Derived Tables in WhereScape RED 26 Mar ’08
Posted by: Jason Laws
A derived table is an in-line SQL SELECT statement in the FROM clause of a query. They are also known as in-line views. Essentially they are non-created views used on the fly. A simple derived table is:
FROM (SELECT *
FROM SourceTable) AS drvd_tbl
In most relational databases, derived tables must be aliased. In the example above, the derived table is aliased as drvd_tbl.
Derived tables are very useful for performing calculations that require averages or division at a granularity different from the source and target tables.
Here’s a table I created in WhereScape RED called stage_orders:
Here it’s source diagram:
When I built the update procedure for stage_orders, I specified the following source table join:
My source data is at invoice line level and includes the customer, product, sales and order dates, sale value, tax and quantity. 21 invoice lines are loaded:
I need to add a condition to only load data for customers who have an average sale price across all orders over $10. I could do this in multiple steps or using a database view, but I’ve decided to use a derived table that retrieves the list of customers with an average sale price across all orders over $10. The derived table is: SELECT customer_code FROM load_order_header load_order_headerJOIN load_order_line load_order_lineON load_order_header.order_number = load_order_line.order_numberGROUP BY customer_codeHAVING AVERAGE(sales_value/quantity) > 10;
In WhereScape RED, all I have to do is add this SQL to the Source Table join like this:
Now only 11 invoice lines are loaded:
sampeascod 8:22 am 26 Mar ’08
It’s something I’ve run into myself and there seem to be a variety of solutions. I’m looking for discussion over the pros and cons of each, helping me to decide what I’d recommend as best practice.
You’ll notice that in Jason’s screenshots, the derived table is not shown in the data path diagram. Fair enough, it’d be a pretty complex thing to map out. However, I put forward the following alternatives to derived tables in the SQL join statement:
1. Stacked Stage tables, which include a filter on the join to effectively mimic the derived table. This will duplicate data, but could provide performance benefits. It will also flow through the documentation completely
2. User defined views, which are one of the dimension types. This won’t duplicate data, will flow through documentation, but it appears as a dimension rather than a stage table – thus breaking the natural flow of things
What are anyone else’s thoughts or suggestions on this? I’m not saying that either of mine are entirely correct or better than Jason’s suggestions, but to me the loss of documentation path is a significant disadvantage. The same principle applies to subqueries when used in the transformation of a column.
wbremmer 8:28 am 26 Mar ’08
I agree that keeping everything working through the documentation is a strong reason to not use the inline/derived views.
I generally prefer to substantiate this type of table anyway, as then if have any issues later in the process (say during the fact publish) you can always see the state of that data used (as it was) during the staging. Doing this also means that you have the ability to add indexes etc if required.
Regarding the user defined views, I always set the option in RED to have views visualized seperately so that I don’t feel constrained to only use views for dimension views.
md 8:29 am 26 Mar ’08
To keep the documentation working you can alias the inline table with the source table name, eg
SELECT *
FROM (SELECT *
FROM SourceTable) AS SourceTable