Archive: March ’08

Views over Type 2 dimensions, simplify everybody’s life 27 Mar ’08

Let’s talk about something that is really painful….type 2 slowly changing dimension. Of course, it can be the best solution in certain cases and yes, it is a very basic data warehousing concept but sometimes, isn’t it a bit “challenging” (to avoid the word painful) for everyone (users included) to maintain and/ or query a slowly changing dimensioned fact table?

Several mechanisms such as snapshot fact tables can “save” the project from them. For that, please refer to the luminous article prepared by Doug Barrett for this blog (SnapShot Fact Tables vs Type 2 dimensions / 20 Dec ’07).

But what can be done when the dimension is already in place? Let’s assume in addition that the users are not very familiar with this type of object and that the new piece of reporting does not require any history on the concerned dimension…

The answer is simple: a dimension view! As close as possible to a type 1 dimension… The goal is to simplify the data and for that, stick to the last known version of the dimension record (as we would do it for a type 1). However, there will be no one-to-one relationship between the surrogate key and the business key components. We still need all surrogate keys per business key to be able to link the data back to the whole fact table.

Once again, it is possible when the context (data, users, requirements…) allows it, but this is typically something that will simplify everybody’s life. Let’s go back to our classic example: A basic sales transactional fact table dimensioned by a type 2 dim_customer that looks like the following table:

dim_cust_key cust_number cust_name address start_date end_date status_flag version

Ø      dim_cust_key is the dim surrogate key (key type: 0)Ø      cust_number is the primary business key (key type: A)Ø      version and status_flag are the added business key (key type: 1)Ø      address is the slowly changing column (key type: 3) 

All we have to do is create a view with all fields from the customer dimension but valid_from, valid_to dates, status and version. Obviously, we leave the business key blank in the definition. Then, we transform the slowly changing column to make it the last known value (identified by dss_current_flag = ‘Y’ for instance) for each occurrence of the business key. The Oracle / MsSQL statement in the transformation of the column city in our example should looks like this:

Select   b.cityfrom   dim_customer b where   dss_current_flag = ‘Y’  and b.cust_number = dim_customer.cust_number  

The result view is as follows:

Dim_customer_view_key Cust_number Cust_name Address

We have now two levels of information for customers. The dimension view for simple things is ready to help. Join it, grouped by the business key for instance to link all data from the fact table. When reporting is more elaborated, more complex, the Type 2 Dimension table is still to be used for tracking changes properly.

There are different ways to do this kind of thing; we could for instance as well add a “current” field directly in the slowly changing dimension and query (or build the view) from it. The goal is just to simplify the use of the dimension when possible. After that, it is up to the developer to deploy what suits the best depending on the context.

The new “post update” option, how to boost dimensions lookup sourcing 27 Mar ’08

WhereScape RED provides all needed mechanisms for multiple sources management when populating dimensions. The new “post update” feature, available from version 5.6.2.0, brings the next steps in term of performance.

Let’s says for instance that your customer dimension table must be loaded from a load_customer table that contains the whole history.

The business key is the customer number and the load table contains all related information as well (let’s says customer name, tel, address, country, creation date and modification date).

Naturally, we don’t want the process to check if every single customer of the whole history exists before updating the dim data or inserting a new record. It is time for a lookup populated dimension using the post update feature!

First thing to do: another load table… Its source obviously is load_customer and it needs only one column: the customer number. Let’s call this table load_customer_code.

In that example (to make it simple), we build ourselves the lookup table from the source. In many cases, the lookup data will come from somewhere else. It could be a list of customers flagged by the marketing, the list of modified/ created customers since the last load…

Now, when creating the dimension, all related customer information will come from load_customer but cust numbers will be loaded from the second table. When building the procedure, WhereScape RED will detect multiple sources and proposes to populate the dimension via a join or a lookup. We select “lookup”; the primary table obviously is load_customer_code because it contains all customer identifiers we want to process.

The business key definition box appears, select the customer number as key. Let’s Tick “Not Exists Select” checkbox, we just ensured in a single click that Red will only deal with new customers in the cursor processing part of the procedure. For that a “Not exists” statement (Oracle: Minus) is added in the cursor definition. It should be as follows:

MS Sql:

   DECLARE c_Load CURSOR LOCAL FAST_FORWARD READ_ONLY FOR     SELECT       code                             code     FROM       load_customer_code    WHERE NOT EXISTS (       SELECT 1 FROM dim_customer       WHERE ( load_customer_code.code = dim_customer.code         OR ( load_customer_code.code IS NULL AND dim_customer.code IS NULL ))       )

Last step, the lookup definition box appears. Let’s tick “post update” then define the join between the dim table and the source table (in our example, dim_customer and load_customer). The procedure is ready to boost!

When loading the table, what happens behind the scenes is:

1) Not exists option: the cursor defined will create all new records very quickly (as it contains only new customer codes).

2) Post update: In that example, the cursor creates blank records with only the business key column populated. At the end of the procedure, a global update (As it would be for a “set” mode) occurs. It updates existing records on one hand and populates related information for new data on the other.

Everything is built and ready, fast…

To go even further, if we can pre-extract in the load_table all modified / new customer information as source of the dimension, it will go faster as the post update will only deal with modified / new customers. Besides, the functionality will obviously allow more complex lookups with other tables containing additional customer related information for instance.

To conclude, combining the “not exist” option with the post update lookup will allow to minimize the cursor processing and then gain in performance. To finish with something that could be important: even if the cursor deals with records to be “created only” in theory, it will always try to update first the table before inserting. That is because the source could contain doubles for instance. This is something that advanced WhereScape RED developers may want to tune but at the cost of a custom modified procedure…

Copy your tables, the belt and braces approach 27 Mar ’08

It’s always the same story. Before modifying a dimension or a fact table, I really do not want to lose any data or have to recreate everything from scratch. There is always the danger that this could happen just because I missed a step or irredeemably modified a column’s data wrongly for instance.

Sometimes, when working on critical objects, it is great to know that whatever happens, you can go back. You do not want to deploy last week’s full backup of the database just because you missed something in a single table. The method I use to prevent this is fairly simple:

1) An old but efficient SQL Statement to create a copy of the table:

Oracle: “CREATE TABLE clone_table AS SELECT * FROM table;”

SQL Server: “SELECT * INTO clone_table FROM table;”

It could be a good idea to adopt a common naming convention for this kind of table. For instance you can prefix it by bkp_ + original_table_name, so it’s easy to find them for a further purge.

When the volume is big, such as with a fact table, I obviously try to tune the bkp table creation query as much as possible.

2) A new version of the Object’s Metadata Definition with WhereScape RED:

Right Click on the table in the left pane then select “New Version”. Give a name to your version and thick the checkbox “include associated procedure / script”.

From there, whatever happens, I’m fine. If necessary, I can recreate the original meta-data by using the “create object from version” option (right click on a project/table type). I can easily repopulate the table by using the following statement:

“INSERT INTO table

SELECT * FROM clone_table;”

These mechanisms are not only useful for backup / restore purpose. They can be very handy for comparisons between tables after any processing or creating data subsets for instance.

Object versioning in WhereScape RED is used for several other reasons such as tracking development changes over the time or creating new tables close to the versioned one, for instance in case of table merges.

Obviously for this kind of approach, everyone implements using their own recipe. I guess the only important thing is to always have a plan B.

Derived Tables in WhereScape RED 26 Mar ’08

A derived table is an in-line SQL SELECT statement in the FROM clause of a query.  They are also known as in-line views.  Essentially they are non-created views used on the fly.  A simple derived table is: 

SELECT *

FROM  (SELECT *

       FROM   SourceTable) AS drvd_tbl 

In most relational databases, derived tables must be aliased.  In the example above, the derived table is aliased as drvd_tbl. 

Derived tables are very useful for performing calculations that require averages or division at a granularity different from the source and target tables. 

Here’s a table I created in WhereScape RED called stage_orders: 

 first pic

Here it’s source diagram:

Pic 2

When I built the update procedure for stage_orders, I specified the following source table join:

33.JPG

 My source data is at invoice line level and includes the customer, product, sales and order dates, sale value, tax and quantity.  21 invoice lines are loaded:

4.JPG

I need to add a condition to only load data for customers who have an average sale price across all orders over $10.  I could do this in multiple steps or using a database view, but I’ve decided to use a derived table that retrieves the list of customers with an average sale price across all orders over $10.  The derived table is: SELECT   customer_code FROM     load_order_header load_order_headerJOIN     load_order_line load_order_lineON       load_order_header.order_number = load_order_line.order_numberGROUP BY customer_codeHAVING AVERAGE(sales_value/quantity) > 10; 

In WhereScape RED, all I have to do is add this SQL to the Source Table join like this:

51.JPG

Now only 11 invoice lines are loaded:

6.JPG

WhereScape RED Teradata History Table DATE Support 19 Mar ’08

An upcoming release of WhereScape RED will include support for the DATE data type for start and end date columns on Teradata history tables. 

Currently, only the TIMESTAMP data type is able to be used for these columns. 

There will be an additional step to choose between TIMESTAMP and DATE data types when generating the update procedure for a History Table. 

If the Source System supplied Start and End dates option is chosen, WhereScape RED will automatically detect the data types of these columns.

Number Storage in SQL Server 13 Mar ’08

It’s always been important in SQL Server to specify the exact length and precision of number data types.  This is because SQL Server has always had just fixed length internal storage for numbers. 

For example, if you have a whole number to store but it can only been one or two digits long then declare it as SMALLINT. 

Recently I converted some tables from Oracle to SQL Server for benchmark testing.  Initially I left all the data type lengths the same as they were in Oracle (Oracle only has variable internal number storage). 

It seemed to take too long to load data into this table from a file, so I changed the data types of numeric fields to be just big enough for the data.  My load time halved! (and so did the space I was using for the table). 

Then I remembered that SQL Server 2005 introduced variable number storage.  The trick is you have to turn it on.  So I changed the table to have decimal data types for all numbers and turned on variable number storage by running the following:

- sp_db_vardecimal_storage_format ‘DatabaseName‘, ‘ON’

- sp_tableoption ‘TableName‘, ‘vardecimal storage format’, ‘ON’ 

I ran the load again and the table did take up a lot less space in the database, but the load time had increased.  I can only assume that the cost of calculating the size of the column somehow slowed things down…

Optimizing Analysis Services 11 Mar ’08

WhereScape have added some new functionality into WhereScape RED version 5.6.4 that improves how dimensions are optimized within Analysis Services 2005.  Attribute relationships are now automatically defined by WhereScape RED when creating a cube, which will improve the top end aggregation performance.  These attribute relationships are exposed and managed within WhereScape RED through the use of member properties (a right click option on a cube).  The two types of attribute relationship (rigid and flexible) are supported.

When creating a cube using version 5.6.4+ the cube dimensions may look different if a level within a hierarchy is not using a key and / or is not unique. This is because the attribute relationship is only built between two levels of a hierarchy, rather than all the way up from the leaf member.  For example if you have a hierarchy of country -< city, where you can have the same city name in multiple countries you must ensure that the city key column is unique otherwise all data against that city name will be allocated to just one of those countries. An artificial key can be built (if one is not available) by creating an additional column in the dimension that is populated with the country_code concatenated to the city_code thereby ensuring all city keys are unique.

If the new hierarchy structure does not fit with your requirements and you preferred the non-optimised view of the dimension then you can delete the attribute relationships in WhereScape RED by deleting the member properties defined on the cube in WhereScape RED.  This will create a cube identical in structure to those cubes created in prior versions of WhereScape RED.

WhereScape RED Native Load Teradata Multiload Support 11 Mar ’08

The latest release of WhereScape RED includes a new option for native odbc load tables in Teradata.

Teradata multiload has always been supported in WhereScape RED for loading flat files.  Now it’s also able to be chosen as the loader for native odbc loads.

This option has been added as several WhereScape Teradata customers prefer multiload to fastload.  There are also some benefits using multiload on busy servers.

To use this new option, choose multiload from the Native Load Routine drop down on the Source Mappings tab for the load table.

Non Mandatory Source Columns - yet another forgotten feature 11 Mar ’08

Every so often I have to extract some data from a source system that changes without warning.

If a column suddenly disappears from a source table my load table will error, probably causing my whole job to stop.  WhereScape RED has some really cool functionality to protect against this.   The Non Mandatory Source Columns feature allows you to flag a load table for extra protection.

This option can be found on the Source Mapping tab of load table properties for database link loads.  I turn this option on and usually choose Warning for Exit status when missing columns.  When the load table runs via the scheduler, WhereScape RED automatically checks to see which columns in the source table or tables exist.  If any source columns don’t exist, the table is still loaded and the missing columns are set to NULL.

Other options for the exit status are Error and Success.

I’ve taken this a step further a couple of times and used After Load column transformations to reset any NULL values to a default value.