Archive: Technology

The Power of Parameters 19 Nov ’08

When first starting to use WhereScape RED many years ago, I briefly looked at parameter functionality and thought “I am sure that will come in handy some day”. 

Now the RED parameter is a fundamental building block in the daily operation of all the warehouses I have built.

A parameter in RED can be called anything you like and can contain a static value that never changes after the implementation of the warehouse or more commonly regularly incremented/updated values.

A set of very useful built in RED functions allow parameters to be read and updated at any time which are called WsParameterRead and WsParameterWrite.

Examples of how I have used each of these in SQL Server are:

– find out the value of the parameter ‘DAILY_RUN_IN_PROGRESS_FLAG’ SELECT @v_run_in_progress_flag  = dbo.WsParameterReadF(’DAILY_RUN_IN_PROGRESS_FLAG’)

– set the value of parameter ‘DAILY_RUN_IN_PROGRESS_FLAG’ to ‘Y’ with comments EXEC @v_return_code = WsParameterWrite DAILY_RUN_IN_PROGRESS_FLAG,’Y',’Place our paramter update comments here’

There are there also two other “special” ways to access your RED parameters in addition to the functions above which are:

1. Load table definitions

A RED parameter can be used in the WHERE clause of a load table by simply wrapping “$P” and “$” around the parameter you wish to use (you must add single quotes as shown in example for non numeric comparisons) in the source mapping tab in the load table properties screen e.g. 

    WHERE DATE_ADD(’1970-01-01′, INTERVAL Load_Table_Name.Load_Table_Date_In_Seconds_Column) SECOND) >= ‘$PLAST_RUN_DATE_GMT$’

    AND DATE_ADD(’1970-01-01′, INTERVAL Load_Table_Name.Load_Table_Date_In_Seconds_Column) SECOND) <= PCURRENT_RUN_DATE_GMT$’ 

At run time the RED parameter value will be substituted and I recommend storing date parameters in a format that works (without transformation) with the database you are using. 

In the SQL Server example above I set the parameter to the format “YYYY-MM-DD HH24:MM:SS” e.g. 2008-02-27 10:59:56 

2. Staging table definitions 

When generating RED procedures a popup screen appears asking if you need to use any parameters in the SQL statement in the procedure.

By selecting parameters you want to use, additional code is automatically added to the beginning of the generated procedure setting the parameter values into procedure local variables: 

    SELECT @v_current_run_date  = dbo.WsParameterReadF(’CURRENT_RUN_DATE’)

    SELECT @v_last_run_date     = dbo.WsParameterReadF(’LAST_RUN_DATE’)

    SELECT @v_unknown_date      = dbo.WsParameterReadF(’UNKNOWN_DATE’)

These local variables can then be referenced in the where clause of the main SQL statement using the variable name generated in RED: 

    WHERE DATEADD(hh,2,stage_vehicle.StartDate + stage_vehicle.StartTime) >= last_run_date

    AND stage_vehicle.StartDate + stage_vehicle.StartTime <= @v_current_run_date 

Staging table column definitions can also use the local variable names for the parameters: 

ISNULL(stage_Vehicle.StartDate,CONVERT(DATETIME,@v_unknown_date,103))StartDate 

Using RED parameters in the WHERE clauses of load and staging tables is a very effective way of restricting large volumes of transactional data.

I can highly recommend the checking/setting of parameters at the start and end of a scheduled job to easily manage incremental/delta data loads as part of a well designed data warehouse.

To check if a load table is used in RED 9 Oct ’08

Data warehouses can be complex, and sometimes it is necessary to check whether a load table is actually used in one of the end star schemas.

We can use the track forward diagram, but this won’t pick up if a column has been mapped manually in a transform.
To catch this, do a search from Tools -> Search for string

Dimension Types: What are the different dimension types and how they are implemented in WhereScape RED? 22 Aug ’08

RED operates with dimension types that are recognized by modern data warehouse theory. When a developer starts working on dimension tables he gets to chose from four options for the default generation of the dimension table and its update procedure via the following dialog box: 

 

A Normal Dimension is where a dimension record is updated and changed whenever any of the non-business key information changes.   This is commonly called a type 1 dimension and is used when historical values of the change are not kept. 

Overwriting the old values.  A Normal dimension type is the most common dimension type. RED auto-generates a stored procedure to handle update and insert paths. Consider this simple example. The “Product” table contains a product named, “Thingamabob” with Product ID being the primary key. In the year 2007, the price of Thingamabob was $150 and over the time, the product’s price changes from $150 to $350. 

In the year 2007, if the price of the product changes to $250, then the old values of the columns “Year” and “Product Price” have to be updated and replaced with the new values. In this Normal Type, there is no way to find out the old value of the product “Thingamabob” in year 2006 since the table now contains only the new price and year information. 

The logic that WhereScape RED applies to the stored procedure handling the Normal dimension type works as follows:

- The stored procedure gets all unknown records from the dim table. If an unknown dim record does not exist, then the procedure allows an explicit value to be inserted into IDENTITY field

- The procedure loops through all the input records, attempting to update the dimension records based on the business key value. If the update attempt fails, then the procedure performs an insert. 

A Slowly Changing dimension is where new dimension records are created when certain identified columns in the dimension change.   This is commonly called a type 2 dimension and tracks point in time historical changes. 

Creating another added record. In this type, the old values will not be updated but a new row containing the new values will be inserted in the product table. Consequently, at any point in time, the difference between the old values and new values can be retrieved and easily compared. This can be very useful for reporting purposes. 

The problem with the above mentioned data structure is the “Product ID” column cannot store duplicate values for “Thingamabob” since “Product ID” is the business /primary key. Also, the current data structure doesn’t clearly specify the effective date and expiration date of the product when the change to its price happened. So, it would be better to change the current data structure to overcome the above primary key violation. 

 

In the changed Product table’s Data structure, “Product ID” and “Effective Datetime” are now a composite business / primary key. Therefore, there will be no violation of the primary key constraint. Addition of the new columns, “Effective Datetime” and “Expiration Datetime” provides the information about the Thingamabob’s Effective date and Expiration date adding more clarity and noticeably improving the scope of this table. The slowly changing dimension (SCD), type 2 approach, may need added space in the database, since for every changed record; an extra row has to be stored. Since dimensions are usually not that big in the real world, extra space is negligible. 

RED performs the following additional tasks in the stored procedures to handle Slowly Changing Dimensions (SCD), type 2 dimensions.

- Three new special variables are added: insert indicator, version indicator and version number.

- Four extra columns are added to support slowly changing dimensions: current_flag; version; start_date; end_date.

- RED automatically generates an alter table SQL script, based on type. It also adds a unique, business key index to support the slowly changing columns.

- In the update stored procedure loop, if the record is not found, the procedure inserts a new record and sets the insert flag on. Otherwise, the procedure checks to see if we have an existing dimension row and have just had a field change for data that is not part of the slowly changing data that we are interested in and performs an update. If there is an existing dimension row and it verifies that any of slowly-changing fields have changed. The procedure then updates the version indicator and version number, and inserts a new row. 

A Previous Values dimension commonly called a type 3 dimension, allows the storing of the last value of selected fields in secondary columns.

Creating new fields. In this dimension type, the latest update to the changed values can be seen. The example shown below illustrates how to add new columns and keep track of the changes. From the example, we can see the current price and the previous price of the product, “Thingamabob”. 

The problem with the Previous Values approach is over time if the product price has multiple changes; the complete history may not be stored, only the latest change will be stored. For instance, in year 2008, if the Thingamabob’s price changes to $350, then we would not be able to see the complete history of 2007 prices, since the old values would have been updated with 2008 product information. 

- RED guides you through the process of adding secondary columns. 

A Date Ranged dimension supports source systems that provide date ranged information. It rolls over time based on specified columns and date range and the history is kept. Otherwise, it looks just like the Slowly Changing dimension type. The difference in that the date range dimension is valid for a given range or set of dates on specified columns. 

The following example might be appropriate here.  Thingamabobs are on sale for the date range 1/1/08 – 1/31/08 at $150 and on sale for the date range 2/1/08 - 2/29/08 for $200.  So the dimensional records would be for 1/1/08…1/31/08, 2/1/08…2/29/08 and 3/1/08 to present each range would be represented by a date range dimension key and the product on sale for that dimension key could be reported on.   

- RED will guide you through the process of identifying which column(s) will be date ranged columns, i.e. “from” datetime column followed by the “until” datetime column. E.g. such “from” column could be “Effective Datetime” and “until” column - Expiration Datetime  

Note: RED is using slightly different naming convention.

The above table will look like this:

(*)          = (Business Key)

(**)       = (Primary Artificial Key)

(+)          = (Slowly Changing Business Key) 

A Get Dimension Key Stored Procedure 

The ‘Get Key’ stored procedure is auto-generated by RED to return an artificial key when a business key is supplied. The normal syntax is to call the function (usually from update_stage_% stored procedure) passing the business key and be returned a status indicating the result of the lookup.  

For example:

EXEC   @v_return_status = get_dim_product_key 

      @v_stage_product     

     ,’N’ — Auto add flag     

     ,’N’ — Write error trail entry if not found flag     

     ,@p_sequence

     ,@p_job_name

     ,@p_task_name

     ,@p_job_id

     ,@p_task_id     

     ,@v_stage_product_key OUTPUT     

     ,@v_getkey_status OUTPUT

On a successful lookup, the artificial key for the dimension record is also returned. If the lookup fails because the business key is not found in the dimension then an appropriate status is returned and the artificial key is set to 0. 

The main logic as follows:

- Lookup the key: If dss_end_date is NULL, then we are just looking up the current record. If not null then we are trying to match to a specific slowly changing record, so look through all of the records until finding the appropriate one.

- The procedure then loops through all the old keys looking for a match on the date. If key is found then the procedure sets the key and the flag.

- If no match found then the procedure gets the current version.

- If a dimension key is not found, then the procedure will add it if we have the auto_add flag set.

- If requested (auto add log message), the procedure will log a message about the auto add and log a message about the lookup failure.

Impact Analysis: What can fizzle while applying changes on the database and how to sustain a healthy, evolving data warehouse? 10 Jun ’08

Every action in the information database and warehouse in particular has a consequence. Database developers, business analysts and DBAs  feel agonizing pain when an organization makes changes to their data warehouse applications without having complete insight into the true impact and cost that a modification will have.

So, what is an impact analysis?  It is an identification of the consequences of change on an object to its related objects. Impact analysis, in case of a data warehouse, consists of:

• understanding of the processes performed within a data warehouse as a whole and its components;
• comprehension of interdependencies between processes and objects;
• assessing the effect the change brings to the data warehouse structure;
• developing recovery procedures in case of  a failed modification.

WhereScape RED provides a variety of needed mechanisms to identify a data warehouse’s weak spot when it comes to a change.

In the development process we have put our best effort into creating a perfect addition to our data warehouse. Now it’s time to deploy our changes to a higher environment. But how can we do so without crashing things that we had developed previously? How do we insure that the new modification will not destroy the integrity of the data warehouse instead of improving it? With WhereScape RED it’s easy.

Here are a few recommendations on how to avoid oopses and ouches:

1. Structural Impact Analysis.  i.e.  what objects have been changed.
Click “Tools” >  Click “Search for string…” > Fill out dialog box for string (e.g. either enter the procedure name being modified or the column being changed) > examine “Object search results” screen for all objects that contain this string.

For the structural-impact analysis you can use the Browser screen for “DataWarehouse” connection.
Click “Invoke SQL Admin” > execute something like this (SQL Server):

select distinct(a.name) from sys.procedures a inner join sys.syscomments b on a.object_id=b.id
where charindex(‘stored procedure or column’,b.text)<>0;
– Check load where clause too
select distinct(a.lt_table_name)from ws_load_tab a
where charindex(‘stored procedure or column’,a.lt_where_clause)<>0;

2. Evaluation of the Impact. i.e. what possible impact the data warehouse’s alteration renders on various data warehouse objects. WhereScape uses an array of pre-defined reports. Click the “Reports” tab from the main Browser menu. The list of available reports expands from release to release. There are about two dozen reports that are directly related to Impact Analysis. Some of them are:
•  “Object modification dates (without indexes)” with date created and date modified, same for “Index modification dates”;
• “Modified procedures” i.e. procedures that have been modified since creation;
• “Table objects that have no associated procedure or script” and “Procedures that are not related to a table”;
• “All objects that have been Refreshed or Imported”;
• “Differences between current and selected repository…”;
• “Query data warehouse objects… where custom validations scripts can be run”;
• “Track back on specified column usage…” and many others.

3. Impact during the Promotion from one environment to the other. e.g.  from DEV to TEST,  when managing such deployments using the “Build Application tables…” built-in utility. While the developer plunges through the wizard-like dialog boxes, WhereScape RED continuously warns him/her of the possible implications of this activity. E.g. if the particular object is to be overwritten, warning message pops up to alert the developer on the possible impact on the existing object.

4. Organizational Documentation Update. The considerable problem for a conventional data warehousing project is that database changes within data warehouse are not reflected in documentation: nobody has time to update them. WhereScape RED automatically generates a new documentation set after each iteration of data warehousing development.

5. Versioning of Metadata Objects. Several kinds of version management are utilized in WhereScape RED. For example, choosing the “Automated Version Creation” option from the Tools menu will thwart a developer from losing valuable DDL scripts, procedural code and other metadata and database information.

6. Standardization and Best Practice Usage.    WhereScape RED uses a state-of-the-art methodology in building warehousing prototypes. Following the developer’s guidelines delineated by the WhereScape RED methodology, wizards and the automatic generation of database objects, in combination with an enterprise standard “look and feel” dramatically decreases the chances of glitches in the data warehouse processing.

Loading Data Using RED 5 Jun ’08

The are several different ways to load data from files and source systems into the data warehouse load table using WhereScape RED.  Each load table type has different attributes, options and performance advantages in different situations.

RED automates loading data from files by either parsing files into columns as they’re loaded or by loading files into single column tables.  The following kinds of flat files:

  • fixed width
  • delimited
  • XML

The default is to have RED manage the code for loading file.  This is called a file load.  You may decide to generated a script and manage this yourself.  This is called a script load.

When load fixed width and delimited files, RED uses the load utilities in the target data warehouse database environment.  For SQL Server this is BULK insert.  In an Oracle data warehouse, SQL*Loader is used.  Teradata has several different loaders and RED can utilize fastload, multiload and TPT.  DB2 uses the LOAD statement.

All options available for these loaders are able to be used by RED.

Two options are available for XML.  The default is to allow RED to read and load the XML file itself.  Another option available for SQL Server data warehouses.  This involves using the XML OLEDB provider for SQL Server via a Script load.

RED also automates extracting and loading data from other databases into the data warehouse.  Again, there are several approaches available:

  • ODBC loads
  • Native ODBC loads
  • Database Link loads via a database link (Oracle warehouses only)
  • Database Link loads via a linked server (SQL Server warehouses only)
  • Externally Loaded load tables

For large volume loads from another database into the data warehouse, the fastest option is almost always the Native ODBC load.  The exceptions for this rule are:

  • In an Oracle data warehouse loading data from another oracle database, a database link load may be faster if one of the following is the case:
    • the two databases are connected via a really fast network
    • the two databases on the same physical server

For smaller volumes, you should always use database link loads for SQL Server->SQL Server and Oracle->Oracle load tables.  Similarly, for small volumes, loading data from a different kind of database, it may be better to use ODBC loads rather than Native ODBC loads.

The final option RED provides is Externally loaded load tables.  These provide integration with existing ETL or EAI infrastructure.  RED knows about the load table in its metadata and can build downstream objects from the load table.  RED does not load the load table.  It waits for the other system to load it.  Job integration is achieved using a combination of RED and the other products APIs.

Refer to the WhereScape RED User Guide for full information on all of the load types discussed.

Multiple Source Systems – One Set of Load Tables, How Does it Work? 23 May ’08

Consider an environment where a source system is duplicated across 3 sites. Site A, Site B and Site C. There may be 400 load tables that require loading from each of these duplicated source systems, in practical terms they are the same tables but in different implementations of the same application. When building WhereScape objects to load data from these tables it is preferable to build the one set of load objects that provide data to the one set of dimension, stage and fact processes. All objects can then be used regardless of the source system providing the actual data. The key to resolving this problem is the use of a WhereScape RED callable procedure called Ws_Connect_Replace. This procedure allows the caller to replace the contents of a connection from another connection.

For example, four connections are created:

Connection Site A - Stores all connection details for Site A 

Connection Site B - Stores all connection details for Site B  

Connection Site C - Stores all connection details for Site C 

Connection Site Generic - When called Ws_Connect_Replace swaps A, B, or C connection details into this connection

All 400 load tables use Connection Site Generic as the connection defined in the metadata.

Prior to loading the data from the load tables the procedure Ws_Connect_Replace is called swapping site specific connection details into the generic connection. Processing can then continue in the knowledge that the correct data is being loaded.

To automate the processing of each site you can use a custom procedure at the start of a job that contains the call to Ws_Connect_Replace along side some parameter controls. In this way you also have only one set of job metadata to maintain.

If your requirements are smaller and only involve one load table that requires source system or schema swapping, look at the WhereScape callable procedure Ws_Load_Change. This procedure allows the caller to change either the connection or the schema of a load table.

Migrating Between RDBMS Platforms 11 Apr ’08

A powerful feature within WhereScape RED is the ability to migrate a metadata repository from one RDBMS platform to another.  This will allow you to take an existing data warehouse environment on one database, such as SQL Server, and move it to another like Oracle.  While rarely used, this functionality is of immense value should you need to move RDBMS platforms at any stage after the data warehouse has been created.  

The following steps can be performed to move an existing WhereScape RED metadata repository from one RDBMS platform to another.  Note: this only moves the metadata structures and content, it does not move the data contained in the data warehouse. 

Prior to performing the metadata migration you may wish to perform the following tasks to ease the process;

  1. Upgrade to the latest version of WhereScape RED, to ensure you have the latest functionality. 
  2. Archive the Scheduler Audit and Detail log tables if you do not need to bring across this information.  The archive process moves the records from the audit trail log to an audit trail archive table.  This will reduce the amount of data that needs to be imported into the target metadata repository.  To perform the archive, go to the Scheduler tab within WhereScape RED and choose from the menu:  Logs à Archive the audit and detail logs.

In the example below we are moving from SQL Server to Oracle;

  1. Create an Oracle metadata repository via WhereScape Setup Administrator tool, using the same version as the SQL Server repository you are going to migrate.
  2. From the SQL Server repository, unload the metadata.  This can be done within WhereScape RED using menu option:  Backup à Unload the metadata to disk.
  3. Connect to the Oracle metadata repository using WhereScape RED.  Load the SQL Server metadata unload, ignoring the warning about the database being different.  This can be done using menu option:  Backup à Load the metadata from disk.  This process can take a long time for a large metadata environment.
  4. Verify column data types have been converted.  WhereScape RED will convert most data types for you during the load process.  However, there may be some data types that still need to be manually adjusted.  This can be easily verified by creating all of the tables (Load, Dimension, Stage, Fact, and Aggregate).  Those tables with bad data types will fail during table creation.  If any data types need to be manually converted, please inform WhereScape Support and they will ensure that future releases of WhereScape RED accommodate those data types.
  5. Change all transformations to Oracle supported functions.  Common transformations will be OK.  However, RDBMS specific transformations will need to be converted to the appropriate syntax on the target RDBMS.
  6. Regenerate or Rebuild all unmodified procedures. 
    • If you created simple join statement in SQL Server using ANSI join syntax, this will not translate in an Oracle environment.  You will need to rebuild the join syntax in Oracle via the wizard.
    • Outer join syntax will need to be rebuilt as the SQL syntax is different.
  7. Custom procedures will need to be recreated using the appropriate procedural language. eg: PL/SQL
  8. Change Load Table where clauses if required.
  9. Change View where clauses if required.
  10. Change Connection details if required.
  11. Change Load Table connections (for file and script loads) to UNIX if a UNIX scheduler is being used.
  12. Change Script connections (for script loads) to UNIX if a UNIX scheduler is being used.
  13. Regenerate any Load scripts.

While not a common situation, it shows the flexibility of WhereScape RED and the advantage of using a tool to generate the bulk of the underlying code.  What can be achieved in a very short time using WhereScape would be a substantial undertaking, if not impossible, in any hand-coded or proprietary environment.

Views over Type 2 dimensions, simplify everybody’s life 27 Mar ’08

Let’s talk about something that is really painful….type 2 slowly changing dimension. Of course, it can be the best solution in certain cases and yes, it is a very basic data warehousing concept but sometimes, isn’t it a bit “challenging” (to avoid the word painful) for everyone (users included) to maintain and/ or query a slowly changing dimensioned fact table?

Several mechanisms such as snapshot fact tables can “save” the project from them. For that, please refer to the luminous article prepared by Doug Barrett for this blog (SnapShot Fact Tables vs Type 2 dimensions / 20 Dec ’07).

But what can be done when the dimension is already in place? Let’s assume in addition that the users are not very familiar with this type of object and that the new piece of reporting does not require any history on the concerned dimension…

The answer is simple: a dimension view! As close as possible to a type 1 dimension… The goal is to simplify the data and for that, stick to the last known version of the dimension record (as we would do it for a type 1). However, there will be no one-to-one relationship between the surrogate key and the business key components. We still need all surrogate keys per business key to be able to link the data back to the whole fact table.

Once again, it is possible when the context (data, users, requirements…) allows it, but this is typically something that will simplify everybody’s life. Let’s go back to our classic example: A basic sales transactional fact table dimensioned by a type 2 dim_customer that looks like the following table:

dim_cust_key cust_number cust_name address start_date end_date status_flag version

Ø      dim_cust_key is the dim surrogate key (key type: 0)Ø      cust_number is the primary business key (key type: A)Ø      version and status_flag are the added business key (key type: 1)Ø      address is the slowly changing column (key type: 3) 

All we have to do is create a view with all fields from the customer dimension but valid_from, valid_to dates, status and version. Obviously, we leave the business key blank in the definition. Then, we transform the slowly changing column to make it the last known value (identified by dss_current_flag = ‘Y’ for instance) for each occurrence of the business key. The Oracle / MsSQL statement in the transformation of the column city in our example should looks like this:

Select   b.cityfrom   dim_customer b where   dss_current_flag = ‘Y’  and b.cust_number = dim_customer.cust_number  

The result view is as follows:

Dim_customer_view_key Cust_number Cust_name Address

We have now two levels of information for customers. The dimension view for simple things is ready to help. Join it, grouped by the business key for instance to link all data from the fact table. When reporting is more elaborated, more complex, the Type 2 Dimension table is still to be used for tracking changes properly.

There are different ways to do this kind of thing; we could for instance as well add a “current” field directly in the slowly changing dimension and query (or build the view) from it. The goal is just to simplify the use of the dimension when possible. After that, it is up to the developer to deploy what suits the best depending on the context.

The new “post update” option, how to boost dimensions lookup sourcing 27 Mar ’08

WhereScape RED provides all needed mechanisms for multiple sources management when populating dimensions. The new “post update” feature, available from version 5.6.2.0, brings the next steps in term of performance.

Let’s says for instance that your customer dimension table must be loaded from a load_customer table that contains the whole history.

The business key is the customer number and the load table contains all related information as well (let’s says customer name, tel, address, country, creation date and modification date).

Naturally, we don’t want the process to check if every single customer of the whole history exists before updating the dim data or inserting a new record. It is time for a lookup populated dimension using the post update feature!

First thing to do: another load table… Its source obviously is load_customer and it needs only one column: the customer number. Let’s call this table load_customer_code.

In that example (to make it simple), we build ourselves the lookup table from the source. In many cases, the lookup data will come from somewhere else. It could be a list of customers flagged by the marketing, the list of modified/ created customers since the last load…

Now, when creating the dimension, all related customer information will come from load_customer but cust numbers will be loaded from the second table. When building the procedure, WhereScape RED will detect multiple sources and proposes to populate the dimension via a join or a lookup. We select “lookup”; the primary table obviously is load_customer_code because it contains all customer identifiers we want to process.

The business key definition box appears, select the customer number as key. Let’s Tick “Not Exists Select” checkbox, we just ensured in a single click that Red will only deal with new customers in the cursor processing part of the procedure. For that a “Not exists” statement (Oracle: Minus) is added in the cursor definition. It should be as follows:

MS Sql:

   DECLARE c_Load CURSOR LOCAL FAST_FORWARD READ_ONLY FOR     SELECT       code                             code     FROM       load_customer_code    WHERE NOT EXISTS (       SELECT 1 FROM dim_customer       WHERE ( load_customer_code.code = dim_customer.code         OR ( load_customer_code.code IS NULL AND dim_customer.code IS NULL ))       )

Last step, the lookup definition box appears. Let’s tick “post update” then define the join between the dim table and the source table (in our example, dim_customer and load_customer). The procedure is ready to boost!

When loading the table, what happens behind the scenes is:

1) Not exists option: the cursor defined will create all new records very quickly (as it contains only new customer codes).

2) Post update: In that example, the cursor creates blank records with only the business key column populated. At the end of the procedure, a global update (As it would be for a “set” mode) occurs. It updates existing records on one hand and populates related information for new data on the other.

Everything is built and ready, fast…

To go even further, if we can pre-extract in the load_table all modified / new customer information as source of the dimension, it will go faster as the post update will only deal with modified / new customers. Besides, the functionality will obviously allow more complex lookups with other tables containing additional customer related information for instance.

To conclude, combining the “not exist” option with the post update lookup will allow to minimize the cursor processing and then gain in performance. To finish with something that could be important: even if the cursor deals with records to be “created only” in theory, it will always try to update first the table before inserting. That is because the source could contain doubles for instance. This is something that advanced WhereScape RED developers may want to tune but at the cost of a custom modified procedure…

Copy your tables, the belt and braces approach 27 Mar ’08

It’s always the same story. Before modifying a dimension or a fact table, I really do not want to lose any data or have to recreate everything from scratch. There is always the danger that this could happen just because I missed a step or irredeemably modified a column’s data wrongly for instance.

Sometimes, when working on critical objects, it is great to know that whatever happens, you can go back. You do not want to deploy last week’s full backup of the database just because you missed something in a single table. The method I use to prevent this is fairly simple:

1) An old but efficient SQL Statement to create a copy of the table:

Oracle: “CREATE TABLE clone_table AS SELECT * FROM table;”

SQL Server: “SELECT * INTO clone_table FROM table;”

It could be a good idea to adopt a common naming convention for this kind of table. For instance you can prefix it by bkp_ + original_table_name, so it’s easy to find them for a further purge.

When the volume is big, such as with a fact table, I obviously try to tune the bkp table creation query as much as possible.

2) A new version of the Object’s Metadata Definition with WhereScape RED:

Right Click on the table in the left pane then select “New Version”. Give a name to your version and thick the checkbox “include associated procedure / script”.

From there, whatever happens, I’m fine. If necessary, I can recreate the original meta-data by using the “create object from version” option (right click on a project/table type). I can easily repopulate the table by using the following statement:

“INSERT INTO table

SELECT * FROM clone_table;”

These mechanisms are not only useful for backup / restore purpose. They can be very handy for comparisons between tables after any processing or creating data subsets for instance.

Object versioning in WhereScape RED is used for several other reasons such as tracking development changes over the time or creating new tables close to the versioned one, for instance in case of table merges.

Obviously for this kind of approach, everyone implements using their own recipe. I guess the only important thing is to always have a plan B.

Derived Tables in WhereScape RED 26 Mar ’08

A derived table is an in-line SQL SELECT statement in the FROM clause of a query.  They are also known as in-line views.  Essentially they are non-created views used on the fly.  A simple derived table is: 

SELECT *

FROM  (SELECT *

       FROM   SourceTable) AS drvd_tbl 

In most relational databases, derived tables must be aliased.  In the example above, the derived table is aliased as drvd_tbl. 

Derived tables are very useful for performing calculations that require averages or division at a granularity different from the source and target tables. 

Here’s a table I created in WhereScape RED called stage_orders: 

 first pic

Here it’s source diagram:

Pic 2

When I built the update procedure for stage_orders, I specified the following source table join:

33.JPG

 My source data is at invoice line level and includes the customer, product, sales and order dates, sale value, tax and quantity.  21 invoice lines are loaded:

4.JPG

I need to add a condition to only load data for customers who have an average sale price across all orders over $10.  I could do this in multiple steps or using a database view, but I’ve decided to use a derived table that retrieves the list of customers with an average sale price across all orders over $10.  The derived table is: SELECT   customer_code FROM     load_order_header load_order_headerJOIN     load_order_line load_order_lineON       load_order_header.order_number = load_order_line.order_numberGROUP BY customer_codeHAVING AVERAGE(sales_value/quantity) > 10; 

In WhereScape RED, all I have to do is add this SQL to the Source Table join like this:

51.JPG

Now only 11 invoice lines are loaded:

6.JPG

WhereScape RED Teradata History Table DATE Support 19 Mar ’08

An upcoming release of WhereScape RED will include support for the DATE data type for start and end date columns on Teradata history tables. 

Currently, only the TIMESTAMP data type is able to be used for these columns. 

There will be an additional step to choose between TIMESTAMP and DATE data types when generating the update procedure for a History Table. 

If the Source System supplied Start and End dates option is chosen, WhereScape RED will automatically detect the data types of these columns.

Number Storage in SQL Server 13 Mar ’08

It’s always been important in SQL Server to specify the exact length and precision of number data types.  This is because SQL Server has always had just fixed length internal storage for numbers. 

For example, if you have a whole number to store but it can only been one or two digits long then declare it as SMALLINT. 

Recently I converted some tables from Oracle to SQL Server for benchmark testing.  Initially I left all the data type lengths the same as they were in Oracle (Oracle only has variable internal number storage). 

It seemed to take too long to load data into this table from a file, so I changed the data types of numeric fields to be just big enough for the data.  My load time halved! (and so did the space I was using for the table). 

Then I remembered that SQL Server 2005 introduced variable number storage.  The trick is you have to turn it on.  So I changed the table to have decimal data types for all numbers and turned on variable number storage by running the following:

- sp_db_vardecimal_storage_format ‘DatabaseName‘, ‘ON’

- sp_tableoption ‘TableName‘, ‘vardecimal storage format’, ‘ON’ 

I ran the load again and the table did take up a lot less space in the database, but the load time had increased.  I can only assume that the cost of calculating the size of the column somehow slowed things down…

Optimizing Analysis Services 11 Mar ’08

WhereScape have added some new functionality into WhereScape RED version 5.6.4 that improves how dimensions are optimized within Analysis Services 2005.  Attribute relationships are now automatically defined by WhereScape RED when creating a cube, which will improve the top end aggregation performance.  These attribute relationships are exposed and managed within WhereScape RED through the use of member properties (a right click option on a cube).  The two types of attribute relationship (rigid and flexible) are supported.

When creating a cube using version 5.6.4+ the cube dimensions may look different if a level within a hierarchy is not using a key and / or is not unique. This is because the attribute relationship is only built between two levels of a hierarchy, rather than all the way up from the leaf member.  For example if you have a hierarchy of country -< city, where you can have the same city name in multiple countries you must ensure that the city key column is unique otherwise all data against that city name will be allocated to just one of those countries. An artificial key can be built (if one is not available) by creating an additional column in the dimension that is populated with the country_code concatenated to the city_code thereby ensuring all city keys are unique.

If the new hierarchy structure does not fit with your requirements and you preferred the non-optimised view of the dimension then you can delete the attribute relationships in WhereScape RED by deleting the member properties defined on the cube in WhereScape RED.  This will create a cube identical in structure to those cubes created in prior versions of WhereScape RED.

WhereScape RED Native Load Teradata Multiload Support 11 Mar ’08

The latest release of WhereScape RED includes a new option for native odbc load tables in Teradata.

Teradata multiload has always been supported in WhereScape RED for loading flat files.  Now it’s also able to be chosen as the loader for native odbc loads.

This option has been added as several WhereScape Teradata customers prefer multiload to fastload.  There are also some benefits using multiload on busy servers.

To use this new option, choose multiload from the Native Load Routine drop down on the Source Mappings tab for the load table.