Archive: Technology

Tricks and traps with Oracle (as a source or target) on 64 bit servers 11 Nov ’09

Installing RED on a Windows 64-bit server, with Oracle as a repository or source system

When installing RED on 64-bit Windows, the installer selects the following installation folder by default:

C:\Program Files (x86)\WhereScape\

This is where 32-bit applications normally reside on a 64-bit Windows platform.

However, if Oracle is involved, either as the data warehouse repository, or as a source system, installing RED in “Program Files (x86)” causes a problem.

The problem is that Oracle doesn’t like directories that have a parenthesis in the name - applications using Oracle cannot be located in a directory with parenthesis in the name, and this includes RED using an ODBC connection to Oracle.  The result…RED will not be able to connect to Oracle.

This issue can be prevented by changing the installation directory of RED to:

C:\WhereScape

For more information about this Oracle issue, see

http://tiredblogger.wordpress.com/2007/11/21/oracle-data-access-components-odac-with-64-bit-development/

64-bit SQL Server data warehouse repository with Oracle as a source system

If you are running a 64-bit SQL Server data warehouse, and need to connect to one or more Oracle source systems, you will require both the 32-bit and 64-bit Oracle Clients installed on the 64-bit Windows Server:

1) 32-bit Oracle Client

The RED client is a 32-bit application, and requires a 32-bit ODBC source to connect to an Oracle database. 

The 32-bit Oracle Client is required so you can set up a 32-bit ODBC source using the “Microsoft ODBC for Oracle” ODBC source.

2) 64-bit Oracle Client

For better data load performance, create a SQL Server “Linked Server” to connect to an Oracle source system.

The RED scheduler will use the Linked Server for the data loads.

64-bit SQL Server requires the 64-bit Oracle Client and ODAC (Oracle Data Access Components) to create the Linked Server - these are both included in the Oracle Database 10g Release 2 Client (10.2.0.4.0) for Vista and Windows 2008 (64 bit).

For installation instructions for the 64-bit Oracle Client, and setup of the Linked Server, see

http://www.mssqltips.com/tip.asp?tip=1433

We have tested this setup using the following software versions:

64-bit SQL Server 2005 as the DW repository on the 64-bit Windows Server

32-bit Oracle 10g Client installed on the 64-bit Windows Server

64-bit Oracle 10g Client installed on the 64-bit Windows Server

Oracle 9iR2 as a source system

Oracle 10g as a source system

 

 

 

To be, or not to be Precise: that is the Question 23 Jul ’09

The other day I was on a customer site where a subject area in the data warehouse had been based on another division’s data warehouse.

There were a few changes that had been made due to different versions of the source system, but these were largely trivial.  The main change was this data warehouse was in SQL Server and the original data warehouse database had been Oracle.  In both cases the source system was an Oracle database.

The customer couldn’t get the new SQL Server data warehouse to balance back to the source system.

It didn’t take long to work out the cause: precision.

Here’s some background on Oracle:

The Oracle documentation describes the NUMBER datatype as follows:

The NUMBER datatype stores fixed and floating-point numbers.  Numbers of virtually any magnitude can be stored … up to 38 digits of precision.

Columns using the NUMBER datatype can be defined using the following conventions:

  • NUMBER – see below
  • NUMBER(precision) – i.e.: scale is zero
  • NUMBER(*, scale) – i.e.:  precision is 38
  • NUMBER(precision, scale)

Oracle always stores NUMBER values in variable length format as “mainframe” computational-3 packed decimals.  In English this means a value with p digits of precision uses ROUND((LENGTH(p)+s)/2))+1 bytes (s is 0 for positive values and 1 for negatives).

All other numeric datatypes in Oracle (except for BINARY_FLOAT and BINARY_DOUBLE) are aliases for NUMBER with specific precision and scale values.  For example, if you create a table with a column that has a datatype of INTEGER the column will really be NUMBER(38).

If a column is NUMBER without precision or scale, it will accept any value in the following list:

·         Positive numbers in the range 1 x 10-130 to 9.99…9 x 10125 with up to 38 significant digits

·         Negative numbers from -1 x 10-130 to 9.99…99 x 10125 with up to 38 significant digits

·         Zero

As you can see the NUMBER datatype without precision or scale is very flexible.  Unfortunately, it doesn’t move nicely to other relational databases.  This brings us back to the original issue.

If you move a table from Oracle to SQL Server, then for basic datatypes, you do the following:

  • CHAR(x) stays the same
  • VARCHAR2(x) becomes VARCHAR(x)
  • DATE becomes DATETIME (note: in SQL Server 2008 there’s now a DATE datatype, so you can leave these as DATE)
  • NUMBER(x) becomes NUMERIC(x)  **1
  • NUMBER(x,y) becomes NUMERIC(x,y)  **1
  • NUMBER becomes ???  **2

There are two important points to make here:

**1: The maximum precision in both Oracle and SQL Server is 38.  However, in SQL Server, the extreme size of a number is quite a lot less than Oracle.  SQL Server values can range from (-1 x 1038) + 1 to (1 x 1038) – 1.  Changing to FLOAT(53) will allow any Oracle NUMBER value outside the SQL Server NUMERIC range to be stored, but only with 15 digits of precision…  This lack of precision capability is why the FLOAT and REAL datatypes in SQL Server are referred to as “Approximate-number data types” in SQL Server Books Online.

Note: In SQL Server, REAL is just an alias for FLOAT(24).

**2: NUMBER (without precision or scale) is non-trivial.  If you have a column in SQL Server with a datatype of NUMERIC, what you actually get is NUMERIC(18,0).

This is largely unexpected, well it certainly was by me and the developers on the client site.  Of course, this turned out to be the cause of my original issue: when you load data into a NUMERIC(18,0) in SQL Server, any decimal digits are lost.  So, the data warehouse not balancing to the source system was caused by rounding errors caused by implicit conversion loading data from Oracle NUMBER columns into SQL Server NUMERIC columns (which were actually NUMERIC(18,0)).

Therefore, you can’t change NUMBER in Oracle to NUMERIC in SQL Server unless you know the NUMBER column in Oracle only contains smallish integer values.

What you need to do is look at the data in the column in Oracle and explicitly manage it into the best SQL Server option.  A couple of options are:

  • If the biggest value you have is 20 digits long in Oracle, I’d change NUMBER to be NUMERIC(38,10) in SQL Server.
  • If the most significant decimal digits you have is two in Oracle, I’d change NUMBER to be NUMERIC(38,2) in SQL Server.
  • And so on…

But you have to manage the rounding explicitly to make sure values always fit in the future.  And you need to balance future proofing a big value coming along with the default non-variable storage nature of the NUMERIC datatype in SQL Server too!

Out of interest, I had a look at the other two databases supported by WhereScape RED: IBM DB2 and Teradata.

The DB2 DECIMAL datatype and the SQL Server NUMERIC datatype are very similar to each other, with a few exceptions:

  1. The maximum precision is 31 in DB2
  2. DECIMAL without precision and scale gives you a DECIMAL(5,0) in DB2
  3. DECIMAL in DB2 can hold values from (-1 x 1031) + 1 to (1 x 1031) – 1
  4. DOUBLE can be used to hold bigger numbers
  5. DECIMAL storage works the same way as Oracle, so is variable by default

So, the answer is the same: moving from Oracle to DB2, you need to make the same judgement calls on the scale and precision to use.

There are even more issues moving to Teradata from Oracle.  In fact, you’re going to have issues moving to Teradata from Oracle, SQL Server and DB2, especially if you’re still using Teradata V2R6.x.

Specifically:

  1. The maximum precision is only 18 prior to Teradata 12.  From Teradata 12 onwards, this increases to 38.  Beware using precision over 18, as some built in database functions only work if the precision is less than or equal to 18.
  2. NUMERIC without precision and scale gives you a NUMERIC(5,0) in Teradata (same as DB2)
  3. NUMERIC in Teradata can hold values from (-1 x 1038) + 1 to (1 x 1038) – 1 from Teradata 12 onwards.  Before Teradata 12, this range is (-1 x 1018) + 1 to (1 x 1018) – 1.
  4. FLOAT can be used to hold bigger numbers
  5. In Teradata, NUMERIC storage is fixed storage, increasing exponentially in bands.  For example, a NUMERIC(19) uses twice the space as a NUMERIC(18), regardless of the value stored in the field.

WhereScape RED and SSAS OLAP Cubes 20 May ’09

WhereScape RED is not a cube development tool - whilst many people use WhereScape RED to build cubes over a star schema it does not replace Visual Studio BI Development Studio (BIDS), because it does not provide the same level of flexibility for cube development as a cubing tool such as BIDS.

WhereScape RED will very quickly build a SQL Server Analysis Services (SSAS) OLAP cube for you over a data warehouse – just drag the fact table over and ta-da you have a cube defined, a couple more clicks and it is deployed and processed. This is great for developers during the prototype phase of a star or subject area. It is of course also great for those that want to deploy cubes into production.

Often we get asked why WhereScape RED doesn’t support this feature or that in SSAS cubes. And to answer that we need to look at what we built WhereScape RED for, and how that is different from why Microsoft built BIDS - Microsoft’s cube development environment.

Straight off the bat – WhereScape RED will never offer the flexibility and functionality that BIDS offers for cube development. Why re-invent BIDS? The criteria we use to assess what features to include – will the feature help WhereScape RED users to build cubes quickly and efficiently and manage them easily as part of a data warehouse? Using these criteria we cater for 90%+ of the scenarios our customers’ meet – while always giving them the opportunity to use the full Microsoft development environment for more esoteric features.

Philosophically WhereScape RED builds cubes to provide a meta data layer and an aggregate layer over the top of a data warehouse. There are some features of SSAS that WhereScape therefore does not support – Analysis Services needs to support scenarios where there is no data warehouse, whereas we can always assume that a warehouse exists.

There are four basic options for using cubes with a data warehouse:

  1. Use cubes as an aggregate layer over an extensive data warehouse – this provides a good mix of the abilities of cubes and of a relational data warehouse. WhereScape RED will build and manage cube as part of the data warehouse.
  2. Use cubes extensively over a simple data warehouse. WhereScape RED can build base cubes which are then extended in BIDS. This allows WhereScape RED to integrate cube processing within the data warehouse workflow but allows the developer to use the flexibility of BIDS for cube development.
  3. Use cubes within a prototype to workshop the data warehouse but actually do not use them in production. Instead use WhereScape RED to build an aggregate layer in the data warehouse to simplify support and development.
  4. Use cubes directly over source systems with no data warehouse.

WhereScape RED provides immense value in all but one of these scenarios; scenario four, where there is no data warehouse, will not benefit from using WhereScape RED. WhereScape views the cubes as complementary to the data warehouse – great for exploring summarized data, but detailed reporting is more efficient in the data warehouse.

The ability to break complex processing into smaller steps for development and for troubleshooting has provided tremendous value for many organisations that have used WhereScape RED to build and manage an enterprise data warehouse. Cubes provide the capability for building extensive business rules and calculations using MDX within cubes – however the skills, complexity and support overhead becomes significant very quickly.

WhereScape RED supports building native cubes on SSAS 2000, SSAS 2005, and SSAS 2008 Analysis Services. These cubes are plain, vanilla cubes that can be opened with BIDS if they need to be edited or extended. WhereScape RED supports:

  • Multiple hierarchies per dimension
  • Attribute hierarchies (member properties in SSAS 2000)
  • Attribute relationships (SSAS 2005+)
  • Calculated members
  • KPIs (SSAS 2005+)
  • Linked cubes (SSAS 2005+ - Virtual cubes in SSAS 2000)
  • Cube partitioning
  • Processing options
  • Drill-through

WhereScape RED does not (or not yet) support:

  • Multi-fact cubes (this is coming, see below)
  • Perspectives (this is coming)
  • Translations (no plans, yet)
  • Aggregate design (can be added using BIDS, or through usage optimization).
  • Security (this is coming – can be added using BIDS or a script)

Value add – what WhereScape RED also brings:

  • WhereScape RED allows you to manage cubes as part of the data warehouse – not some additional technology that is managed separately. If you make a change to the data warehouse you can manage that change and its effects on the cube just using WhereScape RED.
  • Integrated documentation – WhereScape RED auto-generated documentation includes cubes.
  • Integrated migration – you can migrate cube meta data with the rest of the data warehouse.
  • Integrated workflow – you can process the cube within the same jobs that process the data warehouse.
  • Integrated versioning – if you want to reverse out a change or deleted something that was useful then WhereScape RED versioning can save the day.

One topic we get asked about is how WhereScape RED combines different facts / stars into a single cube. In a mixed data warehouse / cube environment there are two base options – integrate this data in the data warehouse and build a cube over the top or combine the data in the cube.

Currently WhereScape RED uses linked cubes to join together separate physical cubes, as opposed to building a single physical multi-fact (multi-measure group) cube. Linked cubes provide equivalent analytical functionality by displaying a single cube containing a superset of measures and dimensions. Multi-fact cubes are coming to WhereScape RED as an alternative option so that the developer has a choice. WhereScape RED will continue to offer linked cubes to combine multiple subject areas in the OLAP layer as they can provide performance benefits over multi-fact cubes. This is relevant when there are many fact tables or the fact tables are large.

Another question we get is how we support cubes building from a Teradata normalized data warehouse. This can be accomplished by utilizing views to provide a dimensional interface.

In summary, WhereScape RED makes managing cubes as part of a data warehouse easy. Where a cube only infrastructure is sufficient WhereScape RED will not provide all the bells and whistles that BIDS provides. If a data warehouse provides value to an organization then WhereScape RED will offer significant value to that organization over the entire lifecycle of that data warehouse.

Need for Speed? 27 Mar ’09

A client site of mine had a detail report that was running in BW over MaxDB that was taking 5-15 minutes to return a small selection of data.
The report was pretty much unusable which frustrated key users in the business as the subject area was of key interest to them.
As a proof of concept I combined my business knowledge and SQL skills with Doug’s Reporting Services knowledge and SQL skills and recreated the very slow report.
The source fact table has 200 million rows in it and is partitioned by month.
Without any additional performance improving indexes the report now runs in 2-4 seconds!

Quickstart Oracle Financials data marts in SQL Server 12 Mar ’09

This is just a quick blog to remind you that WhereScape have pre-built data marts built for several modules of Oracle Financials. These quickstarts allow us to install, tweak (people have a habit of customising these applications) and process data marts built to analyse Oracle Financials information (eg GL or AP) very quickly, within a day or so per module. Once in a generic dimensional data mart the data can be reported and analysed using a wide range of user friendly tools (eg BO, Cognos, etc). That frees people from the built-in reporting and provides the ability to integrate this data with other analysis areas.

Just to add some spice to the story - this week we had a request to install these Oracle Financials quickstarts into a SQL Server data warehouse. This is a new twist - normally its Oracle Financials into an Oracle data warehouse. But by using the database conversion functionality in RED (described here http://blog.wherescape.com/technology/migrating-between-rdbms-platforms/) a consultant has delivered 2 main areas (AP and GL) in SQL Server with 2.5 days of effort. In addition they added Analysis Services OLAP cubes over the top of the star schemas (because RED is good at that too) so that the accountants can slice and dice the data directly from Excel. A very quick way to make friends with Accounting! Now they can also analyse how little the solution cost them!

Multiple Databases and Teradata Grants 30 Jan ’09

In the simplest kind of data warehouse environment, the WhereScape RED metadata and all data warehouse tables exist in a single database or user.  A Teradata user is preferred (over a database) as the metadata includes some procedures that use dynamic SQL.  Procedures containing dynamic SQL may only be compiled under a user in Teradata.

For the simplest kind of environment, a user called dssdemo can be created and granted sufficient privileges to get started with WhereScape RED using this SQL:

CREATE USER dssdemo FROM dbc AS
PERMANENT = 100000000, SPOOL = 100000000, TEMPORARY = 100000000
,PASSWORD = wsl;
GRANT ALL ON dssdemo TO dssdemo WITH GRANT OPTION;
GRANT SELECT ON DBC.TVM TO dssdemo;
GRANT SELECT ON DBC.Dbase TO dssdemo;
GRANT SELECT ON DBC.ErrorMsgs TO dssdemo WITH GRANT OPTION;

The grants in this example may be considered excessive, but are simple to achieve.  The worst case scenario results in the user being dropped by another user they’ve on-granted some privilege to.

In a more complex example, a Database called Test_Grants owns the development data warehouse.  There are two developers, called John and Mary.  Test_Grants has three users:
- john
- mary
- metadata

The users john and mary are used by John and Mary to log into Teradata using WhereScape RED.  The user metadata owns the WhereScape RED metadata and all compile stored procedures generated by WhereScape RED.  The WhereScape metadata is installed and upgraded whilst signed in as metadata.

John has four databases for storing tables and views:
- john_load
- john_stage
- john_model
- john_view

Similarly, Mary has four databases for storing tables and views:
- mary_load
- mary_stage
- mary_model
- mary_view

Here is the database/user structure:

Once created, the metadata user has the following grants on itself:

These are insufficient for WhereScape RED and need to be changed to:

Note: a tick indicates a privilege has been granted, a green circle indicates a privilege has been granted WITH GRANT OPTION.

Each developer and the metadata user also require the following grants:

1. GRANT SELECT ON DBC.TVM
2. GRANT SELECT ON DBC.Dbase

The metadata user also required the following grant:
3. GRANT SELECT ON DBC.ErrorMsgs WITH GRANT OPTION

The first two grants are required by WhereScape RED to compare physical tables and metadata definitions when Validate against database is used.  In all cases, WhereScape RED includes LOCK ROW FOR ACCESS in SELECTs from these two tables.

The third grant is given to all users by default without the WITH GRANT OPTION.  WhereScape RED generated stored procedures use the DBC view ErrorMsgs to look up error message details.  In order for the executer of stored procedures be able to query DBC.ErrorMsgs, the metadata user must be able to on-grant select on DBC.ErrorMsgs.

As mentioned above, metadata installation and upgrade (validation) must be done as the metadata user.  For the ODBC connection DemoTDAT (in this case also with a TDPID of DemoTDAT), the following information is entered to install the WhereScape RED metadata:

The developers (John and Mary) also need some grants.  These examples work thru the details for the john user.  Everything also needs to be repeated for other developers.

The john user needs to be able to:
1. Read, write and delete data in the WhereScape RED metadata tables (in the metadata user).

2. Create, replace and drop stored procedures in the metadata user.
3. Create and drop tables in his four development databases.
4. Read, write and delete data in his four development databases (directly and via stored procedures compiled in the user metadata).

Grants to enable “Read, write and delete data in the WhereScape RED metadata tables” and “create, replace and drop stored procedures in the metadata user” (grants from metadata to john):

Grants to enable john to create and drop tables in his four development databases and to directly read, write and delete data in his four development databases:

Grants to enable john to read, write and delete data in his four development databases via stored procedures compiled in the user metadata:

Grants to enable john to create and drop views in the john_view development database and to directly read, write and delete data in his model database (john_model):

Note: It may be advisable to use database roles rather than direct grants.

Once these privileges have been added, sign into WhereScape RED as john, with the Meta Database set to metadata:

If required, create projects for each developer:

Then ensure each table or view is created in the correct database:

 

This approach can be expanded to the most complex scenarios, including multiple staging databases, etc.

Hardcopy of WhereScape RED user guide 8 Jan ’09

WhereScape RED ships with a pdf of the user guide, but it is also available in hardcopy from lulu.com.  The SQL Server/DB2/Oracle version can be ordered directly from http://www.lulu.com/content/4586805.

Schema Support in SQL Server 2005+ 8 Jan ’09

Schemas were introduced into SQL 2005 to provide security containers for objects (eg tables and views). They allow DBAs to manage security at the schema level - any objects created in a schema inherits the schema’s security properties.

Just as usefully schemas also provide a nice meta-data container for objects so that you can use a schema to group objects together - for example to relate a set of objects together relating to a specific department or function.

WhereScape RED creates and manages all of its objects in the dbo schema - this ensures that no issues (naming or security) are introduced when managing or updating the database objects due to schemas. In order to make use of schemas for security and grouping over user tables and views WhereScape consultants have released a procedure that manages synonyms within schemas to point back to the dbo objects. This means that users can be shut out of the dbo schema and restricted to see only the objects within the schemas that they have access to.

For example we can now have a project called Sales (in the Schema project group) that contains fact_sales, dim_date and dim_customer. Running this procedure will create a Sales schema in SQL, synonyms in that schema pointing to fact_sales, dim_date and dim_customer, and optionally a Sales SQL Server user limited to the Sales schema. As with projects the same object can happily exist in more than one schema.

The procedure (admin_schema_maintenance) is available by emailing support. When it is run or processed within RED it manages schemas and synonyms (and optionally users to access those schemas) based on RED’s projects. The script works by creating SQL Server schemas with the same name as RED Projects that sit under a Project Group called Schema (you need to create this group and projects). The procedure will create synonyms for any Dimensions, Views or Fact tables that exist in these projects in those schemas. The following parameters control procedure options:
SCHEMA_MAINTAIN_SCHEMAS - 0 or 1 will drop any existing schemas (except dbo) and recreate them
SCHEMA_MAINTAIN_USERS - 0 or 1 will drop and recreate users with access only to their schema
SCHEMA_PRINT_ONLY - 0 or 1 will only print the generated SQL not execute it

The procedure can be added to RED by creating a new Procedure called admin_schema_maintenance and then pasting in the procedure code. The procedure is not part of RED and can be added, changed, run and scheduled as required. We would recommend that the script be tested in your environment prior to running in production.

NOTE: Do not forget that synonyms are dependent objects in SQL Server - if they interfere with development then drop the synonyms for those objects being recreated - or drop all synonyms during development. To do this:

  • An easy way of removing synonyms for a single object is to move that object out of the Schema project(s) and rerun the procedure.
  • A simple way of removing all synonyms quickly is to rename the Schema project group (eg Schema_Off) and rerun the procedure.

Considerations should be made for migrating objects to production. It may be easier to drop all synonyms in production first by renaming the Schema project (as above), run procedure, migrate objects over, rename Group back to Schema and rerun procedure.

The Power of Parameters 19 Nov ’08

When first starting to use WhereScape RED many years ago, I briefly looked at parameter functionality and thought “I am sure that will come in handy some day”. 

Now the RED parameter is a fundamental building block in the daily operation of all the warehouses I have built.

A parameter in RED can be called anything you like and can contain a static value that never changes after the implementation of the warehouse or more commonly regularly incremented/updated values.

A set of very useful built in RED functions allow parameters to be read and updated at any time which are called WsParameterRead and WsParameterWrite.

Examples of how I have used each of these in SQL Server are:

– find out the value of the parameter ‘DAILY_RUN_IN_PROGRESS_FLAG’ SELECT @v_run_in_progress_flag  = dbo.WsParameterReadF(’DAILY_RUN_IN_PROGRESS_FLAG’)

– set the value of parameter ‘DAILY_RUN_IN_PROGRESS_FLAG’ to ‘Y’ with comments EXEC @v_return_code = WsParameterWrite DAILY_RUN_IN_PROGRESS_FLAG,’Y',’Place our paramter update comments here’

There are there also two other “special” ways to access your RED parameters in addition to the functions above which are:

1. Load table definitions

A RED parameter can be used in the WHERE clause of a load table by simply wrapping “$P” and “$” around the parameter you wish to use (you must add single quotes as shown in example for non numeric comparisons) in the source mapping tab in the load table properties screen e.g. 

    WHERE DATE_ADD(’1970-01-01′, INTERVAL Load_Table_Name.Load_Table_Date_In_Seconds_Column) SECOND) >= ‘$PLAST_RUN_DATE_GMT$’

    AND DATE_ADD(’1970-01-01′, INTERVAL Load_Table_Name.Load_Table_Date_In_Seconds_Column) SECOND) <= PCURRENT_RUN_DATE_GMT$’ 

At run time the RED parameter value will be substituted and I recommend storing date parameters in a format that works (without transformation) with the database you are using. 

In the SQL Server example above I set the parameter to the format “YYYY-MM-DD HH24:MM:SS” e.g. 2008-02-27 10:59:56 

2. Staging table definitions 

When generating RED procedures a popup screen appears asking if you need to use any parameters in the SQL statement in the procedure.

By selecting parameters you want to use, additional code is automatically added to the beginning of the generated procedure setting the parameter values into procedure local variables: 

    SELECT @v_current_run_date  = dbo.WsParameterReadF(’CURRENT_RUN_DATE’)

    SELECT @v_last_run_date     = dbo.WsParameterReadF(’LAST_RUN_DATE’)

    SELECT @v_unknown_date      = dbo.WsParameterReadF(’UNKNOWN_DATE’)

These local variables can then be referenced in the where clause of the main SQL statement using the variable name generated in RED: 

    WHERE DATEADD(hh,2,stage_vehicle.StartDate + stage_vehicle.StartTime) >= last_run_date

    AND stage_vehicle.StartDate + stage_vehicle.StartTime <= @v_current_run_date 

Staging table column definitions can also use the local variable names for the parameters: 

ISNULL(stage_Vehicle.StartDate,CONVERT(DATETIME,@v_unknown_date,103))StartDate 

Using RED parameters in the WHERE clauses of load and staging tables is a very effective way of restricting large volumes of transactional data.

I can highly recommend the checking/setting of parameters at the start and end of a scheduled job to easily manage incremental/delta data loads as part of a well designed data warehouse.

To check if a load table is used in RED 9 Oct ’08

Data warehouses can be complex, and sometimes it is necessary to check whether a load table is actually used in one of the end star schemas.

We can use the track forward diagram, but this won’t pick up if a column has been mapped manually in a transform.
To catch this, do a search from Tools -> Search for string

Dimension Types: What are the different dimension types and how they are implemented in WhereScape RED? 22 Aug ’08

RED operates with dimension types that are recognized by modern data warehouse theory. When a developer starts working on dimension tables he gets to chose from four options for the default generation of the dimension table and its update procedure via the following dialog box: 

 

A Normal Dimension is where a dimension record is updated and changed whenever any of the non-business key information changes.   This is commonly called a type 1 dimension and is used when historical values of the change are not kept. 

Overwriting the old values.  A Normal dimension type is the most common dimension type. RED auto-generates a stored procedure to handle update and insert paths. Consider this simple example. The “Product” table contains a product named, “Thingamabob” with Product ID being the primary key. In the year 2007, the price of Thingamabob was $150 and over the time, the product’s price changes from $150 to $350. 

In the year 2007, if the price of the product changes to $250, then the old values of the columns “Year” and “Product Price” have to be updated and replaced with the new values. In this Normal Type, there is no way to find out the old value of the product “Thingamabob” in year 2006 since the table now contains only the new price and year information. 

The logic that WhereScape RED applies to the stored procedure handling the Normal dimension type works as follows:

- The stored procedure gets all unknown records from the dim table. If an unknown dim record does not exist, then the procedure allows an explicit value to be inserted into IDENTITY field

- The procedure loops through all the input records, attempting to update the dimension records based on the business key value. If the update attempt fails, then the procedure performs an insert. 

A Slowly Changing dimension is where new dimension records are created when certain identified columns in the dimension change.   This is commonly called a type 2 dimension and tracks point in time historical changes. 

Creating another added record. In this type, the old values will not be updated but a new row containing the new values will be inserted in the product table. Consequently, at any point in time, the difference between the old values and new values can be retrieved and easily compared. This can be very useful for reporting purposes. 

The problem with the above mentioned data structure is the “Product ID” column cannot store duplicate values for “Thingamabob” since “Product ID” is the business /primary key. Also, the current data structure doesn’t clearly specify the effective date and expiration date of the product when the change to its price happened. So, it would be better to change the current data structure to overcome the above primary key violation. 

 

In the changed Product table’s Data structure, “Product ID” and “Effective Datetime” are now a composite business / primary key. Therefore, there will be no violation of the primary key constraint. Addition of the new columns, “Effective Datetime” and “Expiration Datetime” provides the information about the Thingamabob’s Effective date and Expiration date adding more clarity and noticeably improving the scope of this table. The slowly changing dimension (SCD), type 2 approach, may need added space in the database, since for every changed record; an extra row has to be stored. Since dimensions are usually not that big in the real world, extra space is negligible. 

RED performs the following additional tasks in the stored procedures to handle Slowly Changing Dimensions (SCD), type 2 dimensions.

- Three new special variables are added: insert indicator, version indicator and version number.

- Four extra columns are added to support slowly changing dimensions: current_flag; version; start_date; end_date.

- RED automatically generates an alter table SQL script, based on type. It also adds a unique, business key index to support the slowly changing columns.

- In the update stored procedure loop, if the record is not found, the procedure inserts a new record and sets the insert flag on. Otherwise, the procedure checks to see if we have an existing dimension row and have just had a field change for data that is not part of the slowly changing data that we are interested in and performs an update. If there is an existing dimension row and it verifies that any of slowly-changing fields have changed. The procedure then updates the version indicator and version number, and inserts a new row. 

A Previous Values dimension commonly called a type 3 dimension, allows the storing of the last value of selected fields in secondary columns.

Creating new fields. In this dimension type, the latest update to the changed values can be seen. The example shown below illustrates how to add new columns and keep track of the changes. From the example, we can see the current price and the previous price of the product, “Thingamabob”. 

The problem with the Previous Values approach is over time if the product price has multiple changes; the complete history may not be stored, only the latest change will be stored. For instance, in year 2008, if the Thingamabob’s price changes to $350, then we would not be able to see the complete history of 2007 prices, since the old values would have been updated with 2008 product information. 

- RED guides you through the process of adding secondary columns. 

A Date Ranged dimension supports source systems that provide date ranged information. It rolls over time based on specified columns and date range and the history is kept. Otherwise, it looks just like the Slowly Changing dimension type. The difference in that the date range dimension is valid for a given range or set of dates on specified columns. 

The following example might be appropriate here.  Thingamabobs are on sale for the date range 1/1/08 – 1/31/08 at $150 and on sale for the date range 2/1/08 - 2/29/08 for $200.  So the dimensional records would be for 1/1/08…1/31/08, 2/1/08…2/29/08 and 3/1/08 to present each range would be represented by a date range dimension key and the product on sale for that dimension key could be reported on.   

- RED will guide you through the process of identifying which column(s) will be date ranged columns, i.e. “from” datetime column followed by the “until” datetime column. E.g. such “from” column could be “Effective Datetime” and “until” column - Expiration Datetime  

Note: RED is using slightly different naming convention.

The above table will look like this:

(*)          = (Business Key)

(**)       = (Primary Artificial Key)

(+)          = (Slowly Changing Business Key) 

A Get Dimension Key Stored Procedure 

The ‘Get Key’ stored procedure is auto-generated by RED to return an artificial key when a business key is supplied. The normal syntax is to call the function (usually from update_stage_% stored procedure) passing the business key and be returned a status indicating the result of the lookup.  

For example:

EXEC   @v_return_status = get_dim_product_key 

      @v_stage_product     

     ,’N’ — Auto add flag     

     ,’N’ — Write error trail entry if not found flag     

     ,@p_sequence

     ,@p_job_name

     ,@p_task_name

     ,@p_job_id

     ,@p_task_id     

     ,@v_stage_product_key OUTPUT     

     ,@v_getkey_status OUTPUT

On a successful lookup, the artificial key for the dimension record is also returned. If the lookup fails because the business key is not found in the dimension then an appropriate status is returned and the artificial key is set to 0. 

The main logic as follows:

- Lookup the key: If dss_end_date is NULL, then we are just looking up the current record. If not null then we are trying to match to a specific slowly changing record, so look through all of the records until finding the appropriate one.

- The procedure then loops through all the old keys looking for a match on the date. If key is found then the procedure sets the key and the flag.

- If no match found then the procedure gets the current version.

- If a dimension key is not found, then the procedure will add it if we have the auto_add flag set.

- If requested (auto add log message), the procedure will log a message about the auto add and log a message about the lookup failure.

Impact Analysis: What can fizzle while applying changes on the database and how to sustain a healthy, evolving data warehouse? 10 Jun ’08

Every action in the information database and warehouse in particular has a consequence. Database developers, business analysts and DBAs  feel agonizing pain when an organization makes changes to their data warehouse applications without having complete insight into the true impact and cost that a modification will have.

So, what is an impact analysis?  It is an identification of the consequences of change on an object to its related objects. Impact analysis, in case of a data warehouse, consists of:

• understanding of the processes performed within a data warehouse as a whole and its components;
• comprehension of interdependencies between processes and objects;
• assessing the effect the change brings to the data warehouse structure;
• developing recovery procedures in case of  a failed modification.

WhereScape RED provides a variety of needed mechanisms to identify a data warehouse’s weak spot when it comes to a change.

In the development process we have put our best effort into creating a perfect addition to our data warehouse. Now it’s time to deploy our changes to a higher environment. But how can we do so without crashing things that we had developed previously? How do we insure that the new modification will not destroy the integrity of the data warehouse instead of improving it? With WhereScape RED it’s easy.

Here are a few recommendations on how to avoid oopses and ouches:

1. Structural Impact Analysis.  i.e.  what objects have been changed.
Click “Tools” >  Click “Search for string…” > Fill out dialog box for string (e.g. either enter the procedure name being modified or the column being changed) > examine “Object search results” screen for all objects that contain this string.

For the structural-impact analysis you can use the Browser screen for “DataWarehouse” connection.
Click “Invoke SQL Admin” > execute something like this (SQL Server):

select distinct(a.name) from sys.procedures a inner join sys.syscomments b on a.object_id=b.id
where charindex(‘stored procedure or column’,b.text)<>0;
– Check load where clause too
select distinct(a.lt_table_name)from ws_load_tab a
where charindex(‘stored procedure or column’,a.lt_where_clause)<>0;

2. Evaluation of the Impact. i.e. what possible impact the data warehouse’s alteration renders on various data warehouse objects. WhereScape uses an array of pre-defined reports. Click the “Reports” tab from the main Browser menu. The list of available reports expands from release to release. There are about two dozen reports that are directly related to Impact Analysis. Some of them are:
•  “Object modification dates (without indexes)” with date created and date modified, same for “Index modification dates”;
• “Modified procedures” i.e. procedures that have been modified since creation;
• “Table objects that have no associated procedure or script” and “Procedures that are not related to a table”;
• “All objects that have been Refreshed or Imported”;
• “Differences between current and selected repository…”;
• “Query data warehouse objects… where custom validations scripts can be run”;
• “Track back on specified column usage…” and many others.

3. Impact during the Promotion from one environment to the other. e.g.  from DEV to TEST,  when managing such deployments using the “Build Application tables…” built-in utility. While the developer plunges through the wizard-like dialog boxes, WhereScape RED continuously warns him/her of the possible implications of this activity. E.g. if the particular object is to be overwritten, warning message pops up to alert the developer on the possible impact on the existing object.

4. Organizational Documentation Update. The considerable problem for a conventional data warehousing project is that database changes within data warehouse are not reflected in documentation: nobody has time to update them. WhereScape RED automatically generates a new documentation set after each iteration of data warehousing development.

5. Versioning of Metadata Objects. Several kinds of version management are utilized in WhereScape RED. For example, choosing the “Automated Version Creation” option from the Tools menu will thwart a developer from losing valuable DDL scripts, procedural code and other metadata and database information.

6. Standardization and Best Practice Usage.    WhereScape RED uses a state-of-the-art methodology in building warehousing prototypes. Following the developer’s guidelines delineated by the WhereScape RED methodology, wizards and the automatic generation of database objects, in combination with an enterprise standard “look and feel” dramatically decreases the chances of glitches in the data warehouse processing.

Loading Data Using RED 5 Jun ’08

The are several different ways to load data from files and source systems into the data warehouse load table using WhereScape RED.  Each load table type has different attributes, options and performance advantages in different situations.

RED automates loading data from files by either parsing files into columns as they’re loaded or by loading files into single column tables.  The following kinds of flat files:

  • fixed width
  • delimited
  • XML

The default is to have RED manage the code for loading file.  This is called a file load.  You may decide to generated a script and manage this yourself.  This is called a script load.

When load fixed width and delimited files, RED uses the load utilities in the target data warehouse database environment.  For SQL Server this is BULK insert.  In an Oracle data warehouse, SQL*Loader is used.  Teradata has several different loaders and RED can utilize fastload, multiload and TPT.  DB2 uses the LOAD statement.

All options available for these loaders are able to be used by RED.

Two options are available for XML.  The default is to allow RED to read and load the XML file itself.  Another option available for SQL Server data warehouses.  This involves using the XML OLEDB provider for SQL Server via a Script load.

RED also automates extracting and loading data from other databases into the data warehouse.  Again, there are several approaches available:

  • ODBC loads
  • Native ODBC loads
  • Database Link loads via a database link (Oracle warehouses only)
  • Database Link loads via a linked server (SQL Server warehouses only)
  • Externally Loaded load tables

For large volume loads from another database into the data warehouse, the fastest option is almost always the Native ODBC load.  The exceptions for this rule are:

  • In an Oracle data warehouse loading data from another oracle database, a database link load may be faster if one of the following is the case:
    • the two databases are connected via a really fast network
    • the two databases on the same physical server

For smaller volumes, you should always use database link loads for SQL Server->SQL Server and Oracle->Oracle load tables.  Similarly, for small volumes, loading data from a different kind of database, it may be better to use ODBC loads rather than Native ODBC loads.

The final option RED provides is Externally loaded load tables.  These provide integration with existing ETL or EAI infrastructure.  RED knows about the load table in its metadata and can build downstream objects from the load table.  RED does not load the load table.  It waits for the other system to load it.  Job integration is achieved using a combination of RED and the other products APIs.

Refer to the WhereScape RED User Guide for full information on all of the load types discussed.

Multiple Source Systems – One Set of Load Tables, How Does it Work? 23 May ’08

Consider an environment where a source system is duplicated across 3 sites. Site A, Site B and Site C. There may be 400 load tables that require loading from each of these duplicated source systems, in practical terms they are the same tables but in different implementations of the same application. When building WhereScape objects to load data from these tables it is preferable to build the one set of load objects that provide data to the one set of dimension, stage and fact processes. All objects can then be used regardless of the source system providing the actual data. The key to resolving this problem is the use of a WhereScape RED callable procedure called Ws_Connect_Replace. This procedure allows the caller to replace the contents of a connection from another connection.

For example, four connections are created:

Connection Site A - Stores all connection details for Site A 

Connection Site B - Stores all connection details for Site B  

Connection Site C - Stores all connection details for Site C 

Connection Site Generic - When called Ws_Connect_Replace swaps A, B, or C connection details into this connection

All 400 load tables use Connection Site Generic as the connection defined in the metadata.

Prior to loading the data from the load tables the procedure Ws_Connect_Replace is called swapping site specific connection details into the generic connection. Processing can then continue in the knowledge that the correct data is being loaded.

To automate the processing of each site you can use a custom procedure at the start of a job that contains the call to Ws_Connect_Replace along side some parameter controls. In this way you also have only one set of job metadata to maintain.

If your requirements are smaller and only involve one load table that requires source system or schema swapping, look at the WhereScape callable procedure Ws_Load_Change. This procedure allows the caller to change either the connection or the schema of a load table.

Migrating Between RDBMS Platforms 11 Apr ’08

A powerful feature within WhereScape RED is the ability to migrate a metadata repository from one RDBMS platform to another.  This will allow you to take an existing data warehouse environment on one database, such as SQL Server, and move it to another like Oracle.  While rarely used, this functionality is of immense value should you need to move RDBMS platforms at any stage after the data warehouse has been created.  

The following steps can be performed to move an existing WhereScape RED metadata repository from one RDBMS platform to another.  Note: this only moves the metadata structures and content, it does not move the data contained in the data warehouse. 

Prior to performing the metadata migration you may wish to perform the following tasks to ease the process;

  1. Upgrade to the latest version of WhereScape RED, to ensure you have the latest functionality. 
  2. Archive the Scheduler Audit and Detail log tables if you do not need to bring across this information.  The archive process moves the records from the audit trail log to an audit trail archive table.  This will reduce the amount of data that needs to be imported into the target metadata repository.  To perform the archive, go to the Scheduler tab within WhereScape RED and choose from the menu:  Logs à Archive the audit and detail logs.

In the example below we are moving from SQL Server to Oracle;

  1. Create an Oracle metadata repository via WhereScape Setup Administrator tool, using the same version as the SQL Server repository you are going to migrate.
  2. From the SQL Server repository, unload the metadata.  This can be done within WhereScape RED using menu option:  Backup à Unload the metadata to disk.
  3. Connect to the Oracle metadata repository using WhereScape RED.  Load the SQL Server metadata unload, ignoring the warning about the database being different.  This can be done using menu option:  Backup à Load the metadata from disk.  This process can take a long time for a large metadata environment.
  4. Verify column data types have been converted.  WhereScape RED will convert most data types for you during the load process.  However, there may be some data types that still need to be manually adjusted.  This can be easily verified by creating all of the tables (Load, Dimension, Stage, Fact, and Aggregate).  Those tables with bad data types will fail during table creation.  If any data types need to be manually converted, please inform WhereScape Support and they will ensure that future releases of WhereScape RED accommodate those data types.
  5. Change all transformations to Oracle supported functions.  Common transformations will be OK.  However, RDBMS specific transformations will need to be converted to the appropriate syntax on the target RDBMS.
  6. Regenerate or Rebuild all unmodified procedures. 
    • If you created simple join statement in SQL Server using ANSI join syntax, this will not translate in an Oracle environment.  You will need to rebuild the join syntax in Oracle via the wizard.
    • Outer join syntax will need to be rebuilt as the SQL syntax is different.
  7. Custom procedures will need to be recreated using the appropriate procedural language. eg: PL/SQL
  8. Change Load Table where clauses if required.
  9. Change View where clauses if required.
  10. Change Connection details if required.
  11. Change Load Table connections (for file and script loads) to UNIX if a UNIX scheduler is being used.
  12. Change Script connections (for script loads) to UNIX if a UNIX scheduler is being used.
  13. Regenerate any Load scripts.

While not a common situation, it shows the flexibility of WhereScape RED and the advantage of using a tool to generate the bulk of the underlying code.  What can be achieved in a very short time using WhereScape would be a substantial undertaking, if not impossible, in any hand-coded or proprietary environment.