Archive: Technology

How to Print Out the Entire RED Documentation 20 Dec ’11

In order to be able to print out the complete RED documentation set, it is required to create a script that merges all the RED Documentation’s HTML files into one HTML file. The description is as follows:

  • Put the script_html_merge.bat script into RED documentation directory.
  • Double click on it to execute manually; it will create doc.html merged file. The execution of the script_html_merge.bat script can also be scheduled via RED Scheduler.
  • Download http://sourceforge.net/projects/pdfcreator/ PDF creator a free PDF printer for Windows. If desired, the other similar product can be used.
  • After the installation, it will create PDF Creator printer.
  • Open doc.html and print it to PDF Creator – it will create PDF file.
  • For PDF creation, you can use any other tool allowing printing into PDF or convert HTML into PDF; I used PDF Creator simply because it is free and it does a good job.
  • The code is two lines really, see attached code below

Note: Obviously, you can print out one doc at a time by Right Click on the doc and choose Print from dropdown menu

 

REM  *****************************************************************************

REM  Script Name    :    script_html_merge

REM  Description     :    Merges all the RED Documentation HTML files into one HTML file

REM  Generated by  :    WhereScape RED, manually

REM  Generated for :    WhereScape Customer

REM  Author              :    Raphael Klebanov

REM  *****************************************************************************

 

echo OFF

copy index.html doc.html

REM FOR %%G IN (dir /B w*.html) DO (copy doc.html+%%G d && copy d doc.html && del d)

FOR %%G IN (dir w*tech.html w*user.html ) DO (copy doc.html+%%G d  && copy d doc.html && del d)

 

REM this line will put the “tech” files first, then “user” files

 

REM  *****************************************************************************

REM  Notes

REM  *****************************************************************************

REM  1. “FOR %%G”  means for (each item)

REM  2. “/B” switch enabling the batch file to quit with a return code. “/B” option can be removed. I do not see any issues.

REM  3. DO (…&&…) means run the command following && only if the command preceding the symbol is successful.

REM  4. Small “d” is a temporary file used as a temporary copy.

REM  5. If the list of the files always contains the same files, then the remaining files can be added before or after w*, e.g. glossary.html

REM  6. To prevent cut off the images, configure the PDF Printer, e.g., use landscape instead of portrait.

REM  7. You can modify list of HTMLs that are processed within DO statement to include/exclude the doc files

 

Good luck!

3D Speeds Up Analysis 23 Nov ’11

The other day I set out to build a small data mart combining support data and development data from two database sources and several other sources.

I needed to quickly find out which tables in the two database sources I needed to use.

My development system has 104 tables and my support system 244 tables.

So I pointed WhereScape 3D at both sources.

Very quickly I had:
- complete ERDs of both sources
- profiling results for all tables and columns
- identified the 12 tables in the develpment system I needed
- identified the 13 tables in the support system I needed

Less than 30 minutes after starting, I was ready to build my data mart using WhereScape RED.

I’m sure it would have taken me at least a day to figure out both source systems without 3D.

Here’s the ERD I ended up with for the subset of the develpment system I needed:

And here’s the one for my subset of the support source system:

RED’s Supported Platforms and Databases after 6.5.5 23 Nov ’11

The current release of WhereScape RED, version 6.5.5, will be the last release of RED supporting the following operating systems:

· Windows 2000
· Windows XP before Service Pack 3
· Windows 2003 Release 1

It will also be the last release of WhereScape RED supporting the following databases versions for metadata repositories and target data warehouses:

· SQL Server 2000
· Oracle 8.1.x and 9.0.x
· DB2 9.1 and 9.5
· Teradata V2R5.x and V2R6.x

This will make the minimum supported Windows version for RED:

· Workstation: Windows XP SP3
· Server: Windows 2003 R2

And the minimum supported database versions for RED metadata repositories and target data warehouses:

· SQL Server 2005
· Oracle 9.2
· DB2 9.7
· Teradata 12

As is the case now, source systems using older versions of these databases and any other databases will still be accessible using RED.

Keyboard Productivity – WhereScape RED shortcut keys 10 Nov ’11

Every time you take your hand off the keyboard, reach for you mouse, locate the cursor, move the cursor where you need it, click, move again, and click; you are wasting precious seconds. Precious, precious seconds, which over the course of a day will add up to minutes. And which over the course of a week, will add up to hours. Hours of time lost, per week, moving your hand back and forth.

The trick to saving time and increasing your productivity is to mitigate those seconds lost by keeping your fingers engaged with the keyboard and the task at hand. It’s a little known fact that almost everything you do with your mouse in WhereScape RED, can in fact be done instead using keyboard shortcuts.

There are many shortcuts which will speed up your development time, directly improving your projects delivery time, whilst making you look like a super star. Don’t be intimidated though; there are a lot of shortcuts out there. For best results, print off the accompanying cheat sheet and pin it to your monitor. Try to identify just a couple of functions you use often and begin incorporating them into your everyday routine. Then next week, pick up a couple more, incorporating them as well. Before you know it, almost everything you do will be via shortcuts and you may as well unplug your mouse and save your business the power usage as well.

Download the Keyboard Productivity cheatsheet from our marketplace page – http://www.wherescape.com/support/marketplace/keyboard-productivity/

Comparative Analysis 3NF vs. DV 27 Sep ’11

Purpose

A number of projects I have been working on recently have benefited from the use of Data Vaults. The success has led to questions regarding Data Vaults and their application to Enterprise Data Warehousing (EDW).

This comparative analysis has been created to assist customers with making decisions regarding the design approach for an EDW Project.  It compares and contrasts two different approaches: Third Normal Form (3NF), historically a very popular option for EDW and Data Vaults, a purpose-designed for EDW modeling technique. There is no “right” answer, both options have advantages and drawbacks, and the final decision should be based on an organization’s unique circumstances. WhereScape professional services are experienced in both techniques, and are available to make recommendations where required.

 

Third Normal Form (3NF) model

Definition of 3NF:

  • A database, in which each attribute in the relationship, is a fact about a key, the whole key and nothing but the key. 3NF usually refers to a fully normalized structure where information is stored in 3rd Normal Form (E. F. Codd’s 3NF).
  • A 3NF structure is used for the EDW popularized in W. H. Inmon’s Corporate Information Factory (CIF).
  • The generally accepted goal is that a company has one centralized EDW – Data Marts and other Analytical Business Structures are fed from the EDW.
  • The data warehouse is stored using Database Normalization rules in 3NF – tables are structured into subject areas.
  • 3NF models are used for Operational Data Stores as well as Enterprise Data Warehouses.
  • 3NF structures are not recommended for queries and reports.
  • 3NF databases offer performance and stability for Online Transaction Processing Applications (OLTP).

 

Data Vault Model

 

Definition of Data Vault:

  • The Data Vault (DV) has a detail oriented, historical tracking, and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach designed to encompass the best of breed between third normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise.
  • A DV is intended to address the business challenges that DW practitioners meet. It is designed to avoid or minimize issues related to Dimensional Model and 3NF methodologies.
  • DV Modeling is an EDW design methodology that provides historical storage from multiple sources with complete tracking back to system of origin.
  • This method can be adapted to changes in the business environment.
  • The Data Vault is organized around existing business keys.

  

Comparison between 3NF and Data Vault for Enterprise Data Warehousing

 

Category 3NF DV Notes
Initial Purpose OLTP EDW The DV link structure allows flexibility in handling relationships between the Business Keys (M:M, M:1, 1:M, 1:1, etc.) without changing the model structure. In 3NF, relationships between Business Keys in the same table are “solidified” to, for example, 1:M. When past or future data requires different relationships (e.g. new set of data requires M:M instead of 1:M), the whole structure has to be changed in 3NF, while a DV supports this case in Link table.
Number of Objects Less More The number of objects in a DV vs. 3NF is approximately a 3:1 ratio at the beginning of the EDW build and approx. 2:1 for mature warehouses. Objects in DVs are generally smaller and easier to confederate.
 Code Complexity Higher Lower Although code is generated by WhereScape RED in both cases, the DV code makes it slightly less complex to follow and document, because the objects are smaller and have specific meaning.
Speed of Processing More to Same Less to Same DVs allow a high level of parallelism, therefore the total time to complete the DW loading, including Data Mart layer, is often lower than it would be for 3NF. DVs generally offer more stability and performance.
Number of Joins Less More DVs utilize more joins; however, this does not decrease throughput of the DW process due to high parallelism and effectiveness of the foreign key (FK) joins.
Number of Indexes Less to Same More to Same The DV utilizes more indexes, because there are more tables, however most additional indexes are on the PK/FK. The number of indexes may be equal for mature DWs because of the need to index non-key columns in 3NF in order to support the variety of queries against attributes in the tables.
Adaptability to Changes Medium High Adaptability is necessary in case there are changes in the business, e.g. adding a new data source of changing analytical requirements. 3NF structures need rework while with DV additional Hubs/Links/Satellites can be added without disturbing the existing structure. The changes are typically in a single place and easily adapted because the code is likely to be at only 1 or 2 places.
Agile Development Medium High Although both models work well with Agile Development principles, DVs allow the whole EDW project to be broken into well-defined stories/sprints/tasks, etc. based on Units of Work. It is easier to put the DV design into stories/sprints/tasks.
Historical Data Handling Harder Easier Although both models allow historical data to be handled, in DVs it is done in separate object types: one or many Satellites, while in 3NF history is held in a Normalized History object very much like a SCD. When built in WhereScape RED, Satellites are built the same way as 3NF history tables. This means Satellites = Normalized History tables.Another point to consider is the issue of cascading snapshot dates in master-detail-detail 3NF models. This adds to the complexity of the ETL code and makes it difficult to extract “as-of” data sets (i.e. need to use “select(max(snapshot date))” type sub queries).
Real-Time, Near-RT Loads Good Good Although both 3NF and DVs handle Real-Time and Near Real-Time loads well, DVs are preferred because of high-parallelism, hence shorter individual loads.
Very Large DW Loads Good Good Both 3NF and DVs handle very large database loads well. Nevertheless, any changes in business that require vast initial load are easier to handle in DVs.
“Forgiveness” of the Model Less Strict More Strict DVs require DW practitioners to have more discipline in following architectural and processing rules; defying the DV rules can potentially destroy the whole DW structure.
Historical Tracking Good Best DVs allow complete “geological” type tracing from the lowest grain of data to the source system’s atomic level. 3NF is mainly designed for holding operational data history.
Data Access Bad Worse DVs and 3NF are both impractical for direct querying; it is particularly difficult for DW users to combine data from different sources into meaningful information in 3NF.A Presentation Layer is required for both models and the amount of effort to build the layer is about the same.On the other hand, if these queries are performed by a super user, I believe DVs are better (not worse) because the user can create and follow a template to generate SQL Views to extract data from DVs much easier than with 3NF.
Access to Skilled Modellers Less More  There are more people trained in 3NF than there are in DV.

 

Figure 1 Predisposition of various “flavors” of Data Warehousing Models in relation to various purposes of Warehouses

Figure 1 Predisposition of various “flavors” of Data Warehousing Models in relation to various purposes of Warehouses

 

Diagrammatic Representation

The following shows the data models, built from the same source data, for a 3NF and a DV architecture. The diagrams are an example of what can be produced by WhereScape 3D at the planning stage of an EDW Project.

 

Figure 2 Northwind Tutorial Data Model built in Data Vault “flavor”.

Figure 2 Northwind Tutorial Data Model built in Data Vault “flavor”.

 

Figure 3 Northwind Tutorial Data Model built in Third Normal Form “flavor”

Figure 3 Northwind Tutorial Data Model built in Third Normal Form “flavor”

 

 

Reference:

  1. Building the Operational Data Store, 2nd Edition by William H. Inmon
  2. DW2.0, 2008 by William H. Inmon. Bill Inmon stated that the “Data Vault is the optimal approach for modeling the EDW in the DW2.0 framework.” (DW2.0)
  3. Supercharge your Data Warehouse, 2010-2011 by Dan Linstedt
  4. Data Vault and Data Modeling, 2010 Genesee Academy

WhereScape Introduces Data Driven Design Tool 25 May ’11

For WhereScape:
Scott R. Humphrey
Humphrey Strategic Communications
(503) 644-9709
mailto:humphrey@strategic-pr.com

WhereScape Introduces Data Driven Design Tool

WhereScape 3D scopes, sizes, costs and de-risks data warehouse, data mart and business intelligence projects before they begin

BOULDER, Colo.— May 20, 2011 – WhereScape, the developer of WhereScape RED, an agile Integrated Development Environment (IDE) for managing data warehouses, today introduced WhereScape 3D, a data driven design tool for planning and reality-testing data warehousing and business intelligence (BI) projects. Using WhereScape 3D, organizations are able to de-risk projects by planning accurate, user-tested projects up front, in hours or days rather than weeks or months.

WhereScape 3D extends the company’s leadership position in the data warehouse development tools market, providing a pre-ETL solution and making risk-assessed agile development a reality in the BI markets. WhereScape 3D is available via a free trial for a limited time by registering for the WhereScape 3D beta program. The announcement was made at the Boulder BI Brain Trust (BBBT), a gathering of leading BI analysts, experts, and practitioners who attend half day presentations from interesting and innovative vendors. The BBBT was founded by Dr. Claudia Imhoff, president of Intelligent Solutions, Inc.

WhereScape 3D steps data warehouse project teams through the process of source and target exploration and modeling, user functionality verification, scoping, sizing and costing activities, producing both complete project design documentation and an implementation-ready design that can be exported, as metadata to be utilized by products such as WhereScape RED, WhereScape’s flagship data warehouse lifecycle management platform.

Regardless of where the design team begins its process, WhereScape 3D provides end-to-end technical assessment tools to ensure that the flow of data from source to target in your warehousing environment is feasible, meets performance and capacity guidelines or constraints, and delivers the business value user communities expect.

Based on the power of the WhereScape Data Driven Design™ methodology, WhereScape 3D formalizes and accelerates investigation and risk assessment activities, enabling designers to begin where they actually are in the project lifecycle, including:

  • Starting with a detailed, systematic profiling existing source systems, in order to understand what sorts of dimensional or normal-form warehouses and marts can be built from the available source data.
  • Analyzing a new or updated source system against an in-place data warehouse or data mart to determine what impact those sources systems will have on existing BI infrastructure.
  • Proofing a custom schema design, or a purchased or licensed data model against available source system data.
  • Working backwards from prototyped or implemented reports, dashboards or analytical applications and their data requirements into warehouse or mart design and source system data availability.

“With careful application of data driven design principles and attention to the organizational structure of the delivery team, data warehouse design and population projects can now be placed on a much firmer basis of realistic and reliable scoping and sizing,” said Dr. Barry Devlin, a founder of the data warehousing industry and among the foremost worldwide authorities on business intelligence. “Tools to support data driven design, such as that provided by WhereScape 3D, separate from and prior to the design and development of ETL processes is key to enabling these skilled data practitioners to fulfill their role.”

Bank of New Zealand Extending Data Warehouse Infrastructure

The Bank of New Zealand used WhereScape RED to quickly and cost-effectively consolidate multiple financial data marts into a Microsoft SQL Server based centralized data warehouse. According to Dave Thompson, Head of Business Intelligence for Bank of New Zealand, incorporating WhereScape 3D into the Bank’s data warehouse infrastructure represents an opportunity to further reduce the risks and costs associated with designing, developing and extending data warehouses. “We are excited about WhereScape’s new data-driven design methodology and are evaluating opportunities for WhereScape 3D as we continue to build out our data warehouse infrastructure.”

Key features of WhereScape 3D include:

  • Supports multiple aspects of the data warehouse planning process
  • Combines data and design into the same process
  • Supports and associates conceptual, logical and physical models
  • Embedded data warehouse knowledge
  • Generates data warehouse specific design documentation
  • Out of the box support for common use cases, as well as creation of custom use cases.

“WhereScape has been devoted to the idea that well-designed tools can reduce the time, cost and risk of data warehouse development,” said WhereScape co-founder and CEO Michael Whitehead. “WhereScape 3D enables organizations to scope, size, eliminate risk and deliver data warehouses, data marts and business intelligence environments fast by focusing on the source systems, the target schema as well as end-user needs and expectations before they start the build process. It provides a central repository for planning information from interview notes to designs, and extends the concept of lineage to the planning process.”

About WhereScape

WhereScape design, develop, sell and support WhereScape 3D, the industry’s first data warehouse planning tool; and WhereScape RED, the industry’s first and best integrated development environment for building, deploying, managing and renovating data warehouses.

WhereScape’s products are used in every kind of business intelligence project found in today’s commercial environment: from normal-form enterprise data warehouses and data vaults, through user access layers and tiers of dependent data marts, to standalone data marts and reporting systems.

Our more than 400 customers are active and satisfied users of our technologies, and report that, with WhereScape’s products, they are able to build data warehouses with fewer people, few if any diversions, missed deadlines or unmet user expectations, and in record time: in days or weeks, instead of months or years. WhereScape has offices in Portland Oregon, Auckland New Zealand, and Wokingham UK. For more information, please visit www.wherescape.com.

All products or company names herein may be trademarks of their respective owners.

Running an Oracle Procedure when a RED Job Completes with a Linux Scheduler 7 Apr ’11

Ever wondered how to run an Oracle Procedure when a RED Job completes with a Linux or UNIX Scheduler?

Here’s a simple way that doesn’t require a separate script.

The procedure I want to run is very simple:

CREATE OR REPLACE PROCEDURE dss_test
( p_sequence IN NUMBER
, p_job_name IN VARCHAR2
, p_job_id IN NUMBER
) AS
v_result INTEGER;
BEGIN
v_result := WsParameterWrite('TEST_IT', '1', 'Test');
RETURN;
END dss_test;

This procedure writes a new RED parameter, just as a test. But you could get this procedure to do anything, including calling other RED APIs to restart a job if it fails, etc. The three input parameters give you all the information you need to call any RED Scheduler API from the procedure.

The all you have to do to run this procedure when a job succeeds or fails is paste this command into the job success / failure command boxes:

echo "exec SchemaName.Dss_Test($JOB_SEQ$,'$JOB_NAME$',$JOB_KEY$);" | sqlplus /

Note: If your Linux or UNIX scheduler is not running under an OS Authenticated user, you’ll need to change this to:

echo "exec SchemaName.Dss_Test($JOB_SEQ$,'$JOB_NAME$',$JOB_KEY$);" | sqlplus username/password

When my job completed, I could see a new parameter called TEST_IT in RED.

That’s it! Magic!

Solving Load Problems in SQL Server 3 Mar ’11

This article looks at the different levels of error reporting provided by the different database load methods into SQL Server.  There are a number of ways that we can define how to load data into SQL Server using RED:

·         SSIS – this is Integration Services, SQL Server’s ETL tool that RED will generate and call dynamically.

·         Database links – which utilize OLEDB providers to connect to from SQL Server to the source database.

·         Native ODBC – which uses an ODBC connection to read source data and SQL Server’s BULK INSERT command to efficiently load the data.

·         ODBC – which utilizes an ODBC client to read and load data into the data warehouse 1 row at a time.

The factors which determine the best load method to choose are typically:

·         Load performance – a separate blog covers performance, but generally SSIS is the fastest, followed by Native ODBC or a Database Link (depending on the OLEDB and ODBC driver for the source system) and finally ODBC loading. 

·         Error reporting – to assist troubleshooting.  RED will capture error messages that are generated by the method that is used to load data.  Some of these methods generate better error messages than others. 

·         Environment – which mechanism is supported by the standards, policies and software environment. 

Load issues are usually caused by incorrect data type mapping (incorrect load table definition, data type incompatibilities or subsequent changes to the source system) or incorrect column NULLability.  Data type conversion can be significantly improved by editing the mapping of data types from source systems using Tools / Data Type Mappings – see the RED documentation for more details.  Other factors that can cause load failure are source system changes, network failure or duplicates when using a unique constraint. 

The ideal error reporting will identify:

·         the reason that the load failed,

·         the row that caused the failure (either by showing the value that failed, or a row number) ,

·         the column that caused the failure. 

With all three of these parts in the failure report a developer can easily find the offending data that caused the problem and amend the load table attribute definition or apply a transformation to cater to load issue. 

The following table records the message recorded for each of the 4 load methods for loading data from SQL Server to SQL Server.  This test was run by altering the load table to generate a failure, and then capturing the error message during the load.  Interestingly some methods would allow a load where other methods fail.  For example an SSIS load will cope with a shorter destination data type by automatically truncating the data – it will generate a warning, but not a failure. 

 

Incorrect data type Length (varchar(15) to varchar(5)

Incorrect data type (char into an int)

NULLs into an NON-NULL column

SSIS

Warning:

Truncation may occur due to inserting data from data flow column “shipcity” with a length of 15 to database column “shipcity” with a length of 5.

 

Note:  data is loaded successfully but data is truncated to fit the destination column.

Failure:

Fails the load if implicit conversion fails:

There was an error with input column “shipcity” (92) on input “OLE DB Destination Input” (14). The column status returned was: “The value could not be converted because of a potential loss of data.”.

Failure:

SSIS Error Code There was an error with input column “shipregion” (93) on input “OLE DB Destination Input” (14). The column status returned was: “The value violated the integrity constraints for the column.”

Native ODBC

Failure:

Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 921, column 12 (shipregion).

Failure:

Bulk load data conversion error (type mismatch or invalid character for the specified codepage) for row 1, column 11 (shipcity).

Completion or Failure:

The bulk load failed. Unexpected NULL value in data file row 921, column 12. The destination column (shipregion) is defined as NOT NULL.

 

Note: a NULL will be converted to an empty text string and not fail loading into a text column.

Database Links

Failure:

[Microsoft][ODBC SQL Server Driver][SQL Server]String or binary data would be truncated.

Failure:

[Microsoft][ODBC SQL Server Driver][SQL Server]Conversion failed when converting the nvarchar value ‘Reims’ to data type int.

Failure:

[Microsoft][ODBC SQL Server Driver][SQL Server]Cannot insert the value NULL into column ‘shipregion’, table ‘WSLWarehouse.dbo.load_orders_link’; column does not allow nulls. INSERT fails.

ODBC

Failure:

Failed on insert after 8104 rows. [Microsoft][ODBC SQL Server Driver][SQL Server]String or binary data would be truncated.

Failure:

Failed on insert after 1 rows. [Microsoft][ODBC SQL Server Driver][SQL Server]String or binary data would be truncated.

 

Failure:

Failed on insert after 0 rows. [Microsoft][ODBC SQL Server Driver][SQL Server]Cannot insert the value NULL into column ‘shipregion’

 

In conclusion a Native ODBC load consistently provides the best error reporting, identifying the failure reason, row number (of the intermediate text file) and the column that caused the problem.  In the event of a failure the failed data file will be persisted as a text file, and the reported error row can be found if the text file is opened with a text editor (eg Notepad or Textpad) and using Goto line.   

If another load method is used which does not clearly identify the failure details required to identify and fix the problem, then it is recommended that the particular load table is temporarily changed to use a Native ODBC load method to get better error reporting. 

Degenerate Dimensions in Analysis Services 28 Dec ’10

Analysis Services supports the ability to define dimensions based on columns in the fact table – also known as Fact Dimensions.  These dimensions are very useful when attributes that exist in the fact table need to be exposed in our OLAP query or Drill through action, such as Invoice Number or Order Reference.

WhereScape RED does not directly support this dimension type, but it is simple to create equivalent functionality using RED and database views.  RED provides the ability to quickly and easily modify the data warehouse database using views to accommodate cube design, rather than modify the DSV within a cube database.  This approach is useful when equivalent queries are required from both the data warehouse and cubes. 

To define and add a degenerate / fact dimension to an OLAP cube already defined in RED follow these steps:

1.      First we need to define a dimension view built from the fact table, containing the degenerate attribute(s).  The dimension view is defined in RED by dragging the fact table attribute(s) into the Dimension list to create a new Dimension View.  This dimension view needs to have a single unique key column – perhaps just the degenerate attribute.  Next, define the dimension’s key column as an artificial key by checking the artificial key checkbox in the key column properties in RED. 

Note: If the degenerate dimension has a composite key, then a single unique key needs to be created in the dimension view by concatenating the composite key columns.  This key also needs to be added to the fact table or in a view over the fact table to enable a fact / dimension relationship to be defined using a single column.

2.    Once created, the degenerate dimension view can be dragged over to the list of OLAP dimensions to create a new OLAP dimension in RED.  Ensure the key column is identified correctly in the OLAP dimension properties.

3.    The OLAP dimension can be added to a cube in RED by right clicking the cube and choosing Add Dimension.  Choose the new degenerate dimension. 

 

4.    To define the relationship between fact and degenerate dimension right click the cube and choose Display Measure Group Dimensions.  The relationship between fact and degenerate dimension will be shown as “No Relationship”.  Double click this relationship and change the Relationship type to Regular, and choose the key column in the fact that matches the dimension’s key column. 

 

5.    Now you can create and process the cube.   The degenerate attribute(s) will be exposed in the new degenerate dimension.

This method has been tried and tested on client sites.  If you run into problems please let us know.

SQL Server Data Acquisition using RED and Native Support for Microsoft SQL Server Integration Services 10 Dec ’10

The other day I ran some comparative tests using different options to load data into a SQL Server data warehouse from an Oracle database using WhereScape RED.  Here’s a summary of what I did and what I found:

1. The new WhereScape RED load option Integration Services load is the fastest way to move data into SQL Server from another relational database.

2. Native ODBC loads are the second fastest way to move data into SQL Server.  Linked servers are significantly slower than other options.

3. Whether the architecture is 32-bit or 64-bit makes no difference loading data into a load table in SQL Server 2008 from another database.

Details follow, apologies if this looks like a high school science report!

Aim:

The objectives were to:

1. Show the new WhereScape RED load option:  Integration Services load.  This uses the new Native Support for Microsoft SQL Server Integration Services.

2. Determine the fastest way to move data from a relational database into SQL Server 2008 database using WhereScape RED.

3. Quantify any difference between 32-bit and 64-bit platform performance.

 

Method:

A series of tests were run on two identical machines, one 32-bit and one 64-bit.  The two machines were configured as follows:

32-bit Server:

- Windows Server 2003 R2 Enterprise Edition Service Pack 2

- AMD Athlon II X4 635 Processor 2.91 GHz

- 6GB of RAM (PAE)

64-bit Server:

- Windows Server 2003 R2 Enterprise x64 Edition Service Pack 2

- AMD Athlon II X4 635 Processor 2.91 GHz

- 6GB of RAM

Both machines were built using vm-ware 6.5 and tests were performed using vm-ware server on identical host machines.  The 64-bit virtual machine ran on a 64-bit host.  The 32-bit virtual machine ran on a 32-bit host.  Both had 2 virtual CPUs assigned to them (i.e.: 2 cores of the host machines CPU).

SQL Server 2008 R1 was used as the data warehouse database, the 32-bit Enterprise and 64-bit Enterprise versions being used respectively.

The source system was in an Oracle 11.2.0.1.0 database – Enterprise Edition.  Again 32-bit and 64-bit versions being used as appropriate.

SQL Server and Oracle were installed on the same server.

There were some additional set up complexities on the 64-bit machine.  Specifically:

- An oracle 32-bit client was also required (in addition to the 64-bit client installed with the database)

- A 64-bit oledb provider from oracle had to be installed as Microsoft does not supply a 64-bit version of MSDAORA (they do supply a 32-bit version of MSDAORA with 32-bit SQL Server).

The testing involved a large table being loaded into SQL Server from Oracle.  The source table contained approximately 13.6 million rows and used about 2Gb of space in Oracle.  It contained most common data types.

The following tests were performed:

- 32-bit machine:  Integration Services load in RED

- 32-bit machine:  Native ODBC load – 32-bit MS Driver

- 32-bit machine:  Native ODBC load – 32-bit ORA Driver

- 32-bit machine:  Native ODBC load – 32-bit ORA Driver and RAM Disk  ***1

- 32-bit machine:  Linked Server in RED – 32-bit – 4 part names

- 32-bit machine:  Linked Server in RED – 32-bit – open query

- 32-bit machine:  Linked Server in a Query Wind – 32-bit – 4 part names

- 64-bit machine:  Integration Services load in RED

- 64-bit machine:  Native ODBC load – 32-bit MS Driver

- 64-bit machine:  Native ODBC load – 32-bit ORA Driver

- 64-bit machine:  Native ODBC load – 64-bit ORA Driver  ***2

- 64-bit machine:  Native ODBC load – 32-bit ORA Driver and RAM Disk  ***1

- 64-bit machine:  Linked Server in RED – 64-bit – 4 part names

- 64-bit machine:  Linked Server in RED – 64-bit – open query

- 64-bit machine:  Linked Server in a Query Wind – 64-bit – 4 part names

Notes:

***1:  A RAM disk was used in one Native ODBC load test to check for IO constraints.

***2:  This test was run using a specially compiled 64-bit version of the RED native load extract utility.

 

Results:

Conclusions:

1. The new WhereScape RED load option Integration Services load is the fastest way to move data into SQL Server from another relational database.

2. Native ODBC loads are the second fastest way to move data into SQL Server.  Linked servers are significantly slower than other options.

3. Whether the architecture is 32-bit or 64-bit makes no difference loading data into a load table in SQL Server 2008 from another database.

 

Sucking the oxygen from the migration conversation: Thoughts on Microsoft’s Parallel Data Warehouse 29 Nov ’10

When it comes to the analyst community there is none better than Merv Adrian. His comments at a briefing or a BBBT are insightful, if you can’t make a conference following @merv is a worthy substitute, and his blog is required reading.

One of his recent blogs is on SQL Server PDW – Microsoft Leaps Late, Lags with SQL Server PDW. The title is not the most positive, and while his analysis is of course sound, I still believe Microsoft SQL Sever 2008 R2 Parallel Data Warehouse (PDW) is a game changer.

Microsoft is not like any of the other MPP vendors – or other data warehouse database vendors for that matter. It already has volume, and owns the low and mid end markets. WhereScape has more SQL Server implementations that any other database platform. We see more new (or replacement) data warehouses built on SQL Server than any other platform. Despite often having perfectly good entry level and mid market offerings, the other database providers don’t even get a chance in many opportunities – SQL Server is the default choice and a good one.

We do see conversions from SQL Server to other databases (to be fair we also see conversions to SQL Server as well). The main reason we are given for converting from SQL Server is perceived scalability issues. I say perceived as often it is balance and configuration as much as scalability that it the issue.

This is where PDW and Microsoft SQL Server 2008 Fast Track Data Warehouse (Fast Track) are game changes. Microsoft is removing scale as a reason to migrate to another database. It is sucking the oxygen from the migration conversation. Database migrations are already a painful exercise. The appliance vendors have done their best to mitigate the disruption in some scenarios, but it is still not without (often substantial) cost in dollars and time.

There will still be some people who want to join the big-end data warehouse club by implementing an appliance, another relational database or a specialist data warehouse database. Just not so many. And the argument is made harder when the new platform is more expensive to licence and maintain than the old one.

But I don’t see PDW and Fast Track only appealing to the Microsoft base. Who, having seen a bunch of them in kilts at SQL Pass, I have to say are a scary lot. Microsoft’s customer base gives them an ideal place to gain momentum, and to establish a beachhead for attacking other markets. As their technology matures and the customer stories come out, it can’t be good for the other database players, especially the less established MPP vendors, to have another alternative in the market.

Merv talks about how Microsoft is hiring, and its team is excited and knowledgeable (as well as their base). Marry that with Microsoft’s traditional aggressive pricing, and I think we have a game on right now. And it is a new game.

Analysis Services showing ancient history 12 Aug ’10

We had a customer who had built an OLAP cube over a fact with 10 years worth of history. Whenever the cube was browsed though in Business Objects or Excel it would show dates going back to 1950 – which are all the dates held in the date dimension. This meant that the user would have to scroll through years of empty cells to get to 2000 when the data actually started.

The problem was caused by a calculated member that returned a 0 when the calculation returned an empty value. This caused the cube browsers to think that there was data back to 1950. So rewriting the calculated member to return an empty value when the calculation returned nothing caused everything to come right in the front end.

Type 6 Dimensions 11 May ’10

WhereScape RED natively supports type 1, 2 and 3 dimensions. But what on earth is a type 6 dimension? It is a combination of a type1, type 2 and type 3! Basically there are two columns used to track a type 6 attribute – one holds the newest value, and one holds a point in time value that references a start and end datetime. This is a great way of being able to:

  • choose whether to report or filter by an employee’s current department or their department when an event / fact was recorded.
  • choose whether to report or filter by a customer’s current suburb or their suburb when an transaction / fact was recorded.
  • do a “where are they now?” type of report eg where are employees working now, that worked in this location 2 years ago.

WhereScape RED can build these type 6 dimensions without resorting to customizing the code. The standard WhereScape RED procedures will generate change detection code for time variant data (type 2 and 3 attributes) and RED will add change tracking columns to manage date ranges in those dimensions.
Just to summarise:

  • A type 1 slowly changing dimension (a Normal dimension in RED’s dimension wizard) will overwrite historical changes with the newest data – one record for each unique business key.
  • A type 2 slowly changing dimension (a Slowly Changing dimension in RED’s dimension wizard) will create new records each time an attribute that is marked as slowly changing does not match the “current” value for that business key. Unique start and end date ranges will manage which record is relevant at any point in time.
  • A type 3 dimension (a Previous Values dimension in RED’s dimension wizard) will hold the current and the previous attribute values for a tracked attribute in two attributes in the dimension.
  • The mysterious date ranged dimension in WhereScape RED is essentially a type 2 dimension that relies on the source system to maintain and pass changed data with date ranges.

So how would you build a type 6 dimension in RED using standard RED code? By copying a source attribute into two dimension attributes, one to hold the current value and one for a point in time value (with suffixes of _cur and _pit). This dimension should be built as a type 2 dimension in RED, with the _cur attributes marked as type 1, whilst the _pit attributes selected as type 2 attributes.

Thanks to Mr Kimball for writing about them here.
And thanks to Andrew Cann and the chaps from Mater for explaining why type 6’s were useful to them!

Performance Trade-Offs – Joins vs. Storage 17 Mar ’10

WhereScape’s Chuck Kelley has written a blog for BeyeNETWORK on the performance trade offs between joins and storage. 

The question he asks is:

What is the best performing way to load a fact table that has 20 different dimensions?

As he sees it, there are 3 options:

  1. Create the table from a single FROM … WHERE clause
  2. Create a view that does the FROM … WHERE clause and join that to the input for the fact table
  3. Create the table with multiple staging steps joining

The full blog is available at http://www.b-eye-network.com/view/12675

Tricks and traps with Oracle (as a source or target) on 64 bit servers 11 Nov ’09

Installing RED on a Windows 64-bit server, with Oracle as a repository or source system

When installing RED on 64-bit Windows, the installer selects the following installation folder by default:

C:\Program Files (x86)\WhereScape\

This is where 32-bit applications normally reside on a 64-bit Windows platform.

However, if Oracle is involved, either as the data warehouse repository, or as a source system, installing RED in “Program Files (x86)” causes a problem.

The problem is that Oracle doesn’t like directories that have a parenthesis in the name – applications using Oracle cannot be located in a directory with parenthesis in the name, and this includes RED using an ODBC connection to Oracle.  The result…RED will not be able to connect to Oracle.

This issue can be prevented by changing the installation directory of RED to:

C:\WhereScape

For more information about this Oracle issue, see

http://tiredblogger.wordpress.com/2007/11/21/oracle-data-access-components-odac-with-64-bit-development/

64-bit SQL Server data warehouse repository with Oracle as a source system

If you are running a 64-bit SQL Server data warehouse, and need to connect to one or more Oracle source systems, you will require both the 32-bit and 64-bit Oracle Clients installed on the 64-bit Windows Server:

1) 32-bit Oracle Client

The RED client is a 32-bit application, and requires a 32-bit ODBC source to connect to an Oracle database. 

The 32-bit Oracle Client is required so you can set up a 32-bit ODBC source using the “Microsoft ODBC for Oracle” ODBC source.

2) 64-bit Oracle Client

For better data load performance, create a SQL Server “Linked Server” to connect to an Oracle source system.

The RED scheduler will use the Linked Server for the data loads.

64-bit SQL Server requires the 64-bit Oracle Client and ODAC (Oracle Data Access Components) to create the Linked Server – these are both included in the Oracle Database 10g Release 2 Client (10.2.0.4.0) for Vista and Windows 2008 (64 bit).

For installation instructions for the 64-bit Oracle Client, and setup of the Linked Server, see

http://www.mssqltips.com/tip.asp?tip=1433

We have tested this setup using the following software versions:

64-bit SQL Server 2005 as the DW repository on the 64-bit Windows Server

32-bit Oracle 10g Client installed on the 64-bit Windows Server

64-bit Oracle 10g Client installed on the 64-bit Windows Server

Oracle 9iR2 as a source system

Oracle 10g as a source system