Archive: July ’09

More can cost less, and still be more. 30 Jul ’09

Chris Anderson’s recent article in Wired Magazine (“Tech is too cheap to meter”) makes the provocative case that we are wasting a lot of time and money trying to manage technology resources for efficiency.  Disk space, processing cycles and bandwidth are so cheap now that managing them as limited resources is just a waste of time.

Anderson gives the excellent example of the phone system where your mailbox “fills up” after a hundred messages and you are forced to sit down and take a half an hour and delete things.  This of course annoys the user (who, on a public phone system, is a CUSTOMER) to no end.  In a business, it takes up valuable employee time.

And what exactly is being saved here?  How much disk does a garbled thirty second voicemail message take up, anyway?   Well, we can figure this out.  A low-fi VOIP call needs a bandwidth of 30 kbits/sec, or about 4 kbytes/sec.  For a 30 second message, that’s 120 kbytes, or about one-tenth of a megabyte.  If I have a hundred messages in my voicemail box, it’s taking up about one megabyte. (Of course it should be much less, but we’ll assume this company has never heard of compression).

What does my clogged inbox cost the company?  I recently bought a terabyte drive for about $200.  Of course the price went down the next day, but let’s be generous and use the $200 figure.  That calculates out to about a thousandth of a cent per megabyte.

If you spend half an hour cleaning out your voicemail,  the cost in lost labor is going to be somewhere between four dollars and four thousand dollars, depending on where you are in life.  And what did you save the company? One thousandth of one penny.

What does this have to do with Data Warehousing, you ask?  A few things.  The truth is that Data Warehousing is very much a child of cheap technology, which means the cheaper it gets, the better our systems should be able to deliver good information.

First off, we are no longer allergic to redundant sets of data.  Most DW professionals have long ago seen the benefits of denormalizing dimensional data and aren’t worried that every single attribute isn’t tucked off into its own uniquely-valued table somewhere.  But more important than that, we are less conservative about creating downstream data marts, cubes, summaries, and extracts.  If we can save some financial analyst time and headache by giving them a custom data mart then we are happy to do it, even if the marketing department asked for a similar-but-not-quite extract the week before.

More important still is the change to our development model.  Software development projects often go through lengthy and expensive requirements phases, with extensive user interviews and fat specification documents written up.  Back in the 1990s, DW projects did the same thing.  In fact, many DW project STILL do the same thing, despite the evidence of the benefits of Agile methods.  But what exactly are we saving through a lengthy requirements and design process?  Disk space?  You may as well clean out your voicemail box.  Programmer time?  Not if you’ve got the right tools.  User reports and dashboards are created by drag-and-drop now, as is the back-end data warehouse, thanks to WhereScape RED.

“Technology too cheap to meter” is an opportunity for us.  Why go through a lengthy requirements phase?  Why not have a workshop instead where you roll out a dozen reports or data marts to your user group and see which one tickles their fancy? It will take less time than interviews and writing specifications and probably have better results.

Yes, deploying and managing many downstream data objects can be a royal headache, but WhereScape RED is the perfect tool for this new environment.  As you can drag and drop to create new downstream objects, RED tracks each aggregate, summary, data mart and data cube that you create and makes sure that they will be loaded and updated appropriately and included in the documentation.  RED allows you to be generous with information while still maintaining a simple and manageable environment.

To be, or not to be Precise: that is the Question 23 Jul ’09

The other day I was on a customer site where a subject area in the data warehouse had been based on another division’s data warehouse.

There were a few changes that had been made due to different versions of the source system, but these were largely trivial.  The main change was this data warehouse was in SQL Server and the original data warehouse database had been Oracle.  In both cases the source system was an Oracle database.

The customer couldn’t get the new SQL Server data warehouse to balance back to the source system.

It didn’t take long to work out the cause: precision.

Here’s some background on Oracle:

The Oracle documentation describes the NUMBER datatype as follows:

The NUMBER datatype stores fixed and floating-point numbers.  Numbers of virtually any magnitude can be stored … up to 38 digits of precision.

Columns using the NUMBER datatype can be defined using the following conventions:

  • NUMBER – see below
  • NUMBER(precision) – i.e.: scale is zero
  • NUMBER(*, scale) – i.e.:  precision is 38
  • NUMBER(precision, scale)

Oracle always stores NUMBER values in variable length format as “mainframe” computational-3 packed decimals.  In English this means a value with p digits of precision uses ROUND((LENGTH(p)+s)/2))+1 bytes (s is 0 for positive values and 1 for negatives).

All other numeric datatypes in Oracle (except for BINARY_FLOAT and BINARY_DOUBLE) are aliases for NUMBER with specific precision and scale values.  For example, if you create a table with a column that has a datatype of INTEGER the column will really be NUMBER(38).

If a column is NUMBER without precision or scale, it will accept any value in the following list:

·         Positive numbers in the range 1 x 10-130 to 9.99…9 x 10125 with up to 38 significant digits

·         Negative numbers from -1 x 10-130 to 9.99…99 x 10125 with up to 38 significant digits

·         Zero

As you can see the NUMBER datatype without precision or scale is very flexible.  Unfortunately, it doesn’t move nicely to other relational databases.  This brings us back to the original issue.

If you move a table from Oracle to SQL Server, then for basic datatypes, you do the following:

  • CHAR(x) stays the same
  • VARCHAR2(x) becomes VARCHAR(x)
  • DATE becomes DATETIME (note: in SQL Server 2008 there’s now a DATE datatype, so you can leave these as DATE)
  • NUMBER(x) becomes NUMERIC(x)  **1
  • NUMBER(x,y) becomes NUMERIC(x,y)  **1
  • NUMBER becomes ???  **2

There are two important points to make here:

**1: The maximum precision in both Oracle and SQL Server is 38.  However, in SQL Server, the extreme size of a number is quite a lot less than Oracle.  SQL Server values can range from (-1 x 1038) + 1 to (1 x 1038) – 1.  Changing to FLOAT(53) will allow any Oracle NUMBER value outside the SQL Server NUMERIC range to be stored, but only with 15 digits of precision…  This lack of precision capability is why the FLOAT and REAL datatypes in SQL Server are referred to as “Approximate-number data types” in SQL Server Books Online.

Note: In SQL Server, REAL is just an alias for FLOAT(24).

**2: NUMBER (without precision or scale) is non-trivial.  If you have a column in SQL Server with a datatype of NUMERIC, what you actually get is NUMERIC(18,0).

This is largely unexpected, well it certainly was by me and the developers on the client site.  Of course, this turned out to be the cause of my original issue: when you load data into a NUMERIC(18,0) in SQL Server, any decimal digits are lost.  So, the data warehouse not balancing to the source system was caused by rounding errors caused by implicit conversion loading data from Oracle NUMBER columns into SQL Server NUMERIC columns (which were actually NUMERIC(18,0)).

Therefore, you can’t change NUMBER in Oracle to NUMERIC in SQL Server unless you know the NUMBER column in Oracle only contains smallish integer values.

What you need to do is look at the data in the column in Oracle and explicitly manage it into the best SQL Server option.  A couple of options are:

  • If the biggest value you have is 20 digits long in Oracle, I’d change NUMBER to be NUMERIC(38,10) in SQL Server.
  • If the most significant decimal digits you have is two in Oracle, I’d change NUMBER to be NUMERIC(38,2) in SQL Server.
  • And so on…

But you have to manage the rounding explicitly to make sure values always fit in the future.  And you need to balance future proofing a big value coming along with the default non-variable storage nature of the NUMERIC datatype in SQL Server too!

Out of interest, I had a look at the other two databases supported by WhereScape RED: IBM DB2 and Teradata.

The DB2 DECIMAL datatype and the SQL Server NUMERIC datatype are very similar to each other, with a few exceptions:

  1. The maximum precision is 31 in DB2
  2. DECIMAL without precision and scale gives you a DECIMAL(5,0) in DB2
  3. DECIMAL in DB2 can hold values from (-1 x 1031) + 1 to (1 x 1031) – 1
  4. DOUBLE can be used to hold bigger numbers
  5. DECIMAL storage works the same way as Oracle, so is variable by default

So, the answer is the same: moving from Oracle to DB2, you need to make the same judgement calls on the scale and precision to use.

There are even more issues moving to Teradata from Oracle.  In fact, you’re going to have issues moving to Teradata from Oracle, SQL Server and DB2, especially if you’re still using Teradata V2R6.x.

Specifically:

  1. The maximum precision is only 18 prior to Teradata 12.  From Teradata 12 onwards, this increases to 38.  Beware using precision over 18, as some built in database functions only work if the precision is less than or equal to 18.
  2. NUMERIC without precision and scale gives you a NUMERIC(5,0) in Teradata (same as DB2)
  3. NUMERIC in Teradata can hold values from (-1 x 1038) + 1 to (1 x 1038) – 1 from Teradata 12 onwards.  Before Teradata 12, this range is (-1 x 1018) + 1 to (1 x 1018) – 1.
  4. FLOAT can be used to hold bigger numbers
  5. In Teradata, NUMERIC storage is fixed storage, increasing exponentially in bands.  For example, a NUMERIC(19) uses twice the space as a NUMERIC(18), regardless of the value stored in the field.