Do You Make These 5 Data Warehousing Mistakes?

You know how there are times in life when you see something, and you know from experience that someone is headed for a world of problems (to put it mildly)? In situations like this a time machine could really be valuable. We could go back in time to a date when the upcoming catastrophe could be averted.

For anyone out there about to start a data warehousing project, consider this post a flash from the future. Maybe this is your first rodeo, or maybe building data marts/data warehouses is not your main focus. In any case, you are in luck! Avoid the following mistakes, and your project will have a much better chance at success.

No Consolidation Area

It might be tempting to collect data from your source systems, perform a few volatile transformations, and load it directly into your data warehouse. The problem with this plan is that it ties you directly to the whims of the source systems which will eventually change. It will soon be impossible to reload a dimension, for example, because a source has purged historical data. Source systems often modify records without storing historical values. This causes a huge problem when we are trying to track history on dimensional attributes.

The solution is to create a consolidation area. There are a variety of names for this; CHS (consolidated historical store), PSA (persistent staging area), and possibly ODS (operational data store). Without getting into the nitty-gritty, let’s just agree that we do need to create a reliable data source which contains all versions of records that will ever be of interest.

Rely on Business Keys

Don’t do it. Sure, your source system has been around forever, and it isn’t going anywhere soon. Is your organization growing? Might they acquire other companies? If so, there is a good chance that the acquired company has a source system to handle the same tasks. There is also a chance that the new source is using some of the same business keys as your existing source. There are plenty of reasons to not rely on business keys any more than is absolutely necessary.

Conveniently, the solution is simple. Use surrogate keys in every dimension. This doesn’t totally guard you from source system nightmares, but at least you won’t be trying to clean up huge fact tables when your business keys change.

Using SSIS’s Built in SCD Component

Granted, this is specific to Microsoft shops using SSIS, but it is a bad enough mistake that it makes the list. Do a little searching and you will soon find that SSIS’s built in slow change dimension component uses RBAR (row by agonizing row) type logic. Only the smallest dimensions will perform well without excessive optimization effort. There are other issues such as rerunning the wizard causes issues with existing data flow definitions.

There are pretty much two alternatives: Merge & 3^rd Party Components. Loading dimensions using the Merge statement is a great way to handle SCD logic, but it does take some time to get all the kinks worked out. There are also some very nice SCD components (open source & pay for use) that are capable of doing the job. Skip the built in SCD component and go straight to a resilient solution.

Who Needs a Data Model & Data Flow Diagram?

Every serious data warehouse project should have a data model in place before any development begins. The same is true for data flow diagrams, although they are much less common. Even the simplest data integration tasks need to be thought through before trying to develop processes. By “putting it on paper” issues become much easier to identify, and much less time consuming to resolve.

No Plan for Change (a.k.a. Waterfall Methodology)

Remember Sanford & Son? “This is the big one, Elizabeth.” If you are building a data warehouse, then it is likely because the information within is needed to make better decisions. Better decisions lead to success which leads to growth & change. The data warehouse must be able to grow (change) with the business. Long projects that attempt to collect all requirements and deliver on those requirements in one lengthy process often fail for a very simple reason. By the time the project is complete the requirements have changed!

Agile methodologies address this problem by organizing around short iterations that each delivers a piece of functionality. But, how does this work with data warehousing? Data integration by its very nature includes many serial tasks with interrelated dependencies. Changes often cause a domino effect resulting in many hours of rework. It is nearly impossible to be agile in this scenario.

The solution is to work at a higher level. By creating metadata driven processes we can loosely couple dependencies allowing for rapid change. Data integration tools such as SSIS are extremely powerful, but they are built for generic data integration, not specifically for data warehousing. By using tools that are built specifically for data warehousing you will save yourself a lot of time reinventing the wheel.

By the way…

LeapFrogBI has a metadata driven data warehousing platform. Solutions for all of these issues as well as many industry standard best practices are built into the toolset which generates native SSIS packages in a small fraction of the time required by traditional methods.

Paul B. Felix

Managing Partner - Paul has spent his career helping organizations convert raw data into valuable decision support systems. He built a deep set of analytical and technical skills while working as a business intelligence and data warehousing consultant before founding LeapFrogBI. Connect on LinkedIn