Top 3 Reasons to Always use Surrogate Keys in Data Warehousing
There are many things about data warehousing that are subjective. The use of surrogate keys is clearly not one of them. Let’s start with a quick refresher.
Surrogate Key: A system generated value which uniquely identifies a record. Surrogates keys have no business meaning.
Natural Key: A value that represents a real world object. Aka – Business Key. Example; SSN is a natural key representing a person.
In dimensional models we have fact tables which contain foreign keys pointing to all related dimensions’ surrogate keys. Therefore, the decision to use surrogates will affect the primary key of each dimension as well as the foreign key values in fact tables. So, why do we always use surrogates as the primary key in dimensions?
1. Slow Changing Dimensions
It is often a requirement to track historical values of dimension records. For example; a customer dimension may contain a zip code field. This dimension has a natural key (aka business key) named CustomerID. To keep track of the customer zip code throughout time we need to include multiple versions of this customer record in our customer dimension; one for each zip code value. Using the CustomerID as our dimension primary key is simply not an option since it will not uniquely identify a dimension record.
2. Changing Source Systems
Using natural keys tightly ties a data warehouse’s integrity to the stability of the source system. While it is not common for a source system to change natural keys it does happen. Specifically, when data is archived or when no available values remain, OLTP system will do a variety of things to handle the situation. Probably more common is the need to source a dimension from multiple disparate systems. In this case there is a chance that the same key value is used in both source systems making it no longer a candidate for our dimension’s primary key.
Referring back to the two design goals of a dimensional model per Kimball; create an easy to understand data structure while providing fast query response times. Fast response times often mean efficient joins between facts and dimensions. The last thing we want to do is join on text or other meaningful values as this is not going to provide the best response times. By using surrogate keys we can select a data type, integer, which offers high performance.
For these three reasons alone it is simply not suitable to use anything other than a surrogate key as the primary key of dimensions. The good news for LeapFrogBI users is that this is one of many best practices that is automated requiring no developer time to implement.