A resilient ETL process deals with data quality issues without causing a process failure while also meeting business requirements. One issue that should be anticipated is the early arriving fact (aka late arriving dimension) situation. As the name implies an early arriving fact is a record that is bound for a fact table which references a dimension record that does not yet exist. In other words, the natural key in the fact record has not yet been loaded in a related dimension preventing a successful foreign key lookup to the dimension’s surrogate key.
How can this happen?
There are many situations that lead to failed fact foreign key lookups. In some cases a data source may simply be unconstrained such as simple flat file or Excel workbooks. Another common situation involves multiple applications that are not always in sync. Yet another situation involves large scale applications that may be replicated across an enterprise always leaving a short time period which may lead to inconsistencies.
How can we handle late arriving dimensions?
The answer depends on both business requirements and the nature of the data. The following schematic describes our options.
There are four main categories of options. Note that in all cases we need to test the foreign key lookup to determine if a failed lookup is going to occur when loading a fact record. The strategy used to do so depends on the treatment of such failures.
Never process – If the fact record to be loaded is of no value without a reference to the late arriving dimension, and there is no expectation that the late arriving dimension will ever show up, then there is no reason to load the fact. Simply omit the record from the fact load. This is rarely the case.
Queue & Retry – If the fact records to be loaded is of no value without a reference to the late arriving dimension, and the late arriving dimension is expected to show up eventually, then we can queue the record until such time the dimension record is loaded. The queued record must be flagged such that it is retried at a future time and loaded once a related dimension record is loaded. It is typically a good idea to create a queued record report to monitor this process.
Unknown Member – If the fact record does have value without a reference to a related late arriving dimension, and we either do not expect or do not care if a dimension record shows up at a later time, then we can simply load the fact record with a foreign key reference to the unknown record in the related dimension; typically -1 surrogate key value. Note that we must pay particular attention to the fact grain if we are going to take the unknown member approach to failed foreign key lookups. Most fact table primary keys are set as a composite across foreign keys and degenerate dimensions. If allowing a foreign key to be set to -1 causes a possible primary key violation, then we need to look for other options such as adding a degenerate dimension.
Inferred Member – If the fact record has value without a reference to a related late arriving dimension, and we expect the late arriving dimension to show up at a later time, then we can use the natural key in the record bound for the fact table to seed a record (inferred member) in the related dimensions. Once the dimension record shows up, it will have the same natural key and the dimension record’s attributes can then be updated. Particular attention should be paid to the inferred member’s record time in cases where dimension history is tracked such as SCD 2 situations.
When determining the proper treatment of early arriving facts, first consider the value of the record in the case that a foreign key cannot be set to the related dimension. If there is value in having the fact loaded regardless of the available dimension record, then either an inferred member or unknown member approach is appropriate. If there is no value in loading the fact without a related dimension record, then the record should either be queued until a related dimension record is loaded or ignored indefinitely.
The LeapFrogBI platform makes implementing all late arriving fact strategies a cinch. Queues & ignored records can be quickly set up using filters within standard design patterns. LeapFrogBI sets failed foreign key lookups to the related dimension unknown member surrogate by default; no development required. Inferred members are easily handled using standard design patterns that reuse the fact table source records to first load a dimension. Contact us to learn about the LeapFrogBI platform and discuss late arriving dimension treatment options.