Loading Dimensions From Multiple Sources
Dimensions records are often derived from multiple sources. When these sources are in sync, it is a simple enough process to combine the sources to create a dimension load source that includes all required fields. When sources are not in sync, however, other measures must be taken to ensure that the dimension load is accurate.
Consider this situation
- Dimension has 3 sources
- One source is the primary & will always report first
- The other two sources will provide values for a subset of dimension attributes at some point after the initial dimension record load.
- The dimension has both SCD 1 & SCD 2 attributes spread across all three sources.
One solution to this scenario is to create a dimension load component for each source. The required steps are listed here.
- Create a Stage & PSA component for each of the three sources.
- Using the primary (first available) source create and load the dimension. Attributes that are not available must be seeded in the dimension with a default value such as NULL.
- For all other sources create a transform component which joins to the dimension. Filter the dimension records to include only current versions by using LFBI_Current = 1. The output of the transform will include a combination of values from the source (PSA) & dimension. This can also be done in the dimension created in step 4, but it is often useful to seperate this logic from the actual dimension component.
- Create a dimension component for each non-primary source with the same structure as the primary dimension. Name the dimension the same as the primary & add a suffix to identify it as a post initial load. Example; Primary dimension is called “Product”. Subsequent dimension is called “Product_Category”.
- In the target database create a synonym for each subsequent load. In our example we would create a synonym called “d_Product_Category” which points to “d_Product”
- Set the “CreateTarget” configuration value to “0” for each subsequent load dimension component. The component will load the already created dimension using a synonym.
- When tracking load status, create a PSA Update component for each PSA using the dimension as the update source. Include logic in the join on clause which identifies the PSA record that was loaded. This will include a combination of the dimension key & identification of a relevant field that has been loaded from the source PSA. In complex scenarios it may be necessary to set the transforms to “write table” to ensure that adequate data is available to identify the PSA record that has been loaded.
Things to consider:
- Ensure that the RecordTime property is accurately defined for each source in PSA. When history is not tacked, it is okay to set this value to cover all time (1900-01-01). When history is tracked & a date is provided by the source system, use this value. When history is not tracked & a date is not provided, a value will need to be derived based on business rules.
- Dimension load sources must be at the appropriate grain. If using d1000, then the source must be limited to 1 version of each record (only one instance of the dimension key). When using d1001 the source can have multiple versions of the each record (multiple records with the same dimension key), but must not have multiple versions with identical RecordTime values as a record can only have one value at a point in time.
- SCD tracking definitions must be identical in all of the dimension component including the inital load & all subsequent loads. This will ensure that history is tracked accurately on each load.
The following illustration describes this approach to loading a dimension from multiple sources.