Design Pattern – Data Warehouse Deletes
What to do when a data warehouse source system deletes a record? There are probably hundreds of answers to this question. If you want to review a few of the options, then check out this post. For the sake of brevity, I am going to focus the situation as follows:
- The data source hard deletes records (no trace)
- We cannot implement change data capture or last change to detect deletes
- We must compare all source records to previously collected records to isolate deletions
- The source table was used to load dimension
- We will soft delete the dimension records with a datetime field
- We will soft delete records in PSA with a datetime field
Now that we know how we are going to detect source system deletes, how we are going to handle the delete in our consolidation area, and how we are going to handle the delete in our target model (dimension) we can begin designing a solution that meets the prescribed behavior. The best place to start is at the stage component.
Remember that this is a component that simply collects records and loads them into our controlled destination. No transformations are performed. Typically, we try to setup incremental extraction in the stage component. However, in this case we are not going to do so. Instead we will collect all records from our source table. This is required because we need to compare all records in our source to the records in our PSA.
Nothing special is going on here. Just a basic select to collect the desired records from the source system.
Generate Dimension Design Pattern
We can use a design pattern to generate the needed components to load and maintain our dimension. In this case I need to create a PSA component, apply load status tracking, and create and maintain a dimension. That’s perfect fit for DP-1001.
Preferably we would like to prevent from merging all records collected in our stage component into PSA (our consolidation area). So, we can setup an incremental extraction between the stage and PSA components. This is as simple as collecting a statistic on the [ModifiedDate] field and setting up a filter that includes only new records in the PSA component.
Next, we need to add a field in PSA’s output field list that will track the soft deletes. This field will have a datetime data type with a NULL value by default. This is easily achieved by creating a NULL value constant expression and using it as the source to an output field.
Transformation – Detect Source Deletes
Now that we have our data consolidated and a soft delete field created, we can setup the process which will compare Stage (all records from source) to PSA (all records we have previously collected). This is also pretty simple. We need to create a transformation component which uses the stage component as its source and right join to the PSA component.
To ensure that we return only records that exist in the PSA component, but not in the Stage component (deleted from the source) we need to look for null values in the stage component after performing the right outer join. We also are not concerned with records in PSA that were deleted and already detected on a prior execution. To make this happen we simply filter out any PSA records that have a non-null value in the deleted field.
Notice that we also set the transformation to only return unique records. In PSA we may have numerous versions of a records, but in this component we only want to return a distinct list. The output field list should include only the field that represents the source system primary key which in this example is [ProductID]. Note that this [ProductID] field is source from PSA, not the Stage component. Remember that the Stage component represents the null side of the right outer join.
Our transformation component will now return all records that have been deleted from the source system since the last data collection process. It is probably a good idea to create a table to instantiate these records. To do so, we click the box to “write table” in the transformation component header.
Update PSA [Deleted] Field
The records that were collected in our [DeletedProducts] transformation can now be joined to a PSA update component to update the [Deleted] field value in PSA. In the PSA update component we setup our join on the source system primary key, and set [Deleted] to and expression which calls the function GetDate(). We also need to set the “dls”, dimension load status, field to 0 meaning not processed to ensure the records is collected on the next dimension update execution.
Now our deleted record tracking process is all setup. It took two components to add the deleted record tracking behavior.
Add Soft Delete Field to Dimension
Finally, we need to add the [Deleted] attribute to our dimension’s output field list. This attribute should be tracked on an SCD1 (current) basis to ensure that all versions of a record are updated when a source system delete occurs.
Our final data flow is represented below. This is a typical dimension load design patter, DP-1001, with two additional components added to track source system deletes in PSA.
Setup Execution Order – Precedence
We want the deleted record detection to occur prior to the dimension load. We can accomplish this be setting the precedence definition to first run the Stage & PSA component, then run the delete detection component, and finally load the dimension. Basically, we stuff the two new components in the middle of a typical precedence definition.
All done. It took quite a bit of explanation to describe this process, but the actual time required to implement was under 10 minutes. Keep in mind that there are many different ways to detect and deal with source system deletions. This design pattern delivers the behavior needed in our example scenario. Be sure to carefully evaluate your situation before trying to develop a solution.
Two new design patterns are now in production; DP-4100 & DP-4101. These two patterns generate required components and related logic need to maintain soft deletes in PSA. DP-4100 can be used in cases where you are able to detect deletes and stage these deleted records. DP-4101 can be used when you cannot detect deletes. 4101 requires that you stage a copy of the source system records, and a full compare against PSA will be completed to figure what no longer exists in the source system. In both cases, implementation takes on a couple minutes.