Anyone who has been involved with data warehousing knows that there are plenty of things that can go wrong. Mistakes can be made when researching the data sources, collecting requirements, designing a dimensional model, etc… Assuming all of the analyst and modeling work has been done perfectly, we still need to be sure that the developed ETL delivers the prescribed behavior. This short post itemizes a few key ETL considerations.
Understand How the Source System Behaves
How does the source system behave? Does it soft delete? Hard delete? Does the source system keep history? These are a few of the questions we must answer. We can boil this down to the basic functions of a source database; CRUD.
Create – When a record gets created in the source system we typically want to collect it and load it into our target model. How will we determine that a new record was created? If we can’t determine this reliably, then maybe we need to collect every record and compare it to what was collected previously to get the new records. The collecting tactics can differ, but we must be able to identify newly created records.
Read – Read operations are often overlooked, but should be studied in most cases. When is the source system under the most load? We don’t want to try to collect large datasets during heavy load times as this will likely degrade performance for all users.
Update – How does the source system handle updates? In an ideal world the source will keep some amount of history by applying a versioning strategy. When collecting records from the source we need to be sure to include those records which have changed since our last collection (assuming we are applying an incremental collection process). This may be as simple as finding all records with an updated field value that is newer than our last collection process. On the other hand, many systems (or files) will not flag records with update times. In this case we likely will need to do a full compare to previously collected records to find what has changed. Other strategies such as CDC or last change should also be considered.
Delete – Deletes are probably the most overlooked as well as the toughest to deal with function. First we need to figure out how we will determine a delete occurred. There are a number of strategies itemized in this prior post. Dealing with the delete in the persistent staging area as well as the target model require careful consideration. In most cases we will not want to delete the records from our target model, but it is critical that these records be flagged (soft deleted).
You may be thinking that this post is a no brainer. True, there is no rocket science going on here. Sometimes we miss the forest for the trees though. Taking a little time upfront to consider these four functions will save a lot of pain in the long run. Handling reads is usually just a matter of scheduling the ETL process at a low load time and possibly using nolock logic. Inserts and updates are typically handled as a pair. Deletes will often require a separate data flow that may or may not merge with the insert & update logic.
In summary, take time early in your project to determine how to handle all four CRUD functions in your data sources.