Why You Need A Persistent Staging Area (PSA)
Data warehousing closely resembles life in general. Just about when we feel like everything is going our way something unexpected happens. Before we know it we are on a completely different path working to regain that comfortable feeling. Luckily, there are some safeguards in life. Insurance, for example, reduces the risks associated with the events that we don’t plan for but know will occur. A Persistent Staging Area (PSA) can be considered a type of data warehouse insurance.
What is a Persistent Staging Area?
- Source system data is loaded into PSA without transformation
- Records are never deleted from PSA (archives may occur)
- PSA stores all unique records loaded (tracks history)
- Many more fields are stored in PSA than required by the data warehouse
Why Invest in a Persistent Staging Area?
First, it is important to point out that a PSA is not a requirement for creating a data warehouse. In fact, if we could be guaranteed that the data warehouse that is currently envisioned will never change in any way, then PSA would not be needed at all. Of course, designing ETL processes with the premise of a data warehouse never changing is not a great idea.
Since we are pretty much guaranteed that requirements will change over time, it is appropriate to take steps to limit the impact of these changes. This is where PSA comes into the picture. Below is a list of scenarios where PSA provides a clear benefit.
New Data Required – The current data mart doesn’t include a piece of information that has become important to the business. There is a good chance that this is already contained in the PSA including all history. In some cases the source system may not persist this data. PSA to the rescue.
Attribute SCD change from 1 to 2 – At the time the initial data warehouse was developed dimension attributes were required to report their current version only. Now the requirement has changed and some attributes need to track history. No problem. All historical version are stored in PSA. The dimension can be reloaded without impacting the source system.
Auditing – A PSA has all versions of all records ever loaded. In the event that an audit is required the PSA contains a dependable source.
Limit Impact on Source System – A PSA becomes the reliable source for all downstream processes. PSA can then be accessed repeatedly without impacting the source system. Often source systems have small widows of time where access is granted for extracts. PSA simplifies downstream process timing by buffering them from source system availability.
Catastrophic Failure – A unplanned event resulted in the loss of a dimension. All of the required data is in the PSA enabling the dimension to be rebuilt quickly.
Bug Fix – It may never happen, but in the rare case that a developer makes a mistake in the often very complex ETL processes, we can mitigate the impact of such an error with the existence of a PSA.
Limited Source System Persistence – The source system stores data online for 2 years. The data warehouse has been extracting data daily for 10 years. It is not possible to rebuild the data warehouse based on source system data. PSA solves this problem.
What are the Costs?
Just like other insurance, there are costs associated with maintaining a PSA. Storage is an obvious cost. In all but the most extreme cases the cost of storage is negligible. There is also a costs associated with processing PSA loads. The time required to process a PSA should also be taken into consideration.
Including a PSA in ETL design offers is a hedge against unplanned events that are going to happen. The cost of doing so is quickly overcome when PSA turns a major problem into a planned solution. The good news for LeapFrogBI developers is that building a PSA only takes a few minutes per table. Developers can choose which fields track history by simply categorizing them appropriately. The rest (sql, statistics, logging, configuration, timestamps, etc…) is handled by LeapFrogBI automatically without further developer involvement.