In one of my last projects, we faced the requirement of designing our Data Warehouse-Core with bi-temporal historization.
“Bi-temp…. what”, you may now ask.
Well, I’ll try to explain. In all Data Warehouse (DWH) projects, you are confronted with the question whether you have to be able to tell, at what point in time you knew (or reported) some information. The answer to this question leads to slowly changing dimensions (SCD, or the often so called technical historization). Depending on your answer, you usually either use SCD Type 1 or Type 2. This is a well-known design method, and is widely covered in literature, so I’m not going to elaborate further on this topic.
“Yeah, I understand all this, but what is this bi-temporal stuff?”
Sometimes, your source applications (usually business systems), where we get our data to load into the DWH, can handle time validity. In this case, you have to handle two different histories, hence the bi-temporal.
One example for this is keeping track of customer addresses, another example would be the sales price history of an item (as shown in the picture).
The handling of the business history comes in two flavors. Either it’s the one, where you get only one date (usually the date, when the record starts being valid), as seen in the example above. The other possibility is, that a date range is used, where the data is valid between a “valid from” and a “valid to” date.
To make sense of the data, you will have to look at both date elements. To illustrate this, let’s look at how the prices for the books have changed in time. On the left, you see these changes illustrated. So in the beginning of August, the book was poised to start selling in September with a price of 50. Then shortly before the launch, the price was reduced first to 48 then to 45 and so on. One thing to point out is, that your business history can be changed after the fact, so to say. You can see an example of this in the last row.
The questions, in respect to the business history, you usually have to answer are:
- What was the Sales Price that was reported at some specific date? Here you have to filter your data on both the technical and the business date. So in our example, the correct sales price to show on the 18th of December would be 46.
- What were the current active Sales Prices at some specific date, or phrased a little differently what were the known Sales Prices at that date? If you ask this question at the 7th of September, you get to records one sales price of 45 for the 1st of September, and 46 valid from the 1st of October.
“So, how did you model this business history?”
In my opinion, the first question you have to ask is, whether you really, really need the business history. If there is no urgent need, like a federal regulation or driving business advantage, then I would not advise you to incorporate it. If you are not sure, build an Operational Data Store (ODS) or a Data Lake, where you keep all the relevant information, that will allow you later to ad this second history dimension to your DWH.
The second question to ask is, whether you need to know, when your data has changed. So in the example above, is it relevant to know, when the sales price that is valid from the first of September has changed to 45? If you don’t need to know this, then you could use the business history to fill the SCD2 validity timestamps.
In our project, there was no way around using both technical and business history. We modelled the business valid from attribute as an additional part of the business key. So in the sales price table, the records are identified by the attributes “Item No.” and “Valid From”.
“What are the challenges?”
If your source system is using date ranges, things become a bit more challenging. First of all, this method allows for gaps. So there might be periods, where you have no valid data. Also, this way is more error prone, as we had to learn in our project. Let’s delve further into this with an example. In our system, we have a Sales Price of 46, that is valid from 31.10.2016 until 31.12.9999 (Initial situation). If you now want to have a new price, that starts on the 30.11.2016, your source system has to make sure, that it also ends the first record correctly at the 29.11.2016 (Correct change). If this does not always happen correctly, you might end up with multiple records, that are valid (Data quality issue). So make sure, that you check the incoming data for date range overlaps, where more than one version can be true, and think of a way to deal with it.
Incorporating data into a central model becomes more difficult. When you combine data from multiple tables with different validity into one Core entity, you have to create new validity ranges. I’ll write a more in depth post about how you can go about it in a next post.
“So, what’s your conclusion?”
Bi-temporal historization is not so difficult to set up. Whether it works well, depends on a large part on the quality and reliability of the source data. In my experience, it will add a lot of complexity and cost to your solution. So make sure, that you really need both the technical as well as the business history before implementing. And make sure, that you have a good data quality and / or cleansing process in place.