This blog introduces extensions to the SuperNova data modeling technique. SuperNova is designed to create a view layer on the basis of an existing DataVault, that makes reporting and analytics more agile. The extensions that I am proposing add perspectives to SuperNova.
There are three perspectives:
- the historical perspective (aka the Supernova view)
- the active perspective: shows the latest version of business keys that are still present in the source system and
- the latest perspective: shows the latest version of business keys, even if the business keys are no longer present in the source system.
So, to give you an idea what the proposed perspectives look like, I will give an example. Let’s say there is a source table CUSTOMER with the following fields: CUSTOMER_NUMBER, NAME, ADRESS and CITY. In our data vault this gets transformed to a hub H_CUSTOMER with business key CUSTOMER_NUMBER and a satellite with the rest of the fields.
The historical perspective would look like this:
CREATE VIEW CRM_HST.DM_CUSTOMER ( H_CUSTOMER_ID , START_LOAD_DTS , END_LOAD_DTS , CUSTOMER_NUMBER , NAME , ADRESS , CITY ) AS ( SELECT H_CUSTOMER_VERSIONS.H_CUSTOMER_ID , H_CUSTOMER_VERSIONS.START_LOAD_DTS , H_CUSTOMER_VERSIONS.END_LOAD_DTS , CUSTOMER_NUMBER , NAME , ADRESS , CITY FROM CRM.H_CUSTOMER_VERSIONS LEFT OUTER JOIN CRM.S_CUSTOMER ON H_CUSTOMER_VERSIONS.CUSTOMER_ID = S_CUSTOMER.CUSTOMER_ID AND ( H_CUSTOMER_VERSIONS.START_LOAD_DTS >= S_CUSTOMER.START_LOAD_DTS AND H_CUSTOMER_VERSIONS.END_LOAD_DTS <= S_CUSTOMER.END_LOAD_DTS ) )
The actual perspective would look like this:
CREATE VIEW CRM_ACT.DM_CUSTOMER AS ( SELECT H_CUSTOMER_ID , START_LOAD_DTS , END_LOAD_DTS , CUSTOMER_NUMBER , NAME , ADRESS , CITY FROM CRM_HST.DM_CUSTOMER WHERE END_LOAD_DTS = '2099-12-31' )
The last perspective would look like this:
CREATE VIEW CRM_LST.DM_CUSTOMER AS ( WITH H_CUSTOMER_Ordered AS ( SELECT ROW_NUMBER () OVER ( PARTITION BY H_CUSTOMER_ID ORDER BY END_LOAD_DTS DESC ) as Rownumber , H_CUSTOMER_ID , START_LOAD_DTS , END_LOAD_DTS , CUSTOMER_NUMBER , NAME , ADRESS , CITY FROM CRM_HST.H_CUSTOMER_VERSION ) SELECT H_CUSTOMER_ID , START_LOAD_DTS , END_LOAD_DTS , CUSTOMER_NUMBER , NAME , ADRESS , CITY FROM H_CUSTOMER_Ordered WHERE Rownumber = 1 )
Note that the actual perspective and the last perspective are based on the historical perspective.
Because the perspectives are identically described they can be “hot swapped” for each other to change the temporal focus of a query without rewriting any SQL.
To make it clear that we are creating an As-Is Datamart, I like to start hub-based views with DM_, and link-based views with FT_.
I will now describe some use cases that show the benefits of the extensions I am proposing.
Missing dimension records
A dimensions gets its data from a different system of record. This system of record has a different regime of cleaning. Therefore you end up with facts that point to dimension records that are no longer present in the dimension.
By sourcing the dimension from the latest perspective, all business keys are loaded, so no more missing dimension records.
Clean up your data vault
An enterprise data warehouse should follow the source. So, new source records lead to inserts, changed source records lead to updates, and deleted source records lead do deletes.
By making use of the last perspective and the active perspective, finding the records that need to be deleted is trivial. This query:
SELECT H_CUSTOMER_ID FROM CRM_LST.H_CUSTOMER WHERE H_CUSTOMER_ID NOT IN ( SELECT H_CUSTOMER_ID FROM CRM_ACT.H_CUSTOMER )
returns a list of surrogate keys that have to be deleted from the hub and its satellites.
Data warehouse as an archive
In computer science an archive is “a long-term storage area for backup copies of files or for files that are no longer in active use”. Many stakeholders think of a datawarehouse that way.
The last perspective shows the latest version of business keys, even if the business keys are “no longer in active use”. So, if records are removed from the source system they can still be retrieved using the last perspective, thereby treating a data warehouse as an archive.
Type 1 Dimensions
The attributes of a type 1 dimension always reflect the current value. So, use the active (or latest, if your data warehouse is an archive) perspective, and you’re done.
Having said that, it is often the case that type 1 dimensions are the first choice of stakeholders, because, well, they find the concept of time traveling hard to grasp. If, at a later time, they demand history, you can simply swap the active perspective with the historical perspective without rewriting a single SQL query.
Hybrid Slowly Changing Dimensions
The hybrid SCD can use the historically correct values for “as was” reporting by default, but can also use the current values for “as is” reporting. This is achieved by selecting both the historical and the active versions of attributes and co-locate them in a single dimension.
To achieve this:
– join the active view with the historical view, based on the surrogate key.
– select the type 2 attributes, plus the start date and the end date, from the historical view. Select the rest of the attributes from the active view.
SELECT H.H_CUSTOMER_ID , H.START_DATE , H.END_DATE , H.CITY HISTORICAL_CITY , A.CITY CURRENT_CITY , ... FROM CRM_ACT.H_CUSTOMER A JOIN CRM_HST.H_CUSTOMER H ON A.CUSTOMER_ID = H.CUSTOMER_ID
would allow a user to group by HISTORICAL_CITY, while filtering on CURRENT_CITY.
Note that a hybrid SCD can be more like a type 2 if each attribute is only present once (some historical, others active), or more like a type 3 if one or more attributes are present with both the historical and the active value.
I hope I have shown the usefulness of the proposed extensions. If you have any questions, please feel free to contact me.