SuperNova perspectives – Where stars are born

 

This blog introduces extensions to the SuperNova data modeling technique. SuperNova is designed to create a view layer on the basis of an existing DataVault, that makes reporting and analytics more agile. The extensions that I am proposing add perspectives to SuperNova.

Perspectives

There are three perspectives:

  • the historical perspective (aka the Supernova view)
  • the active perspective: shows the latest version of business keys that are still present in the source system and
  • the latest perspective: shows the latest version of business keys, even if the business keys are no longer present in the source system.

So, to give you an idea what the proposed perspectives look like, I will give an example. Let’s say there is a source table CUSTOMER with the following fields: CUSTOMER_NUMBER, NAME, ADRESS and CITY. In our data vault this gets transformed to a hub H_CUSTOMER with business key CUSTOMER_NUMBER and a satellite with the rest of the fields.

The historical perspective would look like this:

CREATE VIEW CRM_HST.DM_CUSTOMER 
( 
            H_CUSTOMER_ID
,           START_LOAD_DTS
,           END_LOAD_DTS
,           CUSTOMER_NUMBER
,           NAME
,           ADRESS
,           CITY 
) 

AS (

  SELECT    H_CUSTOMER_VERSIONS.H_CUSTOMER_ID
  ,         H_CUSTOMER_VERSIONS.START_LOAD_DTS
  ,         H_CUSTOMER_VERSIONS.END_LOAD_DTS
  ,         CUSTOMER_NUMBER
  ,         NAME
  ,         ADRESS
  ,         CITY
 
  FROM      CRM.H_CUSTOMER_VERSIONS

  LEFT OUTER JOIN CRM.S_CUSTOMER
    ON      H_CUSTOMER_VERSIONS.CUSTOMER_ID = 
              S_CUSTOMER.CUSTOMER_ID
    AND (            
            H_CUSTOMER_VERSIONS.START_LOAD_DTS >= 
              S_CUSTOMER.START_LOAD_DTS
          AND 
            H_CUSTOMER_VERSIONS.END_LOAD_DTS <= 
              S_CUSTOMER.END_LOAD_DTS
        )
)

The actual perspective would look like this:

CREATE VIEW CRM_ACT.DM_CUSTOMER AS (

  SELECT    H_CUSTOMER_ID
  ,         START_LOAD_DTS
  ,         END_LOAD_DTS
  ,         CUSTOMER_NUMBER
  ,         NAME
  ,         ADRESS
  ,         CITY

  FROM      CRM_HST.DM_CUSTOMER

  WHERE     END_LOAD_DTS = '2099-12-31'
)

The last perspective would look like this:

CREATE VIEW  CRM_LST.DM_CUSTOMER AS (
  
  WITH       H_CUSTOMER_Ordered AS
  (
    SELECT   ROW_NUMBER () OVER (
               PARTITION BY H_CUSTOMER_ID 
               ORDER BY END_LOAD_DTS DESC
             ) as Rownumber
    ,        H_CUSTOMER_ID
    ,        START_LOAD_DTS
    ,        END_LOAD_DTS
    ,        CUSTOMER_NUMBER
    ,        NAME
    ,        ADRESS
    ,        CITY

    FROM     CRM_HST.H_CUSTOMER_VERSION
  )

  SELECT     H_CUSTOMER_ID
  ,          START_LOAD_DTS
  ,          END_LOAD_DTS
  ,          CUSTOMER_NUMBER
  ,          NAME
  ,          ADRESS
  ,          CITY
  
  FROM       H_CUSTOMER_Ordered

  WHERE      Rownumber = 1
)

 

Note that the actual perspective and the last perspective are based on the historical perspective.

Because the perspectives are identically described they can be “hot swapped” for each other to change the temporal focus of a query without rewriting any SQL.

To make it clear that we are creating an As-Is Datamart, I like to start hub-based views with DM_, and link-based views with FT_.

Use Cases

I will now describe some use cases that show the benefits of the extensions I am proposing.

Missing dimension records

A dimensions gets its data from a different system of record. This system of record has a different regime of cleaning. Therefore you end up with facts that point to dimension records that are no longer present in the dimension.

By sourcing the dimension from the latest perspective, all business keys are loaded, so no more missing dimension records.

Clean up your data vault

An enterprise data warehouse should follow the source. So, new source records lead to inserts, changed source records lead to updates, and deleted source records lead do deletes.

By making use of the last perspective and the active perspective, finding the records that need to be deleted is trivial. This query:

SELECT     H_CUSTOMER_ID

FROM       CRM_LST.H_CUSTOMER

WHERE      H_CUSTOMER_ID NOT IN (

    SELECT H_CUSTOMER_ID

    FROM   CRM_ACT.H_CUSTOMER
)

returns a list of surrogate keys that have to be deleted from the hub and its satellites.

Data warehouse as an archive

In computer science an archive is “a long-term storage area  for backup copies of files or for files that are no longer in active use”. Many stakeholders think of a datawarehouse that way.

The last perspective shows the latest version of business keys, even if the business keys are “no longer in active use”. So, if records are removed from the source system they can still be retrieved using the last perspective, thereby treating a data warehouse as an archive.

Type 1 Dimensions

The attributes of a type 1 dimension always reflect the current value. So, use the active (or latest, if your data warehouse is an archive) perspective, and you’re done.

Having said that, it is often the case that type 1 dimensions are the first choice of stakeholders, because, well, they find the concept of time traveling hard to grasp. If, at a later time, they demand history, you can simply swap the active perspective with the historical perspective without rewriting a single SQL query.

Hybrid Slowly Changing Dimensions

The hybrid SCD can use the historically correct values for “as was” reporting by default, but can also use the current values for “as is” reporting. This is achieved by selecting both the historical and the active versions of attributes and co-locate them in a single dimension.

To achieve this:

– join the active view with the historical view, based on the surrogate key.
– select the type 2 attributes, plus the start date and the end date, from the historical view. Select the rest of the attributes from the active view.

This query:

SELECT     H.H_CUSTOMER_ID
,          H.START_DATE
,          H.END_DATE
,          H.CITY HISTORICAL_CITY
,          A.CITY CURRENT_CITY
,          ...

FROM       CRM_ACT.H_CUSTOMER A

JOIN       CRM_HST.H_CUSTOMER H
ON         A.CUSTOMER_ID = H.CUSTOMER_ID

would allow a user to group by HISTORICAL_CITY, while filtering on CURRENT_CITY.

Note that  a hybrid SCD can be more like a type 2 if each attribute is only present once  (some historical, others active), or more like a type 3 if one or more attributes are present with both the historical and the active value.

I hope I have shown the usefulness of the proposed extensions. If you have any questions, please feel free to contact me.