DataVault, DevOps and the Minimum Viable Product

One of the practices of DevOps is the Minimum Viable Product (MVP), a product with just enough features to create value for the business. This MVP is then improved in subsequent rapid releases,  delivering more value with every release.

So, in DataVault, what would be a Minimum Viable Product? Well, as you may know, DataVault uses three different types of artefacts: Hubs, Links and Satellites. A Hub contains a list of unique business keys, a Link contains the relations between Hubs and a Satellite stores detailed information about a business key (Hub Satellite) or a constellation of hubs (Link Satellite). A Link Satellite also stores measures.

So, let’s say we have a Date Hub (January, February etc), an Employee Hub (John, Peter),  a Link that stores the relation between the two Hubs, and a Link Satellite that stores the hours worked. With these artefacts we would be able to create the following report:

Hours per MonthJanuaryFebruaryMarch
John100100100
Peter10080120

A Minimum Viable DataVault is a Link and Link Satellite plus it’s surrounding Hubs and Satellites. This is enough to create a product (a report, in this case) that creates value for the business.

To improve the product we would then create subsequent releases, each with one more Link with its Link Satellite plus it’s surrounding Hubs and Satellites. This would allow the business users to create more – or more elaborate – reports with every release.

After a few releases there will be less need to add Hubs and Hub Satellites, because they were already added in an earlier release. So, the releases get smaller (eventually consisting of just a Link plus Link Satellite), giving the business user added value, while at the same time giving the developer more time to enhance the product wth BusinessVault artefacts or SuperNova views and perspectives.

 

Smart Date Hub

A classy and retro looking calendar icon. Ideal for your presentation, website, application or even for print. Hires JPEG file included! File contains gradient mesh (EPS 8)

The most important dimensions in every data warehouse is the date dimension. Almost every fact table has one or more surrogate keys that join to this date dimension. Normally, dates from a source table have to be translated into surrogate keys, by looking them up in the date dimension.

However, in DataVault, there is a nifty little trick to overcome this:

  • create a source table that contains the dates and the details (Month, Year etc.) you want in your data marts.
  • create a hub and a satellite to store the date and the details . Use the format of dates in your source system. If your source systems use a numeric format for dates (like 20151015) then use this format.
  • create a surrogate key that matches the business key.

The combination of hub and satellite looks like this:

DATE_IDDATE_NUMDATEDAYMONTHYEAR...
2015010120150101January 1, 20151January2015...
2015010220150102January 2, 20152
January2015...
.....................
2015123120151231December31, 201531December2015...

When loading a link that contains a date, use the date as a peg legged link.

Pros

Links can be partitioned, based on the smart surrogate key of the date hub. Partitioning is the process of dividing a single large table into multiple logical chunks. Partitioning improves the performance of the queries running against this partitioned table, enables the piecemeal backup / restore of historical data and improves the speed with which you can load your data.

  • Fact tables can also be partitioned, thereby gaining the same benefits.
  • Fact tables can be queried without joining the date dimension.
  • Every date gets stored in the hub, so no more missing dimension records.

 

 

SuperNova perspectives – Where stars are born

 

This blog introduces extensions to the SuperNova data modeling technique. SuperNova is designed to create a view layer on the basis of an existing DataVault, that makes reporting and analytics more agile. The extensions that I am proposing add perspectives to SuperNova.

Perspectives

There are three perspectives:

  • the historical perspective (aka the Supernova view)
  • the active perspective: shows the latest version of business keys that are still present in the source system and
  • the latest perspective: shows the latest version of business keys, even if the business keys are no longer present in the source system.

So, to give you an idea what the proposed perspectives look like, I will give an example. Let’s say there is a source table CUSTOMER with the following fields: CUSTOMER_NUMBER, NAME, ADRESS and CITY. In our data vault this gets transformed to a hub H_CUSTOMER with business key CUSTOMER_NUMBER and a satellite with the rest of the fields.

The historical perspective would look like this:

CREATE VIEW CRM_HST.DM_CUSTOMER 
( 
            H_CUSTOMER_ID
,           START_LOAD_DTS
,           END_LOAD_DTS
,           CUSTOMER_NUMBER
,           NAME
,           ADRESS
,           CITY 
) 

AS (

  SELECT    H_CUSTOMER_VERSIONS.H_CUSTOMER_ID
  ,         H_CUSTOMER_VERSIONS.START_LOAD_DTS
  ,         H_CUSTOMER_VERSIONS.END_LOAD_DTS
  ,         CUSTOMER_NUMBER
  ,         NAME
  ,         ADRESS
  ,         CITY
 
  FROM      CRM.H_CUSTOMER_VERSIONS

  LEFT OUTER JOIN CRM.S_CUSTOMER
    ON      H_CUSTOMER_VERSIONS.CUSTOMER_ID = 
              S_CUSTOMER.CUSTOMER_ID
    AND (            
            H_CUSTOMER_VERSIONS.START_LOAD_DTS >= 
              S_CUSTOMER.START_LOAD_DTS
          AND 
            H_CUSTOMER_VERSIONS.END_LOAD_DTS <= 
              S_CUSTOMER.END_LOAD_DTS
        )
)

The actual perspective would look like this:

CREATE VIEW CRM_ACT.DM_CUSTOMER AS (

  SELECT    H_CUSTOMER_ID
  ,         START_LOAD_DTS
  ,         END_LOAD_DTS
  ,         CUSTOMER_NUMBER
  ,         NAME
  ,         ADRESS
  ,         CITY

  FROM      CRM_HST.DM_CUSTOMER

  WHERE     END_LOAD_DTS = '2099-12-31'
)

The last perspective would look like this:

CREATE VIEW  CRM_LST.DM_CUSTOMER AS (
  
  WITH       H_CUSTOMER_Ordered AS
  (
    SELECT   ROW_NUMBER () OVER (
               PARTITION BY H_CUSTOMER_ID 
               ORDER BY END_LOAD_DTS DESC
             ) as Rownumber
    ,        H_CUSTOMER_ID
    ,        START_LOAD_DTS
    ,        END_LOAD_DTS
    ,        CUSTOMER_NUMBER
    ,        NAME
    ,        ADRESS
    ,        CITY

    FROM     CRM_HST.H_CUSTOMER_VERSION
  )

  SELECT     H_CUSTOMER_ID
  ,          START_LOAD_DTS
  ,          END_LOAD_DTS
  ,          CUSTOMER_NUMBER
  ,          NAME
  ,          ADRESS
  ,          CITY
  
  FROM       H_CUSTOMER_Ordered

  WHERE      Rownumber = 1
)

 

Note that the actual perspective and the last perspective are based on the historical perspective.

Because the perspectives are identically described they can be “hot swapped” for each other to change the temporal focus of a query without rewriting any SQL.

To make it clear that we are creating an As-Is Datamart, I like to start hub-based views with DM_, and link-based views with FT_.

Use Cases

I will now describe some use cases that show the benefits of the extensions I am proposing.

Missing dimension records

A dimensions gets its data from a different system of record. This system of record has a different regime of cleaning. Therefore you end up with facts that point to dimension records that are no longer present in the dimension.

By sourcing the dimension from the latest perspective, all business keys are loaded, so no more missing dimension records.

Clean up your data vault

An enterprise data warehouse should follow the source. So, new source records lead to inserts, changed source records lead to updates, and deleted source records lead do deletes.

By making use of the last perspective and the active perspective, finding the records that need to be deleted is trivial. This query:

SELECT     H_CUSTOMER_ID

FROM       CRM_LST.H_CUSTOMER

WHERE      H_CUSTOMER_ID NOT IN (

    SELECT H_CUSTOMER_ID

    FROM   CRM_ACT.H_CUSTOMER
)

returns a list of surrogate keys that have to be deleted from the hub and its satellites.

Data warehouse as an archive

In computer science an archive is “a long-term storage area  for backup copies of files or for files that are no longer in active use”. Many stakeholders think of a datawarehouse that way.

The last perspective shows the latest version of business keys, even if the business keys are “no longer in active use”. So, if records are removed from the source system they can still be retrieved using the last perspective, thereby treating a data warehouse as an archive.

Type 1 Dimensions

The attributes of a type 1 dimension always reflect the current value. So, use the active (or latest, if your data warehouse is an archive) perspective, and you’re done.

Having said that, it is often the case that type 1 dimensions are the first choice of stakeholders, because, well, they find the concept of time traveling hard to grasp. If, at a later time, they demand history, you can simply swap the active perspective with the historical perspective without rewriting a single SQL query.

Hybrid Slowly Changing Dimensions

The hybrid SCD can use the historically correct values for “as was” reporting by default, but can also use the current values for “as is” reporting. This is achieved by selecting both the historical and the active versions of attributes and co-locate them in a single dimension.

To achieve this:

– join the active view with the historical view, based on the surrogate key.
– select the type 2 attributes, plus the start date and the end date, from the historical view. Select the rest of the attributes from the active view.

This query:

SELECT     H.H_CUSTOMER_ID
,          H.START_DATE
,          H.END_DATE
,          H.CITY HISTORICAL_CITY
,          A.CITY CURRENT_CITY
,          ...

FROM       CRM_ACT.H_CUSTOMER A

JOIN       CRM_HST.H_CUSTOMER H
ON         A.CUSTOMER_ID = H.CUSTOMER_ID

would allow a user to group by HISTORICAL_CITY, while filtering on CURRENT_CITY.

Note that  a hybrid SCD can be more like a type 2 if each attribute is only present once  (some historical, others active), or more like a type 3 if one or more attributes are present with both the historical and the active value.

I hope I have shown the usefulness of the proposed extensions. If you have any questions, please feel free to contact me.