When designing your Data Vault, you first start with defining the Hub tables. The Data Vault is made up of many subject areas, and each Hub is the central piece to each core subject.
Each core subject area in your organisation needs its own Hub table. Bear in mind, this is an organisational view of the subject areas, which ultimately may not be viewed the same in every department. Many departments may have customer information, for example, but it may not be stored or defined in the same way by every department. Data Vault is about defining a centralised view of the data. That all starts with the Hub.
The Hub table stores all the business keys for that particular subject area, along with three data warehouse designed and controlled columns.
The Business Key is the only piece of business data that is stored in this table. Any descriptive information is stored in the Satellites, and any relationship information is stored in the Links. This makes Hub tables very narrow in design. The business key is typically the primary key in the source table. It should uniquely identify every record in that subject area, from a single source.
In some scenarios, where you have multiple sources populating a Hub, a business key may no longer be unique in the Hub context, but don’t fret, the data warehouse controlled columns in the Hub table help identify where each record came from and make sure it stays unique.
The Surrogate Key is the unique identifier and primary key for the hub. There is some debate about how the surrogate key should be defined. Some feel that it should be a hashed version of the business key, or business key concatenate. Others suggest to just use the business key, or a concatenated version of the business key. We think that sequences, using unique integers, still reign supreme. Performance should be the main consideration when defining the surrogate key, so I’ll leave you to make a judgement call on what best serves your data warehouse.
The naming of the data warehouse columns are entirely up to the design team. For the surrogate key, we postfix the subject area, such as customer, with “_Sqn” (which stands for sequence), becoming Customer_Sqn.
The Date Time Stamp identifies when each record was first inserted into the Data Vault Hub. We simply name this Insert_Time_Date.
The Record Source identifies the source database where the record originated from. This is especially important if you have multiple sources in your data warehouse. We name this column Record_Source_Id. Typically we would have a metadata table containing the names of all the sources, and any other relevant information, with a unique id that becomes the Record_Source_Id in our Hubs, Links and Satellites.
The real challenge is identifying what should be a Hub table, or more importantly, what are the core subject areas in your organisation. This is especially true in the Enterprise Data Warehouse, where you have multiple sources and multiple business teams. What may be the view of one business team, is not necessarily the same as another. This Hub tables require a centralised view of all data, so you need to compromise with the various business teams to achieve a happy ending when it comes to gathering these requirements.
One solution we have discovered to this issue is using the BEAM requirements gathering methodology as an easy way to identify subject areas and business events. It helps join the dots between the data warehouse and business teams, in an agile and easily understandable way. Check it out here if you’re interested in learning more!
Once you have designed your Hub tables, you can move on to connecting some Satellite tables, which have descriptive information about the Hubs (or in some cases Links), before lastly connecting all your Hubs together using Link tables. I’ll have a couple more blogs coming soon on Satellite and Link table design.