photo by pasja1000
Recently I have bumped into this blog by David Nicholson. I have heard an opinion that the Graph database and the Data Vault look similar before. This idea comes to mind when you see diagrams next to each other. (You can draw them to be identical, but it doesn’t prove anything, because it’s a schematic view.) Anyway, David went further than others and suggested to combine the power of two: to implement Data Vault on graph database platform. He didn’t provide any details on how this could be achieved, and I haven’t found any mentions in the internet if anyone had tried to do that, so I have decided to give it a go myself.
A Data Vault model and a screenshot of visualisation from Neo4J sample movie graph
Data Vault is a Data Warehousing design methodology. You can find the DV basics in one of my blogs. Graphs are the special databases design methodology which represent the idea of the mathematical graph theory. Data Vault main function is to store and integrate all the data from multiple sources organisation has. Graph database main function is to retrieve the complex relationships between data.
The basic concept underneath both, that makes them seem so similar, is that they are aligned with the business concepts and natural relationships between concepts, as business see them. They capture only those entities and relationships that exist, and don’t enforce the relationships that are supposed to be there, but are missing for some reason, i.e. entities, their relationships (also attributes in case of Data Vault) are captured independently. E.g. if Customer class is missing from the data, there is no record in the Customer_Class link; these records could have arrived from some other data source and have another business meaning of Customers, or it’s a data quality issue which could have been missed. In Graph case, there is no “default” edge between Customer and Class nodes if somehow customer didn’t go through classification.
Graph building blocks are nodes and edges; Data Vault building blocks are hubs, links and satellites. While there is an obvious match between nodes and hubs, edges and links, graph has no structure to support something satellite-like. They could not be degraded to node attributes because satellites store all the versions of same data, historical and current. I could see a few ways to overcome this in my implementation of Data Vault on graph platform. Initially I was looking toward using SQL Server, so I could have implemented Satellites as relational SCD2 tables, because you can mix two types of tables on this platform. However, I have decided to stick to the pure graph solution and implemented satellites as nodes.
In graph database edge is a relationship between just two nodes. Neo4J documentation recommends using verbs for them. Tom Hanks acted in Apollo 13, Ron Howard directed this movie, etc. Links in the Data Vault may also represent verbs, or events, but they are not so simple. Links could be not just two-way, but N-way relationships. The rule of thumb for links is to find a Unit of Work – a relationship between all the concepts participating in relationship at one moment from the business perspective. A classical example of unit of work is when Customer purchases a Product from a Salesperson in a Store. It is not recommended to break up this link to a set of two-way links as this breaks the natural business relationship and even could lead to ambiguity. E.g. there is no direct relationship between a Customer and a Product they have purchased, because the same product could be purchased multiple times, but they could be linked in a context of Sale, so that details of each sale transaction could be found. I.e. it is important to have a keyed instance for every natural relationship, and all the relationship parties should be only linked to it in the graph implementation.
Another obstacle is the fact that graph is directed, while in Data Vault links capture any relationship. Directed graph is perfect for hierarchies, in case of Data Vault they are Same-As and Hierarchy links. In other cases directed links could be an issue.
I have used Neo4J for my implementation, because I found it flexible, simple enough to learn and mature tool.
As I said, I have implemented satellites as nodes. They are connected with the hub nodes by Current and Non-Current edges. It means that the ETL tool should remove a Current edge and replace it with a Non-Current in case of update as we want to preserve old “satellite” record as it is. I have also thought about having a date/time stamp of the link record as hub to satellite edge name, but have decided against it. I should have added a type of the hub-satellite relationship as a part of edge name though, e.g. PersonDetail-Current, PersonAddress-Non-Current.
Links are edges, so they have degraded to labelled lines on the Neo4J visualisation, but they have attributes as well. I found it easy to create Same-As links, they match the graph idea of things so naturally. However, the rest of links was harder. As I have mentioned, I had to break up the multi-way natural business relationships into two-way directed edges and give them meaningful names, that’s more work than usual to me.
I like that all the attributes in Neo4J are stored in key-value form, so two nodes of a same type can have different set of attributes. It is very cool for implementing satellites populated from different data sources, if an attribute value is not captured there, there is no need in extra “cells” with empty values. This is good for disk space and gives flexibility around satellite structure, but also makes it complicated for routine reporting as data structure becomes unpredictable.
Graph database is suited for investigation-type analysis, which is only one of the purposes of the Data Vault. The idea of DV is being a single source of truth, which includes data integration, also history tracing and aggregations. Users should be able to get any type of answer from the Data Vault because of its enterprise-wide scope. While it is so easy to integrate with graphs, they are not so good for the consistent piling up of the history.
The most you could achieve on Graph platform with Data Vault implementation is a cut-down version, which is missing a lot of essential functions and features. The recommended architecture of Data Vault includes a semantic layer for prepared integrated slices of data to analyse and report on. I can suggest the graph database to be a part of this layer instead of fitting a square peg in a round hole. Also, I have mentioned other uses of Graph databases in Data Warehousing in my previous blog.