Free-Photos / Pixabay
Microsoft has introduced a new feature of graph database in SQL Server 2017. This is not a new idea, graph platforms were on the market for a while now. I have decided to dive in to understand why graph databases are special and how they could be used in Business Intelligence.
Unlike relational database, Graph database focuses on relationships between entities, which are called nodes. Such design allows quicker and simpler analysis of hierarchies and trace links between entities through their relationships with other entities, like investigations. To me such database model looks less technical than relational tables model, more business-oriented. Such thinking became more popular these days with raise of social networks: you can find someone not just by their name, but also by involvement in activities or preferences in music or food.
I have started digging into graph tables with this example on GitHub. From the SQL Server perspective, graphs are not much different from relational tables. I like the fact that you can have both relational and graph tables in the same database and even mix them in one query. Creating nodes and edges is not different from typical SQL syntax. Also, SSIS works with graph tables as it would work with normal tables. You can implement usual data flows to populate both node and edge tables, update their attributes or delete records. The only gotcha is node and edge identifiers generated by SQL Server have long names with the unique code as a part of it. Address them by pseudo-column $node_id and $edge_id instead. Also for edge ETL you’ll need to use lookup for node identifiers.
In the example I’ve mentioned above, graph tables are created based on the existing Wide World Importers sample database. It includes a demo of the same query for original relational tables and graph tables, returning the same result. Graph query is much shorter. But when I have run the queries, I have noticed that at first graph tables performed much slower than normal tables. I have analysed execution plan and noticed that relational tables are indexed. After I have added columnstore indexes to nodes and edges, graph tables query has returned result twice faster than relational tables query.
There are a couple of major differences between relational tables and graph tables, which require a bit of work to comprehend. SQL Server went for a directed edge architecture, it means that edge could only link two nodes and the relationship is directed. I.e. the fact that Bob befriended Jane doesn’t mean that Jane has befriended Bob; and you can’t include any other entity into this relationship, e.g. a night club where they’ve met, it will be a different edge. This is a feature of graph databases one should be aware of when querying via edge. Another difference which was hard for me to digest is that one edge table doesn’t necessarily connects same two nodes all the time, unlike foreign key relationships between the relational tables. For example, all the things I like, movies, songs and videogames, despite they belong to different nodes, could be stored in one edge.
It is obvious that SQL Server implementation is very young compare to other graph databases on the market. The only function available for querying graph tables in SQL Server is MATCH with a combination of standard SQL WHERE clauses. Other Graph platforms allow using graph query languages, for example, Neo4J, the most popular graph database these days, supports Cypher. It provides much more powerful querying options than just one function.
Neo4J is a separate platform, where data needs to be fed in by some process before you can query it, which is not a problem if you are fine with the standalone graph database. SQL Server could be a better option for a supplementary data analysis layer as a part of the Data Warehouse on the same platform, there is no need to move data anywhere and it could be easily integrated with the rest of the data organisation has.
Graph is represented the best with the visualisation. While Neo4J have a visualisation component built into the platform, in Microsoft environment you go to Reporting Services or Power BI for that function. So, these are two different approaches to splitting functionality between tools.
Kevin Tardivel in his blog explores more uses for graphs in Data Warehousing other than the obvious analysis. It is definitely good for the data lineage, data catalogue and master data management. All of these are important supplementary structures of any Data Warehouse, and it is a challenge for every BI project to find the right tool for these tasks. I would definitely give it a go on my next project. Kevin also mentions a Data Vault there. This is such exciting topic for me, that it deserves another blog.