Some RDBMS support SQL statement MERGE, which also referred as upsert. This is a very common task in data warehousing: we compare the source dataset with what has been recorded in DW, insert new values and update those which have changed since the last load. It is a very convenient, as everything could be done in just one statement.PostgreSQL doesn’t support MERGE, therefore we have to use workarounds for our data warehouse project that uses this platform. If you google for examples, the most obvious one everyone recommends to use is INSERT INTO … ON CONFLICT [ conflict_target ] conflict_action . This seems like a cool feature: minimal coding, no hassle, simply works. I liked it and gave it a go.
On the current project we use Data Vault methodology for the data warehouse, therefore there are three types of tables. Hubs and links are insert-only tables, they only capture the first occurrence of the record’s key in source and the rest is simply ignored in the subsequent loads. Satellites store most of the data and the history of changes.
It turned out though that ON CONFLICT clause has some caveats. It can only deal with conflicts on indexed fields. Which means that it cannot be used on satellites – an index that includes all satellite columns is pointless if not a dangerous thing. But this could still work on hubs and links as an index on business key (hub) or combination of keys (link) is achievable.
Another caveat came from another Postgres feature: NULL is not equal to NULL. Therefore, in N-way link if one of the sides is NULL, which is an acceptable pattern in Data Vault, there will be no conflict on insert, and the same value will be inserted over an over again. To avoid that I had to include NULL-handling logic in both index and insert conflict handler. This made code much clumsier, but that still looked right.
Findings on implementation
This solution has been implemented, and real live data started arriving at the data warehouse. That’s how we found that this solution is not viable. The major issue with that is that a primary key sequence increases dramatically. Every time the INSERT runs, sequence is increased by the number of the records which it tried to insert.
E.g. if our table has a 100 rows in it, and 1 new row arrived, then the new row autogenerated sequence number for that will be 201. However, if another record arrives after another 10 loads, its sequence number will be 1211. Now imagine what happens with more realistic database, where tables contain millions of records, and they are loaded multiple times a day.
Although Postgres sequence’s max value is quite big, there’s a risk of running out of the available values in a few years, which isn’t sustainable for the data warehouses. Also, such giant numbers aren’t human-friendly.
Although these generated numbers are not shown to the end users, data engineers like myself and report analysts often use these for quick testing while developing something. With a wall of numbers, it isn’t that easy to eyeball an issue.
I’ve started looking for another solution for the simple upsert into hubs and links, and here was another not very pleasant surprise. It turned out that any other solution works faster than no action on INSERT conflict. I have written INSERT statement with the check if such value already exists in the table, which I expected to be slow. Quite the opposite, this extra step on insert works faster than the insert that conflicted. Both of them are using index, but apparently extra query is still faster than an insert rollback.
So, the first and the most popular answer in Google is not always the optimal solution in a specific case. It’s always worth trying multiple options.