Reporting and analytics computations are usually very resource-consuming, therefore they are never executed on the same server where the crucial business application is running. Instead, data is copied into dedicated servers to be dissected and studied for insights. Another level of complexity is combining the data from multiple sources in one place, which could be a challenge if there are no common unique identifiers between similar concepts, or they run on different platforms. E.g. one of the hardest tasks is to match people between datasets; full name is rarely unique, so you have to use email address or phone number for better match, but this doesn’t guarantee the result either. Even small-size organisations run multiple applications and web-sites which collect and produce the data, which only gets worse when talking about bigger companies.
So, it’s all about copying the data and fitting it together. And here we have multiple terms that could describe this process, which could be confusing. Let’s clear it up.
ETL vs CDC
ETL stands for Extract-Transform-Load. The term ETL describes the process of moving data around and do manipulations with it. ETL tools usually have functions of connecting to multiple data platforms and apply transformations to data in memory. After that, presumably, the result is recorded somewhere. ETL algorithms can also be written in most modern programming languages, but many organisations see this as less preferable option due to overhead of writing the code from scratch and unknown complexity of support afterwards.
CDC means Change Data Capture, and this is one of the ETL patterns to copy data. It is used for auditing every change to a record: either any of the values change or a record deletion. Back in the day, this pattern was implemented with ETL tools by either comparing the latest copy of the data with the source or checking the source record update timestamp. The performance was awful and there was a huge chance to miss some of the updates. CDC tools have changed things drastically, they use transaction logs to trace changes, so no change ever gets unnoticed, and it doesn’t even impact the source database performance. Just to throw another term to you, the side result (or the end goal, depending on how you see it) of CDC is data replication – getting the exact copy of the source data in another location.
Which one do you need?
CDC only copies the data with its intermediate states. For analytical and reporting purpose this could be a first step. CDC tools don’t allow much transformation, while they are needed for interpreting the data for users and, very importantly, for integrating the data from different sources that organisation has. These operations are performed by an ETL tool on the data warehouse or data store layer to make all data consistent and easy to report on.
These days when data is growing exponentially, transformations could be done in memory, i.e. the result of this transformation is only shown on the screen (or in the report), but not stored in the database. This is called ELT – Extract-Load-Transform. ELT only makes sense when the overhead of repeated data transformation computation for all users is lower than writing the results of it on disk, which happens when you have terabytes and petabytes of data.