Traditionally business intelligence is an enterprise solution, as only big businesses have multiple sources that require integration and analysis. It is assumed that the rest are small fishes that could be able to analyse their data with simpler accessible tools or develop proprietary solutions. Recently we had a customer with limited budget wanting to visualise and analyse their data. Due to complexity of their data they needed some ETL solution. Let’s see what options are available for those who have data, but not cash.
Let’s get back in history. In the 2000’s there was a time when the ETL has been “done properly”; data sources were loaded and properly integrated into data warehouses, the amount of data was manageable, structure well-defined and servers were located in the same building where the end users sat. Then everything changed when the Cloud was invented. Businesses didn’t need to worry about hardware anymore, unleashed data feeds brought a lot of data. Around these times salesmen started telling us that “ETL is no longer needed”. I can see that around 2015 many of the promising open-source ETL tools were abandoned. I found it an amusing sign of changing times.
New times bring new solutions, but let’s still observe what are those few free ETL tools left available. Although the Internet remembers everything and many tools are still available for download, I have deliberately excluded those which haven’t released new versions in years.
Talend Data Integration
I think this brand is familiar to many people who deal with data. Talend has a great range of free data tools, balancing it with paid versions and features.
It has nice graphical interface, many pre-configured transformations, many essential endpoint connectors, including Cloud ones. I believe, it gives everything you need from ETL tool.
Pentaho Data Integration
Pentaho Data Integration, also known as Kettle, was acquired by Hitachi. Good for them for supporting a few community data projects. Like Talend, Pentaho has a graphical interface and connectors to many of the databases, including some Cloud, Big Data and even streaming options. Both tools require downloading driver files for non-Cloud endpoints, however, Talend has a built-in wizard to help with that.
I like the transformations interface of Pentaho better than the Talend’s one, more intuitive I would say.
Another good thing both tools have is a scheduling mechanism. Data loads and transformations could be executed at certain time, queues and conditions for executions could be set up.
I think both of these tools could perform full-blown ETL to be used in production environments, which is currently happening across the world anyway.
Here we are stepping into the no graphical interface area. Solid A+ for the documentation. Most of the open-source projects are writing documentation, even user guides with an assumption that everyone has the same programming background, which is not the case. Transformalize documentation is written for anyone with no previous knowledge on the subject, which is heartwarming. Also, its developers thought about use cases, i.e. how to implement the basic load and incremental loads for data warehouse, how to schedule tasks and even provides links to free data visualisation tools.
To make development easier, there is an extension for Visual Studio Code. However, it didn’t work for me. Otherwise, composing XML scripts looks straightforward. All the transformations are done through pure SQL.
This project looked like it is almost dying, but suddenly the new version has been released in October 2019, therefore it tightly squeezes into our overview. It also means, that all drivers are outdated.
Scriptella also doesn’t have a graphical interface. You have to write XML files for executions, and most of the transformations need to be written in pure SQL. Connection strings are written in the same XML file, and if you have username and password connection, these should be provided explicitly. This is valid for Transformalize as well.
Scriptella’s documentation is not describing how to schedule tasks. I’m guessing, this could be done via the same means as Transformalize suggests.
I still can see how these two tools could be used by a single developer with all data on premise, i.e. safely used internally. It is free, minimalist, manageable.
All four have a drawback of getting into new tool and finding ways for the exact transformation and load tasks.
I believe this is all what you can get for free to do ETL in traditional way. As I said in the beginning, times have changed, and new technologies are available to us now, for example, data streaming. I have deliberately excluded such tools from this review, they are built for different solutions.