Jonny Dixon, Senior Product Manager at Dremio, explains why it’s time for a Hadoop migration.
While there are several alternatives available to enterprises looking to migrate away from Hadoop, the migration journey can be daunting. However, with a clear migration map, enterprises can make the leap to a better analytics infrastructure.
Why migrate now?
In its day, Hadoop was a genuinely revolutionary development in data management. And there are some excellent features within the Hadoop framework.
For example, the distributed file system, HDFS, provides fault-tolerance by replicating data across multiple nodes. This helps ensure that data is highly available even in the event of hardware failure. Another technical advantage is that Hadoop was uniquely flexible in its day, supporting structured, semi-structured and unstructured data. This enabled enterprises to store and process data in its native format, without the need for extensive and expensive pre-processing like extract, transform and load (ETL). This alone was an advantage of traditional, highly structured data warehouses.
However, for all these advantages, there are pain points with Hadoop, especially around increased costs and reduced efficiency. Hadoop requires a distinct level of expertise to set up and maintain, and a Hadoop cluster can be complex and time-consuming, especially as the size of the cluster grows. The hidden costs of this necessary administration are exacerbated by how difficult it is to hire for the right Hadoop skills.
Further, Hadoop has some inherent security vulnerabilities, including somewhat weak authentication protocols and authorisation policies. In the same vein, the lack of native encryption at rest or in transit requires additional libraries – and the skills to implement those technologies.
At one time, the advantages of Hadoop overcame these drawbacks and hidden costs. But over the years, many enterprises have found that Hadoop is cheap to provision and expensive to run. And while the case for migration is compelling, the task can appear complex and formidable – especially for businesses that have committed so much to the Hadoop platform.
Time to shop around
Naturally, one of the most tempting approaches is simply to move from an on-premise solution to the cloud, especially if it’s thought the migration can be a simple ‘lift and shift’.
However, this might not always be feasible or certainly not simple. Some of the applications running on-premise may require specific versions or configurations that aren’t supported in a cloud environment. And while one might expect reduced costs from the cloud, it can be more expensive in some cases, especially for organisations with large amounts of data that will result in high storage and processing costs. Moreover, the issue of vendor lock-in is a genuine concern when operating Hadoop in the cloud, as it ties a business to the cloud provider for the maintenance, security, and accessibility of the solution.
Another, more radical, solution is to migrate the data in the Hadoop system and the applications built over it to a cloud data warehouse. While the cloud has given the old data warehouse architecture a new lease of life, it’s a very different proposition compared to just moving the Hadoop data lake to the cloud. It requires a deep understanding of both architectures and involves significant data engineering work to reformat and restructure the data for the data warehouse – particularly for organisations with large and complex data sets.
A third, and most impactful, option is to migrate to a data lakehouse. With the lakehouse, various business intelligence (BI) and reporting tools have direct access to the data they need without complex and error-prone ETL processes. Further, since the data is stored in open file formats like Apache Parquet, and open table formats like Apache Iceberg, data scientists and ML engineers can build models directly on any data (structured, unstructured) depending on their use cases.
There’s something quite powerful about migrating to a data lakehouse – the federation of data, rather than endless copying of data, means it’s not much of a migration at all. The task involves not so much a ‘lift and shift’ but swapping out the query engine, defining the virtualised semantic layer, and only then migrating HDFS objects to the cloud.
Mapping migration to a data lakehouse
To make the most out of a business’ data potential, enterprises need a modernisation map, starting with an honest assessment of where they stand today. The next step is to decouple storage and compute, and replace the Hadoop query engine with an open-source one. This will enable faster performance, more intuitive SQL and data federation, which makes it easier for tools and vendors to support the new format.
From there, Hadoop should be migrated to object storage – whether on-premise, in the cloud or hybrid. This will reduce costs, improve security and provide greater scalability and simpler administration. Finally, enterprises can start to build on their open data lakehouse – migrating tables to an open table format like Apache Iceberg. The result is a modern, scalable, flexible lakehouse with high-quality performance, manageability and ease of use. Driven by a process which takes the risk out of migration, and enables business users with better data access from step one.