Jonny Dixon, Senior Product Manager at Dremio, outlines the must-have characteristics of the latest development in data architecture – the data lakehouse.
The data lakehouse is the next generation of data architecture that combines the strengths of data warehouses and data lakes in a unified platform. Traditionally, organisations have used a two-tiered architecture, relying on data lakes for expansive data storage and data warehouses for enterprise-grade Business Intelligence (BI) and reporting. While this setup has served organisations well, it struggles to keep up with the exponential growth of data and increasing user demands.
Enter the data lakehouse, which offers a unified architecture combining the reliability and data quality of a data warehouse with the scalability, flexibility, and cost-effectiveness of a data lake. This solution allows organisations to scale and manage their data at an enterprise level without the complexities of the traditional two-tiered architecture.
However, building a data lakehouse presents its own challenges, such as ensuring data quality, managing metadata, and providing fast and reliable data access. There are also specific characteristics that a data lakehouse needs to have, and these aspects require careful consideration and implementation to fully leverage the benefits of the data lakehouse.
The must-have characteristics of a data lakehouse
To accommodate BI and data science use cases, the data lakehouse must provide a unified platform for data engineers, analysts, and scientists. This allows them to share and collaborate on data from diverse sources, including operational databases, applications, IoT sensors, and social media feeds. For instance, a data scientist could create a machine learning model to predict market prices and then pass the model to a data analyst to forecast future revenue scenarios.
Additionally, the platform must allow multiple concurrent workloads to run on the same copy of the data and facilitate the sharing of relevant tools and outputs between team members.
For increased efficiency, the data lakehouse should automate the configuration and management of its components, allowing data teams to execute projects with less effort. This includes providing a user-friendly graphical interface that facilitates data discovery, transformation, curation, and querying for data engineers and analysts.
Instead of relying on data engineers, data analysts should be able to self-serve, enabling them to browse files within the data store, select relevant ones, and preview their contents. They should be able to apply filters or reformat the data for analytics using mouse clicks, drop-down menus, and popup windows, eliminating the need for manual scripting.
To increase productivity, the data lakehouse should allow data analysts and scientists to access data themselves without relying on data engineers. This self-service capability requires a catalogue that provides intuitive views of metadata, including file attributes, lineage, and usage history.
Furthermore, data analysts and scientists need to access the data views concurrently without causing conflicts or disruptions. This means creating consistent data views that rely on the same underlying copy of data.
The data lakehouse should adhere to stringent Service Level Agreements (SLAs) regarding key performance metrics. These metrics encompass low latency to ensure quick query response times, the ability to handle large volumes of data queries, and high concurrency to support numerous workloads simultaneously.
Like other cloud data architectures, a data lakehouse should assist enterprises in cost control by efficiently utilising resources and aligning with FinOps objectives for cloud analytics projects. This involves profiling workloads before execution to provide users with an understanding of their compute cycle requirements. The data lakehouse should then automatically optimise processing methods to streamline these workloads.
To cater to changing workload requirements, the data lakehouse should offer elastic scaling capabilities, allowing users to adjust storage and compute capacity according to their needs. This scalability must be economically viable, prioritising cost-efficiency. Like a hybrid car that turns off at a stoplight to conserve energy, the data lakehouse should avoid unnecessary compute cycles.
Enterprises require economic capabilities like these because analytics projects often involve bursts of processing, such as quarterly financial reports or ad-hoc customer 360 analyses. Instead of always paying for peak performance, the data lakehouse should enable cost reduction by dynamically allocating resources based on workload demands.
To mitigate risks to data quality and maintain compliance with privacy regulations, enterprises must implement effective data governance practices. This involves avoiding unnecessary data copies that could compromise the integrity of a ‘single source of truth’. It also entails implementing role-based access controls, data masking for sensitive information and lineage tracking to monitor and manage user actions.
Additionally, comprehensive audit logs should be maintained to record all user activities. By implementing these governance measures, enterprises ensure that data analysts can generate accurate reports while compliance officers can protect personally identifiable information (PII).
The data lakehouse should seamlessly integrate with the diverse ecosystem of data stores, formats, processors, tools, APIs, and libraries that modern data teams require for innovation. It must also be compatible with alternative cloud data architectures like Azure Synapse Analytics while avoiding the risk of vendor lock-in. Data teams require an open architecture that enables them to make changes as business requirements evolve without needing custom scripts to move data between object stores, switch processing engines, or import a Machine Learning (ML) algorithm.
An open architectural approach empowers data analysts to take iterative steps and gives them the flexibility to use multiple engines and their preferred tools directly on the data without being constrained by formats. This minimises the need for complex, insecure or risky data moves and copy proliferation.
The data lakehouse offers a simplified approach for enterprises to meet their business demands, foster collaboration among analysts, and reduce overall effort. Data teams can transform how their business utilises analytics by carefully selecting the appropriate components for their environment and establishing seamless integration points. This enables them to effectively support new and existing use cases, delivering results within the desired timeframe and budget.
However, it is crucial to develop and execute a well-defined plan to achieve these objectives, and business leaders need to ensure their data lakehouses meet the requirements for a successful deployment.