Skip to content Skip to footer

Progress for Big Data in Kubernetes

Image: Adobe Stock / Connect world

Ted Dunning, chief application architect, MapR discusses the recent developments that make it possible for Kubernetes to manage both compute and storage tiers in the same cluster.

The recent adoption of containerisation as a deployment strategy and the adoption of Kubernetes as the de factomanagement framework for applications in these containers has happened with stunning speed.

This has been driven by the basic nature of containers and how they let us build independently deploy pieces of applications with isolated environments, but also by the way that Kubernetes distils Google’s deep expertise in managing services. The fact that all of the major cloud providers support this trend doesn’t hurt, either.

Containerisation provides a clean separation of concerns, as developers focus on their application logic and dependencies, while IT operations teams can focus on deployment, geo-distribution and management without bothering with the functional details.

However, the folk wisdom has always been that when running stateful applications inside containers that the only viable choice was to externalise the state so that the containers themselves are stateless or nearly so.

Keeping large amounts of state inside containers is possible, but it is considered a problem because stateful containers generally can’t preserve that state across restarts and because large amounts of state anchor containers in place which makes it hard to balance loads across the cluster.

The problem of state

In practice, state complicates the management of large-scale Kubernetes-based infrastructure because these high-performance storage systems require separate management. In terms of overall system management, it would be ideal if we could run a software-defined storage system directly in containers managed by Kubernetes, but that has been hampered by lack of direct device access and difficult questions about what happens to the state on container restarts.

Another problem has been that state storage in Kubernetes using a data platform isn’t going to work unless the data platform manages a very large fraction of the data and can control data placement (and hence data motion) in order to manage the loads imposed by access to the data.

Any data motion implies that there has to be a layer of indirection to address data without regard to location and thus the data layer has to provide a comprehensive mechanism for accessing data by name.

These capabilities of controlling location, load balancing and naming of data are, essentially, the dual of the responsibilities that Kubernetes has for containers. In a strong sense, our data platform needs to be just like Kubernetes but for data.

Recent developments in data platforms make it possible for Kubernetes to manage compute and a data platform in the same cluster. With self-describing data formats, the containers running the data platform can be restarted as needed and re-join the data platform without needing to reformat storage devices.

In some environments it is even possible to implement elastic storage frameworks that can fold data onto just a few containers during quiescent periods or explode it quickly across many machines when higher speed access is required.

The benefits of a scalable data platform that runs external to Kubernetes have been clear for some time. Being able to manage this data platform more easily using Kubernetes itself is a major advance, but there are substantial advantages in updating data platform software to directly make use of Kubernetes control processes. Besides, it makes it a snap to deploy a full-scale compute and storage infrastructure from a single pane of glass.

Previous solutions

To maintain states within container has historically been delivered by two possible options. One was to build a lot of state-handling services that each consisted of a few containers, each housing a fair bit of data.

That doesn’t turn out well because these state-handling services cause problems. They don’t move easily, but because each service contains just a few containers, statistical variations in load create havoc for neighbouring containers creating a need to move them.

Because of poor multi-tenancy, managing state in containers often leads to yet another bespoke state management service for every few operational services. Even worse, none of these state handlers has a global view of storage and loads, meaning that there is no good mechanism for balancing loads.

This is a problem because the load imposed by the services attached to each one of these stateful services is small, but the minimum number of containers required to safely manage state is typically five or more.

I have heard stories of running 5,000 Kafka brokers distributed across hundreds of clusters, of hundreds of database clusters, and of dozens of HDFS clusters in order to run HBase. The twin problems here are that the cost of managing this cluster sprawl scales very poorly and the utilisation of these machines is very poor since the load in each of these cases could typically be supported by a few dozen nodes.

The other major option has been to keep the state out of containers and put it onto special purpose storage appliances. Done poorly, that can lead to grief in a few ways. First off, if your data is on a specialised storage appliance of some kind that lives in your data centre, you have a boat anchor that is going to hold you back from the cloud.

Worse, none of the major cloud services will give you the same sort of storage, so your code isn’t portable any more. Each cloud provider has their own idiosyncratic version of storage, typically in the form of something designed to store large immutable blobs, which is only good for a few kinds of archival use and is non-portable across clouds. If you want anything else, you will find yourself locked into an array of the cloud provider’s specialised services.

Data platform approach

Increasingly, modern data platforms such as MapR and others are offering a software-defined storage layer that can make a pretty good cloud-neutral distributed data fabric that extends from on-premises systems into multiple cloud systems with data at varying performance level and with the desired API.

This data fabric can share the storage loads of many applications and thus raise utilisation dramatically. One particular win with this kind of design is that by putting all or most of your state in a single platform, you get a statistical levelling out of the loads imposed by the storage system, which makes managing the overall system much, much easier.

Up to now, production data fabrics have pretty much universally been managed outside of Kubernetes for a variety of reasons, not least of which have been the difficulty in giving containers low-level storage hardware and network access so that reasonable performance is possible.

However, this isn’t what we would like in the long run. Having applications managed by Kubernetes and a data fabric that is outside Kubernetes is philosophically and practically unpleasant. It would be just so much better if we could run an advanced storage platform inside Kubernetes so that everything we do could be managed uniformly.

What we would like is something that would scale to handle the state required by most or all the services we are running in all the different forms that we need, but still run in containers managed by Kubernetes.

Recent advances in Kubernetes

Work has been progressing since Kubernetes 1.8, that use container storage interface (CSI) or Flexvolume plugins that allow a pod to have an extra container that talks to the client-side daemonset and causes the appropriate data to be mounted and exposed to the application containers in your pod.

The exciting thing about this architecture is that it permits independent evolution on several levels. The storage code can evolve independently of Kubernetes, the application container can access the data platform with no changes whatsoever and any or all pieces of the system can be upgraded on the fly in a running production cluster.

As exciting as these recent developments are, more is on the way. It is reasonable to expect that, before long, we will see ways of supporting more than just files via the same basic mechanisms we have discussed here. But even before that happens, it is already an amazing time to be working on big data under Kubernetes.

You may also like

Stay In The Know

Get the Data Centre Review Newsletter direct to your inbox.