Our recent client was a Fintech who had ambitions to build a Machine Learning platform for real-time decision making. The client had significant Kubernetes proficiency, ran on the cloud, and had a strong preference for using free, open-source software over cloud-native offerings that come with lock-in. Several components were spiked with success (feature preparation with Apache Beam and Seldon for model serving performed particularly strongly). Kubeflow was one of the next technologies on our list of spikes, showing significant promise at the research stage and seemingly a good match for our client’s priorities and skills.
That platform slipped down the client’s priority list before completing the research for Kubeflow, so I wanted to see how that project might have turned out. Would Kubeflow have made the cut?
Machine learning has great potential for many businesses, but the path from a Data Scientist creating an amazing algorithm on their laptop, to that code running and adding value in production, can be arduous.
The paper “Hidden Technical Debt in Machine Learning Systems” covers it well with this often-cited diagram:
The little black box in the centre represents the initial training of the algorithm. However, in order to serve that model, a lot of other infrastructure must be built.
Productionising models involves accounting for things like:
Here are two typical machine learning workflows. Firstly, Data Scientists want the freedom to iterate freely around an experimentation cycle whilst they refine their model.
Then, once they’ve determined a model architecture that solves the problem at hand, and the first iteration of the model has been installed in a decision-making role in production, they might want it to continually be retrained with fresh data.
This functionality is a lot for businesses to build out from scratch, especially when most are at an early stage of adopting Machine Learning. Many of these problems are also familiar to practitioners, so a collaborative, open source framework starts to make a lot of sense.
Kubeflow is a strong attempt to create a platform that encompasses many of these requirements. Much like Kubernetes grew from Google’s internal Borg deployment framework, Kubeflow comes from Google’s internal Tensorflow deployment infrastructure (Tensorflow Extended), and its development is still Google-led.
As the name suggests, it’s designed to run on Kubernetes, a powerful tool for requesting and configuring the infrastructure required during a model’s lifecycle.
Kubernetes has taken the world of software deployments by storm over the last few years. Its ambition to make running software as conceptually easy as stacking boxes (containers) neatly into available spaces (cluster nodes), with a focus on high availability, has greatly simplified and standardised launching software. It also neatly decouples the complex high availability and access control problems from software delivery. Kubeflow in its role as an ML lifecycle management platform continues with this separation; the cluster’s installation can be a bit gnarly, especially for those unfamiliar with Kubernetes, but defining and managing your ML workflows is made much simpler as a result.
Kubeflow is also available (in pre-GA at time of writing) on GCP as part of the AI Hub, which was probably a significant motivation for creating this software. This somewhat shows in the documentation and overall experience of Kubeflow, as we’ll see below. I get the impression that running Kubeflow on GCP (especially through AI hub) is a significantly more seamless experience than deploying it yourself.
My most recent client was a Fintech who had ambitions to build a Machine Learning platform for real-time decision making. The client had significant Kubernetes proficiency, ran on the cloud, and had a strong preference for using free, open-source software over cloud-native offerings that come with lock-in. Several components were spiked with success (feature preparation with Apache Beam and Seldon for model serving performed particularly strongly). Kubeflow was one of the next technologies on our list of spikes, showing significant promise at the research stage and seemingly a good match for our client’s priorities and skills.
That platform slipped down the client’s priority list before completing the research for Kubeflow, so I wanted to see how that project might have turned out. Would Kubeflow have made the cut?
Kubeflow has a fairly comprehensive suite of guides to get you up and running on various flavours of Kubernetes (vanilla, cloud-based AWS EKS and GCP GKE, and a whole host of local-machine options like Minikube and a Vagrant appliance called MiniKF.
I followed the minikube guide, https://www.kubeflow.org/docs/started/workstation/minikube-linux/, and after a few Istio/minikube-related false starts¹ the cluster came up. Because Kubeflow is composed of so many independently-maintained services, it relies on Istio (a service mesh) to provide a consistent auth mechanism, routing, policy layer, and metrics. Istio is currently required to run Kubeflow.
Suppose however you’re already running a service mesh (especially a different one like Linkerd). In that case, I’d suggest running a spike into how Kubeflow’s Istio deployment may interact with your existing infrastructure fairly early on to determine if there are any additional challenges to overcome.
In my case, following a successful deployment, I found I had three new namespaces: istio-system (10 pods), kubeflow (37 pods!), and kubeflow-test (2 pods), the latter being the namespace my work would execute in.
Step one in our hypothetical user journey: a data scientist has a stunning idea for a machine learning model. They want to play around with Python for a few hours, train a model, and see if it solves the problem. Maybe they want to collaborate with a colleague, and work from the same notebook. Jupyter Notebooks have long been standard practice for collaborative Python notebooks, and deploying a notebook server on Kubeflow is trivial.
You can create a notebook server simply enough through the UI (shown below). Under the hood, this will create a Kubernetes StatefulSet that launches a Jupyter Notebook server pod. Alternatively, if you’d rather keep your infrastructure as code, you can just deploy the StatefulSet into Kubernetes through other means.
Once deployed, the notebook server provides a familiar blank notebook, Python 3 pre-installed, ready to be filled with the ML model code of your choice.
And once the training and experimentation phase is complete, you can go beyond merely having a good model trapped in a notebook. You can add the code that continues its lifecycle by creating a Kubeflow Pipeline in the document.
(a small aside: Kubeflow’s heredity is evident here – the UI could be straight out of GCP).
Kubeflow Pipelines are at the core of Kubeflow’s offering. In a Pipeline, you string together a set of operations that should happen to your model, from training, testing and visualisation, to serving and making predictions.
The pipelines are represented as a DAG (Directed Acyclic Graph), and are viewable in the UI, along with their execution history.
The pipelines are defined using a Python DSL. I find this a wonderfully powerful paradigm; the move towards DevOps has blurred the boundaries of application development and deployment, and this choice drives the same for machine learning. There is no step in this process where a data scientist creates a model and “throws it over the wall” to developers, who then wrap it in some mysterious layers the scientist knows nothing about. Here, in a Jupyter notebook, we can compose the entire lifecycle of an ML model. The lifecycle steps remain decoupled; you could pull out the TensorFlow training code and run that independently if you wanted. It might be that the finished product needs a little more testing/structure than can be provided in a notebook, and the code gets pulled out into a repo for more formal packaging; you can do that too.
My delight at this seemingly seamless integration between Kubeflow Pipelines and the Jupyter notebooks was short-lived as the latest release disabled communication between the notebook servers and the pipeline engine for security reasons! Here’s the discussion: https://github.com/kubeflow/pipelines/issues/4440. It’s scheduled to return once a patch has landed that enables secure authentication between the components. In the meantime, you can hack it in ways like disabling Istio’s RBAC (sure as heck not a solution for production!) or (slightly better but still not optimal) adding Service Accounts for the notebook to use with fixed RBAC users that can authenticate.
The practical impact of this is that none of Kubeflow’s copious and lovingly prepared examples work out of the box when Kubeflow is deployed locally, which made for quite a disappointing first impression!
After a little tinkering, I got some of the pipelines working. Below is the execution DAG of a simple coin-flip example (there’s no machine learning here – it’s just performing some Python steps in sequence).
Kubeflow provides a high-level organisational structure called “Experiments”. These allow the grouping of models and avenues of research. I’d be interested in seeing how this structure would play out in a larger business, with perhaps many people working across many projects. It’s certainly good enough for a first pass through and kept my simple experiments organised enough.
Simple but effective: the logs after a failed run clearly show the problem.
There’s a lot more depth I could go into with Kubeflow Pipelines; containerisation of pipeline steps is a must for running on Kubernetes for example. The best way to explore the possibilities is to go through some of the myriad examples on GitHub.
One of the first hurdles to productionising an ML algorithm is getting the same features available in both the training and model execution stages. Additionally, having a catalogue of existing features can be very useful to a data scientist in the exploratory stages of model design. Kubeflow tackles these challenges with the Feast feature store component. I didn’t have time to go into depth with Feast for this post, but if you’re looking for better organised feature sets, it would be worth exploring.
A feature I was pretty interested to see in Kubeflow’s bag of tricks was Katib, a piece of software for “hyperparameter tuning and neural architecture search” (currently in beta). To break it down, you might have the broad strokes of your machine learning model perfect; a good set of features that covers the problem scope, and you are getting decent results from one particular algorithm (say, a neural net). But there are lots of variables (hyperparameters) that can affect the detail of how well your model performs, for example, how much you adjust your weightings in response to each new training example (the learning rate). It’s also a vast problem space – each hyperparameter can independently take any value, so how do you find the global optimum for your model?
The method of “try a few values, see what generally works, and then set up a for loop to explore that region of the value space, and run that for loop with increasingly granular values to zero in on a maximum” seems standard practice. It’s a pretty easy thing to set up after all, but can we do better?
Katib’s a fancier and more automated version of that process. You tell it what your goals are, what hyperparameters it’s permitted to tune, and Katib keeps training and re-training your model with different variations, exploring until it has achieved your definition of “success”.
The advantage of having this on Kubernetes is scale, and the ability to use the available resources. Hyperparameter tuning is often the sort of background job you want to leave running on the available, leftover compute. You can accomplish this on Kubernetes using Pod Priority and Preemption.
Kubeflow’s a framework that delights in integrating with other open-source software. Model serving is one of the more well-populated areas of the MLops space, so there are plenty of options. This table can help you choose which is most appropriate for your use case.
I wanted to give Seldon Core a particular shout-out in this post, as I was very impressed when I used it in the early stages of the MLOps project I mentioned earlier. Seldon is itself Kubernetes-native (that’s part of the reason we were drawn to it). It allows you to flexibly deploy models (and families of models) in pods, serviced by one of the many wrappers implemented by Seldon. You can then query a model (or a set of models) from another service, over REST or gRPC. This neatly decouples the model from the query method (for example, you can swap out a Pytorch model for a Tensorflow one) and still maintain the same contract with calling services.
Seldon’s very flexible in its deployment structure, and allows multiple models to be queried simultaneously. An example is a champion-challenger pattern, where the current best-performing model will serve the result that’s used, but the other models’ performances are tracked and monitored. This allows you to study new potential candidate models in production, and know their historical performance before taking the plunge and updating your champion. They also have more dynamic options available like their multi-armed bandit router; its role is to dynamically route traffic between different models, in order to determine the most performant, and to work that out in as few trials as possible. Extremely flashy, algorithmic A/B testing for ML, in other words.
Seldon’s interpretability module Alibi is also compatible with Kubeflow (even if you’re not using Seldon as the serving component). I’m very excited about this; one of the trickier challenges of taking some types of ML to production is the ability to justify decisions. Not only does it allow a data scientist to understand a model’s weaknesses and improve upon them, but it can also be a regulatory requirement, especially in the financial world. When denying someone’s loan application, it’s undoubtedly insufficient to justify “our model’s a black box that we trust, and it said so.” We must be able to provide to the regulator, or the user, the set of information that backs a decision. Crucially, the decision must be fair and non-discriminatory. Tools like Alibi excite me because Machine Learning is inevitably going to impact the world significantly. The availability of tooling to ensure that impact is just, and even-handed is of paramount importance.
To manage Seldon, you create Kubernetes resources (SeldonDeployments), which then handle the lifecycle of serving the model. https://www.kubeflow.org/docs/components/serving/seldon/.
One of the first thoughts I had upon launching Kubeflow was “wow, that’s a lot of containers!”. The Kubeflow quickstart pack launches 49 pods, and that’s before launching any notebook servers or running any pipelines. That’s a heck of an attack surface for the nefarious. Combined with the likelihood of the data scientist users making use of sensitive information, it’s definitely worth having an upfront discussion with your security team about bringing this software within their risk appetite. One advantage a developer wanting to deploy Kubeflow in this context has: it’s probably safer for jobs and pipelines that handle such sensitive data to be well understood and structured, rather than allowed to become a messier, less well-documented homebrew solution.
Kubeflow’s test harness functionality is still in its infancy, but it is being worked on. Machine learning itself defies plenty of software engineering’s general testing principles in any case; the models fundamentally defy unit testing (it’s pointless to test one neuron of a neural net), and because almost no model is 100% accurate, testing is statistical at best.
I hoped to see more ability to test pipeline logic end to end (for example, testing that an inaccurate model doesn’t get promoted to production). That’s not to say you couldn’t implement this testing yourself using your preferred Python framework, or stub out the model to inject one with deterministic behaviour, but this doesn’t seem to be available out of the box.
A start has been made on this GitHub issue (merged between the writing of this post and publication!), that covers some aspects of local test running and stubbing of components, that will definitely make testing some aspects of Pipelines significantly easier.
One tool that I didn’t play with, but was quite excited to find, was the Kubeflow Pipelines Benchmark Scripts. This seems to be quite a sophisticated way of performance and soak testing your pipelines in the face of significant traffic.
Documentation is a noticeable weak point of Kubeflow. Quite a lot of the docs only cover previous versions, and bear the conspicuous header line “Out of date – This guide contains outdated information pertaining to Kubeflow 1.0. This guide needs to be updated for Kubeflow 1.1”. At the time of writing this post, Kubeflow 1.1 had been released for 8 months (June 2020 – Feb 2021, and 1.2 had been out since November), which makes me worry a little about the project’s maintenance priorities. There also don’t seem to be any issues covering this area in their tracker. There are plenty of recent commits, so the project looks alive and healthy in other respects.
A lot of the primary documentation is in the form of linked Jupyter notebooks. The principle is excellent – I love executable docs. However, most of them were written over a year ago, and haven’t been updated with the framework. The result is that most of them will take quite a lot of effort to get going (pip upgrades, Kubernetes certificate issues, adding ServiceRoleBindings etc.). Expect to spend plenty of time trawling through GitHub issues looking for answers. For someone who’s been a user of Kubernetes for a while, but never an administrator, it was a steep learning curve.
Kubeflow is far from the only ML platform worth a look. Its unique appeal is how effectively it can integrate with a Kubernetes ecosystem, along with its aim to be an extensible framework for integrating and managing other tools, as well as being open source.Here are some examples of rival platforms and tooling I discovered researching this post. There are plenty more out there though:
Kubeflow is a potent and flexible tool. I love the principle that it’s a flexible, extensible framework for managing a machine learning lifecycle, it offers considerable functionality already, and I have high hopes for its future. It’s the most sophisticated FOSS tool I can find in this area.
However, it’s still evolving rapidly, and some of that speed has come at the expense of its current ease of use. I found the learning curve initially very steep, though the abundance of happy users and committers suggests that once you’re familiar with the tool, progress will be faster.
If your company is already comfortable and running its software on Kubernetes, Kubeflow is a decent way of leveraging that power for your data scientists, and providing a flexible, ordered framework to train and execute your models. It’s done a significant amount of the heavy lifting and will get things rolling much faster than creating all the components yourself.
Thinking back to the Fintech client, I think this would have been a good fit for them. Their strength in Kubernetes administration would have been tested, we’d have had to pare down the initial installation to only trusted containers, and have some very detailed conversations with the security team (I wouldn’t have been surprised if they insisted on a separate cluster for the majority of the infrastructure, only allowing the model serving piece to co-exist with the main cluster). I’d have also warned them about the documentation and ease of getting going. But overall, unless your core business is MLOps infrastructure, I wouldn’t recommend businesses building their own frameworks in this area from scratch. Kubeflow and the (MLOps ecosystem in general) is coming on-a-pace and provides so much functionality out of the box that it’s going to save you a lot of time, and be much easier to maintain in the long run.
¹The suggested minikube start flags don’t seem to work with the current version of minikube (1.17.1) – I eventually got it working by running with flags per this github issue: https://github.com/kubeflow/kubeflow/issues/5447. The Istio service mesh was misconfigured and authentication was failing between components. Subsequent launches have been spotty. Most of the time I am successful when providing no flags at all to the minikube start command. Several subsequent releanches with no flags had a similar problem, with the Error: Envoy proxy is NOT ready: config not received from Pilot (is Pilot running?). Getting heavy handed and deleting the istio-pilot pod fixed it those times. There is something that needs to be debugged here, but that’s beyond the scope of this blog post.
Because Kubernetes pods are ephemeral by nature, the input and output datastores must be external. If you’re running a cloud-native operation, the simplest way is to use a storage bucket, eg. S3 or GCS. https://www.kubeflow.org/docs/aws/pipeline/. I couldn’t get reading from S3 working after a few hours work, I believe due to this defect. I get the impression Kubeflow will be significantly easier to use when running in the cloud, rather than locally on a laptop or on-prem.