The Why’s, What’s and How’s of Kubernetes Operators

Michal Tusnio

September 27, 2023

•

Share this post

Copied!

Learn to create your first Kubernetes operator by checking out our Senior Consultant Michal Tusnio's latest blog, “Kubernetes Operators - Whys, Hows and Whats” where he takes you on a journey from zero to operator.

Most users working with Kubernetes would at some point have seen the “operator” suffix in a deployment’s name, or wedged somewhere in the middle of a pod name. Administrators and developers would have deployed an operator or two - a Tigera operator deploying Calico CNI is something that is common to see in a cluster, or a FluxCD/ArgoCD deployment as part of GitOps tooling.

Then, when debugging, there might never be a need to have a look past the logs and events emitted. There is rarely a reason to look under the hood itself and dive into the codebase, which only ensures all of the cogs and gears remain hidden. It leaves an element of mystery as to how difficult and what capabilities an operator can offer, and often might lead to discounting it as a potential solution to problems. As such, I think there is value in understanding the pattern in more depth to know when we could apply it ourselves to solve a business problem, and to be able to estimate more accurately how much work it would take to have a minimal viable operator ready for production.

Thankfully, writing a custom solution is much easier than it looks at first glance, thanks to available and easy-to-use frameworks providing everything from the boilerplate code to support for highly available deployments.

The Why’s

Before even answering why we need our own operator, it is good to know what operators can be: a way of deploying and managing an application by declaring the desired outcomes and states. Those outcomes can be achieved using automation - be it upgrades to bring the application from one version to another, backups to achieve redundancy, auto-scaling to match the demands of incoming traffic.

By utilising the operator pattern, we can bring modern platform engineering philosophy to life: they put common processes into code, define a standard set of practices for an application for all parts of its lifecycle, and automate fault recovery for known scenarios. All of this comes with a decreased burden of know-how put on the administrator, especially in the world of ever-increasing amounts of applications and tools deployed onto Kubernetes clusters.

There are cases when the Kubernetes API can manage an application just fine on its own using a fleet of built-in controllers. Deployments might not need much maintenance apart from liveness or readiness probes, setting a desired deployment strategy, and deploying a metric server to enable the use of Horizontal Pod Autoscaling. In cases like this, often, the data they access is handled by databases or object storage managed by cloud providers.

Things become more complicated when our databases are self-managed. While most popular databases offer an operator out of the box, when working with less popular technologies a solution that fits requirements and/or is supported might be unavailable. To support internal teams a platform team might decide to write an operator themselves, building in that lifecycle management functionality going forwards.

Should our application be stateful similar considerations apply - deploying a StatefulSet with provisioned volumes raises the question of managing backups of data attached, or what should happen to it upon application upgrades. Even more so, considering each version upgrade might come with different steps needed to migrate the data, especially to cover some edge cases or work around deprecations. Those steps can now be embedded in the operator, tested and released as code. Organisations that provide customers with the ability to self-host their applications are now enabled to standardise lifecycle or software upgrade paths for all their users.

Then there is the potential to glue together legacy systems or 3rd party APIs. We could automate replicating data stored in an external provider to a more Kubernetes native format. An example would be secret storage synchronisation. An operator can query an external secrets service, retrieve the secrets, and then proceed to update, delete or create ones as a Secret object inside Kubernetes. Or even better - encrypt and push them into a Git repository as SealedSecrets for a SealedSecrets operator to pick up. Processes like this can be useful when migration from one service to another is in order, but we temporarily need automation while the old service is being sunset. OperatorHub is the primary place which offers a good look into what open-source solutions are available. Then there are more niche and creative cases, such as this game server deployment project that deploys a game server of your choice based on a Custom Resource (more about these in the next section) defined by the user. Sadly the last commits date back to 2021!

The What’s

The official definition of an operator is quite succinct: “Operators are software extensions to Kubernetes that make use of custom resources to manage applications and their components”. Before we can get custom resources (CR), however, we need custom resource definitions (CRD)

‍CRDs are a way of extending the Kubernetes API to allow for object kinds other than the built-in ones to be created (builtin kinds include Pods, Deployments, ConfigMaps etc.). A CRD could, for instance, lay out a config file for our application, specifying the number of replicas, names of deployments down to describing how to expose the application to an external user. Other definitions might be more specific, as would be in the case of our secrets replication example: they might include information on where to find access credentials and which endpoint to use for querying our legacy secrets provider.

On their own those custom objects are quite useless - they have no power to perform changes on their own. What they do provide, however, is something akin to a database entry with strict type definitions. A custom resource definition also includes an OpenAPI schema, meaning that on top of the storage schema we are also instantly provided with a way of creating, updating and deleting those objects using HTTP calls.

Controllers come in next to make use of that API, and provide the compute power and control loop. Once deployed they can make use of custom resources created from our CRDs as configuration. In the case of examples mentioned earlier, a controller will query the API to retrieve the application's custom resource to then create a deployment object of desired size, or will read a custom resource to find authentication details for our legacy secrets service, query it and sync up secrets. The only real limits to functionality are the resource allocation that the operator needs to operate under, and the size of the CRD itself - the latter being around 1.5 MB, depending on your etcd configuration!

It is worth noting that controllers are nothing specific to operators themselves, and are a core part of Kubernetes, powering its control plane. It is not solely our kinds specified in CRDs that are quite useless without a control loop, same goes for built-in kinds. For instance, deployments are handled by a built-in controller that reads a deployment specification and adjusts replica sets to match the desired specification. When deploying a new container image it also manages spinning down pods using the old image, and smoothly transitioning to a new replica set.

Operators work in tandem without having to know of each other’s existence - one operator creates the pod object, and another reads it to fulfil that requirement via scheduling - all thanks to standardisation that Kubernetes API and CRDs provide. Without that loop, a deployment object, same as our own custom resources, is nothing more than a glorified database entry.

The How’s

Technically, there are hardly any limits on the tech that can underpin an operator. In practice the limit is set by the availability of operator frameworks. Python implementations can use the Kubernetes Operator Pythonic Framework (kopf) .NET has KubeOps, while Golang engineers could turn towards the Operator SDK.

Regardless of what we pick, we need a CRD to start with. A sample API group that defines a deployment of an application called CustomApp can be found below:# Some comments adapted from official docs: # https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definitions/#create-a-customresourcedefinition apiVersion: apiextensions.k8s.io/v1 kind: CustomResourceDefinition metadata: # name must match the spec fields below, and be in the form: <plural>.<group> name: customapps.example.tutorial spec: # either Namespaced or Cluster scope: Namespaced # group name to use for REST API: /apis/<group>/<version> group: example.tutorial names: # kind is normally the CamelCased singular type. Your resource manifests use this. kind: CustomApp # plural name to be used in the URL: /apis/<group>/<version>/<plural> plural: customapps # singular name to be used as an alias on the CLI and for display singular: customapp # shortNames allow shorter string to match your resource on the CLI shortNames: - capps - capp versions: - # API version, those can change down the line: v1alpha, v1beta, v1beta1 etc. name: v1 # Whether this version is available to be used and called via the api served: true # Only one version can be stored, whether the schema below is used as the # schema for storing the object in etcd. If multiple versions are present # with different schemas there needs to be a conversion specified. # Conversions will inform the API server of the methods to convert the # schema with which API was called into the one marked as storage. # See: https://kubernetes.io/docs/tasks/extend-kubernetes/custom-resources/custom-resource-definition-versioning/#specify-multiple-versions storage: true schema: openAPIV3Schema: type: object properties: spec: type: object properties: name: type: string replicas: type: integerThe CRD is the most hefty and verbose part, mainly because it requires defining everything from the group, kind name through short names to the API & storage schemas of the custom resource. Creating a CR is much easier and quite self-explanatory:apiVersion: example.tutorial/v1 kind: CustomApp metadata: name: exampleapp spec: name: exampleapp replicas: 2With the CRD applied and a CR ready in a YAML file to be applied, we need a controller. How a minimalistic implementation would look depends on the framework we pick. In the case of Kopf we can boil it down to six lines, at least as far as implementing a working proof of concept that prints out our spec goes:

‍

‍import kopf import logging @kopf.on.create('customapps.example.tutorial') def create_fn(body, **kwargs): logging.info(f"Application created: {body}")

‍

‍Operators do not need to be deployed to be run and tested. Since an operator performs all of its operations by sending HTTP requests to kube-apiserver, there is nothing stopping us from running it locally from our IDE or command line (provided our machine can reach the API server and authenticate with it!). In the case of Kopf, once the library is installed via pip, we can run our python file with `kopf run <file_containing_code_above> --verbose` and use our local workstation’s credentials. Applying the custom resource will then print a log entry with the content of the object created.

Expanding on this means using client libraries to communicate with Kubernetes API and create a new deployment. Since Kopf or Python might not be the stack of your choice, Python users can follow up with examples from Kopf repository to see how this communication can be achieved. After that, the deployment of the controller itself is a case of pushing an image to a container repository and defining a Kubernetes deployment object.

Not all frameworks are as terse as Kopf. For instance, Operator Framework will generate stubs of code and skeletons for the user to fill in and modify. While this makes the code base much heavier, it does not impact the experience itself significantly. In the end, the workflow is similar: setting up our CRDs, running the SDK to create a new operator, and making use of the framework to listen to specific events for objects that are of interest to us.

More implementations of the operator pattern exist than just the ones listed above, see the full list here. For a curious mention, shell operator is an interesting framework that allows for shell scripts to be run as hooks. Rather than having the overhead of a full-blown programming language, we can use bash scripts and shell tools to respond to events. This could be especially useful when existing tooling exists that communicates with the API and what is needed is an engine to run the scripts at the appropriate time.

Putting it all together

Whatever our choice of a framework, there are a few other things to look out for apart from just picking our favourite language, among those are: in-built support for monitoring via a tool of our choice, leader election or idempotency support to run our operator in a highly-available setup, support for admission/validation webhooks, documentation explaining how to test the operator. Getting a production-ready system requires extra considerations, especially as far as fault tolerance is concerned.

After that, what remains is the difficult part - putting our picked framework to use and implementing whatever idea we might have for an operator.

‍This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

Share this post

Copied!

Blog

Kubernetes