Hacking Kubernetes on AWS (EKS) from a Mac

Will May

October 29, 2020

•

Share this post

Copied!

While working with a client recently, we experienced some issues when attempting to make use of NLB external load balancer services when using AWS EKS. I wanted to investigate whether these issues had been fixed in the upstream GitHub Kubernetes repository, or if I could fix it myself, contributing back to the community in the process.

The particular Kubernetes code in question is in the part of the repository responsible for communicating with the AWS EC2 APIs. This lies entwined with the kube-controller-manager which runs on the masters. This meant I wasn't able to use EKS directly to test the changes as the EKS master nodes are not able to be controlled or upgraded by users. Instead I needed to run the masters myself and essentially simulate EKS. As long as these simulated masters resided somewhere in AWS, I would be able to accurately test the AWS integration.

I thus needed to build, package, deploy & test Kubernetes for AWS from the Kubernetes Git repository - this blog records the travails and steps of what is required to do this.

Overview

After a lot of searching which turned up little information, this blog post serves as an answer to anyone else needing or wanting to achieve the same thing.The Kubernetes community is currently in the process of moving cloud vendor specific code into their own cloud provider repositories.

This will allow each cloud provider to be released on a different cadence to Kubernetes itself. Unfortunately, the process isn't trivial as the kubelet currently relies on being able to ask the cloud providers for various pieces of information at startup such as which availability zone it is running in.

At the moment the code for the integration with AWS lives at staging/src/k8s.io/legacy-cloud-providers/aws within the Kubernetes repository. The AWS cloud provider code is going to be moved to cloud-provider-aws.The basic flow for developing the AWS cloud provider is as follows:

Build the Kubernetes binaries & container images which contain your changes
Push the container images somewhere that an instance running in AWS can pull from, such as ECR repositories
Start up some instances within AWS to be used as a Kubernetes cluster
Turn the individual instances into a Kubernetes cluster, using a tool such as kubeadm that is built as part of Kubernetes
Test your changes
Destroy the cluster - while you could probably restart the pods associated with the cluster infrastructure, it was simpler to destroy everything and bring it back anew.

The Terraform code used to manage the infrastructure changes for this testing is available at https://github.com/opencredo/hacking-k8s-on-mac.

Your Machine

You will need:

Docker installed which is used heavily to build and test Kubernetes
A text editor or IDE to allow you to make changes to the code
At least 50GB of disk space, primarily for running the tests just before submitting your PR
10GB of memory for building Kubernetes

Initial Setup

Set up your local Docker machine so that it has 50GB of disk space and 10GB of memory.Create the ECR repositories that will be used to store the Docker images that you will build; the Terraform for this is contained in the repositories directory of the https://github.com/opencredo/hacking-k8s-on-mac repository.

We will be building the images for kube-apiserver, kube-controller-manager, kube-proxy & kube-scheduler shortly but the pause image must be manually pushed up.When we run kubeadm, it will verify and use a pause image hosted under the same container image registry as is hosting the other images. Nothing I could configure would stop this.

If the cluster is unable to download the pause image, this usually manifests as kubeadm timing out waiting for kubelet to boot the control plane. Run these commands to download the pause image and push it to the correct ECR repository.

Build, Test, Rinse, Repeat

Kubernetes Git Repository

These steps are performed within the Kubernetes Git repository that you should clone to your local machine.

Make a change to the Kubernetes code that can only be tested within AWS - such as within staging/src/k8s.io/legacy-cloud-providers/aws.
Build the binaries that’ll be used outside of the Docker images, such as kubelet or kubeadm.KUBE_BUILD_PLATFORMS=linux/amd64 build/run.sh make
Build the container images that will be used for the Kubernetes infrastructure, such as the controller manager or apiserver

KUBE_BUILD_PLATFORMS=linux/amd64 \ KUBE_DOCKER_REGISTRY=$(aws sts get-caller-identity --query Account --output text).dkr.ecr.$(aws configure get region).amazonaws.com \ KUBE_BUILD_CONFORMANCE=n \ build/release-images.sh
Note that the images that the previous step created have the correct registry, but will be for the wrong repositories - such as kube-apiproxy-amd64 rather than kube-apiproxy that kubeadm will be expecting. Re-tag to the correct location and then push the images into ECR.

Hacking-k8s-on-mac Git Repository

These steps take place within the Git repository created for this blog post that you should clone to your local machine.

Now we have the images somewhere that they can be shared, we need to create a VPC (including subnets, route tables and internet gateways), instances within the VPC, share the binaries that we built with the instances, install the necessary things into the instances and then run kubeadm on all the instances to form a cluster. Once the cluster has been created, we also need to create the network on the cluster so that various pods can be started up properly. For this I used https://docs.projectcalico.org/v3.9/manifests/calico.yaml. Fortunately, this has been automated using Terraform and the code is in the terraform directory of the https://github.com/opencredo/hacking-k8s-on-mac repository. The two variables that need to be passed into Terraform is the location where the Kubernetes Git repository was checked out to (kubernetes_directory) and the image tag that the container images were pushed up with (kubernetes_version).
cd cluster && terraform apply -var kubernetes_directory=<location where Kubernetes was checked out to> -var kubernetes_version=<image tag that the new images were pushed up with>
Connect to the master instance
‍cd cluster && $(terraform output master)
Wait for the cluster to finish coming together - you’ll know it has finished when nodes have joined the cluster and are ready
‍watch kubectl get nodes
Manually test whatever change you have made, probably including looking at the kube-controller-manager logs which is currently responsible for running the AWS cloud provider.
kubectl -n kube-system logs --tail -1 -l component=kube-controller-manager
Once you’ve finished that round of testing, destroy the cluster. Remember that any external load balancer services must be deleted before running terraform destroy.
cd cluster && terraform destroy

If all of your tests have passed manual testing, then you can move on to the next section - otherwise go back to step 1.

PR Time

Once you’re happy with the changes, you are now on the road to raising a PR against Kubernetes. Note that you need to:

Have signed the Contributor License Agreement, possibly along with your employer.
Run and pass build/run.sh make verify, build/run.sh make test, and build/run.sh make test-integration locally. Support for running these tests on macs appear to be limited - for example tests failing when mentioning fatal: not a git repository (or any parent up to mount point /go/src/k8s.io) are caused by the .git directory not being available inside Docker. If you don’t have access to a native Linux machine then you’ll have to check the tests you’ve written for your changes haven’t broken anything and rely on the standard CI process.

Conclusion

Using the workflow outlined above, I’ve managed to raise one PR so far although it is currently awaiting review before it can be merged.Note that the per-cloud controllers are in the process of being separated out from kube-controller-manager, although it is the early stages at the moment.

Currently cloud-provider-aws makes use of staging/src/k8s.io/legacy-cloud-providers/aws that is within the Kubernetes repository but that is due to change as the migration continues. Once the cloud providers have been successfully calved off, some of the work in this blog post will be replaced by a simple deployment.

If you experience failures when building Kubernetes with a message similar to /usr/local/go/pkg/tool/linux_amd64/link: signal: killed then increase the memory.I hope this blog can help those of you trying to do something similar! My final bit of parting advice is for the scenario where Docker may run out of disk space. If this happens, try cleaning out stopped docker containers and unused volumes, as I found that building Kubernetes on a Mac leaked both of them.