Kubernetes from scratch to AWS with Terraform and Ansible (part 1)

Terraform, Ansible, AWS, Kubernetes

Part 1 (this post): Provision the infrastructure, with Terraform.
Part 2: Install and configure Kubernetes, with Ansible.
Part 3: Complete setup and smoke test it, deploying a nginx service.

The goal

The purpose of this series of articles is presenting a simple, but realistic example of how to provision a Kubernetes cluster on AWS, using Terraform and Ansible. This is an educational tool, not a production-ready solution nor a simple way to quickly deploy Kubernetes.

I will walk through the sample project, describing the steps to automatise the process of provisioning simple Kubernetes setup on AWS, trying to make clear most of the simplifications and corners cut. To follow it, you need some basic understanding of AWS, Terraform, Ansible and Kubernetes.

The complete working project is available here: https://github.com/opencredo/k8s-terraform-ansible-sample. Please, read the documentation included in the repository, describing requirements and step by step process to execute it.

Starting point

Our Kubernets setup inspired by Kubernetes the Hard way, by Kelsey Hightower from Google. The original tutorial is for Google Cloud. My goal is demonstrating how to automate the process on AWS, so I “translated” it to AWS and transformed the manual steps into Terraform and Ansible code.

[update] Kelsey Hightower recently updated his tutorial to support AWS. I’m happy to see his AWS implementation is not very different from mine 🙂
I’ve updated the source code and the post, adding few additional elements to match the new version of his tutorial. Additions are highlighted in blue.

Target platform

Infrastructure

Our target Kubernetes cluster includes:

3 EC2 instances for Kubernetes Control Plane: Controller Manager, Scheduler, API Server.
3 EC2 instances for the HA etcd cluster.
3 EC2 instances as Kubernetes workers (aka Minions or Nodes, in Kubernetes jargon)
Container networking using Kubenet plugin (relying on CNI)
HTTPS communication between all components
All Kubernetes and etcd components run as services directly in the VM (not in Containers).

Automation Tools

I used the pair Terraform and Ansible for many reasons:

Terraform declarative approach works very well in describing and provisioning infrastructure resources, while it is very limited when you have to install and configure;
Ansible procedural approach is very handy for installing and configuring software on heterogeneous machines, but become hacky when you use it for provisioning the infrastructure;
Terraform+Ansible is the stack we often use for production projects in Open Credo.

Terraform and Ansible overlap. I will use Terraform to provision infrastructure resources, then pass the baton to Ansible, to install and configure software components.

Terraform to provision infrastructure

Terraform allow us to describe the target infrastructure; then it takes care to create, modify or destroy any required resource to match our blueprint. Regardless its declarative nature, Terraform allows some programming patterns. In this project, resources are in grouped in files; constants are externalised as variables and we will use of templating. We are not going to use Terraform Modules.

The code snippets have been simplified. Please refer to the code repository for the complete version.

Create VPC and networking layer

After specifying the AWS provider and the region (omitted here), the first step is defining the VPC, the single subnet and an Internet Gateway.

To make the subnet public, let’s add an Internet Gateway and route all outbound traffic through it:

We also have to import the Key-pair that will be used for all Instances, to be able to SSH into them. The Public Key must correspond to the Identity file loaded into SSH Agent (please, see the README for more details)

[update] To match Hightower’s version, I’ve added the definition of custom DHCP Options. It doesn’t make much difference at the moment, as they match the default DHCP Options.

Create EC2 Instances

I’m using an official AMI for Ubuntu 16.04 to keep it simple. Here is, for example, the definition of etcd instances.

Other instances, controller and worker, are no different, except for one important detail: Workers have source_dest_check = false to allow sending packets from IPs not matching the IP assigned to the machine by AWS (for inter-Container communication).

[update] We also define IAM Role, Role Policy and Instance profile for Controller instances. They will be actually required if you extend the example, adding a proper AWS integration.

Static vs. dynamic IP address vs. internal DNS

Instances have a static private IP address. I’m using a simple address pattern to make them human-friendly: 10.43.0.1x are etcd instances, 10.43.0.2x Controllers and 10.43.0.3x Workers (aka Kubernetes Nodes or Minions), but this is not a requirement.

A static address is required to have a fixed “handle” for each instance. In a big project, you have to create and maintain a “map” of assigned IPs and be careful to avoid clashes. It sounds easy, but it could become messy in a big project. On the flip side, dynamic IP addresses change if (when) VMs restart for any uncontrollable event (hardware failure, the provider moving to different physical hardware, etc.), therefore DNS entry must be managed by the VM, not by Terraform… but this a different story.

Real-world projects use internal DNS names as stable handles, not static IP. But to keep this project simple, I will use static IP addresses, assigned by Terraform, and no DNS.

Installing Python 2.x?

Ansible requires Python 2.5+ on managed machines. Ubuntu 16.04 comes with Python 3 that not compatible with Ansible and we have to install it before running any playbook.

Terraform has a remote-exec provisioner. We might execute apt-get install python... on the newly provisioned instances. But the provisioner is not very smart. So, the option I adopted is making Ansible “pulling itself over the fence by its bootstraps“, and install Python with a raw module. We’ll see it in the next article.

Resource tagging

Every resource has multiple tags assigned (omitted in the snippets, above):

- ansibleFilter: fixed for all instances (“Kubernetes01” by default), used by Ansible Dynamic Inventory to filter instances belonging to this project (we will see Dynamic Inventory in the next article).
- ansibleNodeType: define the type (group) of the instance, e.g. “worker“. Used by Ansible to group different hosts of the same type.
- ansibleNodeName: readable, unique identifier of an instance (e.g. “worker2“). Used by Ansible Dynamic Inventory as replacement of hostname.
- Name: identifies the single resource (e.g. “worker-2“). No functional use, but useful on AWS console.
- Owner: your name or anything identifying you, if you are sharing the same AWS account with other teammates. No functional use, but handy to filter your resources on AWS console.

A load balancer for Kubernetes API

For High Availability, we have multiple instances running Kubernetes API server and we expose the control API using an external Elastic Load Balancer.

The ELB works at TCP level (layer 4), forwarding connection to the destination. The HTTPS connection is terminated by the service, not by the ELB. The ELB need no certificate.

Security

The security is very simplified in this project. We have two Security Groups: one for all instances and another for the Kubernetes API Load Balancer (some rule omitted here).

All instances are directly accessible from outside the VPC: not acceptable for any production environment. But security is not entirely lax: inbound traffic is allowed only from one IP address: the public IP address you are connecting from. This address is configured by the Terraform variable control_cidr. Please, read the project documentation for further explanations.

No matter how lax, this configuration is tighter than the default security set up by Kubernetes cluster creation script.

Self-signed Certificates

Communication between Kubernetes components and control API, all use HTTPS. We need a server certificate for it. It is self-signed with our own private CA. Terraform generates a CA certificate, a server key+certificate and signs the latter with the CA. The process uses CFSSL.

The interesting point here is the template-based generation of certificates. I’m using Data Sources, a feature introduced by Terraform 0.7.

All components use the same certificate, so it has to include all addresses (IP and/or DNS names). In a real-world project, we would use stable DNS names, and the certificate would include them only.

CFSSL command line utility generates .pem files for CA certificate, server key and certificate. In the following articles, we will see how they are uploaded into all machines and used by Kubernetes CLI to connect to the API.

Known simplifications and limitations

Let’s sum up the most significant simplifications introduced in this part of the project:

All instances are in a single, public subnet.
All instances are directly accessible from outside the VPC. No VPN, no Bastion (though, traffic is allowed only from a single, configurable IP).
Instances have static internal IP addresses. Any real-world environment should use DNS names and, possibly, dynamic IPs.
A single server certificate for all components and nodes.

Next steps

The infrastructure is now complete. There is still a lot of work to do, for installing all the components required by Kubernetes. We will see how to do it, using Ansible, in the next article.

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE