The journey towards a secure government cloud bootstrapping process
As a company, we at OpenCredo are heavily involved in automation and devOps based work, with a keen focus on making this a seamless experience, especially in cloud based environments. We are currently working within HMRC, a UK government department to help make this a reality as part of a broader cloud broker ecosystem project. In this blog post, I look to provide some initial insight into some of the tools and techniques employed to achieve this for one particular use case namely:
With pretty much zero human intervention, bar initiating a process and providing some inputs, a development team from any location, should be able to run “something”, which, in the end, results in an isolated, secure set of fully configured VM’s being provisioned within a cloud provider (or providers) of choice.
Lets start by highlighting some of the key principles employed:
- Portable Cloud Automation – The process should not rely on any human intervention, specifically to perform technical tasks, nor should any niche cloud provider functionality form part of the solution, as the goal is for the end environment to be able to be spun up in a variety of different cloud providers
- Modular (Open Source) Components – In order to prevent lock in and allow for flexibility, the end to end process should include components which are cohesive and single focused in purpose, yet able to be plugged in, or swapped out, for others as and when required. A policy of looking to adopt the use of open source software is deemed to be the default
- Security – This is a key concern and should be baked in from the beginning not as an afterthought. Amongst other things, the process should for example, ensure that only authorised users are able to connect to the infrastructure, and, that this is done in a secure manner. Additionally any secrets or keys handled as part of the provisioning process should be closely controlled, always aiming to limit their exposure to only those users and processes which need them, at the time they need them.
It should be noted that at this point in time the target cloud providers are still in the process of being negotiated. This initial use case thus focused on demonstrating a working process able to operate within OpenStack based clouds, as well as the incumbent VMWare (vCD based) cloud provider. That said, the tools and process aims to be sufficiently generic and flexible so as to easily adapt to other cloud providers as and when they come along. OpenCredo have already successfully used and integrated many of these tools against some of the large public cloud providers for some of our other clients.
The vCD based cloud provider presented some interesting challenges, largely due to the fact that VMware’s origins are more virtualisation focused than cloud, and for this use case, it was not as natural a fit as some other native cloud providers. Getting around some of these challenges is a blog post all in its own right, nevertheless, our modular, swappable principle allowed us to slot in some custom tooling and software to handle this case where the core software is, as yet, unable to cater for it.
High level process overview, and core building blocks
The end to end process is broken down into 4 smaller, yet fully automated processes, labelled A to D on the diagrams above and below. Subsequent sections go into more details for each.
A – Automated base image building process
The common foundation …
Underpinning everything is the process which ensures consistent base cloud images are created and uploaded into the various target cloud providers. This is required to ensure that, irrespective of the cloud provider used, all processes can be confident that the base OS image (currently Linux and Centos 7 focused) has certain software installed (for instance cloud-init), and meets appropriate security or corporate standards, with subsequent processes then able to take advantage of this known state. Automated ServerSpec and Cucumber based tests are used to ensure these images conform before being made available for use. Though a few tools are used in reality in this area, for our primary tool we chose to use Packer from HashiCorp as it was the most flexible and extensible.
B – Automated IaaS provisioning process
The old …
Traditionally, enterprises and corporations have relied on manual and physical processes and procedures to create various resources within their external infrastructure providers, or within their internally maintained data centres. For example an administrator may be given access and responsibility to use an infrastructure providers “portal” in order to provision (spin up or create) compute resources, configure networks, firewalls and/or security rules. Sometimes this may even involve physical changes. Manual processes are hard to repeat in a reliable fashion, and often result in mistakes being introduced into the environment.
The new …
In the era of modern data centre management, all of this functionality is expected to be exposed via software based APIs with end user programs and scripts being able to control and interact with it, with, pretty much zero human intervention – the so called software defined datacentre. The codifying and automation of this infrastructure interaction, is key to reducing infrastructure provisioning time from weeks and months, down to minutes
For the implementation of this pattern, also sometimes referred to as infrastructure as code, we chose Terraform, another HashiCorp product, as the main tool employed to programmatically interact with the cloud, as well as other infrastructure based services such as DNS or CDN providers. Like other similar tools operating in this space (such as AWS CloudFormation, OpenStack HEAT etc) Terraform expects you to provide a template / DSL describing your desired infrastructure and then ensures these declared resources are created as described. Another key benefit of Terraform is its ability to interact with multiple cloud providers within the same template; this enables us to, for example, use the external DNS service from one provider whilst provisioning compute resources in another, something required for this use case. All Terraform templates and bootstrap code is stored and versioned in Git repositories.
Sadly, support for vCD environments is not yet available within Terraform (or indeed many open source products in this space). So, in order to achieve roughly the same sort of functionality we used a combination of vCloudTools and Fog. In reality this is not able to offer the same level of cohesive control and sophistication as Terraform, however for what we needed to achieve at this stage, enough was available, and coded, to cover the basic use cases.
C – Bootstrap configuration management process
The tools …
For the application of config management to the VMs (installation of the software which turns a VM into something useful, e.g. a Jenkins or Docker Registry VM), Ansible is our configuration tool of choice: it has allowed us to get up and running really fast and we’ve had good experiences with it in the past. It uses an agentless, push-based model and requires SSH access to the target server.
Cloud-init – the async bridge …
Being a push based model, ordinarily this would involve something initiating the Ansible runs on the target servers, and more often than not this is done by making synchronous call(s) as part of, or just after IaaS provisioning finishes. We however, are making use of a combination of cloud-init, as well as a certain minimal predictable setup within the target environment, as a way to initiate self contained, asynchronous application of initial bootstrap config management to boxes, irrespective of the initiating provisioning tool’s capability, or timing in this regard.
Managing the bootstrap process …
The exact details of how this works requires a blog post all of its own, however a summary at this stage should give you an overview and flavour of things to come. As part of the target environment, a bootstrap server with a predictable IP address, via cloud-init, takes responsibility for downloading the config management code from Git based repository, gaining access to any secrets from the secret store Vault (more on this later), establishing internal DNS, and making a mini HTTP based Ansible listening service available.
As and when other VMs boot, their cloud-init setup is configured to poll and wait for internal DNS, and the Ansible listener to become available. Once this is detected, the VM will join the internal domain and request an Ansible run, via the listener on itself. The nice thing about this approach is that does not require any specific coordination logic, even for the bootstrap server itself. VMs initially self configure, as and when appropriate services becomes available.
Management in the long term …
Beyond the initial bootstrap config, a dedicated external process (Jenkins job) then takes over the responsibility of maintaining the VMs and applying subsequent updates. Moving forward however, a decision has been taken to move to Puppet for installation based configuration management. Existing skill sets and organisational requirements make this a better fit for handling the longer term maintenance and operational requirements of the VMs. Ansible will however continue to be used for certain orchestration based tasks, for example when software releases requiring orchestration etc.
D – Process to establish secure connectivity
Finally, there needs to be some way for end users to establish and gain access to the environment which has just been spun up. As part of the predictable environment setup, an OpenVPN server will have been created. By default, the security and firewall rules created via the IaaS provisioning process prevent non VPN based access. Using information supplied by the user at creation time (for example a devteam prefix) an externally accessible sub domain is automatically established, and the public accessible IP for the OpenVPN VM is automatically registered with an appropriate DNS solution provider via the IaaS provisioning process. This provides a predictable address for the dev team to be able to configure their VPN clients against, and thus gain access to the environment.
But is it secure?
From the outset, the architectural approach taken has been to ensure that key security areas are considered, and implemented to at least a minimal level upfront, allowing for further hardening later down the line if and when required. Security is one of those areas that is never really considered “finished”, and this project is no exception. Though there are still various security related items on our roadmap and backlog, the next sections aim to highlight the approach and steps taken, in securing some of these key security areas thus far. It should be stated that we are happily running our own tooling infrastructure (which builds and deploys the software we are developing) using this exact setup, in various clouds with no problems to date.
It is also worth clarifying that at this point in time, the setup described in this blog post is only being trialled on systems which do not currently involve storing any personal identifiable information. This setup is thus applicable to just about any enterprise or company wanting to run an isolated secure environment in a public cloud. Integrating systems with more stringent data and connectivity requirements is indeed part of the road map, however these aspects are not covered in this blog.
Secrets and key management
A fundamental requirement for setting up any new environment is being able to securely gain access to sensitive information such as secrets and keys. This is required for example to be able to connect to the appropriate cloud provider via their API, as well as configuring some of the software on the VMs once spun up with certain passwords, certificates or keys as appropriate. In this space, Vault from HashiCorp is being trialled and assessed as our core secure secret and key management service. Vault exposes a secure API which the tools and various processes can use in order to gain access to appropriate sensitive information including all passwords, certificates, keys etc.
Though individual tools may have their own way of handling or interacting with sensitive data (for example Ansible Vault or Puppet’s encrypted Hiera), with many people opting to simply load the details out of Git repos, we wanted to offload this responsibility to a component whose sole purpose is to live and breath security, including auditing access, and to one which we feel can do this well! With this approach, if we chose to switch out a component, say moving from Ansible to Puppet, our sensitive data is still stored and managed by a software component we have vetted and trust. The only requirement is that all new tools and components which need sensitive data need to be able to get this from the Vault server via it’s API.
Vault as a product is still very much evolving and we are keeping a close eye on it, however the vibrant community behind it, and solid security principles it is founded on, makes it a great fit thus far for integration into this modern data centre provisioning process.
From a user access perspective, VMs are restricted to being accessed with authorised SSH keys, and is only possible once the cloud environments dedicated VPN server has been established. There is not direct internet access by default to any of the machines; with the exception of the VPN machine itself which allows for secure access from anywhere capable of running a VPN client. As all the security setup, config and firewall rules are all codified, it is easy for this to be reviewed and makes auditing and detecting changes much easier.
At present, all VMs within an environment run off a Centos 7.1 image and are designed to take advantage of SELinux, for enhanced OS and Application security, with the security state enforced by configuration management.
By automating everything, and blending together a choice set of modular tools, the goal of establishing an isolated, secure set of fully configured VM’s within a matter of minutes in a cloud provider of choice is a very real reality. Though there is still a long way to go and there will no doubt be various additional challenges further along the way (this is still an ongoing project) the process above has already born fruit within our own tooling environment. Eating our own dogfood, we are currently able to spin up, test and destroy our entire tooling infrastructure in about 40 minutes depending on which cloud provider is chosen. This came in very handy when we needed to test our own disaster recovery capability by moving from one cloud provider to another as a result of some disk issues which lasted a few days. We are looking forward to the next steps in this journey, and aim to blog more about some of the specifics in due course.