Automation of complex IT systems

Automation ?

In the recent Global Capital Confidence Barometer from EY, 36% of respondents are accelerating their investments in automation, with 41% re-evaluating. We can consider this a natural response – the drive towards increased automation is a venerable endeavour which is now simply accelerating. In a pandemic situation it appears extremely attractive, freeing businesses from dependencies on humans, and humans from risk of exposure in the workplace.

At OpenCredo, our speciality is cloud native delivery – whether applications, infrastructure platforms or data systems. Over the last 10 years or so, the cloud has emerged, exploded in complexity and transformed the infrastructure space. A significant factor in this journey has been the ability to automate infrastructure delivery – and as complexity has grown with the adoption of microservices, big data and IOT, this automation has evolved to become more sophisticated.

So, when we talk about the automation of complex systems, modern cloud-based infrastructure can serve as a good example.

The Ironies of Automation

Published in 1983, the”Ironies of Automation” is a paper by Lisanne Bainbridge which was highlighted to us in a recent Morning Paper from Adrian Colyer. Although it concerns control in process industries, this classic paper is still regularly cited and is perhaps even more relevant today – in a world of highly complex distributed information systems.

Some of the highlighted ironies are:

Designer errors can be a major source of operating problems
DevOps has created increasingly automated cloud infrastructure workflows, such as continuous delivery of containers to Kubernetes and mutation of resources with Terraform. These systems themselves create the possibility of failure and error. As we strive to make efforts to auto-heal on failure, we can never be sure that automated actions will not make things worse.
The designer who tries to eliminate the operator still leaves the operator to do the tasks which the designer cannot think how to automate.
It is rare, if ever, that we see a completely automated cloud system. There are always a few items which are awkward to automate and must be overseen by human operators. These items tend towards being arbitrary and are therefore difficult to operationally model in a rational way.
The automatic control system has been put in because it can do the job better than the operator, but yet the operator is being asked to monitor that it is working effectively.
Automation can perform tasks quicker and more reliably than human operators, yet we must monitor these activities to check for error. However, to do this we must be able to understand what is being performed and why – as well as attempt to follow in real time.
It is the most successful automated systems, with rare need for manual intervention, which may need the greatest investment in human operator training.
As automation grows more complete and complicated, the burden on human operators to gain sufficient understanding to monitor and remediate these systems grows heavier.

The fundamental irony that “…the more advanced a control system is, so the more crucial may be the contribution of the human operator.” presents a strong challenge to the idea that humans can easily be automated away. So, while we acknowledge the fundamental benefits that automation brings – speed of response, reliability and scalability – we must see how these fit within a system that includes both humans and computers.

Human Operators

So, where we experience significant automation error or a system state which cannot be reconciled by either automated systems and standardised playbooks, we now require experts to remediate the system. They must have appropriate monitoring to allow them to know the state of the system and what actions have been performed. They must be appropriately trained and skilled to understand the state of the system and what actions have been performed – and then know how to correct it.

The irony being that increasing automation reduces the opportunity for interaction with the system and for developing a deep understanding of it. Increased reliability on automation causes human operators to become “complacent, over-reliant or unduly diffident” in a manner that is resistant to expertise and training. As Bainbridge points out: “it is impossible for even a highly motivated human being to maintain effective visual attention towards a source of information on which very little happens, for more than about half an hour.“

So, in a highly automated system, our human operators are not required to regularly interact with the system, nor are they able to concentrate on what is going on for a meaningful period of time. Over time whatever skills they had will go stale and their mental models will diverge from the reality of the current system.

Holding up the human side of the equation will require something more than a reductionist approach with monitoring in one box, incident response in another and expertise in yet one more.

Complexity & DevOps

It might be tempting to put this all in the court of the complexity of modern engineering approaches – microservices, kubernetes and cloud. That we should reject these and return to familiar and well-tested workflows.

In some cases, this might well be a valid conclusion. We have seen many cases of unnecessary lift-and-shift of commodity, established services to the cloud – particularly with off-the-shelf, internal services which have static utilisation profiles, slow update cycles and reside on paid-for hardware. Indeed, there has been a backlash extolling the virtues of the “glorious monolith” which avoids the complexity and cognitive load associated with microservices architectures.

At the same time, with such an approach we limit our ambition – sacrificing the opportunities that modern architectures were designed to deliver – solving complex, high scale problems at speed. We give up agility and the ability to out-deliver our competitors.

Looking more closely, the operational control placed over traditional enterprise systems is often, in reality, more limited than generally acknowledged. We find that the complexity has to live somewhere. Monoliths in which developers push all functionality into a single “ball of mud” are – unsurprisingly – hard to reason about and debug. By unrolling these into microservices, for example, the complexity can become more explicit.

So, we must concede that complexity in IT is the new normal and adapt to it – and we can only expect this to increase.

Automation is a critical part of managing this complexity. It is vital that conventions are established and standardised; otherwise the gains brought by automation are lost as each team introduces their own snowflakes which require special knowledge and training to work with. However, as we have seen, automation is not sufficient and new approaches are required to ensure that human operators are able to work effectively.

Enter Chaos Engineering

“…systems development is an integrated interdisciplinary endeavour, where disciplines including software engineering, hardware engineering, human factors engineering and ergonomics, psychology and sociology all potentially have something useful to contribute.”
The ironies of automation: still going strong at 30?

We must reconcile complex automated systems with the need for human oversight and operation.

Chaos engineering is an emerging (and somewhat poorly named) discipline in IT engineering. Building on many years of research in resilience and safety in systems, chaos engineering seeks to mitigate emergent chaotic effects from complex systems (not create chaos, as might be interpreted).

Why Chaos? The complex interrelated nature of these systems creates behaviour which is difficult to interpret from the component parts. This takes the Ironies one step further by illustrating that the human operator cannot simply study the automation and architecture of the system in isolation to gain readiness. They must study the behaviour of the system in practice.

The principles of chaos engineering provide guidelines for the study of systems focussed on scientifically studying overall system behaviour with emphasis on whether it works, not how it works.

Build a Hypothesis around Steady State Behaviour
Vary Real-world Events
Run Experiments in Production
Automate Experiments to Run Continuously
Minimize Blast Radius

Fundamentally, we are measuring default behaviour, altering a variable and then measuring any change. These experiments should be run in-vivo (in production) and continuously to protect against historical effects. At the same time, we must protect the overall system by limiting the impact that experiments can have and ensuring that there is an auto-cutoff function.

Running chaos experiments has the dual effect of deepening the teams knowledge of the system whilst keeping them engaged and ready to respond to incidents. So by keeping your teams interacting in a meaningful way with your automated systems they can retain readiness for action and overcome the ironies of automation.

We see chaos engineering as a key component in an array of tools and technologies for engagement – which includes the development of automation tools, load testing, penetration testing, monitoring, data visualisation and observability. All of these approaches keep your teams energized and working with the system. However, the critical element is that the teams are not isolated from each other such that each team only has a partial understanding.

Embrace Uncertainty

In uncertain times, it can be tempting to retreat to habit: cutting costs, streamlining efficiencies and limiting innovation. Applying reductionist approaches to the delivery of IT services – neatly organising work into waterfall diagrams, teams into functional silos and services according to Conway’s Law – backed by playbook-to-script automation patterns – is rational, sensible but often incompatible with success. Coronovirus itself demonstrates how easily disruption can change the parameters and alter what constitutes being “better adapted for the immediate, local environment”.

The ironies of automation demand that skilled, cross-functional teams continuously engage with your systems. Working with complex systems is uncertain and it can be uncertain how best to manage them. By rejecting any illusion of omniscience and embracing the uncertainty represented by chaos engineering we can make a solid start.

OpenCredo has been working with distributed cloud infrastructure systems for years. We understand the uncertainty central to technical systems such as Kubernetes and Cassandra. However, we too must continuously evolve and have recently adopted design thinking to align system and user expectations. We are paying close attention to emerging approaches such as chaos engineering, site reliability engineering and observability engineering and seek to ensure that we can serve our clients with the most effective solutions to their challenges and opportunities.

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE