The delivery of software has been transformed in recent years by increased adoption of Continuous Integration (CI) and Continuous Delivery & Deployment (CD) processes, and the introduction of the DevOps approach to infrastructure management.
To provide a, somewhat simplistic, summary of these processes, we could say that they are primarily concerned with the automated validation and delivery of application and infrastructure deliverables. We seek to ensure that we have a single pipeline that automatically ensures that our deliverables meet both functional and non-functional requirements and are delivered efficiently and reliably through to production.
The focus here is on ensuring each individual delivery is evaluated against quality criteria as defined by the delivery team and business, so that we can have confidence that the risk of deployment is low. We further reduce risk by ensuring that deliveries are small, incremental and can be reversed if the delivery is proved problematic.
These methods are fundamentally important for the delivery of modern software and can grow to become sophisticated with full test deployment to environments which replicate production and use of service virtualisation to model system behaviour.
So, what shortfall does Continuous Verification fulfil?
Following alongside DevOps and CI/CD, the adoption of cloud, distributed NoSql databases, microservices and Kubernetes, has generated an explosion in the complexity of IT systems. Systems which previously had 3 layers (presentation, application and persistence) may now have hundreds of moving parts. The inherent complexity of these systems has been pushed from the monolith into multiple components composing the system architecture and infrastructure. This is reflected in studies and surveys, for example 76% of chief information officers surveyed think growing information technology complexity may soon make it impossible to efficiently manage digital performance.
Many systems have reached the point where it is no longer possible for a single person to fully understand its architecture and interactions. Others are following along closely behind. These systems are complex: their behavior can no longer be reasonably inferred from their properties. So, from a CI/CD perspective, we cannot reason about how a deliverable will behave within the production environment based only on quality guarantees provided by the CI/CD pipeline.
This is pretty scary stuff for enterprise IT which has long operated in a command-and-control manner (despite attempts to adopt more Agile methods) and, for which, CI/CD itself is often a relatively novel methodology. The control we have demanded is now impossible in the traditional sense. We must now think about our IT systems as a whole rather than reduce them to individual components.
So how do we manage quality in a complex system?
In order to make some inroads here, we might look to some of the basic methods used by science for studying complex systems like biology. By performing experiments on the system we can learn about the behaviour of the system and whether this behaviour lines up with our expectations.
- Generate a hypothesis about expected behaviour: as per the problem of induction, we actually require a null hypothesis – namely that nothing will change.
- Configure a measurement for a dependent variable, the thing that will tell us whether something has changed.
- Alter an independent variable and see whether changes occur in measurements for the dependent variable which would invalidate the null hypothesis.
This approach is, of course, the principles of chaos restated. By performing experiments we learn how chaos manifests in our system and whether the system behaves as we expect it to.
So, we need to generate hypotheses about how the system works. There are many people involved in the development and operation of an IT system: developers, SREs, managers and architects. We can expect each of these to only have a partial mental model about how a system works. So, we must bring together all stakeholders so that we can get a broad perspective and use the power of the collective group to generate effective hypotheses.
With our hypotheses we can then experiment with the system to confirm/fail to refute our beliefs about the system – and in the process gain an understanding about how the system will behave under certain conditions or following particular deliveries.
Continuous Verification (CV) is an extension of the CI/CD process that is concerned with verifying the system as a whole. For the cynical, it might be tempting to consider it a rebranding of Chaos Engineering. To turn this around, Chaos Engineering is simply one tool in the arsenal of CV: it includes, and depends on, a whole array of tooling – from canary deployments to observability to FinOps. The critical element is that we are verifying that the system as a whole behaves as we expect it to, in addition to our existing tests for the behaviour of component parts.
We must test in production – in other environments (dev/uat) there are different sets of conditions in play which may produce different results. This means that we must have discipline in the deployment process to ensure that tests can be delivered to samples of traffic and rolled back effectively if they cause significant issues.
Taking a broad view, Continuous Verification may grow to include all methods and tools which allow us to understand complex IT systems in more detail. There is a significant historical body of work for resilience engineering, safety engineering and complex systems modelling. Over time we would expect that elements of these disciplines will be trialed and adapted to see what is useful in the IT context, with those suitable to the more general cases becoming mainstream.