Open Credo

October 12, 2023 | Blog, Platform Engineering

Event Driven Load Testing

Check out our latest blog “Event Driven Load Testing” which explores how, through some smart automation techniques, testing strategies can be adapted to support scale-up organisations where there are potentially many disparate teams needing to work together.

WRITTEN BY

OpenCredo

OpenCredo

Event Driven Load Testing

Supporting scale-up organisations with disparate engineering teams

Recently we spent a lot of time working with an EdTech client that had issues reconciling work across teams. There were many organisational factors at play, however, primarily it was a result of many remote teams interacting across the globe from disparate timezones. This time zone disparity dramatically limited the amount of synchronous communication the teams could have, making collaboration a problem. 

This blog post explores how, through some smart automation techniques, testing strategies can be adapted to support scale-up organisations where there are potentially many disparate teams needing to work together. 

Setting the Scene

Lets take an  example of a fictional client, InnerScale. They have an infrastructure team based in the UK and two application teams based in Australia and Singapore. They are an AWS shop, currently deploying with ECS, RDS, and ELB which the application teams helped stand up before the infrastructure team was hired. The infrastructure team is building a new Kubernetes-based platform for InnerScale while the application teams build features requested by customers. Right now they have less than 10k active users and are looking to increase that 50 fold within the next year and then double again, to around 1m users in the following 6 months.

The problem InnerScale faces is they have no idea how the application will scale or even if it can. What they are collectively observing is that the current platform is not fulfilling the aims of the company. InnerScale needs to drive its business forward and close the gap between the infrastructure and wider engineering teams without increasing their burden through meetings and increased remote contact.

The QA team, tasked with addressing this, faces 2 problems:

  • They need a way of delivering solid test data to every team based on the changes being made
  • Reporting on any issues discovered without relying on the teams being able to spend time working together.

Dissecting the problem

InnerScale’s QA team have built some basic load testing scenarios that are executed locally  on an ad-hoc basis, they perform user flow tests at a small scale and perform some stress tests against the login process. The problem they face here is two-fold; first, they are only testing the happy path, and second, they are only testing in a reactive fashion. If they discover a problem then they write a test to confirm and deliver results manually to the appropriate engineering team. In order to scale to the degree they intend to, they need to build a more proactive testing process to verify the limits of the application and the platform.

The QA team tasked with delivering this testing process face a problem; they need to react to actions by disparate teams using disparate systems, one engineering team uses GitHub, one uses a private GitLab instance, and the infrastructure team uses some other public flavoured Git based offering. The engineering teams are developing using a new language and are still learning the complexities of the new K8S based platform, which is also developing at a rapid pace.

The questions that the QA team have at this point are:

  • How do we respond to changes to multiple codebases?
  • How do we respond to changes to multiple microservices?
  • How do we keep up with the rate of change in the platform?

Event-driven load testing platform

Here InnerScale decides to create a team dedicated to this problem. This team decide to create an event-based system for responding to changes in all these distinct systems and then make the changes necessary in a unique environment.

This team focuses on not only building this system but also on writing tests that stress the system. They arrange for tracing to be implemented in a production environment that allows them to build tests closer to how real users interact with the application. On top of this they implement a tagging system on all resources and tests to track what tests relate to what service. These tags can then be utilised to test the system appropriately. They then use all this to deliver data to the engineering teams to allow them to iterate and improve.

The process is roughly shaped like so:

  1. Anytime any of the teams makes a significant change (e.g. by merging to main) an event is created in AWS EventBridge that can then be responded to. Additionally a notification is sent out that a test is taking place.
  2. Process any changes:
    1. In the case of a microservice change, deploy the latest image to the environment
    2. In case of an infrastructure change, run Terraform to make any changes.
  3. Run the tests the service or infrastructure tags match, if there is a large change then a full suite can be run in order to try and catch any unintended consequences.
  4. Send notifications for successful or failed scenarios to the relevant engineering teams.

The following diagram outlines a possible implementation of the above process based on AWS services.

The sample architecture outlined above is composed mainly of managed services from AWS.  The primary benefit of using these serverless offerings is their ability to reduce the amount of implementation time and infrastructure overhead. Components within each stage are kept to a minimum in order to avoid brittleness in the solution and to prevent poor data from being delivered to the engineering teams. Similar systems could easily be created in other cloud providers like GCP and Azure too.

The first step is to ensure there is a robust continuous delivery (CD) process in place for their new environment – they are relying on the engineering teams for a deliverable artefact, if the artefact is unable to be deployed then the engineering team responsible for the service are alerted. If it’s OK to deploy then the environment can be updated and the test suite can run, then the engineering team will be alerted if there’s any deviation in results from the previous release.

Ensuring that everything is tagged appropriately relating to what the changes impact allows this to avoid running into the trap of delivering too much data to the engineering teams. For example, if a database configuration changes in the infrastructure codebase then only run tests related to the service utilising that database.

In Summary

The intent of this methodology is to build out a reactive, event-based system that is fully end to end automated. This means it doesn’t put any extra pressure on the engineering teams to keep up to date with the other teams, it allows the new team the opportunity to spend time working on this tooling rather than digesting results and feeding those to the engineering teams. This allows the organisation to keep their rate of change alive and rapidly expand without overfilling engineering teams in an attempt to keep up.

We built an adaptive system external to the critical path teams, allowing the organisation to maintain the momentum that got them into the position of rapid scale up whilst simultaneously ensuring that they have a stable system when they get there.

If you need help building creative solutions to support rapid scale up opportunities, we’d be happy to talk to you about your project and find a way to work together to support your goals.

 

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

RETURN TO BLOG

SHARE

Twitter LinkedIn Facebook Email

SIMILAR POSTS

Blog