This blog is the third part of a series “Spark – the Pragmatic bits”. Get the full overview here.
My colleague Dávid Borsós recently wrote a blog post on “Data Analytics using Cassandra and Spark”, covering how you can use Spark to analyse data with Cassandra, a NoSQL database. If you’re unfamiliar with these technologies, his post is an excellent primer to getting started, and recommended reading before this article. We have also written many other Spark related blogposts, which you can find here.
One aspect not covered in David’s post, however, is how to approach testing. Data analytics isn’t a field commonly associated with testing, but there’s no reason we can’t treat it like any other application. Data analytics services are often deployed in production, and production services should be properly tested.
This post covers some basic approaches for the testing of Cassandra/Spark code. There will be some code examples, but the focus is on how to structure your code to ensure it is testable!
Refactoring for testability
As a starting point, let’s revisit the original Scala code from David’s blog post. The code extracts some data from a Cassandra table, processes it with Spark, and then saves the results back in Cassandra.
As with many blog based code snippets, this code was written to demonstrate technical capability. The code specifically focused on highlighting how powerful and easy to use Spark is. However, it is not production ready code.
The code does a variety of different things. Let’s examine those functional areas line by line:
Steps 1 and 2 set up some necessary variables.
Steps 3 through 5 specify the CQL queries within Cassandra, and get a Spark RDD.
Steps 6 through 9 carry out Spark RDD operations.
Steps 10 and 11 are back in Cassandra. They structure the data into a Cassandra friendly tuple, and then save that into a table.
Ideally, we want unit tests to cover all of the core functional areas of our code, namely:
- Extracting info from Cassandra
- Processing data with Spark
- Saving information back into Cassandra
It would be best if we could test these areas independently of one another. First, let’s split our code into different functions. This way we have defined functional interfaces for each functional area. Here’s one way of restructuring:
With this code, each functional area has been isolated and is called independently. This has the added benefit of making the main function more concise and readable.
Unit testing the Spark code
The first thing we should concentrate on is unit testing. Unit testing focuses on functionality/application logic provided by small, isolated pieces of code. Ideally, we unit test single functions. All dependencies are mocked or otherwise isolated. Business logic is typically left to broader, application wide testing – as described in the testing pyramid. The great thing about unit testing is that it is quick, and will point out errors in specific functions. You can get really fast, specific feedback about what is wrong in your application.
As we now have structured interfaces – by splitting the code into functions – we can write targeted unit tests. For this, I have decided to use ScalaTest, one of the more popular Scala testing frameworks.
Here is an example unit test which explicitly tests the core implementation logic of this application – “Calculating the current balance” implemented as a set of Spark RDD operations (Processing data with Spark as above). It uses ScalaTest’s
GivenWhenThen trait for BDD style annotations and reporting.
The test is calling the function
CalculateCurrentBalance. All this function does is use Spark to sum transaction amounts from an input data source. The data is read from a CSV file (with mock data), and built into a Cassandra RDD. The results do need to be saved back into the table – we can assert against it directly. Note that this test does not rely on a live Cassandra instance at all. We have isolated the application logic from the parts where the data is ingested and saved. As a result this test will be fast to execute and does not have any external dependencies.
This is a good way of testing our core Spark code directly, but what about the Cassandra code?
Unit testing with Cassandra is possible, through tools such as Scassandra, or Cassandra Unit. With the code clearly bounded by functionality, we could have unit tests using one of these mocking frameworks. However, I would argue that (for the time being) our CQL queries are reasonably straightforward, and don’t need to be rigorously tested at the unit level.
There’s an additional complexity with Spark that makes the Cassandra code difficult to unit test. Spark code is not necessarily executed linearly, and it is tightly coupled to Cassandra. Spark optimises your entire query once you call a terminal operation on the RDD. For example, if your terminal query was
take(10), Spark will potentially not read the entire dataset, as it knows it only needs to return 10 records. This makes it difficult to stub out Spark and be sure we’re still testing the same Cassandra query logic. Because of Spark’s lazy evaluation it sometimes alters the Cassandra query logic to optimise it. Spark and its Cassandra connector are strongly coupled to account for this. Because of this strong coupling, it is extremely difficult – in many subtle ways – to isolate the Cassandra connector’s logic. Without isolation, we cannot unit test.
Ultimately, what we care about is whether Cassandra and Spark work together. Therefore we will test our Cassandra code with our Spark code, against a real Cassandra server.
In a real world scenario, a production Cassandra database would be composed of multiple nodes in a cluster. The same logic would apply to a Spark cluster. However, we should check our logic works against a single node first. We can do testing against a production-like environment once we are happy working with a simple setup. A single node cluster cannot detect problems related to the distributed nature of Cassandra/Spark – but it can still catch bugs related to Cassandra data modeling and the executed CQL queries. We will save money and time by not spinning up a full Cassandra/Spark cluster.
The best way to start doing this would be to have a Cassandra node set up in a Docker container. Then run your entire Spark/Cassandra job against this local, lightweight Cassandra node. Once the data has been saved back into Cassandra, read it back and assert against that. Having a containerised test setup means that your tests are self-contained (the container should be controlled by your build system) and easily portable. Additionally they will execute the same way on your local development environment and a CI server.
Any number of different frameworks can be used for integration testing. For example, you could use a tool like Cucumber-JVM and Java, and use the Java driver to interact with Cassandra. Here is an example Cucumber test.
Of course, the structure and tooling used for your actual tests can differ upon your preference. The general idea is to make sure your dependencies and tech stacks work together in a basic scenario.
In terms of data, I prefer to wipe the Cassandra database clean between tests, so you can insert fresh data per test. The CQL operation
TRUNCATE TABLE would make it very easy to do so.
Acceptance testing, and beyond
The integration tests shown above were written in Cucumber. This is intentional – as these tests can be reused at a feature testing level. We have demonstrated some functionality in a somewhat unrealistic scenario, it is important to test both Cassandra and Spark in a multi-node setup. There can be some subtle differences in multi-node environments, due to these systems’ distributed nature. These acceptance tests should be run against an environment as close to your production setup as possible.
The features written for the integration tests can be reused. You can either run the entire suite, or a subset of high priority tests. This decision depends upon how much risk you wish to mitigate, in exchange for faster/cheaper test runs.
Additionally, moving further up the testing pyramid, there is performance testing and benchmarking. Essentially, these can be considered the same – except performance testing would be done against larger volumes of data (ideally equivalent to your production data volumes, or at least against a well-understood scale model). Benchmarking can be useful to determine your query optimisation, and refactoring your Spark code. Benchmarking should also be done against a production-like server setup, similar to your acceptance tests. If concerned about the load put on your QA/staging environment during performance testing, then it would be wise to have a temporary performance testing environment.
Test your Spark/Cassandra dataflows!
In this post I have described how to structure a testing approach involving a data analytics application with Cassandra and Spark. At the unit testing level, you will need to structure your code to be functionally isolated, so you can independently verify your Cassandra and Spark logic. This way you can test Spark without having to rely upon a Cassandra instance. Cassandra logic could also be unit tested if necessary, but it can have a low return on investment.
From a wider perspective, you can test the integration between Cassandra and Spark. This can be done on a single node hosted within a Docker container, which makes an easy, self-contained and portable testing setup. You can then reuse the same tests for your acceptance tests. Your acceptance tests should run on a more realistic environment, and expose any problems associated with a multi-node setup.
In the end, testing a Cassandra and Spark flow is just like testing any application. Just because it is concerned with data analytics does not mean it is immune to bugs. If your data is being viewed by your users – internal or external – it should be treated as a core part of your application.