With the upcoming Cassandra 4.0 release, there is a lot to look forward to. Most excitingly, and following a refreshing realignment of the Open Source community around Cassandra, the next release promises to focus on fundamentals: stability, repair, observability, performance and scaling.
We must set this against the fact that Cassandra ranks pretty highly in the Stack Overflow most dreaded databases list and the reality that Cassandra is expensive to configure, operate and maintain. Finding people who have the prerequisite skills to do so is challenging.
At OpenCredo, we have been working with Cassandra for years and we have a good understanding of its pros and cons. The raw write performance of Cassandra cannot be denied but the overwhelming complexity of its operations – when combined with the mental adjustments required of developers to design appropriate data and query models – means that in many cases, we can no longer recommend operating it yourself.
This is supported by the rise of several managed service providers offering cassandra-as-a-service. There are Cassandra-compatible offerings from Instaclustr, Datastax, Aiven and Scylla. As of April 2020, AWS also has a generally available offering: Amazon Keyspaces.
In this blog post, we’ll look at Amazon’s offering – how it differs from open source Cassandra and what use cases it might be suitable for.
AWS Keyspaces is a fully managed serverless Cassandra-compatible service. That’s Cassandra – compatible. So we can assume that it’s not actually vanilla Cassandra under the hood. This is unofficially confirmed:
Unsurprisingly, it integrates with DynamoDB technology. Cassandra and Dynamo share a common heritage and DynamoDB has been well worn-in over the years so we won’t necessarily consider this problematic.
What is more interesting is that it is serverless and autoscaling: there are no operations to consider: no compaction, no incremental repair, no rebalancing the ring, no scaling issues. For any under-resourced data operations team this must be the main selling point: Keyspaces provides an SLA – if this SLA is acceptable to your internal consumers of Cassandra, then Keyspaces can be used in place of your internal cluster.
AWS Keyspaces is delivered as a 9 node Cassandra 3.11.2 cluster; that is, it is compatible with tools and drivers for Cassandra 3.11.2, including the Datastax Java Driver. Whilst we have no control over the underlying operations and configuration, we do have full control over the data plane – keyspaces, tables and read-write operations.
So this all looks pretty promising, what are the compromises? At the moment the following limitations are in place over vanilla Cassandra 3.*:
These compromises are not particularly onerous – we would generally want to ensure that our writes are QUORUM consistent, although we may miss consistency level ONE for fast, risk-tolerant writes. Likewise with the possibility of ONE and QUORUM reads we have the possibility of fast reads (for, say immutable data) and consistent(ish) data respectively.
The missing functionality is not central to typical use cases for Cassandra, or indeed an anti-pattern – for example materialized views are best suited to scenarios where write throughput is low, but if write throughput is low, why are you using Cassandra and not something else?
Being integrated with AWS, there are some additional functions that we can take advantage of. These essentially take the place of native Cassandra functions:
These remove a class of challenges, there are tools to help like Medusa for backup but for an architecture already integrated into the AWS ecology, these are better aligned.
As of September 2020, unfortunately there is no support yet for AWS Keyspaces in the Terraform AWS provider, although the issue appears to have been raised. Likewise, it doesn’t appear to be available in the AWS Python SDK (boto3) or the AWS CLI, but does in CloudFormation. On this basis, we will need to be temporarily expedient about how Keyspaces is provisioned.
Architecturally, there is support for Interface VPC Endpoints which should be used to keep traffic private to the AWS network and will allow fine grained controls over access to the VPC endpoint and what that endpoint can access (preventing classes of exfiltration attack, a vector often neglected)
Once setup, we need to connect to Keyspaces. The main way to do this is probably the Datastax Java Driver which supports a range of features including connection pooling, load balancing and the control connection. It’s a bit ceremonial to get started but well documented.
Out of the box, each TCP connection to Keyspaces supports up to 3,000 queries per second – so that’s up to 27,000 across the 9 nodes. With the Datastax driver (on a recent version and version of Cassandra) we default to 1 shared connection per node – this can be increased and must where we are looking for higher throughput from Keyspaces – there is no limit to the number of TCP connections that can be made to Keyspaces nodes. So, we can scale up by TCP connection configuration rather than resizing the cluster.
We must also choose the pricing model – either on-demand where capacity auto-scales to… demand, or provisioned where capacity is fixed (but easily changed) with some cost advantages.
So, our feature survey above indicates that there is a lot of upside and only a few downsides to using AWS keyspaces. Let’s take a step back and evaluate it again:
We’ve already discussed the lack of support from standard infrastructure-as-code tools like Terraform and the AWS CLI. We also note that, as of writing (September 2020), AWS Keyspaces does not appear in the AWS Services in Scope by Compliance Program list. So its compliance status seems to be presently “undefined” and presumably “in-progress”. For sensitive workloads this places a delay of unknown length on adoption (which, of course, might fruitfully be employed in evaluation).
This lack of contextual maturity is also reflected in lack of proving of the system. Particularly given that this is a new cloud service, it is critical that you benchmark the performance of your data and querying model against the performance of Keyspaces. The under-the-hood assumptions made to provide a SaaS service may, or may not, suit your workloads.
A final concern is pricing, which seems to be roughly DynamoDB + 15%. To illustrate, if we are using on-demand pricing and have roughly 10,000 writes, of up to 1Kb, per second, in the London region, then this will cost:
$1.7221 * 0.01 * 60 * 60 * 24 * 365 = $543,081 per year
This doesn’t even include reads, storage, point-in-time recovery, VPC endpoints or data transfer. Expensive stuff.
Having said this, we need to take into account the TCO of running Cassandra yourself. It is generally considered that you need at least 2 dedicated staff to a non-trivial Cassandra cluster such as one ingesting 10,000 writes per second. These staff, with skills in distributed systems, caching, operating system configuration, JVM tuning etc are highly in demand and consequently, expensive to hire and retain.
AWS keyspaces is a slick offering that is well integrated into the AWS stack and provides a tidy, usable UI. It removes entire classes of pain from using Cassandra, freeing up staff and providing them with opportunity to pursue activities which might be more profitable to your business.
As with all new AWS services, time will tell: the compliance boxes must be ticked, performance better understood and the service matured before it becomes a sensible option for organisations who are not especially risk-tolerant.
Who would use it? Given the pricing model, it makes sense for:
Web-scale and big data organisations who have mature teams, processes and significant workloads are unlikely to find much of interest here. The costs of migration are simply too great when compared to running on, say i3 instances on EC2; the promises of Cassandra 4.0 are too attractive.
For startups and other organisations where the priority is go-to-market, Keyspaces could be an extremely valuable starting point for supporting big data and streaming workloads without the hassle.