Patterns of Successful Cassandra Data Modelling

For a business it is essential to invest resources into data modelling from the early stages of Cassandra projects; unlike operational settings that can be tuned, a Cassandra data model is very costly to fix.

Cassandra is growing in popularity due to its well-advertised strengths such as high performance, fault tolerance, resilience and scalability covered in a previous blog post by Guy Richardson. How well these strengths are realised in an application depends heavily on the quality of the underlying Cassandra data model. The main rule for designing a good Cassandra data model is crafting it specifically to your business domain and application use cases.

Through years of experience with Cassandra engagements, we have identified a number of data modelling patterns and will cover some of them in this post. Their successful application requires a working knowledge of how Cassandra stores data. Without understanding the underlying architecture you risk making one of the common Cassandra mistakes covered in a separate post. Evaluating a seemingly valid and straightforward data model for potential pitfalls requires a level of expertise in Cassandra internals, the storage format in particular.

The Simplicity and Complexity of Cassandra Storage

While Cassandra architecture is relatively simple, it fundamentally limits the ways in which you can query it for data. Cassandra is designed to be a high-performance database and discourages inefficient queries. Query efficiency is determined by the way Cassandra stores data; it makes the following query patterns inefficient or impossible:

fetching all data without identifying a subset by a partition key,
fetching data from multiple partitions,
joining distinct data sets,
searching by values,
filtering.

Cassandra expects applications to store the data in such a way that they can retrieve it efficiently. It is therefore up to the client to know the ways it will query Cassandra and design the data model accordingly upfront.

Example: Projects by Manager and Turnover

Consider an application that records information about project managers, their projects and project turnovers. Even for this intentionally simple use case, there are a many ways you could store the data and a number things to consider to produce a good Cassandra data model. At the very least, Cassandra requires a meaningful key to split data into subsets that you will later use to retrieve data. Without much knowledge of how Cassandra works, the first response might be to store this data in a simplified table:

CREATE TABLE projects_by_manager (
  manager text,
  project_id int,
  project_name text,
  turnover int,
  PRIMARY KEY (manager, project_id)
);

In this table all data about projects and their turnovers is partitioned by manager. This data model will work if you only want to retrieve all project data for a particular manager. Disappointingly, this is the only type of query this table will support out of the box. Cassandra will not retrieve “all projects with a turnover of 2 000 000″ – the way in which it stored data makes this query inefficient. The reason behind this becomes obvious looking at the format of Cassandra SSTables.

Partitioned Row Store

[
{"key": "Jack Jones",
 "cells": [["1:","",1470229749953090],
           ["1:project_name","Cassandra Tuning",1470229749953090],
           ["1:turnover","5000000",1470229749953090],
           ["2:","",1470229928612372],
           ["2:project_name","Spark Layer",1470229928612372],
           ["2:turnover","2000000",1470229928612372]]},
{"key": "Jill Hill",
 "cells": [["1:","",1470229908473768],
           ["1:project_name","Kubernetes Setup",1470229908473768],
           ["1:turnover","2000000",1470229908473768],
           ["2:","",1470229948844042],
           ["2:project_name","Front End",1470229948844042],
           ["2:turnover","1000000",1470229948844042]]},
{"key": "Richard Ford",
 "cells": [["1:","",1470229980496296],
           ["1:project_name","Docker Training",1470229980496296],
           ["1:turnover","1000000",1470229980496296]]},
{"key": "Maggie Bail",
 "cells": [["1:","",1470230005734692],
           ["1:project_name","Docker Audit",1470230005734692],
           ["1:turnover","1000000",1470230005734692]]}
]

Cassandra is a partitioned row store and its storage structure is in effect a nested sorted map which makes it is easy to grab a subset of data by key. In the previous example data is partitioned by manager name (which is the key) and makes it easy to retrieve projects in per-manager subsets. However, finding projects by turnover would involve checking turnover values for every project in every partition. Cassandra rightfully considers this inefficient, as a result such queries are not supported. Dominic Fox post “How Not To Use Cassandra like an RDBMS (and what will happen if you do)” [Release: 15/09/2016] will give many more examples on other queries that will be suboptimal in Cassandra in this blogpost. Luckily, there are patterns for designing Cassandra data models in a way that will be able to provide answers to most reasonable questions.

Cassandra Data Modelling Patterns

Model around Business Domain

When designing a Cassandra data model for an application, first consider the business entities you are storing and relationships between them. Your ultimate goal will be to store precomputed answers to business questions that the application asks about the stored data, an understanding its structure and meaning is a precondition for modelling these answers. Knowledge of the business domain model is also key to understanding the cardinality of certain data elements and estimating the changes in future data volumes. A data model designed around business domain will also spread data more evenly and keep partition sizes predictable. Naturally, achieving this requires close collaboration between business stakeholders and development teams.

Denormalisation

Unlike relational databases, Cassandra has no concept of foreign keys and does not support joining tables; both concepts are incompatible with its key architectural principles. Foreign keys have a significant negative impact on write performance while joining tables on reads is one of the inefficient querying patterns that Cassandra discourages. Cassandra demands that the structure of tables support the simplest and most efficient queries. On the other hand, writes in Cassandra are cheap and fast out of the box due to its simple storage model. As a result, it is usually a good idea to avoid extra reads at the expense of extra writes. This can be achieved by balanced denormalisation and data redundancy: if certain data is retrieved in multiple queries duplicate it across all tables supporting these queries.

Example:

While keeping data volumes and number of tables low is not a concern in Cassandra, it is still important to avoid unnecessary duplication. Consider a simplified data model of users leaving comments to articles and imagine retrieving the list of all comments for an article. Even though a separate table stores full user information, a good data model will also duplicate the author name in the comments table: application users want to see who wrote the comment. The following table structure would be appropriate for a common use case:

CREATE TABLE comments_by_article (
  article_id int,
  comment_id int,
  comment text,
  user_name text,
  PRIMARY KEY (article_id, comment_id)
);

Note that storing full user information for the author with each comment would be excessive and create unnecessary duplication. It is unlikely that someone will ever want to immediately see full user information next to a comment left by that user – it is easy to retrieve upon request. It is vital to understand the types of queries the application will run to strike the right balance.

Pre-built Result Sets

Fast reads in Cassandra result from storing data fully ready to be read – preparing it at write time. Cassandra does not support any complex processing at query time such as filtering, joins, search, pattern and aggregation. The data must be stored partitioned, sorted, denormalised, and filtered in advance. Ideally, all that is left to Cassandra is to read the results from a single location. For this reason, having roughly a table per query is often a good approach in Cassandra. On the contrary, storing non-contextual isolated denormalised models often leads to common Cassandra anti-patterns.

When creating a Cassandra data model, determine specific questions the application will ask and the format and content of answers it will expect and create tables to store pre-built result sets. There are limited options to incrementally add support for new queries to an existing model through indexing. Besides, this approach introduces additional complexity and it is possible to avoid it by modelling data correctly in advance.

Even Data Distribution

Understanding the distributed nature of Cassandra is key to model for predictably fast cluster performance. All nodes in a Cassandra cluster are equal by design and should run on identical hardware. Accordingly, for nodes to demonstrate comparable performance, they should bear the same load. Spreading data evenly across the cluster by choosing the right partition key helps achieve this. There are several data distribution aspects to consider when designing the data model.

Likely partition sizes. A single partition will always be stored in its entirety on a single node, therefore, it must be small enough to fit on that node accounting for free space required for compaction. For this reason, it is important to understand compaction and choose the right compaction strategy.
Keeping partition data bounded. If the amount of data in a single partition is likely to become too big, adding an additional partitioning column will limit its growth potential. For example, additionally bounding a partition of events by time in addition to type would provide a reasonable guarantee of reasonable partition size. That said, it is important to carefully choose the time granularity best suited to a particular use case.
Partition key cardinality. Choosing a column with a reasonably large number of unique values (high cardinality) is important to keep partition sizes small. However, it is important to balance this with the aim of on partition ideally being able to satisfy each query.
Separating read-heavy and write-heavy data. Read-heavy and write-heavy data have different scalability cycles, and measures to optimise them are different and often conflicting. For example, different compaction strategies suit read-heavy and write-heavy workflows best. For this reason, it often helps to keep such data separate even if there is a strong semantic relationship.

Inserts over Updates and Deletes

Although Cassandra supports updates and deletes, their excessive use results in unexpected operational overhead. Using them safely in Cassandra requires detailed knowledge of their implementation and operational processes in Cassandra. While a record may appear updated or deleted on the surface, physical storage will not reflect it straight away. Cassandra reconciles updates and deletes in physical storage during compaction and only under certain conditions. Use of update and delete operations in Cassandra substantially complicate cluster operations and rely on regular well scheduled repairs, monitoring and manual compaction. In our experience, data models that avoid updating or deleting data help reduce operational risks and costs.

Testing Data Models

Like any code, Cassandra data models need thorough testing before production use. There are too many factors that affect the performance and operations of the cluster to predict it with reasonable certainty. It is necessary to validate a Cassandra data model by testing it against real business scenarios. In fact, Cassandra ships with a cassandra-stress tool which can help load test the performance of your cluster with a chosen data model. The nodetool tablehistograms (nodetool cfhistograms prior to Cassandra 2.2) utility provides further table statistics. There may be a temptation to avoid testing the data model altogether because gathering fully accurate metrics is impossible. While it is rarely possible to run tests against a production-sized Cassandra cluster, testing against a smaller test cluster will highlight common data model problems. In short, prefer testing the data model for a subset of scenarios over speculating about all aspects of its performance.

Look for the Right Balance

Cassandra is a powerful tool but upfront investment into its setup pays off in later stages of projects. Performance and smooth operations of a Cassandra cluster depend in large part on the quality of the data model and how well it suits the application. There are many factors that shape the suitable data model and its design involves many complex decisions and tradeoffs. While data modelling in Cassandra requires a high level of expertise in its architecture, there are a number of patterns that help achieve good results. Carefully balancing these decisions helps avoid mistakes and anti-patterns that lead to common Cassandra operational problems.