Use Case: A Kafka Backup Solution

Peter Vegh

February 22, 2024

•

Share this post

Copied!

Check out Peter Vegh's latest blog where he explores a bespoke Kafka backup framework, discussing the approach, architecture and design process to implement and deliver the solution.

Recently one of our customers approached us to design and build a Kafka Backup solution for their AWS managed Kafka (MSK) cluster. There are many different ways to go about it, but obviously, any final solution will greatly depend on your requirements and constraints. Let’s take a look first at the presented functional requirements.

Records-at-rest need to be able to apply `Right to Forget` policies, so they may need to be mutable in some way
Records should be stored in some form of Cold Storage, preferably S3
Keep architecture simple (use as few services as possible)
Keep the cloud costs down

What was our design approach?

We laid down a few design principles very early on.

We do not want to add or change anything on the source MSK cluster. Apart from changing the retention periods on some of the topics that need to support Right To Forget.
We do not want to push additional logic to any external teams’ consumers or producers in order to read or write to backed-up topics or any encrypted fields therein

This was followed by a fairly intense discovery period with many spikes to test if the different solutions could work together and to compare the pros and cons for each.We looked at the following solutions:

Kafka Tiered storage
- We had to drop this because it still relies on immutable storage for the underlying records, which makes Right to Forget difficult to implement.
Kafka Records as binary columns in Iceberg
- This was a very tempting solution, Athena gives you powerful query capabilities for finding the data you need, using it in conjunction with Iceberg tables, you can update or remove records which is great. In the end, we dropped this solution as the customer wanted a very streamlined tech stack.
Kafka Records as flattened Iceberg table rows

For this one, we would have flattened the records and stored each field as a column. This proved to be overly complicated and difficult to deal with schema changes, so we dropped it.

Redshift as an object store
- We had to drop this solution as it wasn’t in the customer’s currently used technology stack.

Then we proceeded to implement a Vertical Slice, an all-singing, all-dancing happy path with most of the functional requirements present.The end result was this:

The Confluent S3 Sink Connector consumes records from the source topic and writes them to S3 buckets.
Ops can trigger either a restore or redact scenario using a controller service.
The Controller will do the necessary queries over S3 and upload the list of S3 Objects that need to be restored or redacted.
The Controller submits some restore/redact jobs to AWS Batch.

Let's take a look at each of the components separately.

The S3 Sink Connector

This customer already had a lot of experience in running their own Kafka Connect clusters, so adding a “few” more for backing up the necessary topics seemed a reasonable approach. We decided to go with the Confluent S3 Sink Connector mostly because it is open source and they already had expertise in running it. The only additional work that needed to be done was adding a custom partitioner. The default partitioners that come packaged with the Confluent connector did not satisfy the needs of the partitioning we wanted to go with. The first problem was that you could only use one partitioner at a time, and you couldn’t extract fields from the record key. We implemented a fairly simple one that could just do this.

Another important factor is the Recovery Point Objective (RPO). As our backup process is based on the connector, it is important to keep in mind the connectors’ consumer lags, and how you configure the flush sizes. If you have high lag on the connector or your flush size is big, that will affect your RPO - i.e. if your consumer to your backup service lags behind too far and your source cluster fails, then you won’t have read the data read into the backup process, and the data is potentially lost. The tradeoff here is running a large enough Connect cluster to keep your RPO low might end up being too expensive. Ultimately setting the RPO is very much a tradeoff between cost and a tolerable amount of data loss.

The Controller

As the name suggests, The Controller is in charge of preparing and running the different business scenarios. It will run the necessary queries on the backup bucket using the paginated list objects API topped with some filtering. Once it knows which objects it needs to restore, it will write them into a file and upload them to S3. This was necessary because even though AWS Batch job submission supports parameter overriding, they have to be less than 30KB for the entire command line. When you have a lot of objects to work with, this is very easy to breach if you specify lists of IDs or topics at the command line.

The Restore Service

The restore service runs via AWS Batch in a Fargate compute cluster. Its job is to load the S3 Objects that contain the Kafka records and then restore them. It uses a few S3 GetObject API calls and a fairly standard Kafka producer. The only difficulty that we experienced was using the Go Kafka client. This library does not support custom partitioners, which we needed for handling some special cases where records were keyed by complex objects. We have managed to work around this problem by re-implementing some of the custom partitioning logic and combining it with the murmur2 hashing algorithm in Golang to assign the partition id on the Kafka message itself.

Conclusion

Before we started, we investigated several different approaches, but in the end, we implemented the design mentioned above because it ticked all the boxes for what the customer needed. It supported the Right to Forget requirement by having a mutable cold storage solution. Furthermore, the customer expressed a specific technical requirement to maintain a streamlined technology stack, which further influenced our decision to pursue the chosen design. Overall, in these days of redundancy and 99.999% uptime, it’s easy to overlook backups as an operational concern - but this will never save you from data corruption or bad actors in your data centre. However, for this customer, Kafka was their central system of record, so backups are vital and the right to be forgotten is a legal requirement, so out-of-the-box solutions were not 100% applicable. If you have an operational requirement for Kafka backups, we’d be happy to talk to you about the project and discuss ways we might be able to work together!‍

This blog is written exclusively by the OpenCredo team. We do not accept external contributions.

Share this post

Copied!

Blog

Kafka