Message and event-driven systems provide an array of benefits to organisations of all shapes and sizes. At their core, they help decouple producers and consumers so that each can work at their own pace without having to wait for the other – asynchronous processing at its best.
In fact, such systems enable a whole range of messaging patterns, offering varying levels of guarantees surrounding the processing and consumption options for clients. Take for example the publish/subscribe pattern, which enables one message to be broadcast and consumed by multiple consumers; or the competing consumer pattern, which enables a message to be processed once but with multiple concurrent consumers vying for the honour—essentially providing a way to distribute the load. The manner in which these patterns are actually realised however, depends a lot on the technology used, as each has its own approach and unique tradeoffs.
In this article we will explore how this all applies to RabbitMQ and Apache Kafka, and how these two technologies differ, specifically from a message consumer’s perspective.
CTO / CEO
RabbitMQ is a popular open source message broker which works by pushing messages to consumers. Consumer process and acknowledge each message, after which, the message is removed from the queue. RabbitMQ supports several standard protocols including AMQP (the primary one), STOMP, and MQTT. The RabbitMQ server is written in and relies on Erlang, however has client libraries available in many languages to interact with it.
Compared to other traditional queuing systems, RabbitMQ comes with a (good) twist: exchanges (part of the AMQP protocol). With RabbitMQ, publishers normally publish to an exchange, which then takes care of routing that message to zero or more queues, depending on the strategy implemented, such as direct or fanout. Different exchange types, combined with other metadata like routing keys and headers, provides a great deal of flexibility for users.
Queues in RabbitMQ may be durable (i.e. in the event of a broker failure, messages which have not yet been processed will survive upon restart), but messages themselves are destroyed when consumed. The broker is in charge of keeping track of, and ensuring it knows when all appropriate clients have processed messages, and thus when to push more messages along.
This architectural approach and implementation has implications for consumers. For example, implementing the pub/sub pattern requires explicit decisions and architecting up front, knowing which appropriate exchange (a fanout in this case to uses) and queues as building blocks. However implementing the competing consumer pattern does not necessarily require you having to worry about what type of exchange is used, but simply that multiple consumers can connect to the same single queue (with caveats around scaling and messaging ordering).
Apache Kafka is much more than just a messaging system—it is a fully fledged distributed event streaming platform (which also handles messages). Kafka uses a custom protocol and is deployed as a JVM based solution which also relies on Apache ZooKeeper, although the latter is in the process of being architected out as a dependency. This is already available for early access preview in version 2.8.0. Kafka likewise has client integration libraries available in most of the popular programming languages.
Kafka is pull-based in that it is the Consumers responsibility to pull and consume messages at a rate and pace they are comfortable with. Under the hood Kafka uses a distributed append-only log, which at a basic level is comparable to git. Producers append messages to the log just like commits are added to a git repository, and different consumers can process those messages each at their own pace, similar to how git branches point to different commits and can move freely without affecting the history.
In fact, in Kafka, messages are durable and remain in the system for a period of time depending on a retention setting. The same message can be consumed several times, either by different consumers, or indeed by the same consumer multiple times.
Kafka uses the concepts of topics and partitions, where partitions are the main mechanism used to achieve its impressive scalability characteristics. At a basic level, a topic can be imagined as a queue with a particular type of message that interested consumers subscribe to (in the pub/sub sense). However, topics are refined into partitions (i.e. shards), with messages routed to different partitions based on a key. So, for example, messages for the same country code will always end up in the same partition.
This durability of messages, when combined with Kafka’s partition-based architecture, provides a natural way to implement the pub/sub pattern where multiple new consumers can be added without having to necessarily re-architect anything upfront from a consumer perspective. The competing consumer pattern is likewise easily accommodated, although it does require the introduction of a new “consumer group” concept. But going beyond this, Kafka also offers the ability for consumers to replay messages, which further opens up the possibility of even more event/message patterns such as Event Sourcing.
Now that we’ve established the fundamentals around RabbitMQ, Kafka and one or two key messaging patterns, let’s explore how these technologies compare when it comes to distributed systems characteristics – and specifically how consumer applications are impacted when using them in different scenarios.
Generally, increasing throughput means getting more stuff from producers to consumers as quickly as possible. Assuming both systems can get messages in relatively quickly, the challenge then becomes how fast can consumers consume? More consumers which are able to process in parallel will naturally improve the throughput in such systems.
It’s worth stating upfront that both technologies are able to consume millions of messages per second, provided they operate in agreed circumstances and typically with some caveats. For the purposes of this section, I want to focus on the case where you desire good throughput AND require your messages to be ordered. Take for example an e-commerce setting where you need to process a sequence of events in order for a particular client. I.e. such as: Someone orders a good, then cancels a good. Consuming apps would need to ensure they process the order event before the cancel event otherwise the system may not work as expected.
In RabbitMQ, you can have multiple parallel consumers for the same queue—but without doing anything special, this comes at the expense of ordering. This is because parallel consumer processes are not guaranteed to be able to pick messages off the queue and process in order. Whilst messages in a single queue are ordered, being restricted to a single queue for everything unfortunately does land up weakening scalability and ultimately overall message ordering too. To deal with the ordering problem, from version 3.8, RabbitMQ introduced the concept of a Single Active Consumer. This is very similar to the Kafka consumer group concept we cover next, in fact it was likely inspired by it. With this setup, whilst you may have multiple registered consumers for a queue, only one consumer is active at any given time and this is the one that consumes from a queue. If the active consumer is cancelled or dies, it can fail over to another registered consumer. So whilst you gain ordering guarantees through this, you still have the throughput problem of only having one consumer at a time. To get around this, you need to look to combine it with the add-on Consistent Hash Exchange or Sharded Plugins. These plugins provide a way for exchanges to consistently split incoming messages into multiple queues. This assumes that your data is able to be partitioned or sharded in a way which makes sense for in-order processing. For example all customers with IDs 1-100 may land up in one queue, with 101-200 in another etc. Being able to use the single active consumer against each partitioned or shared queue, introduces a level of parallelism not available before and thus improves throughput whilst preserving in-order processing for that specific queue. Note, there is no guarantees around in-order processing across queues.
Kafka, on the other hand, does all of the above by taking advantage of its native partitioning and consumer group functionality. This was built in from day one and through its built in partitioning architecture, scaling and thus throughput is improved whilst still preserving in-order processing within a partition. As with the RabbitMQ add-ons, there is no guarantee of ordering across partitions. Whilst Kafka allows multiple consumers for a partition, these consumers process messages independently. In order to ensure consistent in-order processing amongst consumers, you need to organise them into a single logical consumer group. Such a group provides a pool of highly available consumers in the first instance, however are architected such that only one will consume from a partition at any given time, although another can take over if it dies. Provided you can split your topic into appropriate partitions, consumers can consume that topic concurrently without compromising message ordering.
Routing is RabbitMQ’s secret sauce. Using different types of exchanges in conjunction with specific data sent by the producer (routing keys and/or headers), messages can be routed to different queues or duplicated to multiple queues. The system can easily be configured to ensure that consumers only receive the messages they need. This is very efficient and provides a means to minimise the amount of time, space and data needing to be processed by consumers.
Kafka, on the other hand, does not implement routing out of the box. This puts the burden of filtering out unwanted messages onto either the Consumers themselves, or requires the introduction of intermediary processes. In the latter case, these intermediary processes often need to filter and then forward/duplicate a subset of messages to a new topic, which then becomes the new subscription point for interested consumers. Whilst in some cases, partitions can be used to simulate a form of filtering (e.g. partitioning a topic by country code to have different consumers per country), partitions primarily exist to enable scaling, and thus their use for routing and filtering does not necessarily always fit as naturally. Without careful planning, such approaches can be wasteful in terms of processing and storage.
With RabbitMQ, multiple consumers on the same queue provide higher throughput and high availability, but as noted earlier the competing consumers do not respect message ordering. Thus, balancing throughput, high availability, and ordering requires assessing and balancing tradeoffs:
With Kafka, you likewise need to think about how to use a combination of the partition architecture and consumer groups to achieve this, although this is more naturally done out of the box.
RabbitMQ is designed to scale vertically, with the heaviest work happening in memory. You need to be careful to ensure that queue backlogs remain fairly small (ideally empty) so that consumers can keep up and the broker is not overwhelmed. Complex consumers with unpredictable consumption patterns (e.g. where some are waiting on other downstream processes) have been known to be problematic in such scenarios if they are unable to keep up.
Kafka, on the other hand, is designed to scale out horizontally. Kafka trades lower latency for better durability and availability; the retention and storage approach (see next subsection) means that it can handle higher throughput demands and also that the stability of the system is not threatened by temporary consumer outages. Interestingly, Confluent have recently made a beta Parallel Consumer Kafka client wrapper with client-side queueing available. This is a simpler consumer/producer API which allows for key based concurrency and extendable non-blocking IO processing. It works by allowing you to process messages in parallel within a single Kafka Consumer without necessarily having to increase the number of partitions in the topic you are handling. This can help to improve both throughput (increased ability for a single consumer to parallelise and process more) and latency by reducing load on brokers, although it is not officially supported yet as it is still experimental.
RabbitMQ can thus often achieve better end-to-end latency for smaller throughput scenarios where consumers are guaranteed to be able to keep up with the production of messages, but this comes at the expense of lower combined throughput, durability, and availability. This is a conclusion echoed in this benchmark (note however this is published by Confluent themselves).
One feature we at OpenCredo always look for when building or choosing architectural components is observability. With messaging and event-based systems it is important to be able to observe metrics such as queue depth (how far behind a consumer is becoming) and throughput (how quickly messages are being processed through the queuing mechanism).
Rabbit provides means to extract metrics on queue depth and throughput via APIs and dashboards available within its Management Plugin. On the other hand, Kafka provides similar metrics via JMX and tooling to monitor consumer lag.
However, the nature of Kafka’s immutable event log, allows one to take this further. Where multiple consumers are supported for every topic, it is possible to actually monitor the rates of production and also to listen in and inspect the actual data itself, without affecting live consumers. While this is possible using fanout exchanges in Rabbit, but a fanout queue is required for each additional listener. Whereas in Kafka one adds a new consumer group to the topic.
Finally, in RabbitMQ, queues are durable insofar as they can survive broker outages. On the other hand, message retention is acknowledgement-based, since messages are removed from queues once they are acknowledged. If optimising for latency you will naturally be trying to keep your queues small and moving and thus your storage will be minimised. That said, storage is pretty cheap these days.
Kafka, meanwhile, uses policy-based message retention. It can be configured to keep several days’ worth of message history, allowing messages to be processed at a later date – perhaps some consumers were offline, or some reprocessing is needed. However, it is vitally important to watch consumer group lag in this case, since messages can still be lost if consumers are unable to keep up with backlogs that then run beyond the retention period. This however tends to be less of a problem with Kafka as retention policies are often quite generous. And provided you have good monitoring and alerting (See James Bowkett’s Kafka, Devops …And Resilience for all talk) this should be able to be kept under control.
Overall, consumers get one shot at processing messages in RabbitMQ, whereas Kafka consumers have more flexibility, enabling message replays and patterns such as event sourcing.
RabbitMQ and Kafka follow very different approaches towards message delivery. RabbitMQ is the more traditional message broker system, but it leverages the added flexibility provided by AMQP exchanges, which can be beneficial in certain operational contexts. Kafka is a completely different system engineered specifically to solve the problems surrounding distributed systems processing massive amounts of data at scale (e.g., throughput, ordering, high availability, etc.).
Kafka tends to be the better choice when it comes to larger distributed systems. It can scale horizontally more effectively, achieving better throughput for bigger scenarios including when consumers are offline and not available. RabbitMQ is well-suited for systems with lower latency requirements, where consumers can keep up with the production of messages, but perhaps have less parallel throughput processing requirements.