The Secret Lives of Data


Apache Kafka: The Basics

by Ben Johnson    

There are many tools in a software developer's toolbox that add complexity to a system but the Apache Kafka project is one of the few that simplifies it. Kafka is a distributed log that can be used as a messaging system between different components of a distributed system.

This is a simple, interactive introduction to the concepts behind this log-oriented messaging system. We'll cover replication and internals in future posts.

Introduction to log processing

Many developers think of a log as somewhere they write their error messages but we're talking about a different kind of log. The log we're discussing is append-only log that contains a series of commands that each have a sequential identifier and some data:

In this example, each entry is a command to either add to or subtract from the current value, V, in our application. This is a very simple example of a log but it illustrates many of the important points of a log.

Mechanical sympathy

One important reason to use a log is that it is extremely efficient. Mechanical sympathy means to write software that is in harmony with how the underlying hardware works so you can use it most efficiently.

Your hard disk and RAM have the highest throughput when you read and write to them sequentially. The performance difference can be an order of magnitude over reading and writing data randomly. Logs take advantage of this by ordering entries one after another.

Many systems are optimized using an internal log. Most databases, for example, use a write-ahead log (WAL). The WAL keeps a record of changes so they're not lost during a system crash and the changes are applied in bulk to the underlying data store. Without a WAL, a database would have to make expensive random writes and disk syncs on every operation.

Replaying the log

Another benefit of a log is that changes can be replayed so you can see the state of your system at any given point in time. More importantly in distributed systems, though, is that you can copy the state of a system to another machine and replay all changes after the copy started to keep two systems in sync.

VIZ: Copy state and replay log.

Logical clocks

Producing messages

Consuming messages

Segmenting messages

Parallelizing topics