First steps with Apache Kafka
This is a blog post from our Community Stream: by developers, for developers. Don’t miss to stop by our community to find similar articles or join the conversation.
Before we find out why the Kafka rocket can fly at all and how it does so, we ignite the engines together and take a short test flight:
For our flight into space, we launch our (of course reusable) rocket, briefly leave the atmosphere to practice re-entry, and gently and safely land the rocket on a landing pad.
We want to capture all flight phases in Kafka and create a topic first. Topics are similar to tables in databases in that we store a collection of data of a certain kind in a topic. In our case, it is flight data, so we call the topic
$ kafka-topics.sh \ --create \ --topic flightdata \ --partitions 1 \ --replication-factor 1 \ --bootstrap-server localhost:9092 Created topic flightdata.
We use the
kafka-topics.sh command to manage our topics in Kafka. Here, we tell Kafka to create the topic
--topic flightdata) with the
--create argument. First, we start with one partition (
--partitions 1) and without replicating the data (
--replication-factor 1). Last, we specify which Kafka cluster
kafka-topics.sh should connect to. In our case, we use our local cluster, which by default listens on port
--bootstrap-server localhost:9092). The command confirms us the successful creation of the topic. If we get errors here, it is often because Kafka has not started yet and is therefore not accessible, or because the topic already exists.
So now we have a place to store our data. The rocket’s board computer continuously sends us updates on the rocket’s flight state. For our simulation, we use the command line tool
kafka-console-producer.sh. This producer and also other useful tools are shipped with Kafka directly. The producer connects to Kafka, takes data from the command line and sends it as messages to a topic (configurable via the
--topic parameter). Let’s write the message countdown started into our just created topic
$ echo "Countdown started" | kafka-console-producer.sh \ --topic flightdata \ --bootstrap-server=localhost:9092
Our ground station now wants to read this data and output it on a large screen so that we can see whether the rocket really works as we expect it to. Let’s take a look at what has happened so far. To read our sent message again, we start the
kafka-console-consumer.sh, which is also part of the Kafka family:
$ timeout 10 kafka-console-consumer.sh \ --topic flightdata \ --bootstrap-server=localhost:9092 Processed a total of 0 messages
When we start the
kafka-console-consumer.sh, it continues to run by default until we actively cancel it (for example, with
CTRL+C). This would also be the desired behavior if we really wanted to display the current state of the rocket somewhere. In our example, we use the timeout command to make the consumer terminate automatically after 10 seconds at the latest. For the consumer, we have to specify again which topic it should use (
Somewhat surprisingly, no message is displayed. This is because by default, the
kafka-console-consumer.sh starts reading at the end of the topic and only prints new messages. To display also already written data, we have to use the flag
$ timeout 10 kafka-console-consumer.sh \ --topic flightdata \ --from-beginning \ --bootstrap-server=localhost:9092 Countdown started Processed a total of 1 messages
This time we see the message Countdown started! So, what happened? We used the
kafka-topics.sh command to create the topic
flightdata in Kafka and used the
kafka-console-producer.sh to produce the message Countdown started. Then, we read this message again with the
kafka-console-consumer.sh. We can represent this data flow as follows:
Without any other information, the
kafka-console-consumer.sh always starts reading at the end. That means, if we want to read all messages, we have to use the flag
Interestingly, respectively unlike in many messaging systems, we can read messages not only once, but as many times as we want. We can use this, for example, to connect several independent ground stations to the topic so that they can all read the same data. Or there may be different systems that all need the same data. We’re capable of imagining having not only a display screen, but in addition other services too, such as a service that compares the flight data with current weather data and decides if something needs to be done. However, we may also want to analyze the data after the flight and require the historical flight data for that. To accomplish this, we can simply run the Consumer multiple times and each time get the same result.
However, we would now like to display the current state of the rocket in our control center, in such a way that the display updates immediately when there is new data. For this, we start the
kafka-console-consumer.sh (without timeout) in another terminal window. As soon as new data is available, the consumer fetches it from Kafka and displays it on the command line:
# Don't forget: use CTRL+C to stop the consumer $ kafka-console-consumer.sh \ --topic flightdata \ --bootstrap-server=localhost:9092
To simulate the producer on the rocket side, we now start the
kafka-console-producer.sh. The command does not stop until we press
CTRL+D, sending the
EOF signal to the Producer:
# Don't forget: use CTRL+D to stop the Producer $ kafka-console-producer.sh \ --topic flightdata \ --bootstrap-server=localhost:9092
kafka-console-producer.sh sends one message to Kafka per line we write. That means we can now type messages into the terminal with the producer:
# Producer window: > Countdown finished > Liftoff > Leave atmosphere > Preparation reentry > Reentry successful > Landing successful
We should also see these promptly in the window with the consumer:
# Consumer Window: Countdown finished Liftoff Leave atmosphere Preparation reentry Reentry successful Landing successful
Let’s imagine that part of our ground crew is in the home office and wants to follow the flight from home. To do so, they independently start their consumer. We can simulate this by starting a kafka-console-consumer.sh in another terminal window, which displays all data from the beginning:
# Window Consumer 2 $ kafka-console-consumer.sh \ --topic flightdata \ --from-beginning \ --bootstrap-server=localhost:9092 Countdown started Countdown finished Liftoff Leave atmosphere Preparation reentry Reentry successful Landing successful
We can see here that data, once written, can be read in parallel by multiple consumers without the need for the consumers to talk to each other or for the consumers to register with Kafka first. Kafka does not delete any data, which means that we can still start a consumer later that can read historical data.
Now, let’s say that we don’t want just one rocket to fly at a time, but several. Kafka has no problem with this and can easily process data from numerous Producers simultaneously. So let’s start another Producer in another terminal window:
# Producer for Rocket 2 $ kafka-console-producer.sh \ --topic flightdata \ --bootstrap-server=localhost:9092 > Countdown started
We see all the news from all the Producers show up in all our Consumers in the order the news was produced:
# Consumer 1 window [...] Landing successful Countdown started # Window Consumer 2 [...] Landing successful Countdown started
But now the issue arises that we cannot distinguish the messages from rocket 1 and 2. We could possibly write in each message from which rocket the message was sent. But more about that later.
Before we continue, we should abort the flight of our second rocket, as we want to launch it later:
# Producer for Rocket 2 [...] > Countdown canceled
We have now successfully launched and landed a rocket on a test basis. In the process, we wrote some data in Kafka. To accomplish this, we first created a topic
flightdata using the command line tool
kafka-topics.sh, in which we wrote all the flight data for our rocket. Into this topic, we produced some data using
kafka-console-producer.sh. In our case, this was information about the current status of the rocket. We were able to read and display this data using
kafka-console-consumer.sh. We even went further and produced data in parallel using multiple producers and read data simultaneously using multiple consumers. With the
--from-beginning flag in
kafka-console-consumer.sh, we accessed historical data. Thus, we have already become familiar with three command line tools that ship with Kafka.
After gaining this experience, we can now close all open terminals. We close Producer with
CTRL+D and Consumer with
CTRL+C. This example should not hide the fact that Kafka is used wherever larger amounts of data are processed. From my training experience, we know that Kafka is used intensively by many car manufacturers, supermarket chains, logistics service providers and even in many banks and insurance companies. In this blog post, we got an initial overview of Kafka and ran through a simple test flight scenario. If you want to dig deeper, you can find more details about the Kafka architecture in my German book and in upcoming Xeotek Community articles.
As an IT trainer and a frequent Blogger for Xeotek. Anatoly Zelenin teaches Apache Kafka to hundreds of participants in interactive training sessions. For more than a decade, his clients from the DAX environment and German medium-sized businesses have appreciated his expertise and his inspiring manner. In that capacity, he is delivering trainings for Xeotek Clients as well. His book is available directly from him, from Hanser-Verlag, Amazon. You can reach Anatoly via his e-mail. In addition, he is not only an IT Consultant, Trainer and Blogger but also explores our planet as an adventurer.