QuickStart — Apache Kafka + Kafka-Python

Apache Kafka can process trillions of events a day.

Kiruparan Balachandran
Towards Data Science

--

Introduction

Real-time data ingesting is a common problem in real-time analytics, because in a platform such as e-commerce, active users in a given time and the number of events created by each active user are many. Hence, recommendations (i.e., predictions) for each event or groups of events are expected to be near real-time.

The primary concerns are, How we will [consume, produce, and process] these events efficiently?

Apache Kafka addresses the first two problems stated above. It is a distributed streaming platform, which helps to build real-time streaming data pipelines.

Apache Kafka Setup

This article discusses the setting up of Apache Kafka in a Linux environment and reveals how events are consumed and produced using a Kafka-Python. The following diagram illustrates the Kafka ecosystem we expect to setup.

Kafka Ecosystem — Image by Author

Starting Zookeeper

Apache Zookeeper is a distributed, open-source configuration synchronization service, with a naming registry for distributed applications.

In the context of Kafka, the Zookeeper helps to maintain the server state and stores configurations as a key value pair in ZK data tree, and use them across the cluster in a distributed manner.

The latest version of Kafka binary distribution is available at https://kafka.apache.org/downloads.

Go to the Kafka root folder

cd /home/***/***_STREAM_PROCESSOR/kafka_2.12-2.0.0

Execute the following to start the Zookeeper

bin/zookeeper-server-start.sh config/zookeeper.properties

If everything were accurate, the following output would be visible on the console:

Zookeeper default run on port 2181

Starting Kafka Brokers

Kafka clusters consist of one or more brokers, and the producers push the event into Kafka Topics within the Kafka Brokers.

There are two options to deploy the Kafka Broker:

  • deploy on a local machine
  • deploy on a remote machine, to do this, ensure that you update the server.properties (located in Kafka config) with the value listeners=PLAINTEXT://xxx.xxx.xxx.xxx:9092

Go to the Kafka root in a new console and execute the following command:

cd /home/***/***_STREAM_PROCESSOR/kafka_2.12-2.0.0bin/kafka-server-start.sh config/server.properties

If all were accurate, the following output would be visible on the console:

Creating Kafka Topics

Producer push and Consumer pull events to Kafka Topic.

Execute the following command to create a topic (e.g., “input_recommend_product”) in the Kafka Broker:

Since we have set up one broker, we can keep only one copy of the topic; hence, set the replication-factor to 1.

Here we divide the Topic: “input_recommend_product” to three partitions; and we set partitions to 3.

Creating Producer and Consumer using Kafka-python

Kafka-python is a python client for Apache Kafka. This helps to create topics and produce events to Kafka brokers and employed to consume the events as a topic from Kafka Broker.

The following code segment shows how to create a producer and push a message to a Topic within a broker:

The pushed message can be pulled using KafkaConsumer, and the following code segment helps to consume the message:

And the output from the consumer is as follows in the console:

Final Thoughts

Apache — Kafka is a vital platform in building real-time processing solutions. This article gives you an excellent start to set up Apache — Kafka on a distributed environment and provides easy guidance to produce and consume events.

--

--

A Data enthusiast on extracting insights from business data sets, machine learning, and building and deploying large-scale machine learning models.