QuickStart — Apache Kafka + Kafka-Python
Apache Kafka can process trillions of events a day.
Introduction
Real-time data ingesting is a common problem in real-time analytics, because in a platform such as e-commerce, active users in a given time and the number of events created by each active user are many. Hence, recommendations (i.e., predictions) for each event or groups of events are expected to be near real-time.
The primary concerns are, How we will [consume, produce, and process] these events efficiently?
Apache Kafka addresses the first two problems stated above. It is a distributed streaming platform, which helps to build real-time streaming data pipelines.
Apache Kafka Setup
This article discusses the setting up of Apache Kafka in a Linux environment and reveals how events are consumed and produced using a Kafka-Python. The following diagram illustrates the Kafka ecosystem we expect to setup.
Starting Zookeeper
Apache Zookeeper is a distributed, open-source configuration synchronization service, with a naming registry for distributed applications.
In the context of Kafka, the Zookeeper helps to maintain the server state and stores configurations as a key value pair in ZK data tree, and use them across the cluster in a distributed manner.
The latest version of Kafka binary distribution is available at https://kafka.apache.org/downloads.
Go to the Kafka root folder
cd /home/***/***_STREAM_PROCESSOR/kafka_2.12-2.0.0
Execute the following to start the Zookeeper
bin/zookeeper-server-start.sh config/zookeeper.properties
If everything were accurate, the following output would be visible on the console:
Zookeeper default run on port 2181
Starting Kafka Brokers
Kafka clusters consist of one or more brokers, and the producers push the event into Kafka Topics within the Kafka Brokers.
There are two options to deploy the Kafka Broker:
- deploy on a local machine
- deploy on a remote machine, to do this, ensure that you update the server.properties (located in Kafka config) with the value listeners=PLAINTEXT://xxx.xxx.xxx.xxx:9092
Go to the Kafka root in a new console and execute the following command:
cd /home/***/***_STREAM_PROCESSOR/kafka_2.12-2.0.0bin/kafka-server-start.sh config/server.properties
If all were accurate, the following output would be visible on the console:
Creating Kafka Topics
Producer push and Consumer pull events to Kafka Topic.
Execute the following command to create a topic (e.g., “input_recommend_product”) in the Kafka Broker:
Since we have set up one broker, we can keep only one copy of the topic; hence, set the replication-factor to 1.
Here we divide the Topic: “input_recommend_product” to three partitions; and we set partitions to 3.
Creating Producer and Consumer using Kafka-python
Kafka-python is a python client for Apache Kafka. This helps to create topics and produce events to Kafka brokers and employed to consume the events as a topic from Kafka Broker.
The following code segment shows how to create a producer and push a message to a Topic within a broker:
The pushed message can be pulled using KafkaConsumer, and the following code segment helps to consume the message:
And the output from the consumer is as follows in the console:
Final Thoughts
Apache — Kafka is a vital platform in building real-time processing solutions. This article gives you an excellent start to set up Apache — Kafka on a distributed environment and provides easy guidance to produce and consume events.