1. Introduction to the core concept

Broker: The Kafka cluster contains one or more servers, which are called brokers.

Topic: Every message posted to a Kafka cluster has a category called Topic. (Physically different Topic messages are stored separately. Logically, a Topic message is stored on one or more brokers, but the user only needs to specify the Topic of the message to produce or consume the data without having to care where the data is stored).

Partition: Partition is a physical concept, and each Topic contains one or more Partitions.

Producer: Responsible for posting messages to Kafka broker.

Consumer: The message consumer, the client that reads the message to the Kafka broker.

Consumer Group: Each Consumer belongs to a specific Consumer Group (you can specify a group name for each Consumer, or a default group if you do not specify a group name).

Users only need to pay attention to Topic when programming, and do not need to pay attention to the storage details inside.

2. Architecture


3. API Practice


# coding: utf-8
Import csv
Import time
From kafka import KafkaProducer
# Instantiate a KafkaProducer example for posting messages to Kafka
Producer = KafkaProducer(bootstrap_servers='localhost:9092')
# Open data file
Csvfile = open("../data/user_log.csv","r")
# Generate a reader that can be used to read csv files
Reader = csv.reader(csvfile)
For line in reader:
    Gender = line[9] # gender in the 9th element of each line of the log code
    If gender == 'gender':
        Continue # remove the first line header
    Time.sleep(0.1) # Send one line of data every 0.1 seconds
    #送数据, topic is 'sex'


from kafka import KafkaConsumer
consumer = KafkaConsumer('sex')
for msg in consumer:


1. How do you understand Topic's partition?

With partitioning, suppose a topic may be divided into 10 partitions. Kafka internally distributes 10 partitions to different servers as uniformly as possible according to certain algorithms. For example, A server is responsible for topic partition 1, B. The server is responsible for the partition 2 of the topic. In this case, if the Producer sends a message without specifying which partition to send to, the kafka may partition 1 according to a certain algorithm, and the next message may be in the partition 2. Of course, advanced APIs can also implement their own distribution algorithms. (link: https://www.zhihu.com/question/28925721/answer/43648910)