Skip to content

What is Kafka and why we use it

chris_podorsiki edited this page Jan 26, 2018 · 1 revision

1 What is Kafka

Kafka is a message pipeline architecture which mostly used in distributed message transportation. It was developed by Linkedin and recently largely used in a lot of blockchain and High performance data environment.

If we made an analogy for Kafka: assume you had a chicken which will produce one egg a day, and also yourself, will eat one egg a day. Seeing this situation is perfect right? No left eggs, no shortage for eggs. However, what if you had hundreds of chicken and each one will produce one egg a day, then as a consumer, it is just yourself. Oops, if you don't have something to keep those new produced eggs somewhere, they will drop to the ground and you couldn't eat them later.

Kafka is designed to solve this problems. Kafka won't worry how many messages producer had been produced. Kafka will store them temporarily in the broker(a middle layer between producer and consumer) and wait until the consumer coming to process/read. Once the message had been readed by the consumer from one group id, the message could not be consumed again unless use different group id.

Kafka had three main part, producer / broker / Consumer. Let's talk each of them:

1 Producer: This is where the message coming from, it is outside the Kafka system and by using Kafka producer API to post the message to the broker. And the amount of producer could be multiple, which means, a lot of producer could produce messages to the same broker. When producer send the message, it will give each message a topic (like data will store in a table). By the same topic, messages will be stored in different partition in the next layer broker and those partition will be replicated in the distributed systems. So kafka is a horizontally scaled system which could handle big data sending and reading

2 Broker: broker is a middle layer of the whole architecture. It is within a zookeeper system (zookeeper is a system for making sure distribute system's data won't lose when one of the server is outage, search zookeeper in google for more info). Basically, multiple brokers within the same zookeeper system, and when message was posted into broker, the zookeeper will always make sure the message will be sent to the leader broker, then rest of the brokers will store the replication of the message to make sure message won't lose when the leader broker crashing. When the leader broker is outage, zookeeper will elect another broker to be the leader again and keep serving the whole Kafka system.

3 Consumer: Consumer is used to read/consume the message from broker. It usually will be acted within a group. In the same group, consumer will consume the same message topic and the same message from same topic will not be consumed by different consumer in the same group -- this avoided the repeated reading.

So, now you know how Kafka runs and what sort of the core components it had. Let's pointed out the rules in Kafka:

1 when messages had been produced into brokers, each message will be given an offset id when Kafka received the message and the id is sequencing aligned. Which means, earlier produced message will had a smaller ID than later produced.

2 When messages were reading from the broker, the consumer will read smallest offset id first. Which means, earlier produced message will be early readed in the consumer.

3 If message in the broker had been replicated by N times, then Kafka system could afforD N-1 broker outage without losing the message in the consumer.

4 when consumer read a message, consumer will automatically tell the Kafka system which offset id had been readed (unless you manually set auto-commit-enable:false). After kafka knew which offset id had been readed in the topic partition, it will store the ID in system. Also if consumer stopped suddenly, the kafka will store the last time read offset id for consumer, when consumer backup and start reading again, it will read the message from last time stopped. See, no repeated reading again!

5 One partition of particular topic will only be readed by one consumer.So don't set the number of consumer larger than the partition number, otherwise, it will always had one consumer nothing to do.