Kafka — What you should know ?
Apache Kafka is a distributed event streaming platform that has become a fundamental building block for real-time data architectures. It provides a robust and scalable solution for handling large-scale data streams, enabling organizations to build real-time applications and analytics pipelines. In this blog post, we’ll explore the key concepts of Apache Kafka and its role in modern data-driven applications.
What is Apache Kafka?
Apache Kafka, originally developed by LinkedIn, is an open-source distributed event streaming platform. It is designed to handle massive amounts of data in real-time, making it a powerful tool for building scalable and fault-tolerant applications. Kafka is built on the principles of publish-subscribe and distributed storage, providing a reliable and highly available data pipeline.
Key Concepts
1. Topics
In Kafka, data is organized into topics. A topic is a feed or category to which messages are published by producers. Topics allow for the logical organization and segregation of data streams. Each message published to a topic is timestamped and retained for a configurable period.
2. Producers
Producers are responsible for publishing messages to Kafka topics. They generate data and push it to one or more topics. Producers play a crucial role in the real-time data pipeline, allowing systems to continuously feed data into Kafka.
3. Consumers
Consumers subscribe to topics and process the messages published to those topics. They play a vital role in real-time data processing, allowing applications to react to events as they occur. Kafka’s consumer groups enable parallel processing of messages, ensuring high throughput.
4. Brokers
Kafka brokers form the core of the Kafka cluster. Brokers store and manage the topic partitions, handle producer and consumer requests, and ensure fault tolerance. Kafka can operate with multiple brokers, providing scalability and high availability.
5. Partitions
Each topic is divided into partitions, and each partition is replicated across multiple brokers. Partitions enable parallel processing and distribution of data across the Kafka cluster. Replication ensures data durability and fault tolerance.
Lets dig in on each concepts
Topics
- Related events are structured and organized using Kafka topics.
- Each topic is uniquely identified by its name.
- Kafka topics are versatile, accommodating messages of any type and format. The ordered sequence of these messages is referred to as a data stream.
- By default, data within Kafka topics is retained for one week, known as the default message retention period. This duration is customizable to suit specific requirements.
- Partitions divide topics into distinct segments.
- An offset, represented by an integer, is assigned by Kafka to every message upon being written to a partition. Each message within a particular partition possesses a unique offset.
- The significance of Kafka offsets is confined to the respective partition they belong to.
Producers
- Kafka producers are applications responsible for transmitting data to topics.
- Messages sent by a Kafka producer are allocated to partitions based on a mechanism such as key hashing.
— When the message key is null, distribution across partitions in a topic occurs evenly using a round-robin approach.
— When the message key is not null, all messages sharing the same key are consistently sent and stored in a specific Kafka partition.
- The utilization of Kafka message keys is prevalent when there is a requirement for maintaining message order within all messages associated with a common field.
Producer — Acks
- ack = 0: When we configure the ack = 0, we’re saying that we don’t want to receive the ack from Kafka. In case of broker failure, the message will be lost
- ack = 1: This is the default configuration, with that we’re saying that we want to receive an ack from the leader of the partition. The data will only be lost if the leader goes down (still there’s a chance)
- ack = all: This is the most reliable configuration. We are saying that we want to not only receive a confirmation from the leader but from their replicas as well. This is the most secure configuration since there’s no data loss. Remembering that the replicas need to be in-sync (ISR). If a single replica isn’t, Kafka will wait for the sync to send back the ack.
Consumer
- Kafka consumers, applications that retrieve event data from one or multiple Kafka topics, play a crucial role in the Kafka ecosystem. Within Apache Kafka, consumers have the capability to read from one or more partitions simultaneously.
- The data is read sequentially within each partition, ensuring a structured and ordered consumption of information
Consumer Groups
- Consumers belonging to a shared application and collectively executing a common “logical job” can be organized into a Kafka consumer group. To signify that Kafka consumers are part of the same group, it is imperative to define the consumer-side setting ‘group.id.’
- Efficient load balancing is facilitated by Kafka Consumers through the utilization of a GroupCoordinator and a ConsumerCoordinator. These coordinators work in tandem to assign partitions to consumers, ensuring equitable distribution of the load among all consumers within the group.
- It’s worth noting that each topic partition is exclusively assigned to a single consumer within a consumer group. However, a consumer from a consumer group can be assigned multiple partitions, enhancing the parallel processing capabilities of the group.
Broker
- A single Kafka server is called a Kafka Broker
- When a group of Kafka brokers collaborates, it forms a Kafka cluster.
- Data storage within Kafka brokers occurs in directories on the respective server disks. Each topic-partition has its dedicated sub-directory named after the associated topic.
- To enhance throughput and scalability, Kafka topics are partitioned. In scenarios with multiple Kafka brokers in a cluster, partitions for a specific topic are evenly distributed among the brokers. This distribution ensures effective load balancing and scalability across the Kafka cluster.
- A client seeking to send or receive messages from the Kafka cluster can establish a connection with any broker within the cluster. Each broker possesses metadata concerning all other brokers, enabling it to assist the client in connecting to them. Consequently, any broker within the cluster is commonly referred to as a bootstrap server.
- Upon connection, the bootstrap server furnishes the client with metadata, comprising a comprehensive list of all brokers in the cluster. Armed with this information, the client can precisely determine the broker to connect to when initiating data transmission or reception. This knowledge also allows the client to identify the brokers housing the pertinent topic-partitions as needed.
Topic Replication
Data Replication serves as a safeguard against potential data loss by duplicating the same data across multiple brokers. In the Kafka context, replication extends beyond a single broker to encompass multiple brokers within the cluster.
The replication factor, a topic-specific setting determined at topic creation, defines the number of copies created for each piece of data. A replication factor of 1 implies no replication, often employed for development purposes but discouraged in testing and production Kafka clusters.
A replication factor of 3 is widely adopted, striking a balance between mitigating broker loss and managing replication overhead effectively. This approach ensures data resilience and availability across the Kafka cluster.
Topic Availability & Durability
- With a replication factor of 3 for a topic, the durability of topic data ensures resilience against the loss of 2 brokers. In general, for a replication factor of N, you can endure the permanent loss of up to N-1 brokers and still recover your data.
Consumer
- As long as a single partition remains operational and is considered an In-Sync Replica (ISR), the topic remains available for reads.
Producer
- For
acks=0
andacks=1
, the topic stays available for writes as long as one partition is up and considered an ISR. - When
acks=all
: - With
min.insync.replicas=1
(default), the topic requires at least 1 partition up as an ISR, allowing tolerance for two brokers being down. - For
min.insync.replicas=2
, the topic must have at least 2 ISRs up, enabling tolerance for at most one broker being down (in the case of a replication factor of 3). Additionally, there is a guarantee that for every write, the data will be written at least twice. - Setting
min.insync.replicas=3
wouldn't be meaningful for a corresponding replication factor of 3, as no broker downtime could be tolerated.
In summary, when using acks=all
with a replication factor of N and min.insync.replicas=M
, the system can tolerate the loss of N-M brokers for ensuring topic availability.
Note: The above diagrams are taken from conduktor
As organizations increasingly embrace the need for seamless data integration and event-driven architectures, Kafka continues to play a pivotal role in enabling these transformations. The power of Kafka lies not just in its ability to handle massive volumes of data but in providing a foundation for building innovative and responsive applications.
Whether you are a developer, data engineer, or architect, understanding Kafka’s core concepts empowers you to harness its capabilities fully. The journey through topics, partitions, producers, consumers, and replication factors opens up a world of possibilities for handling streaming data at scale.
As we navigate the evolving landscape of data-driven technologies, Kafka remains a beacon, guiding us towards building resilient, scalable, and future-ready systems. With its vibrant community, ongoing enhancements, and seamless integration with various tools and frameworks, Kafka continues to be at the forefront of shaping the data-driven future.
We will learn more about Kakfa / KSQL in the upcoming blogs. Stay tuned.