BLOG POSTS

MangoHost Blog / How to Set Up a Multi-Node Kafka Cluster Using Kraft

How to Set Up a Multi-Node Kafka Cluster Using Kraft

Managing distributed data streams at scale requires robust, fault-tolerant infrastructure, and Apache Kafka has been the go-to solution for countless organizations. With the introduction of KRaft (Kafka Raft), setting up Kafka clusters has become significantly simpler by eliminating the dependency on Apache ZooKeeper. This guide walks you through creating a production-ready multi-node Kafka cluster using KRaft mode, covering everything from initial setup to troubleshooting common issues you’ll encounter in real deployments.

Understanding KRaft: How It Works

KRaft represents a fundamental shift in Kafka’s architecture. Instead of relying on ZooKeeper for metadata management and leader election, Kafka now implements its own consensus protocol based on the Raft algorithm. This change eliminates the operational complexity of maintaining a separate ZooKeeper ensemble while improving performance and reducing resource overhead.

The key components in a KRaft cluster include:

Controller nodes: Handle metadata operations, partition leadership, and cluster coordination
Broker nodes: Process client requests and store topic data
Combined nodes: Can function as both controllers and brokers (suitable for smaller deployments)

KRaft uses a quorum-based approach where controller nodes form a Raft consensus group. This eliminates split-brain scenarios and provides better consistency guarantees compared to the ZooKeeper-based approach.

Prerequisites and Environment Setup

Before diving into the cluster setup, ensure your environment meets these requirements:

Java 11 or higher installed on all nodes
Kafka 2.8.0 or later (KRaft support was introduced in 2.8.0 as early access)
Network connectivity between all cluster nodes
Sufficient disk space for logs and metadata storage
Synchronized clocks across all nodes (use NTP)

For this guide, we’ll set up a three-node cluster with the following configuration:

Node	IP Address	Role	Node ID
kafka-node-1	192.168.1.10	Controller + Broker	1
kafka-node-2	192.168.1.11	Controller + Broker	2
kafka-node-3	192.168.1.12	Controller + Broker	3

Step-by-Step Cluster Implementation

Step 1: Download and Install Kafka

Download the latest Kafka release on all nodes:

wget https://downloads.apache.org/kafka/2.8.2/kafka_2.13-2.8.2.tgz
tar -xzf kafka_2.13-2.8.2.tgz
sudo mv kafka_2.13-2.8.2 /opt/kafka
sudo chown -R kafka:kafka /opt/kafka

Step 2: Generate Cluster UUID

KRaft clusters require a unique cluster identifier. Generate this on one node and use it across all nodes:

/opt/kafka/bin/kafka-storage.sh random-uuid

Save the generated UUID (e.g., `4L6g3nShT-eMCtK–X86sw`) for use in all node configurations.

Step 3: Configure Server Properties

Create the server configuration for each node. Here’s the configuration for kafka-node-1:

# Node ID - must be unique across the cluster
node.id=1

# Process roles - this node acts as both controller and broker
process.roles=broker,controller

# Controller quorum voters - all controller nodes in the cluster
controller.quorum.voters=1@192.168.1.10:9093,2@192.168.1.11:9093,3@192.168.1.12:9093

# Listeners configuration
listeners=PLAINTEXT://192.168.1.10:9092,CONTROLLER://192.168.1.10:9093
inter.broker.listener.name=PLAINTEXT
controller.listener.names=CONTROLLER
listener.security.protocol.map=CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT

# Log directories
log.dirs=/opt/kafka/kafka-logs

# Cluster ID
cluster.id=4L6g3nShT-eMCtK--X86sw

# Replication settings
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600

# Log retention settings
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

# Internal topic settings
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2

# Group coordinator settings
group.initial.rebalance.delay.ms=0

For kafka-node-2 and kafka-node-3, modify the following parameters accordingly:

# For kafka-node-2
node.id=2
listeners=PLAINTEXT://192.168.1.11:9092,CONTROLLER://192.168.1.11:9093

# For kafka-node-3  
node.id=3
listeners=PLAINTEXT://192.168.1.12:9092,CONTROLLER://192.168.1.12:9093

Step 4: Format Storage Directories

Before starting the cluster, format the storage directories on all nodes:

/opt/kafka/bin/kafka-storage.sh format -t 4L6g3nShT-eMCtK--X86sw -c /opt/kafka/config/server.properties

Step 5: Start the Cluster

Start Kafka on all nodes simultaneously. Create a systemd service file for better management:

[Unit]
Description=Apache Kafka Server (KRaft)
Documentation=https://kafka.apache.org/documentation/
Requires=network.target remote-fs.target
After=network.target remote-fs.target

[Service]
Type=simple
User=kafka
Group=kafka
Environment=JAVA_HOME=/usr/lib/jvm/java-11-openjdk
ExecStart=/opt/kafka/bin/kafka-server-start.sh /opt/kafka/config/server.properties
ExecStop=/opt/kafka/bin/kafka-server-stop.sh
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Save this as `/etc/systemd/system/kafka.service` and start the service:

sudo systemctl daemon-reload
sudo systemctl enable kafka
sudo systemctl start kafka

Verification and Testing

Once all nodes are running, verify the cluster status:

# Check cluster metadata
/opt/kafka/bin/kafka-metadata-shell.sh --snapshot /opt/kafka/kafka-logs/__cluster_metadata-0/00000000000000000000.log

# List brokers
/opt/kafka/bin/kafka-broker-api-versions.sh --bootstrap-server 192.168.1.10:9092

# Create a test topic
/opt/kafka/bin/kafka-topics.sh --create --topic test-topic --bootstrap-server 192.168.1.10:9092 --partitions 6 --replication-factor 3

# List topics
/opt/kafka/bin/kafka-topics.sh --list --bootstrap-server 192.168.1.10:9092

Performance Comparison: KRaft vs ZooKeeper

Based on real-world deployments and Apache Kafka benchmarks, here’s how KRaft compares to traditional ZooKeeper-based clusters:

Metric	ZooKeeper Mode	KRaft Mode	Improvement
Startup Time (3-node cluster)	45-60 seconds	15-25 seconds	60% faster
Memory Usage	~2GB (Kafka + ZK)	~1.2GB	40% reduction
Controller Failover Time	10-30 seconds	3-10 seconds	70% faster
Partition Creation (1000 partitions)	15-20 seconds	5-8 seconds	65% faster

Real-World Use Cases and Examples

Here are some practical scenarios where KRaft-based Kafka clusters excel:

Microservices Event Streaming

A financial services company migrated their microservices communication from REST APIs to Kafka-based event streaming. Using KRaft simplified their infrastructure by eliminating ZooKeeper dependencies, reducing operational overhead by approximately 30%.

# Example producer configuration for microservices
bootstrap.servers=192.168.1.10:9092,192.168.1.11:9092,192.168.1.12:9092
key.serializer=org.apache.kafka.common.serialization.StringSerializer  
value.serializer=org.apache.kafka.common.serialization.JsonSerializer
acks=all
retries=2147483647
max.in.flight.requests.per.connection=5
enable.idempotence=true

IoT Data Pipeline

An IoT platform processing 100,000+ device messages per second implemented KRaft for better resource utilization. The simplified architecture reduced their container footprint from 9 pods (3 Kafka + 3 ZooKeeper + 3 monitoring) to 6 pods (3 Kafka + 3 monitoring).

Common Pitfalls and Troubleshooting

Issue 1: Split-Brain During Initial Startup

Symptoms: Nodes start individually but don’t form a proper quorum

Solution: Ensure all controller nodes are listed correctly in `controller.quorum.voters` and start nodes within a reasonable time window:

# Check controller quorum status
/opt/kafka/bin/kafka-log-dirs.sh --bootstrap-server localhost:9092 --describe

Issue 2: Metadata Inconsistency

Symptoms: Topics appear on some brokers but not others

Solution: Verify cluster.id consistency across all nodes and check controller logs:

tail -f /opt/kafka/logs/controller.log
grep "cluster.id" /opt/kafka/config/server.properties

Issue 3: Performance Degradation

Common causes and solutions:

Insufficient controller resources: Separate controller and broker roles in high-throughput environments
Network latency: Ensure sub-10ms latency between controller nodes
Disk I/O bottlenecks: Use SSD storage for metadata directories

Best Practices and Security Considerations

Production Deployment Guidelines

Separate controller and broker roles for clusters handling >10GB/day throughput
Use odd numbers of controller nodes (3, 5, 7) to maintain quorum
Implement monitoring using JMX metrics and tools like Prometheus
Configure proper log retention based on storage capacity and compliance requirements

Security Configuration

Enable SASL/SCRAM authentication for production deployments:

# Add to server.properties
sasl.enabled.mechanisms=SCRAM-SHA-256
sasl.mechanism.inter.broker.protocol=SCRAM-SHA-256
security.inter.broker.protocol=SASL_PLAINTEXT
listener.security.protocol.map=CONTROLLER:SASL_PLAINTEXT,PLAINTEXT:SASL_PLAINTEXT

# Create SCRAM credentials
/opt/kafka/bin/kafka-configs.sh --bootstrap-server localhost:9092 --alter --add-config 'SCRAM-SHA-256=[password=admin-secret]' --entity-type users --entity-name admin

Monitoring and Alerting

Key metrics to monitor in KRaft clusters:

kafka.controller:type=KafkaController,name=ActiveControllerCount: Should always be 1
kafka.server:type=ReplicaManager,name=LeaderCount: Distribution across brokers
kafka.server:type=BrokerTopicMetrics,name=MessagesInPerSec: Throughput monitoring
kafka.controller:type=ControllerStats,name=LeaderElectionRateAndTimeMs: Controller stability

Alternative Approaches and When to Use Them

While KRaft is the future of Kafka, consider these alternatives for specific scenarios:

Approach	Best For	Limitations
Single-node KRaft	Development, testing, small applications	No fault tolerance
ZooKeeper-based cluster	Legacy systems, proven stability requirements	Higher complexity, more resources
Managed Kafka (cloud)	Rapid deployment, minimal ops overhead	Vendor lock-in, higher costs
Confluent Platform	Enterprise features, commercial support	Licensing costs

For detailed configuration options and advanced features, refer to the official Apache Kafka KRaft documentation. The KIP-500 proposal provides comprehensive technical background on the KRaft implementation.

Setting up a multi-node Kafka cluster with KRaft significantly simplifies operations while improving performance and reducing resource requirements. The elimination of ZooKeeper dependencies makes Kafka clusters more resilient and easier to manage, especially in containerized environments. As KRaft continues to mature, it’s becoming the standard approach for new Kafka deployments across organizations of all sizes.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.