BLOG POSTS

MangoHost Blog / Apache Kafka Architecture and Use Cases

Apache Kafka Architecture and Use Cases

Apache Kafka is a distributed streaming platform that has fundamentally changed how we handle real-time data processing and event-driven architectures. If you’ve been wondering why everyone’s talking about Kafka or if you’re planning to implement it in your stack, understanding its architecture is crucial for making informed decisions. This post walks you through Kafka’s core components, explores real-world implementation scenarios, and covers the common gotchas you’ll encounter when setting up your first cluster.

How Kafka Works: Core Architecture Components

Kafka’s architecture centers around a few key concepts that work together to provide high-throughput, fault-tolerant messaging. Let’s break down the main players:

Brokers: These are your Kafka servers that store and serve data. A typical production setup runs 3-5 brokers minimum for fault tolerance.
Topics: Think of these as categories or feeds where your data lives. Each topic can be split into multiple partitions for scalability.
Partitions: The actual storage units where messages are written. More partitions = better parallelism, but also more complexity.
Producers: Applications that send data to Kafka topics.
Consumers: Applications that read data from topics, usually organized into consumer groups for load balancing.
ZooKeeper: Manages cluster metadata and coordination (though newer versions are moving away from this dependency).

The magic happens through Kafka’s distributed log structure. When a producer sends a message, it gets appended to a partition log with a unique offset. Consumers can read from any offset, enabling replay capabilities that traditional message queues can’t match.

Step-by-Step Kafka Setup Guide

Let’s get a basic Kafka cluster running. I’ll walk you through both single-node and multi-node setups.

Single-Node Setup (Development)

First, grab Kafka from the official site and extract it:

wget https://downloads.apache.org/kafka/2.8.2/kafka_2.13-2.8.2.tgz
tar -xzf kafka_2.13-2.8.2.tgz
cd kafka_2.13-2.8.2

Start ZooKeeper (required for cluster coordination):

bin/zookeeper-server-start.sh config/zookeeper.properties

In another terminal, start the Kafka broker:

bin/kafka-server-start.sh config/server.properties

Create a test topic:

bin/kafka-topics.sh --create --topic test-topic --bootstrap-server localhost:9092 --partitions 3 --replication-factor 1

Production Multi-Node Configuration

For production, you’ll want to customize your server.properties file for each broker. Here’s a sample configuration for broker 1:

broker.id=1
listeners=PLAINTEXT://broker1.example.com:9092
log.dirs=/var/kafka-logs
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
socket.request.max.bytes=104857600
num.partitions=3
num.recovery.threads.per.data.dir=1
offsets.topic.replication.factor=3
transaction.state.log.replication.factor=3
transaction.state.log.min.isr=2
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000
zookeeper.connect=zk1:2181,zk2:2181,zk3:2181
zookeeper.connection.timeout.ms=18000
group.initial.rebalance.delay.ms=3000

The key differences for other brokers are the broker.id and listeners values. Each broker needs a unique ID and should listen on its own address.

Real-World Use Cases and Examples

Kafka shines in several scenarios where traditional messaging systems fall short:

Event Sourcing and CQRS

Many fintech companies use Kafka as their event store. Instead of storing current state, they store all events that led to that state. Here’s a simple producer example for an e-commerce order system:

import org.apache.kafka.clients.producer.*;
import java.util.Properties;

public class OrderEventProducer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        props.put("acks", "all");
        props.put("retries", 3);
        props.put("enable.idempotence", true);
        
        Producer producer = new KafkaProducer<>(props);
        
        // Send order created event
        ProducerRecord record = new ProducerRecord<>(
            "order-events", 
            "order-123", 
            "{\"event\":\"OrderCreated\",\"orderId\":\"123\",\"customerId\":\"456\",\"amount\":99.99}"
        );
        
        producer.send(record, (metadata, exception) -> {
            if (exception == null) {
                System.out.printf("Sent message to partition %d with offset %d%n", 
                    metadata.partition(), metadata.offset());
            } else {
                exception.printStackTrace();
            }
        });
        
        producer.close();
    }
}

Log Aggregation

Netflix famously uses Kafka to collect logs from thousands of services. A typical setup involves Filebeat or Fluentd shipping logs to Kafka, then Logstash or custom consumers processing them for Elasticsearch.

Real-Time Analytics Pipeline

Companies like LinkedIn (Kafka’s creator) use it to power real-time dashboards. Data flows from operational databases through Kafka to stream processing frameworks like Apache Storm or Kafka Streams.

Kafka vs. Alternatives: When to Choose What

Feature	Apache Kafka	RabbitMQ	Apache Pulsar	Amazon SQS
Throughput	Very High (millions/sec)	Moderate (tens of thousands/sec)	Very High	High (managed)
Message Ordering	Per-partition only	Per-queue	Per-partition	FIFO queues available
Message Persistence	Configurable retention	Until consumed	Configurable retention	Up to 14 days
Operational Complexity	High	Medium	High	Low (managed)
Multi-tenancy	Limited	Good	Excellent	Account-level

Choose Kafka when you need high throughput, message replay capabilities, or you’re building event-driven architectures. Go with RabbitMQ for traditional request-response patterns or when you need complex routing. Pulsar is worth considering if you need better multi-tenancy, and SQS is perfect when you want to avoid operational overhead.

Best Practices and Common Pitfalls

Partitioning Strategy

This is where most people mess up. Your partition key determines message distribution and ordering guarantees. Here’s a consumer example that shows proper partition handling:

import org.apache.kafka.clients.consumer.*;
import java.time.Duration;
import java.util.*;

public class OrderEventConsumer {
    public static void main(String[] args) {
        Properties props = new Properties();
        props.put("bootstrap.servers", "localhost:9092");
        props.put("group.id", "order-processing-group");
        props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        props.put("enable.auto.commit", "false");
        props.put("max.poll.records", "100");
        
        KafkaConsumer consumer = new KafkaConsumer<>(props);
        consumer.subscribe(Arrays.asList("order-events"));
        
        try {
            while (true) {
                ConsumerRecords records = consumer.poll(Duration.ofMillis(100));
                for (ConsumerRecord record : records) {
                    processOrder(record.key(), record.value());
                    System.out.printf("Processed order %s from partition %d%n", 
                        record.key(), record.partition());
                }
                consumer.commitSync(); // Manual commit for reliability
            }
        } finally {
            consumer.close();
        }
    }
    
    private static void processOrder(String orderId, String orderData) {
        // Your business logic here
    }
}

Performance Tuning Tips

Batch Size: Increase batch.size and linger.ms for better throughput at the cost of latency
Compression: Enable compression (snappy or lz4) to reduce network and disk I/O
Memory Management: Allocate enough heap memory but leave room for OS page cache
Disk Configuration: Use multiple disks and configure log.dirs to spread I/O load

Common Gotchas

Consumer Lag Monitoring: Always monitor consumer lag. A lagging consumer can indicate processing bottlenecks or undersized consumer groups.

Replication Factor: Never use replication factor 1 in production. You’ll lose data when brokers fail. Use 3 for most cases.

Topic Deletion: Enable delete.topic.enable=true if you need to delete topics, but be careful – it’s permanent.

ZooKeeper Maintenance: Keep your ZooKeeper ensemble odd-numbered (3 or 5 nodes) and monitor its health closely.

Integration Ecosystem and Tools

Kafka’s strength lies in its ecosystem. Here are tools that extend its capabilities:

Kafka Connect: Pre-built connectors for databases, S3, Elasticsearch, and more
Kafka Streams: Stream processing library for building real-time applications
Schema Registry: Manages Avro, JSON, and Protobuf schemas for data consistency
KSQL: SQL interface for stream processing
Kafka Manager/Kafdrop: Web UIs for cluster management and monitoring

For monitoring, integrate with Prometheus and Grafana using JMX metrics. Key metrics to watch include:

Messages per second (throughput)
Consumer lag by topic and partition
Broker CPU and disk utilization
Network request rates
Under-replicated partitions (should always be 0)

The learning curve is steep, but once you understand Kafka’s model, you’ll find it’s incredibly powerful for building scalable, real-time data architectures. Start small with a single-node setup, get comfortable with the concepts, then gradually move to production-grade configurations as your needs grow.

For deeper technical details, check out the official Apache Kafka documentation and the Kafka Wiki for community-contributed guides and best practices.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.