Event-Driven Architecture with Kafka: Patterns for Production Systems
Synchronous REST calls between microservices create tight coupling — when the payment service is slow, the order service is slow too. Event-driven architecture with Kafka decouples services by communicating through events instead of direct calls. Therefore, this guide covers the patterns that make event-driven systems reliable in production: event sourcing, CQRS, sagas, dead letter queues, and consumer group strategies.
Why Event-Driven? The Coupling Problem
In a synchronous architecture, placing an order might involve five REST calls: validate inventory, charge payment, update loyalty points, send confirmation email, and notify warehouse. If any service is down, the order fails. Moreover, adding a new step (like fraud detection) requires modifying the order service to call yet another endpoint.
With event-driven architecture, the order service publishes an “OrderPlaced” event to Kafka. Every interested service subscribes to this topic and reacts independently. The payment service charges the card, the inventory service reserves stock, the email service sends confirmation — all without the order service knowing they exist. Adding fraud detection means adding a new consumer; the order service never changes.
This decoupling has real operational benefits. Services can be deployed independently, scale independently, and fail independently. When the email service goes down for maintenance, orders keep flowing. The email service catches up when it comes back online because Kafka retains events for days or weeks.
Event Sourcing: Store Events, Not State
Traditional databases store the current state — an order has status “shipped.” Event sourcing stores every state change as an immutable event: OrderPlaced, PaymentReceived, ItemsPacked, OrderShipped. The current state is derived by replaying all events for that entity.
// Event Sourcing with Kafka and Spring Boot
@Entity
@Table(name = "event_store")
public class StoredEvent {
@Id
private UUID eventId;
private String aggregateId;
private String eventType;
private int version;
private Instant timestamp;
@Column(columnDefinition = "jsonb")
private String payload;
}
// Order aggregate rebuilt from events
public class OrderAggregate {
private String orderId;
private OrderStatus status;
private BigDecimal total;
private List items;
public static OrderAggregate rebuild(List events) {
OrderAggregate order = new OrderAggregate();
for (StoredEvent event : events) {
order.apply(event);
}
return order;
}
private void apply(StoredEvent event) {
switch (event.getEventType()) {
case "OrderPlaced":
OrderPlacedPayload placed = deserialize(event);
this.orderId = placed.getOrderId();
this.items = placed.getItems();
this.total = placed.getTotal();
this.status = OrderStatus.PLACED;
break;
case "PaymentReceived":
this.status = OrderStatus.PAID;
break;
case "OrderShipped":
this.status = OrderStatus.SHIPPED;
break;
case "OrderCancelled":
this.status = OrderStatus.CANCELLED;
break;
}
}
} Event sourcing gives you a complete audit trail, the ability to replay events to rebuild state, and temporal queries (“what was this order’s status last Tuesday at 3 PM?”). Additionally, you can create new read models by replaying events through a new projection — without changing any existing code.
The trade-off is complexity. Querying the current state requires replaying events (or maintaining snapshots). Schema evolution for events requires careful versioning. And the event store grows indefinitely, requiring archiving strategies for old events.
CQRS: Separate Reads from Writes
Command Query Responsibility Segregation splits your data model into a write model (optimized for consistency) and one or more read models (optimized for queries). Events bridge the gap — every write produces an event that updates the read models asynchronously.
// Write side: handles commands, produces events
@Service
public class OrderCommandHandler {
@KafkaListener(topics = "order-commands")
public void handle(OrderCommand command) {
switch (command.getType()) {
case "PlaceOrder":
// Validate business rules
Order order = Order.create(command.getItems());
// Persist to event store
eventStore.save(order.getUncommittedEvents());
// Publish events to Kafka
order.getUncommittedEvents().forEach(event ->
kafkaTemplate.send("order-events", order.getId(), event)
);
break;
}
}
}
// Read side: consumes events, builds query-optimized views
@Service
public class OrderReadModelUpdater {
@KafkaListener(topics = "order-events", groupId = "order-read-model")
public void updateReadModel(OrderEvent event) {
switch (event.getType()) {
case "OrderPlaced":
// Insert into denormalized read table
jdbcTemplate.update(
"INSERT INTO order_summary (id, customer, total, status, placed_at) VALUES (?, ?, ?, ?, ?)",
event.getOrderId(), event.getCustomerName(),
event.getTotal(), "PLACED", event.getTimestamp()
);
break;
case "OrderShipped":
jdbcTemplate.update(
"UPDATE order_summary SET status = 'SHIPPED', shipped_at = ? WHERE id = ?",
event.getTimestamp(), event.getOrderId()
);
break;
}
}
}CQRS lets you optimize each side independently. The write side uses a normalized schema for consistency. The read side uses denormalized tables, materialized views, or even Elasticsearch for fast queries. Consequently, you can handle complex queries without impacting write performance.
The Saga Pattern for Distributed Transactions
When an order involves payment, inventory, and shipping, you need a way to coordinate across services without distributed transactions (which don’t scale with Kafka). The saga pattern breaks a distributed transaction into a sequence of local transactions, each publishing an event that triggers the next step. If any step fails, compensating transactions undo the previous steps.
SAGA: Place Order
Step 1: Reserve Inventory → on failure: (nothing to compensate)
Step 2: Charge Payment → on failure: Release Inventory
Step 3: Confirm Order → on failure: Refund Payment, Release Inventory
Step 4: Notify Warehouse → on failure: Cancel Order, Refund Payment, Release Inventory
Each step is a Kafka consumer that:
1. Performs its local transaction
2. Publishes a success/failure event
3. The saga orchestrator (or next choreography step) reacts accordinglyThere are two approaches: choreography (each service knows what to do next) and orchestration (a central coordinator manages the workflow). Choreography is simpler for 3-4 steps but becomes spaghetti with more. Orchestration adds a single point of coordination but is easier to understand and debug. For most production systems, orchestration with a saga coordinator service is the practical choice.
Dead Letter Queues and Consumer Groups
Events that cannot be processed — malformed data, business rule violations, downstream failures — need a place to go besides crashing your consumer in an infinite retry loop. A dead letter queue (DLQ) captures failed events for investigation and reprocessing.
// Kafka consumer with DLQ and retry
@Configuration
public class KafkaConsumerConfig {
@Bean
public DefaultErrorHandler errorHandler(KafkaTemplate template) {
// Retry 3 times with exponential backoff
BackOff backOff = new ExponentialBackOff(1000L, 2.0);
((ExponentialBackOff) backOff).setMaxElapsedTime(30000L);
// After retries exhausted, send to DLQ
DeadLetterPublishingRecoverer recoverer =
new DeadLetterPublishingRecoverer(template,
(record, ex) -> new TopicPartition(
record.topic() + ".DLQ", record.partition()
));
DefaultErrorHandler handler = new DefaultErrorHandler(recoverer, backOff);
// Don't retry on deserialization errors — they'll never succeed
handler.addNotRetryableExceptions(DeserializationException.class);
return handler;
}
}
// Consumer groups: scale consumers horizontally
// 12 partitions = up to 12 consumers in same group
// Each consumer processes a subset of partitions
@KafkaListener(
topics = "order-events",
groupId = "payment-service",
concurrency = "4" // 4 consumer threads
) Consumer groups let you scale horizontally. If your topic has 12 partitions, you can run up to 12 consumer instances in the same group, each processing a portion of the events. Adding more consumers than partitions means some sit idle. Therefore, choose your partition count based on your expected throughput — more partitions mean more parallelism but also more overhead.
Operational Best Practices
Event-driven systems require different operational thinking. Monitor consumer lag — the gap between the latest event and what your consumer has processed. High lag means your consumer cannot keep up. Set alerts at 1,000 events lag for latency-sensitive consumers and 100,000 for batch processors.
Schema evolution is critical. Use a schema registry (Confluent Schema Registry or Apicurio) with Avro or Protobuf schemas. Enforce backward compatibility so old consumers can read new events. Additionally, never delete or rename fields — add new ones and deprecate old ones.
Related Reading:
- Event Sourcing and CQRS Deep Dive
- Saga Pattern for Distributed Transactions
- Microservices Architecture Patterns
Resources:
In conclusion, event-driven architecture with Kafka transforms tightly coupled services into independently scalable components. Start with simple event publishing, add CQRS when read performance matters, implement sagas for distributed transactions, and always plan for failure with dead letter queues. The patterns are well-established — the key is adopting them incrementally rather than rewriting everything at once.