System Design: Building a Real-Time Chat Application
Few problems show up in system design discussions as often as a real-time chat application, and for good reason: it touches connection state, fan-out, ordering, persistence, and presence all at once. The gap between a whiteboard sketch and a system that survives production load is enormous. This post walks through the architecture the way teams actually build it, with the trade-offs spelled out rather than glossed over.
Problem Statement
We are building a chat application that supports:
Functional Requirements:
One-on-one and group messaging
Real-time message delivery (sub-second latency)
Message persistence and history
Online/offline presence indicators
Read receipts
Push notifications for offline users
Non-Functional Requirements:
10 million daily active users
99.99% availability
Messages delivered in order within a conversation
End-to-end latency under 200ms for online users
Messages stored for 5 years
High-Level Architecture
Here is how the major components connect:
Clients (Web/Mobile)
|
| WebSocket (STOMP)
v
[Load Balancer (Layer 7)]
|
v
[Chat Server Cluster] <---> [Redis Pub/Sub] <---> [Chat Server Cluster]
| |
v v
[Message Queue (Kafka)] [Presence Service (Redis)]
|
v
[Message Persistence (PostgreSQL)]
|
v
[Push Notification Service]
The key insight is that chat servers are stateful because they hold WebSocket connections. Redis Pub/Sub bridges the gap when two users connected to different servers need to talk. Everything else in the diagram exists to keep that stateful core fast and recoverable.
Technology Choices
| Component | Technology | Why |
|---|---|---|
| Real-time transport | WebSocket + STOMP | Full-duplex, Spring has first-class support |
| Message broker | Redis Pub/Sub | Low latency, simple, handles cross-server routing |
| Persistent queue | Apache Kafka | Durability, replay, decouples write path |
| Database | PostgreSQL | JSONB for flexible message metadata, strong consistency |
| Presence | Redis | TTL-based keys, sub-millisecond reads |
| Push notifications | Firebase Cloud Messaging | Industry standard, handles both iOS and Android |
WebSocket Implementation with Spring
Spring’s STOMP over WebSocket support is production-tested and saves you from writing low-level frame handling. STOMP gives you a lightweight pub/sub semantic on top of the raw socket, so you can address destinations like /topic/room.42 or /user/queue/messages instead of routing every frame by hand.
WebSocket Configuration:
@Configuration
@EnableWebSocketMessageBroker
public class WebSocketConfig implements WebSocketMessageBrokerConfigurer {
@Override
public void configureMessageBroker(MessageBrokerRegistry config) {
// Use Redis-backed broker for multi-instance support
config.enableStompBrokerRelay("/topic", "/queue")
.setRelayHost("redis-host")
.setRelayPort(6379);
config.setApplicationDestinationPrefixes("/app");
config.setUserDestinationPrefix("/user");
}
@Override
public void registerStompEndpoints(StompEndpointRegistry registry) {
registry.addEndpoint("/ws/chat")
.setAllowedOrigins("https://yourchatapp.com")
.withSockJS(); // Fallback for older browsers
}
}
Message Handler:
@Controller
public class ChatController {
private final MessagePersistenceService persistenceService;
private final PresenceService presenceService;
private final PushNotificationService pushService;
@MessageMapping("/chat.send")
public void sendMessage(@Payload ChatMessage message,
SimpMessageHeaderAccessor headerAccessor) {
String senderId = headerAccessor.getUser().getName();
message.setSenderId(senderId);
message.setTimestamp(Instant.now());
message.setMessageId(UUID.randomUUID().toString());
// Persist asynchronously via Kafka
persistenceService.persistAsync(message);
// Route to recipient
String recipientId = message.getRecipientId();
if (presenceService.isOnline(recipientId)) {
messagingTemplate.convertAndSendToUser(
recipientId,
"/queue/messages",
message
);
} else {
// User offline -- queue push notification
pushService.sendPushNotification(recipientId, message);
}
}
@MessageMapping("/chat.typing")
public void typingIndicator(@Payload TypingEvent event,
SimpMessageHeaderAccessor headerAccessor) {
messagingTemplate.convertAndSendToUser(
event.getRecipientId(),
"/queue/typing",
event
);
}
@Autowired
private SimpMessagingTemplate messagingTemplate;
}
Message Persistence and Delivery Guarantees
Messages flow through Kafka before hitting PostgreSQL. This decouples the hot path (WebSocket delivery) from the cold path (database write). Even if the database is momentarily slow, the user still sees the message instantly because delivery never waits on the write.
For delivery guarantees, the system tracks three states:
SENT — server received the message
DELIVERED — recipient’s client acknowledged receipt
READ — recipient opened the conversation
@Entity
@Table(name = "messages")
public class Message {
@Id
private String messageId;
private String conversationId;
private String senderId;
private String content;
private Instant timestamp;
@Enumerated(EnumType.STRING)
private DeliveryStatus status; // SENT, DELIVERED, READ
private Instant deliveredAt;
private Instant readAt;
}
The client sends an acknowledgment back over the WebSocket when it receives a message, which flips the status from SENT to DELIVERED. It is simple and reliable, and it avoids polling the server for delivery confirmation.
Presence System with Redis
Presence is one of those features that looks trivial but gets tricky at scale. A clean approach uses Redis TTL keys that expire if the client stops sending heartbeats:
@Service
public class PresenceService {
private final StringRedisTemplate redisTemplate;
private static final Duration PRESENCE_TTL = Duration.ofSeconds(30);
public void markOnline(String userId, String serverId) {
String key = "presence:" + userId;
Map<String, String> value = Map.of(
"serverId", serverId,
"lastSeen", Instant.now().toString()
);
redisTemplate.opsForHash().putAll(key, value);
redisTemplate.expire(key, PRESENCE_TTL);
}
public boolean isOnline(String userId) {
return Boolean.TRUE.equals(
redisTemplate.hasKey("presence:" + userId)
);
}
public void heartbeat(String userId) {
redisTemplate.expire("presence:" + userId, PRESENCE_TTL);
}
}
Clients send a heartbeat every 15 seconds. If the key expires (no heartbeat for 30 seconds), the user is considered offline. No complex state machine is needed, and crashed clients clean themselves up automatically once the TTL lapses.
Scaling WebSockets Horizontally
This is where most designs fall apart. User A is connected to Server 1, User B is connected to Server 2 — so how does A’s message reach B?
Redis Pub/Sub solves this. Each chat server subscribes to a channel. When Server 1 receives a message for User B, it publishes to Redis. Server 2 picks it up and delivers over its local WebSocket.
@Service
public class RedisMessageRelay {
private final StringRedisTemplate redisTemplate;
private final SimpMessagingTemplate messagingTemplate;
public void relayMessage(ChatMessage message) {
String channel = "chat:user:" + message.getRecipientId();
redisTemplate.convertAndSend(channel,
objectMapper.writeValueAsString(message));
}
@Bean
public MessageListenerAdapter messageListener() {
return new MessageListenerAdapter((MessageListener) (message, pattern) -> {
ChatMessage chatMessage = objectMapper.readValue(
message.getBody(), ChatMessage.class);
messagingTemplate.convertAndSendToUser(
chatMessage.getRecipientId(),
"/queue/messages",
chatMessage
);
});
}
}
For group chats with many participants, fan-out happens at the Redis layer. Each server only delivers to users connected to it, which keeps cross-node chatter proportional to actual subscribers rather than total cluster size.
Connection Lifecycle and Failure Modes
The happy path is easy; the interesting engineering lives in what happens when connections drop. Mobile clients lose connectivity constantly — passing through tunnels, switching from Wi-Fi to cellular, or being backgrounded by the OS. A production design has to treat reconnection as the normal case, not the exception.
When a socket drops, the client must reconnect with exponential backoff and then request any messages it missed while disconnected. Because every message already carries a monotonic, time-sortable ID, the client can simply ask for everything after its last known ID. A typical reconnect routine looks like this:
function connect(lastSeenId) {
let attempt = 0;
const open = () => {
const ws = new WebSocket("wss://chat.example.com/ws/chat");
ws.onopen = () => {
attempt = 0; // reset backoff
// Pull anything we missed while disconnected
ws.send(JSON.stringify({ type: "SYNC", afterMessageId: lastSeenId }));
};
ws.onmessage = (e) => {
const msg = JSON.parse(e.data);
lastSeenId = msg.messageId; // advance the cursor
ws.send(JSON.stringify({ type: "ACK", messageId: msg.messageId }));
render(msg);
};
ws.onclose = () => {
const delay = Math.min(1000 * 2 ** attempt++, 30000); // cap at 30s
setTimeout(open, delay + Math.random() * 1000); // add jitter
};
};
open();
}
Notice the jitter added to the backoff. Without it, a server restart causes every disconnected client to reconnect at the same instant — a thundering herd that knocks the server over again. Server-side, you also need an onDisconnect hook that publishes a presence-offline event so that group rosters and read receipts stay accurate. Idempotency matters here too: because a flaky network can cause the same message to be retried, the server must deduplicate on messageId so a single logical message is never stored twice.
Message Ordering and Consistency
Messages within a single conversation must be ordered. A Snowflake-like ID generator produces time-sortable, globally unique IDs, and the conversation is partitioned in Kafka by conversationId, so messages within a conversation are always processed in order. Cross-conversation ordering, by contrast, is not guaranteed — and it does not need to be, since users only ever perceive order within a single thread.
On the client side, messages are sorted by their server-assigned timestamp, never the client’s local time. Device clocks drift, get set manually, and disagree across time zones, so trusting them produces messages that appear out of order or even in the future.
Database Schema
CREATE TABLE users (
user_id VARCHAR(36) PRIMARY KEY,
username VARCHAR(50) UNIQUE NOT NULL,
display_name VARCHAR(100),
avatar_url TEXT,
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE conversations (
conversation_id VARCHAR(36) PRIMARY KEY,
type VARCHAR(10) NOT NULL, -- 'DIRECT' or 'GROUP'
name VARCHAR(100),
created_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE conversation_members (
conversation_id VARCHAR(36) REFERENCES conversations(conversation_id),
user_id VARCHAR(36) REFERENCES users(user_id),
joined_at TIMESTAMP DEFAULT NOW(),
PRIMARY KEY (conversation_id, user_id)
);
CREATE TABLE messages (
message_id VARCHAR(36) PRIMARY KEY,
conversation_id VARCHAR(36) REFERENCES conversations(conversation_id),
sender_id VARCHAR(36) REFERENCES users(user_id),
content TEXT NOT NULL,
status VARCHAR(10) DEFAULT 'SENT',
created_at TIMESTAMP DEFAULT NOW(),
delivered_at TIMESTAMP,
read_at TIMESTAMP
);
-- Critical indexes for query performance
CREATE INDEX idx_messages_conversation ON messages(conversation_id, created_at DESC);
CREATE INDEX idx_conversation_members_user ON conversation_members(user_id);
Partition the messages table by created_at (monthly partitions) once you pass a few hundred million rows. Old partitions can be detached and moved to cold storage cheaply, which keeps the hot index small enough to stay in memory.
Capacity Estimation
Some quick back-of-the-envelope math for 10 million DAU:
Average user sends 20 messages/day
Total: 200 million messages/day (~2,300 messages/second)
Average message size: 200 bytes (content + metadata)
Daily storage: 200M x 200B = 40 GB/day
Annual storage: ~14.6 TB/year
5-year retention: ~73 TB
For WebSocket connections, plan for roughly 10M concurrent connections at peak. At ~10KB memory per connection, that is 100 GB of RAM across the cluster. With 16 GB allocated per server instance, you need roughly 7-8 chat server instances at peak. In practice, teams run 12-15 for headroom and fault tolerance, since a node failure must not push survivors past their connection ceiling.
Redis Pub/Sub handles 500K+ messages/second on a single node, so one Redis cluster with a few replicas covers this workload comfortably. The constraint that bites first is almost always memory for live connections, not message throughput.
Push Notifications for Offline Users
When the presence check says a user is offline, the message gets routed to a notification queue:
@Service
public class PushNotificationService {
private final FirebaseMessaging firebaseMessaging;
public void sendPushNotification(String userId, ChatMessage message) {
String fcmToken = tokenRepository.getToken(userId);
if (fcmToken == null) return;
Message notification = Message.builder()
.setToken(fcmToken)
.setNotification(Notification.builder()
.setTitle(message.getSenderName())
.setBody(truncate(message.getContent(), 100))
.build())
.putData("conversationId", message.getConversationId())
.build();
firebaseMessaging.sendAsync(notification);
}
}
Batch notifications if a user has many unread messages, because nobody wants 50 separate push alerts. A common pattern is to coalesce on a short timer — “3 new messages from Priya” instead of three separate buzzes.
Load Testing and Validating the Design
An architecture is only a hypothesis until you load test it. WebSockets are notoriously hard to benchmark because the cost is in holding idle connections, not just throughput. Tools like Gatling, k6, and Artillery can open hundreds of thousands of persistent sockets and hold them while trickling messages through, which is exactly the profile you care about.
The metrics that matter are not the obvious ones. Watch the p99 end-to-end latency rather than the average, because chat feels broken the moment a small fraction of messages lag. Track Redis Pub/Sub propagation time separately from database write time, since they fail for different reasons. Crucially, test the failure path: kill a chat server mid-test and confirm that its clients reconnect, resync missed messages, and that presence converges within the TTL window. Benchmarks published by the major service-mesh and broker vendors consistently show that the first bottleneck under load is connection memory and garbage-collection pauses on the chat nodes — not the broker — which is why connection count per node is the number you tune most carefully.
When NOT to Build Your Own Real-Time Chat Application
This whole design assumes chat is a core differentiator worth owning end to end. For many products, it is not. If you need a comment thread, in-app support messaging, or a notification feed, a managed real-time platform such as Pusher, Ably, or PubNub will get you to production in days instead of months, and they shoulder the operational burden of connection scaling and global edge delivery. The same applies to choosing between WebSockets and simpler transports — if your use case is one-directional updates, Server-Sent Events are far cheaper to operate.
Building your own makes sense when you need end-to-end encryption you control, strict data-residency guarantees, custom moderation pipelines, or sheer scale where per-message pricing becomes prohibitive. It rarely makes sense as a v1 for a small team, where the stateful WebSocket tier, the Redis coordination layer, and the on-call burden of all of it will consume engineering time you could spend on the actual product. The honest trade-off is control and cost-at-scale versus speed and operational simplicity — and for many teams, buying is the correct engineering decision. If you do go custom, it pairs naturally with the patterns in event-driven architecture with Kafka, since the persistence and fan-out paths are essentially an event stream.
Conclusion
System design is fundamentally about trade-offs. In this chat system, we traded the simplicity of a stateless HTTP API for the complexity of stateful WebSocket connections — because sub-second latency demanded it. We added Redis as a coordination layer, accepting the operational overhead because the alternative (sticky sessions with no failover) is worse. We chose eventual consistency for read receipts because strong consistency there would crush throughput for no real user benefit.
For further reading, refer to the PostgreSQL documentation and the Redis documentation for comprehensive reference material.
In conclusion, a production real-time chat application earns its complexity one decision at a time. The skill is not in memorizing architectures — it is in understanding why each piece exists and what breaks if you remove it. That understanding is what separates a whiteboard answer from a system that actually runs at scale.