API Rate Limiting Spring Boot Guide

API Rate Limiting and Throttling in Spring Boot

API rate limiting in Spring Boot is the discipline of capping how many requests a single caller can make in a window of time, and it is one of those features that nobody notices until it is missing. A public-facing API without it is an open buffet: a misbehaving client stuck in an infinite loop, a runaway batch job, or a deliberate flood can exhaust your database connections, saturate thread pools, and drag down every downstream system in the chain. In production teams, the lesson is usually learned after the first incident, which is why mature services bake throttling in from day one rather than bolting it on after an outage.

This guide walks through the theory and then the production-ready code, moving from a single-instance interceptor to a distributed, Redis-backed limiter that survives horizontal scaling.

Why Rate Limiting Matters

Without throttling, your legitimate users pay the price for everyone else’s behavior. Concretely, it protects you from several distinct failure modes:

Denial of Service — intentional or accidental abuse that overwhelms capacity
Resource exhaustion — database connections, thread pools, and heap memory all have hard ceilings
Cost overruns — especially when you pay per-request to downstream services or cloud APIs
Unfair usage — one tenant hogging everything in a multi-tenant system, starving the rest

Beyond protection, rate limits double as a product lever. Tiered plans (free, pro, enterprise) are typically enforced through exactly this mechanism, so the same code that defends your infrastructure also encodes your business model.

Types of Rate Limiting Algorithms

Before writing any code, it helps to understand the algorithms on offer, because each carries a different trade-off between simplicity, fairness, and burst tolerance.

Algorithm	How It Works	Pros	Cons
Fixed Window	Counts requests in fixed time intervals (e.g., 100 req/min)	Simple to implement	Burst at window edges
Sliding Window	Rolling time window, smooths out edge bursts	Fairer distribution	Slightly more complex
Token Bucket	Tokens refill at a fixed rate; each request costs a token	Allows controlled bursts	Needs careful tuning
Leaky Bucket	Requests queue and process at a constant rate	Smooth output rate	Can add latency

The fixed-window approach has a notorious edge case worth understanding: with a 100 req/min limit, a client can fire 100 requests in the final second of one window and another 100 in the first second of the next, producing a 200-request burst across a two-second span. Sliding window counters fix this by weighting the previous window proportionally. For most Spring Boot APIs, however, the Token Bucket algorithm strikes the best balance: it tolerates legitimate bursts gracefully while still enforcing a long-run average, and it is exactly what the Bucket4j library implements under the hood.

Implementing Rate Limiting with Bucket4j

Bucket4j is a mature Java library based on the token bucket algorithm. It integrates cleanly with Spring Boot and supports both in-memory and distributed (Redis, Hazelcast, Infinispan) backends, so you can start simple and scale later without rewriting your limiting logic.

First, add the dependency:

<dependency>
    <groupId>com.bucket4j</groupId>
    <artifactId>bucket4j-core</artifactId>
    <version>8.7.0</version>
</dependency>

Here is a basic bucket configuration — 20 requests per minute with a smaller short-term burst allowance of 5 every 10 seconds:

import io.github.bucket4j.Bandwidth;
import io.github.bucket4j.Bucket;
import io.github.bucket4j.Refill;

import java.time.Duration;

public class RateLimiterConfig {

    public static Bucket createBucket() {
        Bandwidth limit = Bandwidth.classic(20, Refill.greedy(20, Duration.ofMinutes(1)));
        Bandwidth burst = Bandwidth.classic(5, Refill.intervally(5, Duration.ofSeconds(10)));

        return Bucket.builder()
                .addLimit(limit)
                .addLimit(burst)
                .build();
    }
}

Note the distinction between refill strategies. Refill.greedy trickles tokens back continuously, so a 20-per-minute bucket regains roughly one token every three seconds rather than all twenty at the top of the minute. Refill.intervally, by contrast, adds the full batch on a schedule. Greedy refill produces smoother behavior and avoids the thundering-herd effect of synchronized clients all retrying at the window boundary.

Building a Rate Limiting Filter

A HandlerInterceptor is generally preferable to a raw servlet filter here, because it gives you access to Spring’s request context and handler metadata while still running early in the request lifecycle.

@Component
public class RateLimitInterceptor implements HandlerInterceptor {

    private final Map<String, Bucket> buckets = new ConcurrentHashMap<>();

    @Override
    public boolean preHandle(HttpServletRequest request,
                             HttpServletResponse response,
                             Object handler) throws Exception {

        String clientId = resolveClientId(request);
        Bucket bucket = buckets.computeIfAbsent(clientId, k -> createBucket());

        ConsumptionProbe probe = bucket.tryConsumeAndReturnRemaining(1);

        response.addHeader("X-RateLimit-Limit", "20");
        response.addHeader("X-RateLimit-Remaining",
                String.valueOf(probe.getRemainingTokens()));

        if (probe.isConsumed()) {
            return true;
        }

        long waitTimeSeconds = probe.getNanosToWaitForRefill() / 1_000_000_000;
        response.addHeader("Retry-After", String.valueOf(waitTimeSeconds));
        response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value());
        response.getWriter().write(
            "{\"error\": \"Rate limit exceeded. Try again in "
            + waitTimeSeconds + " seconds.\"}"
        );
        return false;
    }

    private String resolveClientId(HttpServletRequest request) {
        // Prefer API key, fall back to IP
        String apiKey = request.getHeader("X-API-Key");
        if (apiKey != null && !apiKey.isBlank()) {
            return "key:" + apiKey;
        }
        return "ip:" + request.getRemoteAddr();
    }

    private Bucket createBucket() {
        return Bucket.builder()
                .addLimit(Bandwidth.classic(20,
                    Refill.greedy(20, Duration.ofMinutes(1))))
                .build();
    }
}

One caveat with the IP fallback: when your service sits behind a load balancer or reverse proxy, getRemoteAddr() returns the proxy’s address, not the real client’s. In that topology you must read X-Forwarded-For instead, and crucially you should only trust that header from known proxy IPs, otherwise a caller can simply spoof it to dodge the limit. Register the interceptor in your WebMvc configuration:

@Configuration
public class WebConfig implements WebMvcConfigurer {

    @Autowired
    private RateLimitInterceptor rateLimitInterceptor;

    @Override
    public void addInterceptors(InterceptorRegistry registry) {
        registry.addInterceptor(rateLimitInterceptor)
                .addPathPatterns("/api/**");
    }
}

Per-User vs Per-Endpoint Rate Limiting

The interceptor above does per-user limiting. Often, though, you also want per-endpoint limits, because a full-text search endpoint is far more expensive than a cached profile lookup and deserves a tighter cap.

You can combine both dimensions by keying the bucket on clientId + endpoint:

private String resolveBucketKey(HttpServletRequest request) {
    String clientId = resolveClientId(request);
    String endpoint = request.getMethod() + ":" + request.getRequestURI();
    return clientId + "|" + endpoint;
}

You can then assign different Bandwidth limits per endpoint pattern, typically stored in a configuration map so the values live alongside the rest of your application properties rather than buried in code:

Map<String, Bandwidth> endpointLimits = Map.of(
    "/api/search",  Bandwidth.classic(10, Refill.greedy(10, Duration.ofMinutes(1))),
    "/api/users",   Bandwidth.classic(50, Refill.greedy(50, Duration.ofMinutes(1)))
);

One pitfall to plan for: keying on the raw URI multiplies your bucket count. A path like /api/users/42 and /api/users/43 would each get their own bucket if you do not normalize path variables back to their template form first. Always reduce to the route pattern, not the concrete URL.

Distributed Rate Limiting with Redis

In-memory buckets break the moment you scale beyond one instance. Two pods, each holding its own ConcurrentHashMap, means your effective limit doubles; ten pods means it is off by an order of magnitude. The fix is a shared store, and in practice that store is almost always Redis.

Bucket4j ships a Redis integration via bucket4j-redis that keeps the token math server-side using atomic compare-and-swap operations, so concurrent decrements from different instances stay consistent:

<dependency>
    <groupId>com.bucket4j</groupId>
    <artifactId>bucket4j-redis</artifactId>
    <version>8.7.0</version>
</dependency>

@Bean
public ProxyManager<String> proxyManager(LettuceConnectionFactory connectionFactory) {
    StatefulRedisConnection<String, byte[]> connection =
        connectionFactory.getStatefulConnection();
    return Bucket4jRedis.casBasedBuilder(connection)
            .build();
}

// Then resolve buckets from the proxy manager
Bucket bucket = proxyManager.builder()
    .build(clientId, () -> BucketConfiguration.builder()
        .addLimit(Bandwidth.classic(20, Refill.greedy(20, Duration.ofMinutes(1))))
        .build());

Now every instance shares the same counters. Redis adds a few milliseconds of latency per request, so plan for that on hot paths, and equally important, decide what happens when Redis itself is unreachable. The two camps are “fail open” (allow the request when the limiter is down, prioritizing availability) and “fail closed” (reject it, prioritizing protection). Most public APIs fail open with an alert, on the logic that a brief lapse in throttling beats a total outage caused by the limiter.

Spring Cloud Gateway Rate Limiting

If you already run Spring Cloud Gateway at the edge, it provides a built-in RequestRateLimiter filter backed by Redis, which means you can throttle before traffic ever reaches your service instances:

spring:
  cloud:
    gateway:
      routes:
        - id: api-service
          uri: lb://api-service
          predicates:
            - Path=/api/**
          filters:
            - name: RequestRateLimiter
              args:
                redis-rate-limiter.replenishRate: 10
                redis-rate-limiter.burstCapacity: 20
                redis-rate-limiter.requestedTokens: 1
                key-resolver: "#{@userKeyResolver}"

@Bean
public KeyResolver userKeyResolver() {
    return exchange -> Mono.just(
        exchange.getRequest().getHeaders()
            .getFirst("X-API-Key")
    );
}

This is the fastest path when Gateway is already in your stack, since the heavy lifting and the Redis Lua script ship with the framework. Be aware that the Gateway limiter uses a token-bucket variant where burstCapacity is the bucket size and replenishRate the steady refill, so tune the two together rather than in isolation.

Handling 429 Responses Gracefully

A bare 429 status code is not helpful to the caller. A well-behaved API always returns enough context for the client to recover automatically:

Retry-After header — tells the client exactly when to try again, in seconds or as an HTTP date
X-RateLimit-Remaining — lets clients self-throttle before they ever hit the wall
A clear JSON body describing the error and, ideally, a link to the rate-limit documentation

On the client side, the correct response to a 429 is exponential backoff with jitter: rather than retrying immediately, wait, then double the wait on each subsequent attempt, capped at something like 60 seconds, and add a small random offset so that a fleet of clients does not retry in lockstep and create a synchronized thundering herd.

Testing Rate Limits

Do not skip this step; a misconfigured limit that never trips is worse than no limit at all because it gives a false sense of safety. A straightforward integration test exhausts the bucket and asserts on the throttled response:

@SpringBootTest(webEnvironment = SpringBootTest.WebEnvironment.RANDOM_PORT)
class RateLimitTest {

    @Autowired
    private TestRestTemplate restTemplate;

    @Test
    void shouldReturn429WhenRateLimitExceeded() {
        HttpHeaders headers = new HttpHeaders();
        headers.set("X-API-Key", "test-client");

        HttpEntity<Void> entity = new HttpEntity<>(headers);

        // Exhaust the limit
        for (int i = 0; i < 20; i++) {
            ResponseEntity<String> response = restTemplate
                .exchange("/api/users", HttpMethod.GET, entity, String.class);
            assertThat(response.getStatusCode()).isEqualTo(HttpStatus.OK);
        }

        // Next request should be throttled
        ResponseEntity<String> response = restTemplate
            .exchange("/api/users", HttpMethod.GET, entity, String.class);

        assertThat(response.getStatusCode())
            .isEqualTo(HttpStatus.TOO_MANY_REQUESTS);
        assertThat(response.getHeaders().getFirst("Retry-After"))
            .isNotNull();
    }
}

When NOT to Add Rate Limiting (and the Trade-offs)

Rate limiting is not free, and there are contexts where it adds more friction than value. For a purely internal service behind a trusted network boundary, where callers are your own well-behaved batch jobs, a hard limiter can cause more false-positive outages than it prevents; a circuit breaker or bulkhead may serve you better there. Distributed limiting also introduces a dependency on Redis, which becomes a new single point of failure you must monitor and run with high availability. There is a latency cost too, since every request now makes a network round trip to the shared store, and for ultra-low-latency paths that overhead can matter. Finally, limits set too aggressively frustrate legitimate power users and generate support tickets, while limits set too loosely never engage during the flood you built them for. The honest position is that throttling is a tuning exercise as much as an engineering one: start with generous limits, observe real traffic, and tighten gradually rather than guessing the numbers up front. For background on the broader security posture these controls fit into, the OWASP Top 10 and the NIST vulnerability database are useful reference material.

In conclusion, API rate limiting in Spring Boot is infrastructure, not a nice-to-have. The cost of adding it is genuinely low: a single interceptor or a Gateway filter, a Redis instance you probably already operate, and a handful of response headers. The cost of skipping it is an outage at 2 AM and an uncomfortable conversation on Monday morning. Start with the token bucket, key it by API key with an IP fallback, move the counters into Redis the moment you scale past one instance, and return honest 429 responses so well-behaved clients can recover on their own.