Designing a Rate Limiter
How to design a rate limiter — algorithms, trade-offs, and a production-ready, thread-safe token bucket wired into Spring Boot as an interceptor.
A rate limiter caps how fast a client may call a service. It protects you from abuse, accidental overload, and runaway cost — and it is a staple system-design question because it forces you to reason about accuracy, memory, and concurrency at once.
Requirements
Pin down exactly what "rate limit" means here, so the design and the code agree:
- Functional: admit a configured request rate per client, absorbing short bursts up to a fixed capacity; reject anything beyond it with
429 Too Many Requestsand aRetry-Afterhint. - Non-functional: negligible added latency,
O(1)memory per client, and correct under concurrency — many threads, and eventually many servers, hitting the same counter.
The hard part is not the algorithm — it is keeping the counter atomic when many threads decrement it at the same instant. Everything below is in service of that one property.
Algorithms
Fixed window
Count requests per fixed clock window (say, per second). Simple, but a client can fire a full window's worth at the end of one window and the start of the next — up to 2× the limit across the boundary.
Sliding window
Smooth the boundary by weighting the previous window. If a client made and requests, the estimate is:
where is the window length and is the time elapsed in the current window.
Token bucket
The workhorse, and what we implement. A bucket holds up to capacity tokens and refills at refillPerSecond. Each request removes one token; an empty bucket means reject. It allows controlled bursts (up to capacity) while bounding the long-run rate to the refill rate — exactly the requirement above.
A thread-safe token bucket
Tokens are computed lazily from elapsed time, so there is no background thread. The whole read-modify-write is guarded so concurrent callers can't both spend the last token:
package dev.fiveyear.ratelimit;
/** Lazy, thread-safe token bucket. O(1) memory, no background refill thread. */
public final class TokenBucket {
private final long capacity;
private final double refillPerMillis;
private double tokens;
private long lastRefillMillis;
public TokenBucket(long capacity, double refillPerSecond) {
this.capacity = capacity;
this.refillPerMillis = refillPerSecond / 1000.0;
this.tokens = capacity;
this.lastRefillMillis = System.currentTimeMillis();
}
public synchronized boolean tryAcquire() {
refill();
if (tokens < 1.0) {
return false;
}
tokens -= 1.0;
return true;
}
private void refill() {
long now = System.currentTimeMillis();
tokens = Math.min(capacity, tokens + (now - lastRefillMillis) * refillPerMillis);
lastRefillMillis = now;
}
}The synchronized block on line 18 is the whole correctness argument: refill, the check, and the decrement happen as one atomic step, so two threads can never both see the last token as available.
Wiring it into Spring Boot
Enforce the limit before the controller runs by implementing HandlerInterceptor. One bucket per client, kept in a ConcurrentHashMap:
package dev.fiveyear.ratelimit;
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import org.springframework.http.HttpStatus;
import org.springframework.stereotype.Component;
import org.springframework.web.servlet.HandlerInterceptor;
@Component
public class RateLimitInterceptor implements HandlerInterceptor {
private final Map<String, TokenBucket> buckets = new ConcurrentHashMap<>();
@Override
public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) {
String clientId = request.getRemoteAddr();
TokenBucket bucket = buckets.computeIfAbsent(clientId, key -> new TokenBucket(20, 20));
if (bucket.tryAcquire()) {
return true;
}
response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value()); // 429
response.setHeader("Retry-After", "1");
return false;
}
}Register it with a WebMvcConfigurer so it runs for the paths you choose:
package dev.fiveyear.ratelimit;
import org.springframework.context.annotation.Configuration;
import org.springframework.web.servlet.config.annotation.InterceptorRegistry;
import org.springframework.web.servlet.config.annotation.WebMvcConfigurer;
@Configuration
public class WebConfig implements WebMvcConfigurer {
private final RateLimitInterceptor rateLimitInterceptor;
public WebConfig(RateLimitInterceptor rateLimitInterceptor) {
this.rateLimitInterceptor = rateLimitInterceptor;
}
@Override
public void addInterceptors(InterceptorRegistry registry) {
registry.addInterceptor(rateLimitInterceptor).addPathPatterns("/api/**");
}
}Trade-offs
| Algorithm | Memory | Burst behaviour | Boundary accuracy |
|---|---|---|---|
| Fixed window | O(1) | up to 2× the limit | poor |
| Sliding window | O(1) | smooth | good |
| Token bucket | O(1) | bursts to capacity | good |
For most APIs the token bucket is the right default: O(1) memory per client, atomic under concurrency, and a single knob — the refill rate — for the steady-state limit.
Going distributed
The in-memory map is per-instance, so each server enforces its own limit. To share state across a fleet, move the bucket into Redis (atomic via a server-side script) or adopt a library such as Bucket4j with a Redis/Hazelcast backend — the interceptor stays exactly the same; only TokenBucket is swapped for a distributed one.
What interviewers probe next
- Where does the limiter live — gateway, filter/interceptor, or library?
- What happens when the shared store is down? (Fail-open vs fail-closed — a deliberate choice.)
- How do you choose the client key — IP, API key, or user id — and avoid punishing users behind one NAT?