HLDintermediate

Designing a Rate Limiter

The system design classic: where a rate limiter sits in your architecture, fixed window vs sliding window vs token bucket, going distributed with Redis, and a thread-safe core you can ship.

By fiveyearsdevMay 20, 202616 min read

You request an OTP, mistype it, request another… and on the third try your bank app says "Too many attempts. Try again in 30 seconds." Annoying? A little. But that message just stopped someone from trying ten thousand PINs against your account overnight.

Nobody was watching your account. What blocked that third try was a tiny counter with a fixed budget that refills a little every second — you spent it in three quick taps and hit zero, so the next tap bounced. That counter is a rate limiter saying hello, and designing one is a system-design classic, because that one little budget forces you to reason about accuracy, memory, and concurrency all at once. Let's design it the way you would on a whiteboard: from the outside in.

This article stays at the architecture level — boxes, arrows, and trade-offs. When you're ready to open the rate limiter box and build every class inside it, the companion piece walks through exactly that: Building a Rate Limiter: The Low-Level Design.

Let's start nowhere near a computer

Imagine your grandmother runs your arcade-game budget. Every hour, she drops one coin into a jar on the shelf. The jar holds at most five coins.

You can spend your coins whenever you like — all five in a glorious ten-minute binge, or one at a time. But when the jar is empty, no amount of begging works. You wait for the next coin to drop.

Hold onto that jar — it's the best algorithm in this article wearing a disguise. Notice what it achieves: short bursts are fine, but the long-run rate is fixed, and grandma never has to watch you. The jar enforces the rules by itself.

You got rate limited three times today

OTP and login screens — "try again later" is a limiter protecting against brute force.
Free API tiers — 100 requests/minute on that weather API? A limiter counting your key.
Ticket sales and flash sales — limiters are the difference between a slow queue and a crashed site.
Your own outbound calls — hit a partner API too fast and their limiter 429s you; well-built services limit themselves on the way out too.

What the limiter must do

Before any boxes, say plainly what this thing has to do — the functional requirements, in sentences:

For each incoming request, decide allow or deny, keyed by who's asking (a client, an IP, an API key) and often by which route they're hitting.
Enforce a rule of the shape N requests per time window — "20 per second", "100 per minute".
Be fast enough to sit in the request path — every request flows through it, so it can't be the slow part.
Ideally, when it says no, tell the caller when to try again — a Retry-After hint, so well-behaved clients back off instead of hammering.

That's the whole job. Everything below is just how well we do it.

Non-functional requirements

The features above are the easy half; the design is only good if it does them well. For a rate limiter, "well" means four things:

Low latency. The limiter is on every request's hot path. It must add ~no latency — a check that takes milliseconds defeats the point of a fast front door.
High availability. If the limiter itself is down, it must fail open or degrade — never block all traffic. A dead limiter taking your whole API down with it is the worst outcome of all.
Accuracy under concurrency. Counts must stay correct when many requests hit at once — on one node and across many. Two requests must never both spend the last token.
Scalability. The same contract must work from one node to ten without changing — adding boxes shouldn't quietly multiply your limit.

Listing them isn't enough; the design has to meet them. Here's the contract this design signs — each requirement, the mechanism that keeps it, and where we cover it:

Requirement	How this design fulfills it	Where
Low latency	an in-memory, O(1) check-and-spend, no I/O on the happy path	The single-node core
High availability	fail-open policy + a local in-process fallback limiter	When the limiter itself fails
Accuracy under concurrency	an atomic refill-check-spend (one `synchronized` step; an `INCR`/server-side script in Redis)	The single-node core; Going distributed
Scalability	one shared store across nodes, so the contract holds as you scale	Going distributed

Every trade-off from here on is chosen to keep one of these promises — and the trade-offs table at the end points back at this list.

Where the limiter lives

First whiteboard question: where does this thing sit? The standard answer — at the front door, before any expensive work happens:

Why the gateway? Because a rejected request should cost you almost nothing. If the request gets deep into service B before being refused, it already consumed connections, threads, and database time — the exact resources you were protecting.

The contract with clients is two headers: status 429 Too Many Requests, plus a Retry-After hint so well-behaved clients back off instead of hammering harder.

What "limit" should mean — picking the algorithm

Now the heart of the design. "20 requests per second per client" sounds precise until you ask: counted how? Let's break two obvious answers and keep the third.

Attempt 1: fixed window

Count requests in each clock second; reset the counter when the second flips. Simple, O(1) memory… and it has a hole you could drive a truck through:

A client who fires at the end of one window and the start of the next gets 2× the limit. Fine for rough protection, embarrassing in an interview if you don't mention it.

Attempt 2: sliding window

Smooth the boundary by blending the previous window in proportion to how much of it still "overlaps" your last second. If the client made $c_{prev}$ and $c_{curr}$ requests:

\text{count} = c_{curr} + c_{prev} \cdot \frac{\,w - t\,}{w}

where $w$ is the window length and $t$ is how far you are into the current one. Two counters per client, boundary hole closed.

Its only real sin: it's an estimate. That weighting assumes the previous window's requests were spread evenly, so it can be wrong in both directions — if all of c_prev actually landed in that window's last instant, the true rolling count is higher than the formula thinks, and a burst slips through; if they landed early, you reject a client who was under the real limit. The error is bounded by c_prev, which is why it stays acceptable: at most one window's worth of requests is mis-attributed, never more.

If you need the count to be exact, there's a third variant — the sliding-window log: store the timestamp of every request in a sorted set and, on each new request, drop everything older than w and count what's left. It's exact by construction, but the memory is now O(requests in the window) per client instead of two integers — a busy key at 10k req/min holds 10k timestamps. That's the accuracy-versus-cost dial in one line: the counter approximates in O(1); the log is exact at O(n). Most APIs take the O(1) approximation and never look back; the token bucket below sidesteps the whole question a different way.

Attempt 3: the token bucket — grandma's jar

Now bring back the jar. A bucket holds up to capacity tokens and refills at a steady refillPerSecond. Each request takes one token; an empty bucket means 429.

This is the workhorse, and it gives you two independent knobs that map directly onto requirements: capacity controls how big a burst you'll absorb, and refillPerSecond pins the long-run rate. The jar guaranteed exactly this, remember — binge allowed, average fixed.

Algorithm	Memory per client	Burst behaviour	Boundary accuracy
Fixed window	O(1)	up to 2× the limit	poor
Sliding window	O(1)	up to the full limit	good (estimated)
Token bucket	O(1)	bursts to capacity	exact

For most APIs, the token bucket is the right default — and it's what we'll carry through the rest of the design.

The state it keeps — a tiny data model

Before the boxes, name what the limiter actually stores — and notice how little that is. There are only two kinds of state, and they could not be more different in their lifetimes:

the data model

rate_rule    (id, scope,            -- "route:/api/search" or "tier:free"
              capacity,             -- burst ceiling      (knob #1)
              refill_per_sec,       -- long-run rate      (knob #2)
              key_by)               -- ip | api_key | user_id
 
bucket_state (key,                  -- "<principal>:<route>", e.g. "key-abc:/search"
              tokens,               -- a double: fractional tokens are fine
              last_refill_ms)       -- when we last computed the drip

The split is the whole point. A rate_rule is durable configuration: a handful of rows that change when a product manager edits the free tier, read constantly but written almost never. A bucket_state is the opposite — one tiny, hot, ephemeral record per active principal, rewritten on every single request and worthless the moment that principal goes quiet. Two access patterns this far apart want two homes: the rules live in a relational store (or any config source) and get cached in memory at boot; the counters live in Redis, keyed by <principal>:<route>, each with a TTL so an idle client's bucket evicts itself. Why Redis specifically — and why the counter update has to run inside it — is the going-distributed question below; for now just hold the shape: a fat, slow-changing rulebook and a swarm of throwaway counters.

Notice a bucket_state is exactly the three fields the single-node class is about to hold — tokens, lastRefillMillis, and (from its rule) capacity and refillPerMillis. The data model is the class; the only question left is where the record lives and who is allowed to touch it at once.

The single-node core

Before going distributed, get one box right. Two design decisions matter:

Refill lazily. No background thread topping up millions of buckets. When a request arrives, compute how many tokens should have dripped in since the last visit, then decide. A bucket nobody calls costs nothing.

Make the check atomic. Many threads hit the same bucket at the same instant. The check-and-spend must be one indivisible step, or two threads will both grab the last token and your "limit" leaks under exactly the load it exists for.

Here is the whole mechanism as a strip of time — a burst empties the bucket, the next request is refused with a Retry-After computed from the token deficit, and a moment later the lazy drip lets one more through:

That Retry-After ≈ 0.2s is not a guess — it's (1 − tokens) / refillPerSecond, the exact wait until one whole token exists. Handing the client that number is the difference between well-behaved back-off and a retry storm.

TokenBucket.java

package dev.fiveyear.ratelimit;
 
/** Lazy, thread-safe token bucket. O(1) memory, no background refill thread. */
public final class TokenBucket {
 
    private final long capacity;
    private final double refillPerMillis;
    private double tokens;
    private long lastRefillMillis;
 
    public TokenBucket(long capacity, double refillPerSecond) {
        this.capacity = capacity;
        this.refillPerMillis = refillPerSecond / 1000.0;
        this.tokens = capacity;
        this.lastRefillMillis = System.currentTimeMillis();
    }
 
    public synchronized boolean tryAcquire() {
        refill();
        if (tokens < 1.0) {
            return false;
        }
        tokens -= 1.0;
        return true;
    }
 
    private void refill() {
        long now = System.currentTimeMillis();
        tokens = Math.min(capacity, tokens + (now - lastRefillMillis) * refillPerMillis);
        lastRefillMillis = now;
    }
}

That synchronized on the highlighted line is the entire correctness story at this altitude — refill, check, and spend happen as one step. It's where the functional requirement ("allow or deny, per client") and the non-functional one (accuracy under concurrency — two threads never both spend the last token) are both cashed, and notice the cost we didn't pay: the critical section is one in-memory arithmetic step, not a lock held across a network, so we keep correctness without surrendering low latency. That balance is the whole game — and it's the same instinct that picks an atomic conditional UPDATE over a distributed lock in the seat-booking design. Why the atomic step must be that way (and what breaks when it isn't) gets a proper, diagram-by-diagram treatment in the low-level design.

Wiring it into the request path is one interceptor — one bucket per client, enforced before any controller runs:

RateLimitInterceptor.java

package dev.fiveyear.ratelimit;
 
import jakarta.servlet.http.HttpServletRequest;
import jakarta.servlet.http.HttpServletResponse;
import java.util.Map;
import java.util.concurrent.ConcurrentHashMap;
import org.springframework.http.HttpStatus;
import org.springframework.stereotype.Component;
import org.springframework.web.servlet.HandlerInterceptor;
 
@Component
public class RateLimitInterceptor implements HandlerInterceptor {
 
    private final Map<String, TokenBucket> buckets = new ConcurrentHashMap<>();
 
    @Override
    public boolean preHandle(HttpServletRequest request, HttpServletResponse response, Object handler) {
        String clientId = request.getRemoteAddr();
        TokenBucket bucket = buckets.computeIfAbsent(clientId, key -> new TokenBucket(20, 20));
 
        if (bucket.tryAcquire()) {
            return true;
        }
        response.setStatus(HttpStatus.TOO_MANY_REQUESTS.value()); // 429
        response.setHeader("Retry-After", "1");
        return false;
    }
}

WebConfig.java

package dev.fiveyear.ratelimit;
 
import org.springframework.context.annotation.Configuration;
import org.springframework.web.servlet.config.annotation.InterceptorRegistry;
import org.springframework.web.servlet.config.annotation.WebMvcConfigurer;
 
@Configuration
public class WebConfig implements WebMvcConfigurer {
 
    private final RateLimitInterceptor rateLimitInterceptor;
 
    public WebConfig(RateLimitInterceptor rateLimitInterceptor) {
        this.rateLimitInterceptor = rateLimitInterceptor;
    }
 
    @Override
    public void addInterceptors(InterceptorRegistry registry) {
        registry.addInterceptor(rateLimitInterceptor).addPathPatterns("/api/**");
    }
}

Going distributed: when one box becomes ten

Here's where the design gets interesting. Scale out to three gateway instances, each with its own in-memory buckets, and watch your limit quietly triple:

The fix is to give the fleet one shared jar: move the bucket's state into a store every instance talks to — Redis is the standard pick.

The same atomicity rule follows the state wherever it goes. Naïve GET then SET from the gateway re-creates the two-threads-one-token race, just across the network — so the refill-check-spend runs inside Redis as a single server-side script. In practice you rarely hand-roll this: Bucket4j with a Redis or Hazelcast backend drops in, and your interceptor doesn't change a line — only the bucket behind it moves.

Which datastore — and why

Why Redis and not, say, your SQL database? Three reasons that line up exactly with the job. First, atomic counters: INCR/INCRBY bump a per-window count in one indivisible step, so the two-requests-one-token race never opens. Second, TTL for free: an EXPIRE on the counter key means each window's count cleans itself up — no sweeper, no leak. Third, latency: Redis is in-memory and fast enough to sit on the hot path, where a relational DB — a round-trip plus a row lock on the same hot key under load — would be both too slow and a bottleneck. The store's natural fit (atomic counters + TTL + speed) is what picks Redis here; it isn't habit.

Decide now what happens when Redis is down. Fail-open (let everyone through) risks the very overload you built this for; fail-closed (reject everyone) turns a cache hiccup into a full outage. Most APIs fail open and alarm loudly — but the wrong answer in an interview is not having one.

Who exactly are you limiting?

The last quiet design decision: what's the key? Per IP punishes whole offices and campuses behind one NAT, and a hostel full of students becomes "one abusive client". Per API key or user id is fairer, but only exists after authentication — which is itself a thing you must rate limit, by IP, because that's all you have pre-login. Real systems layer both: a generous IP-level shield in front, precise per-user limits behind it.

Scaling the limiter, one bottleneck at a time

The junior instinct is to reach for a Redis cluster on day one. The senior move is the opposite — start with the cheapest thing that's correct, and add a rung only when a named bottleneck forces it. The limiter climbs a short ladder:

One node, in-memory. The single-node core above. Correct and sub-microsecond for a single instance — and plenty until you run a second instance and the count starts multiplying.
One shared Redis. The distributed fix: move bucket_state into a store every node talks to. A single Redis instance does hundreds of thousands of atomic ops a second — more than most APIs will ever ask of it. When one instance's throughput or memory finally bites…
Replicate, then partition the keyspace. A replica takes read-mostly checks and gives you failover. When even the primary's write throughput saturates, shard the keys across a Redis cluster — buckets hash by <principal>:<route>, so millions of distinct clients spread cleanly across nodes. Each key lives on exactly one shard, so its atomic update stays atomic; no cross-shard coordination is ever needed for a single bucket.
Migrate stores — the last resort. Only if the workload outgrows even a cluster (say you need windowed analytics on limiter decisions) do you reach for a different engine. You will be surprised how rarely rung 4 arrives.

The write hot-key is a different axis. Sharding spreads a million clients beautifully, but it cannot split one client. A single abusive API key hammering 50k requests a second is one key on one shard — a hot key no partitioning can divide. The fix isn't more shards; it's to stop sending every decrement to Redis. Each node keeps a small local bucket that leases a slice of the global budget — grab, say, 200 tokens from Redis in one atomic INCRBY, then spend them locally with zero network hops, and only return to Redis when the lease runs dry. You trade a little precision (the global count can overshoot by one lease per node) for surviving the stampede — the same "make the losing path cheap, don't demand perfect global truth on every request" bargain the ticket-booking waiting room strikes at scale.

When the limiter itself fails

We flagged it in the callout above; here's the stance, spelled out. When the shared store — Redis — is unreachable, you have two choices, and you must pick before it happens. Fail open: let traffic through. For most APIs this is the right call — availability over strictness, because a brief gap where a few callers exceed their limit is far cheaper than your whole front door rejecting everyone. To keep that gap from turning into a free-for-all, each node can fall back to its local in-process bucket (the single-node core from earlier) — looser than the shared count, but it caps the blast radius until Redis comes back.

The opposite choice, fail closed (reject when the limiter is down), is right only for security-critical endpoints — login, OTP, payments — where letting an attacker through unmetered is worse than a brief outage. Same machinery, opposite default, chosen per endpoint.

But Redis going down is only one of the ways this design meets trouble, and a finished design says what happens as each box dies. Go component by component:

The shared counter store dies (Redis down). The counter is the source of truth, so you can't degrade to a slower correct answer the way a cache would — there's no slower correct answer. Instead you degrade to an approximate one: each node falls back to its local in-process bucket and fails open (or closed, per endpoint). Looser than the global count, but the front door stays up. This is the deliberate inversion of the cache pattern: normally an optimization degrades to the truth; here the truth is briefly gone, so you fall back to a local optimization.
The store is slow, not down. The subtler killer. A Redis that answers in 300ms instead of 300µs would add its latency to every request — the limiter becomes the outage. So the check runs under a tight timeout (a few milliseconds); a check that doesn't answer in time is treated as a miss and falls through to the local bucket. Slow is handled exactly like down.
A gateway node dies. Nothing to recover — the nodes are stateless, all the bucket state lives in the shared store. The load balancer stops routing to the dead node and the survivors carry the traffic against the same counters, so a client's limit is unaffected by which node it lands on.
The rule/config store dies. Rules are read constantly but change almost never, so every node caches them in memory at boot. If the config store is unreachable, the limiter keeps enforcing the last-known rules indefinitely and just can't pick up an edit until it returns — a config outage is a non-event, not a stoppage.

The pattern underneath all four: a dependency that holds the truth (the counter store) gets a fail-fast timeout and a degraded local fallback; a dependency that's merely config (the rules) is cached so its outage is invisible; and the request-serving nodes are kept stateless so losing one costs nothing. Designing for failure isn't preventing every outage — it's deciding, in advance, how the limiter bends instead of breaking.

Trade-offs — each one keeping an NFR

The last column is the point: every choice is accountable to one of the non-functional requirements from earlier. That's what designing with them looks like.

Decision	The tempting alternative	Why ours wins	Keeps
sliding window over fixed window	fixed window (reset each clock tick)	closes the boundary hole where a client gets 2× the limit	accuracy
shared counter in Redis	a local counter per node	one source of truth — the limit doesn't multiply as you add boxes	scalability
local fallback bucket	always read the shared store	the check stays in-memory and fast, and survives a Redis outage	latency, availability
fail-open (most APIs)	fail-closed everywhere	a limiter outage degrades instead of taking the whole API down	availability

The interview corner

Clarify before you draw: Where does the limiter sit — an edge gateway, service middleware, or a client-side library (each moves the state and the blast radius)? What is the key — IP, API key, or user id, and is traffic pre- or post-authentication? What's the failure stance — does this endpoint fail open (availability) or closed (security)? And is it one global limit or a per-region one (this decides whether you can keep the counter in a single store)? Naming these before you touch a box is the first senior move.

The follow-up ladder:

"A single key must satisfy two limits at once — 20 requests a second and 1,000 an hour. How?" Keep two buckets per key — a small, fast-refilling one for the burst cap and a large, slow one for the hourly ceiling — and deny the instant either runs dry. The check stays O(1) (two arithmetic steps, not one), the same two token-bucket knobs compose with no new machinery, and you get a short fuse guarding a long one — the shape every "60/min and 1,000/day" tier is really built from.
"A search call is 50× more expensive than a health check — same limit for both?" No — make tokens a cost, not a count: a request spends weight tokens (tryAcquire(5) for search, 1 for a cheap read). The bucket now meters work, not calls, and the same two knobs still tune it.
"You run in five regions but the limit is global; Redis lives in one of them." A strong global limit would cost a cross-region round trip on every request — usually not worth it. The pragmatic answer is a per-region limit (local Redis per region, each region's cap set so their sum ≈ the global budget), accepting bounded slack; reach for a single global store with async reconciliation only when the limit is contractual and the latency is affordable.
"A client asks why they were throttled — what do you return?" Make the limiter observable: X-RateLimit-Limit, X-RateLimit-Remaining, and X-RateLimit-Reset on every response, plus the Retry-After on the 429 computed from the token deficit. A limiter that can't explain itself generates support tickets and retry storms.
"Millions of clients each get a bucket — doesn't that leak memory?" Yes, if you never expire them. In Redis, every bucket_state key carries a TTL so an idle client's bucket evicts itself; a local fallback map needs a bounded, LRU-style eviction. This is the leak the low-level design hunts down class by class — the same read-modify-write and cleanup concerns an LRU cache faces.

Mistakes that fail the round: doing a network GET then SET from the gateway (you've rebuilt the two-threads-one-token race across the wire — the whole reason the update must run inside the store as one atomic step); keying only by IP (you punish every office behind a NAT, and an attacker just rotates addresses); and never deciding a failure policy — a limiter that silently fails closed turns a Redis hiccup into a full-site outage.

Where to go from here

You now have the whole high-level shape: front door, token bucket, split state, explicit failure policy, layered keys.

Open the box. The natural next step is the companion article — Building a Rate Limiter: The Low-Level Design — which builds every class inside that limiter box: the race that breaks the naive bucket, the fix, the per-client registry, and the memory leak almost everyone forgets.
The leaky bucket — token bucket's cousin that smooths output to a steady drip instead of admitting bursts, the right default when a downstream (an SMS gateway, a legacy mainframe) needs an even flow rather than a bounded average.
Where the limiter grows up — the same metering instinct, one altitude bigger, becomes the virtual waiting room in the ticket-booking design, where a stampede for one show is admitted in fair batches.

Next time an app tells you "try again in 30 seconds", you'll see the whole machine behind the message: a jar, a steady drip of coins, and a very firm grandmother.

HLD