HLDintermediate

Load Balancer HLD: The Front Door That Never Sends You to a Dead Server

A load balancer system design: balancing algorithms (round-robin, least-connections, consistent hashing), health checks, L4 vs L7, and keeping the balancer itself from being a single point of failure.

By fiveyearsdevJune 14, 202616 min read

"Design a load balancer." It sounds like a one-liner — "spread requests across servers" — but it's the component every other system on this site quietly leans on, and the whole interview lives in one small, concrete question: a pool of servers, one request just landed — which server gets it, and how do you know that server is even alive? Answer that well and three hard jobs fall out of the same sentence: distribute load so no box melts while its neighbours idle, detect failure so a dead backend never sees a request, and — the one everybody forgets — keep the balancer itself from becoming the single point of failure it was put there to remove.

Let's start nowhere near a computer

Picture the host at a busy restaurant door. Guests arrive; the host seats them across the waiters. A good host doesn't pile every table on one waiter — they spread guests evenly (round-robin), or send the next party to the waiter with the fewest tables right now (least-connections), and they skip a waiter who's on break (a health check). And if the host themselves steps away, a backup host takes the podium so the door is never unattended.

Swap the nouns: the host is the load balancer, the waiters are the backend pool, "spread evenly / send to the least busy" are balancing algorithms, "skip the waiter on break" is a health check, and "a backup host" is LB redundancy. That's the whole design — one address out front, a pool behind it that can grow, shrink, and fail without the diners noticing.

Where this exact shape shows up

Nginx, HAProxy, AWS ELB/ALB, Envoy — every web stack has one (often several layers).
API gateways, service meshes, database read-replica routers — same pick-a-healthy-backend job.
It's the front of nearly every HLD here — the Cricinfo edge, the WhatsApp gateway, the Amazon API tier all sit behind one.

Step 1 — Functional requirements (sentences first)

Accept incoming requests on one stable address and forward each to a backend in the pool.
Distribute requests across backends by a configurable strategy.
Health-check backends and route only to healthy ones.
Support sticky routing when a request must reach a specific backend (session/cache affinity).
Let the pool change — backends added or removed — with no client impact.

The load-bearing verbs are "distribute by a strategy" and "route only to healthy ones." They are the system.

Step 2 — Non-functional requirements

Features tell you what; the non-functional requirements tell you how well, and here they're peculiar — because the LB is the one component whose whole job is a non-functional requirement (availability) for everyone else. Name them with the canonical terms:

Low latency. The LB sits on every request, so its own added overhead must be a rounding error — sub-millisecond for an L4 hop. This is the constraint that forbids a per-request database lookup and pushes almost all state into memory.
High availability. The LB makes the backends highly available by routing around dead ones — and must not become a single point of failure itself. Both halves matter; a redundant balancer in front of a redundant pool.
Scalability. The pool must grow, shrink, and fail with no client impact, and the balancer tier must scale horizontally too when one box can't absorb the connection rate.
Consistency — but of the view, not of data. The LB stores almost nothing durable, so "consistency" here means the pool + health view: with many balancers, that view is eventually consistent (a backend one LB just marked down may still get a request from a lagging peer for a beat — harmless, it fails fast and retries). What must be exactly right is the local read: never route to a backend this LB believes is dead.
Durability — the one that barely applies. The LB is stateless on the hot path: it forwards and forgets. Kill a balancer mid-request and there is nothing to lose, because the only durable truth (pool membership) lives in a config store, not in the LB. Naming durability and then explaining why it's nearly a non-issue is itself the senior move.

Listing them is the easy half; the design only earns them if it fulfills them:

Requirement	How this design fulfills it
Low latency	LB is stateless on the hot path; least-conn counts kept local, in RAM — Step 3
High availability (backends)	health checks eject failing backends; strategies route to the live — Steps 4–5
High availability (the LB)	redundant LBs behind a floating VIP / anycast / DNS — Step 7
Scalability	add backends and they're picked up; the LB tier replicates behind a VIP — Step 9
Consistency (of the view)	pool/rules in a watched config store; a stale health view fails fast, not wrong
Stickiness without a DB	consistent hashing maps a key to a stable backend — Step 4

Every trade-off below is chosen to keep one of these.

Step 3 — The data model (what little state an LB keeps)

Most HLDs start here with tables of durable rows. The load balancer's data model is the interesting opposite: almost nothing is durable, and where each piece lives is the whole design decision. Circle the nouns — pool, backend, route rule, health state, in-flight count — and the question for each is not "what columns" but "which store, and can it be on the hot path?"

the LB's state, and where it lives

pool         (id, name, strategy, listener :443)       -- config store, watched
backend      (id, pool_id → pool, address, weight)     -- config store, watched
route_rule   (id, pool_id → pool, host/path/header)    -- config store (L7 only)
health_state (backend_id, healthy, consec_ok, last_probe)  -- in each LB's RAM
in_flight    (backend_id → current connection count)   -- in each LB's RAM, local

Which stores, and why. There is no single "database" here — there are three homes, each chosen by an access pattern:

Pool membership and routing rules → a config store / service registry (etcd, Consul, or ZooKeeper). It's read on every config change but almost never written, and every balancer must agree on it — a watch-based, strongly-consistent store is the exact fit. A change (add a backend, shift a weight) is pushed to every LB, so the hot path never queries it.
Health state → in each LB's own memory, produced by its own probe loop and gossiped across the fleet. This is where the "consistency of the view" from Step 2 lives: two balancers can briefly disagree about one backend, and that's fine.
In-flight counts → local RAM, deliberately not shared. Here's the subtle senior beat: least-connections counts are per-balancer, never a global tally. Sharing them would add a network round-trip to every routing decision and blow the low-latency budget. So each LB balances only its own slice of traffic — which is why, with many balancers, least-connections is an approximation, and "power of two choices" (sample two backends, pick the idler) is often preferred to a true global minimum. Metrics and access logs, meanwhile, stream off the hot path to a time-series / OLAP store (Prometheus, ClickHouse) — never inline.

The numbers are what make this non-negotiable, not taste. An L4 forwarding decision is tens of microseconds of in-memory work; a round-trip to a shared store on the same network is ~0.5 ms even when it's healthy. Put a lookup on the hot path and you haven't slowed routing by a few percent — you've made a network hop the dominant cost of every request, inflating the per-request budget by an order of magnitude and coupling your availability to that store's. That's the whole reason every byte of routing state is already sitting in local memory before the request arrives.

Step 4 — The strategies

The core decision — which backend gets the next request — has three classic answers, each right for a different workload.

Backend.java + Strategy.java

package dev.fiveyear.lb;
 
/** One server in the pool. `healthy` is flipped by health checks; `inFlight`
 *  tracks current connections for least-connections balancing. */
final class Backend {
    final String id;
    final int weight;
    boolean healthy = true;
    int inFlight = 0;
 
    Backend(String id, int weight) { this.id = id; this.weight = weight; }
}

the balancing strategies

package dev.fiveyear.lb;
 
import java.util.Comparator;
import java.util.List;
 
/** How to pick one backend from the healthy set. */
interface Strategy {
    Backend pick(List<Backend> healthy);
}
 
/** Round-robin: hand requests out in a rotating cycle — even spread, ignores load. */
class RoundRobin implements Strategy {
    private int next = 0;
    public Backend pick(List<Backend> healthy) {
        Backend b = healthy.get(Math.floorMod(next, healthy.size()));
        next++;
        return b;
    }
}
 
/** Least-connections: send the next request to whoever is busiest-least —
 *  self-correcting when requests have uneven cost. */
class LeastConnections implements Strategy {
    public Backend pick(List<Backend> healthy) {
        return healthy.stream().min(Comparator.comparingInt(b -> b.inFlight)).orElseThrow();
    }
}

The picks, in one line each: round-robin for uniform requests (cheap, perfectly even); least-connections when request cost varies wildly (it self-corrects — a backend stuck on a slow request stops receiving new ones); consistent hashing when you need affinity (the same user or key always lands on the same backend, keeping its cache warm) — the same ring trick as the Amazon cart, here for stickiness without a session database.

That third one earns a closer look, because "the same key lands on the same backend" is easy to say and easy to get catastrophically wrong. The naive version — backend = hash(key) % N — is a trap: the day you add or remove one backend, N changes and almost every key remaps at once, cold-flushing every backend cache in a single deploy. Consistent hashing fixes this by placing both keys and backends on a ring (a 2^32 hash space) and mapping each key to the first backend clockwise:

Now removing a backend remaps only the slice of keys it owned — the keys between it and its neighbour — while every other key stays put. Add a backend and it steals only one arc from one neighbour. Two refinements make it production-grade: each physical backend is hashed to many points on the ring (virtual nodes, typically 100–200) so load spreads evenly and a departing node's keys scatter across all survivors instead of dumping onto one neighbour; and when one key is far hotter than the rest — a celebrity, a whale tenant — you cap a node's share with bounded-load consistent hashing, spilling overflow to the next node (the hot-key axis we return to in Step 9). This buys the FR "sticky routing" and the NFR "stickiness without a session DB" — and it deliberately spends nothing on latency, because the ring is a pure in-memory computation, no lookup.

Step 5 — Routing and health checks

The balancer ties it together: filter the pool to the healthy backends, let the strategy pick one, and track in-flight connections so load-aware strategies work. A separate health-check loop flips backends up and down.

LoadBalancer.java

package dev.fiveyear.lb;
 
import java.util.List;
 
/**
 * Routes each request to one healthy backend by a pluggable strategy. The two
 * jobs that matter: never route to an unhealthy backend, and track in-flight
 * connections so load-aware strategies work. Acquire on route, release on done.
 */
public class LoadBalancer {
    private final List<Backend> pool;
    private final Strategy strategy;
 
    public LoadBalancer(List<Backend> pool, Strategy strategy) {
        this.pool = pool;
        this.strategy = strategy;
    }
 
    public Backend route() {
        List<Backend> healthy = pool.stream().filter(b -> b.healthy).toList();
        if (healthy.isEmpty()) throw new IllegalStateException("no healthy backend");
        Backend chosen = strategy.pick(healthy);
        chosen.inFlight++;
        return chosen;
    }
 
    public void release(Backend b) {
        if (b.inFlight > 0) b.inFlight--;
    }
 
    /** A health check marks a backend up or down; routing skips the down ones. */
    public void setHealthy(String id, boolean healthy) {
        for (Backend b : pool) if (b.id.equals(id)) b.healthy = healthy;
    }
}

The single most important line is filter(b -> b.healthy) then if (healthy.isEmpty()) throw: the LB would rather fail fast than hand a request to a dead server. Put the two loops side by side — the fast route/response path, and the slow probe loop running beside it — and the design's shape appears:

Health checks come in two flavours, and a strong answer names both. Active checks probe on a timer, but a naive one is a trap: a shallow GET /health that returns 200 the moment the process is up will happily keep a backend in rotation while the database behind it is unreachable. A deep health check exercises the real dependency (a cheap query, a downstream ping) so "healthy" means "can actually serve." Passive checks need no probe at all — the LB watches live traffic and ejects a backend whose real error/latency rate spikes ("outlier detection"), catching the failures a synthetic probe misses.

Two more mechanics keep the up/down signal from thrashing. Hysteresis: a backend isn't marked down on one failed probe or up on one success — it flips only after N consecutive results (that's the consec_ok counter in the data model), so a single dropped packet doesn't yank a healthy server out. And slow-start: a recovered backend is readmitted with its share ramped up gradually, so it isn't flooded the instant it returns and knocked straight back over. The mirror of readmission is graceful removal — connection draining: to retire a backend for a deploy you stop sending it new connections but let its in-flight requests finish, then pull it, so a routine deploy never severs a live request.

Step 6 — L4 vs L7

A balancer works at one of two layers. L4 (transport) routes by IP/port — it forwards packets/connections without reading them, so it's blazing fast, protocol-agnostic, and cheap: a single L4 box can shovel millions of connections because it never parses a byte of payload. L7 (application) reads the HTTP request, so it can route by path, host, or header, terminate TLS, retry idempotent requests, and rewrite — at the cost of parsing (and often decrypting) every request. Real stacks layer them: a fast L4 tier spreads raw connections across a fleet of L7 balancers that do the smart, expensive routing, so only the requests that need application awareness pay for it. The one piece of per-connection state an L4 balancer does keep is its connection-tracking table (which client flow is pinned to which backend) — still in memory, still off any store, and the reason an L4 hop stays sub-millisecond.

Step 7 — The load balancer can't be a single point of failure

Put one LB in front of everything and you've just moved the single point of failure, not removed it. The fix: run at least two LBs and front them with something that can shift traffic instantly — a floating virtual IP (the standby grabs the VIP if the active dies), anycast (the network routes to the nearest healthy LB), or DNS with health-checked records.

The failover itself has a mechanism worth naming. With a floating VIP, the two LBs exchange heartbeats (VRRP, as keepalived implements it); when the standby misses a few beats it claims the VIP by broadcasting a gratuitous ARP, and traffic swings over in a second or two. The failure mode to pre-empt is split-brain: a network blip makes each LB think the other is dead, so both claim the VIP and answer — the cure is a tie-breaker (a fencing token or an odd-sized quorum) so only one can win. Anycast sidesteps heartbeats entirely: several LBs announce the same IP from different locations, the network routes each client to the nearest, and a dying LB simply withdraws its route so traffic reconverges on the survivors — which also makes the arrangement active/active, not just active/standby, so no capacity sits idle waiting.

So the redundancy is layered: the LB makes the backends highly available, and a VIP/anycast layer makes the LB highly available. Turtles, but only two deep.

Step 8 — Trade-offs (each one keeping an NFR)

Decision	The tempting alternative	Why ours wins	Keeps
least-connections (uneven work)	always round-robin	a backend stuck on a slow request stops getting new ones	availability
local in-flight counts per LB	a shared global tally	no network hop per routing decision; approximate is good enough	low latency
deep health check	a shallow `200 /health`	"healthy" means can serve, not just process-is-up	availability
consistent hashing for stickiness	a session store lookup	affinity with no per-request DB hit	latency / stickiness
L4 in front of L7	one smart L7 tier	cheap connection spread, then smart routing where it's worth it	low latency
redundant LBs + floating VIP	a single big LB	the front door survives losing a balancer	high availability

The complete implementation

The strategies and balancer are the engine. Here's the driver that proves them — even round-robin, an unhealthy backend dropping out, least-connections picking the idlest, and a fail-fast when the whole pool is down:

Main.java — balancing + health, asserted

package dev.fiveyear.lb;
 
import java.util.List;
 
public class Main {
    public static void main(String[] args) {
        Backend a = new Backend("a", 1), b = new Backend("b", 1), c = new Backend("c", 1);
 
        // round-robin cycles evenly (release immediately so load doesn't matter)
        LoadBalancer rr = new LoadBalancer(List.of(a, b, c), new RoundRobin());
        StringBuilder seq = new StringBuilder();
        for (int i = 0; i < 6; i++) { Backend x = rr.route(); seq.append(x.id); rr.release(x); }
        assertTrue(seq.toString().equals("abcabc"), "round-robin spreads evenly (got " + seq + ")");
 
        // an unhealthy backend drops out of rotation
        rr.setHealthy("b", false);
        StringBuilder seq2 = new StringBuilder();
        for (int i = 0; i < 4; i++) { Backend x = rr.route(); seq2.append(x.id); rr.release(x); }
        assertTrue(!seq2.toString().contains("b"), "unhealthy backend is skipped (got " + seq2 + ")");
        rr.setHealthy("b", true);
 
        // least-connections sends to the least busy
        Backend p = new Backend("p", 1), q = new Backend("q", 1), r = new Backend("r", 1);
        p.inFlight = 5; q.inFlight = 0; r.inFlight = 2;
        LoadBalancer lc = new LoadBalancer(List.of(p, q, r), new LeastConnections());
        Backend pick = lc.route();
        assertTrue(pick.id.equals("q"), "least-connections picks the idlest (got " + pick.id + ")");
        assertTrue(q.inFlight == 1, "routing increments the chosen backend's in-flight count");
 
        // every backend down -> routing fails fast, doesn't pick a dead server
        LoadBalancer dead = new LoadBalancer(List.of(new Backend("z", 1)), new RoundRobin());
        dead.setHealthy("z", false);
        boolean threw = false;
        try { dead.route(); } catch (IllegalStateException e) { threw = true; }
        assertTrue(threw, "no healthy backend -> fail fast, never route to a dead one");
 
        System.out.println("ALL LOADBALANCER ASSERTIONS PASSED");
    }
 
    static void assertTrue(boolean cond, String msg) { if (!cond) throw new AssertionError(msg); }
}

Step 9 — Scaling the design, one bottleneck at a time

Each rung below is earned by a named bottleneck, not added on day one:

The backends can't keep up (rising latency, saturating CPU) → add backends; health checks pick them up and traffic rebalances with no client change. This is the LB's whole reason to exist and the cheapest rung.
One LB box saturates (it's now the bottleneck — connection rate or bandwidth exceeds a single machine) → run many LBs, fronted by DNS round-robin or anycast so the balancer layer scales horizontally too.
Smart routing is burning CPU (TLS termination + HTTP parsing dominate) → put a cheap L4 tier in front of the L7 tier, so only requests that need application-level routing pay the parsing cost.
Backend caches keep going cold as the pool resizes → consistent hashing keeps a key on the same backend, so only a slice of keys move when the pool changes size (the ring from Step 4), and caches stay warm.

The hot key is a separate axis. Everything above spreads uniform load; it does nothing for one backend that's hot because one key is hot — a celebrity's shard, one giant tenant, a viral object all hashing to the same node. No amount of adding balancers or backends splits a single key. The fixes are their own axis: bounded-load consistent hashing caps any one node's share and spills the overflow to the next node on the ring; and for a genuinely un-splittable single key, you drop affinity for that key and fan it back out (round-robin the hot object across replicas that all serve it read-only). Naming this as a different problem — hot key, not hot tier — is the senior signal, the same split the movie-ticket blockbuster draws between sharding quiet shows and metering one hot show.

Step 10 — When a piece fails: designing for failure

Failure handling isn't a feature of a load balancer — it's the entire point.

A backend dies → health checks eject it within seconds; in-flight requests are retried on another backend. A backend is an optimization, fully replaceable. The trap to avoid is a retry storm: if every client and the LB all retry a struggling backend, the retries become the overload — so retries run on a budget (cap retries at a small fraction of traffic) behind a circuit breaker that stops sending to a failing backend entirely for a cool-off.
The active LB dies → the floating VIP / anycast shifts traffic to the standby, which already holds the same watched pool and its own health view. The front door fails over in a second or two; clients barely notice. (The split-brain guard from Step 7 is what stops two LBs from both claiming the VIP during a network blip.)
A backend is slow, not dead (a false "healthy") → passive outlier detection ejects it on rising error/latency, and the circuit breaker stops hammering it. Don't trust a binary, shallow health bit alone — this is why Step 5 pairs deep active checks with passive detection.
A backend flaps (probes oscillate up/down every few seconds) → hysteresis (the consec_ok threshold) refuses to move it until N results agree, and readmission uses slow-start and backoff, so route tables don't thrash and a recovered backend isn't instantly re-flooded and re-killed.
The config store is unreachable → the LB fails static: it keeps serving on the last-known-good pool it already has in memory rather than dropping every backend. Losing the ability to learn about changes must never become losing the ability to route — the hot path never depended on the store, so an outage there is invisible to traffic.

The pattern across all five: the LB degrades along whichever axis broke — a bad backend leaves rotation, a dead LB fails over, a lost config store freezes the view but keeps routing — and never collapses, because no single dependency sits on the request path.

The interview corner

Clarify before you draw: L4 or L7 — do we route by IP/port or by HTTP path/host/header (it decides half the design)? Where does TLS terminate — at the LB or passed through to the backends? Do requests need affinity (a warm cache, a sticky session) or is any backend equal? What are the availability and latency targets — how fast must a dead backend leave rotation, and how much overhead can the LB add? One region or global (it decides DNS vs anycast)?

The follow-up ladder (each rung is a scenario the basics don't cover):

"You're deploying — how do you take a backend out with zero dropped requests?" Connection draining: stop routing new connections to it, let in-flight requests finish (bounded by a timeout), then remove it. Never yank a backend mid-request; the deploy should be invisible to users.
"With fifty load balancers, does least-connections still balance?" Not perfectly — each LB only counts its own in-flight connections, never a global tally (a shared counter would cost a network hop per request, Step 3). At that scale you switch to power-of-two-choices: sample two random backends, send to the idler — near-optimal spread with zero coordination.
"A backend's health probe flaps up and down every few seconds — now what?" Hysteresis: require N consecutive agreeing probes before flipping state, and back off readmission with slow-start, so route tables don't thrash and a half-dead backend isn't repeatedly re-flooded. A binary bit that flips on every probe is a bug, not a feature.
"One backend is struggling and everything's retrying it — the site gets slower. Why?" A retry storm: retries pile onto the weakest node and become the overload. Cap retries with a budget (retries ≤ a few percent of traffic) and a circuit breaker that stops sending to a failing backend during a cool-off. Retries must relieve load, never amplify it.
"Consistent hashing pins one whale tenant onto a single node and melts it — fix it." That's a hot key, not a hot tier; adding nodes won't help. Use bounded-load consistent hashing to cap any node's share and spill overflow to the next node, or drop affinity for that one key and fan it across read replicas. The movie-ticket blockbuster draws the same hot-key-vs-hot-tier line.

Mistakes that fail the round:

Putting state on the hot path — a per-request session-store lookup or a shared global connection counter. It's correct and it's fatal to the low-latency NFR; the whole art is that the LB looks nothing up to route.
One "big" load balancer with no VIP/anycast standby. You didn't remove the single point of failure, you became it — the interviewer is waiting for you to make the balancer itself redundant.
Trusting a shallow 200 /health that reports "up" while the backend can't actually serve (its own DB is down). Pair a deep active check with passive outlier detection, or you'll route traffic straight into a black hole.

Where to go from here

Bounded-load consistent hashing and power-of-two-choices are the two ideas that most repay a deeper read — they're where "just round-robin" grows up into real load management.
The consistent-hashing ring also shards the Amazon cart; nearly every HLD here sits behind a load balancer — see the rookie's guide to HLD for the method.
For what runs behind the LB at read scale, the cache discipline is in Cricinfo; for the hot-key-vs-hot-tier split at the data layer, the movie-ticket blockbuster.

HLD