Ad Click Aggregation HLD: Counting a Firehose Without Keeping It
An ad click event aggregation system design (HLD): tumbling windows, exactly-once counting by dedup, the Count-Min Sketch for fixed-memory counts, and stream processing with a batch recompute.
"Design ad click aggregation." An advertiser stares at a dashboard: "12,847 clicks in the last minute," the number ticking up live, broken down by minute for the last hour. Behind it, five billion click events a day are pouring in — tens of thousands a second, around the clock. Two instincts will both betray you here. The first: "store every click in a table and COUNT them" — but you cannot keep five billion rows a day around to scan on every dashboard refresh. The second: "just keep a running counter" — until you need per-minute breakdowns, per-ad breakdowns, deduped against a queue that delivers some events twice, and a memory budget that doesn't grow with the number of ads.
This is the leaderboard's cousin grown wild: there you ranked what existed; here you must count what's streaming past, faster than you can store it. The answer is two ideas that feel like cheating the first time you see them — chop time into buckets, and count with a structure that's deliberately, provably a little bit wrong.
Let's start nowhere near a computer
A toll plaza on a busy highway. The manager wants "cars per minute, all day." The naive clerk photographs every car and counts the photos at night — but that's millions of photos to store and sift. The real toll plaza has a clicker that resets every minute: each minute's total is written to a logbook line, and the cars themselves are forgotten. A month later you still know "10:01 had 1,240 cars" without a single photo. Chopping the endless stream into one-minute buckets turned an impossible archive into a thin logbook.
Now a harder ask: "how many distinct toll tags came through?" There are tens of millions of tags; a ledger with one row per tag is huge. So the plaza uses a tally board with a fixed number of slots: each tag is hashed to a few slots, and those slots are bumped. To estimate a tag's count, read its slots and take the smallest. It's never an under-count (two tags sharing a slot only inflate it), the board never grows no matter how many tags appear, and for a busy tag it's close enough. Approximate, bounded, fast.
The minute-clicker is a tumbling window. The fixed tally board is a Count-Min Sketch. Hold those two and the whole system is plumbing.
Where the clicker-and-tally-board trick runs
- Ad analytics (Google Ads, Meta), metrics systems (Prometheus, Datadog), view/like counts, rate dashboards — all aggregate a firehose into windowed counts.
- Count-Min Sketch and HyperLogLog are the workhorses of stream processing: bounded-memory approximate counts and distinct-counts that ship in Redis, Spark, and Flink.
- The "approximate the tail" instinct from the leaderboard: exact where it's cheap and matters, approximate where exactness is unaffordable and nobody notices.
Step 1 — Functional requirements (sentences first)
Every HLD starts the same way: write what the system must do as plain sentences. These are your functional requirements — the features, scoped out loud (the recipe).
- A click event arrives with an event id, an ad id, and a timestamp.
- The system reports clicks per ad, bucketed by minute.
- A dashboard reads recent counts with near-real-time latency.
- The ingest queue may deliver the same event more than once.
- Volume is billions of events a day across millions of ads.
No fraud detection, billing reconciliation, or per-user targeting yet — naming that boundary out loud is the first senior move.
Step 2 — Non-functional requirements
Features tell you what to build; the non-functional requirements tell you how well — and in a counting pipeline they, not the features, are what pick the architecture. Say them out loud too:
- Low latency. The dashboard reads must feel near-real-time (a number ticking up by the second), and the ingest must keep up with a firehose — tens of thousands of events a second, all day, with no backlog growing behind it. A read that's seconds stale is fine; an ingest that falls behind is fatal.
- Consistency. The one that shapes the design — and the subtle one here: counts are eventually exact, not exact now. The stream layer is deliberately approximate in the moment (a sketch over-counts a hair, a late event hasn't landed), and a batch layer reconciles it to the true number over the retained log. Name that split explicitly: approximate-then-exact is a choice, not an accident.
- High availability. Ingest must never drop events under load. A blockbuster campaign or a traffic spike cannot be allowed to shed clicks on the floor — so the design leans on a durable queue that absorbs the spike and lets the aggregator drain it at its own pace.
- Durability. The raw event log is retained long enough to recompute every aggregate from scratch; the aggregates themselves are derived and may be rebuilt freely. Lose a windowed count and you recompute it; lose the raw log inside its retention window and the batch layer has nothing to reconcile against.
- Scalability. Billions of events a day across millions of ads — and brutal skew, where one mega-campaign produces more clicks than a million small ads combined.
Listing requirements is the easy half; the design is only good if it meets them. So here's the contract this design signs — each requirement, and the one mechanism that keeps it (every row is cashed in a step below):
| Requirement | How this design fulfills it |
|---|---|
| Low latency (dashboard read) | pre-aggregated tumbling windows + a roll-up hierarchy in an OLAP store — Steps 4, 10 |
| Low latency (ingest) | a Kafka buffer absorbs the firehose; the aggregator drains asynchronously — Step 9 |
| Eventual consistency (exact) | the lambda batch layer recomputes truth over the raw log and corrects the stream — Step 7 |
| Exactly-once counting | dedup by event id within the late-arrival horizon — Step 5 |
| High availability (ingest) | the durable queue absorbs spikes; nothing is dropped when the aggregator lags — Steps 9, 11 |
| Durability | the raw event log is retained for recompute; aggregates are derived and rebuildable — Step 3 |
| Scalability / skew | Count-Min for fixed memory across millions of ads; partition by ad id for the hot key — Steps 6, 10 |
Every trade-off we make from here on is chosen to keep one of these promises — and we'll point back at this table when we do.
Step 3 — Nouns: the event and the aggregate
Circle the nouns: ClickEvent and the Aggregate (a per-(ad, window) count). The twist, as with metrics generally: the raw event is cheap to produce and expensive to keep, while the aggregate is what you actually query. You retain raw events only briefly (for recomputation); the aggregates are the product.
-- raw events: a short-retention log / queue, NOT a query target
click_events (event_id, ad_id, ts) -- kept hours, then dropped
-- the product: windowed aggregates (this is what the dashboard reads)
ad_minute (ad_id, window_start, count) -- one row per ad per minuteThe line that earns the nod: you query the aggregates, never the raw events. Raw click rows exist only long enough to recompute aggregates if the stream layer was wrong — they are not the dashboard's data source. Pointing a dashboard at the raw event table is the canonical way to take the system down.
Which datastore — and why it isn't one. Don't say "a database" and move on; this pipeline is a deliberately multi-store design, because no single engine is good at all three jobs. First, a durable append-only log / queue (Kafka) to absorb the firehose. Writes are sequential appends; it buffers spikes the aggregator can't yet drain; and it doubles as the retained raw log the batch layer recomputes from. So it serves two requirements at once: availability (never drop an event) and durability (keep raw long enough to recompute). Second, an OLAP / time-series store (ClickHouse, Druid, or similar) for the windowed aggregates the dashboard queries. The dashboard's whole query pattern is range-scans and roll-ups over time ("sum the last 60 minutes for ad-42"), and a column-oriented analytical store is built precisely for that: it reads only the columns it needs and aggregates over huge spans cheaply. A vanilla relational DB is the wrong fit here, and saying why is the senior beat — a row-oriented OLTP engine is tuned for transactional point-reads and writes of whole rows, not for scanning and summing billions of narrow count rows per query; you'd be fighting its storage layout on every dashboard refresh. The third "store" is barely one: the raw events kept briefly in the log for recompute, then dropped. Right tool per job: a log for the firehose, an analytical store for the time-range roll-ups, and nothing more durable than retention demands. So the query pattern picks the store, not habit — and it picks against SQL.
Step 4 — Tumbling windows turn a stream into rows
Every event maps to a window by arithmetic: windowStart = floor(ts / windowSize) * windowSize. Counting is then "increment (ad, windowStart)."
/** Total count for an ad across the windows covering [fromTs, toTs]. */
public long countInRange(String adId, long fromTs, long toTs) {
long total = 0;
for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
Map<String, Long> counts = windows.get(w); // one bucket per minute
if (counts != null) total += counts.getOrDefault(adId, 0L);
}
return total;
}A tumbling window is non-overlapping and fixed-width — the simplest and the one interviewers expect first. (Sliding windows that overlap, or session windows that close on inactivity, are the natural follow-ups; name them as variants once the tumbling version is on the board.)
But a single minute-grained table is a trap the moment the dashboard offers a "last 30 days" view: answering it from minute rows means summing 43,200 buckets per ad per query. The fix is a roll-up hierarchy — minute rows compact into hour rows, hour rows into day rows, on a schedule (or as a continuous downsample). A query then reads at the coarsest grain that covers its range: the last hour from minute rows, the last week from hour rows, the last year from day rows. Same counts, exponentially fewer to add, and the old fine-grained rows can expire once their roll-up is sealed. Because counting is associative — sum(minutes) = the hour — the hierarchy is just repeated summation, and it's the difference between a dashboard that loads instantly at every zoom level and one that times out on the year view.
Step 5 — Exactly-once, by dedup
The queue is at-least-once: a network hiccup makes a consumer re-read events it already processed. Count blindly and your numbers drift upward. The fix is the idempotency reflex — dedup by event id: count an event only the first time you see its id.
public boolean record(String eventId, String adId, long ts) {
if (!seen.add(eventId)) return false; // duplicate delivery — ignore
windows.computeIfAbsent(windowStart(ts), k -> new HashMap<>()).merge(adId, 1L, Long::sum);
return true;
}(At true scale the seen set can't be an unbounded HashSet — it becomes a bounded structure scoped per window, or a Bloom filter, since you only need dedup within the late-arrival horizon. Naming that bound is the senior note; the toy keeps the full set for clarity.)
Step 6 — Count-Min Sketch: counting in memory you choose
When you must count per-ad across millions of ads, an exact HashMap<adId, count> grows without bound. The Count-Min Sketch counts in a fixed grid: d rows × w columns. Each key is hashed by d independent functions to one cell per row; counting bumps those d cells. The estimate is the minimum of those d cells.
/** Never an under-count: collisions only ever inflate a cell, so the min is the tightest. */
public long estimate(String key) {
long min = Long.MAX_VALUE;
for (int r = 0; r < depth; r++) {
min = Math.min(min, table[r][hash(key, r)]);
}
return min;
}Two properties make this beautiful and testable. It never under-counts: every cell holding a key's count holds at least that count (other keys colliding only add to it), so the minimum is ≥ the truth. And memory is fixed: a 4×2048 sketch is the same size whether you track a thousand ads or a billion.
So how do you size that grid? The two dimensions control two different things, and that's the part worth knowing. Width w bounds the over-count's magnitude: the expected inflation of any estimate is at most the total count divided by w (collisions spread the total mass across w columns), so a wider grid makes each over-estimate smaller — w ≈ e/ε to keep error within an ε-fraction of the total. Depth d bounds the probability of being unlucky: each extra row is an independent hash, and the estimate is the min across rows, so you're wrong only if every row collided — probability shrinks like (1/2)^d, so d ≈ ln(1/δ) for a failure chance δ. In words: add columns to make the answer tighter, add rows to make tightness more certain. That's why a few KB suffices — a 4×2048 grid keeps heavy hitters within a fraction of a percent at four-nines confidence, and heavy hitters are exactly what an ad dashboard cares about. (Counting distinct things — unique users who clicked — is the sibling problem, solved by HyperLogLog: distinct-count in a couple of kilobytes. Mention it and you've named the whole sketch family.)
Step 7 — Stream now, batch later
Streaming aggregation is fast but can be slightly wrong (a dropped worker, a late event, a sketch's over-count). A periodic batch recompute over the retained raw log produces the exact numbers and corrects the stream's drift — the Lambda pattern: a speed layer for now, a batch layer for truth, merged at query.
Step 8 — Trade-offs (each one keeping an NFR)
Notice the last column: every decision is made to keep one of the promises from Step 2. That's what "design with the non-functional requirements in mind" actually looks like — not a list you wrote and forgot, but the thing each choice is accountable to.
| Decision | The tempting alternative | Why ours wins | Keeps |
|---|---|---|---|
| query aggregates, drop raw | keep all events, COUNT on read | you can't scan 5B rows/day per refresh; the minute rows are tiny | latency, durability |
| tumbling windows | one global running counter | per-minute, per-ad breakdowns fall out; a counter answers neither | latency |
| dedup by event id | trust at-least-once = exactly-once | re-delivered events stop inflating the count | consistency |
| Count-Min Sketch (per-ad) | exact HashMap per ad | fixed memory across millions of ads; over-counts a little, never under | scalability |
| durable queue (Kafka) | write straight to the store | absorbs the firehose's spikes; nothing is dropped when the aggregator lags | availability, durability |
| stream + batch (Lambda) | stream only, or batch only | speed and eventual exactness; the batch layer heals the stream's drift | latency, consistency |
The complete implementation
Two cores: the exact windowed counter (with dedup), and the approximate sketch. Both are small, and both carry a property worth asserting.
package dev.fiveyear.adclicks;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
/** Tumbling-window click counts per ad, idempotent on event id. */
public final class WindowedCounter {
private final long windowMs;
private final Map<Long, Map<String, Long>> windows = new HashMap<>();
private final Set<String> seen = new HashSet<>();
public WindowedCounter(long windowMs) {
this.windowMs = windowMs;
}
public long windowStart(long ts) {
return Math.floorDiv(ts, windowMs) * windowMs;
}
/** Record a click, ignoring a re-delivered event id. True if it counted. */
public boolean record(String eventId, String adId, long ts) {
if (!seen.add(eventId)) {
return false;
}
windows.computeIfAbsent(windowStart(ts), k -> new HashMap<>()).merge(adId, 1L, Long::sum);
return true;
}
public long countInWindow(String adId, long ts) {
Map<String, Long> counts = windows.get(windowStart(ts));
return counts == null ? 0 : counts.getOrDefault(adId, 0L);
}
public long countInRange(String adId, long fromTs, long toTs) {
long total = 0;
for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
Map<String, Long> counts = windows.get(w);
if (counts != null) {
total += counts.getOrDefault(adId, 0L);
}
}
return total;
}
/** The per-window series a dashboard plots: [windowStart, count] per bucket. */
public List<long[]> series(String adId, long fromTs, long toTs) {
List<long[]> out = new ArrayList<>();
for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
Map<String, Long> counts = windows.get(w);
out.add(new long[] {w, counts == null ? 0 : counts.getOrDefault(adId, 0L)});
}
return out;
}
}package dev.fiveyear.adclicks;
/** Approximate counts in fixed memory; never under-counts. */
public final class CountMinSketch {
private final int width;
private final int depth;
private final long[][] table;
private final int[] seeds;
public CountMinSketch(int width, int depth) {
this.width = width;
this.depth = depth;
this.table = new long[depth][width];
this.seeds = new int[depth];
for (int r = 0; r < depth; r++) {
seeds[r] = (r + 1) * 0x9E3779B1; // distinct, deterministic per row
}
}
private int hash(String key, int row) {
int h = key.hashCode() ^ seeds[row];
h ^= (h >>> 16);
return Math.floorMod(h, width);
}
public void add(String key, long count) {
for (int r = 0; r < depth; r++) {
table[r][hash(key, r)] += count;
}
}
public long estimate(String key) {
long min = Long.MAX_VALUE;
for (int r = 0; r < depth; r++) {
min = Math.min(min, table[r][hash(key, r)]);
}
return min;
}
}At the dashboard:
WindowedCounter counter = new WindowedCounter(60_000); // one-minute windows
long t0 = 600_000_000L; // a window boundary
counter.record("e1", "ad-42", t0 + 1_000); // minute 0
counter.record("e2", "ad-42", t0 + 2_000); // minute 0
counter.record("e1", "ad-42", t0 + 3_000); // DUPLICATE id → ignored
counter.record("e3", "ad-42", t0 + 61_000); // minute 1
counter.countInWindow("ad-42", t0 + 5_000); // 2 (e1, e2; the dup didn't count)
counter.countInWindow("ad-42", t0 + 65_000); // 1 (e3)
counter.countInRange("ad-42", t0, t0 + 120_000); // 3 (across minutes 0,1,2)
CountMinSketch sketch = new CountMinSketch(2048, 4);
sketch.add("ad-42", 1_000);
sketch.add("ad-7", 5);
sketch.estimate("ad-42"); // ≥ 1000 (here exactly 1000 — no collision at this size)
sketch.estimate("ad-7"); // ≥ 5
sketch.estimate("ad-unseen"); // 0 or a tiny collision noise — never negative, never hugeStep 9 — Only now, the boxes
- Clients fire clicks into an ingest queue (Kafka) — a durable buffer that absorbs spikes and decouples producers from the aggregator.
- A stream aggregator consumes the queue, dedups by event id, and rolls events into windowed counts in an OLAP / time-series store that the query API serves to dashboards.
- A batch layer periodically recomputes the exact counts from the retained raw log and corrects the store — the speed/accuracy split from Step 7.
Step 10 — Scaling, one bottleneck at a time
A junior over-builds on day one — a dozen partitions and a batch cluster for traffic that doesn't exist yet. The senior move is the opposite: start with the simplest thing that's correct, and add a piece only when a measured bottleneck forces you. Climb the ladder, driven by the firehose your non-functional requirements flagged:
- One aggregator, one OLAP store. Correct, simple, and plenty for modest traffic. The ingest rate is the first thing to bite — when a single consumer can't drain the queue fast enough…
- Partition by ad id. Each partition's events flow to its own aggregator instance, so one ad's counts live together and the consumers parallelize cleanly. Counting is per-key, so this shards perfectly — until the store itself is the read or write bottleneck…
- Scale the OLAP store horizontally. ClickHouse / Druid are built to add nodes and spread shards; the windowed aggregates partition by
(ad_id, time)and fan across them. When even the volume into the pipeline is the problem… - Pre-aggregate at the edge. A client or edge collector sums clicks for a few seconds before sending, slashing event volume before it ever touches the queue — counting is associative, so partial sums compose without changing the answer. And alongside scale-out, the roll-up hierarchy (minute → hour → day) is the read-side ladder: a query reads at the coarsest grain that covers its range, so the year view sums 365 day-rows, not half a million minute-rows.
Each rung is more to operate, so you earn it with a number, not a hunch — that restraint is itself the senior signal.
The hot key is a different axis. Partitioning spreads a million quiet ads beautifully, but one mega-campaign is a single hot key no partitioning can split — more clicks on one ad id than a million others combined, all hashing to one partition. That's the familiar shape: handle it with extra partitions dedicated to the hot key, or a standalone aggregator for that one campaign, so the firehose for one ad doesn't starve everyone else's counts.
Late and out-of-order events are the other scale reality: an event for 10:00 arriving at 10:05. A watermark ("I won't accept events older than 5 minutes") closes a window for live writing; anything past the watermark is left for the batch layer to mop up — which is exactly the reconciliation the lambda design already bought us.
Step 11 — When a piece fails
A design isn't finished when it works — it's finished when you can say what happens as each box dies. Go component by component, and notice how much of the answer the durable log already handed us for free:
- A stream aggregator dies mid-window. Its in-flight counts vanish — but the events that fed them are still sitting in Kafka. A replacement consumer reprocesses from the last committed offset, re-reads those events (dedup by event id keeps the replay from double-counting), and the batch layer corrects any residual drift on its next pass. The crash costs latency, not accuracy.
- The OLAP / time-series store dies. Dashboards degrade: you serve the last-known counts (slightly stale) rather than fail, and you rebuild the aggregates from the retained raw log — the store holds derived data, so it's always reconstructable. Reads bend; the source of truth was never in the store.
- Kafka is the load-bearing piece that makes both of the above safe. Because it's a durable buffer, an aggregator outage doesn't drop events (they queue up and get drained on recovery), and a store rebuild has a source to replay from. Its own resilience is replication across brokers, so a single broker loss is a non-event.
The pattern across all three is the lesson: a component that holds only derived data (the aggregator's memory, the OLAP store) degrades and rebuilds; the one component that holds the truth (the durable log) is the thing you replicate and never lose. Designing for failure isn't preventing every outage — it's deciding, in advance, how the system bends instead of breaks. The durable log is what lets it bend: every other box can die and be rebuilt from the events it kept.
The interview corner
Clarify before you draw: Real-time-ish (seconds of lag) or strictly real-time? Exact counts or is approximate acceptable (it unlocks sketches)? What window granularity — minute, hour? Is at-least-once delivery in play (it forces dedup)? Distinct counts (unique users) too, or just totals?
The follow-up ladder:
- "Billions of events — how do you count per minute?" Tumbling windows:
floor(ts / window)buckets, increment(ad, window); query the minute rows, never the raw events. Lead with this. - "The queue delivers duplicates." Dedup by event id within the late-arrival horizon — a bounded seen-set or Bloom filter per window. At-least-once + dedup = exactly-once counting.
- "Per-ad counts for millions of ads won't fit in memory." Count-Min Sketch: fixed
d×wgrid, hash todcells, estimate is the min; never under-counts, bounded memory. For distinct users, HyperLogLog. - "How do you guarantee the numbers are eventually exact?" Lambda: a fast approximate speed layer plus a batch recompute over the raw log that corrects drift, merged at query. (Kappa — stream-only with replay — is the modern alternative; name the trade.)
- "Late events at 10:05 for the 10:00 window?" Watermarks bound how late you'll accept into the live window; the batch layer reconciles anything past the watermark. Naming watermarks is the stream-processing shibboleth.
Mistakes that fail the round: querying the raw event table on the hot path; one global counter that can't break down by ad or minute; trusting at-least-once delivery as exactly-once (drifting counts); an unbounded exact map per ad; ignoring late/duplicate events entirely; no path to exactness when the stream layer is approximate.
Where to go from here
Pocket version: chop the stream into tumbling windows and store one count per (ad, minute); dedup by event id for exactly-once; count per-ad in fixed memory with a Count-Min Sketch that never under-counts; and run a batch recompute behind the stream so the numbers are fast now and exact eventually.
- Implement HyperLogLog for "unique users who clicked" and feel distinct-counting in two kilobytes — the sketch family's other half.
- Add a sliding window on top of the tumbling buckets (sum the last 60 one-second buckets) and see why granularity is a storage/precision dial.
- Zoom across: the "approximate where exactness is unaffordable" move is the leaderboard's approximate tail and the crawler's Bloom filter — bounded memory beats exact memory at scale, every time.
- Next in the queue: the newsletter service — from counting a firehose to fanning one out, where delivery, retries, and idempotency meet the queue you just met here.
A clicker that resets every minute, a tally board with a fixed number of slots, and a nightly recount to set the record straight — that's how you count five billion clicks a day without keeping a single one.