HLDintermediate

Ad Click Aggregation HLD: Counting a Firehose Without Keeping It

An ad click event aggregation system design (HLD): tumbling windows, exactly-once counting by dedup, the Count-Min Sketch for fixed-memory counts, and stream processing with a batch recompute.

By fiveyearsdevJune 13, 202616 min read

"Design ad click aggregation." You tap an ad on your phone and scroll on. Somewhere a dashboard ticks from 12,846 to 12,847 clicks in the last minute — and does it again for the other five billion taps today, tens of thousands a second, around the clock. Two instincts both betray you here. "Store every click and COUNT them": you cannot rescan five billion rows on every dashboard refresh. "Keep one running counter": it can't break down by minute, or by ad, and it happily counts your one tap twice when the queue redelivers it.

This is the leaderboard's cousin grown wild: there you ranked what existed; here you must count what's streaming past, faster than you can store it. The answer is two ideas that feel like cheating the first time you see them — chop time into buckets, and count with a structure that's deliberately, provably a little bit wrong.

Let's start nowhere near a computer

A toll plaza on a busy highway. The manager wants "cars per minute, all day." The naive clerk photographs every car and counts the photos at night — but that's millions of photos to store and sift. The real toll plaza has a clicker that resets every minute: each minute's total is written to a logbook line, and the cars themselves are forgotten. A month later you still know "10:01 had 1,240 cars" without a single photo. Chopping the endless stream into one-minute buckets turned an impossible archive into a thin logbook.

Now a harder ask: "how many distinct toll tags came through?" There are tens of millions of tags; a ledger with one row per tag is huge. So the plaza uses a tally board with a fixed number of slots: each tag is hashed to a few slots, and those slots are bumped. To estimate a tag's count, read its slots and take the smallest. It's never an under-count (two tags sharing a slot only inflate it), the board never grows no matter how many tags appear, and for a busy tag it's close enough. Approximate, bounded, fast.

The minute-clicker is a tumbling window. The fixed tally board is a Count-Min Sketch. Hold those two and the whole system is plumbing.

Where the clicker-and-tally-board trick runs

Ad analytics (Google Ads, Meta), metrics systems (Prometheus, Datadog), view/like counts, rate dashboards — all aggregate a firehose into windowed counts.
Count-Min Sketch and HyperLogLog are the workhorses of stream processing: bounded-memory approximate counts and distinct-counts that ship in Redis, Spark, and Flink.
The "approximate the tail" instinct from the leaderboard: exact where it's cheap and matters, approximate where exactness is unaffordable and nobody notices.

Step 1 — Functional requirements (sentences first)

Every HLD starts the same way: write what the system must do as plain sentences. These are your functional requirements — the features, scoped out loud (the recipe).

A click event arrives with an event id, an ad id, and a timestamp.
The system reports clicks per ad, bucketed by minute.
A dashboard reads recent counts with near-real-time latency.
The ingest queue may deliver the same event more than once.
Volume is billions of events a day across millions of ads.

No fraud detection, billing reconciliation, or per-user targeting yet — naming that boundary out loud is the first senior move.

Step 2 — Non-functional requirements

Features tell you what to build; the non-functional requirements tell you how well — and in a counting pipeline they, not the features, are what pick the architecture. Say them out loud too:

Low latency. The dashboard reads must feel near-real-time (a number ticking up by the second), and the ingest must keep up with a firehose — tens of thousands of events a second, all day, with no backlog growing behind it. A read that's seconds stale is fine; an ingest that falls behind is fatal.
Consistency. The one that shapes the design — and the subtle one here: counts are eventually exact, not exact now. The stream layer is deliberately approximate in the moment (a sketch over-counts a hair, a late event hasn't landed), and a batch layer reconciles it to the true number over the retained log. Name that split explicitly: approximate-then-exact is a choice, not an accident.
High availability. Ingest must never drop events under load. A blockbuster campaign or a traffic spike cannot be allowed to shed clicks on the floor — so the design leans on a durable queue that absorbs the spike and lets the aggregator drain it at its own pace.
Durability. The raw event log is retained long enough to recompute every aggregate from scratch; the aggregates themselves are derived and may be rebuilt freely. Lose a windowed count and you recompute it; lose the raw log inside its retention window and the batch layer has nothing to reconcile against.
Scalability. Billions of events a day across millions of ads — and brutal skew, where one mega-campaign produces more clicks than a million small ads combined.

Listing requirements is the easy half; the design is only good if it meets them. So here's the contract this design signs — each requirement, and the one mechanism that keeps it (every row is cashed in a step below):

Requirement	How this design fulfills it
Low latency (dashboard read)	pre-aggregated tumbling windows + a roll-up hierarchy in an OLAP store — Steps 4, 10
Low latency (ingest)	a Kafka buffer absorbs the firehose; the aggregator drains asynchronously — Step 9
Eventual consistency (exact)	the lambda batch layer recomputes truth over the raw log and corrects the stream — Step 7
Exactly-once counting	dedup by event id within the late-arrival horizon — Step 5
High availability (ingest)	the durable queue absorbs spikes; nothing is dropped when the aggregator lags — Steps 9, 11
Durability	the raw event log is retained for recompute; aggregates are derived and rebuildable — Step 3
Scalability / skew	Count-Min for fixed memory across millions of ads; partition by ad id for the hot key — Steps 6, 10

Every trade-off we make from here on is chosen to keep one of these promises — and we'll point back at this table when we do.

Step 3 — Nouns: the event and the aggregate

Circle the nouns: ClickEvent and the Aggregate (a per-(ad, window) count). The twist, as with metrics generally: the raw event is cheap to produce and expensive to keep, while the aggregate is what you actually query. You retain raw events only briefly (for recomputation); the aggregates are the product.

the data model

-- raw events: a short-retention log / queue, NOT a query target
click_events   (event_id, ad_id, ts)            -- kept hours, then dropped
 
-- the product: windowed aggregates (this is what the dashboard reads)
ad_minute      (ad_id, window_start, count)      -- one row per ad per minute

The line that earns the nod: you query the aggregates, never the raw events. Raw click rows exist only long enough to recompute aggregates if the stream layer was wrong — they are not the dashboard's data source. Pointing a dashboard at the raw event table is the canonical way to take the system down.

Which datastore — and why it isn't one. Don't say "a database" and move on; this pipeline is a deliberately multi-store design, because no single engine is good at all three jobs. First, a durable append-only log / queue (Kafka) to absorb the firehose. Writes are sequential appends; it buffers spikes the aggregator can't yet drain; and it doubles as the retained raw log the batch layer recomputes from. So it serves two requirements at once: availability (never drop an event) and durability (keep raw long enough to recompute). Second, an OLAP / time-series store (ClickHouse, Druid, or similar) for the windowed aggregates the dashboard queries. The dashboard's whole query pattern is range-scans and roll-ups over time ("sum the last 60 minutes for ad-42"), and a column-oriented analytical store is built precisely for that: it reads only the columns it needs and aggregates over huge spans cheaply. A vanilla relational DB is the wrong fit here, and saying why is the senior beat — a row-oriented OLTP engine is tuned for transactional point-reads and writes of whole rows, not for scanning and summing billions of narrow count rows per query; you'd be fighting its storage layout on every dashboard refresh. The third "store" is barely one: the raw events kept briefly in the log for recompute, then dropped. Right tool per job: a log for the firehose, an analytical store for the time-range roll-ups, and nothing more durable than retention demands. So the query pattern picks the store, not habit — and it picks against SQL.

Step 4 — Tumbling windows turn a stream into rows

Every event maps to a window by arithmetic: windowStart = floor(ts / windowSize) * windowSize. Counting is then "increment (ad, windowStart)."

WindowedCounter.java (bucket + count)

/** Total count for an ad across the windows covering [fromTs, toTs]. */
public long countInRange(String adId, long fromTs, long toTs) {
    long total = 0;
    for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
        Map<String, Long> counts = windows.get(w);     // one bucket per minute
        if (counts != null) total += counts.getOrDefault(adId, 0L);
    }
    return total;
}

A tumbling window is non-overlapping and fixed-width — the simplest and the one interviewers expect first. (Sliding windows that overlap, or session windows that close on inactivity, are the natural follow-ups; name them as variants once the tumbling version is on the board.)

But a single minute-grained table is a trap the moment the dashboard offers a "last 30 days" view: answering it from minute rows means summing 43,200 buckets per ad per query. The fix is a roll-up hierarchy — minute rows compact into hour rows, hour rows into day rows, on a schedule (or as a continuous downsample). A query then reads at the coarsest grain that covers its range: the last hour from minute rows, the last week from hour rows, the last year from day rows. Same counts, exponentially fewer to add, and the old fine-grained rows can expire once their roll-up is sealed. Because counting is associative — sum(minutes) = the hour — the hierarchy is just repeated summation, and it's the difference between a dashboard that loads instantly at every zoom level and one that times out on the year view.

Step 5 — Exactly-once, by dedup

The queue is at-least-once: a network hiccup makes a consumer re-read events it already processed. Count blindly and your numbers drift upward. The fix is the idempotency reflex — dedup by event id: count an event only the first time you see its id.

WindowedCounter.java (idempotent record)

public boolean record(String eventId, String adId, long ts) {
    if (!seen.add(eventId)) return false;               // duplicate delivery — ignore
    windows.computeIfAbsent(windowStart(ts), k -> new HashMap<>()).merge(adId, 1L, Long::sum);
    return true;
}

(At true scale the seen set can't be an unbounded HashSet — it becomes a bounded structure scoped per window, or a Bloom filter, since you only need dedup within the late-arrival horizon. Naming that bound is the senior note; the toy keeps the full set for clarity.)

Step 6 — Count-Min Sketch: counting in memory you choose

When you must count per-ad across millions of ads, an exact HashMap<adId, count> grows without bound. The Count-Min Sketch counts in a fixed grid: d rows × w columns. Each key is hashed by d independent functions to one cell per row; counting bumps those d cells. The estimate is the minimum of those d cells.

CountMinSketch.java (the estimate is the min)

/** Never an under-count: collisions only ever inflate a cell, so the min is the tightest. */
public long estimate(String key) {
    long min = Long.MAX_VALUE;
    for (int r = 0; r < depth; r++) {
        min = Math.min(min, table[r][hash(key, r)]);
    }
    return min;
}

Two properties make this beautiful and testable. It never under-counts: every cell holding a key's count holds at least that count (other keys colliding only add to it), so the minimum is ≥ the truth. And memory is fixed: a 4×2048 sketch is the same size whether you track a thousand ads or a billion.

So how do you size that grid? The two dimensions control two different things, and that's the part worth knowing. Width w bounds the over-count's magnitude: the expected inflation of any estimate is at most the total count divided by w (collisions spread the total mass across w columns), so a wider grid makes each over-estimate smaller — w ≈ e/ε to keep error within an ε-fraction of the total. Depth d bounds the probability of being unlucky: each extra row is an independent hash, and the estimate is the min across rows, so you're wrong only if every row collided — probability shrinks like (1/2)^d, so d ≈ ln(1/δ) for a failure chance δ. In words: add columns to make the answer tighter, add rows to make tightness more certain. That's why a few KB suffices — a 4×2048 grid keeps heavy hitters within a fraction of a percent at four-nines confidence, and heavy hitters are exactly what an ad dashboard cares about. (Counting distinct things — unique users who clicked — is the sibling problem, solved by HyperLogLog: distinct-count in a couple of kilobytes. Mention it and you've named the whole sketch family.)

Step 7 — Stream now, batch later

Streaming aggregation is fast but can be slightly wrong (a dropped worker, a late event, a sketch's over-count). A periodic batch recompute over the retained raw log produces the exact numbers and corrects the stream's drift — the Lambda pattern: a speed layer for now, a batch layer for truth, merged at query.

Step 8 — Trade-offs (each one keeping an NFR)

Notice the last column: every decision is made to keep one of the promises from Step 2. That's what "design with the non-functional requirements in mind" actually looks like — not a list you wrote and forgot, but the thing each choice is accountable to.

Decision	The tempting alternative	Why ours wins	Keeps
query aggregates, drop raw	keep all events, `COUNT` on read	you can't scan 5B rows/day per refresh; the minute rows are tiny	latency, durability
tumbling windows	one global running counter	per-minute, per-ad breakdowns fall out; a counter answers neither	latency
dedup by event id	trust at-least-once = exactly-once	re-delivered events stop inflating the count	consistency
Count-Min Sketch (per-ad)	exact `HashMap` per ad	fixed memory across millions of ads; over-counts a little, never under	scalability
durable queue (Kafka)	write straight to the store	absorbs the firehose's spikes; nothing is dropped when the aggregator lags	availability, durability
stream + batch (Lambda)	stream only, or batch only	speed and eventual exactness; the batch layer heals the stream's drift	latency, consistency

The complete implementation

Two cores: the exact windowed counter (with dedup), and the approximate sketch. Both are small, and both carry a property worth asserting.

WindowedCounter.java

package dev.fiveyear.adclicks;
 
import java.util.ArrayList;
import java.util.HashMap;
import java.util.HashSet;
import java.util.List;
import java.util.Map;
import java.util.Set;
 
/** Tumbling-window click counts per ad, idempotent on event id. */
public final class WindowedCounter {
 
    private final long windowMs;
    private final Map<Long, Map<String, Long>> windows = new HashMap<>();
    private final Set<String> seen = new HashSet<>();
 
    public WindowedCounter(long windowMs) {
        this.windowMs = windowMs;
    }
 
    public long windowStart(long ts) {
        return Math.floorDiv(ts, windowMs) * windowMs;
    }
 
    /** Record a click, ignoring a re-delivered event id. True if it counted. */
    public boolean record(String eventId, String adId, long ts) {
        if (!seen.add(eventId)) {
            return false;
        }
        windows.computeIfAbsent(windowStart(ts), k -> new HashMap<>()).merge(adId, 1L, Long::sum);
        return true;
    }
 
    public long countInWindow(String adId, long ts) {
        Map<String, Long> counts = windows.get(windowStart(ts));
        return counts == null ? 0 : counts.getOrDefault(adId, 0L);
    }
 
    public long countInRange(String adId, long fromTs, long toTs) {
        long total = 0;
        for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
            Map<String, Long> counts = windows.get(w);
            if (counts != null) {
                total += counts.getOrDefault(adId, 0L);
            }
        }
        return total;
    }
 
    /** The per-window series a dashboard plots: [windowStart, count] per bucket. */
    public List<long[]> series(String adId, long fromTs, long toTs) {
        List<long[]> out = new ArrayList<>();
        for (long w = windowStart(fromTs); w <= windowStart(toTs); w += windowMs) {
            Map<String, Long> counts = windows.get(w);
            out.add(new long[] {w, counts == null ? 0 : counts.getOrDefault(adId, 0L)});
        }
        return out;
    }
}

CountMinSketch.java

package dev.fiveyear.adclicks;
 
/** Approximate counts in fixed memory; never under-counts. */
public final class CountMinSketch {
 
    private final int width;
    private final int depth;
    private final long[][] table;
    private final int[] seeds;
 
    public CountMinSketch(int width, int depth) {
        this.width = width;
        this.depth = depth;
        this.table = new long[depth][width];
        this.seeds = new int[depth];
        for (int r = 0; r < depth; r++) {
            seeds[r] = (r + 1) * 0x9E3779B1; // distinct, deterministic per row
        }
    }
 
    private int hash(String key, int row) {
        int h = key.hashCode() ^ seeds[row];
        h ^= (h >>> 16);
        return Math.floorMod(h, width);
    }
 
    public void add(String key, long count) {
        for (int r = 0; r < depth; r++) {
            table[r][hash(key, r)] += count;
        }
    }
 
    public long estimate(String key) {
        long min = Long.MAX_VALUE;
        for (int r = 0; r < depth; r++) {
            min = Math.min(min, table[r][hash(key, r)]);
        }
        return min;
    }
}

At the dashboard:

Demo.java

WindowedCounter counter = new WindowedCounter(60_000); // one-minute windows
long t0 = 600_000_000L;                                 // a window boundary
 
counter.record("e1", "ad-42", t0 + 1_000);  // minute 0
counter.record("e2", "ad-42", t0 + 2_000);  // minute 0
counter.record("e1", "ad-42", t0 + 3_000);  // DUPLICATE id → ignored
counter.record("e3", "ad-42", t0 + 61_000); // minute 1
 
counter.countInWindow("ad-42", t0 + 5_000);             // 2  (e1, e2; the dup didn't count)
counter.countInWindow("ad-42", t0 + 65_000);            // 1  (e3)
counter.countInRange("ad-42", t0, t0 + 120_000);        // 3  (across minutes 0,1,2)
 
CountMinSketch sketch = new CountMinSketch(2048, 4);
sketch.add("ad-42", 1_000);
sketch.add("ad-7", 5);
sketch.estimate("ad-42");        // ≥ 1000 (here exactly 1000 — no collision at this size)
sketch.estimate("ad-7");         // ≥ 5
sketch.estimate("ad-unseen");    // 0 or a tiny collision noise — never negative, never huge

Step 9 — Only now, the boxes

Clients fire clicks into an ingest queue (Kafka) — a durable buffer that absorbs spikes and decouples producers from the aggregator.
A stream aggregator consumes the queue, dedups by event id, and rolls events into windowed counts in an OLAP / time-series store that the query API serves to dashboards.
A batch layer periodically recomputes the exact counts from the retained raw log and corrects the store — the speed/accuracy split from Step 7.

Step 10 — Scaling, one bottleneck at a time

A junior over-builds on day one — a dozen partitions and a batch cluster for traffic that doesn't exist yet. The senior move is the opposite: start with the simplest thing that's correct, and add a piece only when a measured bottleneck forces you. Climb the ladder, driven by the firehose your non-functional requirements flagged:

One aggregator, one OLAP store. Correct, simple, and plenty for modest traffic. The ingest rate is the first thing to bite — when a single consumer can't drain the queue fast enough…
Partition by ad id. Each partition's events flow to its own aggregator instance, so one ad's counts live together and the consumers parallelize cleanly. Counting is per-key, so this shards perfectly — until the store itself is the read or write bottleneck…
Scale the OLAP store horizontally. ClickHouse / Druid are built to add nodes and spread shards; the windowed aggregates partition by (ad_id, time) and fan across them. When even the volume into the pipeline is the problem…
Pre-aggregate at the edge. A client or edge collector sums clicks for a few seconds before sending, slashing event volume before it ever touches the queue — counting is associative, so partial sums compose without changing the answer. And alongside scale-out, the roll-up hierarchy (minute → hour → day) is the read-side ladder: a query reads at the coarsest grain that covers its range, so the year view sums 365 day-rows, not half a million minute-rows.

Each rung is more to operate, so you earn it with a number, not a hunch — that restraint is itself the senior signal.

The hot key is a different axis. Partitioning spreads a million quiet ads beautifully, but one mega-campaign is a single hot key no partitioning can split — more clicks on one ad id than a million others combined, all hashing to one partition. That's the familiar shape: handle it with extra partitions dedicated to the hot key, or a standalone aggregator for that one campaign, so the firehose for one ad doesn't starve everyone else's counts.

Late and out-of-order events are the other scale reality: an event for 10:00 arriving at 10:05. A watermark ("I won't accept events older than 5 minutes") closes a window for live writing; anything past the watermark is left for the batch layer to mop up — which is exactly the reconciliation the lambda design already bought us.

Step 11 — When a piece fails

A design isn't finished when it works — it's finished when you can say what happens as each box dies. Go component by component, and notice how much of the answer the durable log already handed us for free:

A stream aggregator dies mid-window. Its in-flight counts vanish — but the events that fed them are still sitting in Kafka. A replacement consumer reprocesses from the last committed offset, re-reads those events (dedup by event id keeps the replay from double-counting), and the batch layer corrects any residual drift on its next pass. The crash costs latency, not accuracy.
The OLAP / time-series store dies. Dashboards degrade: you serve the last-known counts (slightly stale) rather than fail, and you rebuild the aggregates from the retained raw log — the store holds derived data, so it's always reconstructable. Reads bend; the source of truth was never in the store.
Kafka is the load-bearing piece that makes both of the above safe. Because it's a durable buffer, an aggregator outage doesn't drop events (they queue up and get drained on recovery), and a store rebuild has a source to replay from. Its own resilience is replication across brokers, so a single broker loss is a non-event.

The pattern across all three is the lesson: a component that holds only derived data (the aggregator's memory, the OLAP store) degrades and rebuilds; the one component that holds the truth (the durable log) is the thing you replicate and never lose. Designing for failure isn't preventing every outage — it's deciding, in advance, how the system bends instead of breaks. The durable log is what lets it bend: every other box can die and be rebuilt from the events it kept.

The interview corner

Clarify before you draw: Real-time-ish (seconds of lag) or strictly real-time? Exact counts or is approximate acceptable (it unlocks sketches)? What window granularity — minute, hour? Is at-least-once delivery in play (it forces dedup)? Distinct counts (unique users) too, or just totals?

The follow-up ladder:

"Billions of events — how do you count per minute?" Tumbling windows: floor(ts / window) buckets, increment (ad, window); query the minute rows, never the raw events. Lead with this.
"The queue delivers duplicates." Dedup by event id within the late-arrival horizon — a bounded seen-set or Bloom filter per window. At-least-once + dedup = exactly-once counting.
"Per-ad counts for millions of ads won't fit in memory." Count-Min Sketch: fixed d×w grid, hash to d cells, estimate is the min; never under-counts, bounded memory. For distinct users, HyperLogLog.
"How do you guarantee the numbers are eventually exact?" Lambda: a fast approximate speed layer plus a batch recompute over the raw log that corrects drift, merged at query. (Kappa — stream-only with replay — is the modern alternative; name the trade.)
"Late events at 10:05 for the 10:00 window?" Watermarks bound how late you'll accept into the live window; the batch layer reconciles anything past the watermark. Naming watermarks is the stream-processing shibboleth.

Mistakes that fail the round: querying the raw event table on the hot path; one global counter that can't break down by ad or minute; trusting at-least-once delivery as exactly-once (drifting counts); an unbounded exact map per ad; ignoring late/duplicate events entirely; no path to exactness when the stream layer is approximate.

Where to go from here

Pocket version: chop the stream into tumbling windows and store one count per (ad, minute); dedup by event id for exactly-once; count per-ad in fixed memory with a Count-Min Sketch that never under-counts; and run a batch recompute behind the stream so the numbers are fast now and exact eventually.

Implement HyperLogLog for "unique users who clicked" and feel distinct-counting in two kilobytes — the sketch family's other half.
Add a sliding window on top of the tumbling buckets (sum the last 60 one-second buckets) and see why granularity is a storage/precision dial.
Zoom across: the "approximate where exactness is unaffordable" move is the leaderboard's approximate tail and the crawler's Bloom filter — bounded memory beats exact memory at scale, every time.
Next in the queue: the newsletter service — from counting a firehose to fanning one out, where delivery, retries, and idempotency meet the queue you just met here.

A clicker that resets every minute, a tally board with a fixed number of slots, and a nightly recount to set the record straight — that's how you count five billion clicks a day without keeping a single one.

HLD