LLDbeginner

Alarm Service LLD: Why CloudWatch Doesn't Page You for One Bad Second

A low-level design walkthrough of an AWS-style alarm and alert service: threshold rules as data, consecutive-breach debouncing, three honest states, and observers notified on transitions only.

By fiveyearsdevJune 12, 20268 min read

"Design an alarm service — like CloudWatch alarms." Here's the question hiding inside it: your CPU metric just spiked over 90% for one second, then dropped. Should someone's phone scream at 3 a.m.? Every engineer who's carried a pager answers in unison — no — and that answer, formalized, is most of this design: one spike is noise; N in a row is a fire.

This is Phase 2's closer, and a fitting one: a state machine (the queue's oldest friend), rules as data (the paytable's lesson), and the queue's first real Observer — the pattern the cheat sheet promised for "others must react."

Let's start nowhere near a computer

A night nurse watches a patient's heart-rate monitor. One weird beat? She notes it and watches — bodies are noisy. The third irregular reading in a row? Now she calls the doctor. And here's the subtle discipline: while the patient stays in distress, she doesn't call again every minute — the doctor is already coming. She calls on changes: when distress begins, and again when the patient stabilizes.

Three rules in that story — tolerate isolated noise, escalate on persistence, announce transitions rather than states — and they're exactly the three things rookie alarm systems get wrong, in order: they page on the first spike, they page again every breaching datapoint, and they never tell you when it's over.

You're being watched by this design right now

CloudWatch, Datadog, Prometheus Alertmanager — every monitoring stack is this machine; "evaluation periods" is the consecutive-breach knob.
Thermostats, smoke detectors, UPS beeps — threshold + hysteresis is the physical world's same trick.
The login throttle and the ATM's three strikes — "N consecutive bad events trigger a state change" is a pattern you've now built twice.

Step 1 — Functional requirements (sentences first)

What the service must do, as plain sentences — the functional requirements.

A metric is a stream of timestamped values.
An alarm watches one metric against a threshold rule.
The alarm trips only after N consecutive breaching datapoints.
Listeners (email, chat, pager) are notified when the alarm changes state — and only then.
Before any data arrives, the alarm honestly reports insufficient data.

That third sentence is a deliberate stance. Tripping on the first breach doesn't make noise disappear — it pages a human for a one-second spike that has already passed. Debouncing is a feature: a fire is N in a row, not a single bad reading, and the count of N is a knob you size to how jumpy the metric is, not a number you guess.

Step 2 — Non-functional requirements

At class level the non-functional requirements are different words for the same idea — how well, not just what — and for an alarm service they are the design:

Correctness of the verdict. The alarm fires once when the breach is real (N consecutive), clears once when it recovers, and never reports OK for a pipeline that has simply gone silent. Getting the state right is the whole job.
No-spam / notification fidelity. A breach lasting an hour is one page, not sixty — listeners hear transitions, never a weather report of every datapoint.
Thread-safety. Producers call record from many threads while callers addAlarm and addListener; the design must never corrupt an alarm's streak or fire a half-changed state.
Extensibility. New notification channels (email, chat, pager, webhook) plug in without touching the evaluator, and a new alarm is a new row of data, not new code.

Listing them is the easy half; the design only earns them if it fulfills them. Here's the contract — each requirement and the mechanism that keeps it:

Requirement	How this design fulfills it
Correctness of the verdict	a per-alarm streak counter gates the fire (N in a row), and `INSUFFICIENT_DATA` is a real third state — Steps 2, 3
No-spam / notification fidelity	the transition gate fires listeners only when the state actually changes — Step 4
Thread-safety	every public method is `synchronized` on the service; the streak and state are never read mid-write — Step 5
Extensibility	the rule is a `record` (a row, not an if-chain) and notification is an Observer seam — Steps 2, 4

Every trade-off below is chosen to keep one of these.

Step 3 — The rule is data; the alarm is a state machine

A threshold rule is three facts, not code:

the rule — a row, not an if-chain

public record Rule(String metric, double threshold, int consecutivePeriods) {}

New alarm on a new metric? A new row (the paytable, again). And the alarm itself has exactly three states — including the one everyone forgets:

INSUFFICIENT_DATA is the honest third state: an alarm that has seen zero datapoints doesn't know things are OK, and a monitoring system that guesses "OK" hides dead pipelines. ("Why is the alarm green?" — "Because the metric stopped arriving an hour ago" is an outage story older than the cloud.)

Which structure — and why. Alarms live in a Map<String, Alarm> keyed by name, and each Alarm carries its own int breachStreak next to its state — and that pairing is the load-bearing choice, not a default. The streak lives per alarm because correctness demands it: two alarms watching the same metric at different thresholds must count independently, so a shared counter would cross-contaminate verdicts. The Rule is a record rather than a hand-written if because extensibility wants a new alarm to be a new row, not new code. And the listener List is the only fan-out structure, kept separate from the alarm map so a channel can be added without the evaluator knowing it exists — the Observer seam in storage form.

Step 4 — The debounce: counting to N

The heart of the evaluator is a streak counter — and the asymmetry is deliberate:

AlarmService.java (the evaluation)

private void evaluate(Alarm alarm, double value) {
    boolean breach = value > alarm.rule.threshold();
    AlarmState next;
    if (breach) {
        alarm.breachStreak++;
        next = (alarm.breachStreak >= alarm.rule.consecutivePeriods())
                ? AlarmState.ALARM                // N in a row: the fire is real
                : (alarm.state == AlarmState.INSUFFICIENT_DATA ? AlarmState.OK : alarm.state);
    } else {
        alarm.breachStreak = 0;                   // ONE healthy point resets the count
        next = AlarmState.OK;
    }
    transitionTo(alarm, next, value);
}

Going into alarm is hard (N in a row); coming out is easy (one healthy point). That's the same shape as a fuse — slow to blow, instant to acknowledge recovery — and you should say the asymmetry out loud, because the obvious-looking "symmetric" version (N healthy points to recover) is also defensible and is what CloudWatch actually does. Naming the knob beats hardcoding either answer.

Step 5 — Observer, with manners

Who gets told? Email for the log, chat for the team, the pager for whoever drew the short straw. The alarm cannot know — and must not care. That's the Observer pattern's whole job:

the transition gate — Observer's manners

private void transitionTo(Alarm alarm, AlarmState next, double value) {
    if (next == alarm.state) {
        return;                              // same state → SILENCE. no spam, no 47 pages
    }
    AlarmState previous = alarm.state;
    alarm.state = next;
    for (AlarmListener listener : listeners) {
        listener.onStateChange(alarm.name, previous, next, value);
    }
}

The highlighted guard is the nurse's discipline: a metric breaching for an hour produces one notification at the transition, not sixty. Most homemade alert systems fail precisely here — and the engineers carrying their pagers can tell you the date it happened.

The follow-up with teeth: what if a listener throws — should one broken email integration stop the pager from firing? Production answer: never; wrap each call, log the failure, keep delivering. We keep the loop simple here, but say the sentence — "listener isolation" is the kind of phrase that gets written down.

Step 6 — Trade-offs (each one keeping an NFR)

The last column is the discipline: every choice keeps one of the promises from Step 2 — that's what designing to the non-functional requirements looks like.

Decision	The tempting alternative	Why ours wins	Keeps
streak counter, N-in-a-row to fire	trip on the first breach	one spike is noise; a real fire is persistence — no 3 a.m. page for a blip	correctness
one healthy point resets the streak	symmetric N-to-recover	recovery is acknowledged instantly; the asymmetry is a named, defensible knob	correctness
notify only on state change	notify per breaching datapoint	an hour of breach is one page, not sixty	no-spam fidelity
`INSUFFICIENT_DATA` as a real state	assume OK before data arrives	a silent pipeline can't masquerade as healthy	correctness
`synchronized` on every public method	lock-free reads of mutable state	streak and state are never read half-written under concurrent `record`	thread-safety
`Rule` record + Observer list	hardcoded thresholds and channels	a new alarm is a row and a new channel is a listener — no evaluator edits	extensibility

The complete implementation

AlarmService.java

package dev.fiveyear.alarm;
 
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
 
public final class AlarmService {
 
    public enum AlarmState { INSUFFICIENT_DATA, OK, ALARM }
 
    public record Rule(String metric, double threshold, int consecutivePeriods) {}
 
    public interface AlarmListener {
        void onStateChange(String alarmName, AlarmState from, AlarmState to, double value);
    }
 
    static final class Alarm {
        final String name;
        final Rule rule;
        AlarmState state = AlarmState.INSUFFICIENT_DATA;
        int breachStreak;
 
        Alarm(String name, Rule rule) {
            this.name = name;
            this.rule = rule;
        }
    }
 
    private final Map<String, Alarm> alarms = new HashMap<>();
    private final List<AlarmListener> listeners = new ArrayList<>();
 
    public synchronized void addAlarm(String name, Rule rule) {
        if (rule.consecutivePeriods() < 1) {
            throw new IllegalArgumentException("need at least one period");
        }
        alarms.put(name, new Alarm(name, rule));
    }
 
    public synchronized void addListener(AlarmListener listener) {
        listeners.add(listener);
    }
 
    /** One datapoint arrives; every alarm watching that metric re-evaluates. */
    public synchronized void record(String metric, double value) {
        for (Alarm alarm : alarms.values()) {
            if (alarm.rule.metric().equals(metric)) {
                evaluate(alarm, value);
            }
        }
    }
 
    public synchronized AlarmState stateOf(String alarmName) {
        Alarm alarm = alarms.get(alarmName);
        if (alarm == null) {
            throw new IllegalArgumentException("no alarm " + alarmName);
        }
        return alarm.state;
    }
 
    private void evaluate(Alarm alarm, double value) {
        boolean breach = value > alarm.rule.threshold();
        AlarmState next;
        if (breach) {
            alarm.breachStreak++;
            next = (alarm.breachStreak >= alarm.rule.consecutivePeriods())
                    ? AlarmState.ALARM
                    : (alarm.state == AlarmState.INSUFFICIENT_DATA ? AlarmState.OK : alarm.state);
        } else {
            alarm.breachStreak = 0;
            next = AlarmState.OK;
        }
        transitionTo(alarm, next, value);
    }
 
    private void transitionTo(Alarm alarm, AlarmState next, double value) {
        if (next == alarm.state) {
            return;
        }
        AlarmState previous = alarm.state;
        alarm.state = next;
        for (AlarmListener listener : listeners) {
            listener.onStateChange(alarm.name, previous, next, value);
        }
    }
}

A bad night, narrated correctly:

Demo.java

AlarmService service = new AlarmService();
service.addAlarm("api-cpu-high", new Rule("cpu", 80.0, 3));
service.addListener((name, from, to, value) ->
        System.out.println(name + ": " + from + " → " + to + " (at " + value + ")"));
 
service.record("cpu", 45);   // api-cpu-high: INSUFFICIENT_DATA → OK
service.record("cpu", 91);   // breach #1 — counting, silence
service.record("cpu", 95);   // breach #2 — counting, silence
service.record("cpu", 97);   // breach #3 → "api-cpu-high: OK → ALARM (at 97.0)"
service.record("cpu", 98);   // still on fire — but the pager already knows. silence.
service.record("cpu", 40);   // "api-cpu-high: ALARM → OK (at 40.0)" — and you can sleep

Three notifications for six datapoints — each one a change, none of them spam.

The interview corner

Clarify before you code: Are metrics pushed to us or pulled by us? Evaluate on every datapoint or on a schedule? Is missing data OK, ALARM, or its own thing?

The follow-up ladder:

"Alarm on the 5-minute average, not raw points." A window buffer per alarm feeding evaluate — aggregation is a pre-step in the pipeline, not a new machine.
"The metric stops arriving." That's INSUFFICIENTDATA _returning: a per-alarm last-seen stamp checked against an injected clock — silence is data about your pipeline.
"Alarm when three services degrade together." Composite alarms subscribe to child transitions — Observer chaining — and now you owe an answer about notification loops (cap the depth).
"One outage fires 300 alarms." A notification policy layer between alarms and humans: group by service, throttle, escalate — the listener list grows into a router, and that's a separate design.
"The pager webhook fails." At-least-once delivery: a retry queue plus idempotent receivers — and the honest sentence that exactly-once notification is a myth worth saying out loud.

Mistakes that fail the round: notifying on every breaching datapoint; modeling two states instead of three; letting one listener's exception silence the rest.

Where to go from here — and Phase 2, closed

Pocket version: rules are rows, alarms are three-state machines, N-in-a-row gates the fire, one healthy point clears it, and listeners hear transitions — never weather reports.

That closes the systems-and-libraries phase, and look at the toolkit it added to Phase 1's: pipelines with cheap gates (logging), recursion over text (JSON) and over structure (the file system), coordinated lifecycles (restaurant), derived-never-stored quantities (inventory), interval arithmetic (hotel), identity inventory with lazy timers (airline) — and now debounced observation.

Add aggregation — real alarms watch "average over 5 minutes", not raw points; that's a windowed buffer in front of evaluate, and a fine evening's work.
Add missing-data handling — no datapoints for two periods should return the alarm to INSUFFICIENT_DATA; you'll need an injected clock, and you know exactly where to get one.
Phase 3 is next in the backlog: intermediate LLD — the elevator, the call center, Splitwise, the thread pool.

One bad second at 3 a.m., and your phone stays dark — because somewhere, a streak counter read 1, said "not yet," and went back to watching. That's the whole kindness of the design.

LLD