Alarm Service LLD: Why CloudWatch Doesn't Page You for One Bad Second
A low-level design walkthrough of an AWS-style alarm and alert service: threshold rules as data, consecutive-breach debouncing, three honest states, and observers notified on transitions only.
"Design an alarm service — like CloudWatch alarms." Here's the question hiding inside it: your CPU metric just spiked over 90% for one second, then dropped. Should someone's phone scream at 3 a.m.? Every engineer who's carried a pager answers in unison — no — and that answer, formalized, is most of this design: one spike is noise; N in a row is a fire.
This is Phase 2's closer, and a fitting one: a state machine (the queue's oldest friend), rules as data (the paytable's lesson), and the queue's first real Observer — the pattern the cheat sheet promised for "others must react."
Let's start nowhere near a computer
A night nurse watches a patient's heart-rate monitor. One weird beat? She notes it and watches — bodies are noisy. The third irregular reading in a row? Now she calls the doctor. And here's the subtle discipline: while the patient stays in distress, she doesn't call again every minute — the doctor is already coming. She calls on changes: when distress begins, and again when the patient stabilizes.
Three rules in that story — tolerate isolated noise, escalate on persistence, announce transitions rather than states — and they're exactly the three things rookie alarm systems get wrong, in order: they page on the first spike, they page again every breaching datapoint, and they never tell you when it's over.
You're being watched by this design right now
- CloudWatch, Datadog, Prometheus Alertmanager — every monitoring stack is this machine; "evaluation periods" is the consecutive-breach knob.
- Thermostats, smoke detectors, UPS beeps — threshold + hysteresis is the physical world's same trick.
- The login throttle and the ATM's three strikes — "N consecutive bad events trigger a state change" is a pattern you've now built twice.
Step 1 — Functional requirements (sentences first)
What the service must do, as plain sentences — the functional requirements.
- A metric is a stream of timestamped values.
- An alarm watches one metric against a threshold rule.
- The alarm trips only after N consecutive breaching datapoints.
- Listeners (email, chat, pager) are notified when the alarm changes state — and only then.
- Before any data arrives, the alarm honestly reports insufficient data.
That third sentence is a deliberate stance. Tripping on the first breach doesn't make noise disappear — it pages a human for a one-second spike that has already passed. Debouncing is a feature: a fire is N in a row, not a single bad reading, and the count of N is a knob you size to how jumpy the metric is, not a number you guess.
Step 2 — Non-functional requirements
At class level the non-functional requirements are different words for the same idea — how well, not just what — and for an alarm service they are the design:
- Correctness of the verdict. The alarm fires once when the breach is real (N consecutive), clears once when it recovers, and never reports OK for a pipeline that has simply gone silent. Getting the state right is the whole job.
- No-spam / notification fidelity. A breach lasting an hour is one page, not sixty — listeners hear transitions, never a weather report of every datapoint.
- Thread-safety. Producers call
recordfrom many threads while callersaddAlarmandaddListener; the design must never corrupt an alarm's streak or fire a half-changed state. - Extensibility. New notification channels (email, chat, pager, webhook) plug in without touching the evaluator, and a new alarm is a new row of data, not new code.
Listing them is the easy half; the design only earns them if it fulfills them. Here's the contract — each requirement and the mechanism that keeps it:
| Requirement | How this design fulfills it |
|---|---|
| Correctness of the verdict | a per-alarm streak counter gates the fire (N in a row), and INSUFFICIENT_DATA is a real third state — Steps 2, 3 |
| No-spam / notification fidelity | the transition gate fires listeners only when the state actually changes — Step 4 |
| Thread-safety | every public method is synchronized on the service; the streak and state are never read mid-write — Step 5 |
| Extensibility | the rule is a record (a row, not an if-chain) and notification is an Observer seam — Steps 2, 4 |
Every trade-off below is chosen to keep one of these.
Step 3 — The rule is data; the alarm is a state machine
A threshold rule is three facts, not code:
public record Rule(String metric, double threshold, int consecutivePeriods) {}New alarm on a new metric? A new row (the paytable, again). And the alarm itself has exactly three states — including the one everyone forgets:
INSUFFICIENT_DATA is the honest third state: an alarm that has seen zero datapoints doesn't know things are OK, and a monitoring system that guesses "OK" hides dead pipelines. ("Why is the alarm green?" — "Because the metric stopped arriving an hour ago" is an outage story older than the cloud.)
Which structure — and why. Alarms live in a Map<String, Alarm> keyed by name, and each Alarm carries its own int breachStreak next to its state — and that pairing is the load-bearing choice, not a default. The streak lives per alarm because correctness demands it: two alarms watching the same metric at different thresholds must count independently, so a shared counter would cross-contaminate verdicts. The Rule is a record rather than a hand-written if because extensibility wants a new alarm to be a new row, not new code. And the listener List is the only fan-out structure, kept separate from the alarm map so a channel can be added without the evaluator knowing it exists — the Observer seam in storage form.
Step 4 — The debounce: counting to N
The heart of the evaluator is a streak counter — and the asymmetry is deliberate:
private void evaluate(Alarm alarm, double value) {
boolean breach = value > alarm.rule.threshold();
AlarmState next;
if (breach) {
alarm.breachStreak++;
next = (alarm.breachStreak >= alarm.rule.consecutivePeriods())
? AlarmState.ALARM // N in a row: the fire is real
: (alarm.state == AlarmState.INSUFFICIENT_DATA ? AlarmState.OK : alarm.state);
} else {
alarm.breachStreak = 0; // ONE healthy point resets the count
next = AlarmState.OK;
}
transitionTo(alarm, next, value);
}Going into alarm is hard (N in a row); coming out is easy (one healthy point). That's the same shape as a fuse — slow to blow, instant to acknowledge recovery — and you should say the asymmetry out loud, because the obvious-looking "symmetric" version (N healthy points to recover) is also defensible and is what CloudWatch actually does. Naming the knob beats hardcoding either answer.
Step 5 — Observer, with manners
Who gets told? Email for the log, chat for the team, the pager for whoever drew the short straw. The alarm cannot know — and must not care. That's the Observer pattern's whole job:
private void transitionTo(Alarm alarm, AlarmState next, double value) {
if (next == alarm.state) {
return; // same state → SILENCE. no spam, no 47 pages
}
AlarmState previous = alarm.state;
alarm.state = next;
for (AlarmListener listener : listeners) {
listener.onStateChange(alarm.name, previous, next, value);
}
}The highlighted guard is the nurse's discipline: a metric breaching for an hour produces one notification at the transition, not sixty. Most homemade alert systems fail precisely here — and the engineers carrying their pagers can tell you the date it happened.
The follow-up with teeth: what if a listener throws — should one broken email integration stop the pager from firing? Production answer: never; wrap each call, log the failure, keep delivering. We keep the loop simple here, but say the sentence — "listener isolation" is the kind of phrase that gets written down.
Step 6 — Trade-offs (each one keeping an NFR)
The last column is the discipline: every choice keeps one of the promises from Step 2 — that's what designing to the non-functional requirements looks like.
| Decision | The tempting alternative | Why ours wins | Keeps |
|---|---|---|---|
| streak counter, N-in-a-row to fire | trip on the first breach | one spike is noise; a real fire is persistence — no 3 a.m. page for a blip | correctness |
| one healthy point resets the streak | symmetric N-to-recover | recovery is acknowledged instantly; the asymmetry is a named, defensible knob | correctness |
| notify only on state change | notify per breaching datapoint | an hour of breach is one page, not sixty | no-spam fidelity |
INSUFFICIENT_DATA as a real state | assume OK before data arrives | a silent pipeline can't masquerade as healthy | correctness |
synchronized on every public method | lock-free reads of mutable state | streak and state are never read half-written under concurrent record | thread-safety |
Rule record + Observer list | hardcoded thresholds and channels | a new alarm is a row and a new channel is a listener — no evaluator edits | extensibility |
The complete implementation
package dev.fiveyear.alarm;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
public final class AlarmService {
public enum AlarmState { INSUFFICIENT_DATA, OK, ALARM }
public record Rule(String metric, double threshold, int consecutivePeriods) {}
public interface AlarmListener {
void onStateChange(String alarmName, AlarmState from, AlarmState to, double value);
}
static final class Alarm {
final String name;
final Rule rule;
AlarmState state = AlarmState.INSUFFICIENT_DATA;
int breachStreak;
Alarm(String name, Rule rule) {
this.name = name;
this.rule = rule;
}
}
private final Map<String, Alarm> alarms = new HashMap<>();
private final List<AlarmListener> listeners = new ArrayList<>();
public synchronized void addAlarm(String name, Rule rule) {
if (rule.consecutivePeriods() < 1) {
throw new IllegalArgumentException("need at least one period");
}
alarms.put(name, new Alarm(name, rule));
}
public synchronized void addListener(AlarmListener listener) {
listeners.add(listener);
}
/** One datapoint arrives; every alarm watching that metric re-evaluates. */
public synchronized void record(String metric, double value) {
for (Alarm alarm : alarms.values()) {
if (alarm.rule.metric().equals(metric)) {
evaluate(alarm, value);
}
}
}
public synchronized AlarmState stateOf(String alarmName) {
Alarm alarm = alarms.get(alarmName);
if (alarm == null) {
throw new IllegalArgumentException("no alarm " + alarmName);
}
return alarm.state;
}
private void evaluate(Alarm alarm, double value) {
boolean breach = value > alarm.rule.threshold();
AlarmState next;
if (breach) {
alarm.breachStreak++;
next = (alarm.breachStreak >= alarm.rule.consecutivePeriods())
? AlarmState.ALARM
: (alarm.state == AlarmState.INSUFFICIENT_DATA ? AlarmState.OK : alarm.state);
} else {
alarm.breachStreak = 0;
next = AlarmState.OK;
}
transitionTo(alarm, next, value);
}
private void transitionTo(Alarm alarm, AlarmState next, double value) {
if (next == alarm.state) {
return;
}
AlarmState previous = alarm.state;
alarm.state = next;
for (AlarmListener listener : listeners) {
listener.onStateChange(alarm.name, previous, next, value);
}
}
}A bad night, narrated correctly:
AlarmService service = new AlarmService();
service.addAlarm("api-cpu-high", new Rule("cpu", 80.0, 3));
service.addListener((name, from, to, value) ->
System.out.println(name + ": " + from + " → " + to + " (at " + value + ")"));
service.record("cpu", 45); // api-cpu-high: INSUFFICIENT_DATA → OK
service.record("cpu", 91); // breach #1 — counting, silence
service.record("cpu", 95); // breach #2 — counting, silence
service.record("cpu", 97); // breach #3 → "api-cpu-high: OK → ALARM (at 97.0)"
service.record("cpu", 98); // still on fire — but the pager already knows. silence.
service.record("cpu", 40); // "api-cpu-high: ALARM → OK (at 40.0)" — and you can sleepThree notifications for six datapoints — each one a change, none of them spam.
The interview corner
Clarify before you code: Are metrics pushed to us or pulled by us? Evaluate on every datapoint or on a schedule? Is missing data OK, ALARM, or its own thing?
The follow-up ladder:
- "Alarm on the 5-minute average, not raw points." A window buffer per alarm feeding
evaluate— aggregation is a pre-step in the pipeline, not a new machine. - "The metric stops arriving." That's INSUFFICIENTDATA _returning: a per-alarm last-seen stamp checked against an injected clock — silence is data about your pipeline.
- "Alarm when three services degrade together." Composite alarms subscribe to child transitions — Observer chaining — and now you owe an answer about notification loops (cap the depth).
- "One outage fires 300 alarms." A notification policy layer between alarms and humans: group by service, throttle, escalate — the listener list grows into a router, and that's a separate design.
- "The pager webhook fails." At-least-once delivery: a retry queue plus idempotent receivers — and the honest sentence that exactly-once notification is a myth worth saying out loud.
Mistakes that fail the round: notifying on every breaching datapoint; modeling two states instead of three; letting one listener's exception silence the rest.
Where to go from here — and Phase 2, closed
Pocket version: rules are rows, alarms are three-state machines, N-in-a-row gates the fire, one healthy point clears it, and listeners hear transitions — never weather reports.
That closes the systems-and-libraries phase, and look at the toolkit it added to Phase 1's: pipelines with cheap gates (logging), recursion over text (JSON) and over structure (the file system), coordinated lifecycles (restaurant), derived-never-stored quantities (inventory), interval arithmetic (hotel), identity inventory with lazy timers (airline) — and now debounced observation.
- Add aggregation — real alarms watch "average over 5 minutes", not raw points; that's a windowed buffer in front of
evaluate, and a fine evening's work. - Add missing-data handling — no datapoints for two periods should return the alarm to INSUFFICIENT_DATA; you'll need an injected clock, and you know exactly where to get one.
- Phase 3 is next in the backlog: intermediate LLD — the elevator, the call center, Splitwise, the thread pool.
One bad second at 3 a.m., and your phone stays dark — because somewhere, a streak counter read 1, said "not yet," and went back to watching. That's the whole kindness of the design.