HLDintermediate

Alexa HLD: How a Voice Assistant Turns Speech Into Action

How a voice assistant works inside: the wake-word-to-speech pipeline, the NLU layer that resolves an utterance to an intent and slots, the thin-device cloud-brain split, and the skill fan-out.

By fiveyearsdevJune 14, 202618 min read

You say "Alexa, play Hey Jude by The Beatles" and music starts. Between those two moments is a pipeline that wakes on a single word, streams your audio to the cloud, transcribes it, figures out what you meant, routes to the right skill, and speaks back. The device on your counter is deliberately dumb; the intelligence lives in the cloud. And the hardest, most interview-worthy piece isn't the speech recognition — it's the NLU layer that turns a transcript into a structured intent plus slots a program can act on: Weather with city = Paris, PlayMusic with song and artist. Get that mapping right and the rest is plumbing; get it wrong and a misheard word silently does the wrong thing.

Let's start nowhere near a computer

Picture a hotel concierge who only springs to life when you say their name. Until then they politely ignore the whole lobby's chatter — that's the wake word, and it's the one thing they do entirely on their own. Once named, they listen to your sentence (transcription), work out what you actually want from the dozens of ways you might phrase it (understanding), do it or delegate to the right department (the skill), and reply out loud (speech).

The clever part is the understanding. "What's the weather in Paris," "weather in Paris," "is it raining in Paris" — all the same request. The concierge maps each phrasing to one intent (Weather) and pulls out the slot (city = Paris). That mapping is the whole game. And a good concierge remembers: ask "what about tomorrow?" and they don't ask tomorrow's what? — they carry Paris forward from the sentence before. Hold onto that; it's a section of its own later.

Where this exact shape shows up

Alexa, Google Assistant, Siri, Cortana — all run this wake → ASR → NLU → skill → TTS pipeline.
The intent+slot model is the same one behind chatbots and IVR phone trees ("press or say…").
The skill fan-out is a plugin architecture — third parties ship skills against the assistant's intent contract, the same autoscaling-per-feature idea a serverless platform runs on.

Step 1 — Functional requirements (sentences first)

Wake on a hotword, locally, without streaming audio until then.
Transcribe the spoken request to text (ASR — automatic speech recognition).
Understand the text: resolve it to an intent and extract its slots.
Fulfil the intent by routing to the matching skill.
Respond in speech (TTS — text to speech), and keep enough context for a follow-up.

The load-bearing requirement is "understand": map many phrasings to one intent with parameters — and it's the one requirement with no single right answer, so every stage downstream treats its output as a ranked, confidence-scored guess rather than a settled fact.

Step 2 — Non-functional requirements

These are the qualities that actually shape the architecture — named with the canonical terms so an interviewer hears you reaching for them on purpose:

Low latency. The round trip from end-of-speech to spoken reply must feel instant — sub-second perceived, which we'll put a real budget on in Step 6.
Privacy. No audio leaves the device until the wake word fires; the always-on part is local. This one is non-negotiable and shapes the on-device/cloud split more than latency does.
Accuracy. The right intent, the right slots — a misheard "ten" vs "tent" can't silently do the wrong thing; when unsure, the assistant confirms rather than guesses.
Availability. A single skill failing must not take down the assistant, and a device offline still does its few truly-local jobs.
Extensibility. New skills plug in against the intent contract without touching the core.
Scalability. Hundreds of millions of devices, each firing bursts of requests; every cloud stage must scale out horizontally.
Consistency & durability, split deliberately. Session state is eventual and disposable — it lives on a TTL and may be lost. But the interaction log (every transcript, intent, and confidence) is durable, because it's what the models retrain on tomorrow.

Listing them is the easy half; the design only earns them if it fulfills them. Here's the contract, each requirement mapped to the one mechanism that keeps it and the step that delivers it:

Requirement	How this design fulfills it
Low latency	wake word is local; ASR streams under your speech; skill + TTS budgeted — Steps 3, 6
Privacy	the always-on stage is on-device; audio uploads only after the hotword — Steps 3, 8
Accuracy	NLU picks the most specific intent, validates slot types, confirms low-confidence ASR — Steps 5, 11
Availability	the router isolates skills; a failed skill degrades to a spoken fallback — Step 11
Extensibility	skills register against an intent contract and deploy independently — Steps 5, 8
Scalability	every cloud stage is stateless; the only state (the session) is a shared, shardable store — Step 10
Consistency/durability	sessions are eventual + TTL (droppable); the interaction log is durable for retraining — Step 4

Every trade-off below is chosen to keep one of these.

Step 3 — The pipeline, stage by stage

Five stages, and the split between them is where the design decisions live:

Wake word — a tiny model running on the device, the only always-listening part. It exists so audio isn't streamed to the cloud constantly (privacy) and so the device is cheap (latency).
ASR — streams the audio to the cloud and converts it to text as you speak, emitting partial hypotheses that firm up token by token.
NLU — the transcript becomes a structured intent + slots. (Step 5.)
Skill — the code that actually fulfils the intent (fetch weather, queue a song).
TTS — the skill's text response becomes audio and plays back.

The on-device/cloud boundary is deliberate: only the wake word is local, because it must be instant and private; everything heavier rides the cloud where it can be big and updated centrally.

Two subtleties hide inside "ASR streams." First, endpointing — the cloud has to decide when you've stopped talking, from a few hundred milliseconds of trailing silence. Too eager and it cuts you off mid-sentence ("play… [pause] the Beatles"); too lazy and it adds dead air before the reply. That silence threshold is a genuine tuning trade-off, not a constant. Second, because ASR runs while you speak, the transcript is essentially finished the moment you stop — which is the single biggest reason the whole thing feels instant, and the reason Step 6's budget starts its clock at end-of-speech, not at start-of-speech.

Step 4 — The data model (and where each piece lives)

Circle the nouns. Some are the obvious furniture — User, Device, Skill. The load-bearing ones are the pieces that make understanding and memory work:

Intent / SampleUtterance / Slot / SlotType — the catalog a skill registers. An intent owns many sample utterances (play {song} by {artist}) and many typed slots (city is an AMAZON.City, duration is a Duration). The slot's type is what lets NLU reject "play weather in 5 minutes" — the value doesn't match the type.
Session — the hidden noun the follow-up question is about. A short-lived per-device bag of context (the last intent, the slots so far, an expires_at). This is the only piece of mutable, hot state in the whole design.
Interaction — one append-only row per turn: the transcript, the resolved intent, the confidence, the timestamp. Nobody reads it on the request path; it exists so the models can be retrained and so you can measure which intents mis-fire.

the data model

users          (user_id, ...)
devices        (device_id, user_id → users, locale, location, wake_word)
skills         (skill_id, invocation_name, endpoint_url)
intents        (intent_id, skill_id → skills, name)
sample_utts    (id, intent_id → intents, pattern)           -- "play {song} by {artist}"
slots          (slot_id, intent_id → intents, name, slot_type)
sessions       (session_id, device_id → devices,
                context_json, expires_at)                   -- the hidden noun; hot + ephemeral
interactions   (id, device_id, transcript, intent_id,
                confidence, ts)                             -- append-only; retraining fuel

Which datastores — and why it isn't one "database." The access patterns pull in three directions, so this is a split, and naming the split is the senior beat:

The catalog (skills, intents, utterances, slots) is small, relational, read-mostly, and only changes when a developer publishes a skill — a relational SQL store, cached and compiled into in-memory matchers on the NLU nodes. New skills propagate in minutes; eventual consistency is fine here.
The session needs single-digit-millisecond reads and writes on every turn, and it must expire on its own after a short silence — that's Redis (or any in-memory KV with native TTLs), keyed by device_id. Losing it is a non-event by design.
The interaction log is a firehose — hundreds of millions of rows a day, queried in windows ("fallback rate for PlayMusic this week") — which is a columnar/OLAP store like ClickHouse or Druid, fed off the request path, never on it.
The big ASR/NLU model artifacts are large binaries loaded onto GPU nodes — an object store (S3-style), not a database at all.

The rule underneath: the store follows the access pattern and the NFR it must keep. Strong-ish read consistency and joins → SQL; per-turn TTL state → Redis; windowed analytics → columnar; blobs → object store. Never "a database."

Step 5 — NLU: resolving intent and slots

Here's the core. Each skill registers sample utterances with {slot} placeholders; a transcript is matched against all of them, and the most specific match — the one nailed down by the most literal words — wins, so play {song} by {artist} beats the looser play {song}.

IntentRouter.java

package dev.fiveyear.voice;
 
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
/**
 * The natural-language understanding (NLU) core of a voice assistant. After speech is
 * transcribed to text, the assistant must decide WHICH skill the user wants (the
 * intent) and pull out the variable parts (the slots). Each skill registers sample
 * utterances with {slot} placeholders — "play {song} by {artist}" — which compile to a
 * matcher. An incoming transcript is matched against all of them and the MOST SPECIFIC
 * match wins (the one pinned down by the most literal words), so "play X by Y" beats the
 * looser "play {song}". This is the routing layer between raw transcript and skill code.
 */
public class IntentRouter {
 
    /** The resolved intent plus the slot values pulled from the utterance. */
    public static final class Match {
        public final String intent;            // null == no skill matched
        public final Map<String, String> slots;
        Match(String intent, Map<String, String> slots) { this.intent = intent; this.slots = slots; }
        public boolean matched() { return intent != null; }
        static Match none() { return new Match(null, Map.of()); }
    }
 
    private static final class Sample {
        final String intent;
        final Pattern regex;
        final int literalCount; // more literal words == more specific
        final List<String> slots;
        Sample(String intent, Pattern regex, int literalCount, List<String> slots) {
            this.intent = intent; this.regex = regex; this.literalCount = literalCount; this.slots = slots;
        }
    }
 
    private final List<Sample> samples = new ArrayList<>();
 
    /** Register one intent with one or more sample utterances containing {slot} placeholders. */
    public void register(String intent, String... utterances) {
        for (String u : utterances) samples.add(compile(intent, u));
    }
 
    /** Resolve a transcript to its most specific matching intent, extracting slot values. */
    public Match resolve(String transcript) {
        String u = normalize(transcript);
        Sample best = null;
        Matcher bestMatcher = null;
        for (Sample s : samples) {
            Matcher m = s.regex.matcher(u);
            if (m.matches() && (best == null || s.literalCount > best.literalCount)) {
                best = s;
                bestMatcher = m;
            }
        }
        if (best == null) return Match.none();
        Map<String, String> slots = new LinkedHashMap<>();
        for (String name : best.slots) slots.put(name, bestMatcher.group(name).trim());
        return new Match(best.intent, slots);
    }
 
    private static Sample compile(String intent, String utterance) {
        StringBuilder regex = new StringBuilder();
        List<String> slots = new ArrayList<>();
        int literals = 0;
        String[] tokens = normalize(utterance).split(" ");
        for (int i = 0; i < tokens.length; i++) {
            if (i > 0) regex.append(" ");
            String t = tokens[i];
            if (t.startsWith("{") && t.endsWith("}")) {
                String name = t.substring(1, t.length() - 1);
                slots.add(name);
                regex.append("(?<").append(name).append(">.+?)");
            } else {
                regex.append(Pattern.quote(t));
                literals++;
            }
        }
        return new Sample(intent, Pattern.compile(regex.toString()), literals, slots);
    }
 
    /** Lowercase, drop punctuation, collapse whitespace — so matching ignores casing and "?" etc. */
    private static String normalize(String s) {
        return s.toLowerCase().replaceAll("[^a-z0-9{} ]", " ").replaceAll("\\s+", " ").trim();
    }
}

The literalCount ranking is the heart of it. Many sample utterances match a given transcript; choosing the one with the most literal (non-slot) words means the assistant prefers the interpretation the user pinned down most precisely. This is where the functional "understand" requirement and the non-functional accuracy one are both cashed — and note what it doesn't cost: matching is a handful of compiled-regex passes, microseconds, so correctness here doesn't dent the latency budget.

Two things production layers on top of the same contract, both worth naming:

The ASR feeds an n-best list, not one string. "Recognise speech" vs "wreck a nice beach" — the recogniser returns several candidate transcripts with scores, and NLU resolves each, letting a confident intent match rescue a shaky transcript. The contract is unchanged: transcript in, ranked intent + typed slots out; it just runs a few times.
Slot types validate, they don't just capture. The matcher above pulls duration = "5 minutes"; a real slot type parses it to a Duration and rejects a value that doesn't fit, which is half of how a mishearing gets caught before it fires a skill.

Step 6 — The request lifecycle, and where the milliseconds go

"Sub-second" is a promise you have to be able to budget. The clock that matters starts at end-of-speech — because streaming ASR has been transcribing all along, the transcript is basically ready the instant you stop talking.

ASR finalize — ~90 ms. Not "transcribe the whole sentence" (that already happened live); just endpoint and settle the last few tokens.
NLU resolve — ~50 ms. The matching itself is microseconds; the budget is the statistical intent model running over the n-best list.
Skill call — ~250 ms. A network hop plus the skill's own work — and if the weather skill calls an upstream API, that call is inside this number. This is the fattest slice, and the one you least control.
TTS — ~100 ms to the first audio chunk. Speech is synthesised and streamed, so playback starts before the whole sentence is rendered.

End-of-speech to first spoken word lands near 490 ms — comfortably under a second, and the shape of the budget is the lesson: NLU is cheap; the skill call and TTS dominate. So the latency wins live there — cache a skill's upstream response, stream TTS instead of batching it, pre-synthesize the common replies ("Okay", "Playing…") so their TTS cost is zero. Shaving NLU would be optimising the slice that was never the problem.

Step 7 — Multi-turn: the session carries context

"What's the weather in Paris?" … "What about tomorrow?" The second sentence has no city in it, yet the assistant answers Paris. That only works because a session — a small per-device bag of context with a short TTL — remembers the last intent and its slots, and NLU fills the gaps in a follow-up from what's already there.

Turn one resolves to Weather{city: Paris} and writes city = Paris into the session. Turn two resolves against the same intent, finds city missing in the text, and borrows it from the session while filling date = tomorrow from the new words. The session is why the assistant feels like it's listening rather than answering one disconnected question at a time.

Three edges an interviewer will push on, none of which the happy path shows:

The TTL is the whole trick. The session expires after ~30 seconds of silence, on the same lazy-expiry principle a seat-hold uses — say "what about tomorrow?" a minute later and there's no Paris to borrow, so the assistant must re-prompt ("Tomorrow's weather where?") rather than guess a stale city. A follow-up isn't a promise; it's a bet that the context is still alive.
Slot elicitation is the same machinery run backwards. Say only "play something" and the skill returns "what would you like to hear?", the session records that it's waiting on the song slot, and your next words fill it. A multi-turn conversation is just a session with a hole in it.
The session is mutable hot state, so its home is forced. This is exactly why session lives in the shared KV, not on the node that handled turn one — turn two may land on a different node behind the load balancer, and it still has to see Paris.

Step 8 — Architecture: thin device, cloud brain

The device holds only the wake word, mic, and speaker. The cloud holds the brain: ASR, NLU, the intent router, and the session store for follow-ups. The router then fans out to skills — independently built and deployed services that fulfil intents against a fixed contract:

Skill.java — the contract every skill implements

package dev.fiveyear.voice;
 
import java.util.Map;
 
/** Third-party skills implement this; the core knows nothing about them beyond it. */
public interface Skill {
    /** Called once at startup: claim intents by registering their sample utterances. */
    void register(IntentRouter router);
 
    /** Fulfil one resolved intent; return the text the assistant will speak. */
    String handle(String intent, Map<String, String> slots);
}

That two-method interface is the extensibility requirement, made concrete: a skill declares what it answers (register) and how (handle), and the core links to nothing else. Add a skill, deploy it, and the router discovers it — the millions of devices in the field never change. This split is what lets the assistant get smarter (update the cloud) and grow (add skills) without ever touching the hardware on the counter.

Step 9 — Trade-offs (each one keeping an NFR)

Every decision is accountable to a promise from Step 2 — that's the last column:

Decision	The tempting alternative	Why ours wins	Keeps
wake word on-device	stream all audio to the cloud	privacy + instant trigger; no constant upload	privacy, latency
heavy NLU/ASR in the cloud	run everything on the device	big models, updated centrally, cheap hardware	accuracy, latency
streaming ASR + endpointing	record fully, then upload	transcript ready at end-of-speech; no round-trip wait	latency
most-specific intent match	first match wins	prefers the interpretation the user pinned down most	accuracy
skills behind an intent contract	hard-code every feature in core	third parties extend the assistant without touching it	extensibility
session in a shared KV with TTL	dialog state in node memory	any node serves the follow-up; abandoned state expires	scalability

The complete implementation

The intent router is the engine. Here's the driver that proves it — exact resolution, multi-token slots, most-specific ranking, fallback, and a clean no-match:

Main.java — intent resolution, asserted

package dev.fiveyear.voice;
 
import dev.fiveyear.voice.IntentRouter.Match;
 
public class Main {
    public static void main(String[] args) {
        IntentRouter router = new IntentRouter();
        router.register("WeatherIntent", "what is the weather in {city}", "weather in {city}");
        router.register("TimerIntent", "set a timer for {duration}");
        router.register("PlayMusicIntent", "play {song} by {artist}", "play {song}");
 
        // exact intent + slot extraction, ignoring case and the trailing "?"
        Match w = router.resolve("What is the weather in Paris?");
        assertTrue(w.matched() && w.intent.equals("WeatherIntent"), "weather intent resolved");
        assertTrue(w.slots.get("city").equals("paris"), "city slot extracted");
 
        // a different sample utterance for the same intent
        Match w2 = router.resolve("weather in San Francisco");
        assertTrue(w2.intent.equals("WeatherIntent") && w2.slots.get("city").equals("san francisco"),
            "multi-word slot value captured");
 
        // a multi-token slot value
        Match t = router.resolve("set a timer for 5 minutes");
        assertTrue(t.intent.equals("TimerIntent") && t.slots.get("duration").equals("5 minutes"),
            "duration slot spans multiple tokens");
 
        // specificity: "play X by Y" must beat the looser "play {song}"
        Match p = router.resolve("play Hey Jude by The Beatles");
        assertTrue(p.intent.equals("PlayMusicIntent"), "play intent resolved");
        assertTrue(p.slots.get("song").equals("hey jude"), "song slot");
        assertTrue(p.slots.get("artist").equals("the beatles"), "artist slot — the more specific sample won");
 
        // the looser sample handles the no-artist case
        Match p2 = router.resolve("play Yesterday");
        assertTrue(p2.intent.equals("PlayMusicIntent") && p2.slots.get("song").equals("yesterday")
            && !p2.slots.containsKey("artist"), "falls back to the single-slot sample");
 
        // nothing matches -> a clean no-match the assistant can answer "I didn't get that"
        Match none = router.resolve("tell me a joke");
        assertTrue(!none.matched() && none.intent == null, "unmatched utterance returns no intent");
 
        System.out.println("ALL VOICE ASSISTANT ASSERTIONS PASSED");
    }
 
    static void assertTrue(boolean cond, String msg) { if (!cond) throw new AssertionError(msg); }
}

Step 10 — Scaling to millions of speakers

The junior move is to shard on day one. The senior move is to climb a ladder, and to earn each rung with the specific bottleneck that forced it:

One region, one pool. Stateless ASR/NLU nodes and a single session store. Fine for a pilot — until the GPU-bound ASR nodes are the first thing to melt, because speech recognition is far heavier per request than intent matching.
Split ASR from NLU. Give ASR its own GPU-backed autoscaling pool and NLU a cheaper CPU pool, each scaling on its own load. Coupling them would mean buying GPUs to serve regex matches. Now the per-turn session read/write from every node becomes the pinch…
Move the session to a shared cache. A fast KV (Redis) keyed by device_id, so any node serves any turn and node memory holds nothing. When even that one store runs hot…
Shard the session store by device_id. Sessions partition perfectly — every device is independent, no cross-partition query exists — so this is the cleanest shard in system design. Model artifacts, meanwhile, are served from the object store and warmed onto nodes: a read-replica in spirit.

Only after those four rungs does anything exotic come up — and you'll rarely get there.

The write hot-key is a different axis. Sharding by device spreads steady traffic beautifully, but it can't help a correlated spike: a live TV ad says "Alexa, order the thing" and a million devices fire the same utterance at the same skill in the same ten seconds. Two moves, neither of which is "add nodes blindly":

The identical transcript is cacheable. A million copies of one utterance resolve to one NLU result — cache it and the intent layer barely notices the flood.
Meter at the skill boundary. The target skill is the hot key, so guard it with a queue and a rate limiter, and autoscale it independently. Because skills are isolated behind the contract, the surge is contained to one service; every other skill — and the core — is untouched.

The headline: every cloud stage is stateless and replicated, the only mutable state (the session) lives in a shardable store, and a hot skill is a local fire, not a system-wide one.

Step 11 — When a piece fails: designing for failure

A design is finished when you can say what happens as each box dies — component by component, degrade instead of collapse:

The wake word misfires. A false negative just makes you repeat yourself. A false positive (the TV said a name) is caught by a second, heavier wake-word check in the cloud that rejects the audio and discards it — so a hair-trigger local model never turns into a privacy leak.
ASR is unsure. A low-confidence transcript (or a tie across the n-best list) triggers a confirmation — "Did you mean…?" — so a mishearing never silently fires the wrong intent. Accuracy over guessing, spent exactly where it's cheap.
NLU finds no intent. A clean no-match (as the code returns) drives a re-prompt ("I didn't get that"), never a wrong action.
A skill errors or times out. The router catches it, speaks a graceful fallback, and a circuit-breaker trips after repeated failures so a sick skill is skipped fast instead of eating the latency budget on every call. One skill's failure is bulkheaded from the rest.
The session store is down. The assistant degrades to one-shot: every single-sentence request still resolves and fulfils; only multi-turn follow-ups lose their memory, so "what about tomorrow?" asks for the city again. The session was only ever an optimization over asking outright — it degrades, it doesn't collapse.
The cloud is unreachable. The device still does its few truly-local jobs (stop an alarm, change volume) and otherwise reports it's offline — fail fast, don't hang.

The pattern across all six is the lesson: a dependency that's only an optimization (the session, a cached skill response) degrades to doing without; a dependency that holds the truth (the model artifacts, the catalog) gets replicated and warmed; a slow external (a skill's upstream API) is wrapped in a timeout and circuit-breaker so its outage expires harmlessly. Designing for failure is deciding, in advance, how the system bends.

The interview corner

Clarify before you draw: Is the wake word always-on or push-to-talk (it changes the privacy story)? One locale or many (multilingual routes ASR/NLU to per-language models)? Is barge-in supported — can the user talk over the reply? What's the latency target (name the sub-second, end-of-speech budget)? Are third-party skills in scope (they bring the contract and isolation) or first-party only? Is personalization in scope ("call Mom" needs per-user entity resolution)?

The follow-up ladder — each rung is a new scenario, not a re-run of the pipeline:

"Mid-elicitation the user switches topics — 'set a timer,' the assistant asks 'for how long?', and they answer 'actually, what's the weather?' What happens to the half-filled slot?" The pending duration elicitation lives in the session under the same TTL, so if the user simply wanders off it self-expires and nothing dangles. The sharp case is the topic switch: the router has to recognise a fresh intent and abandon the pending elicitation rather than mis-bind "weather" as the timer's duration — intent-switch outranks slot-fill. Name that precedence rule.
"The user talks over the reply (barge-in) — what happens to the in-flight turn?" Incoming audio cancels the current TTS and starts a new turn; the superseded turn's session write must not clobber the new one. It's concurrency on session state — name the ordering.
"A skill needs five seconds. Do you make the user wait in silence?" No: speak a holding phrase or a progressive response, cap the call with a timeout, and circuit-break a chronically slow skill. Never let one skill's latency become the assistant's.
"How do you add a new language?" ASR and NLU become per-locale model routes; the intent contract and slot types are language-independent, only the sample utterances are localized, and the device's locale picks the path. You scale by configuration, not by forking the core.
"How would you even know intent resolution is getting worse?" The interaction log → the columnar store → per-intent confidence and fallback-rate dashboards, and the same logged utterances are the retraining set. This is why the durable log earned its place in the data model.

Mistakes that fail the round: streaming all audio to the cloud instead of gating on a local wake word (dead on privacy and bandwidth); treating ASR as finished before endpointing, so you cut the user off or leave dead air; and guessing on a low-confidence transcript instead of confirming, so a mishearing silently fires the wrong intent.

Where to go from here

Pocket version: the wake word is local and everything heavier is cloud; NLU maps many phrasings to one intent by the most-specific match and extracts typed slots; a TTL session carries context across turns; skills fan out behind an isolated contract; every cloud stage is stateless so the whole brain scales horizontally.

Build the slot-elicitation loop — the multi-turn state machine that fills a missing slot over several turns — and watch the session become a tiny per-device workflow engine.
Add entity resolution — "call Mom", "play my workout playlist" — as a per-user catalog with fuzzy matching, and see the intent+slot contract stretch to personalized values.
The skill fan-out is the same autoscaling idea as AWS Lambda; the stateless-service scaling mirrors the load balancer write-up; the correlated-spike defense is the rate limiter grown into a fairness gate.
New to system design? The rookie's guide to HLD walks the method this article follows.

HLD