Alexa HLD: How a Voice Assistant Turns Speech Into Action
How a voice assistant works inside: the wake-word-to-speech pipeline, the NLU layer that resolves an utterance to an intent and slots, the thin-device cloud-brain split, and the skill fan-out.
You say "Alexa, play Hey Jude by The Beatles" and music starts. Between those two moments is a pipeline that wakes on a single word, streams your audio to the cloud, transcribes it, figures out what you meant, routes to the right skill, and speaks back. The device on your counter is deliberately dumb; the intelligence lives in the cloud. And the hardest, most interview-worthy piece isn't the speech recognition — it's the NLU layer that turns a transcript into a structured intent plus slots a program can act on.
This is the inside of Alexa, Google Assistant, Siri. The signature problem is intent resolution: many ways of saying the same thing, mapped to one action with its parameters extracted.
Let's start nowhere near a computer
Picture a hotel concierge who only springs to life when you say their name. Until then they politely ignore the whole lobby's chatter — that's the wake word, and it's the one thing they do entirely on their own. Once named, they listen to your sentence (transcription), work out what you actually want from the dozens of ways you might phrase it (understanding), do it or delegate to the right department (the skill), and reply out loud (speech).
The clever part is the understanding. "What's the weather in Paris," "weather in Paris," "is it raining in Paris" — all the same request. The concierge maps each phrasing to one intent (Weather) and pulls out the slot (city = Paris). That mapping is the whole game.
Where this exact shape shows up
- Alexa, Google Assistant, Siri, Cortana — all run this wake → ASR → NLU → skill → TTS pipeline.
- The intent+slot model is the same one behind chatbots and IVR phone trees ("press or say…").
- The skill fan-out is a plugin architecture — third parties ship skills against the assistant's intent contract.
Step 1 — Functional requirements (sentences first)
- Wake on a hotword, locally, without streaming audio until then.
- Transcribe the spoken request to text (ASR — automatic speech recognition).
- Understand the text: resolve it to an intent and extract its slots.
- Fulfil the intent by routing to the matching skill.
- Respond in speech (TTS — text to speech), and keep enough context for a follow-up.
The load-bearing requirement is "understand": map many phrasings to one intent with parameters. It's what the whole NLU layer exists to do.
Step 2 — Non-functional requirements
- Low latency. The round trip from end-of-speech to spoken reply must feel instant (sub-second perceived).
- Privacy. No audio leaves the device until the wake word fires; the always-on part is local.
- Accuracy. The right intent, the right slots — a misheard "ten" vs "tent" can't silently do the wrong thing.
- Extensibility. New skills plug in against the intent contract without touching the core.
- Availability. A single skill failing must not take down the assistant.
Listing them is the easy half; the design only earns them if it fulfills them:
| Requirement | How this design fulfills it |
|---|---|
| Low latency | wake word is local; ASR/NLU stream and run in the cloud; responses are pre-synthesized where possible — Steps 3, 6 |
| Privacy | the always-on stage is on-device; audio uploads only after the hotword — Steps 3, 6 |
| Accuracy | NLU picks the most specific intent match and validates slot types — Step 4 |
| Extensibility | skills register against an intent contract and deploy independently — Steps 4, 6 |
| Availability | the router isolates skills; a failed skill degrades to a spoken fallback — Step 8 |
Every trade-off below is chosen to keep one of these.
Step 3 — The pipeline, stage by stage
Five stages, and the split between them is where the design decisions live:
- Wake word — a tiny model running on the device, the only always-listening part. It exists so audio isn't streamed to the cloud constantly (privacy) and so the device is cheap (latency).
- ASR — streams the audio to the cloud and converts it to text as you speak.
- NLU — the transcript becomes a structured intent + slots. (Step 4.)
- Skill — the code that actually fulfils the intent (fetch weather, queue a song).
- TTS — the skill's text response becomes audio and plays back.
The on-device/cloud boundary is deliberate: only the wake word is local, because it must be instant and private; everything heavier rides the cloud where it can be big and updated centrally.
Step 4 — NLU: resolving intent and slots
Here's the core. Each skill registers sample utterances with {slot} placeholders; a transcript is matched against all of them, and the most specific match — the one nailed down by the most literal words — wins, so play {song} by {artist} beats the looser play {song}.
package dev.fiveyear.voice;
import java.util.ArrayList;
import java.util.LinkedHashMap;
import java.util.List;
import java.util.Map;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
/**
* The natural-language understanding (NLU) core of a voice assistant. After speech is
* transcribed to text, the assistant must decide WHICH skill the user wants (the
* intent) and pull out the variable parts (the slots). Each skill registers sample
* utterances with {slot} placeholders — "play {song} by {artist}" — which compile to a
* matcher. An incoming transcript is matched against all of them and the MOST SPECIFIC
* match wins (the one pinned down by the most literal words), so "play X by Y" beats the
* looser "play {song}". This is the routing layer between raw transcript and skill code.
*/
public class IntentRouter {
/** The resolved intent plus the slot values pulled from the utterance. */
public static final class Match {
public final String intent; // null == no skill matched
public final Map<String, String> slots;
Match(String intent, Map<String, String> slots) { this.intent = intent; this.slots = slots; }
public boolean matched() { return intent != null; }
static Match none() { return new Match(null, Map.of()); }
}
private static final class Sample {
final String intent;
final Pattern regex;
final int literalCount; // more literal words == more specific
final List<String> slots;
Sample(String intent, Pattern regex, int literalCount, List<String> slots) {
this.intent = intent; this.regex = regex; this.literalCount = literalCount; this.slots = slots;
}
}
private final List<Sample> samples = new ArrayList<>();
/** Register one intent with one or more sample utterances containing {slot} placeholders. */
public void register(String intent, String... utterances) {
for (String u : utterances) samples.add(compile(intent, u));
}
/** Resolve a transcript to its most specific matching intent, extracting slot values. */
public Match resolve(String transcript) {
String u = normalize(transcript);
Sample best = null;
Matcher bestMatcher = null;
for (Sample s : samples) {
Matcher m = s.regex.matcher(u);
if (m.matches() && (best == null || s.literalCount > best.literalCount)) {
best = s;
bestMatcher = m;
}
}
if (best == null) return Match.none();
Map<String, String> slots = new LinkedHashMap<>();
for (String name : best.slots) slots.put(name, bestMatcher.group(name).trim());
return new Match(best.intent, slots);
}
private static Sample compile(String intent, String utterance) {
StringBuilder regex = new StringBuilder();
List<String> slots = new ArrayList<>();
int literals = 0;
String[] tokens = normalize(utterance).split(" ");
for (int i = 0; i < tokens.length; i++) {
if (i > 0) regex.append(" ");
String t = tokens[i];
if (t.startsWith("{") && t.endsWith("}")) {
String name = t.substring(1, t.length() - 1);
slots.add(name);
regex.append("(?<").append(name).append(">.+?)");
} else {
regex.append(Pattern.quote(t));
literals++;
}
}
return new Sample(intent, Pattern.compile(regex.toString()), literals, slots);
}
/** Lowercase, drop punctuation, collapse whitespace — so matching ignores casing and "?" etc. */
private static String normalize(String s) {
return s.toLowerCase().replaceAll("[^a-z0-9{} ]", " ").replaceAll("\\s+", " ").trim();
}
}The literalCount ranking is the heart of it. Many sample utterances match a given transcript; choosing the one with the most literal (non-slot) words means the assistant prefers the interpretation that the user pinned down most precisely. Real assistants layer a statistical model on top, but the contract — transcript in, ranked intent + typed slots out — is exactly this.
Step 5 — Architecture: thin device, cloud brain
The device holds only the wake word, mic, and speaker. The cloud holds the brain: ASR, NLU, the intent router, and dialog/session state for follow-ups ("what about tomorrow?"). The router then fans out to skills — independently built and deployed services that fulfil intents against a fixed contract. This split is what lets the assistant get smarter (update the cloud) and grow (add skills) without ever touching the millions of devices in the field.
Step 6 — Trade-offs (each one keeping an NFR)
| Decision | The tempting alternative | Why ours wins | Keeps |
|---|---|---|---|
| wake word on-device | stream all audio to the cloud | privacy + instant trigger; no constant upload | privacy, latency |
| heavy NLU/ASR in the cloud | run everything on the device | big models, updated centrally, cheap hardware | accuracy, latency |
| most-specific intent match | first match wins | prefers the interpretation the user pinned down most | accuracy |
| skills behind an intent contract | hard-code every feature in core | third parties extend the assistant without touching it | extensibility |
| dialog state in the cloud | stateless one-shot requests | enables follow-ups and context carry-over | extensibility |
The complete implementation
The intent router is the engine. Here's the driver that proves it — exact resolution, multi-token slots, most-specific ranking, fallback, and a clean no-match:
package dev.fiveyear.voice;
import dev.fiveyear.voice.IntentRouter.Match;
public class Main {
public static void main(String[] args) {
IntentRouter router = new IntentRouter();
router.register("WeatherIntent", "what is the weather in {city}", "weather in {city}");
router.register("TimerIntent", "set a timer for {duration}");
router.register("PlayMusicIntent", "play {song} by {artist}", "play {song}");
// exact intent + slot extraction, ignoring case and the trailing "?"
Match w = router.resolve("What is the weather in Paris?");
assertTrue(w.matched() && w.intent.equals("WeatherIntent"), "weather intent resolved");
assertTrue(w.slots.get("city").equals("paris"), "city slot extracted");
// a different sample utterance for the same intent
Match w2 = router.resolve("weather in San Francisco");
assertTrue(w2.intent.equals("WeatherIntent") && w2.slots.get("city").equals("san francisco"),
"multi-word slot value captured");
// a multi-token slot value
Match t = router.resolve("set a timer for 5 minutes");
assertTrue(t.intent.equals("TimerIntent") && t.slots.get("duration").equals("5 minutes"),
"duration slot spans multiple tokens");
// specificity: "play X by Y" must beat the looser "play {song}"
Match p = router.resolve("play Hey Jude by The Beatles");
assertTrue(p.intent.equals("PlayMusicIntent"), "play intent resolved");
assertTrue(p.slots.get("song").equals("hey jude"), "song slot");
assertTrue(p.slots.get("artist").equals("the beatles"), "artist slot — the more specific sample won");
// the looser sample handles the no-artist case
Match p2 = router.resolve("play Yesterday");
assertTrue(p2.intent.equals("PlayMusicIntent") && p2.slots.get("song").equals("yesterday")
&& !p2.slots.containsKey("artist"), "falls back to the single-slot sample");
// nothing matches -> a clean no-match the assistant can answer "I didn't get that"
Match none = router.resolve("tell me a joke");
assertTrue(!none.matched() && none.intent == null, "unmatched utterance returns no intent");
System.out.println("ALL VOICE ASSISTANT ASSERTIONS PASSED");
}
static void assertTrue(boolean cond, String msg) { if (!cond) throw new AssertionError(msg); }
}Step 7 — Scaling to millions of speakers
- Inbound audio streams → ASR is a stateless, GPU-backed service behind a load balancer; scale it horizontally by request.
- NLU throughput → intent matching is stateless per request; replicate it and cache compiled samples.
- Skill fan-out → each skill is its own autoscaling service (like a Lambda); a viral skill scales independently of the core.
- Dialog/session state → keep it in a fast key-value store keyed by device + session, not in app memory, so any node can serve the follow-up.
The headline: every cloud stage is stateless and replicated; the only state (dialog session) lives in a shared store, so the whole brain scales horizontally.
Step 8 — When a piece fails: designing for failure
- A skill errors or times out → the router catches it and the assistant speaks a graceful fallback ("Sorry, I can't do that right now") instead of going silent. One skill's failure is contained.
- NLU finds no intent → a clean no-match (as the code returns) drives a re-prompt ("I didn't get that"), never a wrong action.
- The cloud is unreachable → the device can still do the few truly-local things (stop an alarm), and otherwise reports it's offline rather than hanging.
- ASR is unsure → low-confidence transcripts trigger a confirmation ("Did you mean…?") so a mishearing doesn't silently fire the wrong intent — accuracy over guessing.
The interview corner
- "Why is the wake word on-device but the rest in the cloud?" Privacy (no audio uploaded until you address it) and latency (instant local trigger), while the heavy ASR/NLU models stay big and centrally updated.
- "What's the difference between an intent and a slot?" The intent is what you want (
PlayMusic); slots are the parameters (song,artist) extracted from the utterance. - "How do you handle many phrasings of the same request?" Register multiple sample utterances per intent and resolve to the most specific match; production layers a statistical model on the same contract.
- "How do third parties add features?" Skills register against the intent contract and deploy independently — the core never changes.
- "How do you keep one bad skill from breaking the assistant?" The router isolates skills and falls back to a spoken error; skills scale and fail independently.
Where to go from here
- The skill fan-out is the same autoscaling idea as AWS Lambda; the stateless-service scaling mirrors the load balancer write-up.
- New to system design? The rookie's guide to HLD walks the method this article follows.