Stack Overflow HLD: Search Is the System, and It Isn't a SQL LIKE
A Stack Overflow system design: full-text search via an inverted index ranked by TF-IDF and votes, a read-heavy cache/CDN path, the Q&A data model, and keeping search in sync with the source of truth.
"Design Stack Overflow." It's tempting to spend the hour on questions, answers, and votes — a clean CRUD schema — and call it done. But that misses what Stack Overflow actually is to its users: a place you arrive at from Google, having searched for an error message, and land on the one answer that solves it. The product is search over a huge corpus, served to an overwhelmingly read-heavy, mostly-anonymous audience. Get search and the read path right and the CRUD is an afternoon.
And the first thing to say out loud: search is not WHERE body LIKE '%...%'. That scans every row, can't rank by relevance, and falls over at the first million questions. Real search is a different data structure entirely — an inverted index — and recognizing that is most of the interview.
Let's start nowhere near a computer
Imagine a library of a million books and the question: "which books mention concurrency?" The brute-force answer is to open every book and scan it — that's LIKE over a million rows, and it's hopeless.
A library doesn't work that way. It keeps a card catalog: one card per topic, and each card lists the books that cover it. You look up the "concurrency" card and instantly have your shortlist — without touching the other 999,000 books. And the librarian doesn't hand you the list in random order; the most-cited, most-trusted books come first.
That card catalog is an inverted index — a map from each term to the documents containing it — and "most-trusted first" is ranking. Those two ideas are the spine of Stack Overflow's search, and the rest of the design is feeding and serving them at scale.
Where this exact shape shows up
- Any search box over a large corpus — product search, docs search, email search, log search. All inverted index + ranking.
- Stack Overflow, Quora, GitHub issues, Zendesk — read-heavy Q&A/knowledge bases fronted by aggressive caching.
- The trie is the autocomplete cousin of this; the inverted index is its full-text sibling.
- The read-from-cache discipline is the same one the Cricinfo live score uses — here the hot objects are millions of cacheable question pages instead of one score.
Step 1 — Functional requirements (sentences first)
- A user asks a question (title, body, tags), answers questions, and comments.
- A user votes on questions and answers, and an asker accepts an answer.
- A user searches questions by keywords and/or tags, ranked by relevance and quality.
- A visitor reads a question page — the question plus answers, accepted first, then by score.
- A user browses by tag and sees feeds (hot, newest, active).
- Reputation accrues to users from the votes their posts receive.
The load-bearing verb is "searches." It's the one that isn't CRUD, and it dictates an entire subsystem.
Step 2 — Non-functional requirements
For Stack Overflow the qualities, not the features, choose the architecture.
- Read-heavy, by a lot. The vast majority of traffic is anonymous reads, much of it landing from search engines on question pages that rarely change. Read latency and cacheability dominate.
- Search quality and latency. Results must be relevant (not just matching) and fast. This is the core engineering problem.
- High availability. A question page is often someone's answer to a production incident; it must load.
- Eventual consistency is acceptable. Vote counts, reputation, and "new question appears in search" can lag a little; a reader won't notice a few seconds.
- Write-light. Asks, answers, and votes are rare next to reads — a single strong primary can own them.
Listing them is the easy half; the design only earns them if it fulfills them:
| Requirement | How this design fulfills it |
|---|---|
| Read-heavy / latency | question pages served from CDN + cache; SQL is touched on writes and misses — Steps 5, 9 |
| Search quality + speed | a dedicated inverted index ranked by TF-IDF blended with votes — Steps 4, 3 |
| High availability | reads survive on cache + replicas; search degrades to browse if the engine is down — Step 10 |
| Eventual consistency | the search index is updated async from the source of truth — Steps 3, 9 |
| Write-light | one SQL primary owns the transactional writes; reads scale out around it — Steps 3, 9 |
Every trade-off below is chosen to keep one of these.
Step 3 — Nouns and the data model
The nouns are familiar: questions, answers, votes, tags (many-to-many with questions), and users with reputation.
package dev.fiveyear.stackoverflow;
import java.util.Set;
/** A question in the corpus. `score` is the denormalized vote total — blended into
* search ranking so a highly-upvoted answer outranks a merely-relevant one. */
record Question(long id, String title, String body, Set<String> tags, int score) {}Which datastore — and why it isn't a default. The source of truth — questions, answers, votes, users — wants a relational database, and earns it: votes need a UNIQUE(user, post) constraint (one vote per person), accepting an answer touches two rows transactionally, and the whole thing is richly related. (Stack Overflow famously runs on a single, well-tuned SQL primary — write-light data doesn't need exotic stores.) But search is the opposite shape: you cannot rank a LIKE and you cannot afford to scan, so the index lives in a dedicated search engine (Elasticsearch / a Lucene-based store) holding the inverted index, kept in sync asynchronously from SQL. And the read path is fronted by Redis + a CDN, because a rendered question page is nearly static and read enormously. Three stores, three jobs: relational for the transactional truth, an inverted index for search, a cache for the read storm.
Step 4 — The core: search is an inverted index, not a LIKE
The card catalog, in code. Each term points at its postings — the documents that contain it, with how often. A query scores candidates by TF-IDF (a term that's rare across the corpus counts for more than a common one), then blends in the question's vote score so results are both relevant and trusted.
package dev.fiveyear.stackoverflow;
import java.util.ArrayList;
import java.util.Comparator;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Set;
/**
* A minimal full-text search core: the inverted index. Each term points at the
* documents that contain it (its "postings"). A search scores candidates by
* TF-IDF — rare terms count more — then blends in the question's vote score, so
* results are both relevant and trusted. This is why search isn't a SQL LIKE.
*/
public class InvertedIndex {
private record Posting(long docId, int termFreq) {}
private final Map<String, List<Posting>> postings = new HashMap<>();
private final Map<Long, Question> docs = new HashMap<>();
public void add(Question q) {
docs.put(q.id(), q);
Map<String, Integer> tf = new HashMap<>();
for (String t : tokenize(q.title() + " " + q.body())) tf.merge(t, 1, Integer::sum);
for (Map.Entry<String, Integer> e : tf.entrySet()) {
postings.computeIfAbsent(e.getKey(), k -> new ArrayList<>())
.add(new Posting(q.id(), e.getValue()));
}
}
/** Rank questions matching the query terms; an empty tag set means no filter. */
public List<Question> search(String query, Set<String> requiredTags, int limit) {
int n = Math.max(1, docs.size());
Map<Long, Double> relevance = new HashMap<>();
for (String term : tokenize(query)) {
List<Posting> list = postings.get(term);
if (list == null) continue;
double idf = Math.log(1.0 + (double) n / list.size()); // rarer term, higher weight
for (Posting p : list) relevance.merge(p.docId(), p.termFreq() * idf, Double::sum);
}
return relevance.keySet().stream()
.map(docs::get)
.filter(q -> requiredTags.isEmpty() || q.tags().containsAll(requiredTags))
.sorted(Comparator.comparingDouble(
(Question q) -> relevance.get(q.id()) + 0.1 * q.score()).reversed())
.limit(limit)
.toList();
}
private static List<String> tokenize(String text) {
List<String> out = new ArrayList<>();
for (String t : text.toLowerCase().split("[^a-z0-9]+")) if (!t.isEmpty()) out.add(t);
return out;
}
}Three decisions in that code are the whole interview:
- Postings, not scans. Looking up a term is
O(1)into the map; you only ever touch documents that actually contain the query terms. No full-table scan, ever. - IDF weights rarity. A query for "java garbage collector" should be driven by "garbage collector," not the ten-million-times-seen "java."
log(N / postings)does exactly that. - Blend relevance with votes. A merely-matching answer shouldn't beat the canonical, 5,000-upvote one. The
+ 0.1 * scoreis how "trusted" enters the ranking — the real thing also folds in recency and exact-title matches.
The index, the TF-IDF ranking, the vote blend, and the tag filter were all run against this code: a search for "sort list" returns the higher-voted match first, a tag filter drops off-topic hits, and an unknown term returns nothing.
Step 5 — The read path: serve from cache, touch SQL rarely
A question page barely changes after the dust settles, yet it's read forever. So it's a near-perfect cache target: render it once, serve it from the edge to the millions who arrive from Google, and let the database sleep.
The discipline is the same as any read-heavy system: CDN in front, page cache (Redis) behind it, SQL only on a miss or a write. Stack Overflow's legendary "huge traffic, tiny server footprint" comes straight from this — the database is the source of truth, not the workhorse for reads.
Step 6 — Verbs become APIs (the API design)
| Verb / endpoint | Does |
|---|---|
POST /questions | ask; writes to SQL, queues an index update |
POST /questions/{id}/answers | answer |
POST /posts/{id}/vote | up/down vote (UNIQUE(user, post) enforces one) |
POST /answers/{id}/accept | asker accepts an answer |
GET /questions/{id} | the cached question page |
GET /search?q=...&tags=... | hits the search engine, not SQL |
GET /questions?tag=...&sort=hot | tag feed (cached) |
GET /search and GET /questions/{id} are the two hot paths, and neither’s common case touches the primary database — search goes to the index, reads go to the cache.
Step 7 — Trade-offs (each one keeping an NFR)
| Decision | The tempting alternative | Why ours wins | Keeps |
|---|---|---|---|
| inverted index for search | WHERE body LIKE '%term%' | no full scan, and you can actually rank by relevance | search quality |
| TF-IDF blended with votes | match-only ranking | the canonical, trusted answer rises above a mere keyword match | search quality |
| async index updates | index inside the write txn | the write stays fast; search lags by seconds, which nobody notices | write latency |
| CDN + page cache for reads | read SQL per page view | the read storm hits the edge; SQL stays write-light | read latency |
| one SQL primary (source of truth) | shard everything early | write-light data fits one tuned primary; reads scale via replicas | simplicity / cost |
The complete implementation
The index above is the engine. Here's the driver that proves its ranking — relevance blended with votes, tag filtering, and an empty result for an unknown term:
package dev.fiveyear.stackoverflow;
import java.util.List;
import java.util.Set;
public class Main {
public static void main(String[] args) {
InvertedIndex idx = new InvertedIndex();
Question q1 = new Question(1, "How to sort a list", "sort a list quickly", Set.of("python"), 10);
Question q2 = new Question(2, "Sort a list of dicts", "sort list by a key", Set.of("python"), 100);
Question q3 = new Question(3, "Binary search tree", "search a tree node", Set.of("algorithms"), 20);
idx.add(q1); idx.add(q2); idx.add(q3);
// "sort list" matches q1 and q2; q2 wins on the vote blend; q3 is irrelevant
List<Question> r = idx.search("sort list", Set.of(), 10);
assertTrue(r.size() == 2, "two relevant questions (got " + r.size() + ")");
assertTrue(r.get(0).id() == 2, "higher-voted relevant question ranks first");
assertTrue(r.get(1).id() == 1, "then the lower-voted relevant one");
assertTrue(r.stream().noneMatch(q -> q.id() == 3), "irrelevant question excluded");
// a term only q3 has
List<Question> tree = idx.search("tree", Set.of(), 10);
assertTrue(tree.size() == 1 && tree.get(0).id() == 3, "'tree' returns only q3");
// tag filter narrows results
List<Question> py = idx.search("sort list", Set.of("python"), 10);
assertTrue(py.size() == 2, "both matches are python-tagged");
List<Question> pyTree = idx.search("tree", Set.of("python"), 10);
assertTrue(pyTree.isEmpty(), "q3 is algorithms, filtered out by python tag");
// unknown term
assertTrue(idx.search("kubernetes", Set.of(), 10).isEmpty(), "no match returns empty");
System.out.println("ALL STACKOVERFLOW ASSERTIONS PASSED");
}
static void assertTrue(boolean c, String m) { if (!c) throw new AssertionError(m); }
}Step 8 — Only now, the boxes
With search and the read path settled, the architecture is those responsibilities given homes.
The shape to notice: three back-ends, each fed from the source of truth. Reads flow through CDN → page cache and rarely reach SQL; search goes to its own engine, kept current by an async indexer; and a write to SQL fans out two side effects — invalidate the page cache, and update the search index. The database is the system of record, not the thing under load.
Step 9 — Scaling the design, one bottleneck at a time
- Reads dominate → CDN + page cache first, then SQL read replicas for cache misses and logged-in views. The primary stays for writes.
- Counts are hot → denormalize the obvious ones (answer count, vote score) so a question page never runs an aggregate.
- Search grows → the search engine scales on its own axis: shard the inverted index across nodes and replicate for query throughput, independent of SQL.
- Writes are light → one well-tuned primary handles asks/answers/votes; buffer vote spikes through a queue and fold them in. You scale the read and search planes long before you shard the write plane.
The headline: this system scales by separating planes — read, search, and write each grow independently — not by sharding one giant database.
Step 10 — When a piece fails: designing for failure
- The CDN or page cache fails → an optimization gone, not the truth; reads fall back to SQL replicas, slower but correct. Degrade, don't break.
- The search engine is down → search degrades to tag/recent browse (or a basic SQL fallback), and every direct question link still works from cache. The site stays useful.
- The SQL primary fails → reads continue from cache and replicas; writes pause and fail over to a promoted replica. Availability for readers is preserved even while writes blip.
- The async indexer lags → a brand-new question shows up in search a little late. That's the eventual-consistency bargain, made on purpose, and invisible in practice.
The interview corner
- "How does search work — isn't it just
LIKE?" No. An inverted index: term → postings, ranked by TF-IDF blended with votes.LIKEcan't rank and scans every row. - "Why is search a separate system from the database?" Different data structure and different scaling axis. The index is updated async from SQL so writes stay fast and search scales on its own.
- "How do you serve such enormous read traffic?" Question pages are highly cacheable — CDN + Redis answer the vast majority; SQL is write-light and barely touched on reads.
- "How do you rank results?" Relevance (TF-IDF, exact-title bonus, recency) blended with quality signals (vote score, accepted answer). Pure keyword match isn't enough.
- "Consistency of votes and reputation?" Eventual — buffered and folded in asynchronously. A few seconds of lag is acceptable and keeps the write path cheap.
Where to go from here
- New to system design? The rookie's guide to HLD walks the method this article follows.
- The autocomplete cousin of full-text search is the trie; the read-from-cache discipline is shared with the Cricinfo live score.
- For a write-heavy tree of comments (the opposite load profile), see the Reddit comments schema.