HLDintermediate

Reddit Comments DB Design: Modelling a Threaded Tree That Scales

A database deep dive on modelling Reddit-style threaded comments in SQL: adjacency list vs materialized path vs closure table, fetching a subtree in one query, plus scores and sharding by post.

By fiveyearsdevJune 14, 202616 min read

You're on a hot thread, thumb scrolling. A reply sits under a reply under a reply, eight indents deep; you tap "load more replies" and a fresh clutch of them slides in, best ones first. One says [deleted] — the account is gone, but the twelve replies hanging off it are all still there, still readable. None of that stutters, even though this post has three hundred thousand comments and you're looking at forty of them. That smoothness is a database trick, and it's the one the interviewer means when they say "design the database for Reddit's comments."

Because comments aren't a list — they're a tree of arbitrary depth, and a tree is the one shape a table full of flat rows is worst at. A reply has a parent, which has a parent, which has a parent. You have to load a whole branch sorted by score, paginate "load more" deep into it, show a live count, and delete a comment without orphaning the replies underneath it — all on a table that, for a viral thread, holds millions of rows. So the whole question collapses to one trade-off: how do you store a tree in flat rows so that reading one subtree is a single cheap query, not a hundred? Get that model right and everything else is ordinary SQL.

Let's start nowhere near a computer

Picture a company org chart in a filing cabinet, one index card per person. You're asked: "pull every card under VP Jane."

If each card lists only "my manager" — Bob's card says "Jane" — you start at Jane, find her direct reports, then their reports, walking the cabinet level by level. Correct, but it's a trip per layer.
If each card instead lists its full chain of command — Bob's card reads "CEO › Jane › Bob" — then "everyone under Jane" is just every card whose chain starts with "CEO › Jane ›". One pass, one matching rule, the whole branch.

That second card is a materialized path, and it's the trick that makes a tree fast to read in a flat filing cabinet — or a flat database table. The rest of this article is choosing between that and its rivals, then making it scale.

Where this exact problem shows up

Reddit, Hacker News, YouTube, Disqus — every nested comment section is this tree.
Slack / Teams threads, email threading — shallower, but the same parent-child model.
Category trees, org charts, file systems, BOMs — any "things nested inside things" schema is one of the four models below. (The in-memory file system is this tree without a database.)
The Twitter timeline HLD — comments are the read-heavy, fan-out-on-read sibling of the feed problem.

Step 1 — The access patterns (functional requirements)

For a schema, "functional requirements" means the queries it must serve. Write those down first; the model is whatever makes them cheap.

Post a comment — a reply to the post (top-level) or to another comment.
Read a page of top-level comments for a post, sorted (best / top / new), paginated.
Expand a comment's replies — fetch its subtree, sorted, paginated ("load more").
Vote on a comment and show its score.
Count the comments on a post.
Delete a comment, leaving its replies readable (the [deleted] tombstone).

The two that bite are "read a subtree to any depth" and "delete without orphaning." Everything below is in service of those.

Step 2 — Non-functional requirements

What separates a toy schema from Reddit's is how well it serves those queries under real load. Name them with the canonical terms an interviewer is listening for:

Low latency. Comments are read far more than written (100:1 and up), so read latency is the boss — the top-level page and each "load more" must feel instant. We'll happily pay at write time to make reads cheap.
Consistency, split two ways. A posted comment is strongly consistent — the author must see their own reply the instant it lands (read-your-writes), and its tree position can never be ambiguous. But the score and the comment count are fine eventually consistent: a number that lags a second or reconciles from a background job is not a bug worth blocking a write over.
High availability. A viral thread is a flood of new comments and votes at once; reads must keep serving even while writes spike, and a slow vote path must never take the read path down with it.
Durability. A comment, once posted, cannot silently vanish — the rows are the source of truth. The derived numbers (score, count) may be recomputed from those rows, so they need no durability guarantee of their own.
Scalability. One AMA can hold millions of comments; the model must shard, and the natural shard key is the post. Ordering is first-class too — the same subtree shown by best, new, or controversial without re-reading the whole branch.

Listing them is the easy half; the schema only earns them if it fulfills them. Here's the contract, each row cashed in a step below:

Requirement (canonical)	How this design fulfills it
Low latency (reads)	a materialized path turns "fetch a subtree" into one indexed range scan — Steps 4, 5
Cheap writes / availability	a new comment is one `INSERT`; no ancestor rows rewritten (unlike nested sets) — Step 6
Strong consistency (post)	the insert is atomic and self-anchoring; the author's own row is in their view — Steps 6, 9
Eventual consistency (tally)	a denormalized `score`/count, folded from a Redis counter, may lag harmlessly — Step 6
Sort flexibility	per-level `ORDER BY score` on a covering index; keyset paging, no join to votes — Step 5
Durability	the comment rows are the truth; derived tallies are recomputable, never load-bearing — Step 9
Scalability	everything keys on `post_id`, so the table shards by post cleanly — Step 8

Every trade-off below is chosen to keep one of these.

Step 3 — Three ways to model a tree

This is the whole interview. There are four classic models; three are live options and one is a trap.

Adjacency list — each row stores its parent_id. Trivial to write, but reading a deep subtree means one query per level, or a recursive CTE that walks the tree at read time. Fine for shallow threads; it strains as depth grows.
Materialized path — each row stores the full chain of ancestor ids as a string (/1/4/9/). A subtree is one prefix query at any depth. Writes stay a single insert. This is the sweet spot for read-heavy comment trees.
Closure table — a separate table holds one row per ancestor→descendant pair. Maximally flexible (move a subtree cheaply), but a comment at depth d writes d rows, and the table explodes on deep, wide threads. Great for org charts that get re-parented; overkill for append-only comments.
Nested sets (the trap) — each node stores left/right numbers bracketing its subtree. Reads are elegant, but inserting one comment renumbers half the table. For a high-write comment system it's disqualifying; name it in an interview only to reject it.

The choice reduces to a read-cost vs. write-cost trade you make on purpose. Adjacency pushes the whole cost onto every read (walk the tree each time); closure and nested sets push it onto writes (many rows touched, or a renumber, per insert); materialized path splits it — one string written once at insert, one prefix scan at read — which is exactly the split a 100:1 read-heavy workload wants. That's why the rest of this build commits to it, and Step 4 names the store that carries it.

Step 4 — The data model and schema

Circle the nouns and you get three entities — and the one people miss is that the tree doesn't need a table of its own. It lives inside the comment row, as a pointer back to another comment:

the data model

post          (id, title, author_id, comment_count)     -- comment_count is denormalized
comment       (id, post_id → post, parent_id → comment,  -- parent_id is the self-edge; NULL = root
               path, depth, author_id, body, score, deleted, created_at)
comment_vote  (comment_id → comment, user_id, value)     -- UNIQUE(comment_id, user_id); ±1

Two things earn an interviewer's nod here. The tree is a self-reference — comment.parent_id points at another comment, so a million-node tree is still one table, not a table per level. And the noun that isn't a column: comment_vote is its own table with a UNIQUE(comment_id, user_id), because "one vote per user" is a correctness rule the score integer can't enforce. The score on the comment row is only the cached total of those vote rows.

Which datastore — and why it isn't a default. The durable truth — posts, comments, votes — lands in a relational store (Postgres), and not from habit. The tree is naturally per-post and bounded to one post's rows, the access is WHERE post_id = ? plus a prefix or parent match, and a vote wants an atomic, uniquely-constrained row — all relational strengths. Postgres even ships the ltree type, a path index purpose-built for exactly this prefix query. But two access patterns fit SQL badly and get their own home in Redis: the live vote tally (a hot INCR counter per comment, folded into score asynchronously) and the rendered hot-thread page (a cache with a short TTL). The tempting one-store alternative — a document store that embeds the whole tree in one post document — reads beautifully until a thread hits tens of thousands of comments: you blow past the document size limit, and every new comment rewrites the entire document, which murders the availability-under-write-spike requirement. A graph database is the right shape but overkill for a strict tree keyed by one post. Relational rows for the truth, Redis for the counters and the cache — the split the access patterns demand.

The table is flat. The path column is what makes it a tree — a string of ancestor ids with a leading and trailing separator (the trailing / matters, as you'll see).

schema.sql

CREATE TABLE comments (
  id         BIGINT PRIMARY KEY,
  post_id    BIGINT NOT NULL,
  parent_id  BIGINT,                       -- NULL = top-level
  path       TEXT   NOT NULL,              -- materialized path, e.g. '/1/4/9/'
  depth      INT    NOT NULL,              -- 1 = top-level
  author_id  BIGINT NOT NULL,
  body       TEXT   NOT NULL,
  score      INT    NOT NULL DEFAULT 0,    -- denormalized: upvotes - downvotes
  deleted    BOOLEAN NOT NULL DEFAULT FALSE,
  created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
 
-- the subtree query rides this: range-scan a path prefix within a post
CREATE INDEX idx_thread   ON comments (post_id, path);
-- the top-level page rides this: roots of a post, best first
CREATE INDEX idx_toplevel ON comments (post_id, depth, score DESC);

Two indexes, two query shapes. (post_id, path) makes a subtree a contiguous range; (post_id, depth, score DESC) makes the first page of roots a pre-sorted slice. Everything in Step 5 is one of these two reads.

Step 5 — The queries that matter (the read path)

A comment section loads in two motions: the page of top-level comments, then expand replies on demand. Both are single indexed reads — and neither ever fetches the whole tree.

the page of roots, best first

SELECT id, author_id, body, score
FROM comments
WHERE post_id = 100 AND depth = 1 AND deleted = FALSE
ORDER BY score DESC
LIMIT 20;

expand a subtree — one prefix query, any depth

SELECT id, parent_id, path, body, score
FROM comments
WHERE post_id = 100 AND path LIKE '/1/%'
ORDER BY path;          -- path order == tree (pre-order) traversal

That ORDER BY path is a quiet gift: sorting strings lexically yields a parent-before-children, depth-first walk — the exact order you render. And the trailing separator earns its keep here: LIKE '/1/%' matches /1/, /1/4/, /1/4/9/, /1/5/ — but not /10/, because the pattern demands the / after the 1. Drop the trailing slash from your paths and /1 silently swallows /10/, /11/, /123/ — the most common materialized-path bug there is.

"Load more" is where a naive read path quietly dies. The obvious way to page deeper — LIMIT 20 OFFSET 200 — makes the database read and throw away all 200 skipped rows before it reaches your page, so page 40 of a mega-thread costs forty times page 1. On a hot AMA that's the query that tips over. The fix is keyset (cursor) pagination: carry the last row you showed as a cursor and ask for what comes after it.

page a subtree with a cursor, not an OFFSET

SELECT id, path, body, score
FROM comments
WHERE post_id = 100 AND path LIKE '/1/%'
  AND path > :last_path          -- the cursor: where the previous page ended
ORDER BY path
LIMIT 20;

Because (post_id, path) is indexed, path > :last_path is an index seek straight to the cursor — 20 rows read, no matter how deep you page. For the score-sorted roots the cursor is the tuple (score, id) instead (WHERE (score, id) < (:last_score, :last_id)), the id breaking ties so no comment is shown twice or skipped when scores collide. Same idea, different sort key; the OFFSET is gone either way.

If you'd rather not store a path — the adjacency-list model — the same subtree comes from a recursive CTE, walking parent links at read time:

adjacency-list alternative: walk the tree with a recursive CTE

WITH RECURSIVE subtree AS (
  SELECT * FROM comments WHERE id = 1
  UNION ALL
  SELECT c.* FROM comments c
  JOIN subtree s ON c.parent_id = s.id
)
SELECT id, body, score FROM subtree;

It's correct and needs no path column, but it does the tree-walk on every read. The materialized path pays that cost once, at write time, by storing the answer — which is exactly the read-heavy trade you want.

Step 6 — Writes: post, vote, delete

A new comment is one insert; its path is the parent's path with its own id appended. No ancestor row is touched — that's what keeps writes cheap under a comment flood.

post a reply (parent is comment 1, path '/1/')

INSERT INTO comments (id, post_id, parent_id, path, depth, author_id, body)
VALUES (4, 100, 1, '/1/4/', 2, 7, 'reply to A');
-- path = parent.path || new_id || '/';  depth = parent.depth + 1

A vote bumps the denormalized score so reads never join to a votes table to sort. (At Reddit scale the votes themselves go through a separate counter — buffered in Redis and folded in asynchronously — but the comment row carries the materialized total.)

a vote updates the denormalized score

UPDATE comments SET score = score + 1 WHERE id = 4;

Delete is the subtle one. You can't remove the row — its replies would be orphaned and the subtree query would skip a generation. Instead, tombstone it: keep the row and its path, blank the content. The branch stays whole; the UI renders [deleted].

soft-delete: keep the node, keep the children

UPDATE comments SET deleted = TRUE, body = '[deleted]', author_id = 0
WHERE id = 1;
-- comment 1 still anchors '/1/4/', '/1/5/', '/1/4/9/' — nothing is orphaned

These six statements aren't illustrative sketches — the schema, the prefix query (including the /10/ exclusion), the recursive CTE, the soft-delete, and the score update were all run against a live SQLite database and produce exactly the rows described.

Step 7 — Trade-offs (each one keeping an NFR)

The last column is the discipline: every choice keeps one of the promises from Step 2.

Decision	The tempting alternative	Why ours wins	Keeps
materialized path	adjacency list + recursive CTE	subtree is one indexed range scan, not a per-read tree walk	read latency
store the tree in rows	embed the tree in one document	a huge thread blows the doc limit; every comment rewrites the whole	cheap writes
denormalized `score` column	`COUNT` a votes table per sort	sorting needs no join; the hot read path stays a single scan	read latency
soft-delete (tombstone)	hard-delete the row	replies stay anchored; the subtree query never skips a generation	correctness
leading and trailing `/`	bare ids in the path	the trailing `/` stops `/1/` from matching `/10/`	correctness
shard by `post_id`	one global comments table	a thread's rows live together; no cross-shard subtree query	scale

Step 8 — Scaling, one bottleneck at a time

Do the napkin math first, because it tells you which bottleneck to expect. A comment row is small — a couple hundred bytes with the body — so even a billion comments is a few hundred GB, well within one beefy Postgres. Writes are modest too: a busy site might see thousands of new comments a second at peak. It's reads that explode — every one of millions of pageviews fans a comment section into dozens of row reads. So you climb the read ladder, and add a rung only when a measured bottleneck forces it, never on day one.

One Postgres. Correct and simple, and it carries you further than you'd guess — the whole tree of any single post is a handful of indexed reads. The first thing to hurt is repeated reads of the same hot thread…
Cache the rendered hot page. The top-level page of a viral thread is identical for every anonymous viewer, so cache the rendered slice in Redis with a short TTL and invalidate on a new top-level comment. Most reads of a hot thread never touch Postgres. When the long tail of cold threads still saturates the primary…
Add read replicas. Fan the subtree and "load more" reads across replicas while the primary keeps the writes. Replicas lag by milliseconds — fine for the score/count NFR (eventual), and the author's own read-your-writes can be pinned to the primary. When total data or write volume finally outgrows one box…
Shard by post_id. Because every query is scoped to a post — WHERE post_id = ? is in every index — a thread's entire tree lives on one shard, so there is never a cross-shard subtree scan. Quiet posts spread evenly across shards. Migrating to a different store is the genuine last resort; the four rungs above go a very long way.

The write hot-key is a separate axis. Sharding spreads quiet posts, but one AMA is a single hot post that no post_id sharding can split — its whole tree is one shard's problem. Two pressures land on it, and each has its own release valve. The read pressure is answered by the cache in rung 2 plus the honest concession that deep, cold branches are a slower on-demand "load more" nobody expects to be instant. The write pressure is the vote storm: a front-page comment can take thousands of votes a second, and UPDATE ... SET score = score + 1 on one row that hot serializes into a bottleneck. That's why the vote tally lives in a Redis counter — absorb the INCR flood in memory and fold the total back into the score column every few seconds. The row's cached total lags by seconds; nobody notices, and the hot row is never the write bottleneck.

Step 9 — Designing for failure: dead components and racing data

A design isn't finished until you can say what happens as each piece dies — and, for a schema, also what happens when its data races. Both sort the same way: a piece that's only an optimization degrades to the source of truth; the piece that holds the truth gets a replica and fails fast; and every race is closed at the row.

When a component dies:

The Redis cache dies. Reads fall through to Postgres — slower, but correct, because the cached page was only ever an optimization over the authoritative rows. Browsing degrades; nothing is lost.
The Redis vote counter dies. The score column already holds the last folded-in total, so ranking keeps working on a slightly-stale number; the in-flight INCRs that hadn't been folded yet are the only loss, and they self-heal — a periodic recount from the comment_vote rows (the durable truth) restores the exact score. A tally is recomputable by design, so its store is allowed to fail.
The Postgres primary dies. This is the one you can't degrade away — the comment rows are the truth. So you keep a replica and fail over; until failover completes you fail fast on writes (reject new comments cleanly) rather than accept comments you can't durably store. Reads keep serving from replicas throughout.
A read replica lags. Fine for scores and counts (eventual by contract), but a user must see their own new comment — so route each author's read-your-writes to the primary for a few seconds after they post.

When the data races:

The comment count drifts (a crash between the insert and the count bump). It's an optimization, not the truth, so show the slightly-stale number and let the periodic recount fix it. Never block posting a comment on its counter.
A reply races its parent's deletion. Because delete is a tombstone, not a row removal, the child insert still computes a valid path off the (now blanked) parent — the reply is anchored correctly instead of dangling. Hard-delete would have raced into an orphan here; the tombstone is what makes the race benign.
Two votes race on the same comment. The UNIQUE(comment_id, user_id) row is the referee: the second insert for the same user fails, so one person can't vote twice no matter how they retry — and the score counter is only the cached sum of those rows, never the authority on who voted.

The interview corner

Clarify before you draw: How deep do threads realistically go — is there a max display depth (Reddit collapses past ~8 and offers "continue this thread"), which caps path length and index bloat? Is the workload read-heavy the way a public forum is, or write-heavy like a live chat (it flips the model choice)? Does a comment ever get re-parented or moved to another post (that one requirement alone can push you off materialized path toward a closure table)? Are exact counts required, or is "12.4k" — an approximate, eventually-consistent number — acceptable?

The follow-up ladder — each rung a new scenario, not a re-run of the thesis:

"A thread goes 60 levels deep — does the path still hold?" The prefix query is depth-agnostic, but the path string and its index grow with depth, and rendering 60 nested indents is its own problem. So cap display depth: past a limit, stop rendering inline and hand back a "continue this thread" link that starts a fresh fetch rooted at that node — bounding both the payload and the index key. Postgres ltree also has a depth-aware operator set if you lean on it.
"Sort by 'best' while votes stream in — how does ORDER BY score stay cheap?" A covering index (post_id, depth, score DESC) serves roots pre-sorted, and the score column is a denormalized tally folded from a Redis counter, so no page-load ever joins the votes table. When "best" is a fancier time-decayed formula, precompute a rank column on the same cadence — the same trick a leaderboard uses to avoid re-sorting on every event.
"Two users reply to the same parent in the same millisecond — can their paths collide?" No: each child appends its own globally-unique id to the parent's path, and ids come from a sequence/snowflake generator, so siblings get distinct paths with no coordination. Contrast the tempting "append the child's position index" scheme, which would need a lock on the parent to allocate the next slot — the id-based path sidesteps that race entirely.
"A mod moves a whole subtree to a different post." This is materialized path's sore spot and the honest place to name a limit: re-parenting means rewriting path (and post_id) for every descendant — a range UPDATE proportional to the subtree size, and across shards if the new post lives elsewhere. If moves are frequent, a closure table wins because a move touches only the ancestor rows; for append-only comments where moves are rare, eating the occasional range rewrite is the right trade.
"The count says 12,388 but the header shows 12.4k — reconcile it." The header is the eventually-consistent denormalized count; a background job recomputes it from the rows and approximate display hides the drift. If the question turns to unique commenters, that's a distinct-count problem — a HyperLogLog sketch in Redis, the same approximate-counting move the analytics aggregation build leans on, not a COUNT(DISTINCT) over millions of rows.

Mistakes that fail the round:

Reaching for nested sets (or any model that renumbers/rewrites existing rows on insert) for a high-write comment tree — one new comment shouldn't touch half the table.
Dropping the trailing separator, so path LIKE '/1/%' silently swallows /10/, /11/, /123/ — the subtree fetch returns strangers' comments.
Hard-deleting a comment that has replies, orphaning the subtree — the branch below it vanishes or dangles. Tombstone instead, keeping the row and its path.

Where to go from here

Pocket version: a comment tree is flat rows plus a materialized path; a subtree is one prefix range scan in tree order; page it with a keyset cursor, never OFFSET; keep score/count denormalized and folded from a Redis counter; tombstone deletes so replies stay anchored; shard by post_id so a thread lives on one shard.

The same tree, in memory and without a database, is the file system LLD — composite nodes and pointers instead of a path string; building both back to back is the clearest way to feel why the on-disk model differs from the in-memory one.
New to the method? The rookie's guide to HLD walks the recipe this article follows.
For the read-vs-write fan-out decision that shapes the rest of a social app, see Twitter's timeline; for ranking that hot score at scale, the leaderboard HLD; for sending to that audience, the newsletter service.

HLD