Web Crawler LLD: A Frontier, a Seen-Set, and the Trick Question of When You're Done
A low-level design walkthrough of a web crawler core: the frontier queue, URL normalization before deduplication, a shared seen-set, per-host politeness, and termination by in-flight count.
"Design a web crawler." Fetch a page, find its links, fetch those, repeat — a graph traversal you could sketch in a minute. Which is exactly why it's a favorite: the one-minute sketch is a trap with at least four pits underneath it, and the interviewer is watching which ones you walk into. When do you stop, if every page leads to more pages? How do you not fetch the same page a thousand times when a thousand pages link to it? What stops you from hammering one poor server into the ground? And — the sneakiest — with ten workers running, how do you even know you're finished, when an empty queue c…
What’s inside
Read this one free
Sign in and your first premium article is on us — read Web Crawler LLD: A Frontier, a Seen-Set, and the Trick Question of When You're Done free.