Web Crawler — Simple#
Problem statement (interviewer prompt)
Design a polite, scalable web crawler that fetches 1B+ pages over a month. It must respect robots.txt + crawl-delay, deduplicate URLs and near-duplicate content, prioritise important pages, handle JS-heavy sites, and survive worker failures.
flowchart LR
S[Seed URLs]
F[[(Frontier Queue)]]
FT([Fetcher])
P([Parser / Link Extractor])
DD[Dedup<br/>URL + content hash]
ST[(Page Storage)]
IX[Indexer]
S --> F
F --> FT --> P
P --> DD
DD -->|new urls| F
P --> ST --> IX
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class S,DD,IX service;
class F,ST datastore;
class FT,P compute;