Skip to content

Web Crawler — Simple#

Problem statement (interviewer prompt)

Design a polite, scalable web crawler that fetches 1B+ pages over a month. It must respect robots.txt + crawl-delay, deduplicate URLs and near-duplicate content, prioritise important pages, handle JS-heavy sites, and survive worker failures.

flowchart LR
  S[Seed URLs]
  F[[(Frontier Queue)]]
  FT([Fetcher])
  P([Parser / Link Extractor])
  DD[Dedup<br/>URL + content hash]
  ST[(Page Storage)]
  IX[Indexer]
  S --> F
  F --> FT --> P
  P --> DD
  DD -->|new urls| F
  P --> ST --> IX

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class S,DD,IX service;
    class F,ST datastore;
    class FT,P compute;