Skip to content

Web Crawler — Detailed#

flowchart TB
  subgraph Seeds[Seeds & Sitemaps]
    SD[Seed list]
    SM[Sitemaps]
  end

  subgraph Frontier[URL Frontier]
    PRIO[[Priority queues<br/>by importance]]
    HOST[[Per-host queues<br/>politeness]]
    RR([Round-robin scheduler])
    DELAY[Crawl-delay enforcer]
  end

  subgraph Resolve[DNS & Politeness]
    DNS[DNS cache]
    ROB[robots.txt cache]
    HC[Host metadata<br/>rps cap]
  end

  subgraph Fetch[Fetcher Workers]
    FW([HTTP fetcher pool<br/>async / epoll])
    REND[Headless renderer<br/>JS sites - Chromium]
    RETR[Retry / backoff]
  end

  subgraph Process[Processing Pipeline]
    PR([HTML parser])
    NORM[URL normalizer<br/>canonicalize]
    LINK[Link extractor]
    LANG[Lang detect]
    EX[Content extractor<br/>readability]
    DUPC[Content dedup<br/>SimHash / MinHash]
  end

  subgraph Dedup[Seen-URL filter]
    BLOOM[Bloom filter / Cuckoo<br/>billions of URLs]
    URLDB[(URL store<br/>last-seen, fetch state)]
  end

  subgraph Store[Storage]
    WARC[(WARC archive<br/>S3 / HDFS)]
    PDB[(Page DB<br/>HBase / Cassandra)]
    GRAPH[(Link Graph<br/>Pregel / GraphX)]
  end

  subgraph Downstream
    IDX[Indexer -> Inverted index]
    RANK[PageRank / signal builder]
    SPAM([Spam / malware classifier])
  end

  subgraph Ctl[Control Plane]
    SCH([Master scheduler])
    MON[Metrics: pages/s,<br/>per-host QPS]
    LIM[Quotas, kill switch]
  end

  SD --> Frontier
  SM --> Frontier
  Frontier --> Resolve
  Resolve --> Fetch
  Fetch --> Process
  Process --> Dedup
  Dedup -->|new URL| Frontier
  Process --> Store
  Store --> Downstream
  SCH -.assigns shards.-> Frontier
  MON -.observes.-> Fetch

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class DNS edge;
    class SD,SM,DELAY,ROB,HC,REND,RETR,NORM,LINK,LANG,EX,DUPC,BLOOM,RANK,LIM service;
    class URLDB,PDB,IDX datastore;
    class PRIO,HOST queue;
    class RR,FW,PR,SPAM,SCH compute;
    class WARC storage;
    class MON obs;

Politeness & robots#

  • Respect robots.txt, Crawl-delay, sitemaps.
  • Cap per-host concurrency (1–4 connections); identify with User-Agent + contact URL.
  • Use exponential backoff on 5xx / 429.

URL canonicalization#

  • Lowercase host, strip default ports, sort query, drop fragments, follow <link rel=canonical>.

Dedup#

  • URL: Bloom filter sized for expected URLs (FPR 1%).
  • Content: SimHash (64-bit) for near-dup pages, Hamming threshold 3.

Scale design#

  • Frontier sharded by host hash; one host always served by same node (politeness).
  • Workers pull from frontier; processed results emitted to Kafka.
  • WARC files stored as 1 GB chunks in S3 / HDFS.

Politeness sticking points#

  • *.akamai.com masking many hosts behind one IP — limit by IP too.
  • Sitemap honesty — verify timestamps before re-crawl.
  • Crawler traps (infinite calendar, faceted search) — depth limit + URL pattern dedup.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD CDN edge caching for static assets cdn
HLD Sharding horizontal partitioning across nodes database-sharding
HLD Pub/Sub & message brokers topics, consumer groups, delivery semantics pub-sub-pattern
HLD CAP / PACELC C vs A under partition; L vs C otherwise cap-pacelc
HLD Probabilistic data structures Bloom, HLL, Count-Min, MinHash, t-digest probabilistic-data-structures
HLD Idempotency & retries safe re-execution, backoff + jitter idempotency-retries
HLD Observability metrics, logs, traces, SLOs observability
HLD Search internals inverted index, BM25, embeddings, ANN search-internals
LLD Creational patterns Singleton, Factory, Builder, Prototype creational-patterns