Web Crawler — Detailed#
flowchart TB
subgraph Seeds[Seeds & Sitemaps]
SD[Seed list]
SM[Sitemaps]
end
subgraph Frontier[URL Frontier]
PRIO[[Priority queues<br/>by importance]]
HOST[[Per-host queues<br/>politeness]]
RR([Round-robin scheduler])
DELAY[Crawl-delay enforcer]
end
subgraph Resolve[DNS & Politeness]
DNS[DNS cache]
ROB[robots.txt cache]
HC[Host metadata<br/>rps cap]
end
subgraph Fetch[Fetcher Workers]
FW([HTTP fetcher pool<br/>async / epoll])
REND[Headless renderer<br/>JS sites - Chromium]
RETR[Retry / backoff]
end
subgraph Process[Processing Pipeline]
PR([HTML parser])
NORM[URL normalizer<br/>canonicalize]
LINK[Link extractor]
LANG[Lang detect]
EX[Content extractor<br/>readability]
DUPC[Content dedup<br/>SimHash / MinHash]
end
subgraph Dedup[Seen-URL filter]
BLOOM[Bloom filter / Cuckoo<br/>billions of URLs]
URLDB[(URL store<br/>last-seen, fetch state)]
end
subgraph Store[Storage]
WARC[(WARC archive<br/>S3 / HDFS)]
PDB[(Page DB<br/>HBase / Cassandra)]
GRAPH[(Link Graph<br/>Pregel / GraphX)]
end
subgraph Downstream
IDX[Indexer -> Inverted index]
RANK[PageRank / signal builder]
SPAM([Spam / malware classifier])
end
subgraph Ctl[Control Plane]
SCH([Master scheduler])
MON[Metrics: pages/s,<br/>per-host QPS]
LIM[Quotas, kill switch]
end
SD --> Frontier
SM --> Frontier
Frontier --> Resolve
Resolve --> Fetch
Fetch --> Process
Process --> Dedup
Dedup -->|new URL| Frontier
Process --> Store
Store --> Downstream
SCH -.assigns shards.-> Frontier
MON -.observes.-> Fetch
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class DNS edge;
class SD,SM,DELAY,ROB,HC,REND,RETR,NORM,LINK,LANG,EX,DUPC,BLOOM,RANK,LIM service;
class URLDB,PDB,IDX datastore;
class PRIO,HOST queue;
class RR,FW,PR,SPAM,SCH compute;
class WARC storage;
class MON obs;
Politeness & robots#
- Respect
robots.txt,Crawl-delay, sitemaps. - Cap per-host concurrency (1–4 connections); identify with User-Agent + contact URL.
- Use exponential backoff on 5xx / 429.
URL canonicalization#
- Lowercase host, strip default ports, sort query, drop fragments, follow
<link rel=canonical>.
Dedup#
- URL: Bloom filter sized for expected URLs (FPR 1%).
- Content: SimHash (64-bit) for near-dup pages, Hamming threshold 3.
Scale design#
- Frontier sharded by host hash; one host always served by same node (politeness).
- Workers pull from frontier; processed results emitted to Kafka.
- WARC files stored as 1 GB chunks in S3 / HDFS.
Politeness sticking points#
*.akamai.commasking many hosts behind one IP — limit by IP too.- Sitemap honesty — verify timestamps before re-crawl.
- Crawler traps (infinite calendar, faceted search) — depth limit + URL pattern dedup.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
CDN | edge caching for static assets | cdn |
HLD |
Sharding | horizontal partitioning across nodes | database-sharding |
HLD |
Pub/Sub & message brokers | topics, consumer groups, delivery semantics | pub-sub-pattern |
HLD |
CAP / PACELC | C vs A under partition; L vs C otherwise | cap-pacelc |
HLD |
Probabilistic data structures | Bloom, HLL, Count-Min, MinHash, t-digest | probabilistic-data-structures |
HLD |
Idempotency & retries | safe re-execution, backoff + jitter | idempotency-retries |
HLD |
Observability | metrics, logs, traces, SLOs | observability |
HLD |
Search internals | inverted index, BM25, embeddings, ANN | search-internals |
LLD |
Creational patterns | Singleton, Factory, Builder, Prototype | creational-patterns |