Web Crawler — Notes
Functional
- Start from seeds, fetch HTML, extract links, recurse.
- Respect robots.txt, sitemaps, crawl-delay.
- Re-crawl with freshness policy (popular = faster).
- Deduplicate URLs and content.
Non-functional
- 1B+ pages crawled.
- 1k+ fetchers in parallel.
- Politeness: ≤ 1 connection / host by default.
- Resumable on failure.
Capacity (10B pages, 1 month)
- 10B pages / 30 days = ~4,000 pages/s avg.
- Avg page 500 KB → 2 GB/s = 16 Gbps inbound.
- Storage: 10B × 500 KB = 5 PB raw; 1 PB compressed (gzip ~5×).
- URL frontier: 100B URLs × 50 B → 5 TB → Bloom filter 100 GB at 1% FPR.
Data
- URL store:
(url, host, depth, last_fetched, fetch_state, http_code, content_hash).
- Page store: WARC files on S3 (immutable, append-only).
Trade-offs
- BFS = better coverage, DFS = bias to one site.
- Priority crawl by PageRank/freshness for limited budget.
- Politeness vs throughput: per-host queue trims throughput on big hosts but is the law.
- Headless rendering essential for SPA sites but 10–100× cost; do selectively.
Refs
- Mercator (1999), UbiCrawler, Heritrix (Internet Archive), Common Crawl,
"Designing Data-Intensive Applications" ch.10, ByteByteGo web crawler.