Skip to content

Web Crawler — Notes#

Functional#

  • Start from seeds, fetch HTML, extract links, recurse.
  • Respect robots.txt, sitemaps, crawl-delay.
  • Re-crawl with freshness policy (popular = faster).
  • Deduplicate URLs and content.

Non-functional#

  • 1B+ pages crawled.
  • 1k+ fetchers in parallel.
  • Politeness: ≤ 1 connection / host by default.
  • Resumable on failure.

Capacity (10B pages, 1 month)#

  • 10B pages / 30 days = ~4,000 pages/s avg.
  • Avg page 500 KB → 2 GB/s = 16 Gbps inbound.
  • Storage: 10B × 500 KB = 5 PB raw; 1 PB compressed (gzip ~5×).
  • URL frontier: 100B URLs × 50 B → 5 TB → Bloom filter 100 GB at 1% FPR.

Data#

  • URL store: (url, host, depth, last_fetched, fetch_state, http_code, content_hash).
  • Page store: WARC files on S3 (immutable, append-only).

Trade-offs#

  • BFS = better coverage, DFS = bias to one site.
  • Priority crawl by PageRank/freshness for limited budget.
  • Politeness vs throughput: per-host queue trims throughput on big hosts but is the law.
  • Headless rendering essential for SPA sites but 10–100× cost; do selectively.

Refs#

  • Mercator (1999), UbiCrawler, Heritrix (Internet Archive), Common Crawl, "Designing Data-Intensive Applications" ch.10, ByteByteGo web crawler.