Skip to content

Probabilistic Data Structures — Notes#

Why they exist#

Sometimes "approximately correct, with O(1) memory" beats "exact, with O(N) memory" — by orders of magnitude. Cache invalidation, analytics, anti-abuse, log dedup all benefit.

Where they show up in this repo#

  • Web Crawler — Bloom filter for seen URLs.
  • Distributed Cache — Bloom filter for cache penetration.
  • Rate Limiter — Count-Min for hot keys.
  • Real-time Analytics / Trending — HLL for unique counts, CMS+heap for top-K.
  • Spam / Abuse — MinHash/LSH for near-dup content.
  • Distributed Counter — HLL & CMS for views/likes at scale.

Reservoir & sampling#

  • Reservoir sampling: uniform random sample of N from a stream of unknown length.
  • Weighted reservoir for biased sampling.

Tuning rules of thumb#

  • Bloom 10 bits/key → 1% FPR.
  • CMS width = e/ε, depth = ln(1/δ); for ε=0.01, δ=0.01 → 272 × 5 = 1.4k cells.
  • HLL m=2^14 → 1.62/√m = 0.81% std err.

Refs#

  • Cormode, Muthukrishnan: "An improved data stream summary: the Count-Min Sketch."
  • Flajolet et al.: "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm."
  • Bonomi et al.: "Cuckoo filter."
  • Mitzenmacher, Upfal: Probability and Computing (textbook).
  • Apache DataSketches library docs.