Probabilistic Data Structures — Notes#

Why they exist#

Sometimes "approximately correct, with O(1) memory" beats "exact, with O(N) memory" — by orders of magnitude. Cache invalidation, analytics, anti-abuse, log dedup all benefit.

Where they show up in this repo#

Web Crawler — Bloom filter for seen URLs.
Distributed Cache — Bloom filter for cache penetration.
Rate Limiter — Count-Min for hot keys.
Real-time Analytics / Trending — HLL for unique counts, CMS+heap for top-K.
Spam / Abuse — MinHash/LSH for near-dup content.
Distributed Counter — HLL & CMS for views/likes at scale.

Reservoir & sampling#

Reservoir sampling: uniform random sample of N from a stream of unknown length.
Weighted reservoir for biased sampling.

Tuning rules of thumb#

Bloom 10 bits/key → 1% FPR.
CMS width = e/ε, depth = ln(1/δ); for ε=0.01, δ=0.01 → 272 × 5 = 1.4k cells.
HLL m=2^14 → 1.62/√m = 0.81% std err.

Refs#

Cormode, Muthukrishnan: "An improved data stream summary: the Count-Min Sketch."
Flajolet et al.: "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm."
Bonomi et al.: "Cuckoo filter."
Mitzenmacher, Upfal: Probability and Computing (textbook).
Apache DataSketches library docs.