Probabilistic Data Structures — Notes#
Why they exist#
Sometimes "approximately correct, with O(1) memory" beats "exact, with O(N) memory" — by orders of magnitude. Cache invalidation, analytics, anti-abuse, log dedup all benefit.
Where they show up in this repo#
- Web Crawler — Bloom filter for seen URLs.
- Distributed Cache — Bloom filter for cache penetration.
- Rate Limiter — Count-Min for hot keys.
- Real-time Analytics / Trending — HLL for unique counts, CMS+heap for top-K.
- Spam / Abuse — MinHash/LSH for near-dup content.
- Distributed Counter — HLL & CMS for views/likes at scale.
Reservoir & sampling#
- Reservoir sampling: uniform random sample of N from a stream of unknown length.
- Weighted reservoir for biased sampling.
Tuning rules of thumb#
- Bloom 10 bits/key → 1% FPR.
- CMS width = e/ε, depth = ln(1/δ); for ε=0.01, δ=0.01 → 272 × 5 = 1.4k cells.
- HLL m=2^14 → 1.62/√m = 0.81% std err.
Refs#
- Cormode, Muthukrishnan: "An improved data stream summary: the Count-Min Sketch."
- Flajolet et al.: "HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm."
- Bonomi et al.: "Cuckoo filter."
- Mitzenmacher, Upfal: Probability and Computing (textbook).
- Apache DataSketches library docs.