Skip to content

Probabilistic Data Structures — Detailed#

flowchart TB
  subgraph BF[Bloom Filter]
    BFW[Insert: set k bits via k hashes]
    BFR[Query: all k bits set?<br/>false positive possible, no false negative]
    BFP["Params: m bits, n items<br/>FPR ≈ (1 - e^{-kn/m})^k"]
    BFV[Variants: Counting, Scalable, Partitioned]
  end

  subgraph CKF[Cuckoo Filter]
    CFW[Insert with fingerprint<br/>+ alt bucket]
    CFD[Delete supported]
    CFP[Better cache locality<br/>than Bloom for high load]
  end

  subgraph CMS[Count-Min Sketch]
    CMW[Insert: ++ in d rows]
    CMR[Query: min over rows<br/>overestimate only]
    CMP["Params: d hashes, w width<br/>ε ≈ e/w, δ ≈ e^{-d}"]
    CMU[Use: heavy hitters,<br/>top-k, frequency]
  end

  subgraph HLL[HyperLogLog]
    HLLW[Bucket via hash prefix<br/>track max trailing zeros]
    HLLR[Cardinality estimate<br/>via harmonic mean]
    HLLP[2^14 bytes ≈ 16 KB<br/>0.81% std err]
    HLLM[Mergeable across shards]
  end

  subgraph QT[Quantiles: t-digest, KLL, GK]
    QTW[Cluster ranks into centroids]
    QTR[Query p50/p95/p99]
    QTM[Mergeable]
  end

  subgraph MinHash[MinHash / SimHash / LSH]
    SHJ[Jaccard / cosine similarity<br/>at scale]
    LSH[Locality-sensitive hashing<br/>for near-neighbor search]
  end

  subgraph TD[Top-K]
    SS[Space-Saving]
    HK[Misra-Gries Heavy Keepers]
  end

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class BFW,BFR,BFP,BFV,CFD,CFP,CMW,CMR,CMP,CMU,HLLR,HLLP,HLLM,QTW,QTR,QTM,SHJ,LSH,SS,HK service;
    class CFW,HLLW storage;

Bloom filter math#

  • m bits, n items, k hash functions.
  • Optimal k ≈ (m/n)·ln 2.
  • FPR ≈ (1 - e^{-kn/m})^k ≈ (1 - e^{-k·n/m})^k.
  • For FPR 1% with n=1B: m ≈ 9.6 bn bits ≈ 1.2 GB.

When to use each#

Need Pick
"have we seen this URL?" Bloom / Cuckoo
"what are top hashtags?" Count-Min + heap, or Space-Saving
"how many unique users today?" HyperLogLog
"p99 latency over 1B events" t-digest / KLL
"near-duplicate detection" MinHash + LSH (or SimHash)
"set difference between replicas" Invertible Bloom Filter

Pitfalls#

  • Bloom: cannot delete (use Counting Bloom or Cuckoo).
  • CMS: overestimates; pair with reservoir sampling for confidence.
  • HLL: bad at small cardinalities (n < ~12·m); use sparse repr.
  • All assume good hash functions (Murmur, xxHash); avoid identity hashes.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD Probabilistic data structures Bloom, HLL, Count-Min, MinHash, t-digest probabilistic-data-structures