HLD
LLD
Probabilistic Data Structures — Detailed
flowchart TB
subgraph BF[Bloom Filter]
BFW[Insert: set k bits via k hashes]
BFR[Query: all k bits set?<br/>false positive possible, no false negative]
BFP["Params: m bits, n items<br/>FPR ≈ (1 - e^{-kn/m})^k"]
BFV[Variants: Counting, Scalable, Partitioned]
end
subgraph CKF[Cuckoo Filter]
CFW[Insert with fingerprint<br/>+ alt bucket]
CFD[Delete supported]
CFP[Better cache locality<br/>than Bloom for high load]
end
subgraph CMS[Count-Min Sketch]
CMW[Insert: ++ in d rows]
CMR[Query: min over rows<br/>overestimate only]
CMP["Params: d hashes, w width<br/>ε ≈ e/w, δ ≈ e^{-d}"]
CMU[Use: heavy hitters,<br/>top-k, frequency]
end
subgraph HLL[HyperLogLog]
HLLW[Bucket via hash prefix<br/>track max trailing zeros]
HLLR[Cardinality estimate<br/>via harmonic mean]
HLLP[2^14 bytes ≈ 16 KB<br/>0.81% std err]
HLLM[Mergeable across shards]
end
subgraph QT[Quantiles: t-digest, KLL, GK]
QTW[Cluster ranks into centroids]
QTR[Query p50/p95/p99]
QTM[Mergeable]
end
subgraph MinHash[MinHash / SimHash / LSH]
SHJ[Jaccard / cosine similarity<br/>at scale]
LSH[Locality-sensitive hashing<br/>for near-neighbor search]
end
subgraph TD[Top-K]
SS[Space-Saving]
HK[Misra-Gries Heavy Keepers]
end
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class BFW,BFR,BFP,BFV,CFD,CFP,CMW,CMR,CMP,CMU,HLLR,HLLP,HLLM,QTW,QTR,QTM,SHJ,LSH,SS,HK service;
class CFW,HLLW storage;
Bloom filter math
m bits, n items, k hash functions.
Optimal k ≈ (m/n)·ln 2.
FPR ≈ (1 - e^{-kn/m})^k ≈ (1 - e^{-k·n/m})^k.
For FPR 1% with n=1B: m ≈ 9.6 bn bits ≈ 1.2 GB.
When to use each
Need
Pick
"have we seen this URL?"
Bloom / Cuckoo
"what are top hashtags?"
Count-Min + heap, or Space-Saving
"how many unique users today?"
HyperLogLog
"p99 latency over 1B events"
t-digest / KLL
"near-duplicate detection"
MinHash + LSH (or SimHash)
"set difference between replicas"
Invertible Bloom Filter
Pitfalls
Bloom: cannot delete (use Counting Bloom or Cuckoo).
CMS: overestimates; pair with reservoir sampling for confidence.
HLL: bad at small cardinalities (n < ~12·m); use sparse repr.
All assume good hash functions (Murmur, xxHash); avoid identity hashes.
Glossary & fundamentals
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.