Search Internals — Detailed#
Inverted index#
flowchart TB
subgraph Build[Index build]
Docs[Documents]
Tok[Tokenise + normalise<br/>lowercase, stop, stem]
Post[Per-term postings list<br/>doc_id, positions, tf]
Seg[(Immutable segment)]
Merge[Background merge]
end
subgraph Query[Query path]
QP[Parse query]
QT[Per-term postings fetch]
Intersect[AND / OR / phrase / proximity]
Score[BM25 score per doc]
Topk[Top-k heap]
end
Docs --> Tok --> Post --> Seg --> Merge
QP --> QT --> Intersect --> Score --> Topk
Seg --> QT
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class Docs,Tok,Post,Merge,QP,QT,Intersect,Score,Topk service;
class Seg datastore;
A posting for term cat:
BM25#
The de-facto ranker. For a query Q with terms q_i and a doc D:
Knobs: k_1 (term frequency saturation, ~1.2), b (length normalisation, ~0.75).
TF-IDF (predecessor)#
- Term frequency × inverse document frequency.
- Doesn't saturate; long docs win unfairly. BM25 fixes both.
Phrase / proximity#
- Use position lists to detect
"data center"as adjacent tokens. - Skip lists in the postings make intersection O(min(p1, p2)).
Sharding the index#
- Shard by doc id range: each shard has all terms for its docs (sub-set of docs). Queries fan out to all shards.
- Shard by term: each shard owns a subset of terms — cheaper for some queries but skewed (rare terms vs common).
- Most production systems shard by doc id.
Vector / semantic search#
flowchart TB
T[Text / image]
E[Embedding model<br/>BERT / CLIP / SBERT]
V[Vector - 512 to 4096 dim]
IDX[(ANN index<br/>HNSW / IVF-PQ / ScaNN)]
Q[Query embedding]
KNN[k-NN search]
T --> E --> V --> IDX
Q --> IDX
IDX --> KNN
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class T,V,KNN service;
class IDX datastore;
class E,Q compute;
ANN structures#
- HNSW — small-world graph; fast recall, RAM-hungry. Most popular.
- IVF + PQ — coarse cluster + product-quantised residuals. Memory-efficient.
- ScaNN — Google's hybrid; great quality / latency.
- DiskANN — SSD-backed for billion-scale.
Hybrid search#
- Lexical (BM25) returns top 200.
- Vector returns top 200.
- Reciprocal-rank fusion or learned-to-rank reranks. Both signals matter — pure vector misses brand names; pure lexical misses synonyms.
Real-time freshness#
- Lucene/ES uses immutable segments + soft commits (visible within seconds).
- Trade-off: smaller segments = faster freshness, more merge overhead.
Caches in front of search#
- Query cache (whole-query top-k).
- Filter cache (e.g. "language=en" bitsets).
- Field-data cache (sort + aggregations).
Where this shows up#
- Search engine (tier 8) — direct application.
- Recommendation system — ANN over user/item embeddings.
- Autocomplete — trie + ranker = lightweight inverted index.
- Spam / abuse detection — vector similarity for near-dup content.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Sharding | horizontal partitioning across nodes | database-sharding |
HLD |
Search internals | inverted index, BM25, embeddings, ANN | search-internals |
LLD |
Data structures & complexity | Big-O, common DS, latency numbers | data-structures-complexity |
LLD |
Immutability | immutable types, persistent collections | immutability |