Search Internals — Detailed#

Inverted index#

flowchart TB
  subgraph Build[Index build]
    Docs[Documents]
    Tok[Tokenise + normalise<br/>lowercase, stop, stem]
    Post[Per-term postings list<br/>doc_id, positions, tf]
    Seg[(Immutable segment)]
    Merge[Background merge]
  end
  subgraph Query[Query path]
    QP[Parse query]
    QT[Per-term postings fetch]
    Intersect[AND / OR / phrase / proximity]
    Score[BM25 score per doc]
    Topk[Top-k heap]
  end
  Docs --> Tok --> Post --> Seg --> Merge
  QP --> QT --> Intersect --> Score --> Topk
  Seg --> QT

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class Docs,Tok,Post,Merge,QP,QT,Intersect,Score,Topk service;
    class Seg datastore;

A posting for term cat:

cat → [(doc_3, freq=2, pos=[10,42]), (doc_7, freq=1, pos=[3]), ...]

BM25#

The de-facto ranker. For a query Q with terms q_i and a doc D:

score(D, Q) = Σ IDF(q_i) · (tf · (k_1 + 1)) / (tf + k_1 · (1 - b + b · |D| / avgdl))

Knobs: k_1 (term frequency saturation, ~1.2), b (length normalisation, ~0.75).

TF-IDF (predecessor)#

Term frequency × inverse document frequency.
Doesn't saturate; long docs win unfairly. BM25 fixes both.

Phrase / proximity#

Use position lists to detect "data center" as adjacent tokens.
Skip lists in the postings make intersection O(min(p1, p2)).

Sharding the index#

Shard by doc id range: each shard has all terms for its docs (sub-set of docs). Queries fan out to all shards.
Shard by term: each shard owns a subset of terms — cheaper for some queries but skewed (rare terms vs common).
Most production systems shard by doc id.

Vector / semantic search#

flowchart TB
  T[Text / image]
  E[Embedding model<br/>BERT / CLIP / SBERT]
  V[Vector - 512 to 4096 dim]
  IDX[(ANN index<br/>HNSW / IVF-PQ / ScaNN)]
  Q[Query embedding]
  KNN[k-NN search]
  T --> E --> V --> IDX
  Q --> IDX
  IDX --> KNN

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class T,V,KNN service;
    class IDX datastore;
    class E,Q compute;

ANN structures#

HNSW — small-world graph; fast recall, RAM-hungry. Most popular.
IVF + PQ — coarse cluster + product-quantised residuals. Memory-efficient.
ScaNN — Google's hybrid; great quality / latency.
DiskANN — SSD-backed for billion-scale.

Hybrid search#

Lexical (BM25) returns top 200.
Vector returns top 200.
Reciprocal-rank fusion or learned-to-rank reranks. Both signals matter — pure vector misses brand names; pure lexical misses synonyms.

Real-time freshness#

Lucene/ES uses immutable segments + soft commits (visible within seconds).
Trade-off: smaller segments = faster freshness, more merge overhead.

Caches in front of search#

Query cache (whole-query top-k).
Filter cache (e.g. "language=en" bitsets).
Field-data cache (sort + aggregations).

Where this shows up#

Search engine (tier 8) — direct application.
Recommendation system — ANN over user/item embeddings.
Autocomplete — trie + ranker = lightweight inverted index.
Spam / abuse detection — vector similarity for near-dup content.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag	Concept	What it is	Page
`HLD`	Sharding	horizontal partitioning across nodes	database-sharding
`HLD`	Search internals	inverted index, BM25, embeddings, ANN	search-internals
`LLD`	Data structures & complexity	Big-O, common DS, latency numbers	data-structures-complexity
`LLD`	Immutability	immutable types, persistent collections	immutability