Search Engine — Detailed#
flowchart TB
subgraph Ingest
DOC[Documents]
CDC[CDC from primary DB]
NORM[Normalize / tokenize / stem]
ANALY[Analyzers / language]
EMBED([Optional: vector embeddings])
end
subgraph Index[Index Tier]
SHARDS[Shards per index]
REPL[Replicas per shard]
SEGS[Lucene segments]
BLOOM[Bloom / FST]
POST[Posting lists]
VECINDEX[HNSW vector index]
end
subgraph Cluster
MASTER[Master / coordinator]
INGEST_N([Ingest nodes])
DATA_N[Data nodes]
ROUTING[Routing layer]
end
subgraph Query
PARSE([Query parser DSL / Lucene])
PLAN[Planner: term + filter + rerank]
RECALL[Recall stage: text + vector]
RERANK([Reranker: BM25 + ML])
HIL[Highlight + snippet]
AGG[Aggregations / facets]
end
subgraph Ops
SNAP[Snapshots to S3]
BACKFILL[Reindex]
HOT_WARM[Hot / warm / cold tiers]
ILM[Index lifecycle mgmt]
end
Ingest --> Index
Index --- Cluster
Query --> Cluster
Cluster --> Query
Ops --- Index
classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
class DOC,NORM,ANALY,SHARDS,REPL,SEGS,BLOOM,POST,VECINDEX,MASTER,DATA_N,ROUTING,PLAN,RECALL,HIL,AGG,BACKFILL,HOT_WARM,ILM service;
class CDC datastore;
class EMBED,INGEST_N,PARSE,RERANK compute;
class SNAP storage;
Posting lists & scoring#
- Each term → sorted list of (doc_id, term_freq, positions).
- Score = BM25 by default; pluggable.
- Conjunctive AND queries iterate intersecting posting lists.
Sharding & routing#
- Index split into shards (Lucene index instances).
- Each shard has primary + replicas; routing by document ID hash.
- Query fans out to all shards; coordinator merges.
Vector + BM25 hybrid#
- Modern stacks combine lexical (BM25) and dense vector (HNSW) recall, then rerank.
Glossary & fundamentals#
Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.
| Tag | Concept | What it is | Page |
|---|---|---|---|
HLD |
Sharding | horizontal partitioning across nodes | database-sharding |
HLD |
Leader/follower replication | sync/semi-sync/async replication, failover | replication-leader-follower |
HLD |
Change Data Capture | WAL/binlog tailing, outbox publishing | change-data-capture |
HLD |
Search internals | inverted index, BM25, embeddings, ANN | search-internals |