Skip to content

Search Engine — Detailed#

flowchart TB
  subgraph Ingest
    DOC[Documents]
    CDC[CDC from primary DB]
    NORM[Normalize / tokenize / stem]
    ANALY[Analyzers / language]
    EMBED([Optional: vector embeddings])
  end

  subgraph Index[Index Tier]
    SHARDS[Shards per index]
    REPL[Replicas per shard]
    SEGS[Lucene segments]
    BLOOM[Bloom / FST]
    POST[Posting lists]
    VECINDEX[HNSW vector index]
  end

  subgraph Cluster
    MASTER[Master / coordinator]
    INGEST_N([Ingest nodes])
    DATA_N[Data nodes]
    ROUTING[Routing layer]
  end

  subgraph Query
    PARSE([Query parser DSL / Lucene])
    PLAN[Planner: term + filter + rerank]
    RECALL[Recall stage: text + vector]
    RERANK([Reranker: BM25 + ML])
    HIL[Highlight + snippet]
    AGG[Aggregations / facets]
  end

  subgraph Ops
    SNAP[Snapshots to S3]
    BACKFILL[Reindex]
    HOT_WARM[Hot / warm / cold tiers]
    ILM[Index lifecycle mgmt]
  end

  Ingest --> Index
  Index --- Cluster
  Query --> Cluster
  Cluster --> Query
  Ops --- Index

    classDef client fill:#dbeafe,stroke:#1e40af,stroke-width:1px,color:#0f172a;
    classDef edge fill:#cffafe,stroke:#0e7490,stroke-width:1px,color:#0f172a;
    classDef service fill:#fef3c7,stroke:#92400e,stroke-width:1px,color:#0f172a;
    classDef datastore fill:#fee2e2,stroke:#991b1b,stroke-width:1px,color:#0f172a;
    classDef cache fill:#fed7aa,stroke:#9a3412,stroke-width:1px,color:#0f172a;
    classDef queue fill:#ede9fe,stroke:#5b21b6,stroke-width:1px,color:#0f172a;
    classDef compute fill:#d1fae5,stroke:#065f46,stroke-width:1px,color:#0f172a;
    classDef storage fill:#e5e7eb,stroke:#374151,stroke-width:1px,color:#0f172a;
    classDef external fill:#fce7f3,stroke:#9d174d,stroke-width:1px,color:#0f172a;
    classDef obs fill:#f3e8ff,stroke:#6b21a8,stroke-width:1px,color:#0f172a;
    class DOC,NORM,ANALY,SHARDS,REPL,SEGS,BLOOM,POST,VECINDEX,MASTER,DATA_N,ROUTING,PLAN,RECALL,HIL,AGG,BACKFILL,HOT_WARM,ILM service;
    class CDC datastore;
    class EMBED,INGEST_N,PARSE,RERANK compute;
    class SNAP storage;

Posting lists & scoring#

  • Each term → sorted list of (doc_id, term_freq, positions).
  • Score = BM25 by default; pluggable.
  • Conjunctive AND queries iterate intersecting posting lists.

Sharding & routing#

  • Index split into shards (Lucene index instances).
  • Each shard has primary + replicas; routing by document ID hash.
  • Query fans out to all shards; coordinator merges.

Vector + BM25 hybrid#

  • Modern stacks combine lexical (BM25) and dense vector (HNSW) recall, then rerank.

Glossary & fundamentals#

Concepts referenced in this design. Each row links to its canonical page; the tag column shows whether it is a high-level (HLD) or low-level (LLD) concept.

Tag Concept What it is Page
HLD Sharding horizontal partitioning across nodes database-sharding
HLD Leader/follower replication sync/semi-sync/async replication, failover replication-leader-follower
HLD Change Data Capture WAL/binlog tailing, outbox publishing change-data-capture
HLD Search internals inverted index, BM25, embeddings, ANN search-internals