Skip to content

Search Internals — Notes#

Lucene segment lifecycle#

  • In-memory buffer → flush → immutable segment.
  • Many small segments → merge into fewer larger ones.
  • Deletes are tombstones; freed at merge.

Tokenisation pitfalls#

  • CJK languages need n-gram or ICU analysers, not whitespace.
  • URLs / code / log lines need keyword analysers (no stemming).
  • Synonyms expanded at index time (precise) or query time (flexible) — both have trade-offs.

Embedding model choice (2024+)#

  • General text: OpenAI text-embedding-3, Cohere embed-multilingual, BGE.
  • Code: voyage-code-2.
  • Images / video: CLIP, OpenCLIP.
  • Pick a dim that fits memory: 512-d × 1B vectors × 4 bytes = 2 TB.
  • p99 budget: 100-300 ms.
  • Allocate: parse 5 ms, candidate fan-out 30 ms, score 30 ms, hydrate 20 ms, ranker 50 ms.

When you don't need a search engine#

  • Single table, < 100k rows — Postgres full-text (tsvector) is fine.
  • Strict exact-match: indexed columns + B-tree.

Refs#

  • "Lucene in Action" (still relevant).
  • Elasticsearch internals docs.
  • "Pretrained Transformers as Universal Computation Engines" (modern embeddings).
  • Faiss / HNSWlib / ScaNN repos.
  • "BM25 explained" — Trey Grainger talk.