Search Internals — Notes
Lucene segment lifecycle
- In-memory buffer → flush → immutable segment.
- Many small segments → merge into fewer larger ones.
- Deletes are tombstones; freed at merge.
Tokenisation pitfalls
- CJK languages need n-gram or ICU analysers, not whitespace.
- URLs / code / log lines need keyword analysers (no stemming).
- Synonyms expanded at index time (precise) or query time (flexible) — both have trade-offs.
Embedding model choice (2024+)
- General text: OpenAI text-embedding-3, Cohere embed-multilingual, BGE.
- Code: voyage-code-2.
- Images / video: CLIP, OpenCLIP.
- Pick a dim that fits memory: 512-d × 1B vectors × 4 bytes = 2 TB.
Latency budget (typical search)
- p99 budget: 100-300 ms.
- Allocate: parse 5 ms, candidate fan-out 30 ms, score 30 ms, hydrate 20 ms, ranker 50 ms.
When you don't need a search engine
- Single table, < 100k rows — Postgres full-text (
tsvector) is fine.
- Strict exact-match: indexed columns + B-tree.
Refs
- "Lucene in Action" (still relevant).
- Elasticsearch internals docs.
- "Pretrained Transformers as Universal Computation Engines" (modern embeddings).
- Faiss / HNSWlib / ScaNN repos.
- "BM25 explained" — Trey Grainger talk.