π§ RAG & Semantic Search β Personal Notes
1. Semantic Search
Semantic search focuses on the meaning of content, unlike lexical search which does literal string/pattern matching.
- Meaning is captured as vectors (numerical representations of content)
- To find relevant content, we calculate the distance between vectors and return the nearest ones
Nearest Neighbour Algorithms
| Algorithm | Description |
|---|---|
| KNN (K-Nearest Neighbours) | Basic approach β compares a vector against all others |
| NSW (Navigable Small World) | Approximate nearest neighbour β maintains 2 closest connections per node, navigates the graph from a random start point |
| HNSW (Hierarchical Navigable Small World) | Layered version of NSW β searches across multiple layers, narrowing down to the nearest vector at each level |
HNSW is the standard used in most production vector databases.
2. RAG Workflow
DATA β Chunks β Vector Embeddings β Vector Store β Relevant Docs β LLM β USER
Tweeks
- Recommender by topic β Regular topic search
- Recommender by content β Remove duplicate titles so new content retrieved is more relavent instead of repeated results
3. Embeddings
An embedding is a vector that carries meaning.
Sparse Embeddings
- BM25 β A ranking function that retrieves text by estimating the relative importance of terms against the search query
- Calculated based on: number of documents in the corpus + word frequency across relevant documents
Dense Embeddings
- CLIP (Contrastive LanguageβImage Pre-Training) β Maps images to text/captions
- Trained on image-text pairs to describe images semantically
Hybrid Search
Combining sparse + dense embeddings gives the best retrieval results. This is called hybrid search.
4. Vector Databases
Uses HNSW under the hood β returns all relevant elements within a given distance threshold.
Choosing a Vector Database β Key Considerations
| Factor | Questions to Ask |
|---|---|
| Search Functionality | What types of search do you need? |
| Budget | Open-source vs. commercial? |
| Privacy | Does data need to be self-hosted? |
| Popularity | Better community = better support |
| SDK Support | Does it support your language/stack? |
| Performance | Does it meet your latency/scale needs? |
5. Chunking Strategy
Chunk Length
- 200β300 tokens works well as a general baseline
- Up to 600 tokens for complex/legal/policy documents
- Always test different lengths with your specific data to optimise recall
Chunk Overlap
- Overlapping chunks improve search relevance by preserving context at boundaries
6. Metadata
Attach extra metadata to chunks for two reasons:
- Enrich LLM generation β e.g. citations, page numbers
- Enable custom filtering during retrieval
Common Metadata Fields
- Source document name
- Page numbers
- Section & section path
- Section chunk index + total chunks
- Reference IDs, article/rule/error/product codes
- LLM-generated summary of chunk contents
7. Reranking
Searches often return irrelevant chunks β reranking sorts results by relevance and filters noise, especially useful with keyword search.
Reranking Pipeline (Example: Top k = 5)
Hybrid Search β Top k Γ 2 = 10 chunks
Add adjacent chunks = 18 chunks
Reranker filters to Top k = 5 chunks β
Rule of thumb: Retrieve 2β3Γ more than your Top k, then rerank down to Top k.
Cross Encoders
- Specialised ML models trained specifically for relevance scoring
- Much faster than LLMs at processing text
- Higher-quality models typically require GPUs for production use
- Good open-source option:
BAAI/bge-reranker-baseβ solid performance, fast enough for CPUs, hostable via a Python FastAPI + HuggingFace setup
Quick Reference Cheatsheet
Semantic Search β meaning-based, uses vector distance
Lexical Search β pattern/string matching (BM25)
Embeddings β vectors with meaning
Sparse β BM25 (term frequency)
Dense β CLIP, neural models
Hybrid β sparse + dense combined
HNSW β layered graph for fast ANN search
Chunking β 200β300 tokens default, overlap helps
Reranking β retrieve 2β3Γ Top k, filter down
Cross Encoder β fast ML model for relevance scoring