Company Context Layer
Retrieval Benchmark
Comprehensive evaluation of 15 search configurations across 1,706 queries on a production compliance knowledge base. Testing BM25, vector search, and 5 fusion algorithms to find the optimal retrieval strategy for regulated financial services.
| Configuration | NDCG@10 | Recall@5 | Recall@10 | MRR | Prec@5 | p50 (ms) |
|---|
Why BM25 Wins on Compliance Documents
Compliance knowledge bases are identifier-dense. BSA policies reference
31 CFR 1020.320(d). SOPs name specific systems (DNA Core,
CU*BASE). Agent instructions mention exact form names and process IDs.
When a user searches for "BSA filing requirements for CTR," the winning signal is
exact keyword overlap, not semantic similarity.
BM25 (Best Matching 25) excels here because it rewards term frequency weighted by inverse document frequency — rare terms in the corpus that appear in the query get the highest scores. In a 109-file compliance corpus, terms like "CTR" and "31 CFR" are rare enough to be decisive.
Why Vector Search Fails on Small Domain-Specific Corpora
Vector search works by finding documents whose embeddings are closest in semantic space to the query. This works brilliantly on large, diverse corpora (like web search) where semantically similar documents are genuinely relevant.
But in a small, domain-specific corpus where all documents are about related topics (credit union compliance, BSA, operations), every document is semantically close to every query. The cosine similarity scores cluster tightly, making it nearly impossible to distinguish the truly relevant document from dozens of topically adjacent ones. Our data shows vector-only achieving just 3.8% NDCG@10 — worse than random selection.
Why Convex Combination Beats RRF
RRF (Reciprocal Rank Fusion) combines retrieval lists using rank positions only:
Convex: score(d) = α · norm(vector_score) + (1-α) · norm(bm25_score)
The critical difference: RRF discards score magnitude. A document ranked #1 with BM25 score 45.2 gets the same RRF contribution as one ranked #1 with score 0.3. When BM25 produces high-confidence exact matches (as it does for compliance queries), throwing away that confidence signal is catastrophic.
Convex combination preserves score magnitude through MinMax normalization, then applies a weighted sum. This lets BM25's strong signal dominate when it has a confident match, while vector search can still contribute when keyword matching is ambiguous. This validates Bruch et al. (ACM TOIS 2023), who demonstrated convex combination outperforms RRF on the BEIR benchmark suite.
What This Means for Production
Immediate action: Switch from RRF K=60 (2:1 vector) to Convex α=0.3 (70% BM25, 30% vector). This is a config change, not a code change. Expected improvement: NDCG@10 from 0.082 → 0.729 — a 8.9x improvement in retrieval quality.
For teams deploying RAG in regulated industries: benchmark your retrieval before tuning your prompts. A mid-tier model with excellent retrieval will outperform a frontier model with broken retrieval. The retrieval layer is the highest-leverage optimization in any RAG system.
Evaluation Framework
We use known-item retrieval: each query is generated from a specific document, and the ground truth is that the source document should appear in the search results. This mirrors real agent usage — searching for a specific SOP, policy, or meeting note.
Query Categories
heading_direct (505): Exact section headings as questions.
entity_lookup (537): Named entities, identifiers, form numbers.
heading_paraphrase (311): Reworded section headings.
exact_phrase (213): Distinctive phrases from document bodies.
first_sentence (140): Opening sentences of chunks.
Metrics
NDCG@10: Normalized Discounted Cumulative Gain — position-weighted relevance.
Recall@K: Fraction of relevant docs found in top K.
MRR: Mean Reciprocal Rank of first relevant result.
Precision@5: Fraction of top 5 that are relevant.
Corpus
109 Markdown files from a production credit union knowledge base. Indexed into 1,974 chunks (~500 tokens, 50-token overlap). YAML frontmatter for doc_type, status, tags. Dual-indexed: DuckDB BM25 FTS + LanceDB vectors (Voyage-3.5, 1024d).
Fusion Algorithms
RRF: Reciprocal Rank Fusion (Cormack et al., 2009).
Convex: Score-based weighted sum (Bruch et al., 2023).
RSF: Relative Score Fusion (MinMax normalization).
DBSN: Distribution-Based Score Normalization (Z-score).
CombMNZ: sum × list_count (Fox & Shaw, 1994).
Cormack, G., Clarke, C., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR '09.
Bruch, S., Gai, S., & Ingber, A. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM Transactions on Information Systems.
Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.
Muennighoff, N., et al. (2023). MTEB: Massive Text Embedding Benchmark.
Fox, E. & Shaw, J. (1994). Combination of Multiple Searches. NIST Special Publication 500-215.