Runline

Company Context Layer
Retrieval Benchmark

Comprehensive evaluation of 15 search configurations across 1,706 queries on a production compliance knowledge base. Testing BM25, vector search, and 5 fusion algorithms to find the optimal retrieval strategy for regulated financial services.

25,590

Total Evaluations

1,706 queries × 15 configs

1,706

Test Queries

5 categories, auto-generated

109

Source Files

1,974 indexed chunks

1024d

Embedding Model

Voyage-3.5 vectors

Configurations

5 fusion algorithms

IR Metrics

NDCG, Recall, MRR, Precision

Key Findings

BM25 Dominates

0.790

NDCG@10 for plain keyword search. In domain-specific compliance corpora where documents use specific identifiers, form numbers, and policy names, BM25 alone outperforms every hybrid configuration. 20x better than vector-only search.

Your Production Config Is Wrong

0.082

RRF K=60 with 2:1 vector weighting — our production default — scores 9.6x worse than BM25 alone. Vector-heavy weighting drowns the BM25 signal that actually finds the right documents.

Convex α=0.3 Wins Hybrid

0.729

Score-based fusion with 70% BM25 / 30% vector achieves the best hybrid performance. Validates Bruch et al. (ACM TOIS 2023): convex combination preserves score magnitude where RRF discards it.

Vector-Only Is Worse Than Random

3.8%

NDCG@10 for vector-only search. In a small, domain-specific corpus where all documents cover related topics, semantic similarity finds related but wrong content. A random baseline would score ~9%.

Leaderboard

NDCG@10 by Configuration

All 15 configurations ranked by Normalized Discounted Cumulative Gain at cutoff 10. Higher is better.

BM25

Vector

RRF

RRF (Production)

Convex

Other Fusion

Multi-Metric Comparison

Configuration	NDCG@10	Recall@5	Recall@10	MRR	Prec@5	p50 (ms)

Performance by Query Category

NDCG@10 Across Query Categories

Top 5 configurations compared across 5 query types. Shows where keyword search dominates vs. where semantic search adds value.

Fusion Method Families

NDCG@10 by Fusion Method Family

Average and range of NDCG@10 scores within each method family. Shows which fusion approach is most effective overall.

The Story

Why BM25 Wins on Compliance Documents

Compliance knowledge bases are identifier-dense. BSA policies reference 31 CFR 1020.320(d). SOPs name specific systems (DNA Core, CU*BASE). Agent instructions mention exact form names and process IDs. When a user searches for "BSA filing requirements for CTR," the winning signal is exact keyword overlap, not semantic similarity.

BM25 (Best Matching 25) excels here because it rewards term frequency weighted by inverse document frequency — rare terms in the corpus that appear in the query get the highest scores. In a 109-file compliance corpus, terms like "CTR" and "31 CFR" are rare enough to be decisive.

Why Vector Search Fails on Small Domain-Specific Corpora

Vector search works by finding documents whose embeddings are closest in semantic space to the query. This works brilliantly on large, diverse corpora (like web search) where semantically similar documents are genuinely relevant.

But in a small, domain-specific corpus where all documents are about related topics (credit union compliance, BSA, operations), every document is semantically close to every query. The cosine similarity scores cluster tightly, making it nearly impossible to distinguish the truly relevant document from dozens of topically adjacent ones. Our data shows vector-only achieving just 3.8% NDCG@10 — worse than random selection.

Why Convex Combination Beats RRF

RRF (Reciprocal Rank Fusion) combines retrieval lists using rank positions only:

RRF: score(d) = Σ weight_i / (K + rank_i(d))
Convex: score(d) = α · norm(vector_score) + (1-α) · norm(bm25_score)

The critical difference: RRF discards score magnitude. A document ranked #1 with BM25 score 45.2 gets the same RRF contribution as one ranked #1 with score 0.3. When BM25 produces high-confidence exact matches (as it does for compliance queries), throwing away that confidence signal is catastrophic.

Convex combination preserves score magnitude through MinMax normalization, then applies a weighted sum. This lets BM25's strong signal dominate when it has a confident match, while vector search can still contribute when keyword matching is ambiguous. This validates Bruch et al. (ACM TOIS 2023), who demonstrated convex combination outperforms RRF on the BEIR benchmark suite.

What This Means for Production

Immediate action: Switch from RRF K=60 (2:1 vector) to Convex α=0.3 (70% BM25, 30% vector). This is a config change, not a code change. Expected improvement: NDCG@10 from 0.082 → 0.729 — a 8.9x improvement in retrieval quality.

For teams deploying RAG in regulated industries: benchmark your retrieval before tuning your prompts. A mid-tier model with excellent retrieval will outperform a frontier model with broken retrieval. The retrieval layer is the highest-leverage optimization in any RAG system.

Methodology

Evaluation Framework

We use known-item retrieval: each query is generated from a specific document, and the ground truth is that the source document should appear in the search results. This mirrors real agent usage — searching for a specific SOP, policy, or meeting note.

Query Categories

heading_direct (505): Exact section headings as questions.
entity_lookup (537): Named entities, identifiers, form numbers.
heading_paraphrase (311): Reworded section headings.
exact_phrase (213): Distinctive phrases from document bodies.
first_sentence (140): Opening sentences of chunks.

Metrics

NDCG@10: Normalized Discounted Cumulative Gain — position-weighted relevance.
Recall@K: Fraction of relevant docs found in top K.
MRR: Mean Reciprocal Rank of first relevant result.
Precision@5: Fraction of top 5 that are relevant.

Corpus

109 Markdown files from a production credit union knowledge base. Indexed into 1,974 chunks (~500 tokens, 50-token overlap). YAML frontmatter for doc_type, status, tags. Dual-indexed: DuckDB BM25 FTS + LanceDB vectors (Voyage-3.5, 1024d).

Fusion Algorithms

RRF: Reciprocal Rank Fusion (Cormack et al., 2009).
Convex: Score-based weighted sum (Bruch et al., 2023).
RSF: Relative Score Fusion (MinMax normalization).
DBSN: Distribution-Based Score Normalization (Z-score).
CombMNZ: sum × list_count (Fox & Shaw, 1994).

Cormack, G., Clarke, C., & Buettcher, S. (2009). Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods. SIGIR '09.

Bruch, S., Gai, S., & Ingber, A. (2023). An Analysis of Fusion Functions for Hybrid Retrieval. ACM Transactions on Information Systems.

Thakur, N., et al. (2021). BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models.

Muennighoff, N., et al. (2023). MTEB: Massive Text Embedding Benchmark.

Fox, E. & Shaw, J. (1994). Combination of Multiple Searches. NIST Special Publication 500-215.

Company Context LayerRetrieval Benchmark