If you are building a RAG pipeline, you have probably hit this wall: Hybrid Search is great, but the scores are meaningless.
You combine your Dense Vector Search (Gemini/OpenAI) with your Sparse Keyword Search (BM25) using Reciprocal Rank Fusion (RRF). It works beautifully to bubble up the best results. But when you try to filter out the noise, youβre left staring at scores like 0.0163 or 0.0327.
Most tutorials tell you to “pick a threshold like 0.02.” But why? Why not 0.015? Why not 0.025?
We recently ran a deep dive into our own retrieval metrics to stop guessing and start calculating. Here is how we derived a mathematically justified threshold for RRF and solved the edge cases that break it.
The Problem: The “Magic Number”
RRF is a rank-based formula. It doesn’t care about cosine similarity or term frequency; it only cares about order.

(Standard k is usually 60)
Because the denominator is huge, the resulting scores are tiny.
- Rank 1 result: ~0.016
- Rank 10 result: ~0.014
When you merge two lists (Semantic + Keyword), the scores get summed. The problem is that a “mediocre” result from two sources looks mathematically similar to a “great” result from just one. We needed a way to distinguish Consensus from Noise.
The Research: Decoding the Score
We analyzed query logs across three distinct categories: Specific (perfect matches), Vague (conceptual matches), and Garbage (keyboard smashing).
We found a distinct “Ceiling” in the garbage results:
| Query Type | Top RRF Score | Pattern |
| Specific | ~0.032 | High confidence from both algorithms. |
| Garbage | ~0.016 | Maxed out at exactly 0.0164. |
The “Noise Floor” (0.016)
Why did garbage queries hit a wall at 0.016? The math explains it.
If Method A (e.g., Vector) thinks a document is Rank #1, but Method B (Keyword) doesn’t find it at all:

A score of ~0.016 represents Single-Source Confidence. It means one algorithm liked it, but the other one didn’t care. In a Hybrid system, this is often a hallucination or a partial match.
The “Consensus Floor” (0.025)
Now, look at what happens when both algorithms agree that a document is relevant (say, top 10 in both):

This gave us our mathematically derived threshold.
- Score >= 0.025: The document appeared in the Top 10 of BOTH methods. We have Consensus.
- Score < 0.016: Neither method ranked it highly. Noise.
- The Middle (0.016 – 0.025): The “Danger Zone” where only one algorithm is confident.
By setting our threshold to 0.025, we aren’t just picking a number. We are enforcing a policy: “For a result to be shown, both the semantic model and the keyword model must independently agree it is top-tier.”
The “Quantum Blockchain” Edge Case
There was one fatal flaw in our logic.
We ran the query: “quantum blockchain banana”.
- BM25: 0 results (Words didn’t exist in our docs).
- Vector: Found “nearest neighbors” with 0.53 similarity.
- RRF Score: ~0.016 (Single source).
Our new filter correctly blocked it! But wait! What if the user searches for a concept that uses no matching keywords?
If BM25 returns zero results, it’s a signal. It means there is zero lexical overlap. In this scenario, RRF breaks because it relies on two votes. When one voter is silent, the score naturally drops below our consensus threshold.
The Solution: The Dual-Filter Strategy
We realized we need different logic when lexical overlap is impossible.
- If Keywords Match: Use RRF with the Consensus Threshold (0.025). We demand agreement.
- If Keywords Fail: Fall back to Vector-only, but with a Strict Similarity Threshold (0.65).
If we can’t match your words, the meaning must be overwhelming for us to show a result.
The Code
Here is the logic we implemented to sanitize our RAG inputs:
Python
MIN_RRF_SCORE = 0.025 # Requires Top-10 ranking in BOTH methods
MIN_FAISS_SCORE = 0.65 # Requires strong semantic match if keywords fail
def smart_hybrid_search(query):
bm25_results = get_sparse(query)
faiss_results = get_dense(query)
if bm25_results:
# Standard Path: Demand Consensus
merged = rrf_fusion(bm25_results, faiss_results)
return [doc for doc in merged if doc.score >= MIN_RRF_SCORE]
else:
# Fallback Path: Strict Semantics
# "Zero keyword overlap? The meaning better be exact."
return [doc for doc in faiss_results if doc.score >= MIN_FAISS_SCORE]
Conclusion
Stop treating hybrid search thresholds as magic numbers.
- 0.016 is the sound of one hand clapping (Single Source).
- 0.025 is the sound of applause (Consensus).
By aligning our thresholds with the mathematical reality of RRF, we turned our retrieval layer from a “best guess” engine into a precision instrument.
Building Agents?
Leave a Reply