We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

When your AI sales agent gets a question like "How does your data security work?" — the quality of the answer depends entirely on what gets retrieved from the knowledge base.
Most retrieval systems use cosine similarity. Embed the query, embed the documents, rank by distance. It works. Until it doesn't.
Cosine measures semantic proximity. Not relevance. A document about "data encryption standards" might score lower than one about "data governance overview" — even though the first is exactly what the buyer asked about.
So we built a custom cross-encoder reranker. Retrieve 50 candidates by cosine, then rerank them with a model that reads the query and each candidate together. The question we wanted to answer: does a bigger reranker model actually make a difference?
Short answer: no. But we learned something more important along the way.
Three models, same training data
We compared three cross-encoder architectures, all fine-tuned on 5,256 training pairs from real production conversations:
- MiniLM-L6 — 22M parameters, 6 layers. The lightweight option.
- MiniLM-L12 — 33M parameters, 12 layers. The "maybe bigger is better" option.
- BGE-M3 — 568M parameters, 24 layers. The heavyweight.
Training pairs came from actual production sessions — queries paired with KB entries, labeled as relevant or irrelevant based on conversation quality scores.
The binary metrics looked identical
After training, both MiniLM models scored nearly the same on standard eval metrics. L6 hit 95.38% accuracy. L12 hit 95.21%. F1 scores within noise.
If we'd stopped here, we might've concluded "model size doesn't matter" and moved on. But binary classification metrics don't tell you what matters most: which documents end up in the top 5.
Ranking metrics told a slightly different story
When we measured ranking quality (MRR, NDCG@10, Precision@5), L12 showed a small edge. NDCG went from 0.959 to 0.974. Precision@5 from 0.982 to 0.991.
Real but modest. The kind of improvement you'd struggle to notice in production.
But here's the thing that actually mattered: both rerankers massively outperformed cosine similarity alone. Each model surfaced about 6 relevant KB entries per query that pure cosine completely missed.
The story wasn't "L12 beats L6." It was "any reranker beats no reranker."
We ran both models on 10 live sessions
Benchmark metrics are one thing. We wanted to see what happens on real buyer conversations.
We ran both models (plus the cosine baseline) on the 10 most recent production sessions and generated side-by-side diffs.
The results:
- In 7 out of 10 sessions, the models surfaced different entries — not more, not fewer, just different
- Both models consistently found 4-8 entries per query that cosine missed entirely
- L12 did better on complex, multi-faceted security questions. L6 matched it on simple intent queries
The bigger model helped with nuanced queries. But for straightforward buyer questions — "What's your pricing?" or "How do I get started?" — both models (and cosine) got it right.
So we looked at what the community was saying
Before scaling up model size, we dug into what practitioners had learned about cross-encoder training. Three findings changed our approach:
Cross-encoders overfit fast on small datasets. Our 96% eval accuracy after 3 epochs on 5K pairs was suspiciously high. The sentence-transformers docs explicitly warn about this.
Hard-negatives-only training can backfire. Our training data used cosine-retrieved negatives — all "hard" negatives from the same org's KB. The community recommends mixing in random negatives (completely unrelated entries). Without them, the model becomes too strict and filters out genuinely relevant content.
The real lever is training data, not model size. With 5K pairs where all negatives are hard, a bigger model simply can't differentiate itself. The bottleneck was data quality, not architecture.
V2: mixed negatives (small improvement)
We retrained with 75% hard negatives and 25% random cross-org negatives. Same MiniLM-L6 architecture. Reduced from 3 epochs to 2.
The result? Minimal practical difference. In 4 out of 10 sessions, identical results. In 5 sessions, one different entry. The mixed negatives helped calibration but 1,460 cross-org pairs weren't enough to move the needle.
V3: 315K pairs changed everything
This is where it got interesting.
We rewrote the training data pipeline. Instead of 5 orgs with 50 turns each, we pulled from 50 orgs with 500 turns each. Batch embeddings (16 per API call instead of one-by-one). Mixed negatives: 64% hard, 18% cross-org random, 18% positive. After deduplication: 315,940 training pairs.
Cost: less than $1 in embedding API calls. About 15 minutes of runtime.
The model trained in 62 minutes on an A10G GPU. On CPU, that would've been 40+ hours.
V3 results on live sessions
We ran the production model (V1, trained on 5K pairs) against V3 (trained on 315K pairs) on 10 recent live sessions:
- 34% more relevant KB entries surfaced — 47 unique entries vs V1's 35
- 7 out of 10 sessions: V3 found entries that V1 completely missed
- Average of 4.1 V3-only entries per session
Same architecture. Same 22M parameter MiniLM-L6. Same latency (~110ms on GPU). The only difference was training data.
The model learned from 50 orgs' worth of KB diversity. It got better at distinguishing "relevant to this specific question" from "topically related but not helpful" — exactly what cross-org random negatives teach.
What this means for AI conversation quality
When your AI agent handles a buyer conversation, the quality ceiling is set by retrieval. The best language model in the world can't give a good answer if the right KB entry never makes it into context.
Better reranking means:
- More accurate answers to complex buyer questions about security, compliance, and integration
- Fewer hallucinations because the model has the right source material
- Better buyer experience because the intelligent front door actually knows what it's talking about
This isn't a theoretical improvement. It's the difference between an AI agent that surfaces a generic product overview and one that pulls the exact security whitepaper paragraph the buyer needs.
Plot twist: we tested Cohere Rerank 3.5 and switched
After all of that — five model versions, 315K training pairs, a clear production winner — we decided to benchmark against a managed reranker. Specifically, Cohere Rerank 3.5 via AWS Bedrock.
We ran a blinded A/B evaluation: 100 real queries sampled across all active orgs from the past 7 days. Both rerankers scored 15 KB entries per query. An LLM judge (Claude Sonnet on Bedrock) compared the top-10 results without knowing which model produced them.
The results were decisive:
- Cohere won 44% of comparisons. Our custom ONNX model won 21%. The remaining 35% were ties.
- When they disagreed, Cohere was right 68% of the time
- 81% of their top-10 entries overlapped — they largely agreed, but Cohere made better choices on the 2 entries that differed
And the latency gap was even more striking. Our custom ONNX model on Lambda: ~2,700ms. Cohere on Bedrock: ~250ms. Over 10x faster.
Cost? About $100/month at our current volume. Worth it.
We switched production to Cohere Rerank 3.5. Our custom model is preserved for future use, but the combination of better quality, dramatically lower latency, and zero infrastructure maintenance made the managed option the clear winner.
Sometimes the best engineering decision is knowing when to stop building and start buying.
Six things we learned
1. Data quality beats model size. Scaling from 5K to 315K pairs with proper negative mixing produced a bigger improvement than doubling model parameters. The architecture was never the bottleneck.
2. Mixed negatives are essential. Hard-negatives-only training makes the model too strict. Cross-org random negatives teach basic topicality and prevent the model from filtering out relevant content.
3. There's a minimum data threshold. 50K pairs actually performed worse than 5K — the model saw the mixed distribution but didn't have enough examples to learn it. 315K crossed the threshold.
4. Binary eval metrics hide real differences. Two models with similar accuracy scores surfaced meaningfully different entries on real queries. Always validate with live session diffs.
5. GPU training enables iteration. 315K pairs trained in 62 minutes on GPU vs 40+ hours on CPU. The entire experiment — five model versions, multiple comparisons — cost about $3 in compute.
6. Benchmark against managed alternatives before shipping. We spent weeks optimizing a custom reranker only to find that Cohere Rerank 3.5 outperformed it in a blinded eval — at 10x lower latency and zero maintenance cost. The custom work wasn't wasted (it taught us what good reranking looks like), but the build-vs-buy evaluation should happen earlier.
Retrieval quality is the invisible foundation of every AI conversation. Most teams obsess over prompt engineering and model selection. Few invest in what actually determines whether the right information makes it into context.
We did — through five model versions, 315K training pairs, and a blinded evaluation against a managed alternative. The journey taught us as much as the destination: data quality matters more than model size, and knowing when to buy beats building everything yourself.
If you're curious how Salespeak handles real buyer conversations — from the first question to qualified handoff — see it in action.



