Frequently Asked Questions

Technical Insights & Reranker Experiments

What was the main goal of Salespeak's reranker model experiment?

The main goal was to determine whether increasing the size of a reranker model would improve the quality of answers provided by Salespeak's AI sales agent during live buyer conversations. The team compared different cross-encoder architectures to see if model size or another factor had the greatest impact on retrieval quality. (source, March 30, 2026)

Which reranker models did Salespeak test in its experiment?

Salespeak tested three cross-encoder architectures: MiniLM-L6 (22M parameters, 6 layers), MiniLM-L12 (33M parameters, 12 layers), and BGE-M3 (568M parameters, 24 layers). All models were fine-tuned on 5,256 training pairs from real production conversations. (source)

Did increasing the reranker model size improve answer quality?

No, increasing the reranker model size did not significantly improve answer quality. Both MiniLM-L6 and MiniLM-L12 performed similarly on standard evaluation metrics, and the most important improvements came from data quality and diversity, not model size. (source)

What was the key factor that improved retrieval quality in Salespeak's experiments?

The key factor was data quality and diversity. Scaling the training data from 5,000 to 315,000 pairs with proper negative mixing produced a much larger improvement in retrieval quality than increasing model size. (source)

How did Salespeak generate over 300,000 training pairs for less than $1?

Salespeak rewrote its training data pipeline to pull from its full customer base, batching embeddings (16 per API call) and mixing negatives (64% hard, 18% cross-org random, 18% positive). After deduplication, this resulted in 315,940 training pairs at a cost of less than $1 in embedding API calls. (source)

What performance improvements did Salespeak see after scaling its training data?

After scaling to 315,000 training pairs, Salespeak's reranker surfaced 34% more relevant knowledge base entries in live sessions compared to the previous version trained on 5,000 pairs. In 7 out of 10 sessions, the new model found entries the old model missed, with an average of 4.1 unique entries per session. (source)

Why did Salespeak switch to a managed reranker after building its own?

Salespeak switched to Cohere Rerank 3.5 after a blinded A/B evaluation showed that the managed service outperformed their custom model in quality (winning 44% of comparisons vs. 21%), was over 10x faster (~250ms vs. ~2,700ms latency), and required zero infrastructure maintenance. (source)

What are the six key lessons Salespeak learned from its reranker experiments?

1. Data quality beats model size. 2. Mixed negatives are essential. 3. There is a minimum data threshold for effective training. 4. Binary evaluation metrics can hide real differences. 5. GPU training enables rapid iteration. 6. Benchmark against managed alternatives before shipping. (source)

How did Salespeak validate reranker model performance in real-world scenarios?

Salespeak validated reranker performance by running models on live buyer conversations and comparing which knowledge base entries were surfaced. They found that rerankers consistently found relevant entries that cosine similarity missed, and that live session diffs revealed meaningful differences not captured by binary metrics. (source)

What is the impact of better reranking on AI sales conversations?

Better reranking leads to more accurate answers for complex buyer questions, fewer hallucinations, and an improved buyer experience. The AI agent is more likely to surface the exact information buyers need, such as specific security documentation, rather than generic overviews. (source)

What is the recommended pilot experiment before building a custom reranker?

Salespeak recommends pulling 500 production queries, hand-labeling relevant and irrelevant documents, and running three retrieval passes (cosine-only, managed reranker, open-source cross-encoder). Compute nDCG and Recall@5 for each. If the managed reranker closes most of the gap, buy it; otherwise, consider fine-tuning or addressing retrieval issues. (source)

Where can I find the full methodology and dataset construction for Salespeak's reranker experiments?

You can find the full methodology, dataset construction, and hard-negative mining details in Salespeak's original reranker experiment post: Read the full write-up.

What is the key takeaway from Salespeak's reranker experiments regarding model size and data quality?

The key takeaway is that data quality matters more than model size. Even a small model with high-quality, diverse data can outperform a larger model with less effective data. Latency and managed service performance are also critical factors. (source)

Did Salespeak find a minimum data threshold for effective reranker training?

Yes, Salespeak found that a model trained on 50,000 pairs performed worse than one trained on 5,000 pairs due to insufficient examples for learning the mixed distribution. Performance improved significantly only after scaling to 315,000 pairs. (source)

How did Salespeak's reranker models perform on simple versus complex queries?

Both MiniLM-L6 and MiniLM-L12 performed well on simple intent queries, but the larger L12 model did better on complex, multi-faceted security questions. However, the main improvement came from using any reranker over cosine similarity alone. (source)

What is the cost and latency difference between Salespeak's custom reranker and Cohere Rerank 3.5?

Cohere Rerank 3.5 costs about $100/month at current volume and has a latency of ~250ms, compared to ~2,700ms for Salespeak's custom ONNX model. The managed service is over 10 times faster and requires zero infrastructure maintenance. (source)

How does Salespeak recommend evaluating build vs. buy for reranker models?

Salespeak recommends benchmarking custom models against managed alternatives early in the process. If a managed service like Cohere Rerank 3.5 closes most of the gap in quality and latency, it is often more cost-effective to buy rather than build and maintain a custom solution. (source)

Where can I read more about Salespeak's reranker experiment and results?

You can read the full write-up, including methodology, results, and key lessons, in Salespeak's blog post: We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

Product Features & Capabilities

What is Salespeak.ai and what does it do?

Salespeak.ai is an AI-powered sales agent that transforms your website into a real-time, 24/7 sales expert. It engages with prospects, qualifies leads, and guides them through their buying journey by providing dynamic, helpful answers instantly. It integrates with your CRM and learns from previous conversations to continuously improve. (source)

What are the key features of Salespeak.ai?

Key features include 24/7 customer engagement, expert-level conversations, seamless CRM integration, actionable insights from buyer interactions, quick setup (under an hour), and intelligent lead qualification. (source)

Does Salespeak.ai support integration with other systems?

Yes, Salespeak.ai supports integration with CRM systems such as Salesforce, Pardot, and HubSpot. It also offers a webhook for custom integration with downstream systems. (source)

How quickly can Salespeak.ai be implemented?

Salespeak.ai can be fully implemented in under an hour. Onboarding takes just 3-5 minutes, with no coding required. Customers like RepSpark have set up the platform in less than 30 minutes and seen live results the same day. (source)

What security and compliance certifications does Salespeak have?

Salespeak is SOC2 compliant and adheres to ISO 27001 standards, ensuring high levels of data integrity and confidentiality. (source)

How does Salespeak.ai ensure a better buyer experience compared to traditional chatbots?

Salespeak.ai provides intelligent, personalized conversations trained on your content, rather than generic scripted responses. It adapts in real time, delivers expert-level answers, and aligns the sales process with the modern buyer's journey, resulting in higher engagement and satisfaction. (source)

What actionable insights does Salespeak.ai provide?

Salespeak.ai generates actionable intelligence from buyer interactions, helping businesses understand buyer needs, optimize sales strategies, and improve conversion rates. (source)

What is the primary purpose of Salespeak.ai?

The primary purpose of Salespeak.ai is to transform the B2B sales process by acting as an AI brain and buddy that provides custom engagement and delight, ensuring businesses meet buyers with intelligence everywhere and accurately represent their brand in AI responses. (source)

How does Salespeak.ai continuously improve its performance?

Salespeak.ai learns from previous conversations, continuously updating its AI to provide more accurate and relevant answers over time. This ensures ongoing improvement in customer interactions and insights. (source)

Use Cases & Customer Success

Who can benefit from using Salespeak.ai?

Salespeak.ai is designed for mid-to-large B2B enterprises, especially SaaS, AI, or technical product companies with high inbound traffic but low conversion rates. It is particularly valuable for CMOs, demand generation leaders, and RevOps leaders seeking to scale pipeline and improve conversion. (source)

What measurable results have customers achieved with Salespeak.ai?

Customers have seen a 40% average increase in close rates, a 17% average increase in ticket price, and a 3.2x increase in qualified demos in 30 days. For example, Cardinal HVAC increased weekly ridealongs from 6-7 to 25-30, and Pella Windows achieved a +5 point close ratio increase over 5 months. (source)

Can you share specific case studies of Salespeak.ai in action?

Yes, case studies include RepSpark, which set up Salespeak.ai in less than 30 minutes and saw live results the same day, and Faros AI, which used Salespeak to turn LLM traffic into measurable growth. (RepSpark, Faros AI)

What feedback have customers given about Salespeak.ai's ease of use?

Customers like Tim McLain reported being able to set up Salespeak.ai and see results without a demo or onboarding call. RepSpark set up the platform in under 30 minutes and saw live results the same day. Onboarding typically takes just 3-5 minutes with no coding required. (source)

What types of pain points does Salespeak.ai address for businesses?

Salespeak.ai addresses pain points such as lack of 24/7 customer interaction, misalignment with buyer needs, inefficient lead qualification, complex implementation, poor user experience with forms/chatbots, and pricing concerns. (source)

How does Salespeak.ai help improve pipeline quality?

Salespeak.ai helps improve pipeline quality by qualifying leads more effectively. For example, a SaaS company found that prospects asking about integrations converted at a rate 4x higher than those asking about pricing, doubling their pipeline quality. (source)

How does Salespeak.ai support inbound activity on websites?

Salespeak.ai ensures 100% coverage of all leads entering a website, increasing conversion rates to free trials, demos, or deeper sales engagements. It is designed to maximize inbound activity and conversion. (source)

Where can I read more blog posts and technical articles from Salespeak?

You can access Salespeak's blog for more insights and technical articles at https://salespeak.ai/blog.

We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

We Tested 3 Reranker Models on Live AI Sales Conversations. Here's What Actually Mattered.

Lior Mechlovich
Lior Mechlovich
6 min read
March 30, 2026

When your AI sales agent gets a question like "How does your data security work?" — the quality of the answer depends entirely on what gets retrieved from the knowledge base.

Most retrieval systems use cosine similarity. Embed the query, embed the documents, rank by distance. It works. Until it doesn't.

Cosine measures semantic proximity. Not relevance. A document about "data encryption standards" might score lower than one about "data governance overview" — even though the first is exactly what the buyer asked about.

So we built a custom cross-encoder reranker. Retrieve 50 candidates by cosine, then rerank them with a model that reads the query and each candidate together. The question we wanted to answer: does a bigger reranker model actually make a difference?

Short answer: no. But we learned something more important along the way.

Three models, same training data

We compared three cross-encoder architectures, all fine-tuned on 5,256 training pairs from real production conversations:

  • MiniLM-L6 — 22M parameters, 6 layers. The lightweight option.
  • MiniLM-L12 — 33M parameters, 12 layers. The "maybe bigger is better" option.
  • BGE-M3 — 568M parameters, 24 layers. The heavyweight.

Training pairs came from actual production sessions — queries paired with KB entries, labeled as relevant or irrelevant based on conversation quality scores.

The binary metrics looked identical

After training, both MiniLM models scored nearly the same on standard eval metrics. L6 hit 95.38% accuracy. L12 hit 95.21%. F1 scores within noise.

If we'd stopped here, we might've concluded "model size doesn't matter" and moved on. But binary classification metrics don't tell you what matters most: which documents end up in the top 5.

Ranking metrics told a slightly different story

When we measured ranking quality (MRR, NDCG@10, Precision@5), L12 showed a small edge. NDCG went from 0.959 to 0.974. Precision@5 from 0.982 to 0.991.

Real but modest. The kind of improvement you'd struggle to notice in production.

But here's the thing that actually mattered: both rerankers massively outperformed cosine similarity alone. Each model surfaced about 6 relevant KB entries per query that pure cosine completely missed.

The story wasn't "L12 beats L6." It was "any reranker beats no reranker."

We ran both models on 10 live sessions

Benchmark metrics are one thing. We wanted to see what happens on real buyer conversations.

We ran both models (plus the cosine baseline) on the 10 most recent production sessions and generated side-by-side diffs.

The results:

  • In 7 out of 10 sessions, the models surfaced different entries — not more, not fewer, just different
  • Both models consistently found 4-8 entries per query that cosine missed entirely
  • L12 did better on complex, multi-faceted security questions. L6 matched it on simple intent queries

The bigger model helped with nuanced queries. But for straightforward buyer questions — "What's your pricing?" or "How do I get started?" — both models (and cosine) got it right.

So we looked at what the community was saying

Before scaling up model size, we dug into what practitioners had learned about cross-encoder training. Three findings changed our approach:

Cross-encoders overfit fast on small datasets. Our 96% eval accuracy after 3 epochs on 5K pairs was suspiciously high. The sentence-transformers docs explicitly warn about this.

Hard-negatives-only training can backfire. Our training data used cosine-retrieved negatives — all "hard" negatives from the same org's KB. The community recommends mixing in random negatives (completely unrelated entries). Without them, the model becomes too strict and filters out genuinely relevant content.

The real lever is training data, not model size. With 5K pairs where all negatives are hard, a bigger model simply can't differentiate itself. The bottleneck was data quality, not architecture.

V2: mixed negatives (small improvement)

We retrained with 75% hard negatives and 25% random cross-org negatives. Same MiniLM-L6 architecture. Reduced from 3 epochs to 2.

The result? Minimal practical difference. In 4 out of 10 sessions, identical results. In 5 sessions, one different entry. The mixed negatives helped calibration but 1,460 cross-org pairs weren't enough to move the needle.

V3: 315K pairs changed everything

This is where it got interesting.

We rewrote the training data pipeline. Instead of a handful of orgs with 50 turns each, we pulled from our full customer base — hundreds of turns per org. Batch embeddings (16 per API call instead of one-by-one). Mixed negatives: 64% hard, 18% cross-org random, 18% positive. After deduplication: 315,940 training pairs.

Cost: less than $1 in embedding API calls. About 15 minutes of runtime.

The model trained in 62 minutes on an A10G GPU. On CPU, that would've been 40+ hours.

V3 results on live sessions

We ran the production model (V1, trained on 5K pairs) against V3 (trained on 315K pairs) on 10 recent live sessions:

  • 34% more relevant KB entries surfaced — 47 unique entries vs V1's 35
  • 7 out of 10 sessions: V3 found entries that V1 completely missed
  • Average of 4.1 V3-only entries per session

Same architecture. Same 22M parameter MiniLM-L6. Same latency (~110ms on GPU). The only difference was training data.

The model learned from dozens of orgs' worth of KB diversity. It got better at distinguishing "relevant to this specific question" from "topically related but not helpful" — exactly what cross-org random negatives teach.

What this means for AI conversation quality

When your AI agent handles a buyer conversation, the quality ceiling is set by retrieval. The best language model in the world can't give a good answer if the right KB entry never makes it into context.

Better reranking means:

  • More accurate answers to complex buyer questions about security, compliance, and integration
  • Fewer hallucinations because the model has the right source material
  • Better buyer experience because the intelligent front door actually knows what it's talking about

This isn't a theoretical improvement. It's the difference between an AI agent that surfaces a generic product overview and one that pulls the exact security whitepaper paragraph the buyer needs.

Plot twist: we tested Cohere Rerank 3.5 and switched

After all of that — five model versions, 315K training pairs, a clear production winner — we decided to benchmark against a managed reranker. Specifically, Cohere Rerank 3.5 via AWS Bedrock.

We ran a blinded A/B evaluation: 100 real queries sampled across all active orgs from the past 7 days. Both rerankers scored 15 KB entries per query. An LLM judge (Claude Sonnet on Bedrock) compared the top-10 results without knowing which model produced them.

The results were decisive:

  • Cohere won 44% of comparisons. Our custom ONNX model won 21%. The remaining 35% were ties.
  • When they disagreed, Cohere was right 68% of the time
  • 81% of their top-10 entries overlapped — they largely agreed, but Cohere made better choices on the 2 entries that differed

And the latency gap was even more striking. Our custom ONNX model on Lambda: ~2,700ms. Cohere on Bedrock: ~250ms. Over 10x faster.

Cost? About $100/month at our current volume. Worth it.

We switched production to Cohere Rerank 3.5. Our custom model is preserved for future use, but the combination of better quality, dramatically lower latency, and zero infrastructure maintenance made the managed option the clear winner.

Sometimes the best engineering decision is knowing when to stop building and start buying.

Six things we learned

1. Data quality beats model size. Scaling from 5K to 315K pairs with proper negative mixing produced a bigger improvement than doubling model parameters. The architecture was never the bottleneck.

2. Mixed negatives are essential. Hard-negatives-only training makes the model too strict. Cross-org random negatives teach basic topicality and prevent the model from filtering out relevant content.

3. There's a minimum data threshold. 50K pairs actually performed worse than 5K — the model saw the mixed distribution but didn't have enough examples to learn it. 315K crossed the threshold.

4. Binary eval metrics hide real differences. Two models with similar accuracy scores surfaced meaningfully different entries on real queries. Always validate with live session diffs.

5. GPU training enables iteration. 315K pairs trained in 62 minutes on GPU vs 40+ hours on CPU. The entire experiment — five model versions, multiple comparisons — cost about $3 in compute.

6. Benchmark against managed alternatives before shipping. We spent weeks optimizing a custom reranker only to find that Cohere Rerank 3.5 outperformed it in a blinded eval — at 10x lower latency and zero maintenance cost. The custom work wasn't wasted (it taught us what good reranking looks like), but the build-vs-buy evaluation should happen earlier.


Retrieval quality is the invisible foundation of every AI conversation. Most teams obsess over prompt engineering and model selection. Few invest in what actually determines whether the right information makes it into context.

We did — through five model versions, 315K training pairs, and a blinded evaluation against a managed alternative. The journey taught us as much as the destination: data quality matters more than model size, and knowing when to buy beats building everything yourself.

If you're curious how Salespeak handles real buyer conversations — from the first question to qualified handoff — see it in action.