Cohere Rerank 3.5 vs Custom Reranker: When to Build, When to Buy

A red, orange and blue "S" - Salespeak Images

Cohere Rerank 3.5 vs Custom Reranker: When to Build, When to Buy

Omer Gotlieb Cofounder and CEO - Salespeak Images
Salespeak Team
9 min read
April 23, 2026

We spent four months training custom cross-encoder rerankers on real AI sales conversations. We scaled our training data from 5,000 to 315,000 pairs. We tested three architectures. Then we put the best version of our own model head-to-head against Cohere Rerank 3.5 in a blinded A/B.

Cohere won 44% of the comparisons with 10x lower latency.

That was not the result we hoped for after four months of work, and it rearranged how we think about build-versus-buy on retrieval. The full experiment is written up in Reranker Experiment: Data Beats Model Size. This post is the part that experiment implied but never spelled out: when does it actually make sense to train your own reranker, and when should you just pay someone for theirs?

Short answer: five questions decide it. None of them are "how smart is your team."

The question nobody asks until they've already spent six months training

Teams reach for a custom reranker for a defensible reason. Cosine similarity is not relevance. A retriever that ranks purely by embedding distance will keep surfacing documents that are topically adjacent but factually wrong. Putting a cross-encoder reranker on top of a bi-encoder retriever is one of the highest-ROI moves in modern RAG, and if you have not done it yet, do it.

The honest question is not whether you need a reranker. You probably do. The question is whether you should train one yourself or rent one from Cohere, Voyage, Jina, or any of the other providers selling a managed reranker behind an API.

Five criteria decide the call, in roughly this order.

Criterion 1: Latency budget

Most teams underweight this one, and it quietly kills more custom rerankers in production than any other factor.

Our own BGE-M3 based reranker clocked in at roughly 10x the latency of Cohere Rerank 3.5 on the same hardware profile. That is a real gap. For an AI sales agent handling a live conversation, a 400 ms reranker pass versus a 40 ms one is the difference between a response that feels instant and one that makes the buyer sit there watching a spinner.

If your reranker sits on a synchronous user-facing path (chat, voice, search-as-you-type), latency is the dominant constraint, full stop. Managed rerankers have spent years industrializing inference for this exact workload, which is very hard to replicate without a dedicated ML infra team. They batch aggressively, run on tuned GPUs, and amortize cold starts across a pool of tenants you are not paying to keep warm.

If your reranker runs in a batch path (document tagging, offline relevance scoring, monthly retrieval audits), latency is a second-tier problem. This is where a custom reranker gets to play on a fair field.

The question to ask yourself: is there a human waiting on the other end of this call? If yes, start the decision leaning managed.

Criterion 2: Domain specificity of your data

The one real argument for going custom, and the one that made us build our first reranker in the first place.

Managed rerankers train on general-purpose relevance judgments. They are very good at telling you which documents are topically closer to a query. They are weaker at telling you which document matches the conventions, jargon, and factual shape of your specific domain.

When we trained on 315K labeled pairs from production B2B sales conversations, our model surfaced six to seven extra documents per query that cosine similarity alone missed. Not off-topic documents. Documents that used our customers' actual language instead of generic SaaS-English. A buyer asking "how do you handle EU data residency" maps to a KB entry titled "data sovereignty overview" only if your reranker has seen how real buyers phrase that question in practice.

The heuristic is the same one embedding teams use when deciding whether to fine-tune a base model. If your domain gets served well by generic internet text (product questions, consumer software support, e-commerce queries), a managed reranker probably reads your queries the way your users do. If your domain is unusual (legal, medical, proprietary B2B verticals, internal jargon that does not exist in public corpora), a custom reranker's ability to learn your phrasing is worth real effort.

Test this before deciding. Pull 200 real queries and 200 real documents out of production, hand-label them for relevance, and run both Cohere Rerank 3.5 and a small open-source baseline on the set. If the managed reranker hits 90% agreement with your labels, the specificity argument does not apply to you and you should buy.

Criterion 3: Training data you actually have

This one quietly decides most projects, and it is the criterion teams lie to themselves about the hardest.

Our experiment has a finding worth memorizing: 5,000 high-quality labeled pairs outperformed 50,000 sloppy ones. Scaling to 315,000 pairs worked only because we had a careful negative-mining strategy (hard negatives from within-session confusions, random negatives across organizations, a calibrated mix ratio between them). Without that discipline, more data made the model worse, not better.

Translation: a useful custom reranker demands not just data volume but data construction discipline. Clean labels. Hard negatives. A feedback loop where you look at what the current model is getting wrong and feed that signal back into the next training set. That takes an ML engineer who actually understands retrieval, several weeks of focused work, and a labeling budget that does not vanish after the initial push.

If your total training corpus is under 5,000 genuinely-labeled pairs, you are almost certainly better off using a managed reranker. The data will not be enough to beat a well-trained general-purpose model, and you will burn months learning that the hard way.

If you are sitting on 50,000 or more high-quality pairs and a team that can sustain ongoing labeling, custom becomes defensible. Above 100,000 pairs with real negative mining, custom starts to pull clearly ahead on domain-specific queries. That was roughly our experience.

Criterion 4: Ops maturity (the one most teams skip)

A reranker is not a one-time build. It is a model in production, which means versioning, A/B testing infrastructure, drift monitoring, retraining triggers, eval harnesses, and on-call ownership when things start regressing at 2am.

This is where most custom-reranker projects die quietly. The initial training run looks great in offline eval. Six months later the production distribution has shifted, nobody has retrained, latency has crept up because someone else changed the batching, and eval metrics that used to hit 95% are sitting at 88%. The engineer who cared about the project got pulled onto something else two quarters ago.

Ask yourself honestly. Do you have the operational muscle to run a custom model in production for the next three years? Who owns retraining? Where does the labeling data flow from on an ongoing basis, not just during the initial push? What is the rollback procedure when v3 ships worse than v2 and nobody notices for two weeks?

If any of those answers are vague, managed is the right call regardless of how compelling your initial training results look. You are not buying a model. You are buying the operational discipline Cohere or Voyage have already built around one.

Criterion 5: Cost at your scale

Everyone expects this criterion to be first. It is almost always last.

Managed reranker APIs charge per search. At small to medium scale (under a million reranker calls per month) the bill is trivial compared to what it would cost you to do this in-house. At larger scale the math starts shifting. Our own usage-scenario projections had custom reranking reaching breakeven around ten million calls per month, and clearly ahead at a hundred million, depending on how you account for infra amortization.

The honest answer for most teams: you are nowhere near the scale where managed becomes uneconomic. If you are under a few million reranker calls per month, cost is not your constraint. Latency, domain specificity, data, and ops maturity are. Stop optimizing a rounding error.

The decision, compressed

Stacking the five criteria together, a rough rule emerges. Reach for a managed reranker first if any one of these is true:

  • Your reranker sits on a synchronous user-facing path and latency actually matters.
  • You have under 10,000 high-quality labeled pairs.
  • Your domain is reasonably standard (B2C, mainstream B2B, generic search).
  • Nobody on your team has signed up to own an ML model in production for years.
  • You are under a million reranker calls per month.

A custom reranker becomes worth the effort only if all of these are true:

  • You have a latency-tolerant path, or you are willing to invest in inference infrastructure to close the gap.
  • You have 50,000+ high-quality pairs, and a labeling process that will keep producing them.
  • Your domain is meaningfully different from public internet text.
  • Engineers will own the model in production for years, not just ship v1.
  • Per-call cost of managed adds up to a real line item at your scale.

The honest middle ground is what we landed on: managed reranker in the synchronous path where every millisecond counts, custom reranker in the batch and offline paths where domain specificity pays for itself. The mixed setup beat either pure approach for us.

Run the pilot before you start the project

Before spending four months on a custom reranker, spend two weeks running the smallest experiment that forces the decision into the open. We wish we had done this sooner.

Pull 500 production queries with their retrieved candidate sets. Hand-label the relevant and irrelevant documents in each set, so you have ground truth. Run three retrieval passes: your current cosine-only baseline, the managed reranker of your choice sitting on top of cosine, and one of the open-source cross-encoders (ms-marco-MiniLM-L-6-v2 works fine as a starting point) with no fine-tuning. Compute nDCG and Recall@5 for each.

One of three things happens. If the managed reranker closes most of the gap between baseline and your hand-label ground truth, you are done. Buy it and move on.

If the managed reranker improves over baseline but leaves noticeable headroom, and the open-source baseline tracks close to managed, you have a case for a fine-tuned custom model in the specific parts of your domain where the gap is largest. Build that and only that, not a universal replacement.

If neither approach helps, your problem is not a reranking problem. Go fix your embeddings, chunking, or KB structure first. No reranker will save bad retrieval.

This pilot is cheaper than one sprint of custom-reranker work, and it will answer the build-versus-buy question with numbers instead of vibes.

What we actually do at Salespeak

For the record, here is where we landed after the experiment. Our synchronous AI sales agent path runs on Cohere Rerank 3.5, because the latency and quality combination was genuinely hard to beat for our domain. Our custom reranker (v3, trained on 315,000 pairs) runs in the offline pipelines for KB quality scoring, session analysis, and training-data generation for downstream models. Both earn their keep. Neither replaces the other, and we no longer pretend one has to.

The full methodology, dataset construction, and the specific way we built hard-negative mining live in the original reranker experiment post. If you are about to start your own build-versus-buy decision on retrieval, read that post first, then come back here with the five criteria in hand.

The punchline: model size is not what matters. Data quality matters more than model size. But even great data and a small model will not beat a mature managed reranker on latency, and latency is the criterion that actually decides the architecture, whether teams admit it up front or only after four months of training runs.

Build the reranker you need. Buy the reranker that saves you from running a production model you do not have the ops to carry. Most teams need one of each, and they figure that out slower than they should.

No items found.