Per-agent model selection: why we fine-tuned discovery but not the orchestrator

A red, orange and blue "S" - Salespeak Images

Per-agent model selection: why we fine-tuned discovery but not the orchestrator

Omer Gotlieb Cofounder and CEO - Salespeak Images
Lior Mechlovich
8 min read
April 24, 2026

The default advice for AI agents is to pick the smartest model you can afford and use it everywhere. GPT-4 across your whole graph. Upgrade when a new one ships. Move on.

In a multi-agent system that advice stops working. We run seven specialized agents in production. Only one of them uses our fine-tuned Qwen 14B. The rest stay on a frontier model. This is not a cost decision, and it is not a "we couldn't afford to fine-tune the others" decision either. It is a deliberate choice that has to be made per agent, because the agents are doing different jobs and those jobs respond to fine-tuning differently.

What I want to walk through here is the decision for each agent, the one agent where the call is still open, and the piece of the architecture that matters more than any of those individual choices: the fallback path that makes running a fine-tuned model in production actually tolerable.

The seven agents and what each one does

Our graph has seven specialized agents, and they are not interchangeable:

  • Orchestrator — central traffic controller. Classifies the conversation phase, validates input, detects language, decides which specialized agent handles the next turn.
  • Discovery — runs when pain points are still unknown. Asks one focused question per turn. Prioritizes buying signals like competitor mentions, timeline hints, budget language, social proof questions.
  • Researcher — vector search over the customer's knowledge base. Runs on the entry node in parallel with the orchestrator on every turn.
  • Support — existing-customer troubleshooting. Gated by whether the KB actually has an answer.
  • Technical consultant — detailed solution design. Multi-entity questions, technical doc lookups, implementation guidance.
  • Adaptive response — handles repeated questions. Reads the full history, finds a different angle from the KB, refuses to rephrase the same facts.
  • Content generator — creates visuals (flowcharts, one-pagers, diagrams). Different modality, runs on a separate image model.

If we picked one model for all of them we would be optimizing for the average job. Which is a way of saying we would be optimizing for none of them.

Where the fine-tuning payoff actually lives

Fine-tuning wins when three things line up: the task is repetitive, the output format is constrained, and you have a clear success signal you can train against. If any of the three is missing, you are paying for a training loop that will not give you a material quality improvement over a well-prompted frontier model.

Our discovery agent hits all three. The output is always one focused question. The format is stable — short, specific, ends on a question, never stacks multiple questions together, pulls from a finite taxonomy of pain-point dimensions. And the success signal is legible: did the buyer answer, did the answer advance qualification, did we learn something we did not know before? That's a trainable signal.

So we built a fine-tuned Qwen 14B with LoRA for that one agent. Internally we call it Navon. It streams tokens to the WebSocket in the same shape as the OpenAI streaming interface. It runs with a 3600-token input budget and sees the last ten user-and-assistant turns of conversation history. On the specific job of "ask the next discovery question," it is better — faster tail latency, more consistent format, fewer of the generic "could you tell me more about your business?" questions that frontier models love to fall back on when they are not sure what to do.

That is the entire argument for fine-tuning discovery. Not "small model beats big model." Not "we built it so we have to use it." Discovery is the one job on our graph where the three conditions actually line up.

The four agents we kept on frontier models, and why

Orchestrator. This one is not close. The orchestrator reads config, conversation phase, language, KB availability, and user intent simultaneously, then decides who handles the turn. The cost of a bad routing decision is enormous — send the user to the technical consultant when the KB does not have the answer, and you get a confidently wrong reply under your name. Orchestration is reasoning-heavy, the output is structured but the decision space is wide, and small errors cascade through every downstream agent. Frontier model, full stop.

Technical consultant. Multi-entity combinations (platform X with feature Y, integration A plus config B), technical doc lookups, long-tail questions that never show up twice the same way. Fine-tuning underperformed here because the training distribution was never diverse enough — every conversation hit a new combination of entities. The frontier model's breadth matters more than any domain adaptation we could teach.

Adaptive response. The job is subtle: detect that a question is effectively a repeat, read through the full history of what has already been said, and find a genuinely different angle from the KB. It requires holding a long context and reasoning over it — exactly the thing fine-tuned small models are weakest at. The frontier model is not optional here.

Support. This one we think about most often. The answers are format-constrained, the success signal is clear, the output is repetitive — on paper, a candidate for fine-tuning. The reason we have not is data volume. Our support corpus is thinner than our discovery corpus, and the cost of a bad support answer (a customer getting wrong troubleshooting advice) is higher than the cost of a slightly less polished discovery question. We will revisit this in the next six months. For now, frontier model with careful prompting wins the expected-value calculation.

The fallback architecture that makes all this work

Everything I have said so far is the part of the story you can argue about. Here is the part that is not optional: you cannot run a fine-tuned model in production unless you have a good answer for what happens when it fails.

Navon sits behind a provider abstraction. Every request to it has a built-in fallback: if the inference call errors out — network blip, GPU timeout, a malformed response the partial-JSON parser cannot recover from — the provider silently falls back to the default frontier model. The user sees one response; they never know which path served it.

The details that matter:

  • No retry storm. If Navon fails once on a given turn, we do not retry it. We fall through to the frontier model immediately. One request per turn, one fallback per turn. Hammering a hurting GPU with retries makes outages worse, and the user has a latency budget you cannot blow on a retry loop.
  • No retry queue for load-once failures. In local-inference mode the model loads on the first call. If that load fails — bad checkpoint, quantization mismatch, memory pressure — it fails permanently for that process. We do not keep trying in the background. The next process restart gets to try again; in the meantime, every request goes to the frontier model. Simple, boring, and it prevents the kind of resource-exhaustion death spiral that broke us once.
  • Streaming with partial JSON parsing. Navon streams tokens in a constrained JSON shape. The client extracts the html_response field from incomplete output mid-generation, so users see progressive text even before the model finishes writing. If the stream dies halfway through, the partial text we already showed is preserved and the fallback fires for the next turn, not this one.

This architecture is why I am comfortable running a fine-tuned model on a live sales conversation. Not because Navon is bulletproof — nothing is — but because a Navon failure degrades to "the frontier model answers this turn," not "the agent disappears." The worst case is identical to the default everyone else runs.

How we actually make the call

When we sit down to decide whether to fine-tune a new agent, the conversation is four questions:

1. Is the task format-constrained? If the output is a question, a structured decision, a JSON payload with a stable schema — fine-tuning helps. If the output is open-ended prose, reasoning chains, or long-form explanations, fine-tuning does not help enough to justify the operational load.

2. Do we have a clear success signal we can train against? "Did the buyer respond" is a signal. "Did the technical consultant give a good answer" is a rubric, not a signal, and it cannot be trained against without a judge layer that is itself the hard part. Fine-tuning needs the former.

3. Is there 20,000+ labeled examples of this specific task in our data? This is the threshold we found empirically where domain adaptation starts pulling ahead of careful prompting. Below that, you are better off investing in a longer system prompt and a few-shot set.

4. Is the cost of a bad output low enough that a fallback can catch it? If the answer is yes — discovery, classification, format normalization — the fallback pattern makes fine-tuning safe. If it is no (legal text, pricing commitments, numbers going into a downstream invoice), the safety calculus is different and we keep it on the frontier model regardless.

Three yeses and one "safe" and we fine-tune. Anything else and we don't.

The one thing I would not skip

If you take nothing else from this: fine-tuning a model in production without a silent fallback to a frontier model is not an engineering decision, it is a bet. You are betting that your inference stack never fails, your training distribution never drifts, and your deployment pipeline never ships a bad checkpoint. Every one of those bets eventually loses.

The fallback is not insurance. It is the architecture. The fine-tuned model is an optimization on top.

Which agent in your graph would fail the safest if its model disappeared for a day? That's probably the only one you should fine-tune first.

No items found.