Frequently Asked Questions

Product Information & Custom LLM

What is Navon, Salespeak's custom language model?

Navon is Salespeak's proprietary language model, trained specifically for real-time sales conversations with website visitors. It was developed to replace GPT-4 for this task, leveraging over 48,000 live sessions and 630,000 reasoning traces from Salespeak's multi-agent architecture. Navon is currently live on a whitelisted customer organization, powering real conversations and being evaluated against GPT-4 for conversion rates and quality. (Source, March 31, 2026)

Why did Salespeak decide to train its own LLM instead of using GPT-4?

Salespeak chose to train its own LLM for three main reasons: 1) Vertical AI companies with enough domain data can build specialized models that outperform frontier models at specific tasks, as proven by Intercom's Fin Apex; 2) Salespeak had access to extensive, high-quality domain data from real sales conversations; 3) The economics improve at scale, with custom models offering significant cost savings and control over the stack. (Source, March 31, 2026)

What prerequisites are needed before training a custom LLM?

Before training a custom LLM, Salespeak recommends having: 1) Enough high-quality data (minimum thresholds: 5,000+ evaluated sessions, 1,000+ scoring 85+, 500+ with clear conversion outcomes); 2) A strong evaluation signal (conversations scored 0-100 across accuracy, sales effectiveness, human-like quality, and professional judgment); 3) Reasoning traces, not just transcripts; 4) A trusted benchmark for evaluation. (Source, March 31, 2026)

How does Navon compare to GPT-4 in performance?

Navon matches GPT-4 in quality for Salespeak's specific sales conversation task. In head-to-head evaluations, 80% of responses were ties, 15% favored Navon, and 5% favored GPT-4. Navon also runs 37% faster than GPT-4 on the same task. (Source, March 31, 2026)

What were the biggest challenges in building a custom LLM?

The main challenges included infrastructure issues (80% of the work), such as Python version mismatches, VRAM limits, SSM timeouts, and dependency conflicts. Evaluation was also difficult, as flawed setups could mislead results. Building a production ML system involves managing sequence length, prompt budgets, token allocation, streaming, caching, and routing. Opportunity cost was significant for a startup. (Source, March 31, 2026)

How important is evaluation setup when training a custom LLM?

Evaluation setup is critical. Salespeak's initial evaluation showed GPT-4 winning 83% of comparisons, but after adjusting to match production context, results flipped to 80% ties, 15% Navon wins, and 5% GPT-4 wins. If evaluation doesn't mirror production, results are meaningless. (Source, March 31, 2026)

What technical improvements had the biggest impact on model quality?

Increasing sequence length from 2,048 to 4,096 tokens was the single biggest quality improvement. Salespeak also implemented a dynamic budget allocator to prioritize knowledge base content, maximizing context for each response. (Source, March 31, 2026)

How quickly can a custom LLM be trained and deployed?

Salespeak trained and deployed Navon in three days: Day one for data exploration and benchmarking, day two for training attempts and inference server setup, and day three for evaluation reframing, model routing, and production deployment. (Source, March 31, 2026)

What are the cost savings of running a custom LLM compared to GPT-4?

Salespeak trained a 14B-parameter model on a single A10G GPU (24GB) using QLoRA, with a total GPU cost of about $25. Custom models can deliver similar quality at a fraction of the cost, especially at scale, as seen with Intercom's 10x savings. (Source, March 31, 2026)

How does Salespeak ensure its custom LLM is production-ready?

Salespeak built an evaluation benchmark with 500 known-good sessions, 200 known-bad sessions, and 100 edge cases. Any model must outperform the current system on this benchmark before being deployed. Navon is tested in real customer environments, with conversion rates and quality metrics tracked. (Source, March 31, 2026)

What happens if Navon's conversion data does not match GPT-4?

If Navon's conversion data does not hold up, Salespeak can switch back to GPT-4 with an environment variable change, ensuring zero risk to customers. The per-org routing allows flexible model deployment. (Source, March 31, 2026)

Can I see Salespeak's AI agent in action?

Yes, you can try Salespeak's AI agent on their website, whether it's running on GPT-4 or Navon. The experience is designed to be seamless, and you may not be able to tell the difference between the models. (Try it here)

What are the next steps for Salespeak's custom LLM development?

If conversion data is positive, Salespeak plans to train more specialized agents, implement reinforcement learning with conversion rewards, and consolidate pipeline systems for retrieval, reranking, and generation. (Source, March 31, 2026)

Is building a custom LLM accessible for startups?

Yes, Salespeak's experience shows that with sufficient domain-specific data and evaluation infrastructure, building a custom LLM is accessible. A single GPU, a few days, and about $25 in compute enabled Salespeak to train a model competitive with GPT-4 for their core task. (Source, March 31, 2026)

What is the role of supervised fine-tuning (SFT) in Salespeak's LLM training?

Salespeak found that supervised fine-tuning (SFT) on 18,000 high-quality examples was sufficient to tie GPT-4 on their specific task. SFT is recommended as the first step before exploring reinforcement learning or synthetic data augmentation. (Source, March 31, 2026)

How does Salespeak use domain-specific data for LLM training?

Salespeak leverages real conversations between AI agents and prospects, including conversion outcomes, reasoning traces, and structured feedback. This domain-specific data enables the model to learn not just what to say, but how to think about sales conversations. (Source, March 31, 2026)

What is the main metric Salespeak uses to evaluate its custom LLM?

The primary metric is conversion rate—whether the model books the same number of demos as GPT-4, more, or fewer. Quality metrics and head-to-head evaluations are also tracked, but conversion rate is the ultimate test. (Source, March 31, 2026)

How does Salespeak handle model updates and vendor dependencies?

By owning the entire stack, Salespeak avoids API rate limits, surprise pricing changes, and dependency on vendor model updates. Once the custom model works, Salespeak has full control over its deployment and optimization. (Source, March 31, 2026)

Features & Capabilities

What features does Salespeak.ai offer for sales teams?

Salespeak.ai provides an AI sales agent that engages prospects 24/7 via web chat or email, qualifies leads, guides buyers through their journey, and integrates seamlessly with CRM systems. It delivers expert-level conversations, actionable insights, and real-time adaptive Q&A. (Source)

Does Salespeak.ai support CRM integration?

Yes, Salespeak.ai integrates with CRM platforms such as Salesforce, Pardot, and HubSpot, enabling real-time CRM sync and streamlined sales operations. (Source)

Can Salespeak.ai be set up without coding?

Yes, Salespeak.ai can be implemented in under an hour with no coding required. Onboarding takes just 3-5 minutes, making it accessible for non-technical users. (Source)

Does Salespeak.ai offer actionable sales insights?

Salespeak.ai generates actionable intelligence from buyer interactions, helping businesses optimize sales strategies and improve conversion rates. (Source)

Does Salespeak.ai support custom integration via API or webhook?

Salespeak.ai supports custom integration using a webhook, allowing connection to downstream systems. For more details, consult Salespeak's official resources or support team. (Source)

Use Cases & Benefits

Who can benefit from Salespeak.ai?

Salespeak.ai is ideal for mid-to-large B2B enterprises, especially SaaS, AI, or technical product companies with high inbound traffic and low conversion rates. Roles such as CMOs, Demand Generation Leaders, and RevOps Leaders benefit from actionable insights and scalable lead qualification. (Source)

What problems does Salespeak.ai solve for businesses?

Salespeak.ai addresses pain points such as 24/7 customer interaction, misalignment with buyer needs, inefficient lead qualification, complex implementation, poor user experience, and pricing concerns. It offers solutions like instant engagement, buyer-first sales alignment, and tailored pricing. (Source)

How does Salespeak.ai improve conversion rates?

Salespeak.ai ensures 100% coverage of all website leads, increasing conversion rates to free trials, demos, or deeper sales engagements. Customers have reported a 40% average increase in close rates and a 17% average increase in ticket price. (Source)

Can you share customer success stories using Salespeak.ai?

Yes, RepSpark set up Salespeak.ai in less than 30 minutes and saw live results the same day. Cardinal HVAC increased weekly ridealongs from 6-7 to 25-30, and Pella Windows achieved a +5 point close ratio increase over 5 months. (Source)

How does Salespeak.ai help with lead qualification?

Salespeak.ai's AI Brain asks qualifying questions to capture relevant leads, optimizing sales efforts and saving time for sales teams. (Source)

What are the measurable results achieved by Salespeak.ai customers?

Salespeak.ai customers have seen a 40% average increase in close rates, a 17% average increase in ticket price, and a SaaS company doubled pipeline quality by focusing on integration questions. (Source)

Technical Requirements & Implementation

How long does it take to implement Salespeak.ai?

Salespeak.ai can be fully implemented in under an hour. Onboarding takes just 3-5 minutes, and customers can start having live conversations with prospects within 1 hour. (Source)

What support options are available for Salespeak.ai customers?

Starter plan customers receive email support. Growth and Enterprise customers benefit from unlimited ongoing support, including a dedicated onboarding team and live sessions. Training videos, documentation, and the Salespeak Simulator are also provided. (Source)

Pricing & Plans

What is Salespeak.ai's pricing model?

Salespeak.ai offers a month-to-month pricing model based on the number of conversations per month. Businesses can cancel anytime, and 25 free conversations are provided to start, with no setup or commitment required. (Source)

Security & Compliance

Is Salespeak.ai SOC2 compliant?

Yes, Salespeak.ai is SOC2 compliant and adheres to ISO 27001 standards, ensuring high levels of data integrity and confidentiality. For more details, visit the Salespeak Trust Center. (Source)

Competition & Comparison

How does Salespeak.ai differentiate itself from other sales AI solutions?

Salespeak.ai offers tailored solutions for various user segments, including 24/7 customer interaction, fully-trained expert conversations, intelligent adaptive Q&A, rapid setup, and seamless CRM integration. Unlike basic chatbots, Salespeak focuses on buyer-first sales alignment and continuous learning. (Source)

Why should a customer choose Salespeak.ai over alternatives?

Customers should choose Salespeak.ai for its proven results (e.g., 3.2x increase in qualified demos in 30 days), quick implementation, intelligent conversations, tailored pricing, and unique features like real-time adaptive Q&A and deep product training. (Source)

Blog & Resources

Where can I read more about Salespeak's LLM journey and related topics?

You can read detailed blog posts about Salespeak's LLM journey and related topics at Salespeak's blog. Recommended posts include "Agent Analytics: See How AI Models Access Your Website" and "Building Our Own LLM: What It Actually Takes." (Source)

We're Training Our Own LLM. Here's What It Actually Takes.

A red, orange and blue "S" - Salespeak Images

We're Training Our Own LLM. Here's What It Actually Takes.

Omer Gotlieb Cofounder and CEO - Salespeak Images
Lior Mechlovich
6 min read
March 31, 2026

A few weeks ago, we started training our own language model.

Not as a research exercise. Not for a blog post. We're actually trying to replace GPT-4 in production — for one very specific task: having real-time sales conversations with website visitors.

We called it Navon (Hebrew for "wise"). Here's what I've learned so far about what it actually takes to build your own model, why we decided to do it, and the honest trade-offs nobody warns you about.

Why would anyone do this?

Our AI agents run on GPT-4. They work well — our conversation evaluations average 92+ across thousands of live sessions. So why mess with something that works?

Three reasons pushed us over the edge.

Intercom proved the playbook. When Fergal Reid announced Fin Apex — their custom model powering over a million support conversations per week — the message was clear. Vertical AI companies with enough domain data can build specialized models that beat frontier models at their specific task. If it works for customer support, it should work for sales.

We're sitting on the data. 48,000+ live sessions. 27,000 scoring 85+ on our evaluation system. 630,000 reasoning traces from our multi-agent architecture. Real conversations between AI agents and real prospects, with concrete outcomes: did they book a demo or not? This isn't synthetic data. It's the real thing.

The economics will only get better. At current volume, GPT-4 costs are manageable. At 10x volume, they won't be. A custom model on our own infrastructure could deliver the same quality at a fraction of the cost. Intercom saw 10x savings. We expect similar.

What you actually need before you start

I see a lot of teams excited about fine-tuning without understanding the prerequisites. Here's what we had before writing a single line of training code:

Enough high-quality data. We set minimum thresholds: 5,000+ evaluated sessions, 1,000+ scoring 85+, 500+ with clear conversion outcomes. We exceeded every threshold by 8-27x. If your data doesn't clear these bars, fine-tuning will disappoint you.

A strong evaluation signal. Every one of our conversations gets scored 0-100 across four dimensions: accuracy, sales effectiveness, human-like quality, and professional judgment. Each evaluation includes structured feedback — what the AI did well and specific issues to fix. Without this, you're training blind.

Reasoning traces, not just inputs and outputs. Most companies only have conversation transcripts. We have full reasoning chains from our LangGraph architecture — what context the AI considered, what rules it applied, how it chose its response strategy. This lets us train on how to think about sales, not just what to say.

A benchmark you trust. Before training anything, we built an evaluation benchmark: 500 known-good sessions, 200 known-bad ones, 100 edge cases. Any model we train has to beat our current system on this benchmark before it touches production. Build the eval before you build the model.

The honest pros and cons

I'll be direct about what's good and what's hard.

The good:

  • It's surprisingly accessible. We trained a 14B-parameter model on a single A10G GPU (24GB) using QLoRA. Total GPU cost: about $25. The tooling — Unsloth, PEFT, HuggingFace — is mature enough that the ML part is actually the easy part.
  • Domain specificity is a real advantage. A 14B model trained on your data can match a frontier model at your specific task. We don't need PhD-level reasoning. We need excellent judgment about sales conversations. That's a narrower, more learnable problem.
  • You own the whole stack. No API rate limits. No surprise pricing changes. No dependency on a vendor's model updates potentially breaking your product. Once it works, it's yours.
  • Latency wins are free. Our model runs 37% faster than GPT-4 on the same task. When you control the inference, you can optimize for your exact use case.

The hard:

  • Infrastructure is 80% of the work. We needed nine attempts to get training running. Every failure was infrastructure: wrong Python version, VRAM limits, SSM timeouts, dependency conflicts. The ML configuration was straightforward once the devops cooperated.
  • Evaluation is treacherous. Our first eval showed the model losing to GPT-4 83% of the time. We nearly killed the project. Turns out the eval was wrong — we were testing without production context. When we replayed through the full pipeline, it was 80% ties. You can easily convince yourself a good model is bad (or a bad model is good) with the wrong evaluation setup.
  • It never feels "done." Sequence length, prompt budgets, token allocation, streaming, caching, routing — each one is a rabbit hole. You're not just training a model. You're building a production ML system with its own operational surface area.
  • The opportunity cost is real. Every hour spent on model training is an hour not spent on product, sales, or customer work. For a startup, that trade-off is sharp.

The moment we almost killed it

I want to be honest about this because I think it's the most important part of the story.

Our first evaluation showed GPT-4 winning 83% of head-to-head comparisons. Navon won 17%. Zero ties. The numbers looked devastating.

But something felt off. Navon's responses weren't bad — they were often more specific, referencing product details that only made sense with knowledge base context. We had tested the model without giving it the same context it would receive in production.

It was like judging a pilot's skill by making them fly blindfolded.

When we rebuilt the evaluation to replay through the actual production pipeline — full knowledge base retrieval, org settings, qualification criteria, conversation history — the results flipped completely:

  • 80% ties — the judge couldn't tell the difference
  • 15% Navon wins
  • 5% GPT-4 wins

Same model. Same weights. Completely different conclusion. If we'd trusted the first eval, we would have abandoned a model that actually works.

The lesson: if your evaluation doesn't match your production setup, your results are meaningless. And "close enough" isn't close enough.

What surprised me

Sequence length matters more than anything. Going from 2,048 to 4,096 tokens was the single biggest quality improvement — more impactful than any hyperparameter change. The model was already good enough; it just needed to see more context. We built a dynamic budget allocator that prioritizes knowledge base content over lower-value sections, squeezing the most out of every token.

SFT alone gets you very far. We expected to need reinforcement learning, DPO, synthetic data augmentation. So far, plain supervised fine-tuning on 18K high-quality examples ties GPT-4 on our specific task. The research from Chroma, Cursor, and Kimi all say: SFT first, RL second. We're still on step one and it's already competitive.

Three days from zero to production. Day one: data exploration, export, benchmark. Day two: nine training attempts, first successful model, inference server. Day three: evaluation reframe, model routing, production deploy, streaming. I genuinely didn't expect to go from "should we try this?" to "it's serving real traffic" in three days.

Where we are right now

Navon is live on a whitelisted customer org. Real visitors are having real conversations powered by our custom model. Every response gets compared against what GPT-4 would have said.

The eval metrics look strong: 80% ties, 37% faster, same production infrastructure. But eval metrics aren't the real test.

The real test is conversion rate. Does this model book the same number of demos as GPT-4? More? Fewer? We're collecting that data right now.

What's next

If the conversion data holds up:

  • More agents. We have training data for all four specialized agents in our architecture. The Discovery agent was first because it has the highest volume. Technical Consultant is likely next.
  • RL with conversion rewards. SFT teaches the model to imitate good conversations. RL teaches it to optimize for outcomes. We've designed a multi-signal reward: conversion outcome as primary, eval score as auxiliary, penalties for hallucination and missed opportunities.
  • Pipeline consolidation. Right now we have separate systems for retrieval, reranking, and generation. Research from Chroma's Context-1 suggests a single model doing all three beats the pipeline approach. That's a longer-term bet.

If the conversion data doesn't hold up — we'll learn from that too. The beauty of per-org routing is we can switch back to GPT-4 with an environment variable change. Zero risk to the rest of our customers.


Building your own model isn't for everyone. You need the data, the evaluation infrastructure, and the stomach for a roller coaster of results that will make you question the whole thing at least once.

But if you're a vertical AI company sitting on thousands of domain-specific conversations with clear outcome signals — the path is more accessible than you think. A single GPU, a few days, and about $25 in compute got us to a model that ties GPT-4 at our core task.

Stay tuned. The conversion data will tell us whether "ties on quality" translates to "ties on revenue." That's the only number that actually matters.

If you want to see what our AI agent looks like in action — whether it's running on GPT-4 or Navon — try it on our site. You might not be able to tell the difference. That's the point.

No items found.