Your chat history after 8 turns is lying to you

A red, orange and blue "S" - Salespeak Images

Your chat history after 8 turns is lying to you

Omer Gotlieb Cofounder and CEO - Salespeak Images

Lior Mechlovich

6 min read

April 24, 2026

There's a failure mode every team building multi-turn AI agents eventually hits, and almost nobody talks about it in public. Around turn 8, your agent starts lying. Not in ways you can easily catch — it still sounds reasonable, the grammar is clean, the responses feel on-topic. But a pain point the user shared three turns ago has evaporated. The budget number the agent captured in turn 4 comes back slightly wrong. A question that was definitively answered gets asked again in a rephrased way.

The instinct is to blame the model. It's not the model. It's the data structure you're feeding it.

If you're dumping full chat history into the context window on every turn and asking the model to re-derive state from the transcript, you're making it do forensic work on every call. You wouldn't ask a human rep to re-read the entire call log before every sentence they speak. The reason they don't need to is that they're carrying a structured summary in their head — pain points noted, objections handled, numbers captured, what comes next. The conversation itself is the record. The summary is what they're operating from.

Your agent needs the same split. Here's the one we use in production.

The 8-turn cliff is real, and it's quiet

The thing that surprised us most when we started measuring was how invisible the failures were. The loud failures — wrong response, off-topic reply, broken JSON — were easy to catch and trace. The quiet ones were the expensive ones.

Three categories we ended up naming:

Partial extraction. The user mentioned they had a six-week timeline in turn 3. By turn 11, the agent is confidently telling them about a Q3 rollout plan. The timeline did not change. The agent just lost it.

Goal drift. The turn started with a clear intent: qualify on budget. Six turns of good back-and-forth later, the agent is somewhere else entirely — talking about integrations, giving a demo, anywhere but the qualification goal it started with. There was no decision to drift. It just happened.

Question repetition. The agent asks what industry the buyer works in. Buyer answers. Seven turns later, the agent asks again. Sometimes phrased slightly differently. The user notices. The user never comes back.

None of these show up as errors. They show up as lost deals, six weeks later, with no clear story for what went wrong.

Why bigger context windows don't fix it

The natural first response, when you see the 8-turn problem, is to reach for a model with a bigger context window. GPT-4 Turbo's 128k, Claude's 200k, whatever the latest number is. More room for history. Problem solved.

Except the problem is not "the history doesn't fit." It fits fine. The problem is that the model has to re-derive state from the raw transcript on every single turn. A 40-message history in 128k tokens gives the model plenty of room to read the history, but it still has to scan it, identify what matters, and reconstruct the implicit state — on every turn, before it has written a single word of its actual response.

That reconstruction is noisy. The model picks up most of what was said, misses a few things, and occasionally hallucinates details that were never there. The error rate is small per turn and compounds across a long conversation. By turn 12 the accumulated drift is visible, and a bigger context window does not help because the ratio of signal to re-derivation noise stays the same.

You don't need the model to rebuild state on every turn. You need the model to read state.

What we extract into structured memory

Our conversation state is a dataclass. Every turn reads from it and writes to it. The raw messages still live in the state object, but they're there for audit and error recovery, not as the primary thing the agent reasons over.

The fields that carry most of the work, roughly in priority order:

conversation_contextual_memory. The core structured summary. Pain points identified, topics already discussed, key commitments made by either side. Specialized agents write to this as they extract information during their turn. The orchestrator reads it on every subsequent turn. This is the field that does the most to kill goal drift.

qualification_status. A dict of score per qualification dimension (budget, authority, need, timeline, fit — standard sales shape). Each dimension is updated incrementally. The orchestrator doesn't have to re-read the transcript to decide whether the buyer is qualified; it reads the dict.

follow_up_questions_asked. A list of every follow-up question the agent has already asked this session. Before asking a new question, every specialized agent checks this list. This is what stops the "ask about industry twice" failure mode. Costs basically nothing, eliminates the entire class.

is_qualified. A three-state enum: IN_PROGRESS, QUALIFIED, or UNQUALIFIED. The orchestrator reads this on the first turn of every call and routes accordingly. When the state flips to QUALIFIED, the orchestrator's behavior changes — it stops asking discovery questions and starts sharing target functions. Without a flag like this, the agent has to infer "are we done qualifying yet?" on every turn, and it gets it wrong sometimes.

previous_chat_phase. Where the conversation was at the end of the last turn. This is separate from the current phase the orchestrator decides. Phase progression — discovery → qualification → asset-sharing → close — is explicit.

kb_information_level. An enum set by the researcher: ACCURATE, PARTIAL, or MISSING for the current turn. The technical consultant agent reads this and refuses to answer if the KB does not actually contain the information. That single field prevents most of our silent-hallucination failures on technical questions.

There are other fields — research_results, user_language, visual_content_offered, agent_responses — but the six above carry the bulk of the "agent does not lose track" work.

Who writes, who reads

The ownership pattern matters as much as the fields themselves.

Specialized agents write to conversation_contextual_memory during their turn. As the discovery agent extracts a pain point, it pushes it into the memory object as part of its response. As the technical consultant clarifies a requirement, that goes in too.

The orchestrator does not write to conversation_contextual_memory. It reads it. The orchestrator's system prompt includes the current memory object every turn, so when it decides who handles the next turn, it has the full extracted state in front of it — not 40 messages of raw chat history.

This read-write split is what makes parallel orchestration possible at all. If both the orchestrator and the specialized agent were writing to the same memory field on the same turn, you'd have a merge conflict every time. With strict ownership, the state graph is predictable: researcher writes research_results, orchestrator writes next_agent, specialized agents write conversation_contextual_memory and agent_responses. Each node owns its fields and nobody else touches them.

Stateless agents, stateful graph. That's the shape.

The tradeoff nobody talks about

Structured memory is not free. There is a cost, and writeups on this pattern mostly skip it.

Extraction itself is an LLM call. When the discovery agent identifies a pain point in the buyer's turn, that identification — "this is a budget signal, this is a competitor mention, this is a timeline hint" — happens inside the discovery agent's own response generation, which is already an LLM call. But it costs tokens. The system prompt for each specialized agent is longer because it includes the extraction schema. The output is longer because it includes the extracted fields alongside the visible response.

You have moved latency and cost from "re-derive state from 40 messages on every turn" to "extract structured state during this turn's response." The total is similar. The distribution is different.

This trade is worth it when multi-turn accuracy matters more than single-turn speed. In a sales conversation that might go 15 turns, yes — compounding accuracy beats a single fast response. In a one-shot Q&A system, probably no. Measure your own workload before you decide.

The honest framing: structured memory is a pattern for multi-turn systems with real state. It is not universally better. It is the right answer when the alternative is your agent losing track of what was said around turn 8.

The question to ask yourself

Open the trace of your longest production conversation. Walk through it turn by turn and ask: what did the agent have to re-figure-out on this turn that it should have just read from state?

Every "re-figure-out" is a place the 8-turn cliff is waiting for you.

What's the one piece of state your agent is re-deriving on every turn that it should save once?

No items found.

Microsoft Just Made NLWeb the Standard. Salespeak Customers Already Have It.

March 9, 2026

A red arrow smashing through a dollar coin

Agent Analytics: See How AI Models Access Your Website

Thanks! We're excited to talk more about B2B GTM and AI!

Oops! Something went wrong while submitting the form.

Your chat history after 8 turns is lying to you