LLMs Get Lost in Multi-turn Conversation
The cat is out of the bag.
Pay attention, devs.
This is one of the most common issues when building with LLMs today.
Glad there is now paper to share insights.
Here are my notes:
The paper investigates how LLMs perform in realistic, multi-turn conversational settings where user instructions are often underspecified and clarified over several turns.
I keep telling devs to spend time preparing those initial instructions. Prompt engineering is important.
The authors conduct large-scale simulations across 15 top LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, DeepSeek-R1, and others) over six generation tasks (code, math, SQL, API calls, data-to-text, and document summarization).
Severe Performance Drop in Multi-Turn Settings
All tested LLMs show significantly worse performance in multi-turn, underspecified conversations compared to single-turn, fully-specified instructions. The average performance drop is 39% across six tasks, even for SoTA models. For example, models with >90% accuracy in single-turn settings often drop to ~60% in multi-turn settings.
Degradation Is Due to Unreliability, Not Just Aptitude
The performance loss decomposes into a modest decrease in best-case capability (aptitude, -15%) and a dramatic increase in unreliability (+112%).
In multi-turn settings, the gap between the best and worst response widens substantially, meaning LLMs become much less consistent and predictable.
High-performing models in single-turn settings are just as unreliable as smaller models in multi-turn dialogues. Don't ignore testing and evaluating in multi-turn settings.
Main reasons LLMs get âlostâ
- Make premature and often incorrect assumptions early in the conversation.
- Attempt full solutions before having all necessary information, leading to âbloatedâ or off-target answers.
- Over-rely on their previous (possibly incorrect) answers, compounding errors as the conversation progresses.
- Produce overly verbose outputs, which can further muddle context and confuse subsequent turns.
- Pay disproportionate attention to the first and last turns, neglecting information revealed in the middle turns (âloss-in-the-middleâ effect).
Practical Recommendations:
- Users are better off consolidating all requirements into a single prompt rather than clarifying over multiple turns.
- If a conversation goes off-track, starting a new session with a consolidated summary leads to better outcomes.
- System builders and model developers are urged to prioritize reliability in multi-turn contexts, not just raw capability. This is especially true if you are building complex agentic systems where the impact of these issues is more prevalent.
- LLMs are really weird. And all this weirdness is creeping up into the latest models too but it more subtle ways. Be careful out there, devs.
The cat is out of the bag.
Pay attention, devs.
This is one of the most common issues when building with LLMs today.
Glad there is now paper to share insights.
Here are my notes:
The paper investigates how LLMs perform in realistic, multi-turn conversational settings where user instructions are often underspecified and clarified over several turns.
I keep telling devs to spend time preparing those initial instructions. Prompt engineering is important.
The authors conduct large-scale simulations across 15 top LLMs (including GPT-4.1, Gemini 2.5 Pro, Claude 3.7 Sonnet, DeepSeek-R1, and others) over six generation tasks (code, math, SQL, API calls, data-to-text, and document summarization).
Severe Performance Drop in Multi-Turn Settings
All tested LLMs show significantly worse performance in multi-turn, underspecified conversations compared to single-turn, fully-specified instructions. The average performance drop is 39% across six tasks, even for SoTA models. For example, models with >90% accuracy in single-turn settings often drop to ~60% in multi-turn settings.
Degradation Is Due to Unreliability, Not Just Aptitude
The performance loss decomposes into a modest decrease in best-case capability (aptitude, -15%) and a dramatic increase in unreliability (+112%).
In multi-turn settings, the gap between the best and worst response widens substantially, meaning LLMs become much less consistent and predictable.
High-performing models in single-turn settings are just as unreliable as smaller models in multi-turn dialogues. Don't ignore testing and evaluating in multi-turn settings.
Main reasons LLMs get âlostâ
- Make premature and often incorrect assumptions early in the conversation.
- Attempt full solutions before having all necessary information, leading to âbloatedâ or off-target answers.
- Over-rely on their previous (possibly incorrect) answers, compounding errors as the conversation progresses.
- Produce overly verbose outputs, which can further muddle context and confuse subsequent turns.
- Pay disproportionate attention to the first and last turns, neglecting information revealed in the middle turns (âloss-in-the-middleâ effect).
Practical Recommendations:
- Users are better off consolidating all requirements into a single prompt rather than clarifying over multiple turns.
- If a conversation goes off-track, starting a new session with a consolidated summary leads to better outcomes.
- System builders and model developers are urged to prioritize reliability in multi-turn contexts, not just raw capability. This is especially true if you are building complex agentic systems where the impact of these issues is more prevalent.
- LLMs are really weird. And all this weirdness is creeping up into the latest models too but it more subtle ways. Be careful out there, devs.