LLMs Turned Conversation Design Into Perpetual QA
Once upon a yesterday, most Conversation Designers owned the entire loop:
Empathize → Define → Ideate → Prototype → Test & QA → Implement
We wore every hat. Researcher. Writer. Prototyper. Builder. Validator. Sometimes PM. Sometimes engineer. Often all in one
day.
When systems were slow, deterministic, and predictable, the work stayed centered on design. You could reason about behavior upfront and trust that it would mostly hold.
LLMs changed that.
With probabilistic models and constantly shifting outputs, the work is drifting toward system analysis: reviewing conversations, auditing behavior, tagging root causes. That’s where system behavior is actually inspected now. And it’s where trust is either built or broken.
But this kind of QA can’t stop at checklists or accuracy scores.
Maybe the wrong answer wasn’t a prompt issue.
Maybe the source data was outdated.
Maybe the model ignored relevant context.
Maybe it hallucinated entirely.
To fix it, you have to trace the full system: prompt → retrieval → model behavior → output → user impact.
And while you’re doing that, the ground keeps moving. Vendors update models. Engineering tweaks backend rules. Even subtle upstream changes can quietly unravel yesterday’s work. A prompt that behaved perfectly three days ago can fail today, without warning.
No single Conversation Designer can keep a system healthy long-term under those conditions. Not when models shift constantly and every fix cascades. Maintaining trust is a team sport now.
We may see fewer traditional CxD roles and more specialized paths emerge: analyst, conversation reviewer, QA lead, LLM workflow strategist. In chasing automation, we created perpetual QA.
Traditional QA asked, “Did this pass?”
LLM-era QA has to ask, “Is this still behaving as expected?”
That’s a full-time job.
One way this work can break down across roles:
- Discover → AI Analysts or Program Managers
- Design & Implement → Conversation Designers
- QA → Specialists auditing fallbacks and tagging patterns
(Was that 1-star CSAT about the system, or about a branding change outside its scope?) - Gather Insights → Quality or CX leads triangulating metrics with confusion signals
- Resolve → Once QA confirms the signal, Conversation Designers update prompts or route issues to the right team
Startups may still expect one person to do it all. It’s rarely sustainable.
And depending on how teams are structured, LLM evaluation often requires joint QA and review. QA might flag an issue, but someone still has to trace why. Was the content wrong? The model confused? The prompt too open? You can’t debug what you can’t diagnose. Without shared evaluation flows, ownership gets murky fast.
If your org has split the work, a few questions matter:
- Who owns what, especially if you have multiple bots?
- How do you keep context intact across handoffs?
- What happens when quality is judged through a different lens than the one it was designed with?
- How do you prevent Conversation Designers and QA from duplicating effort when both need to review the same conversations to understand model behavior?
This shift doesn’t mean conversation design matters less. It means the work has moved closer to the system itself.
Designing trust now means maintaining it.