Case Study

Building Figma’s First LLM QA Program

A system that detects model failures, protects customer trust, and teaches the organization how to evaluate AI with rigor.

The System I Built

This system creates a consistent, repeatable way to understand model behavior across layers, from intent to calibration.

To replace anecdotal evaluation with measurable quality, I designed a modular LLM QA system that combined signals-based scoring, structured failure detection, and a repeatable human-in-the-loop workflow. Each component was built to scale across teams while reducing cognitive load for specialists and giving Engineering actionable feedback.

The QA Modules

These are the core modules that form the system’s architecture. These modules work together to surface patterns, score outputs consistently, and drive actionable model improvements.

Impact

The QA program changed how teams shipped, measured, and improved AI features.

  • For Product

    • Stronger grounding in long-context tasks due to clearer signal criteria • Faster regression detection through weekly pattern surfaces • Reduced safety drift in sensitive flows from structured failure-mode tagging

  • For the Org

    • A shared quality language across Support, PM, Design, and Research • Faster decisions driven by pattern-based insights instead of anecdotal reports • Predictable evaluation cadence that shifted teams from reactive debugging to proactive governance • Clearer prioritization for Engineering based on tagged failure clusters

  • For Customers

    • Noticeably fewer misleading or incorrect responses in guided steps • More reliable multi-step assistance due to improved grounding • Smoother, more consistent experiences across surfaces as drift was identified earlier

Why This Matters

The QA program created a reliable foundation for how Figma evaluates AI behavior. It gave Support, Product, and Engineering a shared framework for understanding quality, surfaced model patterns that influenced roadmap decisions, and established an evaluation rhythm the teams could depend on.

By replacing anecdotal feedback with structured evidence, the system made it easier to spot regressions, prioritize improvements, and measure progress over time. It also established a repeatable evaluation approach that can be applied to any feature or model.

The methods behind this program draw on systems I’d been developing long before Figma, and they continue to shape how I design quality frameworks in new environments.

From Talk to Practice

These slides walk through the architecture and methods behind this QA program. I presented them at a conference session focused on practical AI evaluation.

See the full deck

Case Study

Building Figma’s First LLM QA Program

A system that detects model failures, protects customer trust, and teaches the organization how to evaluate AI with rigor.

The System I Built

This system creates a consistent, repeatable way to understand model behavior across layers, from intent to calibration.

To replace anecdotal evaluation with measurable quality, I designed a modular LLM QA system that combined signals-based scoring, structured failure detection, and a repeatable human-in-the-loop workflow. Each component was built to scale across teams while reducing cognitive load for specialists and giving Engineering actionable feedback.

The QA Modules

These are the core modules that form the system’s architecture. These modules work together to surface patterns, score outputs consistently, and drive actionable model improvements.

Impact

The QA program changed how teams shipped, measured, and improved AI features.

  • For Product

    • Stronger grounding in long-context tasks due to clearer signal criteria • Faster regression detection through weekly pattern surfaces • Reduced safety drift in sensitive flows from structured failure-mode tagging

  • For the Org

    • A shared quality language across Support, PM, Design, and Research • Faster decisions driven by pattern-based insights instead of anecdotal reports • Predictable evaluation cadence that shifted teams from reactive debugging to proactive governance • Clearer prioritization for Engineering based on tagged failure clusters

  • For Customers

    • Noticeably fewer misleading or incorrect responses in guided steps • More reliable multi-step assistance due to improved grounding • Smoother, more consistent experiences across surfaces as drift was identified earlier

Why This Matters

The QA program created a reliable foundation for how Figma evaluates AI behavior. It gave Support, Product, and Engineering a shared framework for understanding quality, surfaced model patterns that influenced roadmap decisions, and established an evaluation rhythm the teams could depend on.

By replacing anecdotal feedback with structured evidence, the system made it easier to spot regressions, prioritize improvements, and measure progress over time. It also established a repeatable evaluation approach that can be applied to any feature or model.

The methods behind this program draw on systems I’d been developing long before Figma, and they continue to shape how I design quality frameworks in new environments.

From Talk to Practice

These slides walk through the architecture and methods behind this QA program. I presented them at a conference session focused on practical AI evaluation.

See the full deck

Marlinda GalaponAI Experience Architect

Case Study

Building Figma’s First LLM QA Program

A system that detects model failures, protects customer trust, and teaches the organization how to evaluate AI with rigor.

The System I Built

This system creates a consistent, repeatable way to understand model behavior across layers, from intent to calibration.

To replace anecdotal evaluation with measurable quality, I designed a modular LLM QA system that combined signals-based scoring, structured failure detection, and a repeatable human-in-the-loop workflow. Each component was built to scale across teams while reducing cognitive load for specialists and giving Engineering actionable feedback.

The QA Modules

These are the core modules that form the system’s architecture. These modules work together to surface patterns, score outputs consistently, and drive actionable model improvements.

Impact

The QA program changed how teams shipped, measured, and improved AI features.

  • For Product

    • Stronger grounding in long-context tasks due to clearer signal criteria • Faster regression detection through weekly pattern surfaces • Reduced safety drift in sensitive flows from structured failure-mode tagging

  • For the Org

    • A shared quality language across Support, PM, Design, and Research • Faster decisions driven by pattern-based insights instead of anecdotal reports • Predictable evaluation cadence that shifted teams from reactive debugging to proactive governance • Clearer prioritization for Engineering based on tagged failure clusters

  • For Customers

    • Noticeably fewer misleading or incorrect responses in guided steps • More reliable multi-step assistance due to improved grounding • Smoother, more consistent experiences across surfaces as drift was identified earlier

Why This Matters

The QA program created a reliable foundation for how Figma evaluates AI behavior. It gave Support, Product, and Engineering a shared framework for understanding quality, surfaced model patterns that influenced roadmap decisions, and established an evaluation rhythm the teams could depend on.

By replacing anecdotal feedback with structured evidence, the system made it easier to spot regressions, prioritize improvements, and measure progress over time. It also established a repeatable evaluation approach that can be applied to any feature or model.

The methods behind this program draw on systems I’d been developing long before Figma, and they continue to shape how I design quality frameworks in new environments.

From Talk to Practice

These slides walk through the architecture and methods behind this QA program. I presented them at a conference session focused on practical AI evaluation.

See the full deck