Case Study

Teaching AI to Hear Everyone

AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.

What I Built

These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.

    1. A priority list of underserved linguistic varieties

    Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.

    1. A demographic-aware sampling model

    Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.

    1. A scalable, org-wide training data library

    Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.

    1. A sociolinguistic field methodology for AI

    Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.

    1. A large-scale dataset with measurable coverage

    Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.

    1. An organizational literacy layer around language

    Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.

Impact

  • On the model • Improved intent recognition across dialects • Reduced misclassification from unfamiliar grammar or pragmatics • Lowered unnecessary escalations • Improved automation success rates for customer groups historically underserved by AI systems
  • On the business • Increased the bot’s functional coverage across a wider spectrum of customer communication styles • Increased total addressable market for conversational channels • Provided a repeatable framework for ethical and inclusive AI training
  • On the organization • Set a new internal standard for how linguistic data should be collected • Highlighted practical, rigorous methods for responsible AI • Became the reference model for future intent training cycles across the company

Why This Matters

Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.

Extending the Model’s Reach

  • Improved model accuracy on underrepresented linguistic patterns
  • Reduced misinterpretation and unnecessary escalations
  • Increased the model’s functional coverage across diverse customer groups
  • Established a standardized framework for inclusive linguistic data collection
  • Set a new precedent for evidence-based practices in AI training

Case Study

Teaching AI to Hear Everyone

AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.

What I Built

These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.

    1. A priority list of underserved linguistic varieties

    Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.

    1. A demographic-aware sampling model

    Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.

    1. A scalable, org-wide training data library

    Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.

    1. A sociolinguistic field methodology for AI

    Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.

    1. A large-scale dataset with measurable coverage

    Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.

    1. An organizational literacy layer around language

    Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.

Impact

  • On the model • Improved intent recognition across dialects • Reduced misclassification from unfamiliar grammar or pragmatics • Lowered unnecessary escalations • Improved automation success rates for customer groups historically underserved by AI systems
  • On the business • Increased the bot’s functional coverage across a wider spectrum of customer communication styles • Increased total addressable market for conversational channels • Provided a repeatable framework for ethical and inclusive AI training
  • On the organization • Set a new internal standard for how linguistic data should be collected • Highlighted practical, rigorous methods for responsible AI • Became the reference model for future intent training cycles across the company

Why This Matters

Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.

Extending the Model’s Reach

  • Improved model accuracy on underrepresented linguistic patterns
  • Reduced misinterpretation and unnecessary escalations
  • Increased the model’s functional coverage across diverse customer groups
  • Established a standardized framework for inclusive linguistic data collection
  • Set a new precedent for evidence-based practices in AI training

Dreamforce Presentation Deck

A complete look at the research program, dataset scale, and linguistic principles behind rebuilding Salesforce’s intent training library.

See the full deck

Marlinda GalaponAI Experience Architect

Case Study

Teaching AI to Hear Everyone

AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.

What I Built

These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.

    1. A priority list of underserved linguistic varieties

    Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.

    1. A demographic-aware sampling model

    Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.

    1. A scalable, org-wide training data library

    Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.

    1. A sociolinguistic field methodology for AI

    Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.

    1. A large-scale dataset with measurable coverage

    Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.

    1. An organizational literacy layer around language

    Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.

Impact

  • On the model • Improved intent recognition across dialects • Reduced misclassification from unfamiliar grammar or pragmatics • Lowered unnecessary escalations • Improved automation success rates for customer groups historically underserved by AI systems
  • On the business • Increased the bot’s functional coverage across a wider spectrum of customer communication styles • Increased total addressable market for conversational channels • Provided a repeatable framework for ethical and inclusive AI training
  • On the organization • Set a new internal standard for how linguistic data should be collected • Highlighted practical, rigorous methods for responsible AI • Became the reference model for future intent training cycles across the company

Why This Matters

Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.

Extending the Model’s Reach

  • Improved model accuracy on underrepresented linguistic patterns
  • Reduced misinterpretation and unnecessary escalations
  • Increased the model’s functional coverage across diverse customer groups
  • Established a standardized framework for inclusive linguistic data collection
  • Set a new precedent for evidence-based practices in AI training

Dreamforce Presentation Deck

A complete look at the research program, dataset scale, and linguistic principles behind rebuilding Salesforce’s intent training library.

See the full deck