Case Study
AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.
I led a research initiative that rebuilt Salesforce’s intent training library using sociolinguistic field methods, community-centered data collection, and demographic-aware sampling. This work prioritized historically underrepresented varieties, including African-American English, Chicano English, Southern US English, and ESL speakers. Over 530 participants contributed more than 24,000 utterances, giving the model exposure to real patterns like indirect requests, regional markers, emoji pragmatics, code-switching, and nonstandard grammar.
The result was an AI system that could interpret intent across dialects without penalty, improving accuracy for communities previously misclassified and expanding Salesforce’s total addressable market by improving the bot’s reliability across a broader range of customer communication patterns..
The training data powering Salesforce’s bots skewed toward “standardized” English, which meant the model struggled with customers whose linguistic patterns fell outside the dataset’s narrow language range.Biased training data produces biased bots, while linguistically inclusive training data produces linguistically inclusive bots.
This created predictable failures:
The real failure lived in the dataset itself, shaped by whose voices were present and whose were missing.
Create a training data system that reflects real linguistic diversity, so AI can listen accurately across communities instead of forcing customers to conform to a narrow language standard.
These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.
Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.
Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.
Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.
Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.
Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.
Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.
Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.
Case Study
AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.
I led a research initiative that rebuilt Salesforce’s intent training library using sociolinguistic field methods, community-centered data collection, and demographic-aware sampling. This work prioritized historically underrepresented varieties, including African-American English, Chicano English, Southern US English, and ESL speakers. Over 530 participants contributed more than 24,000 utterances, giving the model exposure to real patterns like indirect requests, regional markers, emoji pragmatics, code-switching, and nonstandard grammar.
The result was an AI system that could interpret intent across dialects without penalty, improving accuracy for communities previously misclassified and expanding Salesforce’s total addressable market by improving the bot’s reliability across a broader range of customer communication patterns..
The training data powering Salesforce’s bots skewed toward “standardized” English, which meant the model struggled with customers whose linguistic patterns fell outside the dataset’s narrow language range.Biased training data produces biased bots, while linguistically inclusive training data produces linguistically inclusive bots.
This created predictable failures:
The real failure lived in the dataset itself, shaped by whose voices were present and whose were missing.
Create a training data system that reflects real linguistic diversity, so AI can listen accurately across communities instead of forcing customers to conform to a narrow language standard.
These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.
Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.
Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.
Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.
Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.
Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.
Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.
Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.
A complete look at the research program, dataset scale, and linguistic principles behind rebuilding Salesforce’s intent training library.
See the full deck
Case Study
AI fails when it only learns from the easiest voices to hear. I built a system that teaches it to listen to everyone else.
I led a research initiative that rebuilt Salesforce’s intent training library using sociolinguistic field methods, community-centered data collection, and demographic-aware sampling. This work prioritized historically underrepresented varieties, including African-American English, Chicano English, Southern US English, and ESL speakers. Over 530 participants contributed more than 24,000 utterances, giving the model exposure to real patterns like indirect requests, regional markers, emoji pragmatics, code-switching, and nonstandard grammar.
The result was an AI system that could interpret intent across dialects without penalty, improving accuracy for communities previously misclassified and expanding Salesforce’s total addressable market by improving the bot’s reliability across a broader range of customer communication patterns..
The training data powering Salesforce’s bots skewed toward “standardized” English, which meant the model struggled with customers whose linguistic patterns fell outside the dataset’s narrow language range.Biased training data produces biased bots, while linguistically inclusive training data produces linguistically inclusive bots.
This created predictable failures:
The real failure lived in the dataset itself, shaped by whose voices were present and whose were missing.
Create a training data system that reflects real linguistic diversity, so AI can listen accurately across communities instead of forcing customers to conform to a narrow language standard.
These components rebuilt the linguistic backbone the model had been missing, letting it learn from the full range of real communication patterns.
Centered African-American English, Chicano English, Southern US English, and ESL patterns that commonly appear in real customer interactions yet remain underrepresented in most training datasets.
Designed data collection to include context, not just text. Captured demographic factors like gender identity, race, education level, and age to reflect how these variables shape linguistic expression.
Created a shared repository of diverse utterances that any team could use when training or retraining intents, ensuring language inclusivity became a standard rather than a one-off effort.
Applied standardized sociolinguistic methods to select linguistic varieties intentionally, collect data responsibly, and embed ethical AI practices directly into the research design.
Collected over 24,000 utterances from more than 530 participants, building one of the most comprehensive linguistic datasets in the org focused on dialect variation and diversity.
Introduced workshops and education moments that helped teams understand pronoun variation, emoji pragmatics, dialect differences, and the real-world nuance behind user intent, shifting internal culture toward more linguistically informed design.
Language encodes identity, region, culture, and lived experience. When AI misinterprets someone because of their dialect, the failure is structural, not personal. Rebuilding this dataset expanded who the system can understand with accuracy, closing gaps that previously excluded entire communities.
A complete look at the research program, dataset scale, and linguistic principles behind rebuilding Salesforce’s intent training library.
See the full deck