Have you ever wondered why Siri can understand one’s voice over another’s?
Linguistics Professor Nicole Holliday has an answer. She explores how biases develop in speech AI — used in Siri and other applications — and its real world consequences.
Lack of representation in training data causes biases in speech AI, making the system work less for underrepresented groups such as the elderly, Black people and people who speak English as a second language, according to Holliday.
Her research shows how socially prescriptive systems — ones that instruct how people should “talk” — fail due to its biases and lack of understanding social contexts. Companies using these systems may unfairly target marginalized individuals, labeling their speech as insufficiently “professional” based on flawed system metrics.
Nicole Holliday spoke to Berkeley Social Sciences about the social uses and effects of speech AI. This interview has been edited for clarity.
Please tell us more about your background and how you ended up at UC Berkeley.
Nicole Holliday: This is my first year at Berkeley. I joined after spending some years at Pomona College and at the University of Pennsylvania. I am a sociolinguist, so I study everything that has to do with language as a social object. In particular, I focus on speech sounds, so that’s the area of sociophonetics. Before I was at Pomona College, I got my Ph.D. in the Department of Linguistics at New York University. Before that, I was an undergrad at The Ohio State University. I'm originally from Columbus, Ohio.
How is AI used in linguistics?
Nicole Holliday: As humans started to interact with speech systems more, it’s become clear that we need a better understanding of how language variation can cause these systems to work better for some groups than others. Ten years ago, Siri didn't recognize anybody's voice very well. It was making systematic errors for groups such as children, elderly people and people who speak English as a second language.
One thing to know about speech AI is that your output is going to match your input to a certain extent. So if we take a really easy example, imagine that you create a voice assistant and you train it on 1,000 white women from San Francisco. When other people talk to it, it's going to be worse if they're not white women from San Francisco, because it wasn't trained on them. It doesn't know how to deal with regional, gender and age differences, for example.
How do biases develop in AI?
Nicole Holliday: The lack of representation in the training data is a really major issue. There's another layer to this that I think a lot of people don't understand if they're not familiar with large language models. If you have a generative AI system that has vacuumed up all of this text from the internet, or all of this speech from the internet, the size of that data is so large that it's really a challenge for the companies to do any sort of content moderation or management.
When we're talking about issues of representation, or how people interact with these systems, having large data sets that have a ton of representation can be really good. It means that they have more information, and they work better for more people. But, it also lowers the amount of control that companies that profit from these systems have over them.
What would you say are some key findings in your research?
Nicole Holliday: What I've been working on lately is systems that I call socially prescriptive speech technologies, or SPSTs, that establish and enforce “standards” for how people should speak.
One example of an SPST that I've been working with lately is Zoom Revenue Accelerator, a system that uses AI to evaluate customer and corporate interactions. It evaluates the speech of an employee, so that management sees a set of scored outputs. It will say, for example, that the employee is talking too fast or the employee is using too many pauses. So all of these are prescriptive standards about a way that the system thinks a human “should” talk. That's really an issue, especially because the Zoom Revenue Accelerator does not tell you what the metrics you should be aiming for are.
Beyond that, there's a sort of more basic issue, which is that humans are really good at talking to other humans and adjusting the way that their speech style works to accommodate the other person. For example, older people, on average, speak slower than younger people. So, the Zoom Revenue Accelerator might be seeing two 20-year old students, both talking at 190 words per minute and they're having a successful communication but it docks scores for both of them because it thinks they’re talking too fast, according to its standards.
We know from the linguistic literature that when we move in the direction of the person that we're talking to, we understand them better. So having a system like the Zoom Revenue Accelerator that says “this is exactly how fast you should be talking to everyone all the time, to sound “professional,” goes against what we know about successful human interaction, since speech rate is something that speakers negotiate amongst themselves within each conversation in real time.
How might your findings be used in society?
Nicole Holliday: Automatic speech recognition, we know, does not perform as well for African American speakers as it does for white speakers. We have a lot of data on that. So it's already misunderstanding some groups of people before it's even getting into the judging of their speech, which then perpetuates the system’s issues.
If these systems are being deployed in employment or educational contexts, then what that means is because they already have a built in racial bias, which is what I've found, then a company that uses these systems can say to an employee, “well we have these metrics which show that you're not an effective communicator”, regardless of their actual job performance. So it allows companies, if they're not working in good faith, to perpetuate other types of inequality, racism, or sexism by claiming that certain groups are just “worse at talking.”
Do you have anything else to add to the interview?
Nicole Holliday: I don't hate AI and I'm not against technology. What I'm against is the technology pretending that it can do sociological things, because sociological things are fundamentally human. It can't do that. These systems are making sociological judgments about language, which they are not equipped to do and which can cause significant social harms when they’re deployed at scale.