Tech & AIandScience and Society

Hey Siri, do I sound like a girl?

Helping humans and machines find their voice

By Irian D'Andrea

Designs by Eleanor Wang

October 2, 2024

Have you ever wished you could change your voice, or realized with embarrassment that you laugh just like your dad? Maybe you have noticed that you tend to mirror the voices of those around you, or perhaps you have caught yourself slipping back into an old accent that reveals something about your past? It’s possible you don’t often think about your voice. Or perhaps you think about your voice almost every day, out of necessity.

The complex process by which we come to arrive at our normal speaking voice—and the extent to which there is such a thing as a normal speaking voice—lives delectably at the intersection of culture, gender, acoustics, and self-expression. As both a trans woman and a physicist, I have spent many hours trying to understand, analyze, and shape my voice.

If you are a trans woman in the year 2024, you have a lot of options when it comes to learning how to modify your voice. From quick tips on TikTok, to video essays on YouTube, and even private one-on-one lessons, creators—usually trans women—are sharing their strategies, exercises, practice regimens, and encouragement with a community of people interested in voice training. This reflects a decades-long tradition of trans people—especially trans women, whose voices do not change with hormone therapy—teaching each other the tools necessary to modify their voices.

While these resources are a great start, they have limitations. Importantly, unless someone is receiving one-on-one attention from a teacher, online resources don’t provide any direct feedback on one’s voice—a crucial part of the learning process. Without it, progress will likely be slow or nonexistent. As a result, there has been a growing demand for software that analyzes the gendered characteristics of a voice in real time. Thus far, this has proven to be a surprisingly elusive goal. However, researchers at UC Berkeley are pushing the edge of current speech processing capabilities to potentially make such a tool a reality.

Speech-language pathology

Historically, the field of speech-language pathology has been focused on helping people with atypical voices or speech impediments. If someone has a lisp that makes them hesitant to speak or a stutter that is interfering with their career, they might consult a speech-language pathologist. Over the last ten years, however, demand for and interest in gender-affirming voice therapy has exploded. “At the American Speech and Hearing Conference in Boston this year, the sessions that addressed transgender voice were sold out, at full capacity. Over the last five or six years it’s been a big growing area of our field,” explains Jocelyne Leger, a clinical instructor at Emerson College.

Despite this rapid growth, these services remain financially inaccessible to many trans people. “It has not been until recently that insurers will pay for voice modification, and many still do not pay for it,” says Jocelyne. Even when access is not an issue, trans people as a community tend to be skeptical of consulting cis people (who are often ignorant of trans experiences and needs) on transition-related matters. This makes it all the more important to develop alternative voice training resources for people who are not able to or do not wish to access professional guidance.

Enter: the machine and its penchant for learning

If asked to describe a person's voice, you would probably start by using words like "deep," "buzzy," "nasal," or "smooth." These intuitive descriptions are often called perceptual qualities. There is a (not completely agreed upon) set of such perceptual qualities that are used in the speech community and the trans voice community, including, for instance, "strain," "breathiness," and "weight." Teachers and clinicians use these descriptions to communicate the overall impression of a voice in a way that humans can understand and, with practice, recognize and replicate. While it is intuitive for most untrained human listeners to perform this sort of perceptual analysis, it remains an elusive goal to get computers to carry out the same task. A team at UC Berkeley is working hard to train computers in the art of perceptual analysis.

Robin Netzorg, a graduate student in Computer Science, together with the Berkeley Speech Group, is working on accomplishing this goal. Robin explains that “While non-experts can modify an image or sentence perceptually with available tools, it isn’t clear how to similarly modify speech along perceptual lines.”

Robin is primarily interested in perceptual voice analysis as a tool for trans voice training. “The reason why I got into this is because I'm interested in gender-affirming healthcare, first and foremost,” says Robin. When asked about her big picture goals, Robin states, “I want to build a gender voice training feedback tool for folks. I think that's the clearest thing that I want to do right now: just release these models to the public.”

That said, by beginning to investigate this topic, Robin noticed that perceptual voice analysis represents a gap in existing human-computer interaction, with broader applicability than just trans voice training. “There are limits to our current human-computer interfacing abilities. For example, nobody is close to being able to use machine learning to describe smell,” explains Robin. In terms of difficulty, perceptual voice analysis appears to be somewhere between smell and the more “solved” tasks, like language processing and image recognition.

An important distinction needs to be made here: those who have been following the world of machine learning will likely know that “deepfakes” and other machine learning models can reproduce the vocal quality of any speaker. However, such models are static black boxes: once trained, they allow someone to make, for example, President Biden say whatever they would like, but they do not give insight into the quality of his voice: does he sound tired, sick, or happy? Such models are said to be “not interpretable,” in the sense that a human cannot consistently understand why certain decisions or predictions have been made by the model. Gopala Krishna Anumanchipalli, Assistant Professor at UC Berkeley and member of the Berkeley Speech Group, says, "With the widespread use of deep learning, having the models be interpretable is more important now than ever.”

The Berkeley Speech Group is working on two voice models: an analysis tool that listens to a voice and describes it in terms of perceptual qualities, and a modification tool which allows users to take a voice recording and modify it along perceptual axes. “You have these knobs and you can just play with them, and it could generate your voice with these perceptual shifts. Then you could [...] explore the space and be like, okay, this is the voice that I want,” explains Robin. Together, these two tools represent a major step towards providing accessible voice training feedback.

The versatility of these models and the focus on getting a voice that you want (as opposed to one that is just stereotypically feminine) resonates with the experience of many trans women. As Robin puts it, “The reason why I transitioned is to be better aligned with myself. So to be a version of myself that I want to be and that I'm happy with. And so that should be my primary concern: am I getting a voice that I'm happy with?” She was also careful to point out that, for many trans women, passing is a matter of safety. “These tools can also be a benefit if you need to make sure your voice is read as feminine because you’re not in a safe environment,” she says.

(Re)-defining perceptual qualities

A sort of “step zero” in this process is better defining the meaning of vocal perceptual qualities, and importantly, determining to what extent people can perceive and agree on them. Otherwise, they may not be properties that are worth assigning numerical values to—a prerequisite for predicting those values with machine learning.

Previous studies have found that speech specialists are able to reliably agree when assigning values to the perceptual qualities of voices they hear. However, such studies typically use a reduced set of five parameters—roughness, breathiness, strain, pitch, and loudness—as defined by a standard called the CAPE-V (Consensus Auditory-Perceptual Evaluation of Voice). These five parameters are usually augmented by an additional binary property: “sex,” which takes on the values “male” and “female.” However, Robin and others have argued that despite appearing to be an intuitively useful perceptual quality, the perceived sex of the speaker should not be on this list. Rather, like other vocal qualities, the perceived sex of a speaker properly lies on a spectrum, and any attempts to assign a binary label to it not only serve to alienate trans and nonbinary folks but may lead to poor results, especially for androgynous voices. “The technical reason is that we can just build better models if we don't include speaker gender ... the philosophical reason is that, from a scientific perspective, it's not very interesting, because it doesn't actually tell you if you know how diverse masculine and feminine voices are,” explains Robin.

In their search for an accurate and useful set of perceptual qualities, the research team reached out to three voice coaches who specialize in transgender voice training. Together, they reached the consensus that current voice training methodologies are particularly concerned with two perceptual qualities as indicators of gender: vocal resonance and vocal weight. These are the main “knobs” you learn to control when undergoing trans voice training.

To understand the salience of these qualities, one must first understand the basics of voice production. In a nutshell, the mechanism for generating voice can be divided into three major parts: the vocal folds within the larynx (the “voice box”), the resonators (the throat, mouth, and nasal passages), and the articulators (the tongue, soft palate, and lips). Like a plucked guitar string, the vocal folds vibrate rapidly under tension and create an initial sound. This step is called “sound production,” and it primarily determines the pitch of the voice. Next, like the body of a guitar, the resonators amplify the sound, and primarily determine its timbre—its unique character as a sound as captured by the strength of its overtones. This step is called “sound modification.” Finally, the articulators form recognizable vowels and consonants, as well as allow for a highly intricate array of sounds. This step is called “articulation.”

Vocal weight is a perceptual quality that is largely determined at the stage of sound production, where the size, tension, shape, and vibrational pattern of the vocal fold all play a role in the initial timbre of the sound. One could describe vocal weight as a scale of buzziness to smoothness, or heaviness to lightness, of the voice: think of the buzz of a bee versus the smoothness of a flute. In the parlance of audio engineering, it can be associated loosely with the “waveform” of the voice: for instance, labels like “square wave,” “triangle wave,” and “sine wave” could be applied here. The weight of a voice is strongly correlated with the thickness of the vocal fold, and for this reason, weight is a strong indicator of speaker sex. Those who experience a testosterone puberty undergo a thickening of their vocal folds, leading to a voice with more “weight” on average.

Vocal resonance is a perceptual quality that is largely determined at the stage of sound modification. This is roughly synonymous with the brightness versus darkness of the voice, or with the richness and overall perceived depth. Vocal resonance is determined by the size and shape of the major resonant cavities, especially the pharynx—essentially the back of the throat, and the first major cavity after your vocal folds. While sound modification allows for endless variation and complexity, the simplest and most noticeable effect comes from the overall size of the resonators—think about the timbre of a guitar as opposed to that of a ukulele. For people whose testosterone increases during puberty, the overall size of their pharynx increases, on average, which generally causes resonance to darken, leading to a deeper voice.

While hormone-induced changes produce a bimodal distribution of physiological voice characteristics in the population on average, it is important to understand that most people learn to emphasize these qualities, settling into their voice habits through a confluence of physiology and culture. As a result, the perceived bimodality of “speaker sex” is exaggerated: men often learn to deepen their voice and add a buzzy, rich quality. Women often learn to raise their pitch, brighten their vowels and sibilants (“s” sounds), and adopt certain patterns of intonation. Rather ironically, the malleability of the human voice is partly responsible for the common sentiment that “speaker sex” as a voice quality is a relatively static property.

To better capture the true range of voice, the team decided to test the usefulness of a seven-parameter set of perceptual qualities, which consists of the CAPE-V set augmented by weight and resonance. The full set is: roughness, breathiness, strain, pitch, loudness, weight, and resonance.

Initial tests of learnability

The first task at hand is to determine the reproducibility and level of agreement between experts and non-experts about these qualities. That way, a human-interpretable machine learning model based on these qualities can be said to be, well… interpretable. There is also a simple practical reason why it is important to test the ability of non-experts to consistently rate perceptual voice qualities: to train a machine learning model, the team needs a large amount of labeled data, and incentivizing experts to label a large volume of data is expensive and time-consuming—a common problem in machine learning.

As a result, the UC Berkeley team gathered 150 audio clips from the PVQD (perceptual voice quality database), a widely used database of “atypical” voices. They then had a panel of speech experts and a group of non-experts rate the clips based on the seven perceptual qualities. The listeners were asked to rate each clip on a scale from 1 to 100 in each perceptual quality category.

What they found is that the perceptual qualities are reliably hearable by ensembles of non-experts—though not to the same extent as expert agreement, which is to be expected. This important step indicates that it may be possible to crowd-source labeled voice samples and obtain high quality data for training machine learning models. Robin remains open to other options, saying, “They do perform fairly well. I think there are possibly better ways to do it and so that's something that I want to explore in future work.”

Pushing forward with the same modest pool of 150 audio samples, the team was also able to train a machine learning model, called a “random forest regression model,” to rate audio clips in each perceptual quality on a scale of 1 to 100. This commonly used class of machine learning models excels at categorization problems, while often remaining human-interpretable. Essentially, the model uses a collection of decision trees, each trained over different subsets of the data. Then, the decision trees are averaged to create the final best guess: hence, a decision “forest.” Using just 100 audio samples as training data, the model outperformed human non-experts at the task, performing more similarly to the expert pool.

This achievement represents a proof of concept for the analysis tool Robin hopes to release to the public. While other voice researchers were previously able to obtain similar results for categorizing voices along the CAPE-V, the novel feature of this model is that it incorporates the gender-correlated qualities of weight and resonance as properties that exist on a spectrum. The team hopes to scale up this model, with more training data drawn from a broader pool of gender-diverse voices.

PerMod

Forging onwards, Robin and the Berkeley Speech Group have brought to life a prototype of their envisioned voice modification tool. The prototype, which they call PerMod (short for <u>Per</u>ceptual <u>Mod</u>ification), is a machine learning model that takes in an input voice and a description of the desired output vocal quality, then outputs a modified version of the voice. Robin excitedly sat me down at her desk to show me the model in action. In short, it works as advertised, a fact that caused me to join her in enthusiastic excitement, and to wonder what comes next.

This version of PerMod was trained on 296 labeled audio clips of around 30 seconds each, covering a range of voice qualities. The model is built on a conditional latent diffusion model—the same architecture behind OpenAI’s DALL-E 2. The goal of a latent diffusion model is to learn a “diffusion process” that generates the probability distribution of a given dataset. To understand the basic idea, first imagine the space of all possible audio clips that last 10 seconds. Clearly, this space is absolutely massive, with most of it taken up by things that sound like meaningless white noise. However, a very tiny part of this space corresponds to recordings of a human voice. A latent diffusion model learns what part of this space fits the dataset—in this case, it learns which parts sound like a human voice. It also learns which data points are “close to each other,” as well as how to smoothly traverse from one part of the “human voice” dataset to another. In the case of image processing, latent diffusion models are the tool of choice for tasks like de-noising (cleaning up a blurry or noisy image), and for creating images according to some input text. In the case of PerMod, the model learns how to traverse the “human voice” dataset according to the seven perceptual qualities.

PerMod is a leap forward, but it is far from perfect in its current form. “While PerMod is able to generate certain voices with desired perceptual qualities on average, there is much room for improvement. Currently, PerMod underperforms on atypical voices,” Robin writes in her 2023 paper presented at the Automatic Speech Recognition and Understanding Workshop, where the group introduced the model. In addition, a certain lack of naturalness, almost like a Siri voice from 2010, sometimes seems to creep into the voice. That said, in the modern era of big-data AI, the model’s training dataset is extremely modest in size. The team remains optimistic about the prospects of the model, given additional training data.

Ethics and the future

Looking to the future, I ask Robin what she hopes to see in the field of speech processing and trans voice. She replies, “I hope the field becomes more interested in questions of ethics and fairness and bias.” Currently, the project bottleneck is due primarily to questions about how to make the model equitable in the presence of a racist and misogynistic world. This is a widespread issue for deep learning researchers, where the state-of-the-art approach is to train models on, essentially, the entire Internet (for example, chatGPT). It then becomes hard to avoid the bias in our world seeping into the model.

Robin’s work also represents a growing trend wherein the trans voice community and the speech-language pathologist community are becoming more tuned-in to each other. At the same time that trans voice training is becoming a high-demand topic among clinicians, Robin and other trans researchers are helping redefine the field by challenging the frameworks it is built on, the language used to describe it, and widely held beliefs about what is possible.

This article is part of the Spring 2024 issue.

Notice something wrong?

Please report it here.