At first glance it seems familiar: a young woman smiling and dancing to the sound of Spotify’s most streamed artist of the year, Bad Bunny. The only hint that something might set this video apart from every other viral dance that makes its way through TikTok is the caption: “Bad Bunny doesn’t like it, but I do.” What is it about these videos that makes Bad Bunny so mad? The answer is it’s not actually his song. “Demo5: nostalgIA” is the creation of AI artist FlowGPT, himself a creation of a real human who goes by Maury Senpai on TikTok. Senpai used a voice cloning tool to simulate Justin Bieber, Bad Bunny, and Daddy Yankee singing his song. The story of “nostalgIA” is a great window to explore the public’s relationship with generative AI and the role that research at UC Berkeley plays. Students working in Professor Hany Farid’s digital forensics lab have recently published work developing algorithms that detect whether an audio is real or a so-called “deepfake” of an artist.
The concept of voice cloning is not new; in fact, it’s been in practice for over 65 years. Yet, today’s easy access to AI technologies (like those that can create deepfakes of people and their voices) has caused an explosion in their popularity. Celebrities, actors, and world leaders have been protesting the use of these technologies, arguing the legal and ethical rights to their own identities.
Now, multiple publicly available providers will let you pay as little as five dollars to clone a voice and generate new vocals using Text-To-Speech (TTS) technology. Platforms combine traditional TTS with new deep-learning approaches to generate high-quality outputs. In particular, zero-shot multi-speaker TTS systems enable the voice cloning process to happen for any new voice, regardless of whether or not the underlying machine learning model encountered it during training. These algorithms can combine voice-specific artifacts from any speaker with the generic model of turning text into human-sounding speech to create an identity-specific voice clone. In short: if you’ve got more than 30 seconds of your voice online, your voice could be cloned and used to say anything that a perpetrator wants.
The good news is UC Berkeley scientists have been working to create and evaluate algorithms that help humans differentiate between real and fake voices. Sarah Barrington, a graduate student in Professor Farid’s lab, has been working with her colleagues Romit Barua and Gautham Koorma to develop computational methods to detect subtle nuances between real and fake voices that the typical human listener cannot. “Our goal was to understand the state of the art in voice cloning technologies, and then put them to the test against a range of computational detection algorithms,” explains Barrington. Their recent paper, published at the Institute of Electrical and Electronic Engineers (IEEE) International Workshop on Information Forensics and Security, evaluated multiple algorithms for effectively detecting audio deepfakes. Koorma explains, “Our work examined a range of audio features, from those detectable by the human ear to those only discernible by complex deep-learning models. We found that the voice clones are so advanced that the human-detectable features were quickly outdated, leaving the deep-learning models as the most effective way to detect AI-generated audio in research conditions.”
Yet, designing deep-learning algorithms that can generalize in the wild is still a challenge. For vocals used in music, these tests don’t work so well. Music brings additional challenges, since real voices are often processed, superimposed with backing audio, and can contain non-conversational prose and intonation. These additional challenges make it increasingly difficult to differentiate between real and fake voices, but not impossible. The team experimented with a range of audio qualities and techniques and found that denoising the audio still led to promising results. “Deepfakes are no longer hypothetical, and scientists have got to adapt quickly to the ever-changing AI technologies around them,” Barrington explains.
While detection technologies may be able to help protect artists and the general public from AI misuse, they will not prevent deepfakes from being created. Generative AI is here to stay, but as the technology grows, so do detection methods that will positively shape our relationship with it.
This article is part of the Spring 2024 issue.
Notice something wrong?
Please report it here.