Voice cloning in 3.7 seconds. But why?

2 min readMar 2, 2018

Text for human voice samples used by Baidu Research to generate synthesized audio.

Neural networks can now take just a few seconds of your speech and generate entirely new audio samples. What’s more, these synthetic voices may soon be indistinguishable from the originals. It’s an impressive feat of technology, but one that leaves me asking, why?

Last week, Baidu Research published a paper explaining how short phrases like “It was even worse than at home,” spoken by humans can be used to create a cloned voice speaking a relatively complex phrase, “Different telescope designs perform differently and have different strengths and weaknesses.”

Baidu’s results come just weeks after Google’s DeepMind claimed to have synthesized AI-voices to be indistinguishable from humans. Called Tacotron 2, Google’s uses a text-to speech technique and its output is limited to a single female-sounding voice — for now. It’s a topic we explored in All Turtles Podcast Episode 14 (18:17–32:39). At least one of the show’s hosts could not differentiate between the AI and the human voices when considering the phrase, “I’m too busy for romance.”

Skip ahead to 18:17 for a discussion of AI-generated voices.

These developments have some experts sounding an alarm. “Fears of fake news will pale in comparison to new technology that can fake the human voice,” cautioned the Rand Corporation following Google’s announcement. Researchers from Oxford, Cambridge, and a dozen other institutions issued a 101-page report, The Malicious Use of Artificial Intelligence: Forecasting, Prevention, and Mitigation, which predicts many ways that benign AI will be repurposed for nefarious aims. In the wrong hands, AI-generated voices could undermine public trust in the spoken words of politicians, business leaders, and even loved ones. (Such “photoshopping of history” extends to video as as well. See the recent AI-generated videos of President Obama by researchers the University of Washington.)

Closer to home, while I don’t experience an Uncanny Valley revulsion when hearing a human-like machine voice, I don’t want to be fooled. I want my Alexa to sound like an Alexa, and my Siri to sound like Siri. And for those instances when I may prefer to have a device or service emit a human-like voice, I want it to be one that I select.

We’re hurtling towards a soundscape where machine voices sound identical to humans. With the potential malfeasance so large, we need to ask what real problem is this use of AI trying to solve?

Voice cloning in 3.7 seconds. But why?

Written by Blaise Zerega