“We also need a small plastic snake,” the woman says in a northern English accent, “and a big toy frog for the kids.” The sentence is a bit silly, perhaps, but otherwise perfectly innocuous. These six seconds of natural speech are all it takes for a new Chinese algorithm to clone Queen Elizabeth’s voice and make it say anything the user fancies. The Deep Voice programme, which was built by Baidu, a technology giant sometimes described as the Asian counterpart to Google, uses artificial intelligence (AI) technique called a deep neural network to mimic British and American voices from only a handful of brief audio clips. The development raises concerns about an insidious new species of fake news in which manipulating the voices of politicians or celebrities would be child’s play for anyone with the right software on their bedroom computer. The Times published a painstaking reconstruction of the speech that the US president John F Kennedy was preparing to deliver at the Dallas Trade Mart before his assassination 55 years ago. Resurrecting Kennedy’s voice in precise detail, with its gliding Boston vowels and cadences, took sound engineers at the Scottish speech technology company CereProc eight weeks and 831 recordings. However, recent advances in AI mean that in a few years it should take only a few hours to produce a thoroughly convincing imitation of Theresa May or Boris Johnson, according to Matthew Aylett, CereProc’s co-founder.
The scope for mischief is considerable. Dr Aylett, a former research fellow at the University of Edinburgh, said that during the 2012 US election his company had rebuffed a West Coast company’s offer to buy its synthesised Barack Obama voice. For now, the voice cloning market is mostly made up of people who fear that they will lose their power of speech to chronic diseases such as amyotrophic lateral sclerosis, which afflicted the late Stephen Hawking. CereProc has rebuilt the voices of the film critic Roger Ebert and the New Orleans Saints American footballer Steve Gleason, as well as editing together a compendium of the Queen’s speeches to make it sound as though she had composed a rap for her diamond jubilee. The company typically asks its clients to record themselves reading more than 600 sentences, amounting to at least 40 minutes of speech. Adobe, the company behind Photoshop, claims to have made a voice-cloning counterpart to its image-editing software that requires only 20 minutes of audio.
If Baidu’s claims, which are published in a paper on the pre-publication website arXiv.org, are to be believed, it can do the same trick with a handful of samples adding up to less than a minute. Rather than building the cloned voice from scratch, Deep Voice takes a model from a library made up of 2,400 other people’s voices and then tweaks it until it matches the speaker. “The Baidu work is really interesting,” Dr Aylett said. “They are a very credible research team. Their samples sound good. The question is whether they cherry-picked some of the samples to make it sound better.”
Credit: Oliver Moody for The Times, 17 March 2018.
Listen To JF Kennedy’s ‘Never Got To Speak’ Speech here: