Microsoft’s new VALL-E AI can clone your voice from a three-second audio clip

bnew · Jan 12, 2023

Microsoft's new VALL-E AI can clone your voice from a three-second audio clip

Text-to-speech models usually require significantly longer training samples, but VALL-E can synthesise a voice from a very short clip.

techmonitor.ai

Text-to-speech models usually require significantly longer training samples, but VALL-E can synthesise a voice from a very short clip.
By Ryan Morrison

Microsoft’s latest foray into the world of artificial intelligence comes in the form of VALL-E, a transformer-based text-to-speech model that can “recreate any voice from a three-second sample clip”. Cybersecurity experts say that without proper protections, it could be used for more realistic phishing attacks and to spread misinformation.

The VALL-E model was trained on 60,000 hours of speech and can generate a new voice from just a three-second sample clip. (Photo courtesy of Microsoft)

As well as reducing the training time to generate a new voice, VALL-E creates a much more natural-sounding synthetic voice than other models by preserving the intonation, charisma, and style of the original sample. These can then be directed as needed when writing the text-to-speech script.

Having these features means that with just three seconds of someone’s voice, recorded from a phone call, in person or even from a podcast, the model can synthesise that voice to say any sentence. It could potentially see words placed into the mouths of a politician, actor or even a family member “asking for money”.

Performance has improved over previous synthetic voice models to such a point that it would be difficult to tell whether you were hearing a real or fake voice, Microsoft says.

Much like large generative AI models used to train DALL-E 2 and GPT-3, developers fed a significant amount of material into the system to create the tool. They used 60,000 hours of speech while training the model, much of which came from recordings made using the Teams app.

VALL-E could be used in gaming – and fintech

The code for VALL-E is not currently available to the public and only sample audio files have been published that were produced using the tool. It also isn’t clear when or if Microsoft plans to make VALL-E available as a public access or commercial tool.

Joshua Kaiser, CEO of AI company Tovie.ai, told Tech Monitor that the model has been designed in such a way that it allows users to do a lot more with a lot less data, which is crucial for organisations that try to create speech synthesis that don’t have enough data for better performance. “We think this will benefit a lot of industries – from retail to fintech to gaming – that are already embracing voice interfaces, by making the whole process more accessible,” he says.

The biggest benefit from VALL-E is its potential scale, says Arun Chandrasekaran, distinguished VP analyst at Gartner. It can be effective in “zero-shot” or “few-shot” scenarios where little domain-specific training data is available. “In addition, if these models can be delivered as a cloud service, they can reduce time/effort required to get the models up and running in contrast to classic approaches,” Chandrasekaran says.

There are several real-world use cases for the technology, Chandrasekaran explains, including “speech editing (where a certain word or sentence can be corrected), contextualizing voice for different scenarios, interactive virtual learning, and customer service automation.”

It does come with risks, including spoofing voice identification or impersonating specific speakers and celebrities, which could lead to more rapid spread of misinformation. This aspect could be why Microsoft has been slow to publish the code behind the technology or release an API, as OpenAI and others have done with text and image generation tools like GPT-3 and DALL-E 2. It would make it easier to carry out phishing attacks using a real voice, or spread fake news online, perhaps through a YouTube video or a podcast.

Spoofing risk of VALL-E

Spoofing could include allowing a cybercriminal to gain access to banks or secure systems that use a voice print as a password, although many of these systems have mechanisms to detect whether it is a live or recorded voice. It could also be used in a phishing scam to take a short sample of a voice from a phone call, then use that sample to create a new voice model that could make it easier to convince someone to part with a password, perhaps by spoofing a finance manager at a company.

Muhammad Yahya Patel, security engineer at Check Point Software said advancement of new technology like VALL-E shouldn’t be feared, but we should still approach systems like this with a degree of caution. “While it has its merits, Microsoft’s new VALL-E text-to-speech model could have some worrying implications for cybersecurity as it becomes more mature and integrated into our daily lives.

“If we have learned anything from the last year, it’s that cybercriminals will exploit any route to trick unsuspecting victims into handing over their passwords or bank details for example. Vishing [a scam phone call] is a popular method used by threat actors, and for good reason given the success rates of these campaigns.”

He said the new technology could give cybercriminals an opportunity to up their game and introduce a personal element, including allowing them to impersonate the voice of a loved one. “This would make it much harder for anyone to differentiate between the request of someone they trust and one from a malicious cybercriminal.

“Equally, as we move towards a time where many banks are now using voice authentication to authorise transactions, it’s easy to see how a threat actor could target an individual and gain access to an account with very minimal effort. It’s key that these opportunities for hackers to leverage new technologies is understood and as such, that the necessary precautions are being taken before it’s too late.”

Tech Monitor has approached Microsoft for a comment on how it plans to mitigate for the potential misuse of VALL-E.

bnew · Jan 12, 2023

scammers about to level up. :lupe:

doublejump · Jan 12, 2023

US government about to level up.

Dave24 · Jan 12, 2023

Coli 6 certs brehs about to level up

Buddy · Jan 12, 2023

Tony Danza bout to level up

old pig · Jan 12, 2023

Buddy said:
Tony Danza bout to level up

the fukk lmao?

In The Zone '98 · Jan 12, 2023

Val Kilmer about to level up

Neo. The Only. The One. · Jan 12, 2023

Demons about to level up

Buddy · Jan 12, 2023

surf said:
the fukk lmao?

Eddie Levert bout to level up :manny:

Dirty Mcdrawz · Jan 12, 2023

Yamaha about level up

Complexion · Jan 12, 2023

High Art · Jan 12, 2023

Trump about to level up.

bnew · Jan 12, 2023

deepfakes about to get lot more interesting

The God Poster · Jan 12, 2023

#GMB about to level up & break up a bunch of households

J.E.T.S · Jan 12, 2023

Telemarketers about to level up

Microsoft’s new VALL-E AI can clone your voice from a three-second audio clip

More options

bnew

Veteran