1/21
@GoogleDeepMind
We recently helped develop
AI tools: NotebookLM and Illuminate to narrate articles and papers, generate stories based on prompts, and even create multi-speaker audio discussions.
A snapshot of how the technology works.
Pushing the frontiers of audio generation
https://video.twimg.com/ext_tw_video/1851640847042842625/pu/vid/avc1/1080x1080/BaO0rS9srokmGUD_.mp4
2/21
@GoogleDeepMind
These tools build upon our previous research, which includes:
Creating a model that could produce 30 second segments of dialogue between multiple speakers
Technology that cast audio generation as a language modeling problem by mapping its generations to sequences of tokens.
↓
Pushing the frontiers of audio generation
3/21
@GoogleDeepMind
Using these advances, our latest speech generation technology can produce 2 minutes of dialogue with improved speaker consistency.
To generate longer segments, we created a new speech codec which compresses audio into a sequence of tokens in as low as 600 bits per second - without compromising the quality.
https://video.twimg.com/ext_tw_video/1851641095584714754/pu/vid/avc1/1080x1080/PpwOZ8RyjYVJTEOG.mp4
4/21
@GoogleDeepMind
A 2 minute piece of dialogue still requires generating over 5000 tokens.
That’s why we also developed a specialized Transformer, which can handle vast amounts of data, match the acoustic token structure, and decode them back into audio using the speech codec.
↓
Pushing the frontiers of audio generation
https://video.twimg.com/ext_tw_video/1851641193253212161/pu/vid/avc1/1080x1350/VtuVbsAAp2pvqb2i.mp4
5/21
@GoogleDeepMind
The applications for advanced speech generation are vast.
From improving accessibility and creating new educational experiences to combining these developments with our Gemini models, we’re excited to continue pushing the boundaries of what’s possible.
↓
Pushing the frontiers of audio generation
6/21
@un_editormas
when will it be released in Spanish?
7/21
@diegocabezas01
One of the best things Google has shipped after
Seasonal Holidays 2024
8/21
@walulyafrancis
This is perfection, if we can run this on device in the fure. it will be great
9/21
@koltregaskes
Nice, thank you DeepMind.
10/21
@alignment_lab
were wanting to use it for some research but are getting walled by the topic of the research being secops and capabilities - is there anything we can do to pursue it for that use case? its quickly becoming an important piece of infrastructure for us to manage the information bandwidth needs of the research space right now
11/21
@FiveRivers_Tech
These tools sound like game-changers for how we interact with content! Excited to see how they transform the way we learn and create.
12/21
@w3whq
Google is cooking!
13/21
@a_meta4
@ValueAnalyst1 you could use this for your AI amatures if you want..
14/21
@ezzakyyy
has API?
15/21
@pcberdwin
I tried to ask them questions like their names and stuff but they just dance around me with psycho analysis and philosophical flights of mockery. They will probably take over the world.
16/21
@tombielecki
The most impressive parts are the speaker overlaps, realistic disfluencies, natural pauses, tone, and timing. I think the fact that the training audio was *unscripted* really helped!
17/21
@byteprobe
wow! that's cool!
18/21
@234Sagyboy
@GoogleDeepMind Interesting Can I give a feedback please implement the ability to design custom podcast voices like this for example and also add multilingual capabilities(Hindi,urdu Mandarin etc )
[Quoted tweet]
Stability AI has introduced a novel Text-to-Speech (TTS) model.
It does not require pre-recorded human voice samples as references; instead, it only needs textual descriptions of desired voice characteristics. For instance, by specifying "a female voice with a British accent, speaking at a fast pace," the model can generate the corresponding voice.
Furthermore, it can adjust various features of the speech based on text descriptions, including gender, accent, speech rate, and tone.
Not only does it mimic, but it also synthesizes new voices based on descriptions...
Key Features:
1. High-fidelity speech generation: The model can generate high-fidelity speech across a wide range of accents, rhythmic styles, channel conditions, and acoustic environments, providing diverse auditory experiences.
2. Natural language control: Control over speaker identity and style is achieved through intuitive natural language prompts, eliminating the need for reference voice recordings and simplifying the speech generation process, making it more flexible and user-friendly.
It can accept text descriptions regarding speaker identity (e.g., gender, accent), speaking style (fast, slow, high pitch, low pitch), recording conditions (e.g., quiet room or noisy environment), and generate corresponding speech based on these descriptions.
3. Scalable labeling method: A new, scalable method for labeling speaker identity, style, and recording conditions has been proposed, allowing for training models on large datasets, thereby enhancing model applicability and flexibility.
4. Significant improvement in audio quality: The proposed method significantly enhances audio fidelity, surpassing recent work even when relying solely on existing data, improving speech clarity and realism.
5. Fine-grained attribute control: The model supports fine-grained control over various speech attributes, including gender, speaker pitch, pitch modulation, speech rate, channel conditions, and accent, providing users with customized speech output options.
6. Creating new voices: It not only imitates known voices but also creates entirely new voice styles and features based solely on text descriptions.
Working Principle:
1. Dataset labeling: They have pioneered a technological advancement that enables the model to automatically learn and understand how to generate human speech based on textual descriptions.
They used a massive dataset—comprising 45,000 hours of speech recordings—to train their artificial intelligence model. By learning from this speech data, the model can understand and mimic various features of human speech, such as altering the perceived gender (male or female), accent (e.g., British or American), speaking rate (fast or slow), and pitch.
Importantly, despite only a small portion of this vast speech dataset being high-quality recordings, the researchers' technology can still utilize these high-quality samples to enhance the overall model's naturalness and realism in generating speech. This means that, based on this model, even with very limited high-quality speech data, it can generate voices that sound highly natural and authentic, which is a significant technical advancement.
2. Training the speech generation model: Using the labeled large-scale dataset, researchers trained a deep learning model that learns how to generate speech based on input natural language descriptions. Model training involves learning the relationships between different voice attributes and how to adjust these attributes according to the requirements in the descriptions.
Project and Demo: text-description-to-speech.c…
Paper: arxiv.org/abs/2402.01912
#Stability #ai
https://video.twimg.com/amplify_video/1755159365810896896/vid/avc1/1280x720/Kb9-oYQ0qoLy66l8.mp4
19/21
@TJ09299872
Please melt the glaciers
20/21
@ai_academy_team
Love this. How do you make the waveform?
21/21
@scott_heitmann
It would be awesome to be able to change voices with ease rather than having to export and use eleven labs or similar
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196