We introduce OpenVoice, a versatile voice cloning approach that requires only a short audio clip from the reference speaker to replicate their voice and generate speech in multiple languages. OpenVoice represents a significant advancement in addressing the following open challenges in the field: 1) Flexible Voice Style Control. OpenVoice enables granular control over voice styles, including emotion, accent, rhythm, pauses, and intonation, in addition to replicating the tone color of the reference speaker. The voice styles are not directly copied from and constrained by the style of the reference speaker. Previous approaches lacked the ability to flexibly manipulate voice styles after cloning. 2) Zero-Shot Cross-Lingual Voice Cloning. OpenVoice achieves zero-shot cross-lingual voice cloning for languages not included in the massive-speaker training set. Unlike previous approaches, which typically require extensive massive-speaker multi-lingual (MSML) dataset for all languages, OpenVoice can clone voices into a new language without any massive-speaker training data for that language. OpenVoice is also computationally efficient, costing tens of times less than commercially available APIs that offer even inferior performance. To foster further research in the field, we have made the source code and trained model publicly accessible. We also provide qualitative results in our demo website. Prior to its public release, our internal version of OpenVoice was used tens of millions of times by users worldwide between May and October 2023, serving as the backend of MyShell.
Comments: | Technical Report |
Subjects: | Sound (cs.SD); Machine Learning (cs.LG); Audio and Speech Processing (eess.AS) |
Cite as: | arXiv:2312.01479 [cs.SD] |
(or arXiv:2312.01479v5 [cs.SD] for this version) | |
[2312.01479] OpenVoice: Versatile Instant Voice Cloning Focus to learn more |
I showed Gemini 1.5 Pro the ENTIRE Self-Operating Computer codebase, and an example Gemini 1.5 API call.
From there, it was able to perfectly explain how the codebase works...
and then it implemented itself as a new supported model for the repo!
Not perfect, but very close.
We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular \textit{the ability to utilize information at arbitrary input locations}, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the \textit{quantity} and \textit{quality} of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize \textit{domain balance} and \textit{length upsampling}. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.
Comments: | Code at this https URL |
Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
Cite as: | arXiv:2402.10171 [cs.CL] |
(or arXiv:2402.10171v1 [cs.CL] for this version) | |
[2402.10171] Data Engineering for Scaling Language Models to 128K Context Focus to learn more |
text said:Mind officially blown:
I recorded a screen capture of a task (looking for an apartment on Zillow). Gemini was able to generate Selenium code to replicate that task, and described everything I did step-by-step.
It even caught that my threshold was set to $3K, even though I didn't explicitly select it.
"This code will open a Chrome browser, navigate to Zillow, enter "Cupertino, CA" in the search bar, click on the "For Rent" tab, set the price range to "Up to $3K", set the number of bedrooms to "2+", select the "Apartments/Condos/Co-ops" checkbox, click on the "Apply" button, wait for the results to load, print the results, and close the browser."