bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

3 hours ago - Technology

New AI polyglot launched to help fill massive language gap in field​


Illustration of a large dictionary with AI embossed on the front

Illustration: Natalie Peeples/Axios

A new open-source generative AI model can follow instructions in more than 100 languages.

Why it matters: Most models that power today's generative AI tools are trained on data in English and Chinese, leaving a massive gap of thousands of languages — and potentially limiting access to the powerful technology for billions of people.

Details: Cohere for AI, the nonprofit AI research lab at Cohere, on Tuesday released its open-source multilingual large language model (LLM) called Aya.


  • It covers more than twice as many languages as other existing open-source models and is the result of a year-long project involving 3,000 researchers in 119 countries.
How it works: The team started with a base model pre-trained on text that covered 101 languages and fine-tuned it on those languages.

  • But they first had to create a high-quality dataset of prompt and completion pairs (the inputs and outputs of the model) in different languages, which is also being released.
  • Their data sources include machine translations of several existing datasets into more than 100 languages, roughly half of which are considered underrepresented — or unrepresented — in existing text datasets, including Azerbaijani, Bemba, Welsh and Gujarati.
  • They also created a dataset that tries to capture cultural nuances and meaningful information by having about 204,000 prompts and completions curated and annotated by fluent speakers in 67 languages.
  • The team reports Aya outperforms other existing open-source multilingual models when evaluated by humans or using GPT-4.

The impact: "Aya is a massive leap forward — but the biggest goal is all these collaboration networks spur bottom up collaborations," says Sara Hooker, who leads Cohere for AI.

  • The team envisions Aya being used for language research and to preserve and represent languages and cultures at risk of being left out of AI advances.

The big picture: Aya is one of a handful of open-source multilingual models, including BLOOM, which can generate text in 46 languages, a bilingual Arabic-English LLM called Jais, and a model in development by the Masakhane Foundation that covers African languages.

What to watch: "In some ways this is a bandaid for the wider issue with multilingual [LLMs]," Hooker says. "An important bandaid but the issues still persist."


  • Those include figuring out why LLMs can't seem to be adopted to languages they didn't see in pre-training and exploring how best to evaluate these models.
  • "We used to think of a model as something that fulfills a very specific, finite and controlled notion of a request," like a definitive answer about whether a response is correct, Hooker says. "And now we want models to do everything, and be universal and fluid."
  • 'We can’t have everything so we will have to chose what we want it to be."



FEB 13, 2024

Cohere For AI Launches Aya, an LLM Covering More Than 100 Languages​

More than double the number of languages covered by previous open-source AI models to increase coverage for underrepresented communities

Today, the research team at Cohere For AI (C4AI), Cohere’s non-profit research lab, are excited to announce a new state-of-the-art, open-source, massively multilingual, generative large language research model (LLM) covering 101 different languages — more than double the number of languages covered by existing open-source models. Aya helps researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today.

We are open-sourcing both the Aya model, as well as the largest multilingual instruction fine-tuned dataset to-date with a size of 513 million covering 114 languages. This data collection includes rare annotations from native and fluent speakers all around the world, ensuring that AI technology can effectively serve a broad global audience that have had limited access to-date.



Closes the Gap in Languages and Cultural Relevance​

Aya is part of a paradigm shift in how the ML community approaches massively multilingual AI research, representing not just technical progress, but also a change in how, where, and by whom research is done.

As LLMs, and AI generally, have changed the global technological landscape, many communities across the world have been left unsupported due to the language limitations of existing models. This gap hinders the applicability and usefulness of generative AI for a global audience, and it has the potential to further widen existing disparities that already exist from previous waves of technological development. By focusing primarily on English and one or two dozen other languages as training resources, most models tend to reflect inherent cultural bias.

We started the Aya project to address this gap, bringing together over 3,000 independent researchers from 119 countries.

KVFYtY-poO-UQ8wbta7JMsN-0udgHMejToTl_94V4NzQUbyQGhmMV9kZWal33MgJIj8dEdUS-Ge49lO_WoIhnPHDDHy4wW6SXhg-EhwATTTQq6GBjjPOw_yhNK6wW0Yv7XH9DsxJxmaY308v5f6bAYI

Figure: Geographical distribution of Aya collaborators


Significantly Outperforms Existing Open-Source Multilingual Models​

The research team behind Aya was able to substantially improve performance for underserved languages, demonstrating superior capabilities in complex tasks, such as natural language understanding, summarization, and translation, across a wide linguistic spectrum.

We benchmark Aya model performance against available, open-source, massively multilingual models. It surpasses the best open-source models, such as mT0 and Bloomz, on benchmark tests by a wide margin. Aya consistently scored 75% in human evaluations against other leading open-source models, and 80-90% across the board in simulated win rates.

Aya also expands coverage to more than 50 previously unserved languages, including Somali, Uzbek, and more. While proprietary models do an excellent job serving a range of the most commonly spoken languages in the world, Aya helps to provide researchers with an unprecedented open-source model for dozens of underrepresented languages.

SPt0GK0hUVoC7Mfj29yxAHI1GTGz3yXT_HM-Nle7Pdse29ByTM1Fq7NU0zXAAJE8l_S4dfNPdy8wZ_JHpc1XooyNyv__tW1RSzPt_AiOd9URZ43rLIps53XOuw0a6Pvvc5pVbqMjPfUeg2db_2o9FeQ.png

Figure: Head-to-head comparison of preferred model responses


Trained on the Most Extensive Multilingual Dataset to Date​

We are releasing the Aya Collection consisting of 513 million prompts and completions covering 114 languages. This massive collection was created by fluent speakers around the world creating templates for selected datasets and augmenting a carefully curated list of datasets. It also includes the Aya Dataset which is the most extensive human-annotated, multilingual, instruction fine-tuning dataset to date. It contains approximately 204,000 rare human curated annotations by fluent speakers in 67 languages, ensuring robust and diverse linguistic coverage. This offers a large-scale repository of high-quality language data for developers and researchers.

Many languages in this collection had no representation in instruction-style datasets before. The fully permissive and open-sourced dataset includes a wide spectrum of language examples, encompassing a variety of dialects and original contributions that authentically reflect organic, natural, and informal language use. This makes it an invaluable resource for multifaceted language research and linguistic preservation efforts.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

DeepMind framework offers breakthrough in LLMs’ reasoning​

google-deepmind-self-discover-framework-llm-large-language-models-ai-artificial-intelligence-tech-technology-arxiv-hugging-face-paper-benchmark.jpg

About the Author
By Ryan Daws | February 8, 2024


Categories: Artificial Intelligence, Companies, DeepMind, Development, Google,


Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California.

Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionising the performance of leading models such as OpenAI’s GPT-4 and Google’s PaLM 2.

The framework promises substantial enhancements in tackling challenging reasoning tasks. It demonstrates remarkable improvements, boasting up to a 32% performance increase compared to traditional methods like Chain of Thought (CoT). This novel approach revolves around LLMs autonomously uncovering task-intrinsic reasoning structures to navigate complex problems.

At its core, the framework empowers LLMs to self-discover and utilise various atomic reasoning modules – such as critical thinking and step-by-step analysis – to construct explicit reasoning structures.

By mimicking human problem-solving strategies, the framework operates in two stages:
  • Stage one involves composing a coherent reasoning structure intrinsic to the task, leveraging a set of atomic reasoning modules and task examples.
  • During decoding, LLMs then follow this self-discovered structure to arrive at the final solution.

In extensive testing across various reasoning tasks – including Big-Bench Hard, Thinking for Doing, and Math – the self-discover approach consistently outperformed traditional methods. Notably, it achieved an accuracy of 81%, 85%, and 73% across the three tasks with GPT-4, surpassing chain-of-thought and plan-and-solve techniques.

However, the implications of this research extend far beyond mere performance gains.

By equipping LLMs with enhanced reasoning capabilities, the framework paves the way for tackling more challenging problems and brings AI closer to achieving general intelligence. Transferability studies conducted by the researchers further highlight the universal applicability of the composed reasoning structures, aligning with human reasoning patterns.

As the landscape evolves, breakthroughs like the SELF-DISCOVER prompting framework represent crucial milestones in advancing the capabilities of language models and offering a glimpse into the future of AI.

(Photo by Victor on Unsplash)




Computer Science > Artificial Intelligence​

[Submitted on 6 Feb 2024]

Self-Discover: Large Language Models Self-Compose Reasoning Structures​

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng

We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.

Comments:17 pages, 11 figures, 5 tables
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as: arXiv:2402.03620 [cs.AI]
(or arXiv:2402.03620v1 [cs.AI] for this version)
[2402.03620] Self-Discover: Large Language Models Self-Compose Reasoning Structures

Focus to learn more

Submission history​

From: Pei Zhou [ view email]

[v1] Tue, 6 Feb 2024 01:13:53 UTC (1,783 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

World Model on Million-Length Video and Language with RingAttention


Abstract​

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.​

Question Answering Over 1 Hour Video.​





1M video chat


Figure 1. Long video understanding. LWM can answer questions about over 1 hour YouTube video.

Fact Retrieval Over 1M Context.​


1M fact retrieval


Figure 2. Needle retrieval task. LWM achieves high accuracy across 1M context window and outperforms GPT-4V and Gemini Pro.


1M fact retrieval


Figure 3. Needle retrieval task. LWM achieves high accuracy for varying context sizes and positions in the context window.

Long Sequence Any-to-Any AR Prediction.​



Model


Figure 4. Any-to-Any Long Sequence Prediction. RingAttention enables the use of a very large context window for training across diverse formats such as video-text, text-video, image-text, text-image, pure video, pure image, and pure text. See the LWM paper for key features, including masked sequence packing and loss weighting, which allow effective video-language training.

Modeling Diverse Videos and Books With RingAttention.​


Data Mixture


Figure 5. Context Extension and Vision-Language Training. Expanding context size from 4K to 1M on books using RingAttention, followed by vision-language training on diverse forms of visual contents of lengths 32K to 1M. The lower panel shows interactive capabilities in understanding and responding to queries about complex multimodal world.

Text-Image Generation.​

Fact retrieval


Fact retrieval


Figure 6. Text to Image. LWM generates images based on text prompts, autoregressively.



Text-Video Generation.​

Fireworks exploding in the sky

Waves crashing against the shore

A bustling street in London with red telephone booths and Big Ben in the background

Camera pans left to right on mago slices sitting on a table

A ball thown in the air

Slow motion flower petals falling on the ground

A burning campire in a forest

A boat sailing on a stormy ocean


Figure 5. Text to Video. LWM generates videos based on text prompts, autoregressively.

Image Based Conversation.​

1M video chat


Figure 6. Image understanding. LWM can answer questions about images.



Video Chat Over 1 Hour YouTube Video.​




long_video_chat_1.png




long_video_chat_2.png


Figure 7. Long Video Chat. LWM answers questions about 1 hour long YouTube video even if state-of-the-art commercial models GPT-4V and Gemini Pro both fail. The relevant clips for each example are at timestamps 9:56 (top) and 6:49 (bottom).







Large World Model (LWM)​

[Project] [Paper] [Models]

Large World Model (LWM) is a general-purpose large-context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.




Computer Science > Machine Learning​

[Submitted on 13 Feb 2024]

World Model on Million-Length Video And Language With RingAttention​


Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.



Subjects:Machine Learning (cs.LG)
Cite as: arXiv:2402.08268 [cs.LG]
(or arXiv:2402.08268v1 [cs.LG] for this version)
[2402.08268] World Model on Million-Length Video And Language With RingAttention

Focus to learn more

Submission history​

From: Hao Liu [ view email]
[v1] Tue, 13 Feb 2024 07:47:36 UTC (7,336 KB)





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

Google unveils a more capable Gemini Pro AI as it seeks an edge against OpenAI and other rivals in tech’s hottest battlefield​

BY JEREMY KAHN

February 15, 2024 at 10:00 AM EST

Google DeepMind CEO Demis Hassabis
GettyImages-1760047727-e1707947499536.jpg

Demis Hassabis heads Google DeepMind, Alphabet's advanced AI research lab. The lab has helped to develop Google's Gemini AI models. The company has unveiled its latest, a more capable Gemini Pro called 1.5 Pro, that can process more data in one go than any other AI model on the market.

TOBY MELVILLE—WPA POOL/GETTY IMAGES

Another week, another Gemini model.

Alphabet-owned Google has been furiously pushing out new Gemini-branded AI models, which can power business applications like chatbots and coding assistants, as it tries to outdo rivals OpenAI and Microsoft in a generative AI battle that keeps getting more intense.

Earlier this week, OpenAI announced it was adding a persistent memory to ChatGPT—which means the chatbot will remember facts about the user’s preferences and past dialogues and apply them to future responses—through a new underlying large language model GPT-4.5. Now Google is countering by releasing a yet more capable and compact Gemini model.




And, while a Google DeepMind researcher who worked on the models said the company is sharing information on the safety of these new AI systems with government regulatory bodies in the U.S. and U.K., it isn’t exactly waiting for their greenlight before releasing them to the public.

While this information sharing is voluntary, the U.K. government has often implied that its AI Safety Institute will help keep dangerous models from posing a risk to the public, a role it cannot fulfil if tech companies keep pushing out models far faster than it can evaluate them. In the U.S., a similarly named AI Safety Institute is only charged with issuing standards for companies to use in evaluating their own models, not undertaking the evaluations themselves.

Only last week, Alphabet put its most powerful Gemini Ultra 1.0 model into wide release, charging users $20 monthly for access to a better AI assistant to advise you on how to change a tire, help you design a birthday card, or analyze financial statements for you.

Today it is announcing a more limited “research release” of a new version of its Gemini Pro model—Gemini 1.5 Pro —that delivers similar performance to the Ultra 1.0 but in a much smaller model. Smaller models use less computing power to train and run, which also makes them less costly to use.

The 1.5 Pro is also built using a “mixture of experts” design, which means that rather than being a single giant neural network, it is actually an assemblage of several smaller ones, each specialized for a particular task. This too makes the model cheaper to train and to run.

Google charges customers of the current Gemini 1.0 Pro model $.0025 per image the model generates, $.002 per second of audio it outputs, and $.00025 per 1,000 characters of text it produces. The company has not said what it plans to charge for the new 1.5 Pro version.

Like Google’s other Gemini models, 1.5 Pro is multi-modal, meaning it has been trained on text, images, audio, and video. It can process inputs or provide outputs in any of these forms.

But, in addition to being smaller, the new Pro does something that even the larger Ultra 1.0 can’t do. It can ingest and analyze far more data than any other AI model on the market, including its bigger, older cousin.

The new Gemini 1.5 Pro can take in about seven books’ worth of text, or a full hour of video, or 11 hours of audio. This makes it easier to ask the AI system questions that involve searching for an answer amid a lot of data, such as trying to find a particular clip in a long video, or trying to answer a complex question about some portion of the federal criminal code.

The new model can do this because its “context window”—or the maximum length of a prompt—can be as long as 1 million tokens. A token is a chunk of data that is about a word and a bit long. So one million tokens is about 700,000 words. The next closest publicly available large language model, Anthropic’s Claude 2.0, has a context window of 200,000 tokens.

For now, the new 1.5 Pro is being aimed at corporate customers and AI researchers. It’s being made available to users with access to the Gemini API through Google’s AI Studio sandbox as well as select Google Cloud customers being invited to a “private preview” of the model through Google Cloud’s Vertex AI platform.

Google is desperate to convince big businesses to start building their generative AI applications on top of its AI models. It is hoping this will help it grow its Cloud Computing business, which has consistently been in third place behind Microsoft Azure and Amazon’s AWS. But Google’s new AI features have given it the best opportunity it has had in years to gain market share, particularly from AWS, which has been much slower than its rivals in offering cutting-edge generative AI models.

Last month, Alphabet reported that its cloud revenue grew 25% year over year in the last quarter, a figure that is below the 30% cloud revenue growth Microsoft’s reported. But Alphabet’s cloud sales were expanding at a rate almost double that reported by AWS. Amazon has sought to rectify its genAI laggard status through a partnership with AI startup Anthropic, although it’s unclear if that alliance will enable it to keep it from losing ground to Microsoft Azure and Google Cloud.

Oriol Vinyals, a vice president of research at Google DeepMind who helped develop the latest Gemini model, showed reporters a video demonstration that highlighted how the new model could exhibit a sophisticated understanding of both video and language. When asked about the significance of a piece of paper in an old Buster Keaton silent film, the model not only answered correctly that the paper was a pawn ticket and explained its importance in the film’s plot, it could also cite the correct scene in the film where it was featured. It could also pull out examples of astronauts joking in transcripts of the Apollo 11 mission.

Vinyals also showed how a person could use a simple sketch to ask the model to find scenes or instances in the transcript that matched the sketch.

But, he noted, the model is still fallible. Like all LLM-based AI models, it remains prone to “hallucinations,” in which the model simply invents information. He said the 1.5 Pro’s hallucination rate was “no better or worse” than Google’s earlier Gemini models, but he did not disclose a specific error rate.

In response to journalists’ questions, Vinyals also implied that the demonstration videos he had just played to show off the capabilities of the Gemini 1.5 Pro may have depicted examples Google had cherry picked from among other similar attempts that were less successful.

Many journalists and technologists criticized Google for editing a video demonstration that accompanied the unveiling of its Gemini models in December that made the models seem more capable of understanding scenes in live video as well as speedier in answering questions than they actually are.

The new 1.5 Pro also does not have a persistent long-term memory, unlike Google’s GPT-4.5. This means that while the 1.5 Pro can find information within a fairly large dataset, it cannot remember that information for future sessions. For instance, Vinyals said that a 1.5 Pro user could give the model an entire dictionary for an obscure language and then ask it to translate from that language. But if the user came back a month later, the model wouldn’t instantly know how to do the same translation. The user would have to feed it the dictionary again.

The U.K. government’s newly created AI Safety Institute is supposed to be conducting independent evaluations of the most powerful models that AI companies develop. In addition, AI companies including Google DeepMind have agreed to share information about their own internal safety testing with both the U.K. and U.S. government. Vinyals said that Google is complying with the promises it made to these governments at international AI Safety Summit this past summer, but he did not specify whether the U.K. AI Safety Institute has evaluated 1.5 Pro or any of the Gemini models.











 

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,913
Reputation
1,041
Daps
33,071
Anyone been using gpt for anything interesting lately? I sent it an image of Japanese text and it translated it completely accurately. I could see that being very useful if you were traveling in Japan
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907







 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

Video generation models as world simulators​


We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

February 15, 2024

More resources

View Sora overview

Video generation, Sora, Milestone, Release


This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,[sup]1,2,3[/sup] generative adversarial networks,[sup]4,5,6,7[/sup] autoregressive transformers,[sup]8,9[/sup] and diffusion models.[sup]10,11,12[/sup] These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.[sup]13,14[/sup] The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.[sup]15,16,17,18[/sup] We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.


figure-patches.png


At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,[sup]19[/sup] and subsequently decomposing the representation into spacetime patches.

Video compression network

We train a network that reduces the dimensionality of visual data.[sup]20[/sup] This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Spacetime Latent Patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

Scaling transformers for video generation

Sora is a diffusion model[sup]21,22,23,24,25[/sup]; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.[sup]26[/sup] Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,[sup]13,14[/sup] computer vision,[sup]15,16,17,18[/sup] and image generation.[sup]27,28,29[/sup]


figure-diffusion.png


In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

Base compute

4x compute

16x compute

Variable durations, resolutions, aspect ratios

Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

Sampling flexibility

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.

Improved framing and composition

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907
Language understanding

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3[sup]30[/sup] to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

a woman

a womanan old mana toy robotan adorable kangaroo


wearing

a green dress and a sun hat

blue jeans and a white t-shirta green dress and a sun hatpurple overalls and cowboy boots


taking a pleasant stroll in

Johannesburg, South Africa

Mumbai, IndiaJohannesburg, South AfricaAntarctica


during

a colorful festival

a beautiful sunseta winter storma colorful festival


Prompting with images and videos

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

Animating DALL·E images

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2[sup]31[/sup] and DALL·E 3[sup]30[/sup] images.[/SIZE]

prompting_0.png


A Shiba Inu dog wearing a beret and black turtleneck.

prompting_2.png


Monster Illustration in flat design style of a diverse family of monsters. The group includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a tiny polka-dotted monster, all interacting in a playful environment.

prompting_4.png


An image of a realistic cloud that spells “SORA”.

prompting_6.png


In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two surfers, seizing the moment, skillfully navigate the face of the wave.

Extending generated videos

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

00:0000:20

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

Video-to-video editing

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,[sup]32[/sup] to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.

Input video

change the setting to be in a lush junglechange the setting to the 1920s with an old school car. make sure to keep the red colormake it go underwaterchange the video setting to be different than a mountain? perhaps joshua tree?put the video in space with a rainbow roadkeep the video the same but make it be wintermake it in claymation animation stylerecreate in the style of a charcoal drawing, making sure to be black and whitechange the setting to be cyberpunkchange the video to a medieval thememake it have dinosaursrewrite the video in a pixel art style
[/SIZE]

Connecting videos

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

Image generation capabilities

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

Tech companies sign accord to combat AI-generated election trickery​

FILE - Meta's president of global affairs Nick Clegg speaks at the World Economic Forum in Davos, Switzerland, Jan. 18, 2024. Adobe, Google, Meta, Microsoft, OpenAI, TikTok and other companies are gathering at the Munich Security Conference on Friday to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. (AP Photo/Markus Schreiber, File)

FILE - Meta’s president of global affairs Nick Clegg speaks at the World Economic Forum in Davos, Switzerland, Jan. 18, 2024. Adobe, Google, Meta, Microsoft, OpenAI, TikTok and other companies are gathering at the Munich Security Conference on Friday to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. (AP Photo/Markus Schreiber, File)

BY MATT O’BRIEN AND ALI SWENSON

Updated 1:18 PM EST, February 16, 2024

Major technology companies signed a pact Friday to voluntarily adopt “reasonable precautions” to prevent artificial intelligence tools from being used to disrupt democratic elections around the world.

Tech executives from Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI and TikTok gathered at the Munich Security Conference to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. Twelve other companies — including Elon Musk’s X — are also signing on to the accord.

“Everybody recognizes that no one tech company, no one government, no one civil society organization is able to deal with the advent of this technology and its possible nefarious use on their own,” said Nick Clegg, president of global affairs for Meta, the parent company of Facebook and Instagram, in an interview ahead of the summit.

The accord is largely symbolic, but targets increasingly realistic AI-generated images, audio and video “that deceptively fake or alter the appearance, voice, or actions of political candidates, election officials, and other key stakeholders in a democratic election, or that provide false information to voters about when, where, and how they can lawfully vote.”

The companies aren’t committing to ban or remove deepfakes. Instead, the accord outlines methods they will use to try to detect and label deceptive AI content when it is created or distributed on their platforms. It notes the companies will share best practices with each other and provide “swift and proportionate responses” when that content starts to spread.

The vagueness of the commitments and lack of any binding requirements likely helped win over a diverse swath of companies, but may disappoint pro-democracy activists and watchdogs looking for stronger assurances.

“The language isn’t quite as strong as one might have expected,” said Rachel Orey, senior associate director of the Elections Project at the Bipartisan Policy Center. “I think we should give credit where credit is due, and acknowledge that the companies do have a vested interest in their tools not being used to undermine free and fair elections. That said, it is voluntary, and we’ll be keeping an eye on whether they follow through.”

Clegg said each company “quite rightly has its own set of content policies.”

“This is not attempting to try to impose a straitjacket on everybody,” he said. “And in any event, no one in the industry thinks that you can deal with a whole new technological paradigm by sweeping things under the rug and trying to play whack-a-mole and finding everything that you think may mislead someone.”

Tech executives were also joined by several European and U.S. political leaders at Friday’s announcement. European Commission Vice President Vera Jourova said while such an agreement can’t be comprehensive, “it contains very impactful and positive elements.” She also urged fellow politicians to take responsibility to not use AI tools deceptively.

She stressed the seriousness of the issue, saying the “combination of AI serving the purposes of disinformation and disinformation campaigns might be the end of democracy, not only in the EU member states.”

The agreement at the German city’s annual security meeting comes as more than 50 countries are due to hold national elections in 2024. Some have already done so, including Bangladesh, Taiwan, Pakistan, and most recently Indonesia.

Attempts at AI-generated election interference have already begun, such as when AI robocalls that mimicked U.S. President Joe Biden’s voice tried to discourage people from voting in New Hampshire’s primary election last month.

Just days before Slovakia’s elections in November, AI-generated audio recordings impersonated a liberal candidate discussing plans to raise beer prices and rig the election. Fact-checkers scrambled to identify them as false, but they were already widely shared as real across social media.

Politicians and campaign committees also have experimented with the technology, from using AI chatbots to communicate with voters to adding AI-generated images to ads.

Friday’s accord said in responding to AI-generated deepfakes, platforms “will pay attention to context and in particular to safeguarding educational, documentary, artistic, satirical, and political expression.”

It said the companies will focus on transparency to users about their policies on deceptive AI election content and work to educate the public about how they can avoid falling for AI fakes.

Many of the companies have previously said they’re putting safeguards on their own generative AI tools that can manipulate images and sound, while also working to identify and label AI-generated content so that social media users know if what they’re seeing is real. But most of those proposed solutions haven’t yet rolled out and the companies have faced pressure from regulators and others to do more.

That pressure is heightened in the U.S., where Congress has yet to pass laws regulating AI in politics, leaving AI companies to largely govern themselves. In the absence of federal legislation, many states are considering ways to put guardrails around the use of AI, in elections and other applications.

The Federal Communications Commission recently confirmed AI-generated audio clips in robocalls are against the law, but that doesn’t cover audio deepfakes when they circulate on social media or in campaign advertisements.

Misinformation experts warn that while AI deepfakes are especially worrisome for their potential to fly under the radar and influence voters this year, cheaper and simpler forms of misinformation remain a major threat. The accord noted this too, acknowledging that “traditional manipulations (”cheapfakes”) can be used for similar purposes.”

Many social media companies already have policies in place to deter deceptive posts about electoral processes — AI-generated or not. For example, Meta says it removes misinformation about “the dates, locations, times, and methods for voting, voter registration, or census participation” as well as other false posts meant to interfere with someone’s civic participation.

Jeff Allen, co-founder of the Integrity Institute and a former data scientist at Facebook, said the accord seems like a “positive step” but he’d still like to see social media companies taking other basic actions to combat misinformation, such as building content recommendation systems that don’t prioritize engagement above all else.

Lisa Gilbert, executive vice president of the advocacy group Public Citizen, argued Friday that the accord is “not enough” and AI companies should “hold back technology” such as hyper-realistic text-to-video generators “until there are substantial and adequate safeguards in place to help us avert many potential problems.”

In addition to the major platforms that helped broker Friday’s agreement, other signatories include chatbot developers Anthropic and Inflection AI; voice-clone startup ElevenLabs; chip designer Arm Holdings; security companies McAfee and TrendMicro; and Stability AI, known for making the image-generator Stable Diffusion.

Notably absent from the accord is another popular AI image-generator, Midjourney. The San Francisco-based startup didn’t immediately return a request for comment Friday.
0
But in a statement Friday, X CEO Linda Yaccarino said “every citizen and company has a responsibility to safeguard free and fair elections.”

“X is dedicated to playing its part, collaborating with peers to combat AI threats while also protecting free speech and maximizing transparency,” she said.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907

‘Game on’ for video AI as Runway, Stability react to OpenAI’s Sora leap​

Sharon Goldman @sharongoldman

February 15, 2024 1:04 PM

DALL%C2%B7E-2024-02-15-16.03.59-A-futuristic-scene-showcasing-five-different-digital-animals-in-a-running-race-designed-with-a-high-tech-cybernetic-aesthetic.-The-setting-is-a-neon.webp

Runway CEO Cristóbal Valenzuela posted two words on X this afternoon in response to today’s OpenAI surprise demo drop of its new Sora video AI model, which is capable of generating 60-second clips: “game on.”

Screen-Shot-2024-02-15-at-3.21.09-PM.png

Clearly, the race for video AI gold is on. After all, just a few months ago Runway blew people’s minds with its Gen-2 update. Three weeks ago, Google introduced its Lumiere video AI generation model, while just last week, Stability AI launched SVD 1.1, a diffusion model for more consistent AI videos.

Stability AI CEO Emad Mostaque responded positively to the Sora news, calling OpenAI CEO Sam Altman a “wizard.”

Screen-Shot-2024-02-15-at-3.39.16-PM.png

Today’s Sora news firmly squashed any talk about OpenAI “ jumping the shark” in the wake of Andrej Karpathy’s departure; reports that it was suddenly getting into the web search product business; and Sam Altman’s $7 trillion AI chips project.



VB EVENT​

The AI Impact Tour – NYC

We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.



Request an invite


Runway has a need for speed​

In June 2023, Runway was valued at $1.5 billion, after raising $141 million from investors including Google and Nvidia.

In January, Runway announced it added multiple motion controls to AI videos with its Multi Motion Brush, while it is as yet unclear what feature set Sora offers, or its limitations.

Coincidentally, I spoke to Runway’s creative director Jamie Umpherson yesterday about what I think is one of the company’s really fascinating and clever non-video promotional art projects — its Gen-2 Book of Weights, an analog printed booklet of nothing but numbers that is the first volume (of 6834) what will be the entire collection of Gen-2 model weights.

“It was a fun experiment,” he told me, adding it came from a “conversation in passing as a lot of these more creative experimental ideas do — we were just discussing how funny it would be to make the weights of the model tangible, visual, something that you could actually consume.”

For now, however, Runway will most certainly have to be laser-focused not on art or philosophy, but keeping up with OpenAI and the rest of the pack in order to keep pace in the video AI race.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,221
Reputation
8,625
Daps
161,907


Apparently some folks don't get "data-driven physics engine", so let me clarify. Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos.

Sora is a learnable simulator, or "world model". Of course it does not call UE5 explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.

I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!

Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee."

- The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space.
- The 3D objects are consistently animated as they sail and avoid each other's paths.
- Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations.
- Photorealism, almost like rendering with raytracing.
- The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe.
- The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect.

Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.
 
Top