bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461


From GPT-4 to Mistral 7B, there is now a 300x range in the cost of LLM inference 💸

With prices spanning more than two orders of magnitude, it's more important than ever to understand the quality, speed and price trade-offs between different LLMs and inference providers.

Smaller, faster and cheaper models are enabling use-cases that could never have been economically viable with a GPT-4 sized model, from consumer chat experiences to at-scale enterprise data extraction.

Mistral 7B Instruct is a 7 billion parameter open-source LLM from French startup
@MistralAI
. Its quality benchmarks compare strongly against similar-sized models, and while it can’t compete head-on against OpenAI’s GPT-3.5, it is suitable for many use-cases that aren’t pushing reasoning capabilities to the limit. Mistral 7B Instruct is available at competitive prices from a range of providers including @MistralAI @perplexity_ai @togethercompute @anyscalecompute @DeepInfra and @FireworksAI_HQ


See our LLM comparison analysis here: Comparison of AI Models across Quality, Performance, Price | Artificial Analysis
iYD2Idg.png



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461


Create a personalized chatbot with the Chat with RTX tech demo. Accelerated by TensorRT-LLM and Tensor Cores, you can quickly get tailored info from your files and content. Just connect your data to an LLM on RTX-Powered PCs for local, fast, generative AI.



 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461

3 hours ago - Technology

New AI polyglot launched to help fill massive language gap in field​


Illustration of a large dictionary with AI embossed on the front

Illustration: Natalie Peeples/Axios

A new open-source generative AI model can follow instructions in more than 100 languages.

Why it matters: Most models that power today's generative AI tools are trained on data in English and Chinese, leaving a massive gap of thousands of languages — and potentially limiting access to the powerful technology for billions of people.

Details: Cohere for AI, the nonprofit AI research lab at Cohere, on Tuesday released its open-source multilingual large language model (LLM) called Aya.


  • It covers more than twice as many languages as other existing open-source models and is the result of a year-long project involving 3,000 researchers in 119 countries.
How it works: The team started with a base model pre-trained on text that covered 101 languages and fine-tuned it on those languages.

  • But they first had to create a high-quality dataset of prompt and completion pairs (the inputs and outputs of the model) in different languages, which is also being released.
  • Their data sources include machine translations of several existing datasets into more than 100 languages, roughly half of which are considered underrepresented — or unrepresented — in existing text datasets, including Azerbaijani, Bemba, Welsh and Gujarati.
  • They also created a dataset that tries to capture cultural nuances and meaningful information by having about 204,000 prompts and completions curated and annotated by fluent speakers in 67 languages.
  • The team reports Aya outperforms other existing open-source multilingual models when evaluated by humans or using GPT-4.

The impact: "Aya is a massive leap forward — but the biggest goal is all these collaboration networks spur bottom up collaborations," says Sara Hooker, who leads Cohere for AI.

  • The team envisions Aya being used for language research and to preserve and represent languages and cultures at risk of being left out of AI advances.

The big picture: Aya is one of a handful of open-source multilingual models, including BLOOM, which can generate text in 46 languages, a bilingual Arabic-English LLM called Jais, and a model in development by the Masakhane Foundation that covers African languages.

What to watch: "In some ways this is a bandaid for the wider issue with multilingual [LLMs]," Hooker says. "An important bandaid but the issues still persist."


  • Those include figuring out why LLMs can't seem to be adopted to languages they didn't see in pre-training and exploring how best to evaluate these models.
  • "We used to think of a model as something that fulfills a very specific, finite and controlled notion of a request," like a definitive answer about whether a response is correct, Hooker says. "And now we want models to do everything, and be universal and fluid."
  • 'We can’t have everything so we will have to chose what we want it to be."



FEB 13, 2024

Cohere For AI Launches Aya, an LLM Covering More Than 100 Languages​

More than double the number of languages covered by previous open-source AI models to increase coverage for underrepresented communities

Today, the research team at Cohere For AI (C4AI), Cohere’s non-profit research lab, are excited to announce a new state-of-the-art, open-source, massively multilingual, generative large language research model (LLM) covering 101 different languages — more than double the number of languages covered by existing open-source models. Aya helps researchers unlock the powerful potential of LLMs for dozens of languages and cultures largely ignored by most advanced models on the market today.

We are open-sourcing both the Aya model, as well as the largest multilingual instruction fine-tuned dataset to-date with a size of 513 million covering 114 languages. This data collection includes rare annotations from native and fluent speakers all around the world, ensuring that AI technology can effectively serve a broad global audience that have had limited access to-date.



Closes the Gap in Languages and Cultural Relevance​

Aya is part of a paradigm shift in how the ML community approaches massively multilingual AI research, representing not just technical progress, but also a change in how, where, and by whom research is done.

As LLMs, and AI generally, have changed the global technological landscape, many communities across the world have been left unsupported due to the language limitations of existing models. This gap hinders the applicability and usefulness of generative AI for a global audience, and it has the potential to further widen existing disparities that already exist from previous waves of technological development. By focusing primarily on English and one or two dozen other languages as training resources, most models tend to reflect inherent cultural bias.

We started the Aya project to address this gap, bringing together over 3,000 independent researchers from 119 countries.

KVFYtY-poO-UQ8wbta7JMsN-0udgHMejToTl_94V4NzQUbyQGhmMV9kZWal33MgJIj8dEdUS-Ge49lO_WoIhnPHDDHy4wW6SXhg-EhwATTTQq6GBjjPOw_yhNK6wW0Yv7XH9DsxJxmaY308v5f6bAYI

Figure: Geographical distribution of Aya collaborators


Significantly Outperforms Existing Open-Source Multilingual Models​

The research team behind Aya was able to substantially improve performance for underserved languages, demonstrating superior capabilities in complex tasks, such as natural language understanding, summarization, and translation, across a wide linguistic spectrum.

We benchmark Aya model performance against available, open-source, massively multilingual models. It surpasses the best open-source models, such as mT0 and Bloomz, on benchmark tests by a wide margin. Aya consistently scored 75% in human evaluations against other leading open-source models, and 80-90% across the board in simulated win rates.

Aya also expands coverage to more than 50 previously unserved languages, including Somali, Uzbek, and more. While proprietary models do an excellent job serving a range of the most commonly spoken languages in the world, Aya helps to provide researchers with an unprecedented open-source model for dozens of underrepresented languages.

SPt0GK0hUVoC7Mfj29yxAHI1GTGz3yXT_HM-Nle7Pdse29ByTM1Fq7NU0zXAAJE8l_S4dfNPdy8wZ_JHpc1XooyNyv__tW1RSzPt_AiOd9URZ43rLIps53XOuw0a6Pvvc5pVbqMjPfUeg2db_2o9FeQ.png

Figure: Head-to-head comparison of preferred model responses


Trained on the Most Extensive Multilingual Dataset to Date​

We are releasing the Aya Collection consisting of 513 million prompts and completions covering 114 languages. This massive collection was created by fluent speakers around the world creating templates for selected datasets and augmenting a carefully curated list of datasets. It also includes the Aya Dataset which is the most extensive human-annotated, multilingual, instruction fine-tuning dataset to date. It contains approximately 204,000 rare human curated annotations by fluent speakers in 67 languages, ensuring robust and diverse linguistic coverage. This offers a large-scale repository of high-quality language data for developers and researchers.

Many languages in this collection had no representation in instruction-style datasets before. The fully permissive and open-sourced dataset includes a wide spectrum of language examples, encompassing a variety of dialects and original contributions that authentically reflect organic, natural, and informal language use. This makes it an invaluable resource for multifaceted language research and linguistic preservation efforts.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461

DeepMind framework offers breakthrough in LLMs’ reasoning​

google-deepmind-self-discover-framework-llm-large-language-models-ai-artificial-intelligence-tech-technology-arxiv-hugging-face-paper-benchmark.jpg

About the Author
By Ryan Daws | February 8, 2024


Categories: Artificial Intelligence, Companies, DeepMind, Development, Google,


Ryan is a senior editor at TechForge Media with over a decade of experience covering the latest technology and interviewing leading industry figures. He can often be sighted at tech conferences with a strong coffee in one hand and a laptop in the other. If it's geeky, he’s probably into it. Find him on Twitter (@Gadget_Ry) or Mastodon (@gadgetry@techhub.social)

A breakthrough approach in enhancing the reasoning abilities of large language models (LLMs) has been unveiled by researchers from Google DeepMind and the University of Southern California.

Their new ‘SELF-DISCOVER’ prompting framework – published this week on arXiV and Hugging Face – represents a significant leap beyond existing techniques, potentially revolutionising the performance of leading models such as OpenAI’s GPT-4 and Google’s PaLM 2.

The framework promises substantial enhancements in tackling challenging reasoning tasks. It demonstrates remarkable improvements, boasting up to a 32% performance increase compared to traditional methods like Chain of Thought (CoT). This novel approach revolves around LLMs autonomously uncovering task-intrinsic reasoning structures to navigate complex problems.

At its core, the framework empowers LLMs to self-discover and utilise various atomic reasoning modules – such as critical thinking and step-by-step analysis – to construct explicit reasoning structures.

By mimicking human problem-solving strategies, the framework operates in two stages:
  • Stage one involves composing a coherent reasoning structure intrinsic to the task, leveraging a set of atomic reasoning modules and task examples.
  • During decoding, LLMs then follow this self-discovered structure to arrive at the final solution.

In extensive testing across various reasoning tasks – including Big-Bench Hard, Thinking for Doing, and Math – the self-discover approach consistently outperformed traditional methods. Notably, it achieved an accuracy of 81%, 85%, and 73% across the three tasks with GPT-4, surpassing chain-of-thought and plan-and-solve techniques.

However, the implications of this research extend far beyond mere performance gains.

By equipping LLMs with enhanced reasoning capabilities, the framework paves the way for tackling more challenging problems and brings AI closer to achieving general intelligence. Transferability studies conducted by the researchers further highlight the universal applicability of the composed reasoning structures, aligning with human reasoning patterns.

As the landscape evolves, breakthroughs like the SELF-DISCOVER prompting framework represent crucial milestones in advancing the capabilities of language models and offering a glimpse into the future of AI.

(Photo by Victor on Unsplash)




Computer Science > Artificial Intelligence​

[Submitted on 6 Feb 2024]

Self-Discover: Large Language Models Self-Compose Reasoning Structures​

Pei Zhou, Jay Pujara, Xiang Ren, Xinyun Chen, Heng-Tze Cheng, Quoc V. Le, Ed H. Chi, Denny Zhou, Swaroop Mishra, Huaixiu Steven Zheng

We introduce SELF-DISCOVER, a general framework for LLMs to self-discover the task-intrinsic reasoning structures to tackle complex reasoning problems that are challenging for typical prompting methods. Core to the framework is a self-discovery process where LLMs select multiple atomic reasoning modules such as critical thinking and step-by-step thinking, and compose them into an explicit reasoning structure for LLMs to follow during decoding. SELF-DISCOVER substantially improves GPT-4 and PaLM 2's performance on challenging reasoning benchmarks such as BigBench-Hard, grounded agent reasoning, and MATH, by as much as 32% compared to Chain of Thought (CoT). Furthermore, SELF-DISCOVER outperforms inference-intensive methods such as CoT-Self-Consistency by more than 20%, while requiring 10-40x fewer inference compute. Finally, we show that the self-discovered reasoning structures are universally applicable across model families: from PaLM 2-L to GPT-4, and from GPT-4 to Llama2, and share commonalities with human reasoning patterns.

Comments:17 pages, 11 figures, 5 tables
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as: arXiv:2402.03620 [cs.AI]
(or arXiv:2402.03620v1 [cs.AI] for this version)
[2402.03620] Self-Discover: Large Language Models Self-Compose Reasoning Structures

Focus to learn more

Submission history​

From: Pei Zhou [ view email]

[v1] Tue, 6 Feb 2024 01:13:53 UTC (1,783 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461

World Model on Million-Length Video and Language with RingAttention


Abstract​

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.​

Question Answering Over 1 Hour Video.​





1M video chat


Figure 1. Long video understanding. LWM can answer questions about over 1 hour YouTube video.

Fact Retrieval Over 1M Context.​


1M fact retrieval


Figure 2. Needle retrieval task. LWM achieves high accuracy across 1M context window and outperforms GPT-4V and Gemini Pro.


1M fact retrieval


Figure 3. Needle retrieval task. LWM achieves high accuracy for varying context sizes and positions in the context window.

Long Sequence Any-to-Any AR Prediction.​



Model


Figure 4. Any-to-Any Long Sequence Prediction. RingAttention enables the use of a very large context window for training across diverse formats such as video-text, text-video, image-text, text-image, pure video, pure image, and pure text. See the LWM paper for key features, including masked sequence packing and loss weighting, which allow effective video-language training.

Modeling Diverse Videos and Books With RingAttention.​


Data Mixture


Figure 5. Context Extension and Vision-Language Training. Expanding context size from 4K to 1M on books using RingAttention, followed by vision-language training on diverse forms of visual contents of lengths 32K to 1M. The lower panel shows interactive capabilities in understanding and responding to queries about complex multimodal world.

Text-Image Generation.​

Fact retrieval


Fact retrieval


Figure 6. Text to Image. LWM generates images based on text prompts, autoregressively.



Text-Video Generation.​

Fireworks exploding in the sky

Waves crashing against the shore

A bustling street in London with red telephone booths and Big Ben in the background

Camera pans left to right on mago slices sitting on a table

A ball thown in the air

Slow motion flower petals falling on the ground

A burning campire in a forest

A boat sailing on a stormy ocean


Figure 5. Text to Video. LWM generates videos based on text prompts, autoregressively.

Image Based Conversation.​

1M video chat


Figure 6. Image understanding. LWM can answer questions about images.



Video Chat Over 1 Hour YouTube Video.​




long_video_chat_1.png




long_video_chat_2.png


Figure 7. Long Video Chat. LWM answers questions about 1 hour long YouTube video even if state-of-the-art commercial models GPT-4V and Gemini Pro both fail. The relevant clips for each example are at timestamps 9:56 (top) and 6:49 (bottom).







Large World Model (LWM)​

[Project] [Paper] [Models]

Large World Model (LWM) is a general-purpose large-context multimodal autoregressive model. It is trained on a large dataset of diverse long videos and books using RingAttention, and can perform language, image, and video understanding and generation.




Computer Science > Machine Learning​

[Submitted on 13 Feb 2024]

World Model on Million-Length Video And Language With RingAttention​


Hao Liu, Wilson Yan, Matei Zaharia, Pieter Abbeel

Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.



Subjects:Machine Learning (cs.LG)
Cite as: arXiv:2402.08268 [cs.LG]
(or arXiv:2402.08268v1 [cs.LG] for this version)
[2402.08268] World Model on Million-Length Video And Language With RingAttention

Focus to learn more

Submission history​

From: Hao Liu [ view email]
[v1] Tue, 13 Feb 2024 07:47:36 UTC (7,336 KB)





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461

Google unveils a more capable Gemini Pro AI as it seeks an edge against OpenAI and other rivals in tech’s hottest battlefield​

BY JEREMY KAHN

February 15, 2024 at 10:00 AM EST

Google DeepMind CEO Demis Hassabis
GettyImages-1760047727-e1707947499536.jpg

Demis Hassabis heads Google DeepMind, Alphabet's advanced AI research lab. The lab has helped to develop Google's Gemini AI models. The company has unveiled its latest, a more capable Gemini Pro called 1.5 Pro, that can process more data in one go than any other AI model on the market.

TOBY MELVILLE—WPA POOL/GETTY IMAGES

Another week, another Gemini model.

Alphabet-owned Google has been furiously pushing out new Gemini-branded AI models, which can power business applications like chatbots and coding assistants, as it tries to outdo rivals OpenAI and Microsoft in a generative AI battle that keeps getting more intense.

Earlier this week, OpenAI announced it was adding a persistent memory to ChatGPT—which means the chatbot will remember facts about the user’s preferences and past dialogues and apply them to future responses—through a new underlying large language model GPT-4.5. Now Google is countering by releasing a yet more capable and compact Gemini model.




And, while a Google DeepMind researcher who worked on the models said the company is sharing information on the safety of these new AI systems with government regulatory bodies in the U.S. and U.K., it isn’t exactly waiting for their greenlight before releasing them to the public.

While this information sharing is voluntary, the U.K. government has often implied that its AI Safety Institute will help keep dangerous models from posing a risk to the public, a role it cannot fulfil if tech companies keep pushing out models far faster than it can evaluate them. In the U.S., a similarly named AI Safety Institute is only charged with issuing standards for companies to use in evaluating their own models, not undertaking the evaluations themselves.

Only last week, Alphabet put its most powerful Gemini Ultra 1.0 model into wide release, charging users $20 monthly for access to a better AI assistant to advise you on how to change a tire, help you design a birthday card, or analyze financial statements for you.

Today it is announcing a more limited “research release” of a new version of its Gemini Pro model—Gemini 1.5 Pro —that delivers similar performance to the Ultra 1.0 but in a much smaller model. Smaller models use less computing power to train and run, which also makes them less costly to use.

The 1.5 Pro is also built using a “mixture of experts” design, which means that rather than being a single giant neural network, it is actually an assemblage of several smaller ones, each specialized for a particular task. This too makes the model cheaper to train and to run.

Google charges customers of the current Gemini 1.0 Pro model $.0025 per image the model generates, $.002 per second of audio it outputs, and $.00025 per 1,000 characters of text it produces. The company has not said what it plans to charge for the new 1.5 Pro version.

Like Google’s other Gemini models, 1.5 Pro is multi-modal, meaning it has been trained on text, images, audio, and video. It can process inputs or provide outputs in any of these forms.

But, in addition to being smaller, the new Pro does something that even the larger Ultra 1.0 can’t do. It can ingest and analyze far more data than any other AI model on the market, including its bigger, older cousin.

The new Gemini 1.5 Pro can take in about seven books’ worth of text, or a full hour of video, or 11 hours of audio. This makes it easier to ask the AI system questions that involve searching for an answer amid a lot of data, such as trying to find a particular clip in a long video, or trying to answer a complex question about some portion of the federal criminal code.

The new model can do this because its “context window”—or the maximum length of a prompt—can be as long as 1 million tokens. A token is a chunk of data that is about a word and a bit long. So one million tokens is about 700,000 words. The next closest publicly available large language model, Anthropic’s Claude 2.0, has a context window of 200,000 tokens.

For now, the new 1.5 Pro is being aimed at corporate customers and AI researchers. It’s being made available to users with access to the Gemini API through Google’s AI Studio sandbox as well as select Google Cloud customers being invited to a “private preview” of the model through Google Cloud’s Vertex AI platform.

Google is desperate to convince big businesses to start building their generative AI applications on top of its AI models. It is hoping this will help it grow its Cloud Computing business, which has consistently been in third place behind Microsoft Azure and Amazon’s AWS. But Google’s new AI features have given it the best opportunity it has had in years to gain market share, particularly from AWS, which has been much slower than its rivals in offering cutting-edge generative AI models.

Last month, Alphabet reported that its cloud revenue grew 25% year over year in the last quarter, a figure that is below the 30% cloud revenue growth Microsoft’s reported. But Alphabet’s cloud sales were expanding at a rate almost double that reported by AWS. Amazon has sought to rectify its genAI laggard status through a partnership with AI startup Anthropic, although it’s unclear if that alliance will enable it to keep it from losing ground to Microsoft Azure and Google Cloud.

Oriol Vinyals, a vice president of research at Google DeepMind who helped develop the latest Gemini model, showed reporters a video demonstration that highlighted how the new model could exhibit a sophisticated understanding of both video and language. When asked about the significance of a piece of paper in an old Buster Keaton silent film, the model not only answered correctly that the paper was a pawn ticket and explained its importance in the film’s plot, it could also cite the correct scene in the film where it was featured. It could also pull out examples of astronauts joking in transcripts of the Apollo 11 mission.

Vinyals also showed how a person could use a simple sketch to ask the model to find scenes or instances in the transcript that matched the sketch.

But, he noted, the model is still fallible. Like all LLM-based AI models, it remains prone to “hallucinations,” in which the model simply invents information. He said the 1.5 Pro’s hallucination rate was “no better or worse” than Google’s earlier Gemini models, but he did not disclose a specific error rate.

In response to journalists’ questions, Vinyals also implied that the demonstration videos he had just played to show off the capabilities of the Gemini 1.5 Pro may have depicted examples Google had cherry picked from among other similar attempts that were less successful.

Many journalists and technologists criticized Google for editing a video demonstration that accompanied the unveiling of its Gemini models in December that made the models seem more capable of understanding scenes in live video as well as speedier in answering questions than they actually are.

The new 1.5 Pro also does not have a persistent long-term memory, unlike Google’s GPT-4.5. This means that while the 1.5 Pro can find information within a fairly large dataset, it cannot remember that information for future sessions. For instance, Vinyals said that a 1.5 Pro user could give the model an entire dictionary for an obscure language and then ask it to translate from that language. But if the user came back a month later, the model wouldn’t instantly know how to do the same translation. The user would have to feed it the dictionary again.

The U.K. government’s newly created AI Safety Institute is supposed to be conducting independent evaluations of the most powerful models that AI companies develop. In addition, AI companies including Google DeepMind have agreed to share information about their own internal safety testing with both the U.K. and U.S. government. Vinyals said that Google is complying with the promises it made to these governments at international AI Safety Summit this past summer, but he did not specify whether the U.K. AI Safety Institute has evaluated 1.5 Pro or any of the Gemini models.










 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461








 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,344
Reputation
8,215
Daps
156,461

BY BUSINESS

FEB 15, 2024 1:15 PM

OpenAI’s Sora Turns AI Prompts Into Photorealistic Videos​


OpenAI's entry into generative AI video is an impressive first step.


VIDEO: WIRED STAFF; GETTY IMAGES

We already know that OpenAI’s chatbots can pass the bar exam without going to law school. Now, just in time for the Oscars, a new OpenAI app called Sora hopes to master cinema without going to film school. For now a research product, Sora is going out to a few select creators and a number of security experts who will red-team it for safety vulnerabilities. OpenAI plans to make it available to all wannabe auteurs at some unspecified date, but it decided to preview it in advance.

Other companies, from giants like Google to startups like Runway, have already revealed text-to-video AI projects. But OpenAI says that Sora is distinguished by its striking photorealism—something I haven’t seen in its competitors—and its ability to produce longer clips than the brief snippets other models typically do, up to one minute. The researchers I spoke to won’t say how long it takes to render all that video, but when pressed, they described it as more in the “going out for a burrito” ballpark than “taking a few days off.” If the hand-picked examples I saw are to be believed, the effort is worth it.

OpenAI didn’t let me enter my own prompts, but it shared four instances of Sora’s power. (None approached the purported one-minute limit; the longest was 17 seconds.) The first came from a detailed prompt that sounded like an obsessive screenwriter’s setup: “Beautiful, snowy Tokyo city is bustling. The camera moves through the bustling city street, following several people enjoying the beautiful snowy weather and shopping at nearby stalls. Gorgeous sakura petals are flying through the wind along with snowflakes.”

PLAY/PAUSE BUTTON

AI-generated video made with OpenAI's Sora.


COURTESY OF OPENAI

The result is a convincing view of what is unmistakably Tokyo, in that magic moment when snowflakes and cherry blossoms coexist. The virtual camera, as if affixed to a drone, follows a couple as they slowly stroll through a streetscape. One of the passersby is wearing a mask. Cars rumble by on a riverside roadway to their left, and to the right shoppers flit in and out of a row of tiny shops.

It’s not perfect. Only when you watch the clip a few times do you realize that the main characters—a couple strolling down the snow-covered sidewalk—would have faced a dilemma had the virtual camera kept running. The sidewalk they occupy seems to dead-end; they would have had to step over a small guardrail to a weird parallel walkway on their right. Despite this mild glitch, the Tokyo example is a mind-blowing exercise in world-building. Down the road, production designers will debate whether it’s a powerful collaborator or a job killer. Also, the people in this video—who are entirely generated by a digital neural network—aren’t shown in close-up, and they don’t do any emoting. But the Sora team says that in other instances they’ve had fake actors showing real emotions.

The other clips are also impressive, notably one asking for “an animated scene of a short fluffy monster kneeling beside a red candle,” along with some detailed stage directions (“wide eyes and open mouth”) and a description of the desired vibe of the clip. Sora produces a Pixar-esque creature that seems to have DNA from a Furby, a Gremlin, and Sully in Monsters, Inc. I remember when that latter film came out, Pixar made a huge deal of how difficult it was to create the ultra-complex texture of a monster’s fur as the creature moved around. It took all of Pixar’s wizards months to get it right. OpenAI’s new text-to-video machine … just did it.

“It learns about 3D geometry and consistency,” says Tim Brooks, a research scientist on the project, of that accomplishment. “We didn’t bake that in—it just entirely emerged from seeing a lot of data.”

PLAY/PAUSE BUTTON

AI-generated video made with the prompt, “animated scene features a close-up of a short fluffy monster kneeling beside a melting red candle. the art style is 3d and realistic, with a focus on lighting and texture. the mood of the painting is one of wonder and curiosity, as the monster gazes at the flame with wide eyes and open mouth. its pose and expression convey a sense of innocence and playfulness, as if it is exploring the world around it for the first time. the use of warm colors and dramatic lighting further enhances the cozy atmosphere of the image.”


COURTESY OF OPENAI

While the scenes are certainly impressive, the most startling of Sora’s capabilities are those that it has not been trained for. Powered by a version of the diffusion model used by OpenAI’s Dalle-3 image generator as well as the transformer-based engine of GPT-4, Sora does not merely churn out videos that fulfill the demands of the prompts, but does so in a way that shows an emergent grasp of cinematic grammar.

That translates into a flair for storytelling. In another video that was created off of a prompt for “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.” Bill Peebles, another researcher on the project, notes that Sora created a narrative thrust by its camera angles and timing. “There's actually multiple shot changes—these are not stitched together, but generated by the model in one go,” he says. “We didn’t tell it to do that, it just automatically did it.”

PLAY/PAUSE BUTTON

AI-generated video made with the prompt “a gorgeously rendered papercraft world of a coral reef, rife with colorful fish and sea creatures.”COURTESY OF OPENAI


In another example I didn’t view, Sora was prompted to give a tour of a zoo. “It started off with the name of the zoo on a big sign, gradually panned down, and then had a number of shot changes to show the different animals that live at the zoo,” says Peebles, “It did it in a nice and cinematic way that it hadn't been explicitly instructed to do.”

One feature in Sora that the OpenAI team didn’t show, and may not release for quite a while, is the ability to generate videos from a single image or a sequence of frames. “This is going to be another really cool way to improve storytelling capabilities,” says Brooks. “You can draw exactly what you have on your mind and then animate it to life.” OpenAI is aware that this feature also has the potential to produce deepfakes and misinformation. “We’re going to be very careful about all the safety implications for this,” Peebles adds.

Expect Sora to have the same restrictions on content as Dall-E 3 : no violence, no porn, no appropriating real people or the style of named artists. Also as with Dall-E 3, OpenAI will provide a way for viewers to identify the output as AI-created. Even so, OpenAI says that safety and veracity is an ongoing problem that's bigger than one company. “The solution to misinformation will involve some level of mitigations on our part, but it will also need understanding from society and for social media networks to adapt as well,” says Aditya Ramesh, lead researcher and head of the Dall-E team.

AI-generated video made with the prompt “several giant wooly mammoths approach treading through a snowy meadow, their long wooly fur lightly blows in the wind as they walk, snow covered trees and dramatic snow capped mountains in the distance, mid afternoon light with wispy clouds and a sun high in the distance creates a warm glow, the low camera view is stunning capturing the large furry mammal with beautiful photography, depth of field.”COURTESY OF OPENAI

Another potential issue is whether the content of the video Sora produces will infringe on the copyrighted work of others. “The training data is from content we’ve licensed and also publicly available content,” says Peebles. Of course, the nub of a number of lawsuits against OpenAI hinges on the question whether “publicly available” copyrighted content is fair game for AI training.

It will be a very long time, if ever, before text-to-video threatens actual filmmaking. No, you can’t make coherent movies by stitching together 120 of the minute-long Sora clips, since the model won’t respond to prompts in the exact same way—continuity isn’t possible. But the time limit is no barrier for Sora and programs like it to transform TikTok, Reels, and other social platforms. “In order to make a professional movie, you need so much expensive equipment,” says Peebles. “This model is going to empower the average person making videos on social media to make very high-quality content.”

As for now, OpenAI is faced with the huge task of making sure that Sora isn’t a misinformation train wreck. But after that, the long countdown begins until the next Christopher Nolan or Celine Song gets a statuette for wizardry in prompting an AI model. The envelope, please!
 
Top