bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746

Video generation models as world simulators​


We explore large-scale training of generative models on video data. Specifically, we train text-conditional diffusion models jointly on videos and images of variable durations, resolutions and aspect ratios. We leverage a transformer architecture that operates on spacetime patches of video and image latent codes. Our largest model, Sora, is capable of generating a minute of high fidelity video. Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.

February 15, 2024

More resources

View Sora overview

Video generation, Sora, Milestone, Release


This technical report focuses on (1) our method for turning visual data of all types into a unified representation that enables large-scale training of generative models, and (2) qualitative evaluation of Sora’s capabilities and limitations. Model and implementation details are not included in this report.

Much prior work has studied generative modeling of video data using a variety of methods, including recurrent networks,[sup]1,2,3[/sup] generative adversarial networks,[sup]4,5,6,7[/sup] autoregressive transformers,[sup]8,9[/sup] and diffusion models.[sup]10,11,12[/sup] These works often focus on a narrow category of visual data, on shorter videos, or on videos of a fixed size. Sora is a generalist model of visual data—it can generate videos and images spanning diverse durations, aspect ratios and resolutions, up to a full minute of high definition video.

Turning visual data into patches

We take inspiration from large language models which acquire generalist capabilities by training on internet-scale data.[sup]13,14[/sup] The success of the LLM paradigm is enabled in part by the use of tokens that elegantly unify diverse modalities of text—code, math and various natural languages. In this work, we consider how generative models of visual data can inherit such benefits. Whereas LLMs have text tokens, Sora has visual patches. Patches have previously been shown to be an effective representation for models of visual data.[sup]15,16,17,18[/sup] We find that patches are a highly-scalable and effective representation for training generative models on diverse types of videos and images.


figure-patches.png


At a high level, we turn videos into patches by first compressing videos into a lower-dimensional latent space,[sup]19[/sup] and subsequently decomposing the representation into spacetime patches.

Video compression network

We train a network that reduces the dimensionality of visual data.[sup]20[/sup] This network takes raw video as input and outputs a latent representation that is compressed both temporally and spatially. Sora is trained on and subsequently generates videos within this compressed latent space. We also train a corresponding decoder model that maps generated latents back to pixel space.

Spacetime Latent Patches

Given a compressed input video, we extract a sequence of spacetime patches which act as transformer tokens. This scheme works for images too since images are just videos with a single frame. Our patch-based representation enables Sora to train on videos and images of variable resolutions, durations and aspect ratios. At inference time, we can control the size of generated videos by arranging randomly-initialized patches in an appropriately-sized grid.

Scaling transformers for video generation

Sora is a diffusion model[sup]21,22,23,24,25[/sup]; given input noisy patches (and conditioning information like text prompts), it’s trained to predict the original “clean” patches. Importantly, Sora is a diffusion transformer.[sup]26[/sup] Transformers have demonstrated remarkable scaling properties across a variety of domains, including language modeling,[sup]13,14[/sup] computer vision,[sup]15,16,17,18[/sup] and image generation.[sup]27,28,29[/sup]


figure-diffusion.png


In this work, we find that diffusion transformers scale effectively as video models as well. Below, we show a comparison of video samples with fixed seeds and inputs as training progresses. Sample quality improves markedly as training compute increases.

Base compute

4x compute

16x compute

Variable durations, resolutions, aspect ratios

Past approaches to image and video generation typically resize, crop or trim videos to a standard size – e.g., 4 second videos at 256x256 resolution. We find that instead training on data at its native size provides several benefits.

Sampling flexibility

Sora can sample widescreen 1920x1080p videos, vertical 1080x1920 videos and everything inbetween. This lets Sora create content for different devices directly at their native aspect ratios. It also lets us quickly prototype content at lower sizes before generating at full resolution—all with the same model.

Improved framing and composition

We empirically find that training on videos at their native aspect ratios improves composition and framing. We compare Sora against a version of our model that crops all training videos to be square, which is common practice when training generative models. The model trained on square crops (left) sometimes generates videos where the subject is only partially in view. In comparison, videos from Sora (right)s have improved framing.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746
Language understanding

Training text-to-video generation systems requires a large amount of videos with corresponding text captions. We apply the re-captioning technique introduced in DALL·E 3[sup]30[/sup] to videos. We first train a highly descriptive captioner model and then use it to produce text captions for all videos in our training set. We find that training on highly descriptive video captions improves text fidelity as well as the overall quality of videos.

Similar to DALL·E 3, we also leverage GPT to turn short user prompts into longer detailed captions that are sent to the video model. This enables Sora to generate high quality videos that accurately follow user prompts.

a woman

a womanan old mana toy robotan adorable kangaroo


wearing

a green dress and a sun hat

blue jeans and a white t-shirta green dress and a sun hatpurple overalls and cowboy boots


taking a pleasant stroll in

Johannesburg, South Africa

Mumbai, IndiaJohannesburg, South AfricaAntarctica


during

a colorful festival

a beautiful sunseta winter storma colorful festival


Prompting with images and videos

All of the results above and in our landing page show text-to-video samples. But Sora can also be prompted with other inputs, such as pre-existing images or video. This capability enables Sora to perform a wide range of image and video editing tasks—creating perfectly looping video, animating static images, extending videos forwards or backwards in time, etc.

Animating DALL·E images

Sora is capable of generating videos provided an image and prompt as input. Below we show example videos generated based on DALL·E 2[sup]31[/sup] and DALL·E 3[sup]30[/sup] images.[/SIZE]

prompting_0.png


A Shiba Inu dog wearing a beret and black turtleneck.

prompting_2.png


Monster Illustration in flat design style of a diverse family of monsters. The group includes a furry brown monster, a sleek black monster with antennas, a spotted green monster, and a tiny polka-dotted monster, all interacting in a playful environment.

prompting_4.png


An image of a realistic cloud that spells “SORA”.

prompting_6.png


In an ornate, historical hall, a massive tidal wave peaks and begins to crash. Two surfers, seizing the moment, skillfully navigate the face of the wave.

Extending generated videos

Sora is also capable of extending videos, either forward or backward in time. Below are four videos that were all extended backward in time starting from a segment of a generated video. As a result, each of the four videos starts different from the others, yet all four videos lead to the same ending.

00:0000:20

We can use this method to extend a video both forward and backward to produce a seamless infinite loop.

Video-to-video editing

Diffusion models have enabled a plethora of methods for editing images and videos from text prompts. Below we apply one of these methods, SDEdit,[sup]32[/sup] to Sora. This technique enables Sora to transform the styles and environments of input videos zero-shot.

Input video

change the setting to be in a lush junglechange the setting to the 1920s with an old school car. make sure to keep the red colormake it go underwaterchange the video setting to be different than a mountain? perhaps joshua tree?put the video in space with a rainbow roadkeep the video the same but make it be wintermake it in claymation animation stylerecreate in the style of a charcoal drawing, making sure to be black and whitechange the setting to be cyberpunkchange the video to a medieval thememake it have dinosaursrewrite the video in a pixel art style
[/SIZE]

Connecting videos

We can also use Sora to gradually interpolate between two input videos, creating seamless transitions between videos with entirely different subjects and scene compositions. In the examples below, the videos in the center interpolate between the corresponding videos on the left and right.

Image generation capabilities

Sora is also capable of generating images. We do this by arranging patches of Gaussian noise in a spatial grid with a temporal extent of one frame. The model can generate images of variable sizes—up to 2048x2048 resolution.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746

Tech companies sign accord to combat AI-generated election trickery​

FILE - Meta's president of global affairs Nick Clegg speaks at the World Economic Forum in Davos, Switzerland, Jan. 18, 2024. Adobe, Google, Meta, Microsoft, OpenAI, TikTok and other companies are gathering at the Munich Security Conference on Friday to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. (AP Photo/Markus Schreiber, File)

FILE - Meta’s president of global affairs Nick Clegg speaks at the World Economic Forum in Davos, Switzerland, Jan. 18, 2024. Adobe, Google, Meta, Microsoft, OpenAI, TikTok and other companies are gathering at the Munich Security Conference on Friday to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. (AP Photo/Markus Schreiber, File)

BY MATT O’BRIEN AND ALI SWENSON

Updated 1:18 PM EST, February 16, 2024

Major technology companies signed a pact Friday to voluntarily adopt “reasonable precautions” to prevent artificial intelligence tools from being used to disrupt democratic elections around the world.

Tech executives from Adobe, Amazon, Google, IBM, Meta, Microsoft, OpenAI and TikTok gathered at the Munich Security Conference to announce a new voluntary framework for how they will respond to AI-generated deepfakes that deliberately trick voters. Twelve other companies — including Elon Musk’s X — are also signing on to the accord.

“Everybody recognizes that no one tech company, no one government, no one civil society organization is able to deal with the advent of this technology and its possible nefarious use on their own,” said Nick Clegg, president of global affairs for Meta, the parent company of Facebook and Instagram, in an interview ahead of the summit.

The accord is largely symbolic, but targets increasingly realistic AI-generated images, audio and video “that deceptively fake or alter the appearance, voice, or actions of political candidates, election officials, and other key stakeholders in a democratic election, or that provide false information to voters about when, where, and how they can lawfully vote.”

The companies aren’t committing to ban or remove deepfakes. Instead, the accord outlines methods they will use to try to detect and label deceptive AI content when it is created or distributed on their platforms. It notes the companies will share best practices with each other and provide “swift and proportionate responses” when that content starts to spread.

The vagueness of the commitments and lack of any binding requirements likely helped win over a diverse swath of companies, but may disappoint pro-democracy activists and watchdogs looking for stronger assurances.

“The language isn’t quite as strong as one might have expected,” said Rachel Orey, senior associate director of the Elections Project at the Bipartisan Policy Center. “I think we should give credit where credit is due, and acknowledge that the companies do have a vested interest in their tools not being used to undermine free and fair elections. That said, it is voluntary, and we’ll be keeping an eye on whether they follow through.”

Clegg said each company “quite rightly has its own set of content policies.”

“This is not attempting to try to impose a straitjacket on everybody,” he said. “And in any event, no one in the industry thinks that you can deal with a whole new technological paradigm by sweeping things under the rug and trying to play whack-a-mole and finding everything that you think may mislead someone.”

Tech executives were also joined by several European and U.S. political leaders at Friday’s announcement. European Commission Vice President Vera Jourova said while such an agreement can’t be comprehensive, “it contains very impactful and positive elements.” She also urged fellow politicians to take responsibility to not use AI tools deceptively.

She stressed the seriousness of the issue, saying the “combination of AI serving the purposes of disinformation and disinformation campaigns might be the end of democracy, not only in the EU member states.”

The agreement at the German city’s annual security meeting comes as more than 50 countries are due to hold national elections in 2024. Some have already done so, including Bangladesh, Taiwan, Pakistan, and most recently Indonesia.

Attempts at AI-generated election interference have already begun, such as when AI robocalls that mimicked U.S. President Joe Biden’s voice tried to discourage people from voting in New Hampshire’s primary election last month.

Just days before Slovakia’s elections in November, AI-generated audio recordings impersonated a liberal candidate discussing plans to raise beer prices and rig the election. Fact-checkers scrambled to identify them as false, but they were already widely shared as real across social media.

Politicians and campaign committees also have experimented with the technology, from using AI chatbots to communicate with voters to adding AI-generated images to ads.

Friday’s accord said in responding to AI-generated deepfakes, platforms “will pay attention to context and in particular to safeguarding educational, documentary, artistic, satirical, and political expression.”

It said the companies will focus on transparency to users about their policies on deceptive AI election content and work to educate the public about how they can avoid falling for AI fakes.

Many of the companies have previously said they’re putting safeguards on their own generative AI tools that can manipulate images and sound, while also working to identify and label AI-generated content so that social media users know if what they’re seeing is real. But most of those proposed solutions haven’t yet rolled out and the companies have faced pressure from regulators and others to do more.

That pressure is heightened in the U.S., where Congress has yet to pass laws regulating AI in politics, leaving AI companies to largely govern themselves. In the absence of federal legislation, many states are considering ways to put guardrails around the use of AI, in elections and other applications.

The Federal Communications Commission recently confirmed AI-generated audio clips in robocalls are against the law, but that doesn’t cover audio deepfakes when they circulate on social media or in campaign advertisements.

Misinformation experts warn that while AI deepfakes are especially worrisome for their potential to fly under the radar and influence voters this year, cheaper and simpler forms of misinformation remain a major threat. The accord noted this too, acknowledging that “traditional manipulations (”cheapfakes”) can be used for similar purposes.”

Many social media companies already have policies in place to deter deceptive posts about electoral processes — AI-generated or not. For example, Meta says it removes misinformation about “the dates, locations, times, and methods for voting, voter registration, or census participation” as well as other false posts meant to interfere with someone’s civic participation.

Jeff Allen, co-founder of the Integrity Institute and a former data scientist at Facebook, said the accord seems like a “positive step” but he’d still like to see social media companies taking other basic actions to combat misinformation, such as building content recommendation systems that don’t prioritize engagement above all else.

Lisa Gilbert, executive vice president of the advocacy group Public Citizen, argued Friday that the accord is “not enough” and AI companies should “hold back technology” such as hyper-realistic text-to-video generators “until there are substantial and adequate safeguards in place to help us avert many potential problems.”

In addition to the major platforms that helped broker Friday’s agreement, other signatories include chatbot developers Anthropic and Inflection AI; voice-clone startup ElevenLabs; chip designer Arm Holdings; security companies McAfee and TrendMicro; and Stability AI, known for making the image-generator Stable Diffusion.

Notably absent from the accord is another popular AI image-generator, Midjourney. The San Francisco-based startup didn’t immediately return a request for comment Friday.

The inclusion of X — not mentioned in an earlier announcement about the pending accord — was one of the biggest surprises of Friday’s agreement. Musk sharply curtailed content-moderation teams after taking over the former Twitter and has described himself as a “free speech absolutist.”

But in a statement Friday, X CEO Linda Yaccarino said “every citizen and company has a responsibility to safeguard free and fair elections.”

“X is dedicated to playing its part, collaborating with peers to combat AI threats while also protecting free speech and maximizing transparency,” she said.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746

‘Game on’ for video AI as Runway, Stability react to OpenAI’s Sora leap​

Sharon Goldman @sharongoldman

February 15, 2024 1:04 PM

DALL%C2%B7E-2024-02-15-16.03.59-A-futuristic-scene-showcasing-five-different-digital-animals-in-a-running-race-designed-with-a-high-tech-cybernetic-aesthetic.-The-setting-is-a-neon.webp

Runway CEO Cristóbal Valenzuela posted two words on X this afternoon in response to today’s OpenAI surprise demo drop of its new Sora video AI model, which is capable of generating 60-second clips: “game on.”

Screen-Shot-2024-02-15-at-3.21.09-PM.png

Clearly, the race for video AI gold is on. After all, just a few months ago Runway blew people’s minds with its Gen-2 update. Three weeks ago, Google introduced its Lumiere video AI generation model, while just last week, Stability AI launched SVD 1.1, a diffusion model for more consistent AI videos.

Stability AI CEO Emad Mostaque responded positively to the Sora news, calling OpenAI CEO Sam Altman a “wizard.”

Screen-Shot-2024-02-15-at-3.39.16-PM.png

Today’s Sora news firmly squashed any talk about OpenAI “ jumping the shark” in the wake of Andrej Karpathy’s departure; reports that it was suddenly getting into the web search product business; and Sam Altman’s $7 trillion AI chips project.



VB EVENT​

The AI Impact Tour – NYC

We’ll be in New York on February 29 in partnership with Microsoft to discuss how to balance risks and rewards of AI applications. Request an invite to the exclusive event below.



Request an invite


Runway has a need for speed​

In June 2023, Runway was valued at $1.5 billion, after raising $141 million from investors including Google and Nvidia.

In January, Runway announced it added multiple motion controls to AI videos with its Multi Motion Brush, while it is as yet unclear what feature set Sora offers, or its limitations.

Coincidentally, I spoke to Runway’s creative director Jamie Umpherson yesterday about what I think is one of the company’s really fascinating and clever non-video promotional art projects — its Gen-2 Book of Weights, an analog printed booklet of nothing but numbers that is the first volume (of 6834) what will be the entire collection of Gen-2 model weights.

“It was a fun experiment,” he told me, adding it came from a “conversation in passing as a lot of these more creative experimental ideas do — we were just discussing how funny it would be to make the weights of the model tangible, visual, something that you could actually consume.”

For now, however, Runway will most certainly have to be laser-focused not on art or philosophy, but keeping up with OpenAI and the rest of the pack in order to keep pace in the video AI race.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746


Apparently some folks don't get "data-driven physics engine", so let me clarify. Sora is an end-to-end, diffusion transformer model. It inputs text/image and outputs video pixels directly. Sora learns a physics engine implicitly in the neural parameters by gradient descent through massive amounts of videos.

Sora is a learnable simulator, or "world model". Of course it does not call UE5 explicitly in the loop, but it's possible that UE5-generated (text, video) pairs are added as synthetic data to the training set.

If you think OpenAI Sora is a creative toy like DALLE, ... think again. Sora is a data-driven physics engine. It is a simulation of many worlds, real or fantastical. The simulator learns intricate rendering, "intuitive" physics, long-horizon reasoning, and semantic grounding, all by some denoising and gradient maths.

I won't be surprised if Sora is trained on lots of synthetic data using Unreal Engine 5. It has to be!

Let's breakdown the following video. Prompt: "Photorealistic closeup video of two pirate ships battling each other as they sail inside a cup of coffee."

- The simulator instantiates two exquisite 3D assets: pirate ships with different decorations. Sora has to solve text-to-3D implicitly in its latent space.
- The 3D objects are consistently animated as they sail and avoid each other's paths.
- Fluid dynamics of the coffee, even the foams that form around the ships. Fluid simulation is an entire sub-field of computer graphics, which traditionally requires very complex algorithms and equations.
- Photorealism, almost like rendering with raytracing.
- The simulator takes into account the small size of the cup compared to oceans, and applies tilt-shift photography to give a "minuscule" vibe.
- The semantics of the scene does not exist in the real world, but the engine still implements the correct physical rules that we expect.

Next up: add more modalities and conditioning, then we have a full data-driven UE that will replace all the hand-engineered graphics pipelines.
 

Skooby

Alone In My Zone
Supporter
Joined
Sep 30, 2012
Messages
25,290
Reputation
10,309
Daps
59,970
Reppin
The Cosmos
AI-Powered Service Churns Out Pictures of Fake IDs Capable of Passing KYC/AML for as Little as $15 | Feb 5, 2024





Excellent!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,098
Reputation
8,592
Daps
161,746

Let me clear a *huge* misunderstanding here.
The generation of mostly realistic-looking videos from prompts *does not* indicate that a system understands the physical world.
Generation is very different from causal prediction from a world model.
The space of plausible videos is very large, and a video generation system merely needs to produce *one* sample to succeed.
The space of plausible continuations of a real video is *much* smaller, and generating a representative chunk of those is a much harder task, particularly when conditioned on an action.
Furthermore, generating those continuations would be not only expensive but totally pointless.
It's much more desirable to generate *abstract representations* of those continuations that eliminate details in the scene that are irrelevant to any action we might want to take.
That is the whole point behind the JEPA (Joint Embedding Predictive Architecture), which is *not generative* and makes predictions in representation space.
Our work on VICReg, I-JEPA, V-JEPA, and the works of others show that Joint Embedding architectures produce much better representations of visual inputs than generative architectures that reconstruct pixels (such as Variational AE, Masked AE, Denoising AE, etc).
When using the learned representations as inputs to a supervised head trained on downstream tasks (without fine tuning the backbone), Joint Embedding beats generative.

See the results table from the V-JEPA blog post or paper:


 
Top