1/22
@karansdalal
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.
We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.
Every video below is produced directly by the model in a single shot, without editing, stitching, or post-processing. Every story is newly created.
Demos:
One-Minute Video Generation with Test-Time Training
Paper:
http://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf
https://video.twimg.com/ext_tw_video/1909310443530944513/pu/vid/avc1/720x480/S8MsN5qN0o9f_Lnx.mp4
2/22
@karansdalal
Test-time training (TTT) layers are RNN layers where the hidden state is a machine learning model and the update rule is a step of gradient descent. See this thread for previous work.
[Quoted tweet]
I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models.
We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses context through actual gradient descent on input tokens. We call our method “Test-Time-Training layers.”
TTT layers directly replace attention, and unlock linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context.
Our instantiations, TTT-Linear and TTT-MLP, both match or beat the strongest Transformers and Mamba. Arxiv: arxiv.org/abs/2407.04620
3/22
@karansdalal
Our approach simply adds TTT layers to a pre-trained Diffusion Transformer and fine-tunes it on long videos with text annotations. To keep costs manageable, we limit self-attention to local segments and let TTT (linear complexity) operate globally.
4/22
@karansdalal
We create an “On-Chip Tensor Parallel” algorithm to implement an efficient TTT-MLP kernel. Specifically, we shard the weights of the “hidden state model” across Streaming Multiprocessors, and use the DSMEM feature Hopper GPUs implement AllReduce among SMs.
This avoids costly transfers between global memory (HBM) and shared memory (SMEM), while still fitting the large hidden state into the small amount of fast SMEM.
More details in the paper. Kernel code:
GitHub - test-time-training/ttt-tk
5/22
@karansdalal
Grateful for wonderful collaborators. This work will be presented at CVPR 2025.
@danielkoceja @GashonHussein @Jerry_XU_Jiarui @__yuezhao__ @jankautz @guestrin @tatsu_hashimoto @sanmikoyejo @YejinChoinka @xiaolonw
6/22
@karansdalal
+ our wonderful collaborators without Twitter – Shihao Han, Ka Chun Cheung, Youjin Song, and Yu Sun.
7/22
@menhguin
what the fukk (complimentary)
ok for like a solid 30 seconds I thought this was the Test-Time Training used for the ARC AGI MIT submission and I was rly confused
8/22
@karansdalal
Same thing, different application! Best characterization would be "End to End" vs "Non E2E" test-time training.
Test-Time Training Project Website
9/22
@ruslanjabari
damn and this is only ~50 hours of training runs
10/22
@karansdalal
With a 5B model 🫣
11/22
@reborn_agi
This is incredible work — generating coherent, one-minute-long animated stories with zero post-processing is a huge leap in video generation. The TTT approach looks super promising for maintaining temporal consistency. Huge respect to you and the team.
12/22
@karansdalal
Thank you
13/22
@willdepue
very cool work karan! do you have any baselines of what it looks like without test time training?
14/22
@karansdalal
Thank you Will, sorry to miss this! Here's the World Trade Center video with the local attention baseline* We have some examples of comparing TTT to other RNNs on the project page.
* Disclaimer – this model does have less parameters than the one with added TTT layers.
https://video.twimg.com/ext_tw_video/1909798570049650689/pu/vid/avc1/720x480/0agZ6XihQUKUJ9iC.mp4
15/22
@TheGrizztronic
Pretty cool. TTT should get more love. Hope this helps!
16/22
@karansdalal
17/22
@jc_stack
Really interested in your pre-training approaches. Have you seen much impact on compute/memory overhead with the TTT layers? Thinking about startup resource constraints here.
18/22
@karansdalal
TTT layers have linear complexity, so long context inference is far better than self-attention. But we still have some way to go on kernel optimization when compared to other modern RNN layers.
Figure 6 from our paper:
19/22
@john7rho
Amazing work Karan
20/22
@karansdalal
Thank you John!
21/22
@jam3scampbell
[Quoted tweet]
in b4 ttt is the new q*
22/22
@nearcyan
hmmmm
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196