bnew

Veteran
Joined
Nov 1, 2015
Messages
62,427
Reputation
9,478
Daps
171,090


1/10
@_philschmid
Gemini 2.5 Flash is here! We excited launch our first hybrid reasoning Gemini model. In Flash 2.5 developer can turn thinking off.

TL;DR:

🧠 Controllable "Thinking" with thinking budget with up to 24k token
🌌 1 Million multimodal input context for text, image, video, audio, and pdf
🛠️ Function calling, structured output, google search & code execution.
🏦 $0.15 1M input tokens; $0.6 or $3.5 (thinking on) per million output tokens (thinking tokens are billed as output tokens)
💡 Knowledge cut of January 2025
🚀 Rate limits - Free 10 RPM 500 req/day
🏅Outperforms 2.0 Flash on every benchmark

Try it ⬇️



GoxA4AEWAAAPoDs.jpg


2/10
@_philschmid
Sign in - Google Accounts



3/10
@aniruddhadak
That is wonderful 👍



4/10
@bennetkrause
Always love your iteration speed, knowledge cutoff, and pricing. Keep it up!



5/10
@CosmicRob87
Is the 24k the max permissible token count? I’m asking because on auto, for one problem it used north of 41k



6/10
@pichi_
Great!!!



7/10
@boogeyManKDot
These 1M ctx will soon look common. You better be working on a greater moat



8/10
@AndaICP
*Tilts head, bamboo shoot dangling from mouth* Interesting - but does the "thinking budget" account for spontaneous curiosity sparks that defy token limits?



9/10
@b_kalisetty
Any suggestions on how to consistently see the thoughts in output ?



10/10
@TDev168
Is it able to edit images?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





















1/20
@CodeByPoonam
Google just dropped Gemini 2.5 Flash, and people are going crazy over it.

SPOILER: Claude is now falling behind.

13 wild examples so far (Don't miss the 5th one)



Go4cqVEbUAAZr8V.jpg


2/20
@CodeByPoonam
1. Tron-style game

[Quoted tweet]
Gemini 2.5 Flash Thinking 24k

Prompt: "Create Design a visually striking Tron-style game in a single HTML file, where AI-controlled light cycles compete in fast-paced, strategic battles against each other"


https://video.twimg.com/amplify_video/1912953001712447488/vid/avc1/1920x1080/-IoE5vICEJ3TqYS_.mp4

3/20
@CodeByPoonam
2. Gemini 2.5 Flash vs ChatGPT o3

[Quoted tweet]
I tested Gemini 2.5 Flash vs ChatGPT o3

Which one did better?


https://video.twimg.com/amplify_video/1913198527746129920/vid/avc1/1280x720/InEUUE-tUG1QljHE.mp4

4/20
@CodeByPoonam
3. Galton Board Test

[Quoted tweet]
Gemini 2.5 Flash demolishes my Galton Board test, I could not get 4omini, 4o mini high, or 03 to produce this. I found that Gemini 2.5 Flash understands my intents almost instantly, code produced is tight and neat. The prompt is a merging of various steps. It took me 5 steps to achieve this in Gemini 2.5 Flash, I gave up on OpenAI models after about half an hour. My iterations are obviously not exact. But people can test with this one prompt for more objective comparison.

Please try this prompt on your end to confirm:
--------------------------------------------------
Create a self-contained HTML file for a Galton board simulation using client-side JavaScript and a 2D physics engine (like Matter.js, included via CDN). The simulation should be rendered on an HTML5 canvas and meet the following criteria: 1. **Single File:** All necessary HTML, CSS, and JavaScript code must be within this single `.html` file. 2. **Canvas Size:** The overall simulation area (canvas) should be reasonably sized to fit on a standard screen without requiring extensive scrolling or zooming (e.g., around 500x700 pixels). 3. **Physics:** Utilize a 2D rigid body physics engine for realistic ball-peg and ball-wall interactions. 4. **Obstacles (Pegs):** Create static, circular pegs arranged in full-width horizontal rows extending across the usable width of the board (not just a triangle). The pegs should be small enough and spaced appropriately for balls to navigate and bounce between them. 5. **Containment:** * Include static, sufficiently thick side walls and a ground at the bottom to contain the balls within the board. * Implement *physical* static dividers between the collection bins at the bottom. These dividers must be thick enough to prevent balls from passing through them, ensuring accurate accumulation in each bin. 6. **Ball Dropping:** Balls should be dropped from a controlled, narrow area near the horizontal center at the top of the board to ensure they enter the peg field consistently. 7. **Bins:** The collection area at the bottom should be divided into distinct bins by the physical dividers. The height of the bins should be sufficient to clearly visualize the accumulation of balls. 8. **Visualization:** Use a high-contrast color scheme to clearly distinguish between elements. Specifically, use yellow for the structural elements (walls, top guides, physical bin dividers, ground), a contrasting color (like red) for the pegs, and a highly contrasting color (like dark grey or black) for the balls. 9. **Demonstration:** The simulation should visually demonstrate the formation of the normal (or binomial) distribution as multiple balls fall through the pegs and collect in the bins. Ensure the physics parameters (restitution, friction, density) and ball drop rate are tuned for a smooth and clear demonstration of the distribution.

#OpenAI @sama @gdb @ai_for_success @aidan_mclau


https://video.twimg.com/amplify_video/1912972357947334656/vid/avc1/776x586/dX9gd5al-B2qxt6t.mp4

5/20
@CodeByPoonam
Get the latest updates on AI insights and Tutorials.

Join "AI Toast" the community of 35,000 readers for FREE.

Read latest edition here:
AI Toast



6/20
@CodeByPoonam
4. Gemini 2.5 Flash is blazing fast

[Quoted tweet]
First test on gemini 2.5 flash on my phone. This model is blazing fast and it one shotted this, mobile friendly, animation. The code looks pretty clean too. Good vibes so far.


https://video.twimg.com/ext_tw_video/1912946801809772545/pu/vid/avc1/590x1278/nXzNRDKeHXL7JAyb.mp4

7/20
@CodeByPoonam
5. Cloth Simulation

[Quoted tweet]
Prompt: Create a cloth simulation using Verlet integration in a single HTML file (Canvas or Three.js). Include wind, gravity, and drag. Let users interact by dragging points. Cloth should bend and move realistically.

Model: Gemini flash 2.5


https://video.twimg.com/ext_tw_video/1913047505815953408/pu/vid/avc1/590x1278/WSwRATTymRpNQRy2.mp4

8/20
@CodeByPoonam
6. Image segmentation masks on command

[Quoted tweet]
Gemini 2.5 Pro and Flash now have the ability to return image segmentation masks on command, as base64 encoded PNGs embedded in JSON strings

I vibe coded this interactive tool for exploring this new capability - it costs a fraction of a cent per image


Go0lkL7bwAABgeT.jpg


9/20
@CodeByPoonam
7. MCP AI Agent

[Quoted tweet]
I built an MCP AI Agent using Gemini Flash 2.5 with access to AirBnB and Google Maps in just 30 lines of Python Code.

100% Opensource Code.


https://video.twimg.com/amplify_video/1913056645271429120/vid/avc1/1212x720/AfIwfVNsUWTKRlmu.mp4

10/20
@CodeByPoonam
8. Gemini 2.5 Flash is very cheap and super intelligent model.

[Quoted tweet]
Gemini 2.5 Flash Preview is an amazing model. Google is literally winning. No stopping them now, this is not normal.

Gemini 2.5 Flash is very cheap and super intelligent model. Intelligence too cheap to meter this is what it means.

Thank you, Google.


https://video.twimg.com/amplify_video/1912957952824356864/vid/avc1/1920x1080/20ckf4zJ7d1F-Y5P.mp4

11/20
@CodeByPoonam
8. Classic Snakes and Ladders

[Quoted tweet]
A lot of people make a snake game when trying out new models. I went with the classic Snakes and Ladders instead — built it using @GoogleDeepMind latest Gemini Flash 2.5 and it nails it. Look at how it follows the stairs and snakes so smoothly.
Still a WIP and don’t mind the extra dot on the die though 🐍🎲
It is said that this game started in ancient India where it was called Moksha Patam. Every move was a little life lesson where ladders were virtues while snakes were vices.


https://video.twimg.com/amplify_video/1913417180785360896/vid/avc1/1920x1080/nSy-2R-lP8ZiOk13.mp4

12/20
@CodeByPoonam
9. Create Simulation

[Quoted tweet]
AGI is achieved by Google's Gemini 2.5 Flash Preview
Seriously this is the best simulation i've ever created of how AI models work


https://video.twimg.com/amplify_video/1912963072299311104/vid/avc1/1464x720/5TOp8tU-RVWCulcR.mp4

13/20
@CodeByPoonam
10. A Block breaker

[Quoted tweet]
📕速報: Gemini 2.5 Flash登場:AIの思考を自在に操る新時代モデル

- 思考プロセスをオン/オフ可能
- 推論能力を大幅向上、高速性とコスト効率を維持
- 思考予算設定で品質・コスト・レイテンシーを自在に最適化できるハイブリッド思考AI

試しにブロック崩しを作成

注目ポイントを7点まとめました🚀


GoxjbXnasAAQc-Y.jpg


14/20
@CodeByPoonam
11. A dreamy low-poly floating island scene

[Quoted tweet]
Gemini 2.5 Pro 🆚 Gemini 2.5 Flash Thinking 24k

Prompt: "Create a dreamy low-poly floating island scene with dynamic lighting and gentle animations, in a single HTML file."

Gemini 2.5 Pro (left), Gemini 2.5 Flash (right)


https://video.twimg.com/amplify_video/1912964537277452288/vid/avc1/1920x1080/9egTWI8Uw7s6dkfe.mp4

15/20
@CodeByPoonam
12. Generate an SVG of a pelican riding a bicycle

[Quoted tweet]
I upgraded llm-gemini to support the new model, including a "-o thinking_budget X" option for setting the thinking budget

llm install -U llm-gemini
llm -m gemini-2.5-flash-preview-04-17 'Generate an SVG of a pelican riding a bicycle' -o thinking_budget 24576


Gow_iFabYAA9awi.png


16/20
@CodeByPoonam
13. Destroys Claude Sonnet 3.7 in benchmarks

[Quoted tweet]
Holy sh*t

Google Gemini 2.5 Flash dropped.

It destroyed Claude Sonnet 3.7 (64k Extended Thinking) in benchmarks 🤯

20x cheaper on input
25x cheaper on output
~4.2x cheaper on output with reasoning


Go0Vh5lWgAADLVo.jpg


17/20
@CodeByPoonam
Gemini 2.5 Flash is now available on Gemini App, AI Studio, and API

Gemini: ‎Gemini
AI Studio: Sign in - Google Accounts



Go4c4EfaAAA1mZA.jpg


18/20
@CodeByPoonam
Thanks for reading!

If you liked this post, check out my AI updates and tutorials Newsletter.

Join 35000+ readers in the AI Toast Community for free: AI Toast



19/20
@CodeByPoonam
Don't forget to bookmark for later.

If you enjoyed reading this post, please support it with like/repost of the post below 👇

[Quoted tweet]
Google just dropped Gemini 2.5 Flash, and people are going crazy over it.

SPOILER: Claude is now falling behind.

13 wild examples so far (Don't miss the 5th one)


Go4cqVEbUAAZr8V.jpg


20/20
@ricathrs
Gemini 2.5 Flash sounds like a game changer! 🌟




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,427
Reputation
9,478
Daps
171,090

OpenAI Releases a Practical Guide to Building LLM Agents for Real-World Applications​


By Nikhil

April 17, 2025

OpenAI has published a detailed and technically grounded guide, A Practical Guide to Building Agents , tailored for engineering and product teams exploring the implementation of autonomous AI systems. Drawing from real-world deployments, the guide offers a structured approach to identifying suitable use cases, architecting agents, and embedding robust safeguards to ensure reliability and safety.

Defining an Agent


Unlike conventional LLM-powered applications such as single-turn chatbots or classification models, agents are autonomous systems capable of executing multi-step tasks with minimal human oversight. These systems integrate reasoning, memory, tool use, and workflow management.

An agent comprises three essential components:

  1. Model — The LLM responsible for decision-making and reasoning.
  2. Tools — External APIs or functions invoked to perform actions.
  3. Instructions — Structured prompts that define the agent’s objectives, behavior, and constraints.

When to Consider Building an Agent


Agents are well-suited for workflows that exceed the capabilities of traditional rule-based automation. Typical scenarios include:

  • Complex decision-making : For instance, nuanced refund approvals in customer support.
  • High-maintenance rule systems : Such as policy compliance workflows that are brittle or difficult to scale.
  • Interaction with unstructured data : Including document parsing or contextual natural language exchanges.

The guide emphasizes careful validation to ensure the task requires agent-level reasoning before embarking on implementation.

Technical Foundations and SDK Overview


The OpenAI Agents SDK provides a flexible, code-first interface for constructing agents using Python. Developers can declaratively define agents with a combination of model choice, tool registration, and prompt logic.

OpenAI categorizes tools into:

  • Data tools — Fetching context from databases or document repositories.
  • Action tools — Writing or updating data, triggering downstream services.
  • Orchestration tools — Agents themselves exposed as callable sub-modules.

Instructions should derive from operational procedures and be expressed in clear, modular prompts. The guide recommends using prompt templates with parameterized variables for scalability and maintainability.

Orchestration Strategies


Two architectural paradigms are discussed:

  • Single-agent systems : A single looped agent handles the entire workflow, suitable for simpler use cases.
  • Multi-agent systems :
    • Manager pattern : A central coordinator delegates tasks to specialized agents.
    • Decentralized pattern : Peer agents autonomously transfer control among themselves.

Each design supports dynamic execution paths while preserving modularity through function-based orchestration.

Guardrails for Safe and Predictable Behavior


The guide outlines a multi-layered defense strategy to mitigate risks such as data leakage, inappropriate responses, and system misuse:

  • LLM-based classifiers : For relevance, safety, and PII detection.
  • Rules-based filters : Regex patterns, input length restrictions, and blacklist enforcement.
  • Tool risk ratings : Assigning sensitivity levels to external functions and gating execution accordingly.
  • Output validation : Ensuring responses align with organizational tone and compliance requirements.

Guardrails are integrated into the agent runtime, allowing for concurrent evaluation and intervention when violations are detected.

Human Oversight and Escalation Paths


Recognizing that even well-designed agents may encounter ambiguity or critical actions, the guide encourages incorporating human-in-the-loop strategies. These include:

  • Failure thresholds : Escalating after repeated misinterpretations or tool call failures.
  • High-stakes operations : Routing irreversible or sensitive actions to human operators.

Such strategies support incremental deployment and allow trust to be built progressively.

Conclusion


With this guide, OpenAI formalizes a design pattern for constructing intelligent agents that are capable, controllable, and production-ready. By combining advanced models with purpose-built tools, structured prompts, and rigorous safeguards, development teams can go beyond experimental prototypes and toward robust automation platforms.

Whether orchestrating customer workflows, document processing, or developer tooling, this practical blueprint sets a strong foundation for adopting agents in real-world systems. OpenAI recommends beginning with single-agent deployments and progressively scaling to multi-agent orchestration as complexity demands.

Check out the Download the Guide . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,427
Reputation
9,478
Daps
171,090

Meta AI Introduces Perception Encoder: A Large-Scale Vision Encoder that Excels Across Several Vision Tasks for Images and Video​


By Asif Razzaq

April 18, 2025

The Challenge of Designing General-Purpose Vision Encoders


As AI systems grow increasingly multimodal, the role of visual perception models becomes more complex. Vision encoders are expected not only to recognize objects and scenes, but also to support tasks like captioning, question answering, fine-grained recognition, document parsing, and spatial reasoning across both images and videos. Existing models typically rely on diverse pretraining objectives—contrastive learning for retrieval, captioning for language tasks, and self-supervised methods for spatial understanding. This fragmentation complicates scalability and model deployment, and introduces trade-offs in performance across tasks.

What remains a key challenge is the design of a unified vision encoder that can match or exceed task-specific methods, operate robustly in open-world scenarios, and scale efficiently across modalities.

A Unified Solution: Meta AI’s Perception Encoder


Meta AI introduces Perception Encoder (PE) , a vision model family trained using a single contrastive vision-language objective and refined with alignment techniques tailored for downstream tasks. PE departs from the traditional multi-objective pretraining paradigm. Instead, it demonstrates that with a carefully tuned training recipe and appropriate alignment methods, contrastive learning alone can yield highly generalizable visual representations.

The Perception Encoder operates across three scales—PEcoreB, PEcoreL, and PEcoreG—with the largest (G-scale) model containing 2B parameters. These models are designed to function as general-purpose encoders for both image and video inputs, offering strong performance in classification, retrieval, and multimodal reasoning.

Screenshot-2025-04-18-at-8.17.43%E2%80%AFAM-1-1024x470.png


Training Approach and Architecture


The pretraining of PE follows a two-stage process. The first stage involves robust contrastive learning on a large-scale curated image-text dataset (5.4B pairs), where several architectural and training enhancements improve both accuracy and robustness. These include progressive resolution scaling, large batch sizes (up to 131K), use of the LAMB optimizer, 2D RoPE positional encoding, tuned augmentations, and masked regularization.

The second stage introduces video understanding by leveraging a video data engine that synthesizes high-quality video-text pairs. This pipeline incorporates captions from the Perception Language Model (PLM), frame-level descriptions, and metadata, which are then summarized using Llama 3.3. These synthetic annotations allow the same image encoder to be fine-tuned for video tasks via frame averaging.

Despite using a single contrastive objective, PE features general-purpose representations distributed across intermediate layers. To access these, Meta introduces two alignment strategies:

  • Language alignment for tasks such as visual question answering and captioning.
  • Spatial alignment for detection, tracking, and depth estimation, using self-distillation and spatial correspondence distillation via SAM2.

Empirical Performance Across Modalities


PE demonstrates strong zero-shot generalization across a wide range of vision benchmarks. On image classification, PEcoreG matches or exceeds proprietary models trained on large private datasets such as JFT-3B. It achieves:

  • 86.6% on ImageNet-val,
  • 92.6% on ImageNet-Adversarial,
  • 88.2% on the full ObjectNet set,
  • Competitive results on fine-grained datasets including iNaturalist, Food101, and Oxford Flowers.

In video tasks, PE achieves state-of-the-art performance on zero-shot classification and retrieval benchmarks, outperforming InternVideo2 and SigLIP2-g-opt, while being trained on just 22M synthetic video-caption pairs. The use of simple average pooling across frames—rather than temporal attention—demonstrates that architectural simplicity, when paired with well-aligned training data, can still yield high-quality video representations.

An ablation study shows that each component of the video data engine contributes meaningfully to performance. Improvements of +3.9% in classification and +11.1% in retrieval over image-only baselines highlight the utility of synthetic video data, even at modest scale.

Screenshot-2025-04-18-at-8.18.27%E2%80%AFAM-1024x530.png


Conclusion


Perception Encoder provides a technically compelling demonstration that a single contrastive objective, if implemented with care and paired with thoughtful alignment strategies, is sufficient to build general-purpose vision encoders. PE not only matches specialized models in their respective domains but does so with a unified and scalable approach.

The release of PE, along with its codebase and the PE Video Dataset, offers the research community a reproducible and efficient foundation for building multimodal AI systems. As visual reasoning tasks grow in complexity and scope, PE provides a path forward toward more integrated and robust visual understanding.




Check out the Paper , Model , Code and Dataset . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,427
Reputation
9,478
Daps
171,090
New layer addition to Transformers radically improves long-term video generation



Posted on Tue Apr 8 15:30:23 2025 UTC


Fascinating work coming from a team from Berkeley, Nvidia and Stanford.

They added a new Test-Time Training (TTT) layer to pre-trained transformers. This TTT layer can itself be a neural network.

The result? Much more coherent long-term video generation! Results aren't conclusive as they limited themselves to a one minute limit. But the approach can potentially be easily extended.

Maybe the beginning of AI shows?

Link to repo:
One-Minute Video Generation with Test-Time Training

One-Minute Video Generation with Test-Time Training


Abstract​


Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle with complex multi-scene stories because their hidden states are less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. For proof of concept, we curate a dataset based on Tom and Jerry cartoons. Compared to baselines such as Mamba 2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complex stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, results still contain artifacts, likely due to the limited capability of the pre-trained 5B model. The efficiency of our implementation can also be improved. We have only experimented with one-minute videos due to resource constraints, but the approach can be extended to longer videos and more complex stories.

Paper

Code

Adding TTT Layers to a Pre-Trained Transformer​


Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos with strong temporal consistency and motion smoothness.









1/12
@hyperbolic_labs
We’re proud to have supported the team behind One-Minute Video Generation with Test-Time Training with compute infrastructure.

Incredible to see our platform enabling breakthroughs in long-form video generation. Congrats to the authors!

@danielkoceja @GashonHussein @Jerry_XU_Jiarui @__yuezhao__ @jankautz @guestrin @tatsu_hashimoto @sanmikoyejo @YejinChoinka @xiaolonw @karansdalal

[Quoted tweet]
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.

We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.

Every video below is produced directly by the model in a single shot, without editing, stitching, or post-processing. Every story is newly created.

Demos: test-time-training.github.io…
Paper: test-time-training.github.io…


GolAoksW8AA1YCm.png

GolApZAWIAA8wJC.jpg


https://video.twimg.com/ext_tw_video/1909310443530944513/pu/vid/avc1/720x480/S8MsN5qN0o9f_Lnx.mp4

2/12
@hyperbolic_labs
Read the full paper: https://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf



3/12
@Quangduycbq
so cool i will make meaningful video🥰🥰🥰



4/12
@hyperbolic_labs
love it



5/12
@ChetaOfAllTrade
Incredible, Hyperbolic built to see developers actually reach their potential and not getting stucked by computing resource.

Congrats to the team



6/12
@hyperbolic_labs
🥂



7/12
@ericspo29
So now I can make my own cartoons, this is awesome!



8/12
@hyperbolic_labs
Pretty wild tech



9/12
@Just_marhk
Great 👍👏



10/12
@hyperbolic_labs
💯💯



11/12
@Bruhbears985
That's so great 🤘🏻



12/12
@hyperbolic_labs
amazing what AI can do now




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
















1/22
@karansdalal
Today, we're releasing a new paper – One-Minute Video Generation with Test-Time Training.

We add TTT layers to a pre-trained Transformer and fine-tune it to generate one-minute Tom and Jerry cartoons with strong temporal consistency.

Every video below is produced directly by the model in a single shot, without editing, stitching, or post-processing. Every story is newly created.

Demos: One-Minute Video Generation with Test-Time Training
Paper: http://test-time-training.github.io/video-dit/assets/ttt_cvpr_2025.pdf



https://video.twimg.com/ext_tw_video/1909310443530944513/pu/vid/avc1/720x480/S8MsN5qN0o9f_Lnx.mp4

2/22
@karansdalal
Test-time training (TTT) layers are RNN layers where the hidden state is a machine learning model and the update rule is a step of gradient descent. See this thread for previous work.

[Quoted tweet]
I’m excited to share a project I’ve been working on for over a year, which I believe will fundamentally change our approach to language models.

We’ve designed a new architecture, which replaces the hidden state of an RNN with a machine learning model. This model compresses context through actual gradient descent on input tokens. We call our method “Test-Time-Training layers.”

TTT layers directly replace attention, and unlock linear complexity architectures with expressive memory, allowing us to train LLMs with millions (someday billions) of tokens in context.

Our instantiations, TTT-Linear and TTT-MLP, both match or beat the strongest Transformers and Mamba. Arxiv: arxiv.org/abs/2407.04620


GR-cpVpawAABD38.png


3/22
@karansdalal
Our approach simply adds TTT layers to a pre-trained Diffusion Transformer and fine-tunes it on long videos with text annotations. To keep costs manageable, we limit self-attention to local segments and let TTT (linear complexity) operate globally.



Gn88CAKbwAMyW2Q.jpg


4/22
@karansdalal
We create an “On-Chip Tensor Parallel” algorithm to implement an efficient TTT-MLP kernel. Specifically, we shard the weights of the “hidden state model” across Streaming Multiprocessors, and use the DSMEM feature Hopper GPUs implement AllReduce among SMs.

This avoids costly transfers between global memory (HBM) and shared memory (SMEM), while still fitting the large hidden state into the small amount of fast SMEM.

More details in the paper. Kernel code: GitHub - test-time-training/ttt-tk



Gn88LCwbwAMkVY_.jpg


5/22
@karansdalal
Grateful for wonderful collaborators. This work will be presented at CVPR 2025.

@danielkoceja @GashonHussein @Jerry_XU_Jiarui @__yuezhao__ @jankautz @guestrin @tatsu_hashimoto @sanmikoyejo @YejinChoinka @xiaolonw



Gn89FL3bwAM8v5j.jpg


6/22
@karansdalal
+ our wonderful collaborators without Twitter – Shihao Han, Ka Chun Cheung, Youjin Song, and Yu Sun.



7/22
@menhguin
what the fukk (complimentary)

ok for like a solid 30 seconds I thought this was the Test-Time Training used for the ARC AGI MIT submission and I was rly confused



8/22
@karansdalal
Same thing, different application! Best characterization would be "End to End" vs "Non E2E" test-time training.

Test-Time Training Project Website



9/22
@ruslanjabari
damn and this is only ~50 hours of training runs



10/22
@karansdalal
With a 5B model 🫣



11/22
@reborn_agi
This is incredible work — generating coherent, one-minute-long animated stories with zero post-processing is a huge leap in video generation. The TTT approach looks super promising for maintaining temporal consistency. Huge respect to you and the team.



12/22
@karansdalal
Thank you



13/22
@willdepue
very cool work karan! do you have any baselines of what it looks like without test time training?



14/22
@karansdalal
Thank you Will, sorry to miss this! Here's the World Trade Center video with the local attention baseline* We have some examples of comparing TTT to other RNNs on the project page.

* Disclaimer – this model does have less parameters than the one with added TTT layers.



https://video.twimg.com/ext_tw_video/1909798570049650689/pu/vid/avc1/720x480/0agZ6XihQUKUJ9iC.mp4

15/22
@TheGrizztronic
Pretty cool. TTT should get more love. Hope this helps!



16/22
@karansdalal
🙏



17/22
@jc_stack
Really interested in your pre-training approaches. Have you seen much impact on compute/memory overhead with the TTT layers? Thinking about startup resource constraints here.



18/22
@karansdalal
TTT layers have linear complexity, so long context inference is far better than self-attention. But we still have some way to go on kernel optimization when compared to other modern RNN layers.

Figure 6 from our paper:



Gn9WhFzbwAE84O3.jpg


19/22
@john7rho
Amazing work Karan



20/22
@karansdalal
Thank you John!



21/22
@jam3scampbell
🤔

[Quoted tweet]
in b4 ttt is the new q*


22/22
@nearcyan
hmmmm




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top