bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799






1/11
@genmoai
Introducing Mochi 1 preview. A new SOTA in open-source video generation. Apache 2.0.

magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce



https://video.twimg.com/ext_tw_video/1848745801926795264/pu/vid/avc1/1920x1080/zCXCFAyOnvznHUAf.mp4

2/11
@genmoai
Mochi 1 has superior motion quality, prompt adherence and exceptional rendering of humans that begins to cross the uncanny valley.

Today, we are open-sourcing our base 480p model with an HD version coming soon.



GagSEjMW0AAQ-ff.jpg

GagSFdDWIAAHKSa.jpg


3/11
@genmoai
We're excited to see what you create with Mochi 1. We're also excited to announce our $28.4M Series A from @NEA, @TheHouseVC, @GoldHouseCo, @WndrCoLLC, @parasnis, @amasad, @pirroh and more.

Use Mochi 1 via our playground at our homepage or download the weights freely.



4/11
@jahvascript
can I get a will smith eating spaghetti video?



5/11
@genmoai




6/11
@GozukaraFurkan
Does it have image to video and video to video?



7/11
@genmoai
Good question! Our new model, Mochi, is entirely text to video. On Genmo. The best open video generation models. , if you use the older model (Replay v.0.2), you're able to do image to video as well as text to video.



8/11
@iamhitarth
Keep getting this error message even though I've created an account and signed in 🤔



GaguaHHa0AA-5yl.png


9/11
@genmoai
Hey there, thanks for escalating this! We're taking a closer look right now to see why this is happening 👀 I appreciate your patience and understanding!



10/11
@sairahul1
Where can I try it



11/11
@genmoai
You can find everything you'll need to try Mochi at Genmo. The best open video generation models. 🫶 Happy generating!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/3
@AIWarper
My Mochi 1 test thread. Will post some video examples below if you are interested.

Inference done with FAL



https://video.twimg.com/ext_tw_video/1848834063462993920/pu/vid/avc1/848x480/5De8HKQoKL4cYf6I.mp4

https://video-t-1.twimg.com/ext_tw_...016/pu/vid/avc1/848x480/JJWbnMjCeriVR86A.mp4?


2/3
@AIWarper
A floating being composed of pure, radiant energy, its form shifting between physical and immaterial states. The Lumiscribe’s eyes glow with ancient knowledge, and its translucent wings pulse with golden light. It hovers above the Skyvault, a vast floating city on its homeworld, Zephyra, a world of constant storms and lightning-filled skies. Giant skyborne creatures soar above the city, their bodies crackling with electricity. Below, the landscape is a mix of jagged mountains and stormy seas, illuminated by constant flashes of lightning. The Lumiscribe uses its energy to inscribe symbols of power into the air, controlling the flow of the storms that define their world.



https://video.twimg.com/ext_tw_video/1848835515191332868/pu/vid/avc1/848x480/HjfJWN7l2MzSNEcz.mp4

3/3
@DataPlusEngine
Holy shyt it's so good




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
引用
Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer architecture.

The model requires at least 4 H100 GPUs to run.
genmo/mochi-1-preview · Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/10
@_akhaliq
Mochi 1 preview

A new SOTA in open-source video generation. Apache 2.0



https://video.twimg.com/ext_tw_video/1848763058291642370/pu/vid/avc1/1280x720/QFGsHnbVyszo5Xgz.mp4

2/10
@_akhaliq
model: genmo/mochi-1-preview · Hugging Face



3/10
@Prashant_1722
Sora has become extremely non moat by now



4/10
@LearnedVector
Ima need more h100s



5/10
@thebuttredettes
Getting only errors on the preview page.



6/10
@noonescente
What is a SOTA?



7/10
@m3ftah
It requires 4 H100 to run!



8/10
@ED84VG
It’s crazy how fast ai video has come.

Does anyone know if it got better because Sora showed it was possible, or would they have gotten this good even if they never showed us?



9/10
@JonathanKorstad
🔥



10/10
@_EyesofTruth_
OpenAi getting mogged




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799






1/11
@runwayml
Introducing, Act-One. A new way to generate expressive character performances inside Gen-3 Alpha using a single driving video and character image. No motion capture or rigging required.

Learn more about Act-One below.

(1/7)



https://video-t-2.twimg.com/ext_tw_...248/pu/vid/avc1/1280x720/2EyYj6GjSpT_loQf.mp4

2/11
@runwayml
Act-One allows you to faithfully capture the essence of an actor's performance and transpose it to your generation. Where traditional pipelines for facial animation involve complex, multi-step workflows, Act-One works with a single driving video that can be shot on something as simple as a cell phone.

(2/7)



https://video-t-2.twimg.com/ext_tw_...282/pu/vid/avc1/1280x720/Qie29gOWU42zMaGo.mp4

3/11
@runwayml
Without the need for motion-capture or character rigging, Act-One is able to translate the performance from a single input video across countless different character designs and in many different styles.

(3/7)



https://video-t-2.twimg.com/ext_tw_...696/pu/vid/avc1/1280x720/TcWzpRl3kMfHM4ro.mp4

4/11
@runwayml
One of the models strengths is producing cinematic and realistic outputs across a robust number of camera angles and focal lengths. Allowing you generate emotional performances with previously impossible character depth opening new avenues for creative expression.

(4/7)



https://video-t-2.twimg.com/ext_tw_...424/pu/vid/avc1/1280x720/3fupOI32Ck6ITIiE.mp4

5/11
@runwayml
A single video of an actor is used to animate a generated character.

(5/7)



https://video-t-2.twimg.com/ext_tw_...016/pu/vid/avc1/1280x720/Fh7WanCgSTR_ffHF.mp4

6/11
@runwayml
With Act-One, eye-lines, micro expressions, pacing and delivery are all faithfully represented in the final generated output.

(6/7)



https://video-t-2.twimg.com/ext_tw_...528/pu/vid/avc1/1280x720/ywweYvzLHe-3GO2B.mp4

7/11
@runwayml
Access to Act-One will begin gradually rolling out to users today and will soon be available to everyone.

To learn more, visit Runway Research | Introducing Act-One

(7/7)



8/11
@xushanpao310
@threadreaderapp Unroll



9/11
@threadreaderapp
@xushanpao310 Salam, please find the unroll here: Thread by @runwayml on Thread Reader App See you soon. 🤖



10/11
@threadreaderapp
Your thread is gaining traction! /search?q=#TopUnroll Thread by @runwayml on Thread Reader App 🙏🏼@TheReal_Ingrid_ for 🥇unroll



11/11
@flowersslop
motion capture industry is cooked




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799


1/4
@AIWarper
New Stable Diffusion 3.5

Weights: Comfy-Org/stable-diffusion-3.5-fp8 at main

Compatiable with ComfyUI already



2/4
@AIWarper
Update: As per usual - SD basemodels are.... meh?

The strength will hopefully be in the finetune



GagKKv3aMAEvHPO.png

GagKL8gbEAMpgBD.jpg


3/4
@goon_nguyen
Still not good as FLUX 😅



GagJXPmbEAIFVg2.jpg


4/4
@AIWarper
But can it fine tune. That is the selling point.

It's a non distilled model (I think)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
@AIWarper
SD 3.5 Large vs Flux [DEV]

Image 1 = identical prompt I used for Flux

Image 2 = Shortened prompt

Conclusion - I need to experiment more. These were generated with the HF demo so not many knobs to turn

[Quoted tweet]
Flux [DEV] locally w/ ComfyUI

1024 x 1024 gen upscaled 2x

Spicyyyyyyyyyyyy details and the prompt adherence is insane. I never once mentioned the words Frieren OR Fern.

(prompt in 1st comment below)


Gag2xmsbEAEo1j4.jpg

Gag3f1mbEAIheEO.jpg


https://video-t-1.twimg.com/ext_tw_...37/pu/vid/avc1/1542x1080/PTkQOzHzyUXnIEa9.mp4

2/2
@GorillaRogueGam
I cannot get excited for a Stability product until they release something 1.5 level revolutionary again

I was ready to completely give up on the company until Cameron got on the board of directors, but he is fond of sinking ships so I don’t know what to think…




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799





1/11
@Kijaidesign
This model is so much fun!
I have implemented Tora:
GitHub - alibaba/Tora: The official repository of Tora
to my CogVideoX nodes:
GitHub - kijai/ComfyUI-CogVideoXWrapper
This implementation can be run within ~13GB VRAM
/search?q=#ai /search?q=#comfyui



https://video.twimg.com/ext_tw_video/1848349720432754688/pu/vid/avc1/720x480/uR1WPXyQj0FvdxCO.mp4

2/11
@DreamStarter_1
I have only 12...so cloooose!😆



3/11
@Kijaidesign
I didn't try it with this particular model, but there's always the option to use sequential_cpu_offloading, which makes it really slow, but runs on very little VRAM.



4/11
@6___0
how long does a 3 sec vid take to render (on what GPU)?



5/11
@Kijaidesign
Depends on many things, there are many possible optimizations to use, stacking everything at best it's around 2 mins to do 49 frames.



6/11
@huanggou7
does it work with image to video?



7/11
@Kijaidesign
Not currently, the model is trained on top of the text2video model, looking for a way to do that though.



8/11
@Sire1Sire1
Can we try this? somewhere?



9/11
@Kijaidesign
Currently only with my ComfyUI nodes:
GitHub - kijai/ComfyUI-CogVideoXWrapper



10/11
@sophanimeai
Interesting to see Tora implemented in your ComfyUI-CogVideoX wrapper. What drove you to experiment with this particular combination?



11/11
@Kijaidesign
The model that was released was actually build on CogVideoX 5b text2vid -model, after studying their code and the model structure for a bit, it was just a natural fit.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/8
@Kijaidesign
Oh hey Tora trajectory actually works with the image2video model too, this will be good...
/search?q=#ai /search?q=#comfyui /search?q=#cogvideox



https://video-t-2.twimg.com/ext_tw_...8464/pu/vid/avc1/720x480/guTYCJ78ra53nmKL.mp4

2/8
@Elliryk_Krypto
Awesome🔥🔥



3/8
@huanggou7
holy shyt



4/8
@Leo1991199
cool!



5/8
@Mayzappp1
This is like instance diffusion with spline:smile: used to work on sd 1.5 with animate diff
Really nice to see something like it on cog



6/8
@AmirKerr
Finally! Some video control that we can run locally 🥹 Kijai ✨



7/8
@mickmumpitz
That looks absolutely fantastic, great work!



8/8
@_EyesofTruth_
Wut 🤯




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799


AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others​

Carl Franzen@carlfranzen

October 22, 2024 5:00 AM

Screenshot of AI video close-up of Caucasian elderly woman with brown eyes smiling


Credit: Genmo

Genmo, an AI company focused on video generation, has announced the release of a research preview for Mochi 1, a new open-source model for generating high-quality videos from text prompts — and claims performance comparable to, or exceeding, leading closed-source/proprietary rivals such as Runway’s Gen-3 Alpha, Luma AI’s Dream Machine, Kuaishou’s Kling, Minimax’s Hailuo, and many others.

Available under the permissive Apache 2.0 license, Mochi 1 offers users free access to cutting-edge video generation capabilities — whereas pricing for other models starts at limited free tiers but goes as high as $94.99 per month (for the Hailuo Unlimited tier). Users can download the full weights and model code free on Hugging Face, though it requires “at least 4” Nvidia H100 GPUs to operate on a user’s own machine.

In addition to the model release, Genmo is also making available a hosted playground, allowing users to experiment with Mochi 1’s features firsthand.

The 480p model is available for use today, and a higher-definition version, Mochi 1 HD, is expected to launch later this year.

Initial videos shared with VentureBeat show impressively realistic scenery and motion, particularly with human subjects as seen in the video of an elderly woman below:



Advancing the state-of-the-art​


Mochi 1 brings several significant advancements to the field of video generation, including high-fidelity motion and strong prompt adherence.

According to Genmo, Mochi 1 excels at following detailed user instructions, allowing for precise control over characters, settings, and actions in generated videos.

Genmo has positioned Mochi 1 as a solution that narrows the gap between open and closed video generation models.

“We’re 1% of the way to the generative video future. The real challenge is to create long, high-quality, fluid video. We’re focusing heavily on improving motion quality,” said Paras Jain, CEO and co-founder of Genmo, in an interview with VentureBeat.

Jain and his co-founder started Genmo with a mission to make AI technology accessible to everyone. “When it came to video, the next frontier for generative AI, we just thought it was so important to get this into the hands of real people,” Jain emphasized. He added, “We fundamentally believe it’s really important to democratize this technology and put it in the hands of as many people as possible. That’s one reason we’re open sourcing it.”

Already, Genmo claims that in internal tests, Mochi 1 bests most other video AI models — including the proprietary competition Runway and Luna — at prompt adherence and motion quality.

unnamed.png

unnamed-1.png



Series A funding to the tune of $28.4M​


In tandem with the Mochi 1 preview, Genmo also announced it has raised a $28.4 million Series A funding round, led by NEA, with additional participation from The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, and Essence VC. Several angel investors, including Abhay Parasnis (CEO of Typespace) and Amjad Masad (CEO of Replit), are also backing the company’s vision for advanced video generation.

Jain’s perspective on the role of video in AI goes beyond entertainment or content creation. “Video is the ultimate form of communication—30 to 50% of our brain’s cortex is devoted to visual signal processing. It’s how humans operate,” he said.

Genmo’s long-term vision extends to building tools that can power the future of robotics and autonomous systems. “The long-term vision is that if we nail video generation, we’ll build the world’s best simulators, which could help solve embodied AI, robotics, and self-driving,” Jain explained.



Open for collaboration — but training data is still close to the vest​


Mochi 1 is built on Genmo’s novel Asymmetric Diffusion Transformer (AsymmDiT) architecture.

At 10 billion parameters, it’s the largest open source video generation model ever released. The architecture focuses on visual reasoning, with four times the parameters dedicated to processing video data as compared to text.

Efficiency is a key aspect of the model’s design. Mochi 1 leverages a video VAE (Variational Autoencoder) that compresses video data to a fraction of its original size, reducing the memory requirements for end-user devices. This makes it more accessible for the developer community, who can download the model weights from HuggingFace or integrate it via API.

Jain believes that the open-source nature of Mochi 1 is key to driving innovation. “Open models are like crude oil. They need to be refined and fine-tuned. That’s what we want to enable for the community—so they can build incredible new things on top of it,” he said.

However, when asked about the model’s training dataset — among the most controversial aspects of AI creative tools, as evidence has shown many to have trained on vast swaths of human creative work online without express permission or compensation, and some of it copyrighted works — Jain was coy.

“Generally, we use publicly available data and sometimes work with a variety of data partners,” he told VentureBeat, declining to go into specifics due to competitive reasons. “It’s really important to have diverse data, and that’s critical for us.”



Limitations and roadmap​


As a preview, Mochi 1 still has some limitations. The current version supports only 480p resolution, and minor visual distortions can occur in edge cases involving complex motion. Additionally, while the model excels in photorealistic styles, it struggles with animated content.

However, Genmo plans to release Mochi 1 HD later this year, which will support 720p resolution and offer even greater motion fidelity.

“The only uninteresting video is one that doesn’t move—motion is the heart of video. That’s why we’ve invested heavily in motion quality compared to other models,” said Jain.

Looking ahead, Genmo is developing image-to-video synthesis capabilities and plans to improve model controllability, giving users even more precise control over video outputs.


Expanding use cases via open source video AI​


Mochi 1’s release opens up possibilities for various industries. Researchers can push the boundaries of video generation technologies, while developers and product teams may find new applications in entertainment, advertising, and education.

Mochi 1 can also be used to generate synthetic data for training AI models in robotics and autonomous systems.

Reflecting on the potential impact of democratizing this technology, Jain said, “In five years, I see a world where a poor kid in Mumbai can pull out their phone, have a great idea, and win an Academy Award—that’s the kind of democratization we’re aiming for.”

Genmo invites users to try the preview version of Mochi 1 via their hosted playground at genmo.ai/play, where the model can be tested with personalized prompts — though at the time of this article’s posting, the URL was not loading the correct page for VentureBeat.


A call for talent​


As it continues to push the frontier of open-source AI, Genmo is actively hiring researchers and engineers to join its team. “We’re a research lab working to build frontier models for video generation. This is an insanely exciting area—the next phase for AI—unlocking the right brain of artificial intelligence,” Jain said. The company is focused on advancing the state of video generation and further developing its vision for the future of artificial general intelligence.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799


Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs​

Carl Franzen@carlfranzen

October 18, 2024 5:05 PM

comic book style pointillism image of a startled programmer watching a happy ghost emerge from a computer screen


Credit: VentureBeat made with ChatGPT


Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.

As such, it competes directly with OpenAI’s GPT-4o (also natively multimodal) and other multimodal models such as Hume’s EVI 2, as well as dedicated text-to-speech and speech-to-text offerings such as ElevenLabs.

Designed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to address the limitations of existing AI voice experiences by offering a more expressive and natural-sounding speech generation, while learning tasks across modalities like automatic speech recognition (ASR), text-to-speech (TTS), and speech classification.

Unfortunately for entrepreneurs and business leaders, the model is only currently available for non-commercial usage under Meta’s FAIR Noncommercial Research License, which grants users the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM models, but only for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.


A new approach to text and speech​


Traditional AI models for voice rely on automatic speech recognition to process spoken input before synthesizing it with a language model, which is then converted into speech using text-to-speech techniques.

While effective, this process often sacrifices the expressive qualities inherent to human speech, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating phonetic, pitch, and tone tokens to overcome these limitations.

Meta has released two versions of Spirit LM:

Spirit LM Base: Uses phonetic tokens to process and generate speech.

Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more nuanced emotional states, such as excitement or sadness, and reflect those in the generated speech.

Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks like speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its outputs.


Open-source noncommercial — only available for research​


In line with Meta’s commitment to open science, the company has made Spirit LM fully open-source, providing researchers and developers with the model weights, code, and supporting documentation to build upon.

Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore new methods for integrating speech and text in AI systems.

The release also includes a research paper detailing the model’s architecture and capabilities.

Mark Zuckerberg, Meta’s CEO, has been a strong advocate for open-source AI, stating in a recent open letter that AI has the potential to “increase human productivity, creativity, and quality of life” while accelerating advancements in areas like medical research and scientific discovery.


Applications and future potential​


Meta Spirit LM is designed to learn new tasks across various modalities, such as:

Automatic Speech Recognition (ASR): Converting spoken language into written text.

Text-to-Speech (TTS): Generating spoken language from written text.

Speech Classification: Identifying and categorizing speech based on its content or emotional tone.

The Spirit LM Expressive model goes a step further by incorporating emotional cues into its speech generation.

For instance, it can detect and reflect emotional states like anger, surprise, or joy in its output, making the interaction with AI more human-like and engaging.

This has significant implications for applications like virtual assistants, customer service bots, and other interactive AI systems where more nuanced and expressive communication is essential.


A broader effort​


Meta Spirit LM is part of a broader set of research tools and models that Meta FAIR is releasing to the public. This includes an update to Meta’s Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used across disciplines like medical imaging and meteorology, and research on enhancing the efficiency of large language models.

Meta’s overarching goal is to achieve advanced machine intelligence (AMI), with an emphasis on developing AI systems that are both powerful and accessible.

The FAIR team has been sharing its research for more than a decade, aiming to advance AI in a way that benefits not just the tech community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what AI can achieve in natural language processing.


What’s next for Spirit LM?​


With the release of Meta Spirit LM, Meta is taking a significant step forward in the integration of speech and text in AI systems.

By offering a more natural and expressive approach to AI-generated speech, and making the model open-source, Meta is enabling the broader research community to explore new possibilities for multimodal AI applications.

Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning, with the potential to power a new generation of more human-like AI interactions.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799

1/1
I am pleased to share that our work on SynthID text watermarking is published by @Nature today.

Read the Nature paper at: Scalable watermarking for identifying large language model outputs - Nature
Read more about the work at: SynthID: Tools for watermarking and detecting LLM-generated Text | Responsible Generative AI Toolkit | Google AI for Developers

[Quoted tweet]
Today, we’re open-sourcing our SynthID text watermarking tool through an updated Responsible Generative AI Toolkit.

Available freely to developers and businesses, it will help them identify their AI-generated content. 🔍

Find out more → goo.gle/40apGQh



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/20
@GoogleDeepMind
Today, we’re open-sourcing our SynthID text watermarking tool through an updated Responsible Generative AI Toolkit.

Available freely to developers and businesses, it will help them identify their AI-generated content. 🔍

Find out more → SynthID



https://video-ft.twimg.com/ext_tw_v...376/pu/vid/avc1/1280x720/G5K0TaljbmDqO-lP.mp4

2/20
@GoogleDeepMind
Here’s how SynthID watermarks AI-generated content across modalities. ↓



https://video-ft.twimg.com/ext_tw_video/1792521399359180800/pu/vid/avc1/720x720/fT7NUZR4FiMQ2iwO.mp4

3/20
@GoogleDeepMind
By open-sourcing the code, more people will be able to use the tool to watermark and determine whether text outputs have come from their own LLMs - making it easier to build AI responsibly.

We explain more about this tech in @Nature. ↓ Scalable watermarking for identifying large language model outputs - Nature



4/20
@AidfulAI
Detecting AI-written text is tough without watermarks.

Open-sourcing SynthID-Text enables others to embed watermarks in their model outputs.

This means there will be two types of models:
Models which watermark their outputs and the ones that won't. 🤔



5/20
@mkieffer1107
awesome!!! was just looking into this yesterday hoping it was open source :smile:



6/20
@dom_beaini
1. Can we break down the image generation by down-sampling and up-sampling?

2. Invisible to the human eye, but if we plug them back into another gen-AI, would it remove the watermark? For example adding noise to the image, then feeding it back into another watermark-free diffusion model? Asking another LLM to make random modification to a given text?

3. Without regulatory enforcement of these watermarks, I suspect most models won't have them.



7/20
@DesFrontierTech
How does SynthID text’s generative watermarking handle variability across different content domains, and what measures are taken to ensure the watermark’s detectability remains consistent when faced with novel or out-of-distribution input contexts?



8/20
@cloudseedingtec
ok i have a random question tthat no one has answered.. did yall put that (i call it the poison pill) into youtube videos.. cuz like well not to self incriminate but it seems like yall did something<3



9/20
@entergnomer
Would a different sampler bypass this?



10/20
@BensenHsu
The study focuses on developing a method called SynthID-Text to watermark text generated by large language models (LLMs). Watermarking can help identify synthetic text and limit accidental or deliberate misuse of LLMs.

The researchers evaluate SynthID-Text across multiple LLMs and find that it provides improved detectability over comparable methods, while maintaining standard benchmarks and human side-by-side ratings that indicate no change in LLM capabilities. They also conduct a live experiment with the Gemini production system, which shows that the difference in response quality and utility, as judged by humans, is negligible between watermarked and unwatermarked responses.

full paper: Scalable watermarking for identifying large language model outputs



GaquIVKbIAAgkV7.jpg


11/20
@shawnchauhan1
Awesome! Really appreciate it.



12/20
@HungamaHeadline
Google's open-sourcing of SynthID is a major step forward in ensuring accountability and trust in AI-generated content. By providing a reliable way to identify AI-generated media, SynthID empowers users to make informed decisions. This is a crucial development as AI continues to shape our world.



13/20
@thegenioo
Irrelevant somehow to the OP

But this simple animation also shows that how LLMs basically work using Probability to output words, like predicting the next word. Its not the entire process but a very simple illustration for someone who has no clue how AI works.



14/20
@MinhQua52508258
Alphastarter



15/20
@benrayfield
very suspicious to announce opensourcing something without saying what license or where to download it



16/20
@benrayfield
"Where is SynthID available? This technology is available to Vertex AI customers using our text-to-image models, Imagen 3 and Imagen 2, which create high-quality images in a wide variety of artistic styles". Prove its opensource. Wheres one of those guys one could fork from?



17/20
@benrayfield
Why dont you call it a steganography tool? Isnt watermarking a kind of steganography if you do it well enuf? You're hiding any arbitrary data by rewriting words to have a similar meaning, and paying for that in extra length to store the data.



18/20
@234Sagyboy
@GoogleDeepMind @Google Awesome now that we have verification in place meaning better identification of content generated by AI Is it possible that we can please have Google Soundstorm and AudioLm released Thanks



19/20
@explorewithmom
Google DeepMind's SynthID is a game-changer for identifying AI-generated content. I've been exploring AI watermarking for my own work and I'm excited to see SynthID open-sourced and freely available to developers and businesses.



20/20
@AdalaceV2
Oh ok so you're actively polluting the output of the software I am paying for. Sounds like I won't be paying for it anymore.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/4
@MushtaqBilalPhD
Google has open-sourced a watermarking tool, SynthID, to identify AI-generated content.

Teachers can relax now because soon students won't be able to use AI to cheat on their assignments.



https://video-ft.twimg.com/ext_tw_v...305/pu/vid/avc1/1352x720/i6YazQbRYIH6iBnX.mp4

2/4
@MushtaqBilalPhD
Here's the full paper by Google DeepMind:
Scalable watermarking for identifying large language model outputs - Nature



3/4
@healthheronav
I've developed my own ways to detect AI-generated content, but I'm skeptical about tools like SynthID. What's to stop AI from evolving to evade watermarks?



4/4
@fcordobaot
It only works if the content was generated by Gemini after they created the watermark. So unless all the big ones use the standard watermark, it would be complicated to really achieve it!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/3
@kanpuriyanawab
Google Deepmind open-sourced SynthID today.

Here are 3 things you need to know:

What is SynthID??

SynthID has been developed for watermarking and identifying AI-generated content. This includes text, images, audio, and video.

Significance:

> This tool comes when distinguishing between AI and human-created content is becoming increasingly important due to misinformation, plagiarism, and copyright violations.

How it works?

> For text, SynthID modifies the probability scores of tokens during the generation process so that these modifications act as a watermark.

> This watermark can then be detected through a specific scoring system that assesses the likelihood that the text was generated by a watermarked large language model (LLM).

In my opinion,

The move to open-source SynthID allows anyone to implement this technology in their own AI models to watermark and later identify AI-generated text.

Moreover, this can be seen as a step towards fostering responsible AI development by allowing widespread implementation of watermarking technology.



GarI4YwaAAEtHMM.jpg


2/3
@Yaaaaaashhh
SynthID is really cool!!!!



3/3
@kanpuriyanawab
and necessary




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799



1/3
Llama-3.1-Nemotron-70b by @nvidia is now on the Arena leaderboard with overall rank #9 and #26 with Style Control!

Impressive to see a 70B open model competitive in human preference, as well as interesting ranking shifts under style control.

Comment below to share your thoughts!

[Quoted tweet]
Llama-3.1-Nemotron-70B-Instruct model aligned by our team is now live on lmarena.ai leaderboard with overall rank 9.

Everything used to create this model is public: code, data and reward model. HF checkpoint: huggingface.co/nvidia/Llama-…


2/3
Full result at http://lmarena.ai/?leaderboard!



3/3
Paper [2410.01257] HelpSteer2-Preference: Complementing Ratings with Preferences
Model weight nvidia/Llama-3.1-Nemotron-70B-Instruct-HF · Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GarTf1RawAAysGQ.jpg

GanZkZuboAA2dHF.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799

1/2
@dreamingtulpa
EfficientViT can speed up high-resolution diffusion models with a compressing ratio of 128 while keeping good image quality!

It achieves a 19.1x speed increase for inference and a 17.9x for training on ImageNet 512x512 compared to other autoencoders.

GitHub - mit-han-lab/efficientvit: Efficient vision foundation models for high-resolution generation and perception.



https://video.twimg.com/ext_tw_video/1848303913369255936/pu/vid/avc1/1536x512/HFyptxie-pWhglzU.mp4

2/2
@Kleos00
What do you mean 128? 128:1? And are you talking about bytes?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/3
@0xorphaned
"In this paper, we propose an FPGA-based accelerator for EfficientViT to advance the hardware efficiency frontier of ViTs. Specifically, we design a reconfigurable architecture to efficiently support various operation types, including lightweight convolutions and attention, boosting hardware utilization."

[Quoted tweet]
cs AR, LG
5 pages
An FPGA-Based Reconfigurable Accelerator for Convolution-Transformer Hybrid EfficientViT
Haikuo Shao, Huihong Shi, Wendong Mao, Zhongfeng Wang arxiv.org/abs/2403.20230


GKDIBpOWUAAebkJ.jpg


2/3
@0xorphaned
"Additionally, we present a time-multiplexed and pipelined dataflow to facilitate both intra- and inter-layer fusions, reducing off-chip data access costs."



3/3
@0xorphaned
"Experimental results show that our accelerator achieves up to 780.2 GOPS in throughput and 105.1 GOPS/W in energy efficiency at 200MHz on the Xilinx ZCU102 FPGA, which significantly outperforms prior works."




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/3
@FBiras
Just yesterday a new paper approaching the Segment Anything Method came out: "EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss"🚀

Lets 𝐡𝐞𝐚𝐫 what Audiolizer - Convert Research Papers to Audio 🎧 thinks about it:



GF1JWk3XoAA9qQl.png


2/3
@FBiras
👉 This paper introduces EfficientViT-SAM, a highly efficient and fast model for image segmentation. It innovates on the Segment Anything Model (SAM) framework by incorporating EfficientViT, a more streamlined image encoder, enhancing the model's speed without sacrificing accuracy. EfficientViT-SAM shows a significant processing rate improvement over SAM-ViTH, achieving a 48.9 times faster rate on the A100 GPU.

👉 SAM models, known for their zero-shot segmentation capabilities, face computational efficiency challenges due to their intensive image encoders. EfficientViT-SAM addresses this by replacing SAM's image encoder with EfficientViT, aiming to retain performance while significantly boosting speed. This development is particularly relevant in scenarios requiring rapid image processing, such as augmented reality and interactive editing.

👉 EfficientViT-SAM's development involves a two-stage training process. Initially, knowledge distillation transfers capabilities from SAM-ViTH to EfficientViT, ensuring robust performance inheritance. Subsequent end-to-end training on the SA-1B dataset fine-tunes performance. EfficientViT-SAM comes in two versions (L and XL), offering varied speed and accuracy trade-offs. It utilizes EfficientViT's multi-scale linear attention module with ReLU linear attention and convolution for efficient image processing. The model's architecture involves a fusion of features through upsampling and addition, indicating sophisticated integration of multi-scale features.

👉 EfficientViT-SAM's evaluation demonstrates superior performance and efficiency compared to previous SAM models and alternatives. It excels in various zero-shot benchmarks, including point-prompted and box-prompted segmentation on COCO and LVIS datasets. The model also shows high performance in diverse real-world segmentation tasks. Its adaptability is further highlighted when paired with different object detection methods like YOLOv8 and GroundingDINO.

👉 EfficientViT-SAM represents a significant advancement in image segmentation, striking an impressive balance between efficiency and performance. Its ability to perform high-speed processing without compromising accuracy makes it a notable contribution to the field. By open-sourcing the model, the authors promote further research and potential enhancements, expanding the possibilities for practical applications of image segmentation technology.



3/3
@FBiras
As usual, the full version is available on Audiolizer - Convert Research Papers to Audio, where you can 𝐥𝐢𝐬𝐭𝐞𝐧 𝐭𝐨 𝐢𝐭.

If there are any papers or domains that 𝐲𝐨𝐮 𝐰𝐨𝐮𝐥𝐝 𝐥𝐢𝐤𝐞 𝐭𝐨 𝐡𝐞𝐚𝐫 𝐚𝐛𝐨𝐮𝐭, drop a comment bellow!

/search?q=#buildinpublic /search?q=#indiehackers /search?q=#Researchpaper /search?q=#AI




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/3
@_akhaliq
EfficientViT-SAM

Accelerated Segment Anything Model Without Performance Loss

paper page: Paper page - EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

present EfficientViT-SAM, a new family of accelerated segment anything models. We retain SAM's lightweight prompt encoder and mask decoder while replacing the heavy image encoder with EfficientViT. For the training, we begin with the knowledge distillation from the SAM-ViT-H image encoder to EfficientViT. Subsequently, we conduct end-to-end training on the SA-1B dataset. Benefiting from EfficientViT's efficiency and capacity, EfficientViT-SAM delivers 48.9x measured TensorRT speedup on A100 GPU over SAM-ViT-H without sacrificing performance.



GFyu5r1WIAADSpF.png


2/3
@paul_cal
@yacineMTB fyi



3/3
@talrid23
Nice to see that they compare throughput and not "flops", which is a pretty useless metrics




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799



1/3
OmniParser

Microsoft has casually dropped this gem to enable GPT4V to navigate your computer! Looks like, 'Computer use' is the next battleground.



2/3
OmniParser

> Screen Parsing tool for Pure Vision Based GUI Agent
> A method for parsing user interface screenshots into structured and easy-to-understand elements.
> This significantly enhances the ability of GPT-4V to generate actions 🤯
> Makes it possible for powerful LLMS to accurately ground the corresponding regions of interest in an interface.

🤗 Model on @Huggingface with MIT license: microsoft/OmniParser · Hugging Face



3/3
OmniParser: Screen Parsing tool for Pure Vision Based GUI Agent

🚀 Understanding the user interfaces like never before!
👉 Launch the local gradio app by following the instructions here: GitHub - microsoft/OmniParser




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799


1/2
It's here already!! @AnthropicAI Computer Use - out-of-the-box, no docker required! 🤯

Can support any platform. A user-friendly interface based on Gradio🌟



2/2
Computer Use - OOTB 🌟

An out-of-the-box (OOTB) solution for Claude's new Computer Use APIs.
No Docker is required, and it theoretically supports any platform, with testing currently done on Windows.

🤩 Extremely easy to launch: Clone the repo, install from requirements, and launch the app `python app_py`

A user-friendly interface based on Gradio will launch locally 🎨

👉 Repo for Computer_Use_OOTB: GitHub - showlab/computer_use_ootb: An out-of-the-box (OOTB) version of Anthropic Claude Computer Use




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799

1/1
Checkout our paper at Paper page - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

[Quoted tweet]
🚀Excited to release our study on visual comprehension and abstract reasoning skills via #PolyMATH. We provide both quantitative and qualitative evaluations of #GPT4o, #Claude 3.5 Sonnet, #Gemini 1.5 pro, #OpenAI o1 models & 13 other models.

📄✨Full paper: arxiv.org/abs/2410.14702

🤗@huggingface Dataset:
huggingface.co/datasets/him1…

🔍 Key Insights:
1️⃣ A dataset of 5000 samples to test cognitive reasoning capabilities of MLLMs.
2️⃣ The best scores achieved on POLYMATH are ∼ 41%, ∼ 36%, and ∼ 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively while human baseline was at ~66%.
3️⃣ An improvement of 4% is observed when image descriptions are passed instead of actual images, indicating reliance on text over image even in multimodal reasoning.
4️⃣ Open AI o1 models get competitive scores with human baseline on text only samples, highlighting room for improvement!

A massive shoutout to our outstanding team @s_verma3011, @ujjwala_ananth, @thegraydleguy and @Mihir3009 and guidance of @Swarooprm7 and @cbaral


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Gaq7c8CaAAUCaLz.png

Gaq7sTpaAAIMi3m.jpg











1/10
@himanshu_gup14
🚀Excited to release our study on visual comprehension and abstract reasoning skills via /search?q=#PolyMATH. We provide both quantitative and qualitative evaluations of /search?q=#GPT4o, /search?q=#Claude 3.5 Sonnet, /search?q=#Gemini 1.5 pro, /search?q=#OpenAI o1 models & 13 other models.

📄✨Full paper: [2410.14702] Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark

🤗@huggingface Dataset:
him1411/polymath · Datasets at Hugging Face

🔍 Key Insights:
1️⃣ A dataset of 5000 samples to test cognitive reasoning capabilities of MLLMs.
2️⃣ The best scores achieved on POLYMATH are ∼ 41%, ∼ 36%, and ∼ 27%, obtained by Claude-3.5 Sonnet, GPT-4o and Gemini-1.5 Pro respectively while human baseline was at ~66%.
3️⃣ An improvement of 4% is observed when image descriptions are passed instead of actual images, indicating reliance on text over image even in multimodal reasoning.
4️⃣ Open AI o1 models get competitive scores with human baseline on text only samples, highlighting room for improvement!

A massive shoutout to our outstanding team @s_verma3011, @ujjwala_ananth, @thegraydleguy and @Mihir3009 and guidance of @Swarooprm7 and @cbaral



Gaq7c8CaAAUCaLz.png

Gaq7sTpaAAIMi3m.jpg


2/10
@himanshu_gup14
PolyMATH evaluates the complex multi-modal cognitive reasoning capabilities of MLLMs. The tasks and puzzles could be easy for humans but are fairly challenging for even State of the Art Models!



Gaq52OoasAA8GW7.png


3/10
@himanshu_gup14
The questions of the dataset are presented in the following format:



Gaq7KDlaAAIaYoM.png


4/10
@himanshu_gup14
State of the Art Models are evaluated across various prompting methods:



Gaq7Y-LaAAIp531.png


5/10
@himanshu_gup14
Similarly open source models’ performance remains low on PolyMATH as well:



Gaq78xPbEAAn2im.png


6/10
@himanshu_gup14
SOTA LMs frequently misinterpret diagrams, illustrating a need for improved understanding of visual data. Top LMs share common pitfalls in reasoning - suggesting a fundamental challenge in current architectures.



Gaq8JGvaAAIm2hM.png


7/10
@himanshu_gup14
A case study on o1-mini and o1-preview on text only samples showed competitive performance compared to human baseline



Gaq8Q04aAAAP3p4.png


8/10
@himanshu_gup14
Project Page: PolyMATH: A Challenging Multi-Modal Mathematical Reasoning Benchmark
Github Page: GitHub - polymathbenchmark/PolyMATH: Official github repository of XXXX



9/10
@himanshu_gup14
hf paper page: Paper page - Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark



10/10
@synthical_ai
Dark mode for this paper for those who read at night 🌙 Polymath: A Challenging Multi-modal Mathematical Reasoning Benchmark




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799


1/2
KlingAI Virtual Try-On

This project utilizes the KlingAI API to provide a virtual try-on experience using images of people and garments.



2/2
github: GitHub - AtaUllahB/KlingAI-Virtual-Try-On: This project utilizes the KlingAI API to provide a virtual try-on experience using images of people and garments.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
🚀 @Kling_ai Kwai-Kolors just launched Kolors Virtual Try-On! 🛍️👗 Upload your pic 📸, add the dress, and see how it looks on you instantly! 🤯

Try it here: Kolors Virtual Try-On - a Hugging Face Space by Kwai-Kolors

What do you think? 🤔

/search?q=#AI /search?q=#FashionTech /search?q=#VirtualTryOn /search?q=#Innovation /search?q=#ShoppingExperience




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/5
最強の動画生成AI「Kling」のVirtual Try-on APIがヤバすぎるので聞いてほしい。

API経由で動画内のあらゆる人物の着せ替えができちゃう👇



2/5
Klingと同じ会社が出してる、Kolorsの着せ替えモデルを踏襲してるっぽい。

Kolors Virtual Try-On - a Hugging Face Space by Kwai-Kolors



3/5
こらお試しで安いプランあるんで検証してみる。

[Quoted tweet]
最強の動画生成AI「Kling」のVirtual Try-on APIがヤバすぎるので聞いてほしい。

API経由で動画内のあらゆる人物の着せ替えができちゃう👇


4/5
このアカウントはAIのガチ考察もしてます👇

[Quoted tweet]
本日10月17日、Claude 3.5 OpusとChatGPT 4.5は本当に来るのか?!?

海外AIガチ勢の意見まとめました👇️ note.com/meru2002/n/n1f7be67…


5/5
これがいろんなサービスに統合できてしまうのはやばすぎます。




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196









1/6
@aitoolhouse
This is wild!

The creators of Kling AI have released a new Virtual Try-On tool called Kolors and it's really good!

Some examples and the link below 👇



GWXzMe8bYAAfBZB.jpg

GWXzMfZWUAEwQcc.jpg

GWXzMezbQAIvnYR.jpg


2/6
@aitoolhouse
Kolors is pretty accurate even with more complex patterns, although it's not 100% perfect in some cases.

It also understands how to apply shadows and lighting to the new generation:



GWXzZUdXIAAGS1Q.jpg


3/6
@aitoolhouse
Surprisingly, it also works on multiple characters.

It automatically understands the nature of the garment and applies it correctly.



GWXzdwqaoAAKZoD.jpg


4/6
@aitoolhouse
Kolors can automatically apply full-body garments, like dresses and suits to the body.



GWXzmfVbQAUf3K0.jpg


5/6
@aitoolhouse
You can try it here:
Kolors Virtual Try-On - a Hugging Face Space by Kwai-Kolors



6/6
@aitoolhouse
🤖 Contact us if you made a great AI tool to be featured: Submit your AI Tool to AI Toolhouse




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Kling AI developers have just released a neural network that can transform a person in a photo into any clothes—Kolors Virtual Try-On.

It's fast, high-quality, and almost unlimited.

Check this out: Kolors Virtual Try-On - a Hugging Face Space by Kwai-Kolors




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
🚀 Exciting news! Our KOLORS virtual try-on project is now live! 🎉 Experience the thrill of trying on any fashionable outfit and using @Kling_ai to turn your images into dynamic videos. Don't miss out!

🤗Kolors Virtual Try-On - a Hugging Face Space by Kwai-Kolors



2/2
We are continuously optimizing the model effect, and the external access method is also actively preparing, please look forward to it~




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799








1/7
@andi_marafioti
People are asking me about the Qwen2VL paper so I'll share my notes 🧶



GaB8HuGXsAAcStx.png


2/7
@andi_marafioti
They introduce an extension of RoPE to represent temporality in videos. This seems like a great idea in principle, but it doesn't make a huge difference in their ablations.



3/7
@andi_marafioti
They train their own ViT to represent images of different resolutions.
They don't exploit this training often; they use the same vision backbone for all three models sizes.
But it might give their model an advantage over others that share similar vision backbones.
This ViT allows them to have a few tokens for small images and many tokens for large images.
Their ablations for this dynamic tokenization shows a huge improvement over using only 64 tokens per image, but using 64 tokens per image with their architecture is a bad idea and is only showed for the ablation.
When they get to 1600 tokens per image, the difference in performance mostly disappears.
There might still be a large difference in performance between their ViT and others, but this isn't shown.



4/7
@andi_marafioti
They train on videos and images together—more data! Although they call them videos, they remove all of the audio and don't even use transcriptions.



5/7
@andi_marafioti
They train with function calling, which is super cool. They even evaluate the model as a robotic control agent.



6/7
@andi_marafioti
So... what should we try to integrate for Idefics?

The mRoPE strategy seems great for videos, but I'm a bit disappointed that the ablations don't show a larger difference. It doesn't seem like a large change, so I would still code it and test it.

The training of a new ViT model is great and something I've been meaning to do, but it seems to take more work, and I'm not convinced that it is better than what we are doing for Idefics3 (splitting the images into patches).



7/7
@andi_marafioti
I still think the larger mojo here is the data. The paper is not very explicit/extensive as to what data they use exactly. So I'm left wondering if they have more/better data than what's freely available and how much that contributes to their great benchmarks/performance.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,799

1/7
@burkov
Chinese models kick ass. Last week, I posted about Qwen 2.5 being awesome. Here's another one: DeepSeek 2.5. As good as GPT-4, as cheap as GPT-4 mini: DeepSeek
If you want to run it locally, you can get it on @huggingface: deepseek-ai/DeepSeek-V2.5 · Hugging Face

The license for running it locally is permissive, no "for research only" nonsense.



2/7
@DanielSMatthews
Via ollama too, but a 133 gig model!

deepseek-v2.5



3/7
@tensor_art_com
Too big



4/7
@techyalex123
I've been following DeepSeek's progress and I'm excited to see their V2.5 model delivering results on par with GPT-4. The cost-effectiveness and permissive license make it an attractive option for those looking to run it locally. Good job!



5/7
@Naveen_AIGeek
I'm excited to see more Chinese models like DeepSeek 2.5 giving GPT-4 a run for its money. The permissive license is a big plus, no'research only' restrictions can only lead to more innovation.



6/7
@SimplyObjective
DS v2.5 is very good, but far too big for consumer hardware. And if you're going to run it online from a 3rd party you might as well use a superior model like GPT4o. And Q2.5 hallucinates like crazy about very popular things b/c they're overly focused on tests scores & coding.



7/7
@AdinaYakup
If you're interested, we've collected some good open models from the Chinese community here:
zh-ai-community (Chinese LLMs on Hugging Face)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top