The A.I Megathread (LLM , GPT , Development)

bnew · Oct 22, 2024

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

A refreshed, more powerful Claude 3.5 Sonnet, Claude 3.5 Haiku, and a new experimental AI capability: computer use.

www.anthropic.com

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Oct 22, 2024●5 min read

An illustration of Claude navigating a computer cursor

Today, we’re announcing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. The upgraded Claude 3.5 Sonnet delivers across-the-board improvements over its predecessor, with particularly significant gains in coding—an area where it already led the field. Claude 3.5 Haiku matches the performance of Claude 3 Opus, our prior largest model, on many evaluations for the same cost and similar speed to the previous generation of Haiku.

We’re also introducing a groundbreaking new capability in public beta: computer use. Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental—at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.

Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product.

The upgraded Claude 3.5 Sonnet is now available for all users. Starting today, developers can build with the computer use beta on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. The new Claude 3.5 Haiku will be released later this month.

Claude 3.5 Sonnet: Industry-leading software engineering skills

The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.

Early customer feedback suggests the upgraded Claude 3.5 Sonnet represents a significant leap for AI-powered coding. GitLab, which tested the model for DevSecOps tasks, found it delivered stronger reasoning (up to 10% across use cases) with no added latency, making it an ideal choice to power multi-step software development processes. Cognition uses the new Claude 3.5 Sonnet for autonomous AI evaluations, and experienced substantial improvements in coding, planning, and problem-solving compared to the previous version. The Browser Company, in using the model for automating web-based workflows, noted Claude 3.5 Sonnet outperformed every model they’ve tested before.

As part of our continued effort to partner with external experts, joint pre-deployment testing of the new Claude 3.5 Sonnet model was conducted by the US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI).

We also evaluated the upgraded Claude 3.5 Sonnet for catastrophic risks and found that the ASL-2 Standard, as outlined in our Responsible Scaling Policy, remains appropriate for this model.

Claude 3.5 Haiku: State-of-the-art meets affordability and speed

Claude 3.5 Haiku is the next generation of our fastest model. For the same cost and similar speed to Claude 3 Haiku, Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in our previous generation, on many intelligence benchmarks. Claude 3.5 Haiku is particularly strong on coding tasks. For example, it scores 40.6% on SWE-bench Verified, outperforming many agents using publicly available state-of-the-art models—including the original Claude 3.5 Sonnet and GPT-4o.

With low latency, improved instruction following, and more accurate tool use, Claude 3.5 Haiku is well suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from huge volumes of data—like purchase history, pricing, or inventory records.

Claude 3.5 Haiku will be made available later this month across our first-party API, Amazon Bedrock, and Google Cloud’s Vertex AI—initially as a text-only model and with image input to follow.

Teaching Claude to navigate computers, responsibly

With computer use, we're trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. Developers can use this nascent capability to automate repetitive processes, build and test software, and conduct open-ended tasks like research.

To make these general skills possible, we've built an API that allows Claude to perceive and interact with computer interfaces. Developers can integrate this API to enable Claude to translate instructions (e.g., “use data from my computer and online to fill out this form”) into computer commands (e.g. check a spreadsheet; move the cursor to open a web browser; navigate to the relevant web pages; fill out a form with the data from those pages; and so on). On OSWorld, which evaluates AI models' ability to use computers like people do, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category—notably better than the next-best AI system's score of 7.8%. When afforded more steps to complete the task, Claude scored 22.0%.

While we expect this capability to improve rapidly in the coming months, Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. Because computer use may provide a new vector for more familiar threats such as spam, misinformation, or fraud, we're taking a proactive approach to promote its safe deployment. We've developed new classifiers that can identify when computer use is being used and whether harm is occurring. You can read more about the research process behind this new skill, along with further discussion of safety measures, in our post on developing computer use.

Looking ahead

Learning from the initial deployments of this technology, which is still in its earliest stages, will help us better understand both the potential and the implications of increasingly capable AI systems.

We’re excited for you to explore our new models and the public beta of computer use—and welcome you to share your feedback with us. We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create.

bnew · Oct 22, 2024

1/11
@genmoai
Introducing Mochi 1 preview. A new SOTA in open-source video generation. Apache 2.0.

magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce

https://video.twimg.com/ext_tw_video/1848745801926795264/pu/vid/avc1/1920x1080/zCXCFAyOnvznHUAf.mp4

2/11
@genmoai
Mochi 1 has superior motion quality, prompt adherence and exceptional rendering of humans that begins to cross the uncanny valley.

Today, we are open-sourcing our base 480p model with an HD version coming soon.

3/11
@genmoai
We're excited to see what you create with Mochi 1. We're also excited to announce our $28.4M Series A from @NEA, @TheHouseVC, @GoldHouseCo, @WndrCoLLC, @parasnis, @amasad, @pirroh and more.

Use Mochi 1 via our playground at our homepage or download the weights freely.

4/11
@jahvascript
can I get a will smith eating spaghetti video?

5/11
@genmoai

6/11
@GozukaraFurkan
Does it have image to video and video to video?

7/11
@genmoai
Good question! Our new model, Mochi, is entirely text to video. On Genmo. The best open video generation models. , if you use the older model (Replay v.0.2), you're able to do image to video as well as text to video.

8/11
@iamhitarth
Keep getting this error message even though I've created an account and signed in

9/11
@genmoai
Hey there, thanks for escalating this! We're taking a closer look right now to see why this is happening

I appreciate your patience and understanding!

10/11
@sairahul1
Where can I try it

11/11
@genmoai
You can find everything you'll need to try Mochi at Genmo. The best open video generation models. 🫶 Happy generating!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@AIWarper
My Mochi 1 test thread. Will post some video examples below if you are interested.

Inference done with FAL

https://video.twimg.com/ext_tw_video/1848834063462993920/pu/vid/avc1/848x480/5De8HKQoKL4cYf6I.mp4

https://video-t-1.twimg.com/ext_tw_...016/pu/vid/avc1/848x480/JJWbnMjCeriVR86A.mp4?

2/3
@AIWarper
A floating being composed of pure, radiant energy, its form shifting between physical and immaterial states. The Lumiscribe’s eyes glow with ancient knowledge, and its translucent wings pulse with golden light. It hovers above the Skyvault, a vast floating city on its homeworld, Zephyra, a world of constant storms and lightning-filled skies. Giant skyborne creatures soar above the city, their bodies crackling with electricity. Below, the landscape is a mix of jagged mountains and stormy seas, illuminated by constant flashes of lightning. The Lumiscribe uses its energy to inscribe symbols of power into the air, controlling the flow of the storms that define their world.

https://video.twimg.com/ext_tw_video/1848835515191332868/pu/vid/avc1/848x480/HjfJWN7l2MzSNEcz.mp4

3/3
@DataPlusEngine
Holy shyt it's so good

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
引用
Mochi 1 represents a significant advancement in open-source video generation, featuring a 10 billion parameter diffusion model built on our novel Asymmetric Diffusion Transformer architecture.

The model requires at least 4 H100 GPUs to run.
genmo/mochi-1-preview · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/10
@_akhaliq
Mochi 1 preview

A new SOTA in open-source video generation. Apache 2.0

https://video.twimg.com/ext_tw_video/1848763058291642370/pu/vid/avc1/1280x720/QFGsHnbVyszo5Xgz.mp4

2/10
@_akhaliq
model: genmo/mochi-1-preview · Hugging Face

3/10
@Prashant_1722
Sora has become extremely non moat by now

4/10
@LearnedVector
Ima need more h100s

5/10
@thebuttredettes
Getting only errors on the preview page.

6/10
@noonescente
What is a SOTA?

7/10
@m3ftah
It requires 4 H100 to run!

8/10
@ED84VG
It’s crazy how fast ai video has come.

Does anyone know if it got better because Sora showed it was possible, or would they have gotten this good even if they never showed us?

9/10
@JonathanKorstad

10/10
@_EyesofTruth_
OpenAi getting mogged

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/11
@runwayml
Introducing, Act-One. A new way to generate expressive character performances inside Gen-3 Alpha using a single driving video and character image. No motion capture or rigging required.

Learn more about Act-One below.

(1/7)

https://video-t-2.twimg.com/ext_tw_...248/pu/vid/avc1/1280x720/2EyYj6GjSpT_loQf.mp4

2/11
@runwayml
Act-One allows you to faithfully capture the essence of an actor's performance and transpose it to your generation. Where traditional pipelines for facial animation involve complex, multi-step workflows, Act-One works with a single driving video that can be shot on something as simple as a cell phone.

(2/7)

https://video-t-2.twimg.com/ext_tw_...282/pu/vid/avc1/1280x720/Qie29gOWU42zMaGo.mp4

3/11
@runwayml
Without the need for motion-capture or character rigging, Act-One is able to translate the performance from a single input video across countless different character designs and in many different styles.

(3/7)

https://video-t-2.twimg.com/ext_tw_...696/pu/vid/avc1/1280x720/TcWzpRl3kMfHM4ro.mp4

4/11
@runwayml
One of the models strengths is producing cinematic and realistic outputs across a robust number of camera angles and focal lengths. Allowing you generate emotional performances with previously impossible character depth opening new avenues for creative expression.

(4/7)

https://video-t-2.twimg.com/ext_tw_...424/pu/vid/avc1/1280x720/3fupOI32Ck6ITIiE.mp4

5/11
@runwayml
A single video of an actor is used to animate a generated character.

(5/7)

https://video-t-2.twimg.com/ext_tw_...016/pu/vid/avc1/1280x720/Fh7WanCgSTR_ffHF.mp4

6/11
@runwayml
With Act-One, eye-lines, micro expressions, pacing and delivery are all faithfully represented in the final generated output.

(6/7)

https://video-t-2.twimg.com/ext_tw_...528/pu/vid/avc1/1280x720/ywweYvzLHe-3GO2B.mp4

7/11
@runwayml
Access to Act-One will begin gradually rolling out to users today and will soon be available to everyone.

To learn more, visit Runway Research | Introducing Act-One

(7/7)

8/11
@xushanpao310
@threadreaderapp Unroll

9/11
@threadreaderapp
@xushanpao310 Salam, please find the unroll here: Thread by @runwayml on Thread Reader App See you soon.

10/11
@threadreaderapp
Your thread is gaining traction! /search?q=#TopUnroll Thread by @runwayml on Thread Reader App

@TheReal_Ingrid_ for

unroll

11/11
@flowersslop
motion capture industry is cooked

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/4
@AIWarper
New Stable Diffusion 3.5

Weights: Comfy-Org/stable-diffusion-3.5-fp8 at main

Compatiable with ComfyUI already

2/4
@AIWarper
Update: As per usual - SD basemodels are.... meh?

The strength will hopefully be in the finetune

3/4
@goon_nguyen
Still not good as FLUX

4/4
@AIWarper
But can it fine tune. That is the selling point.

It's a non distilled model (I think)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@AIWarper
SD 3.5 Large vs Flux [DEV]

Image 1 = identical prompt I used for Flux

Image 2 = Shortened prompt

Conclusion - I need to experiment more. These were generated with the HF demo so not many knobs to turn

[Quoted tweet]
Flux [DEV] locally w/ ComfyUI

1024 x 1024 gen upscaled 2x

Spicyyyyyyyyyyyy details and the prompt adherence is insane. I never once mentioned the words Frieren OR Fern.

(prompt in 1st comment below)

https://video-t-1.twimg.com/ext_tw_...37/pu/vid/avc1/1542x1080/PTkQOzHzyUXnIEa9.mp4

2/2
@GorillaRogueGam
I cannot get excited for a Stability product until they release something 1.5 level revolutionary again

I was ready to completely give up on the company until Cameron got on the board of directors, but he is fond of sinking ships so I don’t know what to think…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/11
@Kijaidesign
This model is so much fun!
I have implemented Tora:
GitHub - alibaba/Tora: The official repository of Tora
to my CogVideoX nodes:
GitHub - kijai/ComfyUI-CogVideoXWrapper
This implementation can be run within ~13GB VRAM
/search?q=#ai /search?q=#comfyui

https://video.twimg.com/ext_tw_video/1848349720432754688/pu/vid/avc1/720x480/uR1WPXyQj0FvdxCO.mp4

2/11
@DreamStarter_1
I have only 12...so cloooose!

3/11
@Kijaidesign
I didn't try it with this particular model, but there's always the option to use sequential_cpu_offloading, which makes it really slow, but runs on very little VRAM.

4/11
@6___0
how long does a 3 sec vid take to render (on what GPU)?

5/11
@Kijaidesign
Depends on many things, there are many possible optimizations to use, stacking everything at best it's around 2 mins to do 49 frames.

6/11
@huanggou7
does it work with image to video?

7/11
@Kijaidesign
Not currently, the model is trained on top of the text2video model, looking for a way to do that though.

8/11
@Sire1Sire1
Can we try this? somewhere?

9/11
@Kijaidesign
Currently only with my ComfyUI nodes:
GitHub - kijai/ComfyUI-CogVideoXWrapper

10/11
@sophanimeai
Interesting to see Tora implemented in your ComfyUI-CogVideoX wrapper. What drove you to experiment with this particular combination?

11/11
@Kijaidesign
The model that was released was actually build on CogVideoX 5b text2vid -model, after studying their code and the model structure for a bit, it was just a natural fit.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
@Kijaidesign
Oh hey Tora trajectory actually works with the image2video model too, this will be good...
/search?q=#ai /search?q=#comfyui /search?q=#cogvideox

https://video-t-2.twimg.com/ext_tw_...8464/pu/vid/avc1/720x480/guTYCJ78ra53nmKL.mp4

2/8
@Elliryk_Krypto
Awesome

3/8
@huanggou7
holy shyt

4/8
@Leo1991199
cool!

5/8
@Mayzappp1
This is like instance diffusion with spline :smile:

used to work on sd 1.5 with animate diff
Really nice to see something like it on cog

6/8
@AmirKerr
Finally! Some video control that we can run locally 🥹 Kijai

7/8
@mickmumpitz
That looks absolutely fantastic, great work!

8/8
@_EyesofTruth_
Wut

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/11
@ComfyUI
Introducing ComfyUI V1, a packaged desktop application

- Windows (@nvidia), macOS (apple silicon), Linux
- One click install for less technical users
- Ships with ComfyUI manager
- Auto-installed python environment

Join the closed beta: Download ComfyUI for Windows/Mac

https://video.twimg.com/amplify_video/1848333452916981760/vid/avc1/1920x1080/aCnRKztsS-ftpb0_.mp4

2/11
@ComfyUI
To experience the new UI immediately, update your current ComfyUI to the latest and enable it in the settings menu.

3/11
@ComfyUI
Custom Node Registry

- Registry for custom nodes (similar to PYPI)
- Versioned custom nodes
- Integrated into ComfyUI Manager
- Only available on desktop application but coming soon to everyone else

4/11
@ComfyUI
More details: ComfyUI V1 Release

5/11
@22yonking2
is it possible to use this without installing python, and without it installing python on the computer?

if it has python that only it can access and it doesn't install libraries also good.

6/11
@ComfyUI
The python is a standalone one, and will install PyTorch by default. Will not affect system python.

7/11
@22yonking2
Wow!!!! impressive work.
this not only looks incredible but much more beautiful and advanced than any other stable diffusion studio app!

Great job!

8/11
@ComfyUI
@HclHno3 did all the amazing UI improvements!

9/11
@BoneStudioAnim
Can we run workflows from the command line?

10/11
@ComfyUI
You already can: try comfy-cli

11/11
@chidzoWTF
Amazing! Signed up!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/11
@perplexity_ai
Pro Search is now more powerful. Introducing Reasoning Mode!

Challenge your own curiosity. Ask multi-layered questions. Perplexity will adapt.

Try it yourself (sample queries in thread)

https://video-t-2.twimg.com/ext_tw_...563/pu/vid/avc1/1280x720/_0rglRblnHofZeaG.mp4

2/11
@perplexity_ai
"Read all of Bezos’ annual shareholder letters and compile a table of the key takeaways from each year." https://www.perplexity.ai/search/read-all-the-shareholder-lette-ACNERXw4T0iuxdENJjzsPQ

3/11
@perplexity_ai
"Please provide me with the latest information or releases from the following areas regarding Amazon:

1. Recent acquisitions or mergers
2. Executive leadership transitions
3. Technological innovations or IT infrastructure updates
4. Cybersecurity incidents or data breaches
5. Major company announcements or significant news stories
6. Developments in user data protection and privacy policies
7. Key points from their latest 10-K filing and annual report"

https://www.perplexity.ai/search/please-provide-me-with-the-lat-wXprqGQQQ2.hExhF4XQUsQ

4/11
@HCSolakoglu
Add one more step for hallucination checks; it will help users a lot when verifying all the information.

5/11
@ChrisUniverseB
Currently stuck on this question, it works for other questions, but this used up 3 chances ;(

6/11
@riderjharris

https://video-t-2.twimg.com/amplify...81216/vid/avc1/1080x1920/ShNb6ZKMUVzrtmOD.mp4

7/11
@hoppyturtles
casual Perplexity W

8/11
@MarkusOdenthal
Wow! Super useful.

This will really help developers like me build great products.

Best tool to start with research.

9/11
@Prashant_1722
This is great. What are the Top Reasoning searches rn on Perplexity

10/11
@koltregaskes
Amazing, However, this would look even more amazing in a Windows app.

11/11
@ash_stuart_
How do you switch it on and off? Sometimes LLMs go off track so it's good if we have the option to choose.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/7
@logtdx
RAVE and FLATTEN were two of the papers that originally got me into diffusion models. They take inverse noise and apply consistency to image models.

Now with RF-Inversion (thanks @litu_rout_ and @natanielruizg) I can try these on Flux.

Not production quality, but still fun.

https://video.twimg.com/ext_tw_video/1847352406654300160/pu/vid/avc1/1536x560/OzWYdWCyfjxNNEsq.mp4

2/7
@natanielruizg
you’re on twitter now, awesome!

3/7
@logtdx
ahaha yeah, didn't know what I was missing out on

4/7
@blizaine
Welcome to the AI fam on 𝕏 @logtdx

5/7
@fblissjr
this is such an elegant solution (and highly useful). thx!

6/7
@McintireTristan
You’re a wizard harry

7/7
@shateinft
this is something i would like to test

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/9
@AIWarper
Using @logtdx implementation of RF-Inversion by @Google and @litu_rout_ and @natanielruizg I think there may be a method here for consistent stylized animation frames.

If we could somehow just align these grids it would be very powerful

Grid in the second tweet

2/9
@AIWarper

3/9
@AIWarper

4/9
@joehenriod
Sorry for noob question, what is the problem with alignment?

They look very aligned to me.

5/9
@AIWarper
Sorry I should be more clear.

Let's say I have 24 source images and I want to convert them all into this style - how do I do it?

I could:
A) Pack 24 images into 1 grid and make 24 extremely tiny images
B) Pack 6 images (4 per grid) and run 6 different runs

Issue with B is none of them will look alike

6/9
@natanielruizg
this is soooo cool

7/9
@AIWarper
If you have any clever ideas on how I can propagate this across more frames I am all ears haha

Very promising though!

8/9
@vkumar_1994
Be very clear BANGLORE MUMBAI DO NOT MATTER IN INDIA.

9/9
@vkumar_1994
Chutiya define success only for Mumbai Bangalore circles 1 Billion plus people you banglore pigs do not exist for most people in this country.keep you colonised NARSCISSTIC MARKETING defenition of success and lecturing as if u r some god runing earth to

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

1/2
@carlfranzen
New /search?q=#AIvideo model just dropped: Mochi 1 from @genmoai — and it claims higher performance on motion and prompt adherence than leading proprietary rivals!

It's also

OPEN SOURCE

under Apache 2.0. Only 480p but still really impressive imo...

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others

https://video-ft.twimg.com/ext_tw_v...160/pu/vid/avc1/1280x720/a5TaV-PxQRNBI-Nd.mp4

https://video-ft.twimg.com/ext_tw_v...680/pu/vid/avc1/1280x720/26TdwdF2YX18tD21.mp4

https://video-ft.twimg.com/ext_tw_v...432/pu/vid/avc1/1280x720/Vl5X1IyamBbWxVbU.mp4

https://video-ft.twimg.com/ext_tw_v...048/pu/vid/avc1/1280x720/Bqdqq8nW3XwEuwnH.mp4

2/2
@_parasj
Thank you for the coverage! The playground and weights are now live finally

[Quoted tweet]
Introducing Mochi 1 preview. A new SOTA in open-source video generation. Apache 2.0.

magnet:?xt=urn:btih:441da1af7a16bcaa4f556964f8028d7113d21cbb&dn=weights&tr=udp://tracker.opentrackr.org:1337/announce

https://video.twimg.com/ext_tw_video/1848745801926795264/pu/vid/avc1/1920x1080/zCXCFAyOnvznHUAf.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 22, 2024

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others

Available under the permissive Apache 2.0 license, Mochi 1 offers users free access to cutting-edge video generation capabilities...

venturebeat.com

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others

Carl Franzen@carlfranzen

October 22, 2024 5:00 AM

Screenshot of AI video close-up of Caucasian elderly woman with brown eyes smiling

Credit: Genmo

Genmo, an AI company focused on video generation, has announced the release of a research preview for Mochi 1, a new open-source model for generating high-quality videos from text prompts — and claims performance comparable to, or exceeding, leading closed-source/proprietary rivals such as Runway’s Gen-3 Alpha, Luma AI’s Dream Machine, Kuaishou’s Kling, Minimax’s Hailuo, and many others.

Available under the permissive Apache 2.0 license, Mochi 1 offers users free access to cutting-edge video generation capabilities — whereas pricing for other models starts at limited free tiers but goes as high as $94.99 per month (for the Hailuo Unlimited tier). Users can download the full weights and model code free on Hugging Face, though it requires “at least 4” Nvidia H100 GPUs to operate on a user’s own machine.

In addition to the model release, Genmo is also making available a hosted playground, allowing users to experiment with Mochi 1’s features firsthand.

The 480p model is available for use today, and a higher-definition version, Mochi 1 HD, is expected to launch later this year.

Initial videos shared with VentureBeat show impressively realistic scenery and motion, particularly with human subjects as seen in the video of an elderly woman below:

https://venturebeat.com/wp-content/uploads/2024/10/9.mp4

Advancing the state-of-the-art

Mochi 1 brings several significant advancements to the field of video generation, including high-fidelity motion and strong prompt adherence.

According to Genmo, Mochi 1 excels at following detailed user instructions, allowing for precise control over characters, settings, and actions in generated videos.

Genmo has positioned Mochi 1 as a solution that narrows the gap between open and closed video generation models.

“We’re 1% of the way to the generative video future. The real challenge is to create long, high-quality, fluid video. We’re focusing heavily on improving motion quality,” said Paras Jain, CEO and co-founder of Genmo, in an interview with VentureBeat.

Jain and his co-founder started Genmo with a mission to make AI technology accessible to everyone. “When it came to video, the next frontier for generative AI, we just thought it was so important to get this into the hands of real people,” Jain emphasized. He added, “We fundamentally believe it’s really important to democratize this technology and put it in the hands of as many people as possible. That’s one reason we’re open sourcing it.”

Already, Genmo claims that in internal tests, Mochi 1 bests most other video AI models — including the proprietary competition Runway and Luna — at prompt adherence and motion quality.

Series A funding to the tune of $28.4M

In tandem with the Mochi 1 preview, Genmo also announced it has raised a $28.4 million Series A funding round, led by NEA, with additional participation from The House Fund, Gold House Ventures, WndrCo, Eastlink Capital Partners, and Essence VC. Several angel investors, including Abhay Parasnis (CEO of Typespace) and Amjad Masad (CEO of Replit), are also backing the company’s vision for advanced video generation.

Jain’s perspective on the role of video in AI goes beyond entertainment or content creation. “Video is the ultimate form of communication—30 to 50% of our brain’s cortex is devoted to visual signal processing. It’s how humans operate,” he said.

Genmo’s long-term vision extends to building tools that can power the future of robotics and autonomous systems. “The long-term vision is that if we nail video generation, we’ll build the world’s best simulators, which could help solve embodied AI, robotics, and self-driving,” Jain explained.

https://venturebeat.com/wp-content/uploads/2024/10/17.mp4

Open for collaboration — but training data is still close to the vest

Mochi 1 is built on Genmo’s novel Asymmetric Diffusion Transformer (AsymmDiT) architecture.

At 10 billion parameters, it’s the largest open source video generation model ever released. The architecture focuses on visual reasoning, with four times the parameters dedicated to processing video data as compared to text.

Efficiency is a key aspect of the model’s design. Mochi 1 leverages a video VAE (Variational Autoencoder) that compresses video data to a fraction of its original size, reducing the memory requirements for end-user devices. This makes it more accessible for the developer community, who can download the model weights from HuggingFace or integrate it via API.

Jain believes that the open-source nature of Mochi 1 is key to driving innovation. “Open models are like crude oil. They need to be refined and fine-tuned. That’s what we want to enable for the community—so they can build incredible new things on top of it,” he said.

However, when asked about the model’s training dataset — among the most controversial aspects of AI creative tools, as evidence has shown many to have trained on vast swaths of human creative work online without express permission or compensation, and some of it copyrighted works — Jain was coy.

“Generally, we use publicly available data and sometimes work with a variety of data partners,” he told VentureBeat, declining to go into specifics due to competitive reasons. “It’s really important to have diverse data, and that’s critical for us.”

https://venturebeat.com/wp-content/uploads/2024/10/2.mp4

Limitations and roadmap

As a preview, Mochi 1 still has some limitations. The current version supports only 480p resolution, and minor visual distortions can occur in edge cases involving complex motion. Additionally, while the model excels in photorealistic styles, it struggles with animated content.

However, Genmo plans to release Mochi 1 HD later this year, which will support 720p resolution and offer even greater motion fidelity.

“The only uninteresting video is one that doesn’t move—motion is the heart of video. That’s why we’ve invested heavily in motion quality compared to other models,” said Jain.

Looking ahead, Genmo is developing image-to-video synthesis capabilities and plans to improve model controllability, giving users even more precise control over video outputs.

Expanding use cases via open source video AI

Mochi 1’s release opens up possibilities for various industries. Researchers can push the boundaries of video generation technologies, while developers and product teams may find new applications in entertainment, advertising, and education.

Mochi 1 can also be used to generate synthetic data for training AI models in robotics and autonomous systems.

Reflecting on the potential impact of democratizing this technology, Jain said, “In five years, I see a world where a poor kid in Mumbai can pull out their phone, have a great idea, and win an Academy Award—that’s the kind of democratization we’re aiming for.”

Genmo invites users to try the preview version of Mochi 1 via their hosted playground at genmo.ai/play, where the model can be tested with personalized prompts — though at the time of this article’s posting, the URL was not loading the correct page for VentureBeat.

A call for talent

As it continues to push the frontier of open-source AI, Genmo is actively hiring researchers and engineers to join its team. “We’re a research lab working to build frontier models for video generation. This is an insanely exciting area—the next phase for AI—unlocking the right brain of artificial intelligence,” Jain said. The company is focused on advancing the state of video generation and further developing its vision for the future of artificial general intelligence.

bnew · Oct 23, 2024

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs

Spirit LM Expressive incorporates emotional cues into its speech generation and can detect and reflect anger, surprise, or joy.

venturebeat.com

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs

Carl Franzen@carlfranzen

October 18, 2024 5:05 PM

comic book style pointillism image of a startled programmer watching a happy ghost emerge from a computer screen

Credit: VentureBeat made with ChatGPT

Just in time for Halloween 2024, Meta has unveiled Meta Spirit LM, the company’s first open-source multimodal language model capable of seamlessly integrating text and speech inputs and outputs.

As such, it competes directly with OpenAI’s GPT-4o (also natively multimodal) and other multimodal models such as Hume’s EVI 2, as well as dedicated text-to-speech and speech-to-text offerings such as ElevenLabs.

Designed by Meta’s Fundamental AI Research (FAIR) team, Spirit LM aims to address the limitations of existing AI voice experiences by offering a more expressive and natural-sounding speech generation, while learning tasks across modalities like automatic speech recognition (ASR), text-to-speech (TTS), and speech classification.

Unfortunately for entrepreneurs and business leaders, the model is only currently available for non-commercial usage under Meta’s FAIR Noncommercial Research License, which grants users the right to use, reproduce, modify, and create derivative works of the Meta Spirit LM models, but only for noncommercial purposes. Any distribution of these models or derivatives must also comply with the noncommercial restriction.

A new approach to text and speech

Traditional AI models for voice rely on automatic speech recognition to process spoken input before synthesizing it with a language model, which is then converted into speech using text-to-speech techniques.

While effective, this process often sacrifices the expressive qualities inherent to human speech, such as tone and emotion. Meta Spirit LM introduces a more advanced solution by incorporating phonetic, pitch, and tone tokens to overcome these limitations.

Meta has released two versions of Spirit LM:

• Spirit LM Base: Uses phonetic tokens to process and generate speech.

• Spirit LM Expressive: Includes additional tokens for pitch and tone, allowing the model to capture more nuanced emotional states, such as excitement or sadness, and reflect those in the generated speech.

Both models are trained on a combination of text and speech datasets, allowing Spirit LM to perform cross-modal tasks like speech-to-text and text-to-speech, while maintaining the natural expressiveness of speech in its outputs.

Open-source noncommercial — only available for research

In line with Meta’s commitment to open science, the company has made Spirit LM fully open-source, providing researchers and developers with the model weights, code, and supporting documentation to build upon.

Meta hopes that the open nature of Spirit LM will encourage the AI research community to explore new methods for integrating speech and text in AI systems.

The release also includes a research paper detailing the model’s architecture and capabilities.

Mark Zuckerberg, Meta’s CEO, has been a strong advocate for open-source AI, stating in a recent open letter that AI has the potential to “increase human productivity, creativity, and quality of life” while accelerating advancements in areas like medical research and scientific discovery.

Applications and future potential

Meta Spirit LM is designed to learn new tasks across various modalities, such as:

• Automatic Speech Recognition (ASR): Converting spoken language into written text.

• Text-to-Speech (TTS): Generating spoken language from written text.

• Speech Classification: Identifying and categorizing speech based on its content or emotional tone.

The Spirit LM Expressive model goes a step further by incorporating emotional cues into its speech generation.

For instance, it can detect and reflect emotional states like anger, surprise, or joy in its output, making the interaction with AI more human-like and engaging.

This has significant implications for applications like virtual assistants, customer service bots, and other interactive AI systems where more nuanced and expressive communication is essential.

A broader effort

Meta Spirit LM is part of a broader set of research tools and models that Meta FAIR is releasing to the public. This includes an update to Meta’s Segment Anything Model 2.1 (SAM 2.1) for image and video segmentation, which has been used across disciplines like medical imaging and meteorology, and research on enhancing the efficiency of large language models.

Meta’s overarching goal is to achieve advanced machine intelligence (AMI), with an emphasis on developing AI systems that are both powerful and accessible.

The FAIR team has been sharing its research for more than a decade, aiming to advance AI in a way that benefits not just the tech community, but society as a whole. Spirit LM is a key component of this effort, supporting open science and reproducibility while pushing the boundaries of what AI can achieve in natural language processing.

What’s next for Spirit LM?

With the release of Meta Spirit LM, Meta is taking a significant step forward in the integration of speech and text in AI systems.

By offering a more natural and expressive approach to AI-generated speech, and making the model open-source, Meta is enabling the broader research community to explore new possibilities for multimodal AI applications.

Whether in ASR, TTS, or beyond, Spirit LM represents a promising advance in the field of machine learning, with the potential to power a new generation of more human-like AI interactions.

bnew · Oct 23, 2024

1/6
@OdinLovis
If ComfyUI is complicated for you , I think @KaiberAI create an amazing tool called SuperStudio.

Its clearly the new figma of AI , its using a canvas that allow you to iterate and share ideas quickly. And what its super nice its that they integrate flux, Kling, runway and much more all in one place

I will make a canva from it and sharing it soon !

You can test it here : Kaiber

https://video.twimg.com/ext_tw_video/1846957497812221952/pu/vid/avc1/1280x720/Umv9xfMkxEbu1TLm.mp4

2/6
@NNaumovsky
Forgot to mention the small detail that - in comparison to ComfyUI, it's not an open source software and it's defiantly not free. There are many great and free tools for Image generation and lately even for video generation locally (Pyramid flow for example).

3/6
@KaiberAI
you get it

can't wait to see what you generate

4/6
@bkvenn
Can’t wait to see how you cook @OdinLovis

5/6
@mr_flaneur_
Terrible UX

6/6
@tabletennisrule
I need a tutorial! It's confusing.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/11
@OdinLovis
Generate image, 3D, and video from a single sketch in one click in ComfyUI with AI!

I’m excited to share a cutting-edge workflow I’ve developed that combines inside ComfyUI: Runway with fal. ai api (yes there is a node for it :smile:

), Stable AI’s stable fast 3D , powerful Flux, ControlNet, IPAdapter, Gemini LLm in ComfyUI. This innovation allows you to generate high-quality visuals and 3D models, reposition objects, and create videos—completely automated from a simple sketch input.

Would you link a tutorial for it ?

/search?q=#AI /search?q=#ComfyUI /search?q=#3DModeling /search?q=#Automation /search?q=#CreativeTech

https://video.twimg.com/ext_tw_video/1843381274720804864/pu/vid/avc1/780x720/erFoQQHs8ROf1z03.mp4

2/11
@OdinLovis
how the mess look for now haha

3/11
@matgee01
Good stuff, need a way to use a 3d model/viewport to pick a camera angle to export to image input

4/11
@OdinLovis
its what i am doing here, picking a angle that I want for the final generation

5/11
@PaulFidika
So RunwayML Gen 3 is the model producing the video? But their API isn't public yet so it's going through Fal-ai first?

You're using the 3d model to produce start + end frames to guide Gen 3's video generation?

Thanks for sharing; great stuff!

6/11
@OdinLovis
For Runway fal api connections I am using the node of @gokayfem , its pretty Amazing node , not famous enough yet ! GitHub - gokayfem/ComfyUI-fal-API: Custom nodes for using fal API. Video generation with Kling, Runway, Luma. Image generation with Flux. LLMs and VLMs OpenAI, Claude, Llama and Gemini.

7/11
@BennyKokMusic
Let me help you deploy this

8/11
@bchewyme
Tutorial! Pronto

9/11
@AI_Car_Pool
That’s kind of cool.. app’afy it

10/11
@byarlooo
Great! Super cool!

11/11
@nicolasmariar
That's amazing I want to try it

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

Paper page - Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Join the discussion on this paper page

huggingface.co

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Published on Oct 20
·Submitted by Hoang Ha on Oct 22

Abstract

Large Language Models (LLMs) have revolutionized natural language processing, but their application to speech-based tasks remains challenging due to the complexities of integrating audio and text modalities. This paper introduces Ichigo, a mixed-modal model that seamlessly processes interleaved sequences of speech and text. Utilizing a tokenized early-fusion approach, Ichigo quantizes speech into discrete tokens and employs a uniform transformer-based architecture for both speech and text modalities. This method enables joint reasoning and generation across modalities without the need for separate adapters. We present a comprehensive training methodology, including pre-training on multilingual speech recognition datasets and fine-tuning on a curated instruction dataset. Ichigo demonstrates state-of-the-art performance on speech question-answering benchmarks, outperforming existing open-source speech language models and achieving comparable results to cascaded systems. Notably, Ichigo exhibits a latency of just 111 ms to first token generation, significantly lower than current models. Our approach not only advances the field of multimodal AI but also provides a framework for smaller research teams to contribute effectively to open-source speech-language models.
View arXiv page View PDF Add to collection

1/1

:Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

:https://arxiv.org/pdf/2410.15316.pdf

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 23, 2024

1/2
@Prashant_1722
Can you detect the type of plane?

Most people can’t including OpenAI ChatGPT. But guess what, Claude guessed it right in first attempt with valid reasoning behind it. Seems like Anthropic did train its model well.

[Quoted tweet]
The ability of multimodal AI to “understand” images is underrated.

I just took these. Given the first photo Claude guesses where I am. Given the second it identifies the type of plane. These aren’t obvious.

2/2
@noahbyteart
I've worked with Claude and I'm consistently impressed by its ability to discern subtle details. Guessing the plane type with valid reasoning is quite a feat. Props to Anthropic's training model

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku​

Claude 3.5 Sonnet: Industry-leading software engineering skills​

Claude 3.5 Haiku: State-of-the-art meets affordability and speed​

Teaching Claude to navigate computers, responsibly​

Looking ahead​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others​

Advancing the state-of-the-art​

Series A funding to the tune of $28.4M​

Open for collaboration — but training data is still close to the vest​

Limitations and roadmap​

Expanding use cases via open source video AI​

A call for talent​

Veteran

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs​

A new approach to text and speech​

Open-source noncommercial — only available for research​

Applications and future potential​

A broader effort​

What’s next for Spirit LM?​

Veteran

Veteran

Veteran

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant​

Abstract​

Veteran

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku

Claude 3.5 Sonnet: Industry-leading software engineering skills

Claude 3.5 Haiku: State-of-the-art meets affordability and speed

Teaching Claude to navigate computers, responsibly

Looking ahead

AI video startup Genmo launches Mochi 1, an open source rival to Runway, Kling, and others

Advancing the state-of-the-art

Series A funding to the tune of $28.4M

Open for collaboration — but training data is still close to the vest

Limitations and roadmap

Expanding use cases via open source video AI

A call for talent

Meta Introduces Spirit LM open source model that combines text and speech inputs/outputs

A new approach to text and speech

Open-source noncommercial — only available for research

Applications and future potential

A broader effort

What’s next for Spirit LM?

Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant

Abstract