Large Language Models News & Discussions

bnew · Sep 25, 2024

Fake AI “podcasters” are reviewing my book and it’s freaking me out

NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.

arstechnica.com

Fake AI “podcasters” are reviewing my book and it’s freaking me out

NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.

Kyle Orland - 9/23/2024, 11:40 AM

Hey, welcome back to Talkin'<em>Minesweeper</em>, the podcast where AI hosts discuss a book about <em>Minesweeper</em>!

Enlarge / Hey, welcome back to "Talkin'Minesweeper," the podcast where AI hosts discuss a book about Minesweeper!

Aurich Lawson | Boss Fight Books
66

Hey! Listen!

https://cdn.arstechnica.net/wp-content/uploads/2024/09/Minesweeper-AI-discussion.wav?_=1

Listen to NotebookLM's 12.5-minute summary of my Minesweeper book using the player above.

Google's NotebookLM launched over a year ago as "a virtual research assistant that can summarize facts, explain complex ideas, and brainstorm new connections—all based on the sources you select." Just last week, though, Google added the new "Audio Overview" feature that it's selling as "a new way to turn your documents into engaging audio discussions."

Google doesn't use the word "podcast" anywhere in that announcement, instead talking up audio creations that "summarize your material, make connections between topics, and banter back and forth." But Wharton AI professor Ethan Mollick correctly referred to the style as a "podcast" in a recent social media post sharing a NotebookLM Audio Overview of his book. Mollick called these Audio Summaries "the current best 'wow this is amazing & useful' demo of AI" and "unnerving, too," and we agree on both counts.

Inspired by Mollick's post, I decided to feed my own book into NotebookLM to see what its virtual "podcasters" would make of 30,000 or so words about '90s Windows gaming classic Minesweeper (believe it or not, I could have written much more). Just a few minutes later, I was experiencing a reasonable facsimile of what it would be like if I was featured on NPR's Pop Culture Happy Hour or a similar banter-filled podcast.

Just the facts?

NotebookLM's summary hits on all the book's major sections: the pre-history of the games that inspired Minesweeper; the uphill battle for the Windows Entertainment Pack at a business-focused Microsoft of the '90s; the moral panic over the game's pre-installation on millions of business and government computers; and the surprising cheating controversies that surrounded the game's competitive scene.

Why <a href=https://bossfightbooks.com/products/minesweeper-by-kyle-orland?srsltid=AfmBOookKnnu3mqj63xFEHPhWjXQRLplFphQtE_CAAh-F4BTmsjCAR3D>read ~30,000 words about <em>Minesweeper</em></a> when you can listen to two fake people banter for a few minutes instead?

Enlarge / Why read ~30,000 words about Minesweeper when you can listen to two fake people banter for a few minutes instead?

Boss Fight Books

Sure, I could quibble about which specific bits the summary decided to focus on and/or leave out (maybe feeding different chapters individually would have led to more detail in the collected summaries). But anyone listening to this "podcast" would get the same general overview of my book that they would listening to one of the many actual podcasts that I did after the book launched.

While there weren't any full-blown, whole-cloth hallucinations in NotebookLM's summary "podcast," there were a few points where it got small details wrong or made assumptions that weren't supported in the text. Discussing Minesweeper predecessor Mined-Out, for instance, NotebookLM's audio summary says, "So this is where those squares and flags start to come into play..." even though Mined-Out had neither feature.

Then there's the portion where the summary-cast mentions a senator who called Minesweeper "a menace to the republic," repeating the quote for emphasis. That definitely captures the spirit of Senator Lauch Faircloth's tirade against Minesweeper and other games being pre-installed on government computers. In the "podcast" context, though, it sounds like the voices are putting words in Faircloth's mouth by sharing a direct quote.

Small, overzealous errors like these—and a few key bits of the book left out of the podcast entirely—would give me pause if I were trying to use a NotebookLM summary as the basis for a scholarly article or piece of journalism. But I could see using a summary like this to get some quick Cliff's Notes-style grounding on a thick tome I didn't have the time or inclination to read fully. And, unlike poring through Cliff's Notes, the pithy, podcast-style format would actually make for enjoyable background noise while out on a walk or running errands.

It’s all in the delivery

It's that naturalistic, bantering presentation that makes NotebookLM's new feature stand out from other AI products that generate capable text summaries. I felt like I was eavesdropping on two people who just happened to be discussing my book in a cafe, except those people don't actually exist (and were probably algorithmically designed to praise the book).

Enlarge / Is this thing on?

Getty Images

Right from the start, I was tickled by the way one "podcast host" described the book as a tale from "the land of floppy disks and dial-up modems" (a phrase I did not use in the book). That same voice goes on to tease "a bit of Bill Gates sneaking around the Microsoft office," up front, hinting at my absolute favorite anecdote from the book before fully exploring it later in the summary.

When they do get to that anecdote, the fake podcast hosts segue in with what feels like a natural conversational structure:

Voice 1: It's hard to deny the impact of something when your own CEO is secretly hooked.

Voice 2: Wait, are we talking about Bill Gates?

The back-and-forth style of the two-person "podcast" format allows for some entertaining digressions from the main point of the book, too. When discussing the wormy movie-star damsel-in-distress featured in Minesweeper predecessor Mined-Out, for instance, the AI summarizers seem to get a little distracted:

Voice 1: I have to ask, what kind of movies does a worm even star in?

Voice 2: I'm afraid that detail has been lost to the sands of gaming history.

Then there's the casual way the two "hosts" bring up the improved versions of Minesweeper that were crafted to fix problems with Microsoft's original:

Voice 1: So eventually the community came up with a more elegant solution.

Voice 2: Let me guess. They created a new version of Minesweeper.

Voice 1: Exactly.

Voice 2: Called it a day on the old one.

The two-person format helps foster a gentle, easy rhythm to the presentation of dense information, with natural-sounding pauses and repetition that help emphasize key points. When one ersatz podcaster talks about the phenomenon of "this incredibly addictive puzzle game [being] pre-installed on practically every computer," for instance, the other voice can answer back with the phrase "on every computer" with just the right amount of probing interest. Or when one AI voice intones that "it was discovered that the original Minesweeper had a flaw in how it generated random boards," the other voice jumps in and exclaims "A flaw!" with pitch-perfect timing and a sense of surprise.

Wait, are we talking about Bill Gates?

NotebookLM podcast voice

There are some problems with this back-and-forth style, though. For one, both voices seem to alternate between the "I read the book" role and the "I'm surprised at these book facts you're sharing" role, making it hard to feel like either one is genuine. For another, the sheer volume of surprised reactions (a partial sample: "What? No! Wooooow! You're kidding! No way! You're blowing my mind here!") can get a little grating. And then there are the sentences that pause at the wrong points or the bits of laughter that feel like an editor chopped them off prematurely.

Still, when one fake podcast voice cooed, "Oh, do tell!" in response to the idea of controversy in competitive Minesweeper, it set off the same parasocial relationship buttons that a good, authentic podcast can (while also effectively flattering my sense of authorial ego).

After listening to NotebookLM's summary of my own book, I can easily envision a near future where these "fake" podcasts become a part of my real podcast diet, especially for books or topics that are unlikely to get professional interest from human podcasters. By repackaging generative AI text into a "just two people chatting" format, Google has put a much more amiable face on what can sometimes seem like a dehumanizing technology.

bnew · Sep 25, 2024

1/21
@AIatMeta

Introducing Llama 3.2: Lightweight models for edge devices, vision models and more!

What’s new?
• Llama 3.2 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek & @Qualcomm on day one.
• Llama 3.2 11B & 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B & 70B.
• New Llama Guard models to support multimodal use cases and edge deployments.
• The first official distro of Llama Stack simplifies and supercharges the way developers & enterprises can build around Llama to support agentic applications and more.

Details in the full announcement

Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Download Llama 3.2 models

Llama 3.2

These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more.

With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.

2/21
@reach_vb
Congrats on the release! I’m a huge fan of your commitment to open science and weights!

Thanks for the vision and an-device goodies:

Llama 3.2 - a meta-llama Collection

3/21
@ai_for_success
Congratulations this is huge.

4/21
@togethercompute

We love that Llama has gone multimodal! We're excited to partner with @AIatMeta to offer free access to the Llama 3.2 11B vision model for developers. Can't wait to see what everyone builds!

Try now with our Llama-Vision-Free model endpoint.

Sign up here: https://api.together.ai/playground/chat/meta-llama/Llama-Vision-Free

5/21
@Saboo_Shubham_
@ollama and @LMStudioAI time to go!!

6/21
@ollama
Let's go!!!! Open-source AI!

7/21
@joinnatural
GG

8/21
@borsavada
It is a real pity that Llama 3.2 is not available and accessible in Turkey. Restricting access to such innovative technologies can cause developers and researchers in Turkey to miss important opportunities.

Given the rapid developments in the field of artificial intelligence, it is crucial that our country is able to closely follow the advances in this field and utilize these technologies. Advanced language models such as Llama 3.2 can accelerate innovation and increase productivity in many sectors.

This may be due to license agreements, legal regulations or technical infrastructure issues. But whatever the reason, such obstacles need to be overcome to ensure that Turkey does not fall behind in the global AI race.

It is critical that policymakers, technology companies and academic institutions collaborate to ensure that Turkey has access to the latest AI technologies and strengthen the local ecosystem in this field. In this way, Turkey can become not only a consumer but also a producer and innovation center in the field of artificial intelligence.

9/21
@swissyp_
let's get these on-chain on /search?q=#ICP!! Llama 3.2 1B & 3B

cc: @icpp_pro

10/21
@AMD
AMD welcomes the latest Llama 3.2 release from Meta. We're excited to share how our collaboration with Meta is enabling developers with Day-0 support. Llama 3.2 and AMD: Optimal Performance from Cloud to Edge and AI PCs

11/21
@basetenco
We're excited to bring dedicated deployments of these new Llama 3.2 models to our customers!

90B vision looks especially powerful — congrats to the entire Llama team!

12/21
@janvivaddison
Congratulations that was amazing

13/21
@Ming_Chun_Lee
"open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone."

This is also how we can ensure everyone can help to build and advance AI together with the same goal.

Very important.

14/21
@dhruv2038
Congrats to @AIatMeta for being this open.

15/21
@FirstBatchAI
Thank you for helping us build better for edge!

16/21
@testingcatalog
“Linux of AI”

17/21
@ryanraysr
Awesome! Looking forward to digging it!

18/21
@Neeraj_Kumar222
I am excited to see the new capabilities of Llama 3.2 models, especially the support for edge devices and the competitive performance of the vision models. The expanded partner ecosystem and commitment to open-source AI are great to see. Looking forward to exploring the potential applications of these advancements.

19/21
@JonathanRoseD
I've been trying out 3.2 3b on my phone and it's been super for a micro model! I just highly recommend using a temperature at 0.2 and no higher (hallucination). Amazing work!!

20/21
@CamelAIOrg
Congratulations, we are excited to test these out within our framework!

21/21
@philip_kiely
Excited to compare 1B and 3B to the Qwen 2.5 small language models — have been really enjoying Qwen 2.5 1.5B

Llama 1B will be especially useful as a small example model in all of the documentation that I write!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 25, 2024

1/1

BREAKING

Llama 3.2 multimodal is here and 90B outperforms GPT-4o-mini and Claude Haiku in different benchmarks.

» Lightweight - 1B and 3B
» Multimodal - 11B and 90B

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Together is serving Llama 3.2 vision for free - have fun!

[Quoted tweet]

Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI.

Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/c…

Learn more in the blog together.ai/blog/llama-3-2-v…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/12
@nutlope
Announcing Napkins.dev – Screenshot to code!

An open source wireframe to app tool powered by Llama 3.2 vision. Upload a screenshot of a simple site/design & get code.

100% free and open source.

2/12
@nutlope
Here's the GitHub repo! Also, shoutout to @YoussefUiUx for the great design.

GitHub - Nutlope/napkins: napkins.dev – from screenshot to app

3/12
@nutlope
Tech Stack:

◆ @togethercompute's inference (AI API)
◆ @AIatMeta's Llama 3.2 Vision models
◆ @AIatMeta's Llama 3.1 405B for the LLM
◆ @codesandbox's sandpack for the sandbox
◆ @nextjs w/ tailwind & typescript
◆ @helicone_ai for AI observability
◆ @PlausibleHQ for analytics
◆ @awscloud's S3 for file uploads

4/12
@nutlope
How it works:

I ask the Llama 3.2 vision models to describe whatever screenshot the user uploaded, then pass it to Llama 3.1 405B to actually code it.

It's fairly limited in what it can handle right now – best for simple UI sketches!

5/12
@nutlope
Launched this as part of us at Together AI supporting the new Llama 3.2 models (including vision). Check it out!

[Quoted tweet]

Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI.

Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/c…

Learn more in the blog together.ai/blog/llama-3-2-v…

6/12
@nutlope
Check out the app here!

Napkins.dev – Screenshot to code

7/12
@LM_22
It would be nice to change it as OCR and extracting particular data out of pictures, invoices, packing list, delivery notes, and structure it in json or csv for handover to agent

8/12
@nutlope
Agreed! It has a lot of really cool use cases and I'm planning to do one with receipts potentially

9/12
@olanetsoft
Wow

10/12
@nutlope
Still limited to fairly simple designs but gonna work to make it better!

11/12
@DamiDina
Fire

12/12
@nutlope
Thanks Dami!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/6
Llama 3.2 is available on Ollama! It's lightweight and multimodal! It's so fast and good!

Try it:

1B
ollama run llama3.2:1b

3B
ollama run llama3.2

vision models are coming very soon!

llama3.2

[Quoted tweet]

Introducing Llama 3.2: Lightweight models for edge devices, vision models and more!

What’s new?
• Llama 3.2 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek & @Qualcomm on day one.
• Llama 3.2 11B & 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B & 70B.
• New Llama Guard models to support multimodal use cases and edge deployments.
• The first official distro of Llama Stack simplifies and supercharges the way developers & enterprises can build around Llama to support agentic applications and more.

Details in the full announcement

go.fb.me/229ug4
Download Llama 3.2 models

go.fb.me/w63yfd

These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more.

With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.

2/6
Amazing!!

3/6
lightweight and multimodal!

4/6

5/6
Amazing! Is this the 3B or 1B?

6/6

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@awnihannun
Llama 3.2 1B in 4-bit generates at ~350 (!) toks/sec with MLX on an M2 Ultra. This is interesting.

Command: mlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt "Write a story about Einstein" --temp 0.0 --max-tokens 512

Not sped up:

2/4
@MemoSparkfield
Wow!

3/4
@ivanfioravanti
WOW! That’s ultra fast!

4/4
@shareastronomy
Is the MLX version of the model on Hugging Face already?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 25, 2024

bnew · Sep 28, 2024

1/3
@rohanpaul_ai
Really

new Paper, MINI-SEQUENCE TRANSFORMER claims to extend the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

MST enables efficient long-sequence training by reducing intermediate memory overhead

It achieves 2.7x improvement in perplexity with 30k context vs 8k baseline on LongAlpaca dataset

**Original Problem**

:

Training large language models with long sequences is limited by memory constraints, particularly due to large intermediate values in MLP and LM-Head blocks.

-----

**Solution in this Paper**

:

• MINI-SEQUENCE TRANSFORMER (MST) partitions input sequences into M mini-sequences
• Processes mini-sequences iteratively to reduce intermediate memory usage
• Applies to MLP and LM-Head blocks, compatible with various attention mechanisms
• Integrates with activation recomputation for further memory savings
• Chunk-based implementation optimizes performance for small sequences
• Extends to distributed training settings

-----

**Key Insights from this Paper**

:

• Intermediate values in MLP and LM-Head blocks consume significantly more memory than inputs/outputs
• Partitioning input sequences into mini-sequences can reduce intermediate memory usage
• MST is compatible with existing optimization techniques like activation recomputation
• Optimal mini-sequence size depends on model architecture and sequence length

-----

**Results**

:

• Enables training of Llama3-8B with 60k context length (12x longer than standard)
• Maintains comparable throughput to standard implementation
• Reduces peak memory usage by 30% compared to standard transformer
• Scales linearly with number of GPUs in distributed settings

2/3
@rohanpaul_ai
Part (a) shows the conventional Transformer architecture with MLP and LM-Head blocks processing full sequence length S.

Part (b) illustrates the proposed MST, which splits the input sequence S into M mini-sequences of length S/M (M=2 in this example).

3/3
@rohanpaul_ai
Memory consumption of pre-training Llama3-8B and Gemma2-9B models with a batch size of 1 on a single A100 device, with activation recomputation and MST.

https://arxiv.org/pdf/2407.15892

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 28, 2024

1/1
Chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA

[Quoted tweet]

Exciting News!

We’re thrilled to announce the release of the new Video Arena

tab in our WildVision-Arena! You can now chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA, with more VideoLLMs coming soon.

Huge thanks to the amazing team for their hard work!

Special shoutout to @DongfuJiang, Yingzi Ma, @WenhuChen, @WilliamWangNLP , @YejinChoinka, @billyuchenlin, and the entire team.

Demo

: huggingface.co/spaces/WildVi…

Other resources

WildVision-Bench: huggingface.co/datasets/Wild…
WildVision-Chat: huggingface.co/datasets/Wild…
Paper: arxiv.org/abs/2406.11069
Github: github.com/orgs/WildVision-A…

#WildVisionArena #VideoArena #Video-LLM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2

Exciting News!

We’re thrilled to announce the release of the new Video Arena

tab in our WildVision-Arena! You can now chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA, with more VideoLLMs coming soon.

Huge thanks to the amazing team for their hard work!

Special shoutout to @DongfuJiang, Yingzi Ma, @WenhuChen, @WilliamWangNLP , @YejinChoinka, @billyuchenlin, and the entire team.

Demo

: Vision Arena (Testing VLMs side-by-side) - a Hugging Face Space by WildVision

Other resources

WildVision-Bench: WildVision/wildvision-bench · Datasets at Hugging Face
WildVision-Chat: WildVision/wildvision-chat · Datasets at Hugging Face
Paper: [2406.11069] WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Github: WildVision-AI

/search?q=#WildVisionArena /search?q=#VideoArena /search?q=#Video-LLM

2/2
Thanks for adding this model! It’s now serving in the Arena. Try it out.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 1, 2024

1/2
Starting this week, Advanced Voice is rolling out to all ChatGPT Enterprise, Edu, and Team users globally. Free users will also get a sneak peek of Advanced Voice.

Plus and Free users in the EU…we’ll keep you updated, we promise.

2/2
To access Advanced Voice, remember to download the latest version of the ChatGPT app.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@OpenAI
Advanced Voice is rolling out to all Plus and Team users in the ChatGPT app over the course of the week.

While you’ve been patiently waiting, we’ve added Custom Instructions, Memory, five new voices, and improved accents.

It can also say “Sorry I’m late” in over 50 languages.

2/11
@OpenAI
If you are a Plus or Team user, you will see a notification in the app when you have access to Advanced Voice.

3/11
@OpenAI
Meet the five new voices.

4/11
@OpenAI
Set Custom Instructions for Advanced Voice.

5/11
@OpenAI
We’ve also improved conversational speed, smoothness, and accents in select foreign languages.

6/11
@OpenAI
Advanced Voice is not yet available in the EU, the UK, Switzerland, Iceland, Norway, and Liechtenstein.

7/11
@GozuMighty
voice.gpt.eth

8/11
@spffspcmn
We need that Her voice back. Juniper just doesn't cut it for me.

9/11
@Maik_Busse
If your from EU and don't have access please like for confirmation

10/11
@reach_vb
Congrats on shipping Advanced Voice Mode! At the same time I’m quite happy to see Open Source catching up:

Moshi v0.1 Release - a kyutai Collection

11/11
@ai_for_success
My meme was correct

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 1, 2024

MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art

The startup from MIT's CSAIL says its Liquid Foundation Models have smaller memory needs thanks to a post-transformer architecture.

venturebeat.com

MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art

Carl Franzen @carlfranzen

September 30, 2024 2:16 PM

Liquid ripples over the surface of glowing blue and purple circuitry against a black backdrop

Credit: VentureBeat made with OpenAI ChatGPT

Liquid AI, a startup co-founded by former researchers from the Massachusetts Institute of Technology (MIT)’s Computer Science and Artificial Intelligence Laboratory (CSAIL), has announced the debut of its first multimodal AI models: the “Liquid Foundation Models (LFMs).”

Unlike most others of the current generative AI wave, these models are not based around the transformer architecture outlined in the seminal 2017 paper “Attention Is All You Need.”

Instead, Liquid states that its goal “is to explore ways to build foundation models beyond Generative Pre-trained Transformers (GPTs)” and with the new LFMs, specifically building from “first principles…the same way engineers built engines, cars, and airplanes.”

It seems they’ve done just that — as the new LFM models already boast superior performance to other transformer-based ones of comparable size such as Meta’s Llama 3.1-8B and Microsoft’s Phi-3.5 3.8B.

Liquid’s LFMs currently come in three different sizes and variants:

LFM 1.3B (smallest)
LFM 3B
LFM 40B MoE (largest, a “Mixture-of-Experts” model similar to Mistral’s Mixtral)

The “B” in their name stands for billion and refers the number of parameters — or settings — that govern the model’s information processing, analysis, and output generation. Generally, models with a higher number of parameters are more capable across a wider range of tasks.

Screenshot-2024-09-30-at-5.03.53%E2%80%AFPM.png

Already, Liquid AI says the LFM 1.3B version outperforms Meta’s new Llama 3.2-1.2B and Microsoft’s Phi-1.5 on many leading third-party benchmarks including the popular Massive Multitask Language Understanding (MMLU) consisting of 57 problems across science, tech, engineering and math (STEM) fields, “the first time a non-GPT architecture significantly outperforms transformer-based models.”

All three are designed to offer state-of-the-art performance while optimizing for memory efficiency, with Liquid’s LFM-3B requiring only 16 GB of memory compared to the more than 48 GB required by Meta’s Llama-3.2-3B model (shown in the chart above).

Maxime Labonne, Head of Post-Training at Liquid AI, took to his account on X to say the LFMs were “the proudest release of my career :smile:

” and to clarify that the core advantage of LFMs: their ability to outperform transformer-based models while using significantly less memory.

This is the proudest release of my career

At @LiquidAI_, we're launching three LLMs (1B, 3B, 40B MoE) with SOTA performance, based on a custom architecture.

Minimal memory footprint & efficient inference bring long context tasks to edge devices for the first time! pic.twitter.com/v9DelExyTa

— Maxime Labonne (@maximelabonne) September 30, 2024

The models are engineered to be competitive not only on raw performance benchmarks but also in terms of operational efficiency, making them ideal for a variety of use cases, from enterprise-level applications specifically in the fields of financial services, biotechnology, and consumer electronics, to deployment on edge devices.

However, importantly for prospective users and customers, the models are not open source. Instead, users will need to access them through Liquid’s inference playground, Lambda Chat, or Perplexity AI.

How Liquid is going ‘beyond’ the generative pre-trained transformer (GPT)

In this case, Liquid says it used a blend of “computational units deeply rooted in the theory of dynamical systems, signal processing, and numerical linear algebra,” and that the result is “general-purpose AI models that can be used to model any kind of sequential data, including video, audio, text, time series, and signals” to train its new LFMs.

Last year, VentureBeat covered more about Liquid’s approach to training post-transformer AI models, noting at the time that it was using Liquid Neural Networks (LNNs), an architecture developer at CSAIL that seeks to make the artificial “neurons” or nodes for transforming data, more efficient and adaptable.

Unlike traditional deep learning models, which require thousands of neurons to perform complex tasks, LNNs demonstrated that fewer neurons—combined with innovative mathematical formulations—could achieve the same results.

Liquid AI’s new models retain the core benefits of this adaptability, allowing for real-time adjustments during inference without the computational overhead associated with traditional models, handling up to 1 million tokens efficiently, while keeping memory usage to a minimum.

A chart from the Liquid blog shows that the LFM-3B model, for instance, outperforms popular models like Google’s Gemma-2, Microsoft’s Phi-3, and Meta’s Llama-3.2 in terms of inference memory footprint, especially as token length scales.

While other models experience a sharp increase in memory usage for long-context processing, LFM-3B maintains a significantly smaller footprint, making it highly suitable for applications requiring large volumes of sequential data processing, such as document analysis or chatbots.

Liquid AI has built its foundation models to be versatile across multiple data modalities, including audio, video, and text.

With this multimodal capability, Liquid aims to address a wide range of industry-specific challenges, from financial services to biotechnology and consumer electronics.

Accepting invitations for launch event and eyeing future improvements

Liquid AI says it is is optimizing its models for deployment on hardware from NVIDIA, AMD, Apple, Qualcomm, and Cerebras.

While the models are still in the preview phase, Liquid AI invites early adopters and developers to test the models and provide feedback.

Labonne noted that while things are “not perfect,” the feedback received during this phase will help the team refine their offerings in preparation for a full launch event on October 23, 2024, at MIT’s Kresge Auditorium in Cambridge, MA. The company is accepting RSVPs for attendees of that event in-person here.

As part of its commitment to transparency and scientific progress, Liquid says it will release a series of technical blog posts leading up to the product launch event.

The company also plans to engage in red-teaming efforts, encouraging users to test the limits of their models to improve future iterations.

With the introduction of Liquid Foundation Models, Liquid AI is positioning itself as a key player in the foundation model space. By combining state-of-the-art performance with unprecedented memory efficiency, LFMs offer a compelling alternative to traditional transformer-based models.

bnew · Oct 1, 2024

Ai2’s new Molmo open source AI models beat GPT-4o, Claude on some benchmarks

The models are not only high-performing but also entirely open, allowing researchers and developers to access and build upon them.

venturebeat.com

Ai2’s new Molmo open source AI models beat GPT-4o, Claude on some benchmarks

Carl Franzen@carlfranzen

September 25, 2024 2:48 PM

Silhouette of man typing on laptop against purple orange code backdrop standing on curving planet surface

Credit: VentureBeat made with Midjourney

The Allen Institute for AI (Ai2) today unveiled Molmo, an open-source family of state-of-the-art multimodal AI models which outpeform top proprietary rivals including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 on several third-party benchmarks.

Being multimodal, the models can therefore accept and analyze imagery and files — similar to the leading proprietary foundation models.

Yet, Ai2 also noted in a post on X that Molmo uses “1000x less data” than the proprietary rivals — thanks to some clever new training techniques described in greater detail below and in a technical report paper published by the Paul Allen-founded and Ali Farhadi-led company.

Ai2 also posted a video to YouTube and its social accounts showing how Molmo can be used on a smartphone to rapidly analyze what’s in front of the user — by having them snap a photo and send it to the AI. In less than a second, it can count the number of people in a scene, discern whether a menu item is vegan, analyze flyers tapped to a lamppost and determine which bands are electronic music, and even take and convert handwritten notes on a whiteboard into a table.

Ai2 says the release underscores its commitment to open research by offering high-performing models, complete with open weights and data, to the broader community — and of course, companies looking for solutions they can completely own, control, and customize.

It comes on the heels of Ai2’s release two weeks ago of another open model, OLMoE, which is a “mixture of experts” or combination of smaller models designed for cost effectiveness.

Closing the Gap Between Open and Proprietary AI

Molmo consists of four main models of different parameter sizes and capabilities:

Molmo-72B (72 billion parameters, or settings — the flagship model, based on based on Alibaba Cloud’s Qwen2-72B open source model)
Molmo-7B-D (“demo model” based on Alibaba’s Qwen2-7B model)
Molmo-7B-O (based on Ai2’s OLMo-7B model)
MolmoE-1B (based on OLMoE-1B-7B mixture-of-experts LLM, and which Ai2 says “nearly matches the performance of GPT-4V on both academic benchmarks and user preference.”)

These models achieve high performance across a range of third-party benchmarks, outpacing many proprietary alternatives. And they’re all available under permissive Apache 2.0 licenses, enabling virtually any sorts of usages for research and commercialization (e.g. enterprise grade).

Notably, Molmo-72B leads the pack in academic evaluations, achieving the highest score on 11 key benchmarks and ranking second in user preference, closely following GPT-4o.

Vaibhav Srivastav, a machine learning developer advocate engineer at AI code repository company Hugging Face, commented on the release on X, highlighting that Molmo offers a formidable alternative to closed systems, setting a new standard for open multimodal AI.

Molmo by @allen_ai – Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o ?

They release four model checkpoints:

1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total)

2. Molmo-7B-O, most open 7B model

3.… pic.twitter.com/9hpARh0GYT

— Vaibhav (VB) Srivastav (@reach_vb) September 25, 2024

In addition, Google DeepMind robotics researcher Ted Xiao took to X to praise the inclusion of pointing data in Molmo, which he sees as a game-changer for visual grounding in robotics.

'

Molmo is a very exciting multimodal foundation model release, especially for robotics. The emphasis on pointing data makes it the first open VLM optimized for visual grounding — and you can see this clearly with impressive performance on RealworldQA or OOD robotics perception! x.com pic.twitter.com/VHtu9hT2r9

— Ted Xiao (@xiao_ted) September 25, 2024

This capability allows Molmo to provide visual explanations and interact more effectively with physical environments, a feature that is currently lacking in most other multimodal models.

The models are not only high-performing but also entirely open, allowing researchers and developers to access and build upon cutting-edge technology.

Advanced Model Architecture and Training Approach

Molmo’s architecture is designed to maximize efficiency and performance. All models use OpenAI’s ViT-L/14 336px CLIP model as the vision encoder, which processes multi-scale, multi-crop images into vision tokens.

These tokens are then projected into the language model’s input space through a multi-layer perceptron (MLP) connector and pooled for dimensionality reduction.

The language model component is a decoder-only Transformer, with options ranging from the OLMo series to the Qwen2 and Mistral series, each offering different capacities and openness levels.

The training strategy for Molmo involves two key stages:

Multimodal Pre-training: During this stage, the models are trained to generate captions using newly collected, detailed image descriptions provided by human annotators. This high-quality dataset, named PixMo, is a critical factor in Molmo’s strong performance.
Supervised Fine-Tuning: The models are then fine-tuned on a diverse dataset mixture, including standard academic benchmarks and newly created datasets that enable the models to handle complex real-world tasks like document reading, visual reasoning, and even pointing.

Unlike many contemporary models, Molmo does not rely on reinforcement learning from human feedback (RLHF), focusing instead on a meticulously tuned training pipeline that updates all model parameters based on their pre-training status.

Outperforming on Key Benchmarks

The Molmo models have shown impressive results across multiple benchmarks, particularly in comparison to proprietary models.

For instance, Molmo-72B scores 96.3 on DocVQA and 85.5 on TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. It further outperforms GPT-4o on AI2D (Ai2’s own benchmark, short for “A Diagram Is Worth A Dozen Images,” a dataset of 5000+ grade school science diagrams and 150,000+ rich annotations), scoring the highest of all model families in comparison at 96.3.

The models also excel in visual grounding tasks, with Molmo-72B achieving top performance on RealWorldQA, making it especially promising for applications in robotics and complex multimodal reasoning.

Open Access and Future Releases

Ai2 has made these models and datasets accessible on its Hugging Face space, with full compatibility with popular AI frameworks like Transformers.

This open access is part of Ai2’s broader vision to foster innovation and collaboration in the AI community.

Over the next few months, Ai2 plans to release additional models, training code, and an expanded version of their technical report, further enriching the resources available to researchers.

For those interested in exploring Molmo’s capabilities, a public demo and several model checkpoints are available now via Molmo’s official page.

bnew · Oct 1, 2024

1/11
@reach_vb
Molmo by @allen_ai - Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o

They release four model checkpoints:

1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total)
2. Molmo-7B-O, most open 7B model
3. Molmo-7B-D, demo model
4. Molmo-72B, best model

System Architecture

> Input: Multi-scale, multi-crop images generated from the original image.

> Vision Encoder: OpenAI's ViT-L/14 336px CLIP model, a powerful ViT, encodes images into vision tokens.

> Connector: MLP projects tokens to LLM input space, followed by pooling for dimensionality reduction.

> LLM: Decoder-only Transformer, various options (OLMo, OLMoE, Qwen2, Mistral, Gemma2, Phi) with diverse scales and openness.

Model Variants

> Vision Encoder: Consistent ViT-L/14 CLIP model across variants.
> LLM: OLMo-7B-1024, OLMoE-1B-7B-0924, Qwen2 (7B, 72B), Mistral 7B, Gemma2 9B, Phi 3 Medium, offering different capacities and openness levels.

Training Strategy

> Stage 1: Multimodal pre-training for caption generation with new captioning data.

> Stage 2: Supervised fine-tuning on a dataset mixture, updating all parameters.

> No RLHF involved, Learning rates adjusted based on component types and pre-training status.

> All the weights are available on Hugging Face Hub

> Compatible with Transformers (Remote Code)

Kudos @allen_ai for such a brilliant and open work!

Video credits: Allen AI YT Channel

2/11
@reach_vb
Check out their model checkpoints on the Hub:

Molmo - a allenai Collection

3/11
@iamrobotbear
Wait, @allen_ai Did I miss something, what is the main difference between 7b-o and d?

I know the 7B-D is the version on your demo, but in terms of the model or capabilities, I'm a bit confused.

4/11
@A_Reichenbach_
Any idea why they didn’t use rlhf/dpo?

5/11
@ccerrato147
Paul Allen really left an impressive legacy.

6/11
@heyitsyorkie
Nice! New vision model that will need support in llama.cpp!

7/11
@HantianPang
amazing

8/11
@invisyblecom
@ollama

9/11
@EverydayAILabs
Heartening to see the how much open source models are progressing.

10/11
@sartify_co

[Quoted tweet]
We’re thrilled to be part of the 2024 Mozilla Builders Accelerator!
Our project, Swahili LLMs, will bring open-source AI to empower Swahili speakers

.
Exciting 12 weeks ahead!
Learn more here: mozilla.org/en-US/builders/
@MozillaHacks @MozillaAI

#MozillaBuilders #AI #Swahili

11/11
@genesshk
Great to see advancements in open source multimodal models!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@xiao_ted
Molmo is a very exciting multimodal foundation model release, especially for robotics. The emphasis on pointing data makes it the first open VLM optimized for visual grounding — and you can see this clearly with impressive performance on RealworldQA or OOD robotics perception!

[Quoted tweet]
Try out Molmo on your application! This is a great example by @DJiafei! We have a few videos describing Molmo's different capabilities on our blog! molmo.allenai.org/blog

This one is me trying it out on a bunch of tasks and images from RT-X: invidious.poast.org/bHOBGAYNBNI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
Try out Molmo on your application! This is a great example by @DJiafei! We have a few videos describing Molmo's different capabilities on our blog! https://molmo.allenai.org/blog

This one is me trying it out on a bunch of tasks and images from RT-X: https://invidious.poast.org/bHOBGAYNBNI
[Quoted tweet]
The idea of using a VLM for pointing, RoboPoint has proven useful and generalizable for robotic manipulation. But the next challenge is: can VLMs draw multiple "points" to form complete robotic trajectories? @allen_ai 's new Molmo seems up to the task—very exciting!

2/2
Thank you Pannag :smile:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 3, 2024

Announcing FLUX1.1 [pro] and the BFL API

Today we’re laucnhing Flux1.1 PRO and our API, we can’t wait to see what users will dream up using our latest and greatest <3

blackforestlabs.ai

api.

Announcing FLUX1.1 [pro] and the BFL API

Oct 2, 2024
—

by

BlackForestLabs
in Uncategorized

Today, we release FLUX1.1 [pro], our most advanced and efficient model yet, alongside the general availability of the beta BFL API. This release marks a significant step forward in our mission to empower creators, developers, and enterprises with scalable, state-of-the-art generative technology.

FLUX1.1 [pro]: Faster & Better

FLUX1.1 [pro] provides six times faster generation than its predecessor FLUX.1 [pro] while also improving image quality, prompt adherence, and diversity. At the same time, we updated FLUX.1 [pro] to generate the same output as before, but two times faster.

Superior Speed and Efficiency: Faster generation times and reduced latency, enabling more efficient workflows. FLUX1.1 [pro] provides an ideal tradeoff between image quality and inference speed. FLUX1.1 [pro] is three times faster than the currently available FLUX.1 [pro].
Improved Performance: FLUX1.1 [pro] has been introduced and tested under the codename “blueberry” into the Artificial Analysis image arena (https://artificialanalysis.ai/text-to-image), a popular benchmark for text-to-image models. It surpasses all other models on the leaderboard, achieving the highest overall Elo score.

All metrics from artificialanalysis.ai as of Oct 1, 2024.

All metrics from artificialanalysis.ai as of Oct 1, 2024, except FLUX.1 inference speeds (benchmarked internally).

Fast High-res coming soon: FLUX1.1 [pro], natively set up for fast ultra high-resolution generation coming soon to the API. Generate up to 2k images without sacrificing any of the prompt following.

We are excited to announce that FLUX1.1 [pro] will also be available through Together.ai, Replicate, fal.ai, and Freepik.

Building with the BFL API

Our new beta BFL API brings FLUX’s capabilities directly to developers and businesses looking to integrate state-of-the-art image generation into their own applications. Our API stands out with key advantages over competitors:

Advanced Customization: Tailor the API outputs to your specific needs with customization options on model choice, image resolution, and content moderation.
Scalability: Seamlessly scale your applications, whether you are building small projects or enterprise-level applications.
Competitive pricing:The API offers superior image quality at a lower cost. The pricing for our FLUX.1 model suite is as follows:
- FLUX.1 [dev]: 2.5 cts/img
- FLUX.1 [pro]: 5 cts/img
- FLUX1.1 [pro]: 4 cts/img

Get started with the BFL API today at: docs.bfl.ml.

We are eager to see the creative applications that will emerge from users of the BFL API.

bnew · Oct 3, 2024

1/13
@OpenAI
We’re rolling out an early version of canvas—a new way to work with ChatGPT on writing & coding projects that go beyond simple chat.

Starting today, Plus & Team users can try it by selecting “GPT-4o with canvas” in the model picker. https://openai.com/index/introducing-canvas/

2/13
@OpenAI
Canvas opens in a separate window, allowing you and ChatGPT to work on ideas side by side.

In canvas, ChatGPT can suggest edits, adjust length, change reading levels, and offer inline feedback. You can also write and edit directly in canvas.

3/13
@OpenAI
When writing code, canvas makes it easier to track and understand ChatGPT’s changes.

It can also review code, add logs and comments, fix bugs, and port to other coding languages like JavaScript and Python.

4/13
@OpenAI
Canvas will be available to Enterprise & Edu users next week.

5/13
@FrankieS2727

6/13
@iAmTCAB

7/13
@sivajisahoo
So the canvas is similar to Claude Artifact, but different.

8/13
@razroo_chief
Artifacts but make it cooler

9/13
@PedroCoSilva
@joaoo_dev

10/13
@CabinaAI
Compare with http://Cabina.AI
:smile:

11/13
@JaySahnan
If only it would work for Intel Macs

12/13
@NicerInPerson
yo @amasad, your product stack is a shoo-in here and there's no way OpenAI can catch up and out execute on this as a side project

13/13
@causalityin
Folks, the language in the ad is Rust!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

https://openai.com/index/introducing-canvas/

October 3, 2024

Introducing canvas

A new way of working with ChatGPT to write and code

The image shows a vertical toolbar featuring five icons arranged in a column on a soft pastel background. The third icon from the top, depicting an open book, is highlighted with a label next to it reading Reading Level.

We’re introducing canvas, a new interface for working with ChatGPT on writing and coding projects that go beyond simple chat. Canvas opens in a separate window, allowing you and ChatGPT to collaborate on a project. This early beta introduces a new way of working together—not just through conversation, but by creating and refining ideas side by side.

Canvas was built with GPT-4o and can be manually selected in the model picker while in beta. Starting today we’re rolling out canvas to ChatGPT Plus and Team users globally. Enterprise and Edu users will get access next week. We also plan to make canvas available to all ChatGPT Free users when it’s out of beta.

Better collaboration with ChatGPT

People use ChatGPT every day for help with writing and code. Although the chat interface is easy to use and works well for many tasks, it’s limited when you want to work on projects that require editing and revisions. Canvas offers a new interface for this kind of work.

With canvas, ChatGPT can better understand the context of what you’re trying to accomplish. You can highlight specific sections to indicate exactly what you want ChatGPT to focus on. Like a copy editor or code reviewer, it can give inline feedback and suggestions with the entire project in mind.

You control the project in canvas. You can directly edit text or code. There’s a menu of shortcuts for you to ask ChatGPT to adjust writing length, debug your code, and quickly perform other useful actions. You can also restore previous versions of your work by using the back button in canvas.

Canvas opens automatically when ChatGPT detects a scenario in which it could be helpful. You can also include “use canvas” in your prompt to open canvas and use it to work on an existing project.

Writing shortcuts include:

Suggest edits: ChatGPT offers inline suggestions and feedback.
Adjust the length: Edits the document length to be shorter or longer.
Change reading level: Adjusts the reading level, from Kindergarten to Graduate School.
Add final polish: Checks for grammar, clarity, and consistency.
Add emojis: Adds relevant emojis for emphasis and color.

Coding in canvas

Coding is an iterative process, and it can be hard to follow all the revisions to your code in chat. Canvas makes it easier to track and understand ChatGPT’s changes, and we plan to continue improving transparency into these kinds of edits.

Coding shortcuts include:

Review code: ChatGPT provides inline suggestions to improve your code.
Add logs: Inserts print statements to help you debug and understand your code.
Add comments: Adds comments to the code to make it easier to understand.
Fix bugs: Detects and rewrites problematic code to resolve errors.
Port to a language: Translates your code into JavaScript, TypeScript, Python, Java, C++, or PHP.

Training the model to become a collaborator

We trained GPT-4o to collaborate as a creative partner. The model knows when to open a canvas, make targeted edits, and fully rewrite. It also understands broader context to provide precise feedback and suggestions.

To support this, our research team developed the following core behaviors:

Triggering the canvas for writing and coding
Generating diverse content types
Making targeted edits
Rewriting documents
Providing inline critique

We measured progress with over 20 automated internal evaluations. We used novel synthetic data generation techniques, such as distilling outputs from OpenAI o1-preview, to post-train the model for its core behaviors. This approach allowed us to rapidly address writing quality and new user interactions, all without relying on human-generated data.

A key challenge was defining when to trigger a canvas. We taught the model to open a canvas for prompts like “Write a blog post about the history of coffee beans” while avoiding over-triggering for general Q&A tasks like “Help me cook a new recipe for dinner.” For writing tasks, we prioritized improving “correct triggers” (at the expense of “correct non-triggers”), reaching 83% compared to a baseline zero-shot GPT-4o with prompted instructions.

It’s worth noting that the quality of such baselines is highly sensitive to the specific prompt used. With different prompts, the baseline may still perform poorly but in a different manner—for instance, by being evenly inaccurate across coding and writing tasks, resulting in a different distribution of errors and alternative forms of suboptimal performance. For coding, we intentionally biased the model against triggering to avoid disrupting our power users. We'll continue refining this based on user feedback.

Canvas Decision Boundary Trigger - Writing & Coding

Prompted GPT-4o

GPT-4o with canvas

0000

For writing and coding tasks, we improved correctly triggering the canvas decision boundary, reaching 83% and 94% respectively compared to a baseline zero-shot GPT-4o with prompted instructions.

A second challenge involved tuning the model's editing behavior once the canvas was triggered—specifically deciding when to make a targeted edit versus rewriting the entire content. We trained the model to perform targeted edits when users explicitly select text through the interface, otherwise favoring rewrites. This behavior continues to evolve as we refine the model.

Canvas Edits Boundary - Writing & Coding

Prompted GPT-4o

GPT-4o with canvas

For writing and coding tasks, we prioritized improving canvas targeted edits. GPT-4o with canvas performs better than a baseline prompted GPT-4o by 18%.

Finally, training the model to generate high-quality comments required careful iteration. Unlike the first two cases, which are easily adaptable to automated evaluation with thorough manual reviews, measuring quality in an automated way is particularly challenging. Therefore, we used human evaluations to assess comment quality and accuracy. Our integrated canvas model outperforms the zero-shot GPT-4o with prompted instructions by 30% in accuracy and 16% in quality, showing that synthetic training significantly enhances response quality and behavior compared to zero-shot prompting with detailed instructions.

Canvas Suggested Comments

Prompted GPT-4o

GPT-4o with canvas

Human evaluations assessed canvas comment quality and accuracy functionality. Our canvas model outperforms the zero-shot GPT-4o with prompted instructions by 30% in accuracy and 16% in quality.

What’s next

Making AI more useful and accessible requires rethinking how we interact with it. Canvas is a new approach and the first major update to ChatGPT’s visual interface since we launched two years ago.

Canvas is in early beta, and we plan to rapidly improve its capabilities.

bnew · Oct 3, 2024

Reverb Open-Source ASR and Diarization Models | Rev

Discover how Rev’s open-source Reverb models, trained on the largest human-transcribed dataset, are pushing the boundaries of ASR and diarization technology.

www.rev.com

Introducing Reverb: The Future of Open-Source ASR and Diarization

Jennifer Drexler Fox

Oct 3, 2024

Speech to Text Technology

Rev › Blog › Speech to Text Technology › Introducing Reverb: The Future of Open-Source ASR and Diarization

Rev, as a leader in human transcription of English, has amassed the highest quality English speech recognition dataset in the world. The research team at Rev has used this corpus to develop extremely accurate speech recognition and speech diarization models, currently available through the Rev.AI API.

These models are accessible under a non-commercial license. For information on usage-based or all-inclusive commercial licenses, please contact us at licensing@rev.com. We are releasing both a full production pipeline for developers as well as pared-down research models for experimentation. Rev hopes that these releases will spur research and innovation in the fast-moving domain of voice technology. The speech recognition models released today outperform all existing open source speech recognition models across a variety of long-form speech recognition domains.

The released models, as well as usage instructions, can be found on GitHub and HuggingFace.

Shaping the Future of Speech Technology

This release, which we are calling Reverb, encompasses two separate models: an automatic speech recognition (ASR) model in the WeNet framework and a speech diarization model in the Pyannote framework. For researchers, we provide simple scripts for combining ASR and diarization output into a single diarized transcript. For developers, we provide a full pipeline that handles both ASR and diarization in a production environment. Additionally, we are releasing an int8 quantized version of the ASR model within the developer pipeline (“Reverb Turbo”) for applications that are particularly sensitive to speed and/or memory usage.

Logos of Reverb ASR and Diarization, representing Rev’s new open-source models overlapping on a purple background.

Reverb ASR was trained on 200,000 hours of English speech, all expertly transcribed by humans — the largest corpus of human transcribed audio ever used to train an open-source model. The quality of this data has produced the world’s most accurate English automatic speech recognition (ASR) system, using an efficient model architecture that can be run on either CPU or GPU.

Additionally, this model provides user control over the level of verbatimicity of the output transcript, making it ideal for both clean, readable transcription and use-cases like audio editing that require transcription of every spoken word including hesitations and re-wordings. Users can specify fully verbatim, fully non-verbatim, or anywhere in between for their transcription output.

For diarization, Rev used the high-performance pyannote.audio library to fine-tune existing models on 26,000 hours of expertly labeled data, significantly improving their performance. Reverb diarization v1 uses the pyannote3.0 architecture, while Reverb diarization v2 uses WavLM instead of SincNet features.

Training with the Largest Human-Transcribed Corpus

Preparing and Processing ASR Training Data

Rev’s ASR dataset is made up of long-form, multi-speaker audio featuring a wide range of domains, accents and recording conditions. This corpus contains audio transcribed in two different styles: verbatim and non-verbatim.

Verbatim transcripts include all speech sounds in the audio (including false starts, filler words, and laughter), while non-verbatim transcripts have been lightly edited for readability. Training on both of these transcription styles is what enables the style control feature of the Reverb ASR model.

To prepare our data for training, we employ a joint normalization and forced-alignment process, which allows us to simultaneously filter out poorly-aligned data and get the best possible timings for segmenting the remaining audio into shorter training segments. During the segmentation process, we include multi-speaker segments, so that the resulting ASR model is able to effectively recognize speech across speaker switches.

The processed ASR training corpus comprises 120,000 hours of speech with verbatim transcription labels and 80,000 hours with non-verbatim labels.

A Closer Look at Reverb’s ASR Model Architecture

Reverb ASR was trained using a modified version of the WeNet toolkit and uses a joint CTC/attention architecture. The encoder has 18 conformer layers and the bidirectional attention decoder has 6 transformer layers, 3 in each direction. In total, the model has approximately 600M parameters.

One important modification available in Rev’s WeNet release is the use of the language-specific layer mechanism. While this technique was originally developed to give control over the output language of multilingual models, Reverb ASR uses these extra weights for control over the verbatimicity of the output. These layers are added to the first and last blocks of both the encoder and decoder.

The joint CTC/attention architecture enables experimentation with a variety of inference modes, including: greedy CTC decoding, CTC prefix beam search (with or without attention rescoring), attention decoding, and joint CTC/attention decoding. The joint decoding available in Rev’s Wenet is a slightly modified version of the time synchronous joint decoding implementation from ESPnet.

The production pipeline uses WFST-based beam search with a simple unigram language model on top of the encoder outputs, followed by attention rescoring. This pipeline also implements parallel processing and overlap decoding at multiple levels to achieve the best possible turn-around time without introducing errors at the chunk boundaries. While the research model outputs unformatted text, the production pipeline includes a post-processing system for generating fully formatted output.

Setting New Benchmarking Standards for ASR Accuracy

Unlike many ASR providers, Rev primarily uses long-form speech recognition corpora for benchmarking. We use each model to produce a transcript of an entire audio file, then use fstalign to align and score the complete transcript. We report micro-average WER across all of the reference words in a given test suite. As part of our model release, we have included our scoring scripts so that anyone can replicate our work, benchmark other models, or experiment with new long-form test suites.

bnew · Oct 3, 2024

Here, we’ve benchmarked Reverb ASR against the best performing open-source models currently available: OpenAI’s Whisper large-v3 and NVIDIA’s Canary-1B. Note that both of these models have significantly more parameters than Reverb ASR.

For these models and Rev’s research model, we use simple chunking with no overlap – 30s chunks for Whisper and Canary, and 20s chunks for Reverb. The Reverb research results use CTC prefix beam search with attention rescoring. We used Canary through Hugging Face and used the WhisperX implementation of Whisper. For both Whisper and Canary, we use NeMo to normalize the model outputs before scoring.

For long-form ASR, we’ve used three corpora: Rev16 (podcasts)1, Earnings21 (earnings calls from US-based companies), and Earnings22 (earnings calls from global companies).

1 Description from https://cdn.openai.com/papers/whisper.pdf, Appendix A.2: “We use a subset of 16 files from the 30 podcast episodes in Rev.AI’s Podcast Transcription Benchmark, after finding that there are multiple cases where a significant portion of the audio and the labels did not match, mostly on the parts introducing the sponsors. We selected 16 episodes that do not have this error, whose file numbers are: 3, 4, 9, 10, 11, 14, 17, 18, 20, 21, 23, 24, 26, 27, 29, 32.”

Model	Earnings21	Earnings22
Reverb Verbatim	7.64	11.38
Reverb Turbo Verbatim	7.88	11.60
Reverb Research Verbatim	9.68	13.68
Whisper large-v3	13.67	18.53
Canary-1B	14.40	19.01

For Rev16, we have produced both verbatim and non-verbatim human transcripts. For all Reverb models, we run in verbatim mode for evaluation with the verbatim reference and non-verbatim mode for evaluation with the non-verbatim reference.

Model	Verbatim Reference	Non-Verbatim Reference
Reverb	7.99	7.06
Reverb Turbo	8.25	7.50
Reverb Research	10.30	9.08
Whisper large-v3	10.67	11.37
Canary-1B	13.82	13.24

We have also used GigaSpeech for a more traditional benchmark. We ran Reverb ASR in verbatim mode and used the HuggingFace Open ASR Leaderboard evaluation scripts.

Model	Gigaspeech
Reverb Research Verbatim	11.05
Whisper large-v3	10.02
Canary-1B	10.12

Overall, Reverb ASR significantly outperforms the competition on long-form ASR test suites. Rev’s models are particularly strong on the Earnings22 test suite, which contains mainly speech from non-native speakers of English. We see a small WER degradation from the use of the Turbo model, but a much larger gap between the production pipeline and research model – demonstrating the importance of engineering a complete system for long-form speech recognition.

On the Gigaspeech test suite, Rev’s research model is worse than other open-source models. The average segment length of this corpus is 5.7 seconds; these short segments are not a good match for the design of Rev’s model. These results demonstrate that despite its strong performance on long-form tests, Rev is not the best candidate for short-form recognition applications like voice search.

Customizing Verbatimicity Levels in Reverb ASR

Rev has the only AI transcription API and model that allows user control over the verbatimicity of the output. The developer pipeline offers a verbatim mode that transcribes all spoken content and a non-verbatim mode that removes unnecessary phrases to improve readability. The output of the research model can be controlled with a verbatimicity parameter that can be anywhere between zero and one.

The Rev team has found that halfway between verbatim and non-verbatim produces a reader-preferred style for captioning – capturing all content while reducing some hesitations and stutters to make captions fit better on screen.

Verbatimicities that other APIs miss:	Nonverbatimicities that other APIs transcribe:
Repeated stutter words	“You know”
Repeated phrases	“Kind of”
Filled pauses (um, uh)	“Sort of”
	“Like”

Optimizing Data for ASR and Diarization Models

Rev’s diarization training data comes from the same diverse corpus as the ASR training data. However, annotation for diarization is particularly challenging, because of the need for precise timings specifically at speaker switches and the difficulties of handling overlapped speech. As a result, only a subset of the ASR training data is usable for diarization. The total corpus used for diarization is 26,000 hours.

Enhancing Diarization Precision with WavLM Technology

The Reverb diarization models were developed using the pyannote.audio library. Reverb diarization v1 is identical to pyannote3.0 in terms of architecture but it is fine-tuned on Rev’s transcriptions for 17 epochs. Training took 4 days on a single A100 GPU. The network has 2 LSTM layers with hidden size of 256, totaling approximately 2.2M parameters.

Our most precise diarization model – Reverb diarization v2 – uses WavLM instead of the SincNet features in the pyannote3.0 basic model.

Diarization Benchmarks and Performance

While DER is a valuable metric for assessing the technical performance of a diarization model in isolation, WDER (Word Diarization Error Rate) is more crucial in the context of ASR because it reflects the combined effectiveness of both the diarization and ASR components in producing accurate, speaker-attributed text. In practical applications where the accuracy of both “who spoke” and “what was spoken” is essential, WDER provides a more meaningful and relevant measure for evaluating system performance and guiding improvements. For this reason we only report WDER metrics.

We show results for two test-suites, earnings21 and rev16.

Model	Earnings21 WDER	Rev16 WDER
Pyannote3.0	0.051	0.090
Reverb Diarization v1	0.047	0.077
Reverb Diarization v2	0.046	0.078

Driving Innovation in Speech Technology

We are excited to release the state-of-the-art Reverb ASR and diarization models to the public. We hope that these releases will spur research and innovation in the fast-moving domain of voice technology. To get started, visit GitHub - revdotcom/reverb: Open source inference code for Rev's model for research models or GitHub - revdotcom/reverb-self-hosted: This public GitHub repository contains code for a fully self-hosted, on-premise transcription solution. for the complete developer solution. Schedule a demo today to learn more about the Rev.AI API or email licensing@rev.com for Reverb commercial licensing.

Rev would like to extend our sincere thanks to Nishchal Bhandari, Danny Chen, Miguel Del Rio, Natalie Delworth, Miguel Jette, Quinn McNamara, Corey Miller, Jan Profant, Nan Qin, Martin Ratajczak, Jean-Philippe Robichaud, Ondrej Novotny, Jenny Drexler Fox, and Lee Harris for their invaluable contributions to making this release possible.

bnew · Oct 3, 2024

Nvidia just dropped a bombshell: Its new AI model is open, massive, and ready to rival GPT-4

Nvidia has released NVLM 1.0, a powerful open-source AI model that rivals GPT-4 and Google’s systems, marking a major breakthrough in multimodal language models for vision and text tasks.

venturebeat.com

Nvidia just dropped a bombshell: Its new AI model is open, massive, and ready to rival GPT-4

Michael Nuñez@MichaelFNunez

October 1, 2024 3:58 PM

Credit: VentureBeat made with Midjourney

Nvidia has released a powerful open-source artificial intelligence model that competes with proprietary systems from industry leaders like OpenAI and Google.

The company’s new NVLM 1.0 family of large multimodal language models, led by the 72 billion parameter NVLM-D-72B, demonstrates exceptional performance across vision and language tasks while also enhancing text-only capabilities.

“We introduce NVLM 1.0, a family of frontier-class multimodal large language models that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models,” the researchers explain in their paper.

By making the model weights publicly available and promising to release the training code, Nvidia breaks from the trend of keeping advanced AI systems closed. This decision grants researchers and developers unprecedented access to cutting-edge technology.

Benchmark results comparing NVIDIA’s NVLM-D model to AI giants like GPT-4, Claude 3.5, and Llama 3-V, showing NVLM-D’s competitive performance across various visual and language tasks. (Credit: arxiv.org)

NVLM-D-72B: A versatile performer in visual and textual tasks

The NVLM-D-72B model shows impressive adaptability in processing complex visual and textual inputs. Researchers provided examples that highlight the model’s ability to interpret memes, analyze images, and solve mathematical problems step-by-step.

Notably, NVLM-D-72B improves its performance on text-only tasks after multimodal training. While many similar models see a decline in text performance, NVLM-D-72B increased its accuracy by an average of 4.3 points across key text benchmarks.

“Our NVLM-D-1.0-72B demonstrates significant improvements over its text backbone on text-only math and coding benchmarks,” the researchers note, emphasizing a key advantage of their approach.

Screenshot-2024-10-01-at-3.27.49%E2%80%AFPM.png

NVIDIA’s new AI model analyzes a meme comparing academic abstracts to full papers, demonstrating its ability to interpret visual humor and scholarly concepts. (Credit: arxiv.org)

AI researchers respond to Nvidia’s open-source initiative

The AI community has reacted positively to the release. One AI researcher commenting on social media, observed, “Wow! Nvidia just published a 72B model with is ~on par with llama 3.1 405B in math and coding evals and also has vision ?”

Nvidia’s decision to make such a powerful model openly available could accelerate AI research and development across the field. By providing access to a model that rivals proprietary systems from well-funded tech companies, Nvidia may enable smaller organizations and independent researchers to contribute more significantly to AI advancements.

The NVLM project also introduces innovative architectural designs, including a hybrid approach that combines different multimodal processing techniques. This development could shape the direction of future research in the field.

Wow nvidia just published a 72B model with is ~on par with llama 3.1 405B in math and coding evals and also has vision ? pic.twitter.com/c46DeXql7s

— Phil (@phill__1) October 1, 2024

NVLM 1.0: A new chapter in open-source AI development

Nvidia’s release of NVLM 1.0 marks a pivotal moment in AI development. By open-sourcing a model that rivals proprietary giants, Nvidia isn’t just sharing code—it’s challenging the very structure of the AI industry.

This move could spark a chain reaction. Other tech leaders may feel pressure to open their research, potentially accelerating AI progress across the board. It also levels the playing field, allowing smaller teams and researchers to innovate with tools once reserved for tech giants.

However, NVLM 1.0’s release isn’t without risks. As powerful AI becomes more accessible, concerns about misuse and ethical implications will likely grow. The AI community now faces the complex task of promoting innovation while establishing guardrails for responsible use.

Nvidia’s decision also raises questions about the future of AI business models. If state-of-the-art models become freely available, companies may need to rethink how they create value and maintain competitive edges in AI.

The true impact of NVLM 1.0 will unfold in the coming months and years. It could usher in an era of unprecedented collaboration and innovation in AI. Or, it might force a reckoning with the unintended consequences of widely available, advanced AI.

One thing is certain: Nvidia has fired a shot across the bow of the AI industry. The question now is not if the landscape will change, but how dramatically—and who will adapt fast enough to thrive in this new world of open AI.

Large Language Models News & Discussions

Veteran

Fake AI “podcasters” are reviewing my book and it’s freaking me out​

NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.​

Further Reading​

​

Hey! Listen!​

​

Just the facts?​

​

It’s all in the delivery​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art​

How Liquid is going ‘beyond’ the generative pre-trained transformer (GPT)​

Accepting invitations for launch event and eyeing future improvements​

Veteran

Ai2’s new Molmo open source AI models beat GPT-4o, Claude on some benchmarks​

Closing the Gap Between Open and Proprietary AI​

Advanced Model Architecture and Training Approach​

Outperforming on Key Benchmarks​

Open Access and Future Releases​

Veteran

Veteran

Announcing FLUX1.1 [pro] and the BFL API​

FLUX1.1 [pro]: Faster & Better​

Building with the BFL API​

Veteran

Introducing canvas​

Better collaboration with ChatGPT​

Coding in canvas​

Training the model to become a collaborator​

Canvas Decision Boundary Trigger - Writing & Coding​

Canvas Edits Boundary - Writing & Coding​

Canvas Suggested Comments​

What’s next​

Veteran

Introducing Reverb: The Future of Open-Source ASR and Diarization​

Jennifer Drexler Fox​

Oct 3, 2024​

Shaping the Future of Speech Technology​

Training with the Largest Human-Transcribed Corpus​

Preparing and Processing ASR Training Data​

A Closer Look at Reverb’s ASR Model Architecture​

Setting New Benchmarking Standards for ASR Accuracy​

Veteran

Customizing Verbatimicity Levels in Reverb ASR​

Optimizing Data for ASR and Diarization Models​

Enhancing Diarization Precision with WavLM Technology​

Diarization Benchmarks and Performance​

Driving Innovation in Speech Technology​

Veteran

Nvidia just dropped a bombshell: Its new AI model is open, massive, and ready to rival GPT-4​

NVLM-D-72B: A versatile performer in visual and textual tasks​

AI researchers respond to Nvidia’s open-source initiative​

NVLM 1.0: A new chapter in open-source AI development​

Fake AI “podcasters” are reviewing my book and it’s freaking me out

NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.

Further Reading

Hey! Listen!

Just the facts?

It’s all in the delivery

MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art

How Liquid is going ‘beyond’ the generative pre-trained transformer (GPT)

Accepting invitations for launch event and eyeing future improvements

Ai2’s new Molmo open source AI models beat GPT-4o, Claude on some benchmarks

Closing the Gap Between Open and Proprietary AI

Advanced Model Architecture and Training Approach

Outperforming on Key Benchmarks

Open Access and Future Releases

Announcing FLUX1.1 [pro] and the BFL API

FLUX1.1 [pro]: Faster & Better

Building with the BFL API

Introducing canvas

Better collaboration with ChatGPT

Coding in canvas

Training the model to become a collaborator

Canvas Decision Boundary Trigger - Writing & Coding

Canvas Edits Boundary - Writing & Coding

Canvas Suggested Comments

What’s next

Introducing Reverb: The Future of Open-Source ASR and Diarization

Jennifer Drexler Fox

Oct 3, 2024

Shaping the Future of Speech Technology

Training with the Largest Human-Transcribed Corpus

Preparing and Processing ASR Training Data

A Closer Look at Reverb’s ASR Model Architecture

Setting New Benchmarking Standards for ASR Accuracy

Customizing Verbatimicity Levels in Reverb ASR

Optimizing Data for ASR and Diarization Models

Enhancing Diarization Precision with WavLM Technology

Diarization Benchmarks and Performance

Driving Innovation in Speech Technology

Nvidia just dropped a bombshell: Its new AI model is open, massive, and ready to rival GPT-4

NVLM-D-72B: A versatile performer in visual and textual tasks

AI researchers respond to Nvidia’s open-source initiative

NVLM 1.0: A new chapter in open-source AI development