bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469

Fake AI “podcasters” are reviewing my book and it’s freaking me out​


NotebookLM's "Audio Summaries" show a more personable future for AI-generated content.​


Kyle Orland - 9/23/2024, 11:40 AM

Hey, welcome back to Talkin'<em>Minesweeper</em>, the podcast where AI hosts discuss a book about <em>Minesweeper</em>!

Enlarge / Hey, welcome back to "Talkin'Minesweeper," the podcast where AI hosts discuss a book about Minesweeper!

Aurich Lawson | Boss Fight Books
66

Further Reading​

How Bill Gates’ Minesweeper addiction helped lead to the Xbox

As someone who has been following the growth of generative AI for a while now, I know that the technology can be pretty good (if not quite human-level) at quickly summarizing complex documents into a more digestible form. But I still wasn't prepared for how disarmingly compelling it would be to listen to Google's NotebookLM condense my recent book about Minesweeper into a tight, 12.5-minute, podcast-style conversation between two people who don't exist.

There are still enough notable issues with NotebookLM's audio output to prevent it from fully replacing professional podcasters any time soon. Even so, the podcast-like format is an incredibly engaging and endearing way to take in complex information and points to a much more personable future for generative AI than the dry back-and-forth of a text-based chatbot.

Hey! Listen!​




Listen to NotebookLM's 12.5-minute summary of my Minesweeper book using the player above.

Google's NotebookLM launched over a year ago as "a virtual research assistant that can summarize facts, explain complex ideas, and brainstorm new connections—all based on the sources you select." Just last week, though, Google added the new "Audio Overview" feature that it's selling as "a new way to turn your documents into engaging audio discussions."

Google doesn't use the word "podcast" anywhere in that announcement, instead talking up audio creations that "summarize your material, make connections between topics, and banter back and forth." But Wharton AI professor Ethan Mollick correctly referred to the style as a "podcast" in a recent social media post sharing a NotebookLM Audio Overview of his book. Mollick called these Audio Summaries "the current best 'wow this is amazing & useful' demo of AI" and "unnerving, too," and we agree on both counts.

Inspired by Mollick's post, I decided to feed my own book into NotebookLM to see what its virtual "podcasters" would make of 30,000 or so words about '90s Windows gaming classic Minesweeper (believe it or not, I could have written much more). Just a few minutes later, I was experiencing a reasonable facsimile of what it would be like if I was featured on NPR's Pop Culture Happy Hour or a similar banter-filled podcast.

Just the facts?​


NotebookLM's summary hits on all the book's major sections: the pre-history of the games that inspired Minesweeper; the uphill battle for the Windows Entertainment Pack at a business-focused Microsoft of the '90s; the moral panic over the game's pre-installation on millions of business and government computers; and the surprising cheating controversies that surrounded the game's competitive scene.

Why <a href=https://bossfightbooks.com/products/minesweeper-by-kyle-orland?srsltid=AfmBOookKnnu3mqj63xFEHPhWjXQRLplFphQtE_CAAh-F4BTmsjCAR3D>read ~30,000 words about <em>Minesweeper</em></a> when you can listen to two fake people banter for a few minutes instead?
Enlarge
/ Why read ~30,000 words about Minesweeper when you can listen to two fake people banter for a few minutes instead?

Boss Fight Books

Sure, I could quibble about which specific bits the summary decided to focus on and/or leave out (maybe feeding different chapters individually would have led to more detail in the collected summaries). But anyone listening to this "podcast" would get the same general overview of my book that they would listening to one of the many actual podcasts that I did after the book launched.

While there weren't any full-blown, whole-cloth hallucinations in NotebookLM's summary "podcast," there were a few points where it got small details wrong or made assumptions that weren't supported in the text. Discussing Minesweeper predecessor Mined-Out, for instance, NotebookLM's audio summary says, "So this is where those squares and flags start to come into play..." even though Mined-Out had neither feature.

Then there's the portion where the summary-cast mentions a senator who called Minesweeper "a menace to the republic," repeating the quote for emphasis. That definitely captures the spirit of Senator Lauch Faircloth's tirade against Minesweeper and other games being pre-installed on government computers. In the "podcast" context, though, it sounds like the voices are putting words in Faircloth's mouth by sharing a direct quote.

Small, overzealous errors like these—and a few key bits of the book left out of the podcast entirely—would give me pause if I were trying to use a NotebookLM summary as the basis for a scholarly article or piece of journalism. But I could see using a summary like this to get some quick Cliff's Notes-style grounding on a thick tome I didn't have the time or inclination to read fully. And, unlike poring through Cliff's Notes, the pithy, podcast-style format would actually make for enjoyable background noise while out on a walk or running errands.

It’s all in the delivery​


It's that naturalistic, bantering presentation that makes NotebookLM's new feature stand out from other AI products that generate capable text summaries. I felt like I was eavesdropping on two people who just happened to be discussing my book in a cafe, except those people don't actually exist (and were probably algorithmically designed to praise the book).

Is this thing on?
Enlarge
/ Is this thing on?

Getty Images

Right from the start, I was tickled by the way one "podcast host" described the book as a tale from "the land of floppy disks and dial-up modems" (a phrase I did not use in the book). That same voice goes on to tease "a bit of Bill Gates sneaking around the Microsoft office," up front, hinting at my absolute favorite anecdote from the book before fully exploring it later in the summary.

When they do get to that anecdote, the fake podcast hosts segue in with what feels like a natural conversational structure:


Voice 1: It's hard to deny the impact of something when your own CEO is secretly hooked.

Voice 2: Wait, are we talking about Bill Gates?

The back-and-forth style of the two-person "podcast" format allows for some entertaining digressions from the main point of the book, too. When discussing the wormy movie-star damsel-in-distress featured in Minesweeper predecessor Mined-Out, for instance, the AI summarizers seem to get a little distracted:


Voice 1: I have to ask, what kind of movies does a worm even star in?

Voice 2: I'm afraid that detail has been lost to the sands of gaming history.

Then there's the casual way the two "hosts" bring up the improved versions of Minesweeper that were crafted to fix problems with Microsoft's original:


Voice 1: So eventually the community came up with a more elegant solution.

Voice 2: Let me guess. They created a new version of Minesweeper.

Voice 1: Exactly.

Voice 2: Called it a day on the old one.

The two-person format helps foster a gentle, easy rhythm to the presentation of dense information, with natural-sounding pauses and repetition that help emphasize key points. When one ersatz podcaster talks about the phenomenon of "this incredibly addictive puzzle game [being] pre-installed on practically every computer," for instance, the other voice can answer back with the phrase "on every computer" with just the right amount of probing interest. Or when one AI voice intones that "it was discovered that the original Minesweeper had a flaw in how it generated random boards," the other voice jumps in and exclaims "A flaw!" with pitch-perfect timing and a sense of surprise.


Wait, are we talking about Bill Gates?

NotebookLM podcast voice

There are some problems with this back-and-forth style, though. For one, both voices seem to alternate between the "I read the book" role and the "I'm surprised at these book facts you're sharing" role, making it hard to feel like either one is genuine. For another, the sheer volume of surprised reactions (a partial sample: "What? No! Wooooow! You're kidding! No way! You're blowing my mind here!") can get a little grating. And then there are the sentences that pause at the wrong points or the bits of laughter that feel like an editor chopped them off prematurely.

Still, when one fake podcast voice cooed, "Oh, do tell!" in response to the idea of controversy in competitive Minesweeper, it set off the same parasocial relationship buttons that a good, authentic podcast can (while also effectively flattering my sense of authorial ego).

After listening to NotebookLM's summary of my own book, I can easily envision a near future where these "fake" podcasts become a part of my real podcast diet, especially for books or topics that are unlikely to get professional interest from human podcasters. By repackaging generative AI text into a "just two people chatting" format, Google has put a much more amiable face on what can sometimes seem like a dehumanizing technology.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469

1/21
@AIatMeta
📣 Introducing Llama 3.2: Lightweight models for edge devices, vision models and more!

What’s new?
• Llama 3.2 1B &amp; 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek &amp; @Qualcomm on day one.
• Llama 3.2 11B &amp; 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B &amp; 70B.
• New Llama Guard models to support multimodal use cases and edge deployments.
• The first official distro of Llama Stack simplifies and supercharges the way developers &amp; enterprises can build around Llama to support agentic applications and more.

Details in the full announcement ➡️ Llama 3.2: Revolutionizing edge AI and vision with open, customizable models
Download Llama 3.2 models ➡️ Llama 3.2

These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more.

With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.



2/21
@reach_vb
Congrats on the release! I’m a huge fan of your commitment to open science and weights!

Thanks for the vision and an-device goodies:

Llama 3.2 - a meta-llama Collection



3/21
@ai_for_success
Congratulations this is huge.



4/21
@togethercompute
🙌 We love that Llama has gone multimodal! We're excited to partner with @AIatMeta to offer free access to the Llama 3.2 11B vision model for developers. Can't wait to see what everyone builds!

Try now with our Llama-Vision-Free model endpoint.

Sign up here: https://api.together.ai/playground/chat/meta-llama/Llama-Vision-Free



5/21
@Saboo_Shubham_
@ollama and @LMStudioAI time to go!!



6/21
@ollama
Let's go!!!! Open-source AI!



7/21
@joinnatural
GG 🙌



8/21
@borsavada
It is a real pity that Llama 3.2 is not available and accessible in Turkey. Restricting access to such innovative technologies can cause developers and researchers in Turkey to miss important opportunities.

Given the rapid developments in the field of artificial intelligence, it is crucial that our country is able to closely follow the advances in this field and utilize these technologies. Advanced language models such as Llama 3.2 can accelerate innovation and increase productivity in many sectors.

This may be due to license agreements, legal regulations or technical infrastructure issues. But whatever the reason, such obstacles need to be overcome to ensure that Turkey does not fall behind in the global AI race.

It is critical that policymakers, technology companies and academic institutions collaborate to ensure that Turkey has access to the latest AI technologies and strengthen the local ecosystem in this field. In this way, Turkey can become not only a consumer but also a producer and innovation center in the field of artificial intelligence.



9/21
@swissyp_
let's get these on-chain on /search?q=#ICP!! Llama 3.2 1B &amp; 3B 🚀

cc: @icpp_pro



10/21
@AMD
AMD welcomes the latest Llama 3.2 release from Meta. We're excited to share how our collaboration with Meta is enabling developers with Day-0 support. Llama 3.2 and AMD: Optimal Performance from Cloud to Edge and AI PCs



11/21
@basetenco
We're excited to bring dedicated deployments of these new Llama 3.2 models to our customers!

90B vision looks especially powerful — congrats to the entire Llama team!



12/21
@janvivaddison
Congratulations that was amazing 😻



13/21
@Ming_Chun_Lee
"open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone."

This is also how we can ensure everyone can help to build and advance AI together with the same goal.

Very important.



14/21
@dhruv2038
Congrats to @AIatMeta for being this open.



15/21
@FirstBatchAI
Thank you for helping us build better for edge! 🚀



16/21
@testingcatalog
“Linux of AI”



17/21
@ryanraysr
Awesome! Looking forward to digging it!



18/21
@Neeraj_Kumar222
I am excited to see the new capabilities of Llama 3.2 models, especially the support for edge devices and the competitive performance of the vision models. The expanded partner ecosystem and commitment to open-source AI are great to see. Looking forward to exploring the potential applications of these advancements.



19/21
@JonathanRoseD
I've been trying out 3.2 3b on my phone and it's been super for a micro model! I just highly recommend using a temperature at 0.2 and no higher (hallucination). Amazing work!!



20/21
@CamelAIOrg
Congratulations, we are excited to test these out within our framework!



21/21
@philip_kiely
Excited to compare 1B and 3B to the Qwen 2.5 small language models — have been really enjoying Qwen 2.5 1.5B

Llama 1B will be especially useful as a small example model in all of the documentation that I write!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GYVg0COXYAEvj_-.jpg

GYVrlFXWQAIzPYA.jpg

GYV5x9NWoAA7Uap.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469

1/1
🚨 BREAKING

Llama 3.2 multimodal is here and 90B outperforms GPT-4o-mini and Claude Haiku in different benchmarks.

» Lightweight - 1B and 3B
» Multimodal - 11B and 90B


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GYVsAv3aAAA2zl4.jpg



1/1
Together is serving Llama 3.2 vision for free - have fun!

[Quoted tweet]
🚀 Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI.

🎉 Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/c…

➡️ Learn more in the blog together.ai/blog/llama-3-2-v…



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GYV7UktWgAAZZwP.jpg











1/12
@nutlope
Announcing Napkins.dev – Screenshot to code!

An open source wireframe to app tool powered by Llama 3.2 vision. Upload a screenshot of a simple site/design &amp; get code.

100% free and open source.



2/12
@nutlope
Here's the GitHub repo! Also, shoutout to @YoussefUiUx for the great design.

GitHub - Nutlope/napkins: napkins.dev – from screenshot to app



3/12
@nutlope
Tech Stack:

◆ @togethercompute's inference (AI API)
◆ @AIatMeta's Llama 3.2 Vision models
◆ @AIatMeta's Llama 3.1 405B for the LLM
◆ @codesandbox's sandpack for the sandbox
◆ @nextjs w/ tailwind &amp; typescript
◆ @helicone_ai for AI observability
◆ @PlausibleHQ for analytics
◆ @awscloud's S3 for file uploads



4/12
@nutlope
How it works:

I ask the Llama 3.2 vision models to describe whatever screenshot the user uploaded, then pass it to Llama 3.1 405B to actually code it.

It's fairly limited in what it can handle right now – best for simple UI sketches!



5/12
@nutlope
Launched this as part of us at Together AI supporting the new Llama 3.2 models (including vision). Check it out!

[Quoted tweet]
🚀 Big news! We’re thrilled to announce the launch of Llama 3.2 Vision Models & Llama Stack on Together AI.

🎉 Free access to Llama 3.2 Vision Model for developers to build and innovate with open source AI. api.together.ai/playground/c…

➡️ Learn more in the blog together.ai/blog/llama-3-2-v…


6/12
@nutlope
Check out the app here!

Napkins.dev – Screenshot to code



7/12
@LM_22
It would be nice to change it as OCR and extracting particular data out of pictures, invoices, packing list, delivery notes, and structure it in json or csv for handover to agent



8/12
@nutlope
Agreed! It has a lot of really cool use cases and I'm planning to do one with receipts potentially



9/12
@olanetsoft
Wow



10/12
@nutlope
Still limited to fairly simple designs but gonna work to make it better!



11/12
@DamiDina
Fire



12/12
@nutlope
Thanks Dami!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GYV_EahW4AAqnVd.png

GYV7UktWgAAZZwP.jpg








1/6
Llama 3.2 is available on Ollama! It's lightweight and multimodal! It's so fast and good!

🥕 Try it:

1B
ollama run llama3.2:1b

3B
ollama run llama3.2

🕶️ vision models are coming very soon!

llama3.2

[Quoted tweet]
📣 Introducing Llama 3.2: Lightweight models for edge devices, vision models and more!

What’s new?
• Llama 3.2 1B & 3B models deliver state-of-the-art capabilities for their class for several on-device use cases — with support for @Arm, @MediaTek & @Qualcomm on day one.
• Llama 3.2 11B & 90B vision models deliver performance competitive with leading closed models — and can be used as drop-in replacements for Llama 3.1 8B & 70B.
• New Llama Guard models to support multimodal use cases and edge deployments.
• The first official distro of Llama Stack simplifies and supercharges the way developers & enterprises can build around Llama to support agentic applications and more.

Details in the full announcement ➡️ go.fb.me/229ug4
Download Llama 3.2 models ➡️ go.fb.me/w63yfd

These models are available to download now directly from Meta and @HuggingFace — and will be available across offerings from 25+ partners that are rolling out starting today, including @accenture, @awscloud, @AMD, @azure, @Databricks, @Dell, @Deloitte, @FireworksAI_HQ, @GoogleCloud, @GroqInc, @IBMwatsonx, @Infosys, @Intel, @kaggle, @NVIDIA, @OracleCloud, @PwC, @scale_AI, @snowflakeDB, @togethercompute and more.

With Llama 3.2 we’re making it possible to run Llama in even more places, with even more flexible capabilities. We’ve said it before and we’ll say it again: open source AI is how we ensure that these innovations reflect the global community they’re built for and benefit everyone. We’re continuing our drive to make open source the standard with Llama 3.2.


2/6
Amazing!!



3/6
lightweight and multimodal! ❤️❤️❤️



4/6
❤️



5/6
Amazing! Is this the 3B or 1B?



6/6
❤️




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GYVg0COXYAEvj_-.jpg

GYV6hYDWEAEJ5wJ.png


1/4
@awnihannun
Llama 3.2 1B in 4-bit generates at ~350 (!) toks/sec with MLX on an M2 Ultra. This is interesting.

Command: mlx_lm.generate --model mlx-community/Llama-3.2-1B-Instruct-4bit --prompt "Write a story about Einstein" --temp 0.0 --max-tokens 512

Not sped up:



2/4
@MemoSparkfield
Wow!



3/4
@ivanfioravanti
WOW! That’s ultra fast!



4/4
@shareastronomy
Is the MLX version of the model on Hugging Face already?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469



1/3
@rohanpaul_ai
Really 👀 new Paper, MINI-SEQUENCE TRANSFORMER claims to extend the maximum context length of Qwen, Mistral, and Gemma-2 by 12-24x.

MST enables efficient long-sequence training by reducing intermediate memory overhead

It achieves 2.7x improvement in perplexity with 30k context vs 8k baseline on LongAlpaca dataset

**Original Problem** 🔍:

Training large language models with long sequences is limited by memory constraints, particularly due to large intermediate values in MLP and LM-Head blocks.

-----

**Solution in this Paper** 🧠:

• MINI-SEQUENCE TRANSFORMER (MST) partitions input sequences into M mini-sequences
• Processes mini-sequences iteratively to reduce intermediate memory usage
• Applies to MLP and LM-Head blocks, compatible with various attention mechanisms
• Integrates with activation recomputation for further memory savings
• Chunk-based implementation optimizes performance for small sequences
• Extends to distributed training settings

-----

**Key Insights from this Paper** 💡:

• Intermediate values in MLP and LM-Head blocks consume significantly more memory than inputs/outputs
• Partitioning input sequences into mini-sequences can reduce intermediate memory usage
• MST is compatible with existing optimization techniques like activation recomputation
• Optimal mini-sequence size depends on model architecture and sequence length

-----

**Results** 📊:

• Enables training of Llama3-8B with 60k context length (12x longer than standard)
• Maintains comparable throughput to standard implementation
• Reduces peak memory usage by 30% compared to standard transformer
• Scales linearly with number of GPUs in distributed settings



2/3
@rohanpaul_ai
Part (a) shows the conventional Transformer architecture with MLP and LM-Head blocks processing full sequence length S.

Part (b) illustrates the proposed MST, which splits the input sequence S into M mini-sequences of length S/M (M=2 in this example).



3/3
@rohanpaul_ai
Memory consumption of pre-training Llama3-8B and Gemma2-9B models with a batch size of 1 on a single A100 device, with activation recomputation and MST.

📚 https://arxiv.org/pdf/2407.15892




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GYg2wEoXUAAl-8q.jpg

GYg3FlKWwAAz90k.jpg

GYg5X_cWwAAOuzE.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469

1/1
Chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA

[Quoted tweet]
🚀 Exciting News! 🎉 We’re thrilled to announce the release of the new Video Arena 🎥🤖 tab in our WildVision-Arena! You can now chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA, with more VideoLLMs coming soon.

Huge thanks to the amazing team for their hard work! 🙌Special shoutout to @DongfuJiang, Yingzi Ma, @WenhuChen, @WilliamWangNLP , @YejinChoinka, @billyuchenlin, and the entire team.

Demo🤗: huggingface.co/spaces/WildVi…

Other resources🔗
WildVision-Bench: huggingface.co/datasets/Wild…
WildVision-Chat: huggingface.co/datasets/Wild…
Paper: arxiv.org/abs/2406.11069
Github: github.com/orgs/WildVision-A…

#WildVisionArena #VideoArena #Video-LLM



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/2
🚀 Exciting News! 🎉 We’re thrilled to announce the release of the new Video Arena 🎥🤖 tab in our WildVision-Arena! You can now chat with videos using LLaVA-NeXT-Video, VideoLLaMA2, Video-LLaVA, with more VideoLLMs coming soon.

Huge thanks to the amazing team for their hard work! 🙌Special shoutout to @DongfuJiang, Yingzi Ma, @WenhuChen, @WilliamWangNLP , @YejinChoinka, @billyuchenlin, and the entire team.

Demo🤗: Vision Arena (Testing VLMs side-by-side) - a Hugging Face Space by WildVision

Other resources🔗
WildVision-Bench: WildVision/wildvision-bench · Datasets at Hugging Face
WildVision-Chat: WildVision/wildvision-chat · Datasets at Hugging Face
Paper: [2406.11069] WildVision: Evaluating Vision-Language Models in the Wild with Human Preferences
Github: WildVision-AI

/search?q=#WildVisionArena /search?q=#VideoArena /search?q=#Video-LLM



2/2
Thanks for adding this model! It’s now serving in the Arena. Try it out.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469


1/2
Starting this week, Advanced Voice is rolling out to all ChatGPT Enterprise, Edu, and Team users globally. Free users will also get a sneak peek of Advanced Voice.

Plus and Free users in the EU…we’ll keep you updated, we promise.

2/2
To access Advanced Voice, remember to download the latest version of the ChatGPT app.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196









1/11
@OpenAI
Advanced Voice is rolling out to all Plus and Team users in the ChatGPT app over the course of the week.

While you’ve been patiently waiting, we’ve added Custom Instructions, Memory, five new voices, and improved accents.

It can also say “Sorry I’m late” in over 50 languages.



2/11
@OpenAI
If you are a Plus or Team user, you will see a notification in the app when you have access to Advanced Voice.



3/11
@OpenAI
Meet the five new voices.



4/11
@OpenAI
Set Custom Instructions for Advanced Voice.



5/11
@OpenAI
We’ve also improved conversational speed, smoothness, and accents in select foreign languages.



6/11
@OpenAI
Advanced Voice is not yet available in the EU, the UK, Switzerland, Iceland, Norway, and Liechtenstein.



7/11
@GozuMighty
voice.gpt.eth



8/11
@spffspcmn
We need that Her voice back. Juniper just doesn't cut it for me.



9/11
@Maik_Busse
If your from EU and don't have access please like for confirmation💀



10/11
@reach_vb
Congrats on shipping Advanced Voice Mode! At the same time I’m quite happy to see Open Source catching up:

Moshi v0.1 Release - a kyutai Collection



11/11
@ai_for_success
My meme was correct 😀




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GYQmqu8bsAAAWHv.jpg

GYQtRSDW0AASJJB.jpg

GYQtRRiW8AAuYvF.jpg

GYQr1X9WkAAYaho.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469


MIT spinoff Liquid debuts non-transformer AI models and they’re already state-of-the-art​


Carl Franzen@carlfranzen

September 30, 2024 2:16 PM

Liquid ripples over the surface of glowing blue and purple circuitry against a black backdrop


Credit: VentureBeat made with OpenAI ChatGPT


Liquid AI, a startup co-founded by former researchers from the Massachusetts Institute of Technology (MIT)’s Computer Science and Artificial Intelligence Laboratory (CSAIL), has announced the debut of its first multimodal AI models: the “Liquid Foundation Models (LFMs).”

Unlike most others of the current generative AI wave, these models are not based around the transformer architecture outlined in the seminal 2017 paper “Attention Is All You Need.”

Instead, Liquid states that its goal “is to explore ways to build foundation models beyond Generative Pre-trained Transformers (GPTs)” and with the new LFMs, specifically building from “first principles…the same way engineers built engines, cars, and airplanes.”

It seems they’ve done just that — as the new LFM models already boast superior performance to other transformer-based ones of comparable size such as Meta’s Llama 3.1-8B and Microsoft’s Phi-3.5 3.8B.

Liquid’s LFMs currently come in three different sizes and variants:

  • LFM 1.3B (smallest)
  • LFM 3B
  • LFM 40B MoE (largest, a “Mixture-of-Experts” model similar to Mistral’s Mixtral)

The “B” in their name stands for billion and refers the number of parameters — or settings — that govern the model’s information processing, analysis, and output generation. Generally, models with a higher number of parameters are more capable across a wider range of tasks.

Screenshot-2024-09-30-at-5.03.53%E2%80%AFPM.png


Already, Liquid AI says the LFM 1.3B version outperforms Meta’s new Llama 3.2-1.2B and Microsoft’s Phi-1.5 on many leading third-party benchmarks including the popular Massive Multitask Language Understanding (MMLU) consisting of 57 problems across science, tech, engineering and math (STEM) fields, “the first time a non-GPT architecture significantly outperforms transformer-based models.”

All three are designed to offer state-of-the-art performance while optimizing for memory efficiency, with Liquid’s LFM-3B requiring only 16 GB of memory compared to the more than 48 GB required by Meta’s Llama-3.2-3B model (shown in the chart above).

66f9a9b9624c365c96251a0c_desktop-graph-2.png


Maxime Labonne, Head of Post-Training at Liquid AI, took to his account on X to say the LFMs were “the proudest release of my career :smile:” and to clarify that the core advantage of LFMs: their ability to outperform transformer-based models while using significantly less memory.

This is the proudest release of my career :smile:

At @LiquidAI_, we're launching three LLMs (1B, 3B, 40B MoE) with SOTA performance, based on a custom architecture.

Minimal memory footprint & efficient inference bring long context tasks to edge devices for the first time! pic.twitter.com/v9DelExyTa

— Maxime Labonne (@maximelabonne) September 30, 2024

The models are engineered to be competitive not only on raw performance benchmarks but also in terms of operational efficiency, making them ideal for a variety of use cases, from enterprise-level applications specifically in the fields of financial services, biotechnology, and consumer electronics, to deployment on edge devices.

However, importantly for prospective users and customers, the models are not open source. Instead, users will need to access them through Liquid’s inference playground, Lambda Chat, or Perplexity AI.


How Liquid is going ‘beyond’ the generative pre-trained transformer (GPT)​


In this case, Liquid says it used a blend of “computational units deeply rooted in the theory of dynamical systems, signal processing, and numerical linear algebra,” and that the result is “general-purpose AI models that can be used to model any kind of sequential data, including video, audio, text, time series, and signals” to train its new LFMs.

Last year, VentureBeat covered more about Liquid’s approach to training post-transformer AI models, noting at the time that it was using Liquid Neural Networks (LNNs), an architecture developer at CSAIL that seeks to make the artificial “neurons” or nodes for transforming data, more efficient and adaptable.

Unlike traditional deep learning models, which require thousands of neurons to perform complex tasks, LNNs demonstrated that fewer neurons—combined with innovative mathematical formulations—could achieve the same results.

Liquid AI’s new models retain the core benefits of this adaptability, allowing for real-time adjustments during inference without the computational overhead associated with traditional models, handling up to 1 million tokens efficiently, while keeping memory usage to a minimum.

A chart from the Liquid blog shows that the LFM-3B model, for instance, outperforms popular models like Google’s Gemma-2, Microsoft’s Phi-3, and Meta’s Llama-3.2 in terms of inference memory footprint, especially as token length scales.

66f9a9b9624c365c96251a0c_desktop-graph.png


While other models experience a sharp increase in memory usage for long-context processing, LFM-3B maintains a significantly smaller footprint, making it highly suitable for applications requiring large volumes of sequential data processing, such as document analysis or chatbots.

Liquid AI has built its foundation models to be versatile across multiple data modalities, including audio, video, and text.

With this multimodal capability, Liquid aims to address a wide range of industry-specific challenges, from financial services to biotechnology and consumer electronics.


Accepting invitations for launch event and eyeing future improvements​


Liquid AI says it is is optimizing its models for deployment on hardware from NVIDIA, AMD, Apple, Qualcomm, and Cerebras.

While the models are still in the preview phase, Liquid AI invites early adopters and developers to test the models and provide feedback.

Labonne noted that while things are “not perfect,” the feedback received during this phase will help the team refine their offerings in preparation for a full launch event on October 23, 2024, at MIT’s Kresge Auditorium in Cambridge, MA. The company is accepting RSVPs for attendees of that event in-person here.

As part of its commitment to transparency and scientific progress, Liquid says it will release a series of technical blog posts leading up to the product launch event.

The company also plans to engage in red-teaming efforts, encouraging users to test the limits of their models to improve future iterations.

With the introduction of Liquid Foundation Models, Liquid AI is positioning itself as a key player in the foundation model space. By combining state-of-the-art performance with unprecedented memory efficiency, LFMs offer a compelling alternative to traditional transformer-based models.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469


Ai2’s new Molmo open source AI models beat GPT-4o, Claude on some benchmarks​


Carl Franzen@carlfranzen

September 25, 2024 2:48 PM



Silhouette of man typing on laptop against purple orange code backdrop standing on curving planet surface


Credit: VentureBeat made with Midjourney



The Allen Institute for AI (Ai2) today unveiled Molmo, an open-source family of state-of-the-art multimodal AI models which outpeform top proprietary rivals including OpenAI’s GPT-4o, Anthropic’s Claude 3.5 Sonnet, and Google’s Gemini 1.5 on several third-party benchmarks.

Being multimodal, the models can therefore accept and analyze imagery and files — similar to the leading proprietary foundation models.

Yet, Ai2 also noted in a post on X that Molmo uses “1000x less data” than the proprietary rivals — thanks to some clever new training techniques described in greater detail below and in a technical report paper published by the Paul Allen-founded and Ali Farhadi-led company.

Ai2 also posted a video to YouTube and its social accounts showing how Molmo can be used on a smartphone to rapidly analyze what’s in front of the user — by having them snap a photo and send it to the AI. In less than a second, it can count the number of people in a scene, discern whether a menu item is vegan, analyze flyers tapped to a lamppost and determine which bands are electronic music, and even take and convert handwritten notes on a whiteboard into a table.



Ai2 says the release underscores its commitment to open research by offering high-performing models, complete with open weights and data, to the broader community — and of course, companies looking for solutions they can completely own, control, and customize.

It comes on the heels of Ai2’s release two weeks ago of another open model, OLMoE, which is a “mixture of experts” or combination of smaller models designed for cost effectiveness.


Closing the Gap Between Open and Proprietary AI​


Molmo consists of four main models of different parameter sizes and capabilities:

  1. Molmo-72B (72 billion parameters, or settings — the flagship model, based on based on Alibaba Cloud’s Qwen2-72B open source model)
  2. Molmo-7B-D (“demo model” based on Alibaba’s Qwen2-7B model)
  3. Molmo-7B-O (based on Ai2’s OLMo-7B model)
  4. MolmoE-1B (based on OLMoE-1B-7B mixture-of-experts LLM, and which Ai2 says “nearly matches the performance of GPT-4V on both academic benchmarks and user preference.”)

These models achieve high performance across a range of third-party benchmarks, outpacing many proprietary alternatives. And they’re all available under permissive Apache 2.0 licenses, enabling virtually any sorts of usages for research and commercialization (e.g. enterprise grade).

Notably, Molmo-72B leads the pack in academic evaluations, achieving the highest score on 11 key benchmarks and ranking second in user preference, closely following GPT-4o.

Vaibhav Srivastav, a machine learning developer advocate engineer at AI code repository company Hugging Face, commented on the release on X, highlighting that Molmo offers a formidable alternative to closed systems, setting a new standard for open multimodal AI.

Molmo by @allen_ai – Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o ?

They release four model checkpoints:

1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total)

2. Molmo-7B-O, most open 7B model

3.… pic.twitter.com/9hpARh0GYT

— Vaibhav (VB) Srivastav (@reach_vb) September 25, 2024

In addition, Google DeepMind robotics researcher Ted Xiao took to X to praise the inclusion of pointing data in Molmo, which he sees as a game-changer for visual grounding in robotics.



'

Molmo is a very exciting multimodal foundation model release, especially for robotics. The emphasis on pointing data makes it the first open VLM optimized for visual grounding — and you can see this clearly with impressive performance on RealworldQA or OOD robotics perception! x.com pic.twitter.com/VHtu9hT2r9

— Ted Xiao (@xiao_ted) September 25, 2024




This capability allows Molmo to provide visual explanations and interact more effectively with physical environments, a feature that is currently lacking in most other multimodal models.

The models are not only high-performing but also entirely open, allowing researchers and developers to access and build upon cutting-edge technology.


Advanced Model Architecture and Training Approach​


Molmo’s architecture is designed to maximize efficiency and performance. All models use OpenAI’s ViT-L/14 336px CLIP model as the vision encoder, which processes multi-scale, multi-crop images into vision tokens.

These tokens are then projected into the language model’s input space through a multi-layer perceptron (MLP) connector and pooled for dimensionality reduction.

The language model component is a decoder-only Transformer, with options ranging from the OLMo series to the Qwen2 and Mistral series, each offering different capacities and openness levels.

The training strategy for Molmo involves two key stages:


  1. Multimodal Pre-training: During this stage, the models are trained to generate captions using newly collected, detailed image descriptions provided by human annotators. This high-quality dataset, named PixMo, is a critical factor in Molmo’s strong performance.

  2. Supervised Fine-Tuning: The models are then fine-tuned on a diverse dataset mixture, including standard academic benchmarks and newly created datasets that enable the models to handle complex real-world tasks like document reading, visual reasoning, and even pointing.

Unlike many contemporary models, Molmo does not rely on reinforcement learning from human feedback (RLHF), focusing instead on a meticulously tuned training pipeline that updates all model parameters based on their pre-training status.


Outperforming on Key Benchmarks​


The Molmo models have shown impressive results across multiple benchmarks, particularly in comparison to proprietary models.

For instance, Molmo-72B scores 96.3 on DocVQA and 85.5 on TextVQA, outperforming both Gemini 1.5 Pro and Claude 3.5 Sonnet in these categories. It further outperforms GPT-4o on AI2D (Ai2’s own benchmark, short for “A Diagram Is Worth A Dozen Images,” a dataset of 5000+ grade school science diagrams and 150,000+ rich annotations), scoring the highest of all model families in comparison at 96.3.

GYV9JYdXMAABnlJ.jpg


The models also excel in visual grounding tasks, with Molmo-72B achieving top performance on RealWorldQA, making it especially promising for applications in robotics and complex multimodal reasoning.


Open Access and Future Releases​


Ai2 has made these models and datasets accessible on its Hugging Face space, with full compatibility with popular AI frameworks like Transformers.

This open access is part of Ai2’s broader vision to foster innovation and collaboration in the AI community.

Over the next few months, Ai2 plans to release additional models, training code, and an expanded version of their technical report, further enriching the resources available to researchers.

For those interested in exploring Molmo’s capabilities, a public demo and several model checkpoints are available now via Molmo’s official page.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
52,513
Reputation
7,999
Daps
150,469


1/11
@reach_vb
Molmo by @allen_ai - Open source SoTA Multimodal (Vision) Language model, beating Claude 3.5 Sonnet, GPT4V and comparable to GPT4o 🔥

They release four model checkpoints:

1. MolmoE-1B, a mixture of experts model with 1B (active) 7B (total)
2. Molmo-7B-O, most open 7B model
3. Molmo-7B-D, demo model
4. Molmo-72B, best model

System Architecture

&gt; Input: Multi-scale, multi-crop images generated from the original image.

&gt; Vision Encoder: OpenAI's ViT-L/14 336px CLIP model, a powerful ViT, encodes images into vision tokens.

&gt; Connector: MLP projects tokens to LLM input space, followed by pooling for dimensionality reduction.

&gt; LLM: Decoder-only Transformer, various options (OLMo, OLMoE, Qwen2, Mistral, Gemma2, Phi) with diverse scales and openness.

Model Variants

&gt; Vision Encoder: Consistent ViT-L/14 CLIP model across variants.
&gt; LLM: OLMo-7B-1024, OLMoE-1B-7B-0924, Qwen2 (7B, 72B), Mistral 7B, Gemma2 9B, Phi 3 Medium, offering different capacities and openness levels.

Training Strategy

&gt; Stage 1: Multimodal pre-training for caption generation with new captioning data.

&gt; Stage 2: Supervised fine-tuning on a dataset mixture, updating all parameters.

&gt; No RLHF involved, Learning rates adjusted based on component types and pre-training status.

&gt; All the weights are available on Hugging Face Hub 🤗
&gt; Compatible with Transformers (Remote Code)

Kudos @allen_ai for such a brilliant and open work! 🐐

Video credits: Allen AI YT Channel



2/11
@reach_vb
Check out their model checkpoints on the Hub:

Molmo - a allenai Collection



3/11
@iamrobotbear
Wait, @allen_ai Did I miss something, what is the main difference between 7b-o and d?

I know the 7B-D is the version on your demo, but in terms of the model or capabilities, I'm a bit confused.



4/11
@A_Reichenbach_
Any idea why they didn’t use rlhf/dpo?



5/11
@ccerrato147
Paul Allen really left an impressive legacy.



6/11
@heyitsyorkie
Nice! New vision model that will need support in llama.cpp!



7/11
@HantianPang
amazing



8/11
@invisyblecom
@ollama



9/11
@EverydayAILabs
Heartening to see the how much open source models are progressing.



10/11
@sartify_co


[Quoted tweet]
We’re thrilled to be part of the 2024 Mozilla Builders Accelerator!
Our project, Swahili LLMs, will bring open-source AI to empower Swahili speakers 🌍.
Exciting 12 weeks ahead!
Learn more here: mozilla.org/en-US/builders/
@MozillaHacks @MozillaAI

#MozillaBuilders #AI #Swahili


11/11
@genesshk
Great to see advancements in open source multimodal models!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GYU6WeoWUAAju0y.jpg


1/1
@xiao_ted
Molmo is a very exciting multimodal foundation model release, especially for robotics. The emphasis on pointing data makes it the first open VLM optimized for visual grounding — and you can see this clearly with impressive performance on RealworldQA or OOD robotics perception!

[Quoted tweet]
Try out Molmo on your application! This is a great example by @DJiafei! We have a few videos describing Molmo's different capabilities on our blog! molmo.allenai.org/blog

This one is me trying it out on a bunch of tasks and images from RT-X: invidious.poast.org/bHOBGAYNBNI



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GYV9JYdXMAABnlJ.jpg



1/2
Try out Molmo on your application! This is a great example by @DJiafei! We have a few videos describing Molmo's different capabilities on our blog! https://molmo.allenai.org/blog

This one is me trying it out on a bunch of tasks and images from RT-X: https://invidious.poast.org/bHOBGAYNBNI
[Quoted tweet]
The idea of using a VLM for pointing, RoboPoint has proven useful and generalizable for robotic manipulation. But the next challenge is: can VLMs draw multiple "points" to form complete robotic trajectories? @allen_ai 's new Molmo seems up to the task—very exciting!

GYVcd0fXwAAruKm.jpg


2/2
Thank you Pannag :smile:




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top