Large Language Models News & Discussions

bnew · Sep 15, 2024

1/3
@rohanpaul_ai
Reverse Engineering o1 OpenAI Architecture with Claude

2/3
@NorbertEnders
The reverse engineered o1 OpenAI Architecture simplified and explained in a more narrative style, using layman’s terms.
I used Claude Sonnet 3.5 for that.

Keep in mind: it’s just an educated guess

3/3
@NorbertEnders
Longer version:

Imagine a brilliant but inexperienced chef named Alex. Alex's goal is to become a master chef who can create amazing dishes on the spot, adapting to any ingredient or cuisine challenge. This is like our language model aiming to provide intelligent, reasoned responses to any query.

Alex's journey begins with intense preparation:

First, Alex gathers recipes. Some are from famous cookbooks, others from family traditions, and many are creative variations Alex invents. This is like our model's Data Generation phase, collecting a mix of real and synthetic data to learn from.

Next comes Alex's training. It's not just about memorizing recipes, but understanding the principles of cooking. Alex practices in a special kitchen (our Training Phase) where:

1. Basic cooking techniques are mastered (Language Model training).
2. Alex plays cooking games, getting points for tasty dishes and helpful feedback when things go wrong (Reinforcement Learning).
3. Sometimes, the kitchen throws curveballs - like changing ingredients mid-recipe or having multiple chefs compete (Advanced RL techniques).

This training isn't a one-time thing. Alex keeps learning, always aiming to improve.

Now, here's where the real magic happens - when Alex faces actual cooking challenges (our Inference Phase):

1. A customer orders a dish. Alex quickly thinks of a recipe (Initial CoT Generation).
2. While cooking, Alex tastes the dish and adjusts seasonings (CoT Refinement).
3. For simple dishes, Alex works quickly. For complex ones, more time is taken to perfect it (Test-time Compute).
4. Alex always keeps an eye on the clock, balancing perfection with serving time (Efficiency Monitoring).
5. Finally, the dish is served (Final Response).
6. Alex remembers this experience for future reference (CoT Storage).

The key here is Alex's ability to reason and adapt on the spot. It's not about rigidly following recipes, but understanding cooking principles deeply enough to create new dishes or solve unexpected problems.

What makes Alex special is the constant improvement. After each shift, Alex reviews the day's challenges, learning from successes and mistakes (feedback loop). Over time, Alex becomes more efficient, creative, and adaptable.

In our language model, this inference process is where the real value lies. It's the ability to take a query (like a cooking order), reason through it (like Alex combining cooking knowledge to create a dish), and produce a thoughtful, tailored response (serving the perfect dish).

The rest of the system - the data collection, the intense training - are all in service of this moment of creation. They're crucial, but they're the behind-the-scenes work. The real magic, the part that amazes the 'customers' (users), happens in this inference stage.

Just as a master chef can delight diners with unique, perfectly crafted dishes for any request, our advanced language model aims to provide insightful, reasoned responses to any query, always learning and improving with each interaction.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 15, 2024

1/28
@CodeByPoonam
Google just dropped a bombshell

NotebookLM can now turn your notes into a Podcast in minutes.

I'll show you how in just 3 easy steps:

2/28
@CodeByPoonam
Google introduces a new Audio Overview feature that can turn documents, slides, charts, and more into engaging discussions with one click.

To try it out, follow these steps:

1/ Go to NotebookLM: Sign in - Google Accounts
- Create a new notebook.

3/28
@CodeByPoonam
2/ Add at least one source.
3/ In your Notebook guide, click on the “Generate” button to create an Audio Overview.

4/28
@CodeByPoonam
I uploaded my newsletter edition: AI Toast.

With one click, two AI hosts start up a lively “deep dive” discussion based on your sources.

Listen here

5/28
@CodeByPoonam
Read more here:
OpenAI released next big thing in AI

6/28
@CodeByPoonam
Thanks for reading.

Get latest AI updates and Tutorials in your inbox for FREE.

Join my AI Toast Community of 22000 readers:
AI Toast

7/28
@CodeByPoonam
Don't forget to bookmark for later.

If you enjoyed reading this post, please support it with like/repost of the post below

[Quoted tweet]
Google just dropped a bombshell

NotebookLM can now turn your notes into a Podcast in minutes.

I'll show you how in just 3 easy steps:

8/28
@hasantoxr
Perfect guide

9/28
@CodeByPoonam
Thanks for checking

10/28
@iamfakhrealam
It's surprising

11/28
@codedailyML
Amazing Share

12/28
@codeMdSanto
That's a game-changer! Technology never fails to amaze. Can't wait to see how it works!

13/28
@shawnchauhan1
That's awesome! Turning notes into a podcast that fast seems like a total productivity hack.

14/28
@AndrewBolis
Creating podcasts is easier than ever

15/28
@EyeingAI
Impressive guide, thanks for sharing.

16/28
@Klotzkette
It’s OK, but you can’t really give it any direction, so it’s useless

17/28
@vidhiparmxr
Helpful guide, Poonam!

18/28
@arnill_dev
That's like magic! Can't wait to see how it works. Exciting stuff!

19/28
@alifcoder
That's amazing! Turning notes into a podcast sounds so convenient.

Can't wait to see how it works.

20/28
@leo_grundstrom
Really cool stuff, thanks for sharing Poonam!

21/28
@LearnWithBishal
Wow this looks amazing

22/28
@shushant_l
This has made podcast creation super easy

23/28
@Parul_Gautam7
Excellent breakdown

Thanks for sharing Poonam

24/28
@jxffb
Just did one! So awesome!

25/28
@iam_kgkunal
That's amazing...Turning notes into a podcast so quickly sounds like a game-changer for productivity

26/28
@chriskclark
Here’s how we implemented this AI app in real life (yesterday).

[Quoted tweet]
was playing with NotebookLLM today as well. Here’s how I implemented the audio podcast mode (what I’m calling it) on an article today. You can listen to the AI generated conversation here —> agingtoday.com/health/fall-p…

27/28
@DreamWithO
I'd love to see this in action, how's the audio quality compared to traditional podcasting software?

28/28
@ThePushkaraj
The AI space is getting crazier day by day!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/13
@minchoi
Google dropped NotebookLM recently.

AI tool that can generate podcasts of two speakers talking about the contents from various sources like research papers, articles, and more.

Absolutely bonkers.

100% AI

10 examples (and how to try):

1. AI Podcast about OpenAI o1 drop

2/13
@minchoi
2. AI Podcast from Newsletter

[Quoted tweet]
Very impressed with this new NotebookLM feature by Google Labs that turns notes/docs into podcasts

I uploaded this morning's newsletter, and it turned into a two-way podcast between two AI agent hosts

Give it a listen, pretty darn good (sound on

)

3/13
@minchoi
3. AI Podcast from 90 min lecture

[Quoted tweet]
Googles NotebookLM's new podcast feature is wild

This is made from a 90min lecture I held on Monday

It condensed it into a 16 minute talkshow

Some hallucinations here and there, but overall this is a new paradigm for learning.

Link to try it below, no waitlist

4/13
@minchoi
4. AI Podcast from book "The Infernal Machine"

[Quoted tweet]
Rolling out audio overviews at NotebookLM today. So excited for this one.

Take any collection of sources and automatically generate a "deep dive" audio conversation.

I created one based on the text of my book The Infernal Machine. Have a listen.

below

notebooklm.google.com

5/13
@minchoi
5. AI Podcast from Research Paper

[Quoted tweet]
So, Google just dropped #NotebookLM, an AI that creates podcast segments on research papers nearly instantly.

Here's the thing though, it doesn't check to see if anything you feed it is true, sooooo I plugged in my found footage creepypasta.

The results are amazing.

@labsdotgoogle

6/13
@minchoi
6. AI Podcast from Overview of NotebookLM

[Quoted tweet]
Just had my 3rd wow moment in AI... this time through AI Overview by NotebookLM

7/13
@minchoi
7. AI Podcast from paper "On the Category of Religion"

[Quoted tweet]

My mind is genuinely blown by Google's NotebookLM new Audio Overview feature. It creates a podcast for a document.

Here's a podcast for our paper "On the Category of Religion" that @willismonroe created.

I genuinely would not have known it was AI...

8/13
@minchoi
8. AI Podcast from System Card for OpenAI o1

[Quoted tweet]
Do you want to see something impressive?
This podcast isn’t real.
It’s AI-generated after I gave Google’s NotebookLM the system card for OpenAI’s new o1 model, and it produced a 10-minute podcast discussion that feels incredibly real, better, more informative, and more entertaining than most actual tech podcasts.

9/13
@minchoi
9. AI Podcast from News reports on "Black Myth: Wukong"

[Quoted tweet]
用 NotebookLM 快速生成「黑神話：悟空」英文新聞報導

如同之前大家所知道的， NotebookLM 是一個 Google 推出的 AI 筆記服務，他可以免費整合各種文件檔、連結以及純文字，幫你生成出摘要、目錄、問答等內容。

今天他推出音訊總覽，也就是他會藉由筆記的內容產出對話性節目，時間長度視你的內容多寡，產出時間大概是 10 分鐘以內，目前只提供英文。

我拿現成有的黑神話悟空來做以下的內容：

10/13
@minchoi
10. AI Podcast from College thesis

[Quoted tweet]
This AI service is so impressive! Google's NotebookLM is now capable of generating an audio overview based on documents uploaded and links to online resources.

I uploaded my bachelors thesis, my resume, and a link to my online course website and it created this really cool podcast like format.

It didn't get everything right but its so funny because NotebookLM actually drew great conclusions that I didn’t think about while writing this thesis myself.

Which AI tool could create a video for this audio file?

@labsdotgoogle #RenewableEnergy #offgridpower #batterystorage #SolarEnergy #AI

11/13
@minchoi
Try it out yourself, head over to

Sign in - Google Accounts

12/13
@minchoi
If you enjoyed this thread,

Follow me @minchoi and please Bookmark, Like, Comment & Repost the first Post below to share with your friends:

[Quoted tweet]
Google dropped NotebookLM recently.

AI tool that can generate podcasts of two speakers talking about the contents from various sources like research papers, articles, and more.

Absolutely bonkers.

100% AI

10 examples (and how to try):

1. AI Podcast about OpenAI o1 drop

13/13
@minchoi
If you want to keep up with the latest AI developments and tools, subscribe to The Rundown it's FREE.

And you'll never miss a thing in AI again:
The Rundown AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 16, 2024

DeepMind understands Strawberry - there is no moat

[2408.03314] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

[Submitted on 6 Aug 2024]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, Aviral Kumar

Enabling LLMs to improve their outputs by using more test-time computation is a critical step towards building generally self-improving agents that can operate on open-ended natural language. In this paper, we study the scaling of inference-time computation in LLMs, with a focus on answering the question: if an LLM is allowed to use a fixed but non-trivial amount of inference-time compute, how much can it improve its performance on a challenging prompt? Answering this question has implications not only on the achievable performance of LLMs, but also on the future of LLM pretraining and how one should tradeoff inference-time and pre-training compute. Despite its importance, little research attempted to understand the scaling behaviors of various test-time inference methods. Moreover, current work largely provides negative results for a number of these strategies. In this work, we analyze two primary mechanisms to scale test-time computation: (1) searching against dense, process-based verifier reward models; and (2) updating the model's distribution over a response adaptively, given the prompt at test time. We find that in both cases, the effectiveness of different approaches to scaling test-time compute critically varies depending on the difficulty of the prompt. This observation motivates applying a "compute-optimal" scaling strategy, which acts to most effectively allocate test-time compute adaptively per prompt. Using this compute-optimal strategy, we can improve the efficiency of test-time compute scaling by more than 4x compared to a best-of-N baseline. Additionally, in a FLOPs-matched evaluation, we find that on problems where a smaller base model attains somewhat non-trivial success rates, test-time compute can be used to outperform a 14x larger model.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2408.03314 [cs.LG]
	(or arXiv:2408.03314v1 [cs.LG] for this version)
	[2408.03314] Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Submission history

From: Charlie Snell [view email]
[v1] Tue, 6 Aug 2024 17:35:05 UTC (4,152 KB)

https://arxiv.org/pdf/2408.03314

bnew · Sep 17, 2024

1/11
@denny_zhou
What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (ICLR 2024)

2/11
@denny_zhou
Just noticed a fun youtube video for explaining this paper. LoL. Pointed by @laion_ai http://invidious.poast.org/4JNe-cOTgkY

3/11
@ctjlewis
hey Denny, curious if you have any thoughts. i reached the same conclusion:

[Quoted tweet]
x.com/i/article/178554774683…

4/11
@denny_zhou
Impressive! You would be interested at seeing this: [2301.04589] Memory Augmented Large Language Models are Computationally Universal

5/11
@nearcyan
what should one conclude from such a proof if it’s not also accompanied by a proof that we can train a transformer into the state (of solving a given arbitrary problem), possibly even with gradient descent and common post training techniques?

6/11
@QuintusActual
“We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.”

I’m guessing this is only true because as a problem grows in difficulty, the # of required tokens approaches

7/11
@Shawnryan96
How do they solve novel problems without a way to update the world model?

8/11
@Justin_Halford_
Makes sense for verifiable domains (e.g. math and coding).

Does this generalize to more ambiguous domains with competing values/incentives without relying on human feedback?

9/11
@ohadasor
Don't fall into it!!

[Quoted tweet]
"can solve any problem"? Really?? Let's read the abstract in the image attached to the post, and see if the quote is correct. Ah wow! Somehow he forgot to quote the rest of the sentence! How is that possible?
The full quote is "can solve any problem solvable by boolean circuits of size T". This changes a lot. All problems solvable by Boolean circuits, of any size, is called the Circuit Evaluation Problem, and is known to cover precisely polynomial time (P) calculations. So it cannot solve the most basic logical problems which are at least exponential. Now here we don't even have P, we have only circuits of size T, which validates my old mantra: it can solve only constant-time problems. The lowest possible complexity class.
And it also validates my claim about the bubble of machine learning promoted by people who have no idea what they're talking about.

10/11
@CompSciFutures
Thx, refreshingly straightforward notation too, I might take the time to read this one properly.

I'm just catching up and have a dumb Q... that is an interestingly narrow subset of symbolic operands. Have you considered what happens if you add more?

11/11
@BatAndrew314
Noob question- how is it related to universal approximation theorem? Meaning does transformer can solve any problem because it is neural net? Or it’s some different property of transformers and CoT?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work...

arxiv.org

[Submitted on 20 Feb 2024 (v1), last revised 23 May 2024 (this version, v3)]

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have shown that constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of TC0. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

Comments:	38 pages, 10 figures. Accepted by ICLR 2024
Subjects:	Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Cite as:	arXiv:2402.12875 [cs.LG]
	(or arXiv:2402.12875v3 [cs.LG] for this version)
	[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Submission history

From: Zhiyuan Li [view email]

[v1] Tue, 20 Feb 2024 10:11:03 UTC (3,184 KB)
[v2] Tue, 7 May 2024 17:00:27 UTC (5,555 KB)
[v3] Thu, 23 May 2024 17:10:39 UTC (5,555 KB)

https://arxiv.org/pdf/2402.12875

bnew · Sep 17, 2024

1/11
@danielhanchen
A transformer's depth affects its reasoning capabilities, whilst model size affects its knowledge capacity

High recommend @ZeyuanAllenZhu's video on reasoning in transformers. Experiments show wider nets don't affect reasoning but more depth helps. Video: Invidious - search

2/11
@fleetwood___
Same claim in the MobileLLM paper from @AIatMeta
https://arxiv.org/pdf/2402.14905

3/11
@danielhanchen
Oh interesting - forgot about this paper!!

4/11
@im_datta0
From Gemma 2 paper :smile:

5/11
@danielhanchen
Oh yep remember this! The Gemma 2 paper did many experiments and ablations - forgot depth and width was also an experiment they did!

6/11
@NicholasLiu77
Model size = hidden state size?

7/11
@danielhanchen
Oh model size as in number of parameters of the model! :smile:

8/11
@gerardsans
There’s absolutely no “reasoning” in Transformers.

9/11
@danielhanchen
The definition of "reasoning" needs to be better defined, but the video did show if you train the LLM on 15 interactions, it can generalize to higher order interactions.

10/11
@inductionheads
I think they should be triangular - wider at first layers than later layers

11/11
@dejanseo
Daniel, it's time.

Unsloth-xxsmall-uncased
Unsloth-xsmall-uncased
Unsloth-small-uncased
Unsloth-base-uncased
Unsloth-large-uncased
Unsloth-xlarge-uncased
Unsloth-xxlarge-uncased

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 17, 2024

1/11
@Swarooprm7
Introducing NATURAL PLAN

: a realistic planning benchmark in natural language!

Key features:
- 3 main tasks: Trip Planning, Meeting Planning, and
Calendar Scheduling.
- Supplies in the context all relevant information to the
model (e.g., Google Flights, Maps, Calendar)
- No need for a separate tool-use environment: direct
LLM calls for evaluations
- Assesses the planning capabilities of large language
models (LLMs)

Joint work with my awesome collaborators at @GoogleDeepMind : @HuaixiuZheng , @hughbzhang , (now at Scale AI), @xinyun_chen_ , @chenmm24 , @Azade_na , @Hou_Le, @HengTze , @quocleix , @edchi ,@denny_zhou .

Paper: https://arxiv.org/pdf/2406.04520
Dataset and evaluation code will be released
[1/5]

2/11
@Swarooprm7
NATURAL PLAN is a challenging benchmark for state of the art models. For example, in Trip Planning, GPT-4 and Gemini 1.5 Pro could only achieve 31.1% and 34.8% solve rate respectively.
[2/5]

3/11
@Swarooprm7
Model performance drops drastically as the complexity of the problem increases: e.g. in Trip Planning, all models perform below 5% when there are 10 cities, highlighting a significant gap in planning in natural language for SoTA LLMs.
[3/5]

4/11
@Swarooprm7
Self-correction does not help and interestingly, the stronger models such as GPT-4 and Gemini 1.5 Pro suffer bigger loss than others.
[4/5]

5/11
@Swarooprm7
In-context planning experiments show promise: Gemini Pro 1.5 is able to leverage more in-context examples up to 355K tokens, still showing steady improvements.
[5/5]

6/11
@YiTayML
great work swaroop and steven!

7/11
@Swarooprm7
Thank you Yi

8/11
@qinyuan_ye
Cool work!! I've always wanted an AI assistant to plan for weekend fun with friends, accounting for the weather, traffic, carpooling, restaurants and everything... It feels like this will be possible soon!
And btw, Natural Questions

Instructions

Plans

What's next?

9/11
@Swarooprm7
Yes, true AI assistant is the future.
Natural Questions

Instructions

Plans

Your pattern is absolutely spot on. Something else I am working on in that line is coming. Let the suspense be there until then

10/11
@billyuchenlin
Awesome

will try to implement SwiftSage & Lumos agents see if how local LLM agents and hybrid agents will perform on it

11/11
@Swarooprm7
Thank you Bill.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
NATURAL PLAN data and eval code is finally up

.
Thank you everyone for your interest and patience!

GitHub - google-deepmind/natural-plan

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 18, 2024

1/1
No-Brainer to use Gemini Flash for vision: Fast, Inexpensive and Accurate!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@deedydas
Gemini 1.5 Flash is the model people are sleeping on.

It took ~5s to recognize all the books on my shelf. GPT 4-o took ~25s!

And $1 gets you 13M tokens on Flash vs 200k tokens on 4-o.

2/11
@deedydas
Here's ChatGPT's ~25s in comparison

3/11
@myotherme100
The GCP onboarding is hostile and Gemini is lobotomized.

Speed doesn't make up for it.

4/11
@deedydas
onboarding being bad is an unserious reason to not use a good model

5/11
@KewkD
Why do you believe text being output faster than anyone can read is beneficial or brag worthy, for any model?

6/11
@deedydas
Not all text output for models are meant for human consumption and even when they are, empirically lower latency leads to higher user retention

7/11
@SteDjokovic
Did you check the results?

Gemini says “left and right” shelves, which GPT correctly identifies top-middle-bottom.

The Elon Musk biography is on the right but Gemini categorised it as left.

Also, comparing Flash with GPT-4o instead of mini?

8/11
@OfficialLoganK
1.5 Flash multi-modal performance is truly wild for the price, this is going to power the next wave of AI startups.

9/11
@stevenheidel
give gpt-4o-mini a try! also returns results in a flash and is 30x cheaper than 4o

10/11
@0xshai
5 seconds is nuts! Awesome speed.

P.S: Musashi reader as well. 🫡

11/11
@RawSucces
If you're want to bypass any AI and get the responses you want.

I’ve made a full video guide on how to do it. Simply reply with "AI" , and I'll send it over to you. "must follow so I can DM you"

It is completely free

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 18, 2024

1/11
@lmsysorg
No more waiting. o1's is officially on Chatbot Arena!

We tested o1-preview and mini with 6K+ community votes.

o1-preview: #1 across the board, especially in Math, Hard Prompts, and Coding. A huge leap in technical performance!

o1-mini: #1 in technical areas, #2 overall.

Huge congrats to @OpenAI on this incredible milestone! Come try the king of LLMs and vote at http://lmarena.ai

More analysis below

[Quoted tweet]
Congrats @OpenAI on the exciting o1 release!

o1-preview and o1-mini are now live in Chatbot Arena accepting votes. Come challenge them with your toughest math/reasoning prompts!!

2/11
@lmsysorg
Chatbot Arena Leaderboard overview.

@openai's o1-preview #1 across the board, and o1-mini #1 in technical areas.

3/11
@lmsysorg
Win-rate heat map

4/11
@lmsysorg
Check out full results at http://lmarena.ai/leaderboard!

5/11
@McclaneDet
Given the latency time a human with Google could be o1. Be careful out there folks (especially check writers).

6/11
@_simonsmith
"AI is hitting a wall."

7/11
@axel_pond
very impressive.

thank you for your great service to the community.

8/11
@QStarETH
Math is the key to unlocking the secrets of the universe. We have arrived...

9/11
@Evinst3in
@sama after o1's is officially #1 across the board on Chatbot Arena

10/11
@JonathanRoseD
It seems like the new LLM meta is going to be training models on CoT strategies and relying on agents in the LLM clients. This has implications. Like, should @ollama consider preemptively adding CoT agents for future supporting models?

11/11
@andromeda74356
Can you add a feature where the user can give some text, you convert it to an embedding, and then show how models rank when only using chats that are close to that embedding, so we can see which models are best for our specific use cases?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 19, 2024

1/2
It has been a week with a lot of new AI releases. This is like my third time changing the list this week. I will explain the strengths of each AI out there that I use, so you can exactly know which AI is the best! This is what these AIs are good at!

Text Generation (LLM) - (Services):
-(Chat)GPT 4O by OpenAI:
•Questions with chain of thoughts responses
•Questions with concise responses.
•Coding assistant (Basic)
•Summarising
•Spelling Check
•Text Editing
•Formal writing
•Creative writing
•Vision
•Web search
•Math (Python)
•Data analysis (Python)
•Multilingual

-O1 by OpenAI:
•Reasoning with real world knowledge.

-O1 Mini by OpenAI:
•Reasoning
•Coding Composer (Advanced)

-Claude 3.5 Sonnet by Anthropic:
•Answering questions without web search (Recent knowledge cutoff)

-Gemini Flash 1.5 by Google:
•2 million token context window

-Grok 2 (Mini) by xAI:
•Recent facts that aren't popular (X database)
•Chat image generation

Image Generation (Diffusion):
-Flux:
•Best Overall

-Midjourney:
•Best aesthetic

-Dalle 3:
•Best results with bad prompting.
•Best for 2D like images.

-Stable Diffusion XL by Stability AI:
•Control

-Firefly by Adobe:
•Camera control

Video Generation (Diffusion):
-Kling by Kwai:
•Image to video

-Minimax by Haiou AI:
•Text to video

-Gen 3 by RunwayML:
•Video to video

-Dream Machine by Luma AI
•Good results with bad prompting

-Viggle:
•Controlled video animation

-AnimateDiff:
•Good for weird animations

Audio Generation (Each is for different purposes!):
-Elevenlabs/OpenAIs TTS/Stable Audio Open-VoiceOver+SFX
-Suno/Udio-Music

If I haven't mentioned a AI, it's either because I've forgot about it or it's useless!

2/2
I probably need to make something like this

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 23, 2024

1/11
@Alibaba_Qwen
Welcome to the party of Qwen2.5 foundation models! This time, we have the biggest release ever in the history of Qwen. In brief, we have:

Blog: Qwen2.5: A Party of Foundation Models!
Blog (LLM): Qwen2.5-LLM: Extending the boundary of LLMs
Blog (Coder): Qwen2.5-Coder: Code More, Learn More!
Blog (Math): Qwen2.5-Math: The world's leading open-sourced mathematical LLMs
HF Collection: Qwen2.5 - a Qwen Collection
ModelScope: ModelScope 魔搭社区
HF Demo: Qwen2.5 - a Hugging Face Space by Qwen

* Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B
* Qwen2.5-Coder: 1.5B, 7B, and 32B on the way
* Qwen2.5-Math: 1.5B, 7B, and 72B.

All our open-source models, except for the 3B and 72B variants, are licensed under Apache 2.0. You can find the license files in the respective Hugging Face repositories. Furthermore, we have also open-sourced the **Qwen2-VL-72B**, which features performance enhancements compared to last month's release.

As usual, we not only opensource the bf16 checkpoints but we also provide quantized model checkpoints, e.g, GPTQ, AWQ, and GGUF, and thus this time we have a total of over 100 model variants!

Notably, our flagship opensource LLM, Qwen2.5-72B-Instruct, achieves competitive performance against the proprietary models and outcompetes most opensource models in a number of benchmark evaluations!

We heard your voice about your need of the welcomed 14B and 32B models and so we bring them to you. These two models even demonstrate competitive or superior performance against the predecessor Qwen2-72B-Instruct!

SLM we care as well! The compact 3B model has grasped a wide range of knowledge and now is able to achive 68 on MMLU, beating Qwen1.5-14B!

Besides the general language models, we still focus on upgrading our expert models. Still remmeber CodeQwen1.5 and wait for CodeQwen2? This time we have new models called Qwen2.5-Coder with two variants of 1.5B and 7B parameters. Both demonstrate very competitive performance against much larger code LLMs or general LLMs!

Last month we released our first math model Qwen2-Math, and this time we have built Qwen2.5-Math on the base language models of Qwen2.5 and continued our research in reasoning, including CoT, and Tool Integrated Reasoning. What's more, this model now supports both English and Chinese! Qwen2.5-Math is way much better than Qwen2-Math and it might be your best choice of math LLM!

Lastly, if you are satisfied with our Qwen2-VL-72B but find it hard to use, now you got no worries! It is OPENSOURCED!

Prepare to start a journey of innovation with our lineup of models! We hope you enjoy them!

2/11
@Alibaba_Qwen
Qwen2.5-72B-Instruct against the opensource models!

3/11
@Alibaba_Qwen
14B and 32B and even a lightweight Turbo model can outcompete GPT4-o-mini!

4/11
@Alibaba_Qwen
Qwen2.5-3B can learn way more knowledge than you might expect!

5/11
@Alibaba_Qwen
Play with all our Qwen2.5 LLMs in a single HF Space!
Qwen2.5 - a Hugging Face Space by Qwen

6/11
@Alibaba_Qwen
The prince of code LLM, Qwen2.5-Coder!

7/11
@Alibaba_Qwen
Break the limit, Qwen2.5-Math!

8/11
@Sentdex
A day early it seems! Epic release.

9/11
@altryne
Whoaaah there! What a massive release!
Congrats to yhr team for pulling this off!

Will dig in and chat about it tomorrow on the show!

https://nitter.poast.org/i/spaces/1LyxBgkXXMpKN

10/11
@yacineMTB
okay go sleep now

11/11
@Gopinath876
Impressive

1/2
Qwen 2.5 72B aka GPT4/ Sonnet 3.5 competitive model now available for free on Hugging Chat!

GO try it out now!

[Quoted tweet]
Open Source AI/ML is on fire today!

Multilingual (29) Qwen 2.5 just dropped w/ 128K context too! The 72B rivals Llama 3.1 405B and beats Mistral Large 2 (123B)

> Trained on an extensive dataset containing up to 18 trillion tokens

> It surpasses its predecessor, Qwen2, with significantly higher scores on MMLU (85+), HumanEval (85+), and MATH (80+) benchmarks

> Excels in instruction following, generating lengthy texts (over 8K tokens), and understanding structured data like tables. It also shows significant progress in generating structured outputs, particularly JSON.

> Supports over 29 languages, including major global languages, and can handle up to 128K tokens, with a text generation capacity of 8K tokens.

They release specialised models as well:

1. Qwen2.5: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B

2. Qwen2.5-Coder: 1.5B, 7B, and 32B on the way

3. Qwen2.5-Math: 1.5B, 7B, and 72B.

Kudos to @Alibaba_Qwen team for shipping high quality model checkpoints!

2/2
HuggingChat

1/6
@_philschmid
GPT-4 for coding at home! Qwen 2.5 Coder 7B outperforms other @OpenAI GPT-4 0613 and open LLMs < 33B, including @BigCodeProject StartCoder, @MistralAI Codestral, or Deepseek, and is released under Apache 2.0.

Details:

Three model sizes: 1.5B, 7B, and 32B (coming soon) up to 128K tokens using YaRN

Pre-trained on 5.5 trillion tokens, post-trained on tens of millions example (no details on # tokens)

7:2:1 ratio of public code data, synthetic data, and text data outperformed other combinations, even those with more code proportion.

Build scalable synthetic data generation using LLM scorers, checklist-based scoring, and sandbox for code verification to filter out low-quality data.

Trained on 92+ programming languages and Incorporated multilingual code instruction data

To improve long context, create instruction pairs with FIM format using AST

Adopted a two-stage post-training process—starting with diverse, low-quality data (tens of millions) for broad learning, followed by high-quality data with rejection sampling for refinement (millions).

Performed decontamination on all datasets (pre & post) to ensure integrity using a 10-gram overlap method

7B Outperforms other open Code LLMs < 40B, including Mistral Codestral, or Deepseek

7B matches OpenAI GPT-4 0613 on various benchmarks

Released under Apache 2.0 and available on @huggingface

Models: Qwen2.5-Coder - a Qwen Collection
Paper: Paper page - Qwen2.5-Coder Technical Report

2/6
@andrey_cheptsov
Would be also great to have Claude include to the comparison

3/6
@0xSMW
Do you think the performance holds on real-world scenarios? My observation with the small open models is they struggle with longer prompts or context, making them more of a POC than something usable.

4/6
@joru1000
Thanks, there are open discussions about whether the model holds up to benchmarks for real coding scenarios, the Qwen team is serious and usually deliver, however, for some reason, (quantizations or else?), I do not reach good results and others report similar feedback

5/6
@chatgptdevx
Any API provider that supports QWen 2.5 Coder?

6/6
@yuhui_bear
In actual production environments, it performs better than any LLM below 20B Aider LLM Leaderboards

1/19
@_philschmid
We have GPT-4 for coding at home! I looked up @OpenAI GPT-4 0613 results for various benchmarks and compared them with @Alibaba_Qwen 2.5 7B coder.

> 15 months after the release of GPT-0613, we have an open LLM under Apache 2.0, which performs just as well.

> GPT-4 pricing is $30/$60 while a ~7-8B model is at $0.09/$0.09 that's a cost reduction of ~333-666x times, or if you run it on your machine, it's “free”.

Still Mindblown. Full post about Qwen 2.5 tomorrow. 🫡

2/19
@_philschmid

3/19
@nisten
the 8bit 1.5b is getting the usual questions right as well, while running locally on the phone.
Time to rethink the scaling laws lol

4/19
@S_yahrul123
Could you do a guide on how to run this for an LLM beginner?

5/19
@hallmark_nick
How does this compare with Sonnet-3.5?

6/19
@j6sp5r
It might be impossible to beat OpenAI at everything, but it's totally possible to beat OpenAI on specific problems.

OpenAI is trying to solve the general problem and they are good at it. But if we focus on specific problems, it's rather easy for us to surpass their performance in these areas. They cannot be better than us in all areas all the time. This is how we stand our ground.

7/19
@edalgomezn
What requirements does it ask for to run it locally?

8/19
@gpt_biz
"Sounds like Qwen 2.5 is really giving GPT-4 a run for its money, can't wait to see how it performs in real-world tasks!"

9/19
@nocodeaidev
Wow, that's incredible progress! Open source LLMs really are changing the game. Looking forward to your full post tomorrow!

10/19
@beniam1nzh
wait the full name of Qwen is Alibaba_Qwen?? thats my Chinese LLM model there

11/19
@undeservingfut
One of the nice side effects of the chip export ban is Chinese engineers are working hard on doing more with less and thus helping level the playing field for startups without a big GPU budget.

12/19
@1992Hikaro
What I know about Chinese junk, at most this would be a hard tricky working only on benchmark nothing else.

13/19
@garyfung
Compare with Claude 3.5 instead

14/19
@APIdeclare
This is so cool. Wonder if its possible to run with cursor or alternative?

15/19
@squizzster
Looks like BMB to me, Bench-Mark-Bias.
a ~7B model beats GPT-4 0613 at coding. (6m old only)
:=) !PLEASE BE RIGHT! :-)
I want you to be right.

16/19
@yar_vol
Annoying thing about AI is that now I am used to prompt O1 so all other models even my friend Sonnet 3.5 look so dumb..
Do you know of any reasonable OSS effort to reproduce o1?

17/19
@SimplyObjective
Must be true then. The benchmark gods have spoken.

18/19
@be_anon_always
We knew it and thats why we have been making local AI co-pilot Pyano - Price, Privacy and Personalised. Launching in five days.

19/19
@LOFI911
Tried 7b ver of this model in @LMStudioAI and when it first generated the code the code generated error and so when I asked additionl prompt it just generated the very same code it did the first time. Not impressed, unless its Lm studio app's fault.

bnew · Sep 23, 2024

1/2
"nowhere near solved" ... from "A brief history of AI", published in january 2021.

2/2
Page 26 :smile:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 24, 2024

1/1
Dane Vahey of OpenAI says the cost per million tokens has fallen from $36 to $0.25 in the past 18 months, and as such AI is the greatest cost-depreciating technology ever invented

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 24, 2024

1/3
@veryvanya
the first 1bit visionLM has arrived IntelLabs/LlavaOLMoBitnet1B · Hugging Face

[Quoted tweet]
Intel presents LLaVaOLMoBitnet1B

Ternary LLM goes Multimodal!

discuss: huggingface.co/papers/2408.1…

Multimodal Large Language Models (MM-LLMs) have seen significant advancements in the last year, demonstrating impressive performance across tasks. However, to truly democratize AI, models must exhibit strong capabilities and be able to run efficiently on small compute footprints accessible by most. Part of this quest, we introduce LLaVaOLMoBitnet1B - the first Ternary Multimodal LLM capable of accepting Image(s)+Text inputs to produce coherent textual responses. The model is fully open-sourced along with training scripts to encourage further research in this space. This accompanying technical report highlights the training process, evaluation details, challenges associated with ternary models and future opportunities.

2/3
@ilumine_ai
what this can mean in the near term?

3/3
@veryvanya
how i see it
in few years, 10 year old iot potato compute will run multimodality faster and at better quality than current sota closed models (in narrow finetuned usecases)
this continued research into 1bit coupled with decentralised training are already sneakpeak of crazy future

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/18
@mohamedmekkouri

Exciting news! We’ve finally cracked the code for BitNet @huggingface ! no pre-training needed! With just fine-tuning a Llama 3 8B, we've achieved great results, reaching a performance close to Llama 1 & 2 7B models on key downstream tasks!

Want to learn more? Check out the blogpost or keep reading for exclusive insights!

Blogpost: Fine-tuning LLMs to 1.58bit: extreme quantization made easy

2/18
@mohamedmekkouri
1/ BitNet principle in a nutshell: BitNet is an architecture introduced by @MSFTResearch, it replaces traditional linear layers in multi-head attention and feed-forward networks with specialized BitLinear layers using ternary or binary precision.

Kudos to @realHongyu_Wang, @ma_shuming, and the entire team for this amazing technique!

papers :

[2310.11453] BitNet: Scaling 1-bit Transformers for Large Language Models

[2402.17764] The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits

3/18
@mohamedmekkouri
2/ The papers proposed pre-training models from scratch using this architecture in the context of Quantization Aware Training (QAT) with fake quantization layers. This approach aims to make the model aware of quantization during training. However, pre-training requires significant resources, which aren't affordable for everyone. Our vision is to make these models accessible and open, encouraging greater community involvement and effort.
That's the primary reason we began exploring fine-tuning techniques!

4/18
@mohamedmekkouri
3/ The first reasonable step was to start with pre-trained weights (Llama3 8B), apply quantization, and evaluate performance. We attempted this, but the model lost all prior information with quantization. There was no significant difference between starting with random weights and pre-trained weights.

5/18
@mohamedmekkouri
4/ We then experimented with various techniques to achieve successful fine-tuning (on the same Llama3 8b model), but I'll skip the ones that didn't work. Let's focus on the most promising technique we discovered: Warmup Quantization. To grasp this concept, you need to understand how quantization is implemented in BitNet:
(for a detailed code explanation, check out the blogpost)

6/18
@mohamedmekkouri
5/ In the image above, we use both quantized and non-quantized values to address the issue of non-differentiable quantization operations. This inspired us to introduce quantization gradually into the model. We created a variable called lambda, which increases from 0 to 1. A lambda value of 0 means no quantization, while a value of 1 represents full quantization. We tried in our experiments different warmup steps, and different schedulers (Linear, Exponential, Sigmoid)

7/18
@mohamedmekkouri
6/ Our experiments show that the linear scheduler generally performs better, whereas the sigmoid scheduler excels in specific setups with a particular slope parameter. Below are the results on downstream tasks, evaluated using LightEval on the nanotron format of the checkpoints after a 10B tokens fine-tuning :

8/18
@mohamedmekkouri
7/ We scaled our experiments to fine-tune on 100 billion tokens. Except for Hellaswag and Winogrande, the model performs comparably to Llama2 7B and Llama 1 7B on downstream tasks!

9/18
@mohamedmekkouri
8/ To learn more, check out the blogpost

.

Remember, this is only the beginning of the Era of Extreme Quantization!

10/18
@iamRezaSayar
Super exciting work!

thanks for sharing and looking forward for more on this topic!

11/18
@bnjmn_marie
Incredible work!

12/18
@sladurb
Congrats! BitNets are also being merged in llama.cpp and llamafile , so we need decent trainded models and methods to distill existing LLMs into bitnets!

13/18
@iamRezaSayar
@nisten you might wanna see this

14/18
@abelsantillanr
@ollama Please!!!!

15/18
@GhostOfAnubis
I thought this technique had been forgotten.

16/18
@CyberAIWizard
@LMStudioAI

17/18
@Karenhalmi
That's fascinating, Mohamed! It's amazing how AI keeps evolving. I'd love to learn more about how BitNet works. Thanks for sharing this breakthrough!

18/18
@Jacoed

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/6
4 months since we released BitNet b1.58

After we compressed LLM to 1.58 bits, the inference of 1bit LLM is no longer memory-bound, but compute-bound.

Today we introduce Q-Sparse that can significantly speed up LLM computation.

2/6
Q-sparse is trained with TopK sparsification and STE to prevent from gradients vanishing.

Q-Sparse can achieve results comparable to those of baseline LLMs while being much more efficient at inference time;

3/6

We present an inference-optimal scaling law for sparsely-activated LLMs; As the total model size grows, the gap between sparsely-activated and dense model continuously narrows.

4/6

Q-Sparse is effective in different settings, including training-from-scratch, continue-training of off-the-shelf LLMs, and finetuning;

5/6

Q-Sparse works for both full-precision and 1-bit LLMs (e.g., BitNet b1.58). Particularly, the synergy of BitNet b1.58 and Q-Sparse (can be equipped with MoE) provides the cornerstone and a clear path to revolutionize the efficiency of future LLMs.

6/6
Link：[2407.10969] Q-Sparse: All Large Language Models can be Fully Sparsely-Activated

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 24, 2024

1/2
From DeepNet, BitNet to Q-Sparse, our mission is to develop a 10x, even 100x efficient architecture without any compromise on performance.

With test-time scaling, more efficient inference also means better performance (given the same inference budget).

2/2
Currently we still focus on the weight and activation. KV Cache has significant redundancy between layers… it’s worth trying Q-Sparse for KV Cache, but it’s difficult to achieve such high-level compression compared to YOCO.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Sep 24, 2024

1/7
(1/7) Physics of LM, Part 2.1 with 8 results for LLM reasoning is out: [2407.20311] Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process. Probing reveals that LLMs secretly develop some "level-2" reasoning skill beyond Humans. Although I recommend watching my ICML tutorial first... Come in this thread to see the slides.

2/7
Result 1: we don't want to chat with GPT4 to guess how it reasons. Instead, we create synthetic grade-school math data (using mod-23 and removing common sense, focusing solely on reasoning) and pretrain model directly on it. This allows for controlled experiments and probing.

3/7
Result 2-3: Using this data, we show models can truly learn some reasoning skill (not by memorizing solution templates). Crucially, models can mentally do planning to generate shortest solutions (avoiding unnecessary computations) – a level 1 reasoning skill that Humans also do.

4/7
Result 4-5: we invent probing technique to discover, before a question is asked, model already figures out (mentally!) what parameter recursively depends on what. This skill is not needed for solving the problem, and different from human reasoning. We call it "level-2" reasoning.

5/7
Result 6: We explain how reasoning errors occur. For instance, some error traces back to the model's mental planning stage => such error can be predicted even before the model starts to generate the first token; such errors are independent of the random generation process.

6/7
Result 7-8. Depth of model is crucial for reasoning; and we explain this necessity in depth by the complexity of the mental processes involved. This cannot be mitigated by CoT – deciding what’s the first CoT step may still require deep, multi-step mental reasoning (planning).

7/7
This is joint work with @yetian648(CMU/Meta), Zicheng Xu (Meta), Yuanzhi Li (MBZUAI). I'd like to thank once again my manager Lin Xiao for encouraging this exploratory research, FAIR's sponsorship on A100 + V100, and FAIR's wonderful engineering team that supported our heavy jobs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Submission history

bnew

Veteran

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Submission history

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters​

Submission history​

Veteran

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems​

Submission history​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Submission history

Chain of Thought Empowers Transformers to Solve Inherently Serial Problems

Submission history