Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

1/1
How reasoning works in OpenAI's o1


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXWbkM1XgAAZGQb.png























1/21
@rohanpaul_ai
How Reasoning Works in the new o1 models from @OpenAI

The key point is that reasoning allows the model to consider multiple approaches before generating final response.

🧠 OpenAI introduced reasoning tokens to "think" before responding. These tokens break down the prompt and consider multiple approaches.

🔄 Process:
1. Generate reasoning tokens
2. Produce visible completion tokens as answer
3. Discard reasoning tokens from context

🗑️ Discarding reasoning tokens keeps context focused on essential information

📊 Multi-step conversation flow:
- Input and output tokens carry over between turns
- Reasoning tokens discarded after each turn

🪟 Context window: 128k tokens

🔍 Visual representation:
- Turn 1: Input → Reasoning → Output
- Turn 2: Previous Output + New Input → Reasoning → Output
- Turn 3: Cumulative Inputs → Reasoning → Output (may be truncated)

2/21
@rohanpaul_ai
https://platform.openai.com/docs/guides/reasoning

3/21
@sameeurehman
So strawberry o1 uses chain of thought when attempting to solve problems and uses reinforcement learning to recognize and correct its mistakes. By trying a different approach when the current one isn’t working, the model’s ability to reason improves...

4/21
@rohanpaul_ai
💯

5/21
@ddebowczyk
System 1 (gpt-4o) vs system 2 (o1) models necessitate different work paradigm: "1-1, interactive" vs "multitasking, delegated".

O1-type LLMs will require other UI than chat to make collaboration effective and satisfying:

6/21
@tonado_square
I would name this as an agent, rather than a model.

7/21
@realyashnegi
Unlike traditional models, O1 is trained using reinforcement learning, allowing it to develop internal reasoning processes. This method improves data efficiency and reasoning capabilities.

8/21
@JeffreyH630
Thanks for sharing, Rohan!

It's fascinating how these reasoning tokens enhance the model's ability to analyze and explore different perspectives.

Can’t wait to see how this evolves in future iterations!

9/21
@mathepi
I wonder if there is some sort of confirmation step going on, like a theorem prover, or something. I've tried using LLMs to check their own work in certain vision tasks and they just don't really know what they're doing; no amount of iterating and repeating really fixes it.

10/21
@AIxBlock
Nice breakdown!

11/21
@AITrailblazerQ
We have this pipeline from 6 months in ASAP.

12/21
@gpt_biz
This is a fascinating look into how AI models reason, a must-read for anyone curious about how these systems improve their responses!

13/21
@labsantai


14/21
@GenJonesX
How can quantum-like cognitive processes be empirically verified?

15/21
@AImpactSpace


16/21
@gileneo1
so it's CoT in a loop with large context window

17/21
@mycharmspace
Discard reasoning tokens actually bring inference challenges for KV cache, unless custom attention introduced

18/21
@SerisovTj
Reparsing output e? 🤔

19/21
@Just4Think
Well, I will ask again: should it be considered one model?
Should it be benchmarked as one model?

20/21
@HuajunB68287
I wonder where does the figure come from? Is it the actual logic behind o1?

21/21
@dhruv2038
Well.Just take a look here.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXTFjmDXMAA1DC1.png

GXTH6J1bgAAAcl0.jpg

GXXE_8cWwAERJYP.jpg

GXVlsKDXkAAtPEO.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556



1/3
@rohanpaul_ai
Reverse Engineering o1 OpenAI Architecture with Claude 👀

2/3
@NorbertEnders
The reverse engineered o1 OpenAI Architecture simplified and explained in a more narrative style, using layman’s terms.
I used Claude Sonnet 3.5 for that.

Keep in mind: it’s just an educated guess

3/3
@NorbertEnders
Longer version:

Imagine a brilliant but inexperienced chef named Alex. Alex's goal is to become a master chef who can create amazing dishes on the spot, adapting to any ingredient or cuisine challenge. This is like our language model aiming to provide intelligent, reasoned responses to any query.

Alex's journey begins with intense preparation:

First, Alex gathers recipes. Some are from famous cookbooks, others from family traditions, and many are creative variations Alex invents. This is like our model's Data Generation phase, collecting a mix of real and synthetic data to learn from.

Next comes Alex's training. It's not just about memorizing recipes, but understanding the principles of cooking. Alex practices in a special kitchen (our Training Phase) where:

1. Basic cooking techniques are mastered (Language Model training).
2. Alex plays cooking games, getting points for tasty dishes and helpful feedback when things go wrong (Reinforcement Learning).
3. Sometimes, the kitchen throws curveballs - like changing ingredients mid-recipe or having multiple chefs compete (Advanced RL techniques).

This training isn't a one-time thing. Alex keeps learning, always aiming to improve.

Now, here's where the real magic happens - when Alex faces actual cooking challenges (our Inference Phase):

1. A customer orders a dish. Alex quickly thinks of a recipe (Initial CoT Generation).
2. While cooking, Alex tastes the dish and adjusts seasonings (CoT Refinement).
3. For simple dishes, Alex works quickly. For complex ones, more time is taken to perfect it (Test-time Compute).
4. Alex always keeps an eye on the clock, balancing perfection with serving time (Efficiency Monitoring).
5. Finally, the dish is served (Final Response).
6. Alex remembers this experience for future reference (CoT Storage).

The key here is Alex's ability to reason and adapt on the spot. It's not about rigidly following recipes, but understanding cooking principles deeply enough to create new dishes or solve unexpected problems.

What makes Alex special is the constant improvement. After each shift, Alex reviews the day's challenges, learning from successes and mistakes (feedback loop). Over time, Alex becomes more efficient, creative, and adaptable.

In our language model, this inference process is where the real value lies. It's the ability to take a query (like a cooking order), reason through it (like Alex combining cooking knowledge to create a dish), and produce a thoughtful, tailored response (serving the perfect dish).

The rest of the system - the data collection, the intense training - are all in service of this moment of creation. They're crucial, but they're the behind-the-scenes work. The real magic, the part that amazes the 'customers' (users), happens in this inference stage.

Just as a master chef can delight diners with unique, perfectly crafted dishes for any request, our advanced language model aims to provide insightful, reasoned responses to any query, always learning and improving with each interaction.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXetQ5QXYAA9o0_.jpg

GXgWFRDXAAAj60F.jpg

GXgWFRAXQAAVFNi.jpg
 

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,634
Reputation
996
Daps
31,838
I like Anthropic as a company better but you're right the new OpenAI is quite impressive. I think Anthropic will drop something soon too to stay competitive.
I like Anthropic a lot too, and I really like their artifact feature for coding. 3.5 was heads and shoulders above GPT in it's reasoning capabilities until the new release. At the very least--Anthropic needs to up their limits because I routinely hit my messaging limits while iterating code. It's pretty impressive how quickly OpenAI has been able to release new models and feature sets. Things are starting to get quite exciting, I find it interesting how most people still really have no clue that we're living through the development of what will be the most impactful human invention in history.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556
I like Anthropic a lot too, and I really like their artifact feature for coding. 3.5 was heads and shoulders above GPT in it's reasoning capabilities until the new release. At the very least--Anthropic needs to up their limits because I routinely hit my messaging limits while iterating code. It's pretty impressive how quickly OpenAI has been able to release new models and feature sets. Things are starting to get quite exciting, I find it interesting how most people still really have no clue that we're living through the development of what will be the most impactful human invention in history.

i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
 

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,634
Reputation
996
Daps
31,838
i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
It is, because it's applications aren't limited. Once it is sophisticated enough it will be used in every industry. Healthcare, transportation, manufacturing, construction, engineering. It will permeate and enhance every single aspect of life.

I'm really glad that I'm in at the ground level and have a huge fascination with it. It will give us a leg up and will probably lead to some good job opportunities since we have such a base of knowledge already.
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
15,304
Reputation
2,733
Daps
84,470
i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
It seems like AGI is going to happen relatively quickly too (within next 5 years). I don't think even those in power are prepared for whats coming. Once these models reach general level intelligence there will be a rapid amount of change in a ridiculously short period of time.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556






1/11
@denny_zhou
What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (ICLR 2024)



2/11
@denny_zhou
Just noticed a fun youtube video for explaining this paper. LoL. Pointed by @laion_ai http://invidious.poast.org/4JNe-cOTgkY



3/11
@ctjlewis
hey Denny, curious if you have any thoughts. i reached the same conclusion:

[Quoted tweet]
x.com/i/article/178554774683…


4/11
@denny_zhou
Impressive! You would be interested at seeing this: [2301.04589] Memory Augmented Large Language Models are Computationally Universal



5/11
@nearcyan
what should one conclude from such a proof if it’s not also accompanied by a proof that we can train a transformer into the state (of solving a given arbitrary problem), possibly even with gradient descent and common post training techniques?



6/11
@QuintusActual
“We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.”

I’m guessing this is only true because as a problem grows in difficulty, the # of required tokens approaches ♾️



7/11
@Shawnryan96
How do they solve novel problems without a way to update the world model?



8/11
@Justin_Halford_
Makes sense for verifiable domains (e.g. math and coding).

Does this generalize to more ambiguous domains with competing values/incentives without relying on human feedback?



9/11
@ohadasor
Don't fall into it!!

[Quoted tweet]
"can solve any problem"? Really?? Let's read the abstract in the image attached to the post, and see if the quote is correct. Ah wow! Somehow he forgot to quote the rest of the sentence! How is that possible?
The full quote is "can solve any problem solvable by boolean circuits of size T". This changes a lot. All problems solvable by Boolean circuits, of any size, is called the Circuit Evaluation Problem, and is known to cover precisely polynomial time (P) calculations. So it cannot solve the most basic logical problems which are at least exponential. Now here we don't even have P, we have only circuits of size T, which validates my old mantra: it can solve only constant-time problems. The lowest possible complexity class.
And it also validates my claim about the bubble of machine learning promoted by people who have no idea what they're talking about.


10/11
@CompSciFutures
Thx, refreshingly straightforward notation too, I might take the time to read this one properly.

I'm just catching up and have a dumb Q... that is an interestingly narrow subset of symbolic operands. Have you considered what happens if you add more?



11/11
@BatAndrew314
Noob question- how is it related to universal approximation theorem? Meaning does transformer can solve any problem because it is neural net? Or it’s some different property of transformers and CoT?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXnuMiObgAAizHF.png


[Submitted on 20 Feb 2024 (v1), last revised 23 May 2024 (this version, v3)]


Chain of Thought Empowers Transformers to Solve Inherently Serial Problems​


Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have shown that constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of TC0. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

Comments:38 pages, 10 figures. Accepted by ICLR 2024
Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Cite as:arXiv:2402.12875 [cs.LG]
(or arXiv:2402.12875v3 [cs.LG] for this version)
[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems


Submission history​

From: Zhiyuan Li [view email]

[v1] Tue, 20 Feb 2024 10:11:03 UTC (3,184 KB)
[v2] Tue, 7 May 2024 17:00:27 UTC (5,555 KB)
[v3] Thu, 23 May 2024 17:10:39 UTC (5,555 KB)


 
Top