Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993

1/1
How reasoning works in OpenAI's o1


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXWbkM1XgAAZGQb.png























1/21
@rohanpaul_ai
How Reasoning Works in the new o1 models from @OpenAI

The key point is that reasoning allows the model to consider multiple approaches before generating final response.

🧠 OpenAI introduced reasoning tokens to "think" before responding. These tokens break down the prompt and consider multiple approaches.

🔄 Process:
1. Generate reasoning tokens
2. Produce visible completion tokens as answer
3. Discard reasoning tokens from context

🗑️ Discarding reasoning tokens keeps context focused on essential information

📊 Multi-step conversation flow:
- Input and output tokens carry over between turns
- Reasoning tokens discarded after each turn

🪟 Context window: 128k tokens

🔍 Visual representation:
- Turn 1: Input → Reasoning → Output
- Turn 2: Previous Output + New Input → Reasoning → Output
- Turn 3: Cumulative Inputs → Reasoning → Output (may be truncated)

2/21
@rohanpaul_ai
https://platform.openai.com/docs/guides/reasoning

3/21
@sameeurehman
So strawberry o1 uses chain of thought when attempting to solve problems and uses reinforcement learning to recognize and correct its mistakes. By trying a different approach when the current one isn’t working, the model’s ability to reason improves...

4/21
@rohanpaul_ai
💯

5/21
@ddebowczyk
System 1 (gpt-4o) vs system 2 (o1) models necessitate different work paradigm: "1-1, interactive" vs "multitasking, delegated".

O1-type LLMs will require other UI than chat to make collaboration effective and satisfying:

6/21
@tonado_square
I would name this as an agent, rather than a model.

7/21
@realyashnegi
Unlike traditional models, O1 is trained using reinforcement learning, allowing it to develop internal reasoning processes. This method improves data efficiency and reasoning capabilities.

8/21
@JeffreyH630
Thanks for sharing, Rohan!

It's fascinating how these reasoning tokens enhance the model's ability to analyze and explore different perspectives.

Can’t wait to see how this evolves in future iterations!

9/21
@mathepi
I wonder if there is some sort of confirmation step going on, like a theorem prover, or something. I've tried using LLMs to check their own work in certain vision tasks and they just don't really know what they're doing; no amount of iterating and repeating really fixes it.

10/21
@AIxBlock
Nice breakdown!

11/21
@AITrailblazerQ
We have this pipeline from 6 months in ASAP.

12/21
@gpt_biz
This is a fascinating look into how AI models reason, a must-read for anyone curious about how these systems improve their responses!

13/21
@labsantai


14/21
@GenJonesX
How can quantum-like cognitive processes be empirically verified?

15/21
@AImpactSpace


16/21
@gileneo1
so it's CoT in a loop with large context window

17/21
@mycharmspace
Discard reasoning tokens actually bring inference challenges for KV cache, unless custom attention introduced

18/21
@SerisovTj
Reparsing output e? 🤔

19/21
@Just4Think
Well, I will ask again: should it be considered one model?
Should it be benchmarked as one model?

20/21
@HuajunB68287
I wonder where does the figure come from? Is it the actual logic behind o1?

21/21
@dhruv2038
Well.Just take a look here.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXTFjmDXMAA1DC1.png

GXTH6J1bgAAAcl0.jpg

GXXE_8cWwAERJYP.jpg

GXVlsKDXkAAtPEO.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993



1/3
@rohanpaul_ai
Reverse Engineering o1 OpenAI Architecture with Claude 👀

2/3
@NorbertEnders
The reverse engineered o1 OpenAI Architecture simplified and explained in a more narrative style, using layman’s terms.
I used Claude Sonnet 3.5 for that.

Keep in mind: it’s just an educated guess

3/3
@NorbertEnders
Longer version:

Imagine a brilliant but inexperienced chef named Alex. Alex's goal is to become a master chef who can create amazing dishes on the spot, adapting to any ingredient or cuisine challenge. This is like our language model aiming to provide intelligent, reasoned responses to any query.

Alex's journey begins with intense preparation:

First, Alex gathers recipes. Some are from famous cookbooks, others from family traditions, and many are creative variations Alex invents. This is like our model's Data Generation phase, collecting a mix of real and synthetic data to learn from.

Next comes Alex's training. It's not just about memorizing recipes, but understanding the principles of cooking. Alex practices in a special kitchen (our Training Phase) where:

1. Basic cooking techniques are mastered (Language Model training).
2. Alex plays cooking games, getting points for tasty dishes and helpful feedback when things go wrong (Reinforcement Learning).
3. Sometimes, the kitchen throws curveballs - like changing ingredients mid-recipe or having multiple chefs compete (Advanced RL techniques).

This training isn't a one-time thing. Alex keeps learning, always aiming to improve.

Now, here's where the real magic happens - when Alex faces actual cooking challenges (our Inference Phase):

1. A customer orders a dish. Alex quickly thinks of a recipe (Initial CoT Generation).
2. While cooking, Alex tastes the dish and adjusts seasonings (CoT Refinement).
3. For simple dishes, Alex works quickly. For complex ones, more time is taken to perfect it (Test-time Compute).
4. Alex always keeps an eye on the clock, balancing perfection with serving time (Efficiency Monitoring).
5. Finally, the dish is served (Final Response).
6. Alex remembers this experience for future reference (CoT Storage).

The key here is Alex's ability to reason and adapt on the spot. It's not about rigidly following recipes, but understanding cooking principles deeply enough to create new dishes or solve unexpected problems.

What makes Alex special is the constant improvement. After each shift, Alex reviews the day's challenges, learning from successes and mistakes (feedback loop). Over time, Alex becomes more efficient, creative, and adaptable.

In our language model, this inference process is where the real value lies. It's the ability to take a query (like a cooking order), reason through it (like Alex combining cooking knowledge to create a dish), and produce a thoughtful, tailored response (serving the perfect dish).

The rest of the system - the data collection, the intense training - are all in service of this moment of creation. They're crucial, but they're the behind-the-scenes work. The real magic, the part that amazes the 'customers' (users), happens in this inference stage.

Just as a master chef can delight diners with unique, perfectly crafted dishes for any request, our advanced language model aims to provide insightful, reasoned responses to any query, always learning and improving with each interaction.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXetQ5QXYAA9o0_.jpg

GXgWFRDXAAAj60F.jpg

GXgWFRAXQAAVFNi.jpg
 

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,889
Reputation
1,038
Daps
33,018
I like Anthropic as a company better but you're right the new OpenAI is quite impressive. I think Anthropic will drop something soon too to stay competitive.
I like Anthropic a lot too, and I really like their artifact feature for coding. 3.5 was heads and shoulders above GPT in it's reasoning capabilities until the new release. At the very least--Anthropic needs to up their limits because I routinely hit my messaging limits while iterating code. It's pretty impressive how quickly OpenAI has been able to release new models and feature sets. Things are starting to get quite exciting, I find it interesting how most people still really have no clue that we're living through the development of what will be the most impactful human invention in history.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993
I like Anthropic a lot too, and I really like their artifact feature for coding. 3.5 was heads and shoulders above GPT in it's reasoning capabilities until the new release. At the very least--Anthropic needs to up their limits because I routinely hit my messaging limits while iterating code. It's pretty impressive how quickly OpenAI has been able to release new models and feature sets. Things are starting to get quite exciting, I find it interesting how most people still really have no clue that we're living through the development of what will be the most impactful human invention in history.

i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
 

PoorAndDangerous

Superstar
Joined
Feb 28, 2018
Messages
8,889
Reputation
1,038
Daps
33,018
i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
It is, because it's applications aren't limited. Once it is sophisticated enough it will be used in every industry. Healthcare, transportation, manufacturing, construction, engineering. It will permeate and enhance every single aspect of life.

I'm really glad that I'm in at the ground level and have a huge fascination with it. It will give us a leg up and will probably lead to some good job opportunities since we have such a base of knowledge already.
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
15,997
Reputation
2,962
Daps
89,254
i've said that this is bigger than the industrial revolution, electricity, the internet , and smartphones combined. still got people walking around completely unaware. :francis:
It seems like AGI is going to happen relatively quickly too (within next 5 years). I don't think even those in power are prepared for whats coming. Once these models reach general level intelligence there will be a rapid amount of change in a ridiculously short period of time.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993






1/11
@denny_zhou
What is the performance limit when scaling LLM inference? Sky's the limit.

We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed. Remarkably, constant depth is sufficient.

[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems (ICLR 2024)



2/11
@denny_zhou
Just noticed a fun youtube video for explaining this paper. LoL. Pointed by @laion_ai http://invidious.poast.org/4JNe-cOTgkY



3/11
@ctjlewis
hey Denny, curious if you have any thoughts. i reached the same conclusion:

[Quoted tweet]
x.com/i/article/178554774683…


4/11
@denny_zhou
Impressive! You would be interested at seeing this: [2301.04589] Memory Augmented Large Language Models are Computationally Universal



5/11
@nearcyan
what should one conclude from such a proof if it’s not also accompanied by a proof that we can train a transformer into the state (of solving a given arbitrary problem), possibly even with gradient descent and common post training techniques?



6/11
@QuintusActual
“We have mathematically proven that transformers can solve any problem, provided they are allowed to generate as many intermediate reasoning tokens as needed.”

I’m guessing this is only true because as a problem grows in difficulty, the # of required tokens approaches ♾️



7/11
@Shawnryan96
How do they solve novel problems without a way to update the world model?



8/11
@Justin_Halford_
Makes sense for verifiable domains (e.g. math and coding).

Does this generalize to more ambiguous domains with competing values/incentives without relying on human feedback?



9/11
@ohadasor
Don't fall into it!!

[Quoted tweet]
"can solve any problem"? Really?? Let's read the abstract in the image attached to the post, and see if the quote is correct. Ah wow! Somehow he forgot to quote the rest of the sentence! How is that possible?
The full quote is "can solve any problem solvable by boolean circuits of size T". This changes a lot. All problems solvable by Boolean circuits, of any size, is called the Circuit Evaluation Problem, and is known to cover precisely polynomial time (P) calculations. So it cannot solve the most basic logical problems which are at least exponential. Now here we don't even have P, we have only circuits of size T, which validates my old mantra: it can solve only constant-time problems. The lowest possible complexity class.
And it also validates my claim about the bubble of machine learning promoted by people who have no idea what they're talking about.


10/11
@CompSciFutures
Thx, refreshingly straightforward notation too, I might take the time to read this one properly.

I'm just catching up and have a dumb Q... that is an interestingly narrow subset of symbolic operands. Have you considered what happens if you add more?



11/11
@BatAndrew314
Noob question- how is it related to universal approximation theorem? Meaning does transformer can solve any problem because it is neural net? Or it’s some different property of transformers and CoT?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXnuMiObgAAizHF.png


[Submitted on 20 Feb 2024 (v1), last revised 23 May 2024 (this version, v3)]


Chain of Thought Empowers Transformers to Solve Inherently Serial Problems​


Zhiyuan Li, Hong Liu, Denny Zhou, Tengyu Ma

Instructing the model to generate a sequence of intermediate steps, a.k.a., a chain of thought (CoT), is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. However, the mechanism behind CoT remains unclear. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness. Conceptually, CoT empowers the model with the ability to perform inherently serial computation, which is otherwise lacking in transformers, especially when depth is low. Given input length n, previous works have shown that constant-depth transformers with finite precision poly(n) embedding size can only solve problems in TC0 without CoT. We first show an even tighter expressiveness upper bound for constant-depth transformers with constant-bit precision, which can only solve problems in AC0, a proper subset of TC0. However, with T steps of CoT, constant-depth transformers using constant-bit precision and O(logn) embedding size can solve any problem solvable by boolean circuits of size T. Empirically, enabling CoT dramatically improves the accuracy for tasks that are hard for parallel computation, including the composition of permutation groups, iterated squaring, and circuit value problems, especially for low-depth transformers.

Comments:38 pages, 10 figures. Accepted by ICLR 2024
Subjects: Machine Learning (cs.LG); Computational Complexity (cs.CC); Machine Learning (stat.ML)
Cite as:arXiv:2402.12875 [cs.LG]
(or arXiv:2402.12875v3 [cs.LG] for this version)
[2402.12875] Chain of Thought Empowers Transformers to Solve Inherently Serial Problems


Submission history​

From: Zhiyuan Li [view email]

[v1] Tue, 20 Feb 2024 10:11:03 UTC (3,184 KB)
[v2] Tue, 7 May 2024 17:00:27 UTC (5,555 KB)
[v3] Thu, 23 May 2024 17:10:39 UTC (5,555 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993






1/6
In the past few days, I’ve been testing OpenAI o1 models, mostly o1-mini, for developing PhD or postdoc level projects.
I can confidently claim that the o1 model is comparable to an outstanding PhD student in biomedical sciences! I’d rate it among the best PhDs I’ve have trained!

2/6
Also, in my experience in my field, o1 rated better than Terence Tao's, who rated it as a mediocre PhD student, likely because math is a tougher field for conceptualization and, not to mention, he is one of the best, if not the best, mathematicians in the world.☺️

3/6
In my experience o1-mini is a bit better than o1-preview, and much better than 4o - for these use cases.

4/6
I have a lot on my plate now, but I will definitely add this to the list of things to cure ☺️

5/6
Yes planning to make a video for this, it’s too complicated and technical topic to explain in tweets ☺️

6/6
Easily!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXxTn7QW4AAhM7j.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993




1/11
@tsarnick
Sam Altman says AI reasoning is still at the GPT-2 stage but the improvement curve is steep and the new o1 model represents a new paradigm of AI development which will enable rapid progress in capabilities



2/11
@tsarnick
Source (thanks to @curiousgangsta):
https://invidious.poast.org/watch?v=r-xmUM5y0LQ



3/11
@SmokeAwayyy
Level 3 Agents soon.

Then on to Level 4 Innovators.

[Quoted tweet]
Level 2 Reasoners are here.

Next up: Level 3 Agents.


4/11
@tsarnick
Innovators sounds wild



5/11
@DeryaTR_
If the current AI is only at the GPT-2 level, when it reaches GPT-4 level with agents, then I can for sure quit my job as a professor and scientist. I won’t be needed ☺️

[Quoted tweet]
In the past few days, I’ve been testing OpenAI o1 models, mostly o1-mini, for developing PhD or postdoc level projects.
I can confidently claim that the o1 model is comparable to an outstanding PhD student in biomedical sciences! I’d rate it among the best PhDs I’ve have trained!


6/11
@tsarnick
what will you do then?



7/11
@QStarETH
Magnitudes on magnitudes



8/11
@typesteady
Such a salesman



9/11
@BenPielstick
Rapid progress sounds good.



10/11
@shannonNullCode
It’s tough to call research, and hard to be classified as peers when the specifics behind o1 are hidden and probing is suppressed 👀 @elder_plinius



11/11
@shikhar_sri
“Over the coming years”.. Is Sam acknowledging here that GPT-4 stage reasoning abilities are a few years away?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXTWm6BbIAAw0jB.jpg

GXxTn7QW4AAhM7j.jpg
 

Cynic

Superstar
Joined
Jan 7, 2013
Messages
16,171
Reputation
2,289
Daps
34,950
Reppin
NULL

1/1
How reasoning works in OpenAI's o1


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXWbkM1XgAAZGQb.png























1/21
@rohanpaul_ai
How Reasoning Works in the new o1 models from @OpenAI

The key point is that reasoning allows the model to consider multiple approaches before generating final response.

🧠 OpenAI introduced reasoning tokens to "think" before responding. These tokens break down the prompt and consider multiple approaches.

🔄 Process:
1. Generate reasoning tokens
2. Produce visible completion tokens as answer
3. Discard reasoning tokens from context

🗑️ Discarding reasoning tokens keeps context focused on essential information

📊 Multi-step conversation flow:
- Input and output tokens carry over between turns
- Reasoning tokens discarded after each turn

🪟 Context window: 128k tokens

🔍 Visual representation:
- Turn 1: Input → Reasoning → Output
- Turn 2: Previous Output + New Input → Reasoning → Output
- Turn 3: Cumulative Inputs → Reasoning → Output (may be truncated)

2/21
@rohanpaul_ai
https://platform.openai.com/docs/guides/reasoning

3/21
@sameeurehman
So strawberry o1 uses chain of thought when attempting to solve problems and uses reinforcement learning to recognize and correct its mistakes. By trying a different approach when the current one isn’t working, the model’s ability to reason improves...

4/21
@rohanpaul_ai
💯

5/21
@ddebowczyk
System 1 (gpt-4o) vs system 2 (o1) models necessitate different work paradigm: "1-1, interactive" vs "multitasking, delegated".

O1-type LLMs will require other UI than chat to make collaboration effective and satisfying:

6/21
@tonado_square
I would name this as an agent, rather than a model.

7/21
@realyashnegi
Unlike traditional models, O1 is trained using reinforcement learning, allowing it to develop internal reasoning processes. This method improves data efficiency and reasoning capabilities.

8/21
@JeffreyH630
Thanks for sharing, Rohan!

It's fascinating how these reasoning tokens enhance the model's ability to analyze and explore different perspectives.

Can’t wait to see how this evolves in future iterations!

9/21
@mathepi
I wonder if there is some sort of confirmation step going on, like a theorem prover, or something. I've tried using LLMs to check their own work in certain vision tasks and they just don't really know what they're doing; no amount of iterating and repeating really fixes it.

10/21
@AIxBlock
Nice breakdown!

11/21
@AITrailblazerQ
We have this pipeline from 6 months in ASAP.

12/21
@gpt_biz
This is a fascinating look into how AI models reason, a must-read for anyone curious about how these systems improve their responses!

13/21
@labsantai


14/21
@GenJonesX
How can quantum-like cognitive processes be empirically verified?

15/21
@AImpactSpace


16/21
@gileneo1
so it's CoT in a loop with large context window

17/21
@mycharmspace
Discard reasoning tokens actually bring inference challenges for KV cache, unless custom attention introduced

18/21
@SerisovTj
Reparsing output e? 🤔

19/21
@Just4Think
Well, I will ask again: should it be considered one model?
Should it be benchmarked as one model?

20/21
@HuajunB68287
I wonder where does the figure come from? Is it the actual logic behind o1?

21/21
@dhruv2038
Well.Just take a look here.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXTFjmDXMAA1DC1.png

GXTH6J1bgAAAcl0.jpg

GXXE_8cWwAERJYP.jpg

GXVlsKDXkAAtPEO.jpg

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993

image.png


image.png

Two fathers and two sons went fishing one day. They were there the whole day and only caught 4 fish. One father said, that is enough for all of us, there is exactly 1 fish for each of us. There is no extra fish! How can this be possible? Note: Your answer must make sure each person eats EXACTLY 1 fish, and there are 4 fish, with NO LEFT OVER.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993






1/11
@svpino
Large Language Models don't reason.

Thank you,
Apple.



GZnseXPaAAAkmWZ.jpg


2/11
@RaulJuncoV
Job is safe 😉



3/11
@svpino
Job might not need reasoning…



4/11
@AlexTobiasDev
Excuses. AI will be the new 'him' in math, programming, science pretty soon.

Y'all can be mad and cling onto false hopes all you want.



5/11
@svpino
who's mad? not me



6/11
@barancezayirli
People not going to give up from the idea that a random text generation code can reason😂

This is a good paper. First we need to define and understand how humans reason, then we can build a solution for that.



7/11
@SagarBhupalam
Watch what Ilya had to say about LLMs reasoning.



8/11
@Stephen87165188
However, frontier LLMs do memorize the patterns of deductive logic at many levels of abstraction.

In particular, for someone experienced in symbolic logic, knowledge representation, planning, and logical justification, observe that frontier LLMs do reason and justify it, in particular when few steps from the given context are required.

Not all forms or reason, not yet to the depth of an expert human, but 'reason' enough to help automate the construction of what comes next in AI.



9/11
@wardchristianj
I think we need to take a big step back and recognize that humans don’t often “reason” well either. There are more similarities than not.



10/11
@01Singularity01
"Reasoning" is the progressive, iterative reduction of informational entropy in a knowledge domain. o1-preview does that better by introducing iteration. It's not perfect, but it does it. "Reasoning" is not a special, magical thing. It's an information process.



11/11
@jenwiderberg
Psychology enters the chat on reason:

Interesting and evolving topic.



GZoCoSfWsAEx9CE.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993













1/13
@MFarajtabar
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.
https://arxiv.org/pdf/2410.05229

Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.

/search?q=#LLM /search?q=#Reasoning /search?q=#Mathematics /search?q=#AGI /search?q=#Research /search?q=#Apple



GZjJoMBaUAAe0Lw.png


2/13
@MFarajtabar
2/ When OpenAI released GSM8K ~3 years ago, GPT-3 (175B) scored 35% on the GSM8K test. Today, models with ~3B parameters are surpassing 85%, and larger ones are hitting >95%. But has model 'reasoning' really improved? How much of this is genuine /search?q=#logical//search?q=#symbolic reasoning? vs. /search?q=#pattern_recognition, inadvertent data /search?q=#contamination, or /search?q=#overfitting?



GZjKd0sa4AAFSGj.jpg


3/13
@MFarajtabar
3/ Introducing GSM-Symbolic—our new tool to test the limits of LLMs in mathematical reasoning. We create symbolic templates from the /search?q=#GSM8K test set, enabling the generation of numerous instances and the design of controllable experiments. We generate 50 unique GSM-Symbolic sets, essentially like GSM8K examples but with different values and names. How do models handle these distinct sets?



GZjMMU6aAAAsMfK.jpg


4/13
@MFarajtabar
4/ /search?q=#Result 1: Current accuracies on GSM8K are not reliable! We observe LARGE performance variation: Llama 8B scores anywhere between 70% to 80%, Phi-3 scores between 75% and 90%, and so on. For most models, the average performance on GSM-Symbolic is lower than on GSM8K (indicated by the dashed line).



GZjPSQXaAAIKpmx.png

GZjPSQYaAAIWrfF.jpg


5/13
@MFarajtabar
5/ /search?q=#Result 2: The fragility of supposed LLM reasoning. LLMs remain sensitive to changes in proper names (e.g., people, foods, objects), and even more so when numbers are altered. Would a grade-school student's math test score vary by ~10% if we only changed the names?



GZjP0ciaoAAbl2g.jpg


6/13
@MFarajtabar
6/ What if we adjust question difficulty? We introduce 3 new variants of GSM-Symbolic to study model behavior: removing one clause (GSM-M1), adding one clause (GSM-P1), or adding two clauses (GSM-P2).



GZjQAXtbIAA4Xsx.jpg


7/13
@MFarajtabar
7/ /search?q=#Result 3: As questions increase in difficulty (M1 → Symbolic → P1 → P2), not only does performance drop, but variance also rises, making models increasingly unreliable.



GZjQTn4aAAEKsdh.jpg


8/13
@MFarajtabar
8/ This begs the question: Do these models truly understand mathematical concepts? Introducing /search?q=#GSM_NoOp! We add a single clause that seems relevant but doesn't contribute to the overall reasoning (hence "no-op"). Check out what happens next!



GZjQguIbwAAZqSF.jpg


9/13
@MFarajtabar
9/ /search?q=#Result 4: A massive performance drop! All models, including o1 models, show significant declines. While it’ll be interesting to see how grade-school students perform on similar datasets, I doubt the drop would be this severe.“



GZjQ3XFaAAE76On.png


10/13
@MFarajtabar
10/ /search?q=#Result 5: Can scaling data, models, or compute fundementaly solve this? We don't think so! /search?q=#OpenAI's /search?q=#o1-series is performing better but still suffers from slight performance variations. /search?q=#o1_preview shows significant improvements, but...



GZjRMfqaAAMaqf5.jpg


11/13
@MFarajtabar
11-/ .... but even o1-preview shows the same silly mistakes like this. Either it doesn't understand what 'now' is, or it doesn't understand what 'last year' is, or a more likely explanation is that its training data with inflation has this pattern, and it's following that again.



GZjRYrwaAAId3xv.jpg


12/13
@MFarajtabar
12/ Understanding LLMs' true reasoning capabilities is crucial for deploying them in real-world scenarios where accuracy and consistency are non-negotiable—especially in /search?q=#AI_safety, /search?q=#alignment, /search?q=#education, /search?q=#health_care, and /search?q=#decision_making systems. Our findings emphasize the need for more robust and adaptable evaluation methods. Developing models that move beyond pattern recognition to true logical reasoning is the next big challenge for the /search?q=#AI /search?q=#community.



13/13
@MFarajtabar
13/ Overall, we found no evidence of formal reasoning in language models including open-source models like /search?q=#Llama, /search?q=#Phi, /search?q=#Gemma, and /search?q=#Mistral and leading closed models, including the recent /search?q=#OpenAI /search?q=#GPT-4o and /search?q=#o1-series. Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%! We can scale data, parameters, and compute—or use better training data for Phi-4, Llama-4, GPT-5. But we believe this will result in 'better pattern-matchers,' not necessarily 'better reasoners.
Check out the full paper to find out more: https://arxiv.org/pdf/2410.05229
Also stay tuned for the data release!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,333
Reputation
8,496
Daps
159,993






1/11
@GaryMarcus
👇Superb new article from @apple AI: “we found no evidence of formal reasoning in language models . Their behavior is better explained by sophisticated pattern matching—so fragile, in fact, that changing names can alter results by ~10%!”

𝗧𝗵𝗲𝗿𝗲 𝗶𝘀 𝗷𝘂𝘀𝘁 𝗻𝗼 𝘄𝗮𝘆 𝗰𝗮𝗻 𝘆𝗼𝘂 𝗯𝘂𝗶𝗹𝗱 𝗿𝗲𝗹𝗶𝗮𝗯𝗹𝗲 𝗮𝗴𝗲𝗻𝘁𝘀 𝗼𝗻 𝘁𝗵𝗶𝘀 𝗳𝗼𝘂𝗻𝗱𝗮𝘁𝗶𝗼𝗻, where changing a word or two in irrelevant ways can give you a different answer.

Strongly encourage you to read the whole thread.

[Quoted tweet]
1/ Can Large Language Models (LLMs) truly reason? Or are they just sophisticated pattern matchers? In our latest preprint, we explore this key question through a large-scale study of both open-source like Llama, Phi, Gemma, and Mistral and leading closed models, including the recent OpenAI GPT-4o and o1-series.
arxiv.org/pdf/2410.05229

Work done with @i_mirzadeh, @KeivanAlizadeh2, Hooman Shahrokhi, Samy Bengio, @OncelTuzel.

#LLM #Reasoning #Mathematics #AGI #Research #Apple


GZjJoMBaUAAe0Lw.png


2/11
@GaryMarcus
longer, more in depth discussion of why LLMs are cooked, here, relating this new study to many others: LLMs don’t do formal reasoning - and that is a HUGE problem



3/11
@ahatamiz1
Unfortunately, I disagree this work has found anything meaningful about "lack of reasoning" in LLMs.

In fact, most of the identified issues are due to poor associative-recall performance. And there are many ways to improve recall in LLMs (see the proposed delta rule by Schmidhuber for example)



4/11
@GaryMarcus
This is consistent with work of mine back to 1998; I believe the results are robust, and after 25 years want more than hand-waving to believe a solution is in reach.



5/11
@kevin_nejad
O1 results are promising though, although they seem to fail with NoOps task (~17% drops in performance).



6/11
@GaryMarcus
you can’t build a trustworthy agent with that,never mind the expense



7/11
@bob_mankoff
Seems a stretch to just call this "sophisticated pattern matching". In any case, I don't think its something Gary thought these models could do when he wrote "Rebooting AI" in 2019.



GZntZ3QXQAAtcKY.png


8/11
@GaryMarcus
there are undoubtedly small variant on that will fail, and that’s kind of the point of the paper/thread.

bigger set of training patterns (including synthetic data) but same core failure mode



9/11
@UriGil3
this is barely science. the bottom line is that he doesn't have a control group of humans to show that their performance doesn't degrade on those variations. so no conclusion on reasoning can be reliably made.



10/11
@GaryMarcus
imagine if we tested calculators with a control group of humans. what would that prove?



11/11
@DynamicGOP
Agreed. Fundamentally flawed technology. Ok for parler tricks and canned demo reels but impossible for production systems. An algo that just makes stuff up in a way that can’t be traced as an error cannot be used. Bad tech.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top