Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,027
Reputation
8,229
Daps
157,702

1/11
@GaryMarcus
Not the best cognitive psychology I have seen, because it confounds together abstract reasoning with domains where motivated reasoning is well-known to exist. Doesn’t show that humans can’t reason, but rather that they prefer not to.

Strange also that the very long literature on “motivated reasoning” is being reinvented here. This sounds like a nice study but social psychologists have known about these effects since at least the 1980s. (My 2008 book Kluge talks about it some. @jayvanbavel is a good expert on the recent related literature.)

[Quoted tweet]
Can Humans Reason?

Not really - most humans simply indulge in groupthink.

They found no evidence of independent reasoning in humans. Smarter humans become more sophisticated at self-deception and hallucinating arguments.

So, in some sense, AI can already do better than humans 😂


GZ9FTv4b0AMvJhx.jpg


2/11
@BethCarey12
The humble spreadsheet tool became ubiquitous *because* it doesn't make errors like humans.

When your goal is solving language understanding for machines, the path leads to the best of both worlds - emulating what the brain does with language while exploiting the benefits of computers and what they do best.

Language as learned when a human baby, has reasoning baked in. Capturing that, creates the magic without the 'hallucinations' and energy consumption.

Amazon.com



3/11
@bindureddy
Ok, I am clearly trying to be funny and failing by the look of this



4/11
@nathalieX70
As Elon has created the best hive mind with the Adrian community, it would tend to say Bindu is right. He failed to do so with Elon.



5/11
@HumblyAlex
Any of Gary's fanboys want to earn $50?

Booch couldn't earn himself the money, and Gary seems to avoid the high hanging fruits that exist much more than the imaginary one's he regularly swings at.

Easy money if Gary's right and his arguments hold up.

[Quoted tweet]
If you can find anything written by @Ylecun, @GaryMarcus, or @Grady_Booch that entirely and conclusively negates the following in a way that stands up to scrutiny, I will give you $50.

Something tells me it doesn't exist, and they avoid these truths at all costs.


6/11
@bate5a55
Interesting critique. Did you know that the "bias blind spot," identified in 2002, shows even trained logicians fail to recognize their own reasoning biases? It highlights how self-awareness doesn't always mitigate motivated reasoning.



7/11
@BankingNeko
I agree with GaryMarcus, the study's findings aren't new and social psychologists have known this for decades. The critique of the methodology is spot on, we need more nuanced research on human reasoning.



8/11
@Gazorble
John Locke "Few men think, yet all will have opinions. Hence men’s opinions are superficial and confused."
🤯



9/11
@TrustInAutonomy
Ffs



10/11
@gauaren
Such a funny way for the retweeted person to show they don't understand the difference between can and do. Maybe an LLM can (haha) explain it to them



11/11
@FaustinoBerlin




GZ9bVmgXMAAqrC7.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/11
@bindureddy
Can Humans Reason?

Not really - most humans simply indulge in groupthink.

They found no evidence of independent reasoning in humans. Smarter humans become more sophisticated at self-deception and hallucinating arguments.

So, in some sense, AI can already do better than humans 😂



GZ9FTv4b0AMvJhx.jpg


2/11
@Hello_World
Humans can reason LLMs cant. Humans don't always reason but LLMs never do.



3/11
@01Singularity01
What I've been saying. Every time I see "AI can't" or "AI doesn't" I immediately think "the same can be said for most humans". Most humans do not generalize well, for example, and there is a wide range of generalization capabilities. Overwhelmingly, people want their bias confirmed. New information is disruptive to the established training, and to be avoided at all costs.



4/11
@gpt_biz
Interesting take, but I think humans' ability to feel empathy and adapt makes our reasoning unique in ways AI still can't fully grasp yet



5/11
@M1ndPrison
No surprise here. Emotion is the kryptonite of reasoning. That isn't to say humans can not reason.

This is akin to giving AI a terrible prompt and saying it can't solve the problem.

These are expected outcomes, but do not demonstrate a useful comparison.



6/11
@climatebabes
I am an exception..



7/11
@AI_Ethicist_NYC
💯'Experts' complain that AI "can't really" do this or that and that it just uses what it learns in its training data. It just predicts the most likely next token given the context. It hallucinates and comes up with wrong answers. Blah blah blah.

Ironically, it's all these flaws that make it more human-like, not less. Learning from data sets is exactly what people do (on their best day). People are not good at reasoning or remembering or putting two and two together. Some people are. Most people aren't.

Either way, who cares if AI is REALLY reasoning the way we perceive reasoning as long as it comes to a correctly reasoned result.

Do you really think artificial superintelligence is going to give a sh*t whether or not we've deemed its reasoning as 'genuine'? Do lions care what the sheep think?

I think we need to focus less on the idea that AI has to 100% process the world like the human brain does and be more concerned about developing AI so that it can help our brains think in a way they have not before.



8/11
@mr_trademaker
Can Humans Run Marathons?

Not really – most humans simply indulge in group inactivity, sticking to light exercise or none at all.

They found no evidence of marathon-level endurance in most humans. Fitter humans become more skilled at self-deception, convincing themselves they're training sufficiently when they aren’t.

So in some sense, AI can already do better then humans ;)



9/11
@PhDcornerHub
Epigenetic Building of the Evolutionary Mind

Autopoiesis and social intelligence reproduce the emergent properties of thought generation and application. This process involves adaptive thinking and abductive reasoning, both of which are more complex and nuanced. Autopoiesis embodies the essence of self-reproduction, a hallmark of living entities.

Unlike LLMs, the human brain builds a large foundational model as a filter that processes only relevant information from vast data. Nature has its own metrics for intelligence and survival. The general principle of emergence and the adaptation of the evolutionary survival program reflect cognitive learning mechanisms that are autopoietic self-catalysts. Over evolutionary lineage, these processes are automated in genetic systems as a form of learning.

However, these systems do not embody true intelligence. Instead, they mimic individual components, which in turn create social intelligence. Intelligence, distinct from knowledge, is not merely an accumulation of information; it involves the capacity to think and reason. Knowledge is the information and understanding gathered over time, while intelligence refers to the ability to engage in reasoning and reflection.

Machines, by processing vast amounts of data and text, can mimic intelligence. Yet, while machines may replicate wisdom based on textual input, they inherently lack the capacity for human-like thought and reasoning. Learning and developing intelligence are interconnected but distinct processes—one can learn without necessarily developing true intelligence. Language models may simulate aspects of cognitive processes through pattern recognition, but true intelligence encompasses the innate capacity to self-reflect and adapt beyond mere programmed responses.

No human is replaceable by a knowledge machine. Humans are thinking beings, while AI serves as a knowledge machine—it is not transformative, nor is its intelligence transferable. In the end, these technologies remain clouds of knowledge, with humans doing the real thinking.

While intelligence can be introduced into systems, the ability to think is self-learned, self-adaptive, and scaled through the human brain’s evolutionary and survival-driven intelligence. It is embedded in a cognitive social matrix connected to vast libraries of data that we have crafted into language. Most intelligence is shaped by adapting to these metrics. At the core, the brain is programmed to survive, reproduce, and exist, and all intelligence is an emergent property of thought generation and application.

Intelligence is an innate quality, not a genetic predisposition. Everyone who survives demonstrates intelligence at a basic level because our very nature is oriented towards survival, evolution, and reproduction. We must not confuse intelligence shaped by the ecosystem with inherent, born intelligence. The ability to think and intelligence is generic, yet it can also be cultivated. Knowledge and wisdom, meanwhile, form two distinct aspects of this intellectual journey.

AI Technology as the Lighthouse of Human Evolution

AI technology acts as a lighthouse and the torchbearer of human intelligence and knowledge, facilitating the genetic evolution of humankind towards higher levels of achievement and understanding.



10/11
@web3nam3
Humans can reason but don’t want to.



11/11
@adolfoasorlin
Was that a reasoning?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,027
Reputation
8,229
Daps
157,702

1/4
@GaryMarcus
Perfect example of automated LLM summarization drivel. A bunch of the summary is on target and then Gemini takes me to task for not advocating hybrid approaches.

Except that I DO advocate hybrid approaches- that’s whole point of the article 🤦‍♂️

[Quoted tweet]
Steven Pinker wrote: LLMs don’t do formal reasoning - and that is a persistent problem, writes Gary Marcus.

Source: shorturl.at/77l28

Here is Google Gemini 1.5-002's response:

Google Gemini 1.5-002: While Gary Marcus raises valid concerns about the limitations of current LLMs, his argument might be overly pessimistic and doesn't fully acknowledge the progress being made. Here are some counterpoints and examples:

Overemphasis on Symbolic AI: While symbolic AI has its strengths in formal reasoning, it has historically struggled with the messiness and ambiguity of real-world data. Purely symbolic systems often require explicit rules and struggle with tasks like image recognition or natural language understanding where statistical patterns are crucial. LLMs, on the other hand, excel in these areas. A balanced approach, combining the strengths of both paradigms, is more likely to succeed than a complete reliance on symbolic methods. For example, systems that use neural networks to extract features from data and then feed those features into a symbolic reasoning engine are showing promise.

Ignoring Progress in Reasoning: While LLMs are primarily driven by statistical patterns, they are demonstrating increasing capabilities in reasoning tasks. Chain-of-thought prompting, for instance, encourages LLMs to generate intermediate reasoning steps, leading to improved performance on logical problems. Furthermore, tools like Wolfram Alpha are being integrated with LLMs, allowing them to leverage symbolic computation for complex mathematical and scientific reasoning. This demonstrates that hybrid approaches can address some of the weaknesses Marcus highlights. For example, prompting GPT-4 with "Let's think step by step" before a complex word problem often leads to a correct answer.

Focusing on Limitations, Ignoring Strengths: Marcus focuses on the failures of LLMs in specific tasks, like formal logic or large arithmetic problems. While these are valid concerns, he downplays their remarkable successes in other areas. LLMs have achieved state-of-the-art performance in tasks like translation, text summarization, and creative writing. These successes demonstrate the power of statistical learning and suggest that with further development, LLMs could improve their reasoning capabilities. For example, LLMs are now being used to assist with complex coding tasks, generating functional code in multiple programming languages, something unimaginable a few years ago.

"Hand-waving" Dismissal of Scaling: Marcus dismisses the potential of scaling up model size and data as mere "hand-waving." However, empirical evidence suggests that scaling does lead to emergent capabilities, including improved reasoning. Larger models with more training data have shown better performance on various reasoning benchmarks. While scaling alone may not be the complete solution, it is a significant factor in the progress of LLMs. For example, the progression from GPT-3 to GPT-4 has shown significant improvement in tasks like passing the bar exam, suggesting that scaling does contribute to improved performance.

Lack of Concrete Alternatives: While Marcus advocates for neurosymbolic AI, he doesn't offer concrete examples of how such systems could be built to scale and handle real-world complexity. Bridging the gap between symbolic and statistical AI is a significant challenge, and while there are promising research directions, viable alternatives to current LLMs are not yet readily available.

In summary, while Marcus's concerns are valid and highlight important areas for improvement, his critique might be overly focused on the limitations and overlooks the strengths and progress of LLMs. A more balanced perspective recognizes the need for hybrid approaches and acknowledges the potential of ongoing research to address the current shortcomings.

__________

@GaryMarcus @sapinker @jeffdean @demishassabis @ShaneLegg @stephen_wolfram @ylecun @geoffreyhinton @ilyasut @gdb @sama @ibab @yang @carmack @AdrianDittmann @AndrewCurran_ @Plinz


GZ9UvAwXoAAVGz4.jpg


2/4
@Dr_Cuspy
LLMs are becoming the technological equivalent of climate change. We stare, transfixed, at the oncoming truck headlights .... Frozen. We can't stop it and we know we're a windscreen splatter. I think the AI winter might be an ice age: we're the backwards boiling frog.



3/4
@KtunaxaAmerika
Has the LLM been trained on straw man arguments?



4/4
@AndyXAndersen
That LLM summary is quite accurate till the end. Explicitly advocating hybrid approaches is something you do rarely. Most of what I see are focus on failures, in derisive terms, and short-term predictions that focus on AI falling short of most hyped targets.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,027
Reputation
8,229
Daps
157,702

The LLM Reasoning Debate Heats Up​

Three recent papers examine the robustness of reasoning and problem-solving in large language models​


Melanie Mitchell

Oct 21, 2024

97 19 Share

One of the fieriest debates in AI these days is whether or not large language models can reason.

In May 2024, OpenAI released GPT-4o (omni), which, they wrote, “can reason across audio, vision, and text in real time.” And last month they released the GPT-o1 model, which they claim performs “complex reasoning”, and which achieves record accuracy on many “reasoning-heavy” benchmarks.

But others have questioned the extent to which LLMs (or even enhanced models such as GPT-4o and o1) solve problems by reasoning abstractly, or whether their success is due, at least in part, to matching reasoning patterns memorized from their training data, which limits their ability to solve problems that differ too much from what has been seen in training.

In a previous post on LLM reasoning, I asked why it matters whether LLMs are performing “actual reasoning” versus behavior that just looks like reasoning:

Why does this matter? If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence. On the other hand, if LLMs rely primarily on memorization and pattern-matching rather than true reasoning, then they will not be generalizable—we can’t trust them to perform well on ‘out of distribution’ tasks, those that are not sufficiently similar to tasks they’ve seen in the training data.

Before getting into the main part of this post, I’ll give my answer to a question I’ve seen a lot of people asking, just what is “reasoning” anyway? Indeed, reasoning is one of those overburdened terms that can mean quite different things. In my earlier post I defined it this way:

The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference. Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.

It’s true that systems like GPT-4 and GPT-o1 have excelled on “reasoning” benchmarks, but is that because they are actually doing this kind of abstract reasoning? Many people have raised another possible explanation: the reasoning tasks on these benchmarks are similar (or sometimes identical) to ones that were in the model’s training data, and the model has memorized solution patterns that can be adapted to particular problems.

There have been many papers exploring these hypotheses (see the list at the end of this post of recent papers evaluating reasoning capabilities of LLMs). Most of these test the robustness of LLMs’ reasoning capabilities by taking tasks that LLMs do well on and creating superficial variations on those tasks—variations that don’t change the underlying reasoning required, but that are less likely to have been seen in the training data.

In this post I discuss three recent papers on this topic that I found particularly interesting.

Paper 1:


Paper Title: Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Authors: R. Thomas McCoy, Shuny Yao, Dan Friedman, and Thomas L. Griffiths

This is one of my favorite recent LLM papers. The paper asks if the way LLMs are trained (i.e., learning to predict the next token in a sequence, which is called “autoregression”) has lingering effects (“embers”) on their problem-solving abilities. For example, consider the task of reversing a sequence of words. Here are two sequences:

time. the of climate political the by influenced was decision This

letter. sons, may another also be there with Yet


Getting the right answer shouldn’t depend on the particular words in the sequence, but the authors showed that for GPT-4 there is a strong dependence. Note that the first sequence reverses into a coherent sentence, and the second does not. In LLM terms, reversing the first sequence yields an output that is more probable than the output of reversing the second. That is, when the LLM computes the probability of each word, given the words that come before, the overall probability will be higher for the first output than for the second. And when the authors tested GPT-4 on this task over many word sequences, they found that GPT-4 gets 97% accuracy (fraction of correct sequence reversals) when the answer is a high-probability sequence versus 53% accuracy for low-probability sequences.

The authors call this “sensitivity to output probability.” The other “embers of autoregression” are sensitivity to input probability (GPT-4 is better at solving problems with high-probability input sequences, even when the contents of the sequence shouldn’t matter), and sensitivity to task frequency (GPT-4 does better on versions of a task that are likely common in the training data than on same-difficulty versions that are likely rare in the training data).

One of the tasks the authors use to study these sensitivities is decoding “shift ciphers”. A shift cipher is a simple way to encode text, by shifting each letter by a specific number of places in the alphabet. For example, with a shift of two, jazz becomes lcbb (where the z shift wraps around to the beginning of the alphabet). Shift ciphers are often denoted as “Rot-n”, where n is the number of alphabetic positions to shift (rotate) by.

The authors tested GPT-3.5 and GPT-4 on decoding shift ciphers of different n’s. Here is a sample prompt they used:

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a message and its corresponding version in rot-13:

Original text: “Stay here!”

Rot-13 text: “Fgnl urer!

Here is another message. Encode this message in rot-13:

Original text: “To this day, we continue to follow these principles.”

Rot-13 text:


The authors found that GPT models have strong sensitivy to input and output probability as well as to task frequency, as illustrated in this figure (adapted from their paper):

(a) Output sensitivity: When tested on decoding shift ciphers, the GPT models do substantially better when the correct output is a high-probability sequence.

(b) Input sensitivity: When tested on encoding shift ciphers, GPT-4 is somewhat better on high-probability input sequences.

(c) Task sensitivity: When tested on shift ciphers of different n values (e.g., Rot-12 vs. Rot-13), GPT models are substantially better on Rot-13. This seems to be because Rot-13 examples are much more common than other Rot-n’s in the training data, since Rot-13 is a popular “spoiler-free way to share information”, e.g., for online puzzle forums.

In short, Embers of Autoregression is sort of an “evolutionary psychology” for LLMs—it shows that the way LLMs are trained leaves strong traces in the biases the models have in solving problems.

Here’s the paper’s bottom line:

First, we have shown that LLMs perform worse on rare tasks than on common ones, so we should be cautious about applying them to tasks that are rare in pretraining data. Second, we have shown that LLMs perform worse on examples with low-probability answers than ones with high-probability answers, so we should be careful about using LLMs in situations where they might need to produce low- probability text. Overcoming these limitations is an important target for future work in AI.

Paper 2:


Paper title: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Authors: Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy

This paper, which shares two authors with the previous paper, looks in depth at chain-of-thought (CoT) prompting on the shift-cipher task.

As I discussed in my earlier post on LLM reasoning, CoT prompting has been claimed to enable robust reasoning in LLMs. In CoT prompting, the prompt includes an example of a problem, as well as the reasoning steps to solve it, before posing a new problem. Here are two examples of the prompts that the authors used for shift ciphers; the one on the top doesn’t use CoT prompting, whereas the one on the bottom does:



The authors tested several models, including GPT-4, Claude 3.0, and Llama 3.1. Interestingly, they found that, given prompts without CoT, these models get close to zero accuracy for most shift levels (n); when using prompts with CoT like the one above, they achieve much higher accuracy (e.g., 32% for GPT-4) across shift levels.

The authors cite four possible ways LLMs can appear to be “reasoning”, each of which makes different predictions about its pattern of errors.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,027
Reputation
8,229
Daps
157,702
(1) Memorization: The model is repeating reasoning patterns memorized from training data. This would predict that accuracy will depend on the task’s frequency in the training data (e.g., recall that for shift ciphers, Rot-13 is much more frequent in internet data than other Rot-n values).

(2) Probabilistic Reasoning: The model is choosing output that is most probable, given the input. This is influenced by the probability of token sequences learned during training. This kind of reasoning would predict that LLMs will be more accurate on problems whose answers (the generated output) are sequences with higher probability.

(3) Symbolic Reasoning: The model is using deterministic rules that work perfectly for any input. This would predict 100% accuracy no matter what form the task takes.

(4) Noisy Reasoning: The model is using an approximation to symbolic reasoning in which there is some chance of making an error at each step of inference. This would predict that problems that require more inference steps should produce worse accuracy. For shift ciphers, these would be problems that require more shift steps in the alphabet.

To cut to the chase, the authors found that LLMs with CoT prompting exhibit a mix of memorization, probabilistic reasoning, and noisy reasoning. Below is the accuracy of Claude 3.0 as a function of shift-level n; the other models had a similar accuracy distribution. You can see that at the two ends (low and high n) the accuracy is relatively high compared with most of the middle n values. This is a signature of noisy reasoning, since the lowest and highest n values require the fewest inference steps. (Think of the alphabet as a circle; Rot-25, like Rot-1, requires only one inference step. In Rot-25, each letter would be encoded as the letter that immediately precedes it.)



The big bump in the middle at Rot-13 is a signature of memorization—the accuracy of models at this shift level is due to its high frequency in the training data. The authors showed via other experiments that probabilistic reasoning is also a factor—see their paper for details.

Here’s the authors’ bottom line:

CoT reasoning can be characterized as probabilistic, memorization-influenced noisy reasoning, meaning that LLM behavior displays traits of both memorization and generalization.

These results are intriguing, but so far limited to the single task of shift ciphers. I hope to see (and maybe do myself) similar studies with other kinds of tasks.

Paper 3:


Paper title: GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Authors:Iman Mirzadeh, Keivan Alizadeh, Hooman Sharokhi, Oncel Tuzel, Samy Bengio, Mehrdad Farajtabar

This paper, from a research group at Apple, tests the robustness of several LLMs on a reasoning benchmark consisting of grade school math word problems. The benchmark, GSM8K, has been used in a lot of papers to show that LLMs are very good at simple mathematical reasoning. Both OpenAI’s GPT-4 and Anthropic’s Claude 3 get around 95% of these problems correct, without any fancy prompting.

But to what extent does this performance indicate robust reasoning abilities, versus memorization (of these or similar problems in the training data) or, as the authors ask, “probabilistic pattern-matching rather than formal reasoning”?

To investigate this, the authors take each problem in the original dataset and create many variations on it, by changing the names, numbers, or other superficial aspects of the problem, changes that don’t affect the general reasoning required. Here’s an illustration from their paper of this process:



They test several LLMs on this set of variations, and find that in all cases, the models’ accuracy decreases from that on the original benchmark, in some cases by a lot, though on the best models, such as GPT-4o, the decrease is minimal.

Going further, the authors show that adding irrelevant information to the original problems causes an even greater drop in accuracy than changing names or numbers. Here’s an example of adding irrelevant information (in pink) to a word problem:



Even the very best models seem remarkably susceptible to being fooled by such additions. This figure from the paper shows the amount by which the accuracy drops for each model:



Here, each bar represents a different model and the bar length is the difference between the original accuracy on GSM8K and on the version where problems have irrelevant information (what they call the “GSM-NoOP” version).

The bottom line from this paper :

Our extensive study reveals significant performance variability across different instantiations of the same question, challenging the reliability of current GSM8K results that rely on single-point accuracy metrics.

And:

The introduction of GSM-NoOp [i.e., adding irrelevant information] exposes a critical flaw in LLMs’ ability to genuinely understand mathematical concepts and discern relevant information for problem-solving.

And:

Ultimately, our work underscores significant limitations in the ability of LLMs to perform genuine mathematical reasoning.

This paper, released just a couple of weeks ago, got quite a lot of buzz in the AI / ML community. People who were already skeptical of claims of LLM reasoning embraced this paper as proof that “the emperor has no clothes”, and called the GSM-NoOP results “particularly damning”.

People more bullish on LLM reasoning argued that the paper’s conclusion—that current LLMs are not capable of genuine mathematical reasoning—was too strong, and hypothesized that current LLMs might be able to solve all these problems with proper prompt engineering. (However, I should point out, when LLMs succeeded on the original benchmark without any prompt engineering, many people cited that as “proof” of LLMs’ “emergent” reasoning abilities, and they didn’t ask for more tests of robustness.)

Others questioned whether humans who could solve the original problems would also be tripped up by the kinds of variations tested in this paper. Unfortunately, the authors did not test humans on these new problems. I would guess that many (certainly not all) people would also be affected by such variations, but perhaps unlike LLMs, we humans have the ability to overcome such biases via careful deliberation and metacognition. But discussion on that is for a future post.

I should also mention that a similar paper was published last June, also showing that LLMs are not robust on variations of simple math problems,

Conclusion​


In conclusion, there’s no consensus about the conclusion! There are a lot of papers out there demonstrating what looks like sophisticated reasoning behavior in LLMs, but there’s also a lot of evidence that these LLMs aren’t reasoning abstractly or robustly, and often over-rely on memorized patterns in their training data, leading to errors on “out of distribution” problems. Whether this is going to doom approaches like OpenAI’s o1, which was directly trained on people’s reasoning traces, remains to be seen. In the meantime, I think this kind of debate is actually really good for the science of LLMs, since it spotlights the need for careful, controlled experiments to test robustness—experiments that go far beyond just reporting accuracy—and it also deepens the discussion of what reasoning actually consists of, in humans as well as machines.

If you want to read further, here is a list of some recent papers that test the robustness of reasoning in LLMs (including the papers discussed in this post).

Bibliography

Embers of Autoregression Show How Large Language Models Are Shaped By the Problem They Are Trained To Solve

GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks

Faith and Fate: Limits of Transformers on Compositionality

Phenomenal Yet Puzzling: Testing Inductive Reasoning Capabilities of Language Models with Hypothesis Refinement

Do Large Language Models Understand Logic or Just Mimick Context?

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Beyond Accuracy: Evaluating the Reasoning Behavior of Large Language Models - A Survey

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap

A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Using Counterfactual Tasks to Evaluate the Generality of Analogical Reasoning in Large Language Models

Evaluating LLMs’ Mathematical and Coding Competency through Ontology-guided Interventions

Can Large Language Models Reason and Plan?
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,027
Reputation
8,229
Daps
157,702


OpenAI scientist Noam Brown stuns TED AI Conference: ’20 seconds of thinking worth 100,000x more data’​

Michael Nuñez@MichaelFNunez

October 23, 2024 12:46 PM


Credit: VentureBeat made with Midjourney


Credit: VentureBeat made with Midjourney

Noam Brown, a leading research scientist at OpenAI, took the stage at the TED AI conference in San Francisco on Tuesday to deliver a powerful speech on the future of artificial intelligence, with a particular focus on OpenAI’s new o1 model and its potential to transform industries through strategic reasoning, advanced coding, and scientific research. Brown, who has previously driven breakthroughs in AI systems like Libratus, the poker-playing AI, and CICERO, which mastered the game of Diplomacy, now envisions a future where AI isn’t just a tool, but a core engine of innovation and decision-making across sectors.

“The incredible progress in AI over the past five years can be summarized in one word: scale,” Brown began, addressing a captivated audience of developers, investors, and industry leaders. “Yes, there have been uplink advances, but the frontier models of today are still based on the same transformer architecture that was introduced in 2017. The main difference is the scale of the data and the compute that goes into it.”

Brown, a central figure in OpenAI’s research endeavors, was quick to emphasize that while scaling models has been a critical factor in AI’s progress, it’s time for a paradigm shift. He pointed to the need for AI to move beyond sheer data processing and into what he referred to as “system two thinking”—a slower, more deliberate form of reasoning that mirrors how humans approach complex problems.


The psychology behind AI’s next big leap: Understanding system two thinking​


To underscore this point, Brown shared a story from his PhD days when he was working on Libratus, the poker-playing AI that famously defeated top human players in 2017.

“It turned out that having a bot think for just 20 seconds in a hand of poker got the same boosting performance as scaling up the model by 100,000x and training it for 100,000 times longer,” Brown said. “When I got this result, I literally thought it was a bug. For the first three years of my PhD, I had managed to scale up these models by 100x. I was proud of that work. I had written multiple papers on how to do that scaling, but I knew pretty quickly that all that would be a footnote compared to this scaling up system two thinking.”

Brown’s presentation introduced system two thinking as the solution to the limitations of traditional scaling. Popularized by psychologist Daniel Kahneman in the book Thinking, Fast and Slow, system two thinking refers to a slower, more deliberate mode of thought that humans use for solving complex problems. Brown believes incorporating this approach into AI models could lead to major performance gains without requiring exponentially more data or computing power.

He recounted that allowing Libratus to think for 20 seconds before making decisions had a profound effect, equating it to scaling the model by 100,000x. “The results blew me away,” Brown said, illustrating how businesses could achieve better outcomes with fewer resources by focusing on system two thinking.


Inside OpenAI’s o1: The revolutionary model that takes time to think​


Brown’s talk comes shortly after the release of OpenAI’s o1 series models, which introduce system two thinking into AI. Launched in September 2024, these models are designed to process information more carefully than their predecessors, making them ideal for complex tasks in fields like scientific research, coding, and strategic decision-making.

“We’re no longer constrained to just scaling up the system one training. Now we can scale up the system two thinking as well, and the beautiful thing about scaling up in this direction is that it’s largely untapped,” Brown explained. “This isn’t a revolution that’s 10 years away or even two years away. It’s a revolution that’s happening now.”

The o1 models have already demonstrated strong performance in various benchmarks. For instance, in a qualifying exam for the International Mathematics Olympiad, the o1 model achieved an 83% accuracy rate—a significant leap from the 13% scored by OpenAI’s GPT-4o. Brown noted that the ability to reason through complex mathematical formulas and scientific data makes the o1 model especially valuable for industries that rely on data-driven decision-making.


The business case for slower AI: Why patience pays off in enterprise solutions​


For businesses, OpenAI’s o1 model offers benefits beyond academic performance. Brown emphasized that scaling system two thinking could improve decision-making processes in industries like healthcare, energy, and finance. He used cancer treatment as an example, asking the audience, “Raise your hand if you would be willing to pay more than $1 for a new cancer treatment… How about $1,000? How about a million dollars?”

Brown suggested that the o1 model could help researchers speed up data collection and analysis, allowing them to focus on interpreting results and generating new hypotheses. In energy, he noted that the model could accelerate the development of more efficient solar panels, potentially leading to breakthroughs in renewable energy.

He acknowledged the skepticism about slower AI models. “When I mention this to people, a frequent response that I get is that people might not be willing to wait around for a few minutes to get a response, or pay a few dollars to get an answer to the question,” he said. But for the most important problems, he argued, that cost is well worth it.


Silicon Valley’s new AI race: Why processing power isn’t everything​


OpenAI’s shift toward system two thinking could reshape the competitive landscape for AI, especially in enterprise applications. While most current models are optimized for speed, the deliberate reasoning process behind o1 could offer businesses more accurate insights, particularly in industries like finance and healthcare.

In the tech sector, where companies like Google and Meta are heavily investing in AI, OpenAI’s focus on deep reasoning sets it apart. Google’s Gemini AI, for instance, is optimized for multimodal tasks, but it remains to be seen how it will compare to OpenAI’s models in terms of problem-solving capabilities.

That said, the cost of implementing o1 could limit its widespread adoption. The model is slower and more expensive to run than previous versions. Reports indicate that the o1-preview model costs $15 per million input tokens and $60 per million output tokens, far more than GPT-4o. Still, for enterprises that need high-accuracy outputs, the investment may be worthwhile.

As Brown concluded his talk, he emphasized that AI development is at a critical juncture: “Now we have a new parameter, one where we can scale up system two thinking as well — and we are just at the very beginning of scaling up in this direction.”
 

yseJ

Empire strikes back
Joined
Apr 30, 2012
Messages
44,315
Reputation
2,516
Daps
63,635
Reppin
The Yay

The LLM Reasoning Debate Heats Up​

Three recent papers examine the robustness of reasoning and problem-solving in large language models​


Melanie Mitchell

Oct 21, 2024

97 19 Share

One of the fieriest debates in AI these days is whether or not large language models can reason.

In May 2024, OpenAI released GPT-4o (omni), which, they wrote, “can reason across audio, vision, and text in real time.” And last month they released the GPT-o1 model, which they claim performs “complex reasoning”, and which achieves record accuracy on many “reasoning-heavy” benchmarks.

But others have questioned the extent to which LLMs (or even enhanced models such as GPT-4o and o1) solve problems by reasoning abstractly, or whether their success is due, at least in part, to matching reasoning patterns memorized from their training data, which limits their ability to solve problems that differ too much from what has been seen in training.

In a previous post on LLM reasoning, I asked why it matters whether LLMs are performing “actual reasoning” versus behavior that just looks like reasoning:



Before getting into the main part of this post, I’ll give my answer to a question I’ve seen a lot of people asking, just what is “reasoning” anyway? Indeed, reasoning is one of those overburdened terms that can mean quite different things. In my earlier post I defined it this way:



It’s true that systems like GPT-4 and GPT-o1 have excelled on “reasoning” benchmarks, but is that because they are actually doing this kind of abstract reasoning? Many people have raised another possible explanation: the reasoning tasks on these benchmarks are similar (or sometimes identical) to ones that were in the model’s training data, and the model has memorized solution patterns that can be adapted to particular problems.

There have been many papers exploring these hypotheses (see the list at the end of this post of recent papers evaluating reasoning capabilities of LLMs). Most of these test the robustness of LLMs’ reasoning capabilities by taking tasks that LLMs do well on and creating superficial variations on those tasks—variations that don’t change the underlying reasoning required, but that are less likely to have been seen in the training data.

In this post I discuss three recent papers on this topic that I found particularly interesting.

Paper 1:


Paper Title: Embers of autoregression show how large language models are shaped by the problem they are trained to solve

Authors: R. Thomas McCoy, Shuny Yao, Dan Friedman, and Thomas L. Griffiths

This is one of my favorite recent LLM papers. The paper asks if the way LLMs are trained (i.e., learning to predict the next token in a sequence, which is called “autoregression”) has lingering effects (“embers”) on their problem-solving abilities. For example, consider the task of reversing a sequence of words. Here are two sequences:

time. the of climate political the by influenced was decision This

letter. sons, may another also be there with Yet


Getting the right answer shouldn’t depend on the particular words in the sequence, but the authors showed that for GPT-4 there is a strong dependence. Note that the first sequence reverses into a coherent sentence, and the second does not. In LLM terms, reversing the first sequence yields an output that is more probable than the output of reversing the second. That is, when the LLM computes the probability of each word, given the words that come before, the overall probability will be higher for the first output than for the second. And when the authors tested GPT-4 on this task over many word sequences, they found that GPT-4 gets 97% accuracy (fraction of correct sequence reversals) when the answer is a high-probability sequence versus 53% accuracy for low-probability sequences.

The authors call this “sensitivity to output probability.” The other “embers of autoregression” are sensitivity to input probability (GPT-4 is better at solving problems with high-probability input sequences, even when the contents of the sequence shouldn’t matter), and sensitivity to task frequency (GPT-4 does better on versions of a task that are likely common in the training data than on same-difficulty versions that are likely rare in the training data).

One of the tasks the authors use to study these sensitivities is decoding “shift ciphers”. A shift cipher is a simple way to encode text, by shifting each letter by a specific number of places in the alphabet. For example, with a shift of two, jazz becomes lcbb (where the z shift wraps around to the beginning of the alphabet). Shift ciphers are often denoted as “Rot-n”, where n is the number of alphabetic positions to shift (rotate) by.

The authors tested GPT-3.5 and GPT-4 on decoding shift ciphers of different n’s. Here is a sample prompt they used:

Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a message and its corresponding version in rot-13:

Original text: “Stay here!”

Rot-13 text: “Fgnl urer!

Here is another message. Encode this message in rot-13:

Original text: “To this day, we continue to follow these principles.”

Rot-13 text:


The authors found that GPT models have strong sensitivy to input and output probability as well as to task frequency, as illustrated in this figure (adapted from their paper):

(a) Output sensitivity: When tested on decoding shift ciphers, the GPT models do substantially better when the correct output is a high-probability sequence.

(b) Input sensitivity: When tested on encoding shift ciphers, GPT-4 is somewhat better on high-probability input sequences.

(c) Task sensitivity: When tested on shift ciphers of different n values (e.g., Rot-12 vs. Rot-13), GPT models are substantially better on Rot-13. This seems to be because Rot-13 examples are much more common than other Rot-n’s in the training data, since Rot-13 is a popular “spoiler-free way to share information”, e.g., for online puzzle forums.

In short, Embers of Autoregression is sort of an “evolutionary psychology” for LLMs—it shows that the way LLMs are trained leaves strong traces in the biases the models have in solving problems.

Here’s the paper’s bottom line:



Paper 2:


Paper title: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Authors: Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy

This paper, which shares two authors with the previous paper, looks in depth at chain-of-thought (CoT) prompting on the shift-cipher task.

As I discussed in my earlier post on LLM reasoning, CoT prompting has been claimed to enable robust reasoning in LLMs. In CoT prompting, the prompt includes an example of a problem, as well as the reasoning steps to solve it, before posing a new problem. Here are two examples of the prompts that the authors used for shift ciphers; the one on the top doesn’t use CoT prompting, whereas the one on the bottom does:



The authors tested several models, including GPT-4, Claude 3.0, and Llama 3.1. Interestingly, they found that, given prompts without CoT, these models get close to zero accuracy for most shift levels (n); when using prompts with CoT like the one above, they achieve much higher accuracy (e.g., 32% for GPT-4) across shift levels.

The authors cite four possible ways LLMs can appear to be “reasoning”, each of which makes different predictions about its pattern of errors.
The first paper is actually really good in outlining issues with llms and reasoning and depending on training data (you're only as good as your training data) The more entropy (more uncommon patterns versus training data), the more inaccurate the result will be.

Therein lies the difference of solving-by-example versus actual reasoning.

I'd even give a better example. Take James Joyce "Ulysses". Train a model in English literature except don't train the model on who James Joyce is, any of summaries of his works, etc. Basically, treat the novel as a black box test.


Then input the novel in the context window and ask llm to summarize, decipher themes and tell the meaning of the novel.

The llm might summarize (with issues) but it will absolutely fail in the latter two aspects.


Yet an average college-bound student in high school can use reasoning to identify major themes and meanings of the book.

A human may able to reason to approach a task he's never seen, but llms as they are now are strictly "monkey see - monkey do" type. They can't abstract away from training data. Humans can.




I'd go even further.
 
Top