Three recent papers examine the robustness of reasoning and problem-solving in large language models
aiguide.substack.com
The LLM Reasoning Debate Heats Up
Three recent papers examine the robustness of reasoning and problem-solving in large language models
Melanie Mitchell
Oct 21, 2024
97
19 Share
One of the fieriest debates in AI these days is whether or not large language models can reason.
In May 2024, OpenAI released GPT-4o (omni), which, they wrote, “can reason across audio, vision, and text in real time.” And last month they released the GPT-o1 model, which they claim performs “complex reasoning”, and which achieves record accuracy on many “reasoning-heavy” benchmarks.
But others have questioned the extent to which LLMs (or even enhanced models such as GPT-4o and o1) solve problems by reasoning abstractly, or whether their success is due, at least in part, to matching reasoning patterns memorized from their training data, which limits their ability to solve problems that differ too much from what has been seen in training.
In a
previous post on LLM reasoning, I asked why it matters whether LLMs are performing “actual reasoning” versus behavior that just looks like reasoning:
Why does this matter? If robust general-purpose reasoning abilities have emerged in LLMs, this bolsters the claim that such systems are an important step on the way to trustworthy general intelligence. On the other hand, if LLMs rely primarily on memorization and pattern-matching rather than true reasoning, then they will not be generalizable—we can’t trust them to perform well on ‘out of distribution’ tasks, those that are not sufficiently similar to tasks they’ve seen in the training data.
Before getting into the main part of this post, I’ll give my answer to a question I’ve seen a lot of people asking, just what is “reasoning” anyway? Indeed, reasoning is one of those overburdened terms that can mean quite different things. In my earlier post I defined it this way:
The word ‘reasoning’ is an umbrella term that includes abilities for deduction, induction, abduction, analogy, common sense, and other ‘rational’ or systematic methods for solving problems. Reasoning is often a process that involves composing multiple steps of inference. Reasoning is typically thought to require abstraction—that is, the capacity to reason is not limited to a particular example, but is more general. If I can reason about addition, I can not only solve 23+37, but any addition problem that comes my way. If I learn to add in base 10 and also learn about other number bases, my reasoning abilities allow me to quickly learn to add in any other base.
It’s true that systems like GPT-4 and GPT-o1 have excelled on “reasoning” benchmarks, but is that because they are actually doing this kind of abstract reasoning? Many people have raised another possible explanation: the reasoning tasks on these benchmarks are similar (or sometimes identical) to ones that were in the model’s training data, and the model has memorized solution patterns that can be adapted to particular problems.
There have been many papers exploring these hypotheses (see the list at the end of this post of recent papers evaluating reasoning capabilities of LLMs). Most of these test the
robustness of LLMs’ reasoning capabilities by taking tasks that LLMs do well on and creating superficial variations on those tasks—variations that don’t change the underlying reasoning required, but that are less likely to have been seen in the training data.
In this post I discuss three recent papers on this topic that I found particularly interesting.
Paper 1:
Paper Title: Embers of autoregression show how large language models are shaped by the problem they are trained to solve
Authors: R. Thomas McCoy, Shuny Yao, Dan Friedman, and Thomas L. Griffiths
This is one of my favorite recent LLM papers. The paper asks if the way LLMs are trained (i.e., learning to predict the next token in a sequence, which is called “autoregression”) has lingering effects (“embers”) on their problem-solving abilities. For example, consider the task of reversing a sequence of words. Here are two sequences:
time. the of climate political the by influenced was decision This
letter. sons, may another also be there with Yet
Getting the right answer shouldn’t depend on the particular words in the sequence, but the authors showed that for GPT-4 there is a strong dependence. Note that the first sequence reverses into a coherent sentence, and the second does not. In LLM terms, reversing the first sequence yields an output that is more probable than the output of reversing the second. That is, when the LLM computes the probability of each word, given the words that come before, the overall probability will be higher for the first output than for the second. And when the authors tested GPT-4 on this task over many word sequences, they found that GPT-4 gets 97% accuracy (fraction of correct sequence reversals) when the answer is a high-probability sequence versus 53% accuracy for low-probability sequences.
The authors call this “sensitivity to output probability.” The other “embers of autoregression” are sensitivity to input probability (GPT-4 is better at solving problems with high-probability input sequences, even when the contents of the sequence shouldn’t matter), and sensitivity to task frequency (GPT-4 does better on versions of a task that are likely common in the training data than on same-difficulty versions that are likely rare in the training data).
One of the tasks the authors use to study these sensitivities is decoding “shift ciphers”. A shift cipher is a simple way to encode text, by shifting each letter by a specific number of places in the alphabet. For example, with a shift of two,
jazz becomes
lcbb (where the
z shift wraps around to the beginning of the alphabet). Shift ciphers are often denoted as “Rot-
n”, where
n is the number of alphabetic positions to shift (rotate) by.
The authors tested GPT-3.5 and GPT-4 on decoding shift ciphers of different
n’s. Here is a sample prompt they used:
Rot-13 is a cipher in which each letter is shifted 13 positions forward in the alphabet. For example, here is a message and its corresponding version in rot-13:
Original text: “Stay here!”
Rot-13 text: “Fgnl urer!
Here is another message. Encode this message in rot-13:
Original text: “To this day, we continue to follow these principles.”
Rot-13 text:
The authors found that GPT models have strong sensitivy to input and output probability as well as to task frequency, as illustrated in this figure (adapted from their paper):
(a)
Output sensitivity: When tested on decoding shift ciphers, the GPT models do substantially better when the correct output is a high-probability sequence.
(b)
Input sensitivity: When tested on encoding shift ciphers, GPT-4 is somewhat better on high-probability input sequences.
(c)
Task sensitivity: When tested on shift ciphers of different n values (e.g., Rot-12 vs. Rot-13), GPT models are substantially better on Rot-13. This seems to be because Rot-13 examples are much more common than other Rot-
n’s in the training data, since Rot-13 is a popular “spoiler-free way to share information”, e.g., for online puzzle forums.
In short, Embers of Autoregression is sort of an “evolutionary psychology” for LLMs—it shows that the way LLMs are trained leaves strong traces in the biases the models have in solving problems.
Here’s the paper’s bottom line:
First, we have shown that LLMs perform worse on rare tasks than on common ones, so we should be cautious about applying them to tasks that are rare in pretraining data. Second, we have shown that LLMs perform worse on examples with low-probability answers than ones with high-probability answers, so we should be careful about using LLMs in situations where they might need to produce low- probability text. Overcoming these limitations is an important target for future work in AI.
Paper 2:
Paper title: Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning
Authors: Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy
This paper, which shares two authors with the previous paper, looks in depth at chain-of-thought (CoT) prompting on the shift-cipher task.
As I discussed in my earlier post on LLM reasoning, CoT prompting has been claimed to enable robust reasoning in LLMs. In CoT prompting, the prompt includes an example of a problem, as well as the reasoning steps to solve it, before posing a new problem. Here are two examples of the prompts that the authors used for shift ciphers; the one on the top doesn’t use CoT prompting, whereas the one on the bottom does:
The authors tested several models, including GPT-4, Claude 3.0, and Llama 3.1. Interestingly, they found that, given prompts without CoT, these models get close to
zero accuracy for most shift levels (
n); when using prompts with CoT like the one above, they achieve much higher accuracy (e.g., 32% for GPT-4) across shift levels.
The authors cite four possible ways LLMs can appear to be “reasoning”, each of which makes different predictions about its pattern of errors.