Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this this https URL
Comments: | EMNLP 2024 Findings; 9 pages plus references and appendices |
Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
Cite as: | arXiv:2407.01687 [cs.CL] |
(or arXiv:2407.01687v2 [cs.CL] for this version) | |
[2407.01687] Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning Focus to learn more |
In "Embers of Autoregression" (McCoy et al., 2023), we showed that several large language models (LLMs) have some important limitations that are attributable to their origins in next-word prediction. Here we investigate whether these issues persist with o1, a new system from OpenAI that differs from previous LLMs in that it is optimized for reasoning. We find that o1 substantially outperforms previous LLMs in many cases, with particularly large improvements on rare variants of common tasks (e.g., forming acronyms from the second letter of each word in a list, rather than the first letter). Despite these quantitative improvements, however, o1 still displays the same qualitative trends that we observed in previous systems. Specifically, o1 -- like previous LLMs -- is sensitive to the probability of examples and tasks, performing better and requiring fewer "thinking tokens" in high-probability settings than in low-probability ones. These results show that optimizing a language model for reasoning can mitigate but might not fully overcome the language model's probability sensitivity.
Comments: | 6 pages; updated to fix typo in Fig 4 caption |
Subjects: | Computation and Language (cs.CL); Artificial Intelligence (cs.AI) |
Cite as: | arXiv:2410.01792 [cs.CL] |
(or arXiv:2410.01792v2 [cs.CL] for this version) | |
[2410.01792] When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1 Focus to learn more |