Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240







1/11
We’re presenting the first AI to solve International Mathematical Olympiad problems at a silver medalist level.🥈

It combines AlphaProof, a new breakthrough model for formal reasoning, and AlphaGeometry 2, an improved version of our previous system. 🧵 AI achieves silver-medal standard solving International Mathematical Olympiad problems

2/11
Our system had to solve this year's six IMO problems, involving algebra, combinatorics, geometry & number theory. We then invited mathematicians @wtgowers and Dr Joseph K Myers to oversee scoring.

It solved 4️⃣ problems to gain 28 points - equivalent to earning a silver medal. ↓

3/11
For non-geometry, it uses AlphaProof, which can create proofs in Lean. 🧮

It couples a pre-trained language model with the AlphaZero reinforcement learning algorithm, which previously taught itself to master games like chess, shogi and Go. AI achieves silver-medal standard solving International Mathematical Olympiad problems

4/11
Math programming languages like Lean allow answers to be formally verified. But their use has been limited by a lack of human-written data available. 💡

So we fine-tuned a Gemini model to translate natural language problems into a set of formal ones for training AlphaProof.

5/11
When presented with a problem, AlphaProof attempts to prove or disprove it by searching over possible steps in Lean. 🔍

Each success is then used to reinforce its neural network, making it better at tackling subsequent, harder problems. → AI achieves silver-medal standard solving International Mathematical Olympiad problems

6/11
With geometry, it deploys AlphaGeometry 2: a neuro-symbolic hybrid system.

Its Gemini-based language model was trained on increased synthetic data, enabling it to tackle more types of problems - such as looking at movements of objects. 📐

7/11
Powered with a novel search algorithm, AlphaGeometry 2 can now solve 83% of all historical problems from the past 25 years - compared to the 53% rate by its predecessor.

It solved this year’s IMO Problem 4 within 19 seconds. 🚀

Here’s an illustration showing its solution ↓

8/11
We’re excited to see how our new system could help accelerate AI-powered mathematics, from quickly completing elements of proofs to eventually discovering new knowledge for us - and unlocking further progress towards AGI.

Find out more → AI achieves silver-medal standard solving International Mathematical Olympiad problems

9/11
thank you for this hard work and thank you for sharing it with the world <3

10/11
That is astonishing

11/11
Amazing. Congrats!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GTV8_V5WAAAws2X.png

GTV9E7GXkAAMJH2.jpg

GTV9KtFXoAAIqCG.jpg

GTV75c1XYAA5j1s.jpg

GTV_El2XYAARTMO.jpg





 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240



[Submitted on 11 Jun 2024 (v1), last revised 13 Jun 2024 (this version, v2)]

Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B​

Di Zhang, Xiaoshui Huang, Dongzhan Zhou, Yuqiang Li, Wanli Ouyang
This paper introduces the MCT Self-Refine (MCTSr) algorithm, an innovative integration of Large Language Models (LLMs) with Monte Carlo Tree Search (MCTS), designed to enhance performance in complex mathematical reasoning tasks. Addressing the challenges of accuracy and reliability in LLMs, particularly in strategic and mathematical reasoning, MCTSr leverages systematic exploration and heuristic self-refine mechanisms to improve decision-making frameworks within LLMs. The algorithm constructs a Monte Carlo search tree through iterative processes of Selection, self-refine, self-evaluation, and Backpropagation, utilizing an improved Upper Confidence Bound (UCB) formula to optimize the exploration-exploitation balance. Extensive experiments demonstrate MCTSr's efficacy in solving Olympiad-level mathematical problems, significantly improving success rates across multiple datasets, including GSM8K, GSM Hard, MATH, and Olympiad-level benchmarks, including Math Odyssey, AIME, and OlympiadBench. The study advances the application of LLMs in complex reasoning tasks and sets a foundation for future AI integration, enhancing decision-making accuracy and reliability in LLM-driven applications.

Submission history​

From: Di Zhang [view email]
[v1] Tue, 11 Jun 2024 16:01:07 UTC (106 KB)
[v2] Thu, 13 Jun 2024 07:19:06 UTC (106 KB)






1/1
Accessing GPT-4 level Mathematical Olympiad Solutions via Monte Carlo Tree Self-refine with LLaMa-3 8B: A Technical Report


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQuhD3AXsAAlI7j.jpg

GQuhUkJW0AAjRbc.jpg







1/6
It's finally here. Q* rings true. Tiny LLMs are as good at math as a frontier model.

By using the same techniques Google used to solve Go (MTCS and backprop), Llama8B gets 96.7% on math benchmark GSM8K!

That’s better than GPT-4, Claude and Gemini, with 200x less parameters!

2/6
Source: https://arxiv.org/pdf/2406.07394

3/6
I'd imagine these are the techniques code foundational model trainers are using, but I wonder

a) you're limited by the ability of the base open source model and might get it to be as good as a frontier model, but barely.
b) whether you can generate enough volume of synthetic code data with reasonable $$ spend.
c) If you are doing this on a 1T+ param model, can be prohibitively expensive

4/6
The (purported) technique isn’t tied to a particular model

5/6
Come on it's SLIGHTLY harder than that 😆

6/6
Shanghai AI lab made rumored Q* reality


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQIOwEOasAEoq6D.jpg

GQIvQWqbwAAipu4.jpg



1/1
A prompt-level formula to add Search to LLM

🧵📖 Read of the day, day 85: Accessing GPT-4 level Mathematical Olympiad Solutions Via Monte Carlo Tree Self-refine with Llama-3 8B: A technical report, by Zhang et al from Shanghai Artificial Intelligence Laboratory

https://arxiv.org/pdf/2406.07394

The authors of this paper introduce a Monte-Carlo Tree Search like method to enhance model generation. They call it Monte-Carlo Tree Self-Refined, shortened as MCTSr.

Their method is based solely on prompting the model and does not modify its weight, yet greatly enhances the results.

How?
1- Generate a root node through naive answers or a dummy one
2- Use a value function Q to rank answers that were not expanded, select the best greedily
3- Optimize answer through generating a feedback, and then exploit it
4- Compute the Q value of the answer
5- Update value of parent nodes
6- Identify candidate nodes for expansion, and use UCT formula to update all nodes for iterating again
7- Iterate until max steps are reached

Value function Q is actually prompting the model to reward its answer. Model is prompted several times and its answers are averaged. Backpropagation and UCT formulas can be found within the paper.

The authors then evaluate 4 rollouts and 8 rollouts MCTSr on a Llama-3 8B and compare it to GPT-4, Claude 3 Opus and Gemini-1.5 Pro on mathematical problems.

They first find out such sampling greatly increases performances on both GSM8k and MATH datasets, reaching Frontier-models level of performances in GSM8k (still below in MATH, but greatly improved).

The authors then evaluate the models on harder benchmarks. MCTSr improves model performance across all of them. They notice that on Math Odyssey, the 8-rollout MCTSr is on the level of GPT-4 !

Prompts can be found within the appendix.
Code is open-sourced at: GitHub - trotsky1997/MathBlackBox

Personal Thoughts: While this research remains on preliminary stage, the results are quite impressive for results they get only by prompting. The fact a mere 8B model can reach frontier-levels of performance in benchmarks is nothing to laugh at. Still tells us there’s a lot of stuff to discover even solely with LLMs!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQPHixkXAAAOs00.jpg

GQPHixtX0AAGk40.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

AI in practice
Aug 22, 2024

New prompting method can help improve LLM reasoning skills​


Midjourney prompted by THE DECODER

New prompting method can help improve LLM reasoning skills


Chinese researchers have created a technique that enables large language models (LLMs) to recognize and filter out irrelevant information in text-based tasks, leading to significant improvements in their logical reasoning abilities.

The research team from Guilin University of Electronic Technology and other institutions developed the GSMIR dataset, which consists of 500 elementary school math problems intentionally injected with irrelevant sentences. GSMIR is derived from the existing GSM8K dataset.

Tests on GSMIR showed that GPT-3.5-Turbo and GPT-3.5-Turbo-16k could identify irrelevant information in up to 74.9% of cases. However, the models were unable to automatically exclude this information once it was detected before solving a task.

Recognizing and filtering irrelevant information - and only then responding​


To address this, the researchers developed the two-stage "Analysis to Filtration Prompting" (ATF) method. First, the model analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before starting the actual reasoning process.

The two-step ATF prompt process. First it analyzes, then it filters, and only then the model responds. | Image: Jiang et al.

Using ATF, the accuracy of LLMs in solving tasks with irrelevant information approached their performance on the original tasks without such distractions. The method worked with all tested prompting techniques.

The combination of ATF with "Chain-of-Thought Prompting" (COT) was particularly effective. For GPT-3.5-Turbo, accuracy increased from 50.2% without ATF to 74.9% with ATF – an improvement of nearly 25 percentage points.

Benchmark results comparing various prompting methods with and without ATF. The methods tested include standard, instructed, chain-of-thought (with and without examples), and least-to-most prompting. GSM8K-SLC represents the GSMIR data set without irrelevant information. The study presents two tables, although their differences are unclear. Most likely, the upper table shows results for GPT-3.5-Turbo-16k and the lower table shows results for GPT-3.5-Turbo, but the labeling is incorrect. Both tables show that ATF consistently improved accuracy across all prompting methods when solving data set tasks containing irrelevant information. | Image: Jiang et al.

The smallest improvement came when ATF was combined with Standard Prompting (SP), where accuracy increased by only 3.3 percentage points. The researchers suggest that this is because SP's accuracy on the original questions was already very low at 18.5%, with most errors likely due to calculation errors rather than irrelevant information.

Because the ATF method is specifically designed to reduce the impact of irrelevant information, but not to improve the general computational ability of LLMs, the effect of ATF in combination with SP was limited.

With other prompting techniques, such as COT, which better support LLMs in correctly solving reasoning tasks, ATF was able to improve performance more significantly because irrelevant information accounted for a larger proportion of errors.

The study has some limitations. Experiments were conducted only with GPT-3.5, and the researchers only examined tasks containing a single piece of irrelevant information. In real-world scenarios, problem descriptions may contain multiple confounding factors.

In approximately 15% of cases, irrelevant information was not recognized as such. More than half of these instances involved "weak irrelevant information" that did not impact the model's ability to arrive at the correct answer.

This suggests that ATF is most effective for "strong irrelevant information" that significantly interferes with the reasoning process. Only 2.2% of cases saw relevant information incorrectly classified as irrelevant.

Despite these limitations, the study shows that language models' logical reasoning abilities can be enhanced by filtering out irrelevant information through prompt engineering. While the ATF method could help LLMs better handle noisy real-world data, it does not address their fundamental weaknesses in logic.

Summary
  • Researchers at Guilin University of Electronic Technology have developed a technique that helps large language models (LLMs) identify and remove irrelevant information in text-based tasks, significantly improving their reasoning capabilities.
  • The two-step "Analysis to Filtration Prompting" (ATF) method first analyzes the task and identifies irrelevant information by examining each sub-sentence. It then filters out this information before the model begins its reasoning process. When combined with Chain-of-Thought Prompting (COT), the accuracy of GPT-3.5-Turbo improved by nearly 25 percentage points, from 50.2% to 74.9%.
  • The study has limitations. Only GPT-3.5 variants were tested, and the tasks each contained only one piece of irrelevant information. Real-world scenarios often involve multiple confounding factors.

Sources
Paper

Matthias Bastian


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

1/1
Current AI systems have limited Reasoning abilities, but they can improve

Yes, current systems do reason.

Many people argue that what they do "isn't reasoning," but the difference between fake and "real" reasoning doesn't seem very important.

Their reasoning has limits, especially when multiple steps build on each other. Sometimes, outputs are given without leading to hallucinations.
Even worse, if asked to explain a hallucinated answer, the system may create a made-up reasoning for it.

These limits on reasoning ability, failing to apply reasoning when needed, and generating false reasoning are major challenges for LLMs.

Despite this, there’s optimism. Instead of needing new methods, frameworks like COT, tree search, and society of minds could greatly improve reasoning in existing LLMs.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GVvci8uWcAA2F4g.jpg
 
Top