Reasoning skills of large language models are often overestimated

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,890
Reputation
9,559
Daps
172,210
Eric Schmidt says "the computers are now self-improving, they're learning how to plan" - and soon they won't have to listen to us anymore. Within 6 years, minds smarter than the sum of humans - scaled, recursive, free. "People do not understand what's happening."



Posted on Tue Apr 15 16:09:36 2025 UTC


 

Hood Critic

The Power Circle
Joined
May 2, 2012
Messages
24,658
Reputation
3,930
Daps
112,239
Reppin
דעת
Eric Schmidt says "the computers are now self-improving, they're learning how to plan" - and soon they won't have to listen to us anymore. Within 6 years, minds smarter than the sum of humans - scaled, recursive, free. "People do not understand what's happening."



Posted on Tue Apr 15 16:09:36 2025 UTC



His entire interview

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,890
Reputation
9,559
Daps
172,210

1/1
@GestaltU
Hard to argue the o3+ families of models, and perhaps even Gemini 2.5 pro level models, aren’t *generally* superhuman in logical reasoning and code domains at this point.

[Quoted tweet]
o4-mini-high just solved the latest project euler problem (from 4 days ago) in 2m55s, far faster than any human solver. Only 15 people were able to solve it in under 30 minutes


GorM1yqacAAdAft.jpg

GorM3QOacAAPwtG.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/16
@bio_bootloader
o4-mini-high just solved the latest project euler problem (from 4 days ago) in 2m55s, far faster than any human solver. Only 15 people were able to solve it in under 30 minutes



GorM1yqacAAdAft.jpg

GorM3QOacAAPwtG.jpg


2/16
@bio_bootloader
I'm stunned

I knew this day was coming but wow. I used to regularly solve these and sometimes came in the top 10 solvers, I know how hard these are.



3/16
@bio_bootloader
turns out it sometimes solves this in under a minute:

[Quoted tweet]
Okay, not sure what I did differently than you, but I got CoT time down to 56s with the right answer. 🎯

What was the prompt you used?


GosepZOWIAArIg1.png

Gose6YcXoAE9B2b.png


4/16
@RyanJTopps
You do know the answer is known and it’s not executing any code in that response right?



5/16
@bio_bootloader
wrong



6/16
@GuilleAngeris
yeah ok that's pretty sick actually



7/16
@yacineMTB
cool



8/16
@nayshins
Dang



9/16
@gnopercept
is it so over?



10/16
@friendlyboxcat
Boggles me that it can do that but not connect 4



GotdIdoacAUiqMs.jpg


11/16
@CDS61617
code moving different



12/16
@DrMiaow
Still can’t label code correctly.



GouGPwxW4AA6RFN.jpg


13/16
@sadaasukhi
damn



14/16
@plutobyte
for the record, gemini 2.5 pro solved it in 6 minutes. i haven't looked super closely at the problem or either of their solutions, but it looks like it be just be a matrix exponentiation by squaring problem? still very impressive



15/16
@BenceDezs3
I also had a few test problems from SPOJ that very few of us could solve, and they definitely weren’t in the training data. Unfortunately (or perhaps fortunately), the day came when it managed to solve every single one of them.



16/16
@Scarcus
Just a matter of compute now for time, what I want to know is how many tokens it took to solve it.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
62,890
Reputation
9,559
Daps
172,210

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels​


By Mohammad Asjad

April 18, 2025

Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research community faces challenges in understanding which capabilities are enhanced through small-scale SFT and which limitations persist despite these improvements. Despite impressive performance on popular benchmarks, there is an incomplete understanding of these fine-tuned models’ specific strengths and weaknesses, creating a critical gap in knowledge about their true reasoning abilities and practical limitations.

Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT merely improves performance on previously seen problem types or genuinely enables models to transfer problem-solving strategies to new contexts, such as applying coordinate-based techniques in geometry. Existing methods focus on factors like correctness, solution length, and response diversity, which initial studies suggest play significant roles in model improvement through SFT. However, these approaches lack the granularity needed to determine exactly which types of previously unsolvable questions become solvable after fine-tuning, and which problem categories remain resistant to improvement despite extensive training. The research community still struggles to establish whether observed improvements reflect deeper learning or simply memorisation of training trajectories, highlighting the need for more sophisticated analysis methods.

The researchers from the University of California, Berkeley and the Allen Institute for AI propose a tiered analysis framework to investigate how supervised fine-tuning affects reasoning capabilities in language models. This approach utilises the AIME24 dataset, chosen for its complexity and widespread use in reasoning research, which exhibits a ladder-like structure where models solving higher-tier questions typically succeed on lower-tier ones. By categorising questions into four difficulty tiers, Easy, Medium, Hard, and Exh, the study systematically examines the specific requirements for advancing between tiers. The analysis reveals that progression from Easy to Medium primarily requires adopting an R1 reasoning style with long inference context, while Hard-level questions demand greater computational stability during deep exploration. Exh-level questions present a fundamentally different challenge, requiring unconventional problem-solving strategies that current models uniformly struggle with. The research also identifies four key insights: the performance gap between potential and stability in small-scale SFT models, minimal benefits from careful dataset curation, diminishing returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome through SFT alone.

AD_4nXee4JV8pbJQboK5oxaQcIxOQK1cnfjdoQiol3JiAeuCizvPRD4TsSUeGSkE_kOIWJXG0nrienDihCDfR2Igb4PgGVJyweixOTQ1IzxULj0Gw7IkZ6lNCedjh5PdAHCgU-QrZzlWYA


The methodology employs a comprehensive tiered analysis using the AIME24 dataset as the primary test benchmark. This choice stems from three key attributes: the dataset’s hierarchical difficulty that challenges even state-of-the-art models, its diverse coverage of mathematical domains, and its focus on high school mathematics that isolates pure reasoning ability from domain-specific knowledge. Qwen2.5-32 B-Instruct serves as the base model due to its widespread adoption and inherent cognitive behaviours, including verification, backtracking, and subgoal setting. The fine-tuning data consists of question-response pairs from the Openr1-Math-220k dataset, specifically using CoT trajectories generated by DeepSeek R1 for problems from NuminaMath1.5, with incorrect solutions filtered out. The training configuration mirrors prior studies with a learning rate of 1 × 10−5, weight decay of 1 × 10−4, batch size of 32, and 5 epochs. Performance evaluation employs avg@n (average pass rate over multiple attempts) and cov@n metrics, with questions categorised into four difficulty levels (Easy, Medium, Hard, and Extremely Hard) based on model performance patterns.

Research results reveal that effective progression from Easy to Medium-level mathematical problem-solving requires minimal but specific conditions. The study systematically examined multiple training variables, including foundational knowledge across diverse mathematical categories, dataset size variations (100-1000 examples per category), trajectory length (short, normal, or long), and trajectory style (comparing DeepSeek-R1 with Gemini-flash). Through comprehensive ablation studies, researchers isolated the impact of each dimension on model performance, represented as P = f(C, N, L, S), where C represents category, N represents the number of trajectories, L represents length, and S represents style. The findings demonstrate that achieving performance ≥90% on Medium-level questions minimally requires at least 500 normal or long R1-style trajectories, regardless of the specific mathematical category. Models consistently fail to meet performance thresholds when trained with fewer trajectories, shorter trajectories, or Gemini-style trajectories. This indicates that reasoning trajectory length and quantity represent critical factors in developing mathematical reasoning capabilities, while the specific subject matter of the trajectories proves less important than their structural characteristics.

AD_4nXdyjS3MmDujWMgOdMY8ueTM-sl3ozJZnKH7SI-POtwd0ASRxi0Q1tediUg8_xLGSY9iGEHrwJNMC8pQXzkMVrgChpbzPrLvJQDLu7bjOrxQi2nEjZMmpH-vNwoTdDPkgdDZC0SIgA


AD_4nXfKRl0xyZ-Q2TZEIScYcUEOAEnhYUHGbLEI6UmV7E74UUFJcvHs-WgRcwJ8PgOb1Cnfn1I7gon6IwR324zoOAeG21rR8YZuxSIlDWB_IyfaSbXNiGTHHrUTazl4Omr9DPBe0F1rug


The research demonstrates that models with small-scale supervised fine-tuning can potentially solve as many questions as more sophisticated models like Deepseek-R1, though significant challenges remain. The primary limitation identified is instability in mathematical reasoning, rather than capability. Experimental results show that geometry-trained models can achieve a coverage score of 90, matching R1’s performance when given multiple attempts, yet their overall accuracy lags by more than 20%. This performance gap stems primarily from instability in deep exploration and computational limitations during complex problem-solving. While increasing the SFT dataset size offers one solution path, performance enhancement follows a logarithmic scaling trend with diminishing returns. Notably, the study challenges recent assertions about the importance of careful dataset curation, revealing that performance across various mathematical categories remains consistent within a narrow range of 55±4%, with only marginal differences between specifically constructed similar datasets and randomly constructed ones. This conclusion suggests that the quantity and quality of reasoning trajectories matter more than subject-specific content for developing robust mathematical reasoning capabilities.




Here is the Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

 
Top