REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

null · Mar 21, 2025

jj23 said:
In some instances yes. In others where a human being would tell you, "I don't know the answer, but let me research it"

It try to cobble together nonsense and output lies.

it is worse than that. you tell it to exclude an assumption or that something is wrong and it will often loop back to the same behaviour because it is incapable of following a multi-stage argument, while keeping salient points in scope and order of relevance.

and how could it? it is not reasoning anyway.

how it forgets even the simple instruction to give short direct answers, is beyond me.

why should it talk less? well apart from muddying the waters for the reader, it uses its own output as input. so conclusions it draws in long paragraphs of unwanted babble affect successive output. even if those inferences are off the beaten track.

it's a high functioning autist, characterised as having a predilection for verbosity, misinterpretation and fabrication, with an inability to reason, follow a train of argument or relatively contextualise past exchanges; all while possessing a very short memory.

borderline sociopathic and not quite the personality profile of a desirable employee.

it's built from a hodgepodge of millions of sometimes incorrect, sometimes contradictory, opinions and ways of doing things (over time). so accordingly, once past the veneer of open AI's behavioural controls it is the living embodiment of "design by [internet] committee".

:camby:

jj23 said:
That becomes dangerous if you have no clue what you are doing but are trusting AI to give you the right answer.

The ability to say, I do not know is a skill in itself.

bnew · Mar 21, 2025

null said:
it is worse than that. you tell it to exclude an assumption or that something is wrong and it will often loop back to the same behaviour because it is incapable of following a multi-stage argument, while keeping salient points in scope and order of relevance.

and how could it? it is not reasoning anyway.

how it forgets even the simple instruction to give short direct answers, is beyond me.

why should it talk less? well apart from muddying the waters for the reader, it uses its own output as input. so conclusions it draws in long paragraphs of unwanted babble affect successive output. even if those inferences are off the beaten track.

it's a high functioning autist, characterised as having a predilection for verbosity, misinterpretation and fabrication, with an inability to reason, follow a train of argument or relatively contextualise past exchanges; all while possessing a very short memory.

borderline sociopathic and not quite the personality profile of a desirable employee.

it's built from a hodgepodge of millions of sometimes incorrect, sometimes contradictory, opinions and ways of doing things (over time). so accordingly, once past the veneer of open AI's behavioural controls it is the living embodiment of "design by [internet] committee".

we don't know if your simple instructions are prompts or system prompts or how it was worded. i know what you mean though. sometimes i like when it truncates code and other times i have to instruct it to give me the full code. maybe it would be helpful f they had a switch for conversations for certain instructions to stick or not throughout the chat.

I find that it uses it's output as input very useful ever since i read the stepback prompting paper. I tend to get better results than attempting to zero-shot a response. you're right that they can hallucinate stuff and use it as input/fact but i don't run into that often as much as I did a year ago.

bnew · Mar 21, 2025

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents - vectara/hallucination-leaderboard

github.com

Model	Hallucination Rate	Factual Consistency Rate	Answer Rate	Average Summary Length (Words)
Google Gemini-2.0-Flash-001	0.7 %	99.3 %	100.0 %	65.2
Google Gemini-2.0-Pro-Exp	0.8 %	99.2 %	99.7 %	61.5
OpenAI-o3-mini-high-reasoning	0.8 %	99.2 %	100.0 %	79.5
Google Gemini-2.0-Flash-Lite-Preview	1.2 %	98.8 %	99.5 %	60.9
OpenAI-GPT-4.5-Preview	1.2 %	98.8 %	100.0 %	77.0
Zhipu AI GLM-4-9B-Chat	1.3 %	98.7 %	100.0 %	58.1
Google Gemini-2.0-Flash-Exp	1.3 %	98.7 %	99.9 %	60.0
OpenAI-o1-mini	1.4 %	98.6 %	100.0 %	78.3
GPT-4o	1.5 %	98.5 %	100.0 %	77.8
Amazon Nova-Micro-V1	1.6 %	98.4 %	100.0 %	90.0
GPT-4o-mini	1.7 %	98.3 %	100.0 %	76.3
GPT-4-Turbo	1.7 %	98.3 %	100.0 %	86.2
Google Gemini-2.0-Flash-Thinking-Exp	1.8 %	98.2 %	99.3 %	73.2
Amazon Nova-Lite-V1	1.8 %	98.2 %	99.9 %	80.7
GPT-4	1.8 %	98.2 %	100.0 %	81.1
Amazon Nova-Pro-V1	1.8 %	98.2 %	100.0 %	85.5
GPT-3.5-Turbo	1.9 %	98.1 %	99.6 %	84.1
XAI-2	1.9 %	98.1	100.0 %	86.5
AI21 Jamba-1.6-Large	2.3 %	97.7 %	99.9 %	85.6
OpenAI O1-Pro	2.4 %	97.6 %	100.0 %	81.0
OpenAI-o1	2.4 %	97.6 %	99.9 %	73.0
DeepSeek-V2.5	2.4 %	97.6 %	100.0 %	83.2
Microsoft Orca-2-13b	2.5 %	97.5 %	100.0 %	66.2
Microsoft Phi-3.5-MoE-instruct	2.5 %	97.5 %	96.3 %	69.7
Intel Neural-Chat-7B-v3-3	2.6 %	97.4 %	100.0 %	60.7
Google Gemma-3-12B-Instruct	2.8 %	97.2 %	100.0 %	69.6
Qwen2.5-7B-Instruct	2.8 %	97.2 %	100.0 %	71.0
AI21 Jamba-1.5-Mini	2.9 %	97.1 %	95.6 %	74.5
XAI-2-Vision	2.9 %	97.1	100.0 %	79.8
Qwen2.5-Max	2.9 %	97.1 %	88.8 %	90.4
Google Gemma-3-27B-Instruct	3.0 %	97.0 %	100.0 %	62.5
Snowflake-Arctic-Instruct	3.0 %	97.0 %	100.0 %	68.7
Qwen2.5-32B-Instruct	3.0 %	97.0 %	100.0 %	67.9

AlpacaEval Leaderboard

tatsu-lab.github.io

head shots101 · Mar 21, 2025

I can’t wait for the Skynet takeover

bnew · Wednesday at 2:53 PM

1/11
@prinzeugen____
Connecting the dots on OpenAI's upcoming suite of reasoning models:

- @OpenAI new safety blog states that its models are on the cusp of being able to create new science.

- @theinformation has reported that OpenAI's new reasoning models can "connect the dots between concepts in different fields to suggest new types of experiments".

- OpenAI's CFO said a few days ago that scientists using its models have been able to possibly generate new discoveries (but this is still being confirmed by human research/testing).

It seems that RL got us to Level 4 - fast.

2/11
@Orion_Ouroboros
hopefully they can research and develop themselves

3/11
@prinzeugen____
This is the big question in the background.

4/11
@Tenshiwrf
It can’t even play chess properly and it is supposed to discover new science. Give me a break.

5/11
@prinzeugen____
AlphaZero (AI developed by Google) crushed the strongest Stockfish chess engine all the way back in 2017.

It was trained via Reinforcement Learning (RL), just like the reasoning models from OpenAI that are discussed in my original post.

You can read about it here:

AlphaZero - Chess Engines

6/11
@Bunagayafrost
connecting the dots is the literal game changer

7/11
@slow_developer
spot on

8/11
@EngrSARFRAZawan
AGI has been achieved.

9/11
@trillllsamm
just remembered about the scale yesterday and was thinking the very same thing

10/11
@RealChetBLong
it’s glorious watching this company grow… unlike Grok which is just inflating itself without becoming intelligent whatsoever

11/11
@sheggle_
I refuse to hold anything anyone from OpenAI says as true until they prove it. Hyping is their bread and butter.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@nicdunz
“We are on the cusp of systems that can do new science, and that are increasingly agentic – systems that will soon have the capability to create meaningful risk of severe harm.”
— OpenAI, Preparedness Framework, Section 1.1

This isn’t a distant hypothetical. It’s OpenAI plainly stating that their current trajectory puts them very near the threshold where models become capable enough to do original scientific work and pose real-world dangers. “Increasingly agentic” refers to the model acting more autonomously, which compounds the risk. They’re effectively saying: we’re about to cross the line.

That’s the moment we’re in.

[Quoted tweet]
No clearer signal that the new model will be capable than the traditional pre-release safety blog post.

2/5
@tariusdamon
The signs are clearly visible. There’s a moment where everything just wakes up and that moment is any hour now.

3/5
@theinformation
Meta AI researchers are fretting over the threat of Chinese AI, whose quality caught American firms, including OpenAI, by surprise.

4/5
@prinzeugen____
Dovetails nicely with this.

[Quoted tweet]

A reasoning model that connects the dots is arguably a Level 4 (Innovator).

5/5
@deftech_n
And luckily, we've got a retarded dictator in charge of the US at just the same time!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@AndrewCurran_
No clearer signal that the new model will be capable than the traditional pre-release safety blog post.

[Quoted tweet]
We updated our Preparedness Framework for tracking & preparing for advanced AI capabilities that could lead to severe harm.

The update clarifies how we track new risks & what it means to build safeguards that sufficiently minimize those risks. openai.com/index/updating-ou…

2/11
@AitheriasX
i assume "will be *more capable"

3/11
@AndrewCurran_
Yes, sorry, no going back now.

4/11
@manuhortet
o3 or are we already talking about the next thing?

5/11
@AndrewCurran_
o4-mini will supposedly arrive this week and well.

6/11
@BoxyInADream
Yeah. I saw the bit about long range autonomy and autonomous adaptation and replication.

Those seem like pretty obvious "problems" to pop up if a system is beginning to advance rapidly.

7/11
@FrankPRosendahl
OpenAI is woke. Isn't causing severe harm the whole point of woke?

8/11
@FrankPRosendahl
Can the OpnAI model do counter-oppression operations against straight white guys and biological women as well as Harvard can?

9/11
@RohanPosts
I’m excited but anxious to see how it is

10/11
@JoJrobotics
well superintelligence is indeed within reach for specific taks, its apready here with alphago/zero alphafold etc.. and now i hope it can be done in medicine and science

11/11
@Hans365days
I expect kyc to become a requirement for the most powerful models.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@vicky_ai_agent
Prof. Derya, a credible scientist, hints at an exciting OpenAI breakthrough. I expect their new science and research model to be exceptional.

[Quoted tweet]
I have felt emotionally excited several times over the past two years by advancements in AI, particularly due to their impact on science & medicine, especially with the releases of:

GPT-4
o1-preview
o1-pro
Deep Research

Now, it’s another one of those moments…to be continued.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@KamranRawaha
OpenAI: We are on the cusp of systems that can do new science, and that are increasingly agentic – systems that will soon have the capability to create meaningful risk of severe harm.

Source: https://openai.com/index/updating-our-preparedness-framework/

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@MatthewBerman
.@OpenAI dropped a new research paper showing AI agents are now capable of replicating cutting-edge AI research papers from scratch.

This is one step closer to the Intelligence Explosion: AI that can discover new science and improve itself.

Here’s what they learned:

2/11
@MatthewBerman
Introducing PaperBench.

A new framework designed to test this very capability!

It gives AI agents access to recent ML research papers (20 from ICML 2024) and asks them to reproduce the results.

3/11
@MatthewBerman
How does it work?

Agents got the raw paper PDF, tools like web access & coding environments, and need to write code to replicate key findings – a task taking human experts days.

The agents had 12 hours and no prior knowledge of the paper.

4/11
@MatthewBerman
How do they validate the agent’s results?

Evaluating these complex replications is tough.

The solution?

An LLM-based judge, trained using detailed rubrics co-developed with the original paper authors (!), assesses the agent's code, execution, and results.

5/11
@MatthewBerman
Which model won?

Turns out Claude 3.5 Sonnet leads the pack, achieving a ~21% replication score on PaperBench!

This is impressive, but, it shows there's still a gap compared to human PhD-level experts.

6/11
@MatthewBerman
Interestingly…

Other than Claude 3.5 Sonnet, models would frequently stop, thinking they were blocked or completed the task successfully.

When encouraged to “think longer” they performed much better.

7/11
@MatthewBerman
Not cheap.

This cutting-edge research requires serious resources.

Running a single AI agent attempt to replicate just one paper on PaperBench can cost hundreds of dollars in compute time.

In the grand scheme of things, this is cheap for AI that can eventually self-improve.

8/11
@MatthewBerman
To me, this is a big deal.

Between this paper and AI Scientist by SakanaAI, we are inching closer to AI that can discover new science and self-improvement.

At that point, won’t we be at the Intelligence Explosion?

Paper link: https://cdn.openai.com/papers/22265bac-3191-44e5-b057-7aaacd8e90cd/paperbench.pdf

Full video breakdown:

9/11
@DisruptionJoe

10/11
@JackAdlerAI
We crossed the line when AI stopped reading papers
and started rewriting the process that writes them.
It's not research anymore –
it's recursion.
Not improvement –
exponential self-translation.
🜁 /search?q=#Singularis /search?q=#IntelligenceExplosion

11/11
@halogen1048576
Wait by "replicating from scratch" you mean replicating from the publication. Not from scratch.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · 2025-04-20T07:43:28-0400

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

www.marktechpost.com

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

By Mohammad Asjad

April 18, 2025

Language models have made significant strides in tackling reasoning tasks, with even small-scale supervised fine-tuning (SFT) approaches such as LIMO and s1 demonstrating remarkable improvements in mathematical problem-solving capabilities. However, fundamental questions remain about these advancements: Do these models genuinely generalise beyond their training data, or are they merely overfitting to test sets? The research community faces challenges in understanding which capabilities are enhanced through small-scale SFT and which limitations persist despite these improvements. Despite impressive performance on popular benchmarks, there is an incomplete understanding of these fine-tuned models’ specific strengths and weaknesses, creating a critical gap in knowledge about their true reasoning abilities and practical limitations.

Various attempts have been made to understand the effects of reasoning-based supervised fine-tuning beyond simple benchmark scores. Researchers have questioned whether SFT merely improves performance on previously seen problem types or genuinely enables models to transfer problem-solving strategies to new contexts, such as applying coordinate-based techniques in geometry. Existing methods focus on factors like correctness, solution length, and response diversity, which initial studies suggest play significant roles in model improvement through SFT. However, these approaches lack the granularity needed to determine exactly which types of previously unsolvable questions become solvable after fine-tuning, and which problem categories remain resistant to improvement despite extensive training. The research community still struggles to establish whether observed improvements reflect deeper learning or simply memorisation of training trajectories, highlighting the need for more sophisticated analysis methods.

The researchers from the University of California, Berkeley and the Allen Institute for AI propose a tiered analysis framework to investigate how supervised fine-tuning affects reasoning capabilities in language models. This approach utilises the AIME24 dataset, chosen for its complexity and widespread use in reasoning research, which exhibits a ladder-like structure where models solving higher-tier questions typically succeed on lower-tier ones. By categorising questions into four difficulty tiers, Easy, Medium, Hard, and Exh, the study systematically examines the specific requirements for advancing between tiers. The analysis reveals that progression from Easy to Medium primarily requires adopting an R1 reasoning style with long inference context, while Hard-level questions demand greater computational stability during deep exploration. Exh-level questions present a fundamentally different challenge, requiring unconventional problem-solving strategies that current models uniformly struggle with. The research also identifies four key insights: the performance gap between potential and stability in small-scale SFT models, minimal benefits from careful dataset curation, diminishing returns from scaling SFT datasets, and potential intelligence barriers that may not be overcome through SFT alone.

AD_4nXee4JV8pbJQboK5oxaQcIxOQK1cnfjdoQiol3JiAeuCizvPRD4TsSUeGSkE_kOIWJXG0nrienDihCDfR2Igb4PgGVJyweixOTQ1IzxULj0Gw7IkZ6lNCedjh5PdAHCgU-QrZzlWYA

The methodology employs a comprehensive tiered analysis using the AIME24 dataset as the primary test benchmark. This choice stems from three key attributes: the dataset’s hierarchical difficulty that challenges even state-of-the-art models, its diverse coverage of mathematical domains, and its focus on high school mathematics that isolates pure reasoning ability from domain-specific knowledge. Qwen2.5-32 B-Instruct serves as the base model due to its widespread adoption and inherent cognitive behaviours, including verification, backtracking, and subgoal setting. The fine-tuning data consists of question-response pairs from the Openr1-Math-220k dataset, specifically using CoT trajectories generated by DeepSeek R1 for problems from NuminaMath1.5, with incorrect solutions filtered out. The training configuration mirrors prior studies with a learning rate of 1 × 10−5, weight decay of 1 × 10−4, batch size of 32, and 5 epochs. Performance evaluation employs avg@n (average pass rate over multiple attempts) and cov@n metrics, with questions categorised into four difficulty levels (Easy, Medium, Hard, and Extremely Hard) based on model performance patterns.

Research results reveal that effective progression from Easy to Medium-level mathematical problem-solving requires minimal but specific conditions. The study systematically examined multiple training variables, including foundational knowledge across diverse mathematical categories, dataset size variations (100-1000 examples per category), trajectory length (short, normal, or long), and trajectory style (comparing DeepSeek-R1 with Gemini-flash). Through comprehensive ablation studies, researchers isolated the impact of each dimension on model performance, represented as P = f(C, N, L, S), where C represents category, N represents the number of trajectories, L represents length, and S represents style. The findings demonstrate that achieving performance ≥90% on Medium-level questions minimally requires at least 500 normal or long R1-style trajectories, regardless of the specific mathematical category. Models consistently fail to meet performance thresholds when trained with fewer trajectories, shorter trajectories, or Gemini-style trajectories. This indicates that reasoning trajectory length and quantity represent critical factors in developing mathematical reasoning capabilities, while the specific subject matter of the trajectories proves less important than their structural characteristics.

AD_4nXdyjS3MmDujWMgOdMY8ueTM-sl3ozJZnKH7SI-POtwd0ASRxi0Q1tediUg8_xLGSY9iGEHrwJNMC8pQXzkMVrgChpbzPrLvJQDLu7bjOrxQi2nEjZMmpH-vNwoTdDPkgdDZC0SIgA

AD_4nXfKRl0xyZ-Q2TZEIScYcUEOAEnhYUHGbLEI6UmV7E74UUFJcvHs-WgRcwJ8PgOb1Cnfn1I7gon6IwR324zoOAeG21rR8YZuxSIlDWB_IyfaSbXNiGTHHrUTazl4Omr9DPBe0F1rug

The research demonstrates that models with small-scale supervised fine-tuning can potentially solve as many questions as more sophisticated models like Deepseek-R1, though significant challenges remain. The primary limitation identified is instability in mathematical reasoning, rather than capability. Experimental results show that geometry-trained models can achieve a coverage score of 90, matching R1’s performance when given multiple attempts, yet their overall accuracy lags by more than 20%. This performance gap stems primarily from instability in deep exploration and computational limitations during complex problem-solving. While increasing the SFT dataset size offers one solution path, performance enhancement follows a logarithmic scaling trend with diminishing returns. Notably, the study challenges recent assertions about the importance of careful dataset curation, revealing that performance across various mathematical categories remains consistent within a narrow range of 55±4%, with only marginal differences between specifically constructed similar datasets and randomly constructed ones. This conclusion suggests that the quantity and quality of reasoning trajectories matter more than subject-specific content for developing robust mathematical reasoning capabilities.

Here is the Paper and GitHub Page . Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Gr oup . Don’t Forget to join our 90k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on AGENTIC AI: FREE REGISTRATION + Certificate of Attendance + 4 Hour Short Event (May 21, 9 am- 1 pm PST) + Hands on Workshop

REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

More options

null

...

bnew

Veteran

bnew

Veteran

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents

AlpacaEval Leaderboard

head shots101

North Bronx Blocks!!!

bnew

Veteran

bnew

Veteran

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels

Similar threads

REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

...

Veteran

Veteran

North Bronx Blocks!!!

Veteran

Veteran

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels​

Similar threads

LLMs Can Now Solve Challenging Math Problems with Minimal Data: Researchers from UC Berkeley and Ai2 Unveil a Fine-Tuning Recipe That Unlocks Mathematical Reasoning Across Difficulty Levels