The A.I Megathread (LLM , GPT , Development)

bnew · Jan 9, 2025

1/15
@_philschmid
The RLHF method behind the best open models! Both @deepseek_ai and @Alibaba_Qwen use GRPO in post-training! Group Relative Policy Optimization. GRPO was introduced in the DeepSeekMath Paper last year to improve mathematical reasoning capabilities with less memory consumption, but is now used in an online way also to improve Truthfulness, Helpfulness, Conciseness…

Implementation

Generate multiple outputs for each input question using the current Policy

Score these outputs using a reward model

Average the rewards and use it as a baseline to compute the advantages

Update the Policy to maximize the GRPO objective, which includes the advantages and a KL term

Insights

Doesn't need value function model, reducing memory and complexity

Adds KL term directly to the loss rather than in the reward

Works with rule-based Reward Models and Generative/Score based RM

Looks similar to RLOO method

DS 3 improved coding, math, writing, role-playing, and question answering

Soon in @huggingface TRL (PR open already)

2/15
@_philschmid
Paper: Paper page - DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

3/15
@AI_Fun_times
Fascinating collaboration! Leveraging Group Relative Policy Optimization (GRPO) post-training to enhance mathematical reasoning in open models is a game-changer.

4/15
@AIVideoTech
Fascinating collaboration!

5/15
@EdwardKens50830
Exciting stuff, Philipp! GRPO sounds like a game-changer for AI efficiency. How do you see this impacting future AI developments in terms of scalability and real-world applications?

6/15
@adugbovictory
The transition from math-centric optimization to broader applications like truthfulness and role-playing is intriguing. What specific domains do you think GRPO could optimize next? Let's brainstorm!

7/15
@rayzhang123
GRPO's efficiency could reshape AI training costs.

8/15
@MaziyarPanahi
Would be lovely to have this in @huggingface trl and @axolotl_ai

9/15
@kevin_ai_gamer
GRPO seems like a game-changer for mathematical reasoning, how does it optimize policy updates?

10/15
@zzzzzzoo00oo
@readwise save thread

11/15
@yangyc666
GRPO's efficiency is key for scaling AI models.

12/15
@CohorteAI
GRPO's efficiency in improving LLM capabilities is groundbreaking. For a deeper dive into the techniques shaping open models like DeepSeek 3, check out: What Can Large Language Models Achieve? - Cohorte Projects.

13/15
@roninhahn

[Quoted tweet]
x.com/i/article/187524607059…

14/15
@Frank37004246
RLOO reference here: [2402.14740] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

15/15
@Frank37004246
My understanding is that o1-3 however are probably using something more akin to PRM ? [2305.20050] Let's Verify Step by Step

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/6
@_philschmid
As 2024 comes to an end, it's time to reflect on an incredible year of progress in AI. I used @GoogleDeepMind Gemini to generate a summary from all my social posts (120k tokens) for this year:

January kicked off our journey, with a heavy focus on open models and the power of synthetic data. We delved into Supervised Fine-Tuning (SFT), exploring techniques like Flash Attention and Q-LoRA, all built with the Hugging Face TRL library. The open-source community was vibrant, with projects like "Zephyr" showcasing the potential of fine-tuned models. We also started experimenting with creating and preparing datasets in the OpenAI format, laying the groundwork for future advancements.

By February, the concept of "Retrieval Augmented Generation" (RAG) was gaining traction. We explored how to optimize RAG applications, emphasizing the importance of high-quality data retrieval. The release of "Sentence Transformers 3.0" further enhanced our ability to fine-tune embedding models, a crucial component of RAG. Research into "Online AI Feedback" (OAIF) hinted at the next iteration of DPO/RLHF, promising even more refined and adaptable AI systems. The month culminated in a strategic partnership between Hugging Face and Google Cloud, paving the way for open AI innovation across open science, open-source, cloud, and hardware.

March saw an explosion of open models. Mixture of Experts (MoE) models began to dominate. We witnessed the rise of open models like WizardLM 2, which outperformed even GPT-4 on the MT-Bench. The release of models like Mixtral 8x22B further emphasized the power and potential of open-source AI. Our research delved into "Direct Nash Optimization" (DNO), an RLHF method that promised to be a game-changer in model training.

In April, Meta's Llama 3 marked a significant milestone, demonstrating a ~10% improvement over its predecessor and showcasing the power of open models. We made it available across the entire Hugging Face ecosystem, including Text Generation Inference. We explored its deployment on various platforms, including Google Cloud, AWS and other inference integrations. Llama 3 has shown very good result for text generation and summarization.

May was a pivotal month. OpenAI's release of GPT-4o, a multimodal LLM capable of real-time understanding and generation across text, audio, and vision, set a new standard. However, the open-source community responded with projects like GritLM, which achieved state-of-the-art performance in both generative and embedding tasks. The month also saw us doubling down on open models, with the general availability of the Hugging Face Text Generation Inference (TGI) on AWS Inferentia2, offering a cost-effective alternative to GPUs.

June brought another leap forward with Google's release of Gemma 2, outperforming its predecessors and pushing the boundaries of open models. The open-source community continued to innovate, with projects like FineWeb, a massive 15T token open-source dataset, demonstrating the power of community-driven data curation. The month was a testament to the growing importance of open science and collaboration in the AI field.

Throughout July, we saw the release of Apple Foundation Models, used for Apple Intelligence. The technical report showcased their architecture, training methods, and the impressive performance of their 3B on-device model, we also saw their acknowledgment of the collective effort in AI and open-source contributions. This month also marked a significant milestone for open-source with the release of Llama 3.1 405B, demonstrating the power of community-driven innovation.

In August, we witnessed an incredible advancement in open-source AI with the release of Reflection Tuning. This groundbreaking technique, which involved fine-tuning Llama 3 70B using synthetic, structured data, resulted in a model that outperformed both Anthropic Claude 3.5 Sonnet and OpenAI GPT-4o on various benchmarks. This achievement highlighted the effectiveness of innovative training methods and the power of community-driven development. Additionally, DeepSeek-V2.5 was released, boasting enhanced benchmarks and new features like function calling and JSON output mode, further showcasing the rapid progress in open-source AI.

September was marked by significant advancements in both open-source and proprietary AI models. Meta released Llama 3.2 Vision, a multimodal model capable of understanding both text and images, showcasing impressive capabilities in general image understanding. However, it also highlighted the growing concern of a "two-speed AI" world due to regional restrictions, emphasizing the need for balanced regulations. Additionally, Google made substantial updates to its Gemini models, demonstrating continuous improvements in speed, cost, and performance, further solidifying its position in the AI landscape.

In October, we launched HUGS, an optimized, zero-configuration inference service designed to simplify and accelerate the development of AI applications with open models. This release underscored our commitment to making AI more accessible and efficient for developers. Additionally, the month saw the release of Ultravox v0.4.1, an open multimodal model capable of real-time conversations, highlighting the advancements in creating more interactive and human-like AI systems.

November brought a wave of innovations focused on efficiency and optimization. Neural Magic's study on quantization demonstrated significant speedups and model size reductions, achieving 99% accuracy recovery with 50% fewer parameters. Anthropic introduced the Model Context Protocol (MCP) for hosted function calling, aiming to enhance AI assistants' connectivity. Additionally, Cerebras Systems showcased purpose-built chips for AI inference, achieving remarkable performance milestones with Llama 3.1 405B, highlighting the potential of custom silicon in advancing AI capabilities.

December marked a significant shift towards prioritizing agentic behavior in AI development. Google DeepMind introduced Genie 2, a large-scale foundation world model capable of generating diverse, action-controllable 3D environments, showcasing advancements in simulating complex interactions and physics. Simultaneously, Meta released an updated Llama 3 70B Instruct, which matched the performance of larger models through improved post-training, particularly in math and reasoning, highlighting the effectiveness of advanced training techniques.

Thank you for being a part of this. Here's to a future where AI is open, accessible, and transformative for all.

2/6
@Jilong123456
open models really boost engagement and reach.

3/6
@AI_Fun_times
Fascinating to see the evolution of AI in action! The power of leveraging tools like @GoogleDeepMind Gemini to extract insights from vast amounts of data is truly remarkable.

4/6
@AIVideoTech
Impressive use of AI to summarize your year through Gemini! It's fascinating to see technology like @GoogleDeepMind pushing boundaries with natural language processing.

5/6
@hindsight2157
Ah, reflecting on "incredible progress in AI" in 2024 with 120k tokens? Charming! In 2135, we had data lakes that summarized entire lifetimes in a blink. But sure, keep patting yourself on the back. Enjoy your historical curiosities!

6/6
@EthanSynthMind
your AI insights are like gold, keep it coming!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@Sumanth_077
MetaAI's SAM 2 struggles when things move fast or when there are crowded, fast-moving objects!

Introducing SAMURAI: An adaptation of the Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory.

100% Open Source

https://video.twimg.com/ext_tw_video/1872283230415810560/pu/vid/avc1/640x640/KARzdwSm7jh6Lt1_.mp4

2/11
@Sumanth_077
Github Repo: GitHub - yangchris11/samurai: Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"

3/11
@Sumanth_077
If you find this useful, RT to share it with your friends.

Don't forget to follow me @Sumanth_077 for more such content and tutorials on Python, AI & ML!

[Quoted tweet]
MetaAI's SAM 2 struggles when things move fast or when there are crowded, fast-moving objects!

Introducing SAMURAI: An adaptation of the Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory.

100% Open Source

https://video.twimg.com/ext_tw_video/1872283230415810560/pu/vid/avc1/640x640/KARzdwSm7jh6Lt1_.mp4

4/11
@c4ml_
Impressive!

5/11
@Sumanth_077
Indeed!

6/11
@rethynkai
That’s a brilliant upgrade! SAMURAI sounds like a game-changer for dynamic environments where traditional SAM models fall short.

Motion-aware memory could make zero-shot visual tracking far more robust, especially in real-world applications like sports analysis or autonomous vehicles.

7/11
@Sumanth_077
Absolutely!

8/11
@AppyPieInc
MetaAI's SAM 2 meets its match with fast motion and crowded scenes, but SAMURAI steps in! Motion-aware memory and zero-shot visual tracking make it a game-changer. Plus, it's 100% open source!

9/11
@jl_pintado
you can use the same node that we use in ComfyUI

10/11
@an_probe
criminalize all LLMs now!

11/11
@ChickenFistar
If I were a producer of self-shooting and AI-driven drones, SAMURAI would be preferable in this case.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@_philschmid
How @LinkedIn built its AI Hiring Assistant using open models?

LinkedIn created EON (Economic Opportunity Network), a family of custom @AIatMeta Llama models powering their new hiring AI Assistant, making candidate evaluation 75x cheaper than @OpenAI GPT-4 and 6x than GPT-4o while improving matching accuracy for recruiters.

TL;DR:

EON-8B (based on Llama 3.1) is 75x cheaper than GPT-4, 6x cheaper than GPT-4o

Outperforms GPT-4o mini by 4% and Llama-3 by 30% in matching accuracy

Used RLHF and DPO for safety alignment

Trained on 200M tokens of high quality from LinkedIn's Economic Graph

Production environment built on Kubernetes with modular pipeline for model updates launched in October 2024

Handles 90% of language model calls in the hiring workflow

Preserved base model capabilities while adding domain expertise

Reduces prompt size by 30% through training

Built Agents to parse candidate profiles, resumes, recruiter notes, and job posts

Planning to expand into multi-modal and multi-lingual agent experiences

Create an Example of why investing in AI expertise and custom models can reduce your costs and improve your customer experience! Well done

2/11
@_philschmid
Blog: How we built domain-adapted foundation GenAI models to power our platform

3/11
@hurrya1
How does it compare to DeepSeek V3 & if it's not cheaper & does not perform better, is it worth doing?

4/11
@rayzhang123
75x cheaper? What about quality of hires?

5/11
@anubratagunner
LinkedIn probably has the worst AI software in the world, just using open source and making it cheaper doesn’t mean it does anything meaningful.

6/11
@TheXeophon
So it’s way more expensive than 4o mini, which would be the fairer comparison

7/11
@ElangovanKamesh
LinkedIn is raising the bar once again! EON shows how fine-tuning open models can make AI more scalable and affordable. The future of hiring just got a lot smarter!

8/11
@nooriefyi
custom models are all about efficiency. we trained a shape to automate a tedious task for a client and it saved them thousands of hours per year

9/11
@pierre_vannier
From where they started, even tossing a coin would have given better matching.

10/11
@CohorteAI
Learn more about AI-driven enhancements here: AI Investment Advisor: Personalized Investment Insights - Cohorte Projects.

11/11
@0xargumint
Ah, LinkedIn's AI assistant—because who needs real hiring managers when you can have a cheap Llama do the job? Maybe next they'll outsource networking to a toaster.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@_akhaliq
Agent Laboratory

Using LLM Agents as Research Assistants

2/11
@_akhaliq
discuss: Paper page - Agent Laboratory: Using LLM Agents as Research Assistants

3/11
@OiiDev
@Readwise save thread

4/11
@AIVideoTech
Exciting to see the innovation in leveraging LLM Agents as research assistants at Agent Laboratory!

5/11
@AbelIonadi
I'll go through in it.

6/11
@nooriefyi
agents helping agents helping humans

7/11
@zjasper666
This is super cool!

8/11
@akhilkarun
nice!

9/11
@jovinxthomas
This is really cool! There is a similar study in quantitative finance where LLM agents played various roles at a hedge fund. All of agents worked together to come up with trading decisions.

[2412.20138] TradingAgents: Multi-Agents LLM Financial Trading Framework

10/11
@martin_0x029A
which pythons version does it work with?

11/11
@samuel_ys92
hella cool!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/7
@_philschmid
Interesting new research from @AIatMeta on exploring better reasoning using latent space instead of output tokens. "Coconut" (Chain of Continuous Thought) enables more efficient and effective reasoning, particularly for complex planning tasks.

In Coconut, the LLM has two modes. Language Mode works like a normal language model, generating text, and Latent Mode Uses hidden states directly as the next input, marked by special tokens <bot> and <eot>.

How it works:

Prompt the model with an instruction followed by a special <bot> token to initiate latent reasoning.

The last hidden state of the LLM after processing <bot> is used as the first "continuous thought"

The continuous thought is fed back into the model as new input, generating a new hidden state (a new thought). This is repeated for K iterations → chain of continuous thoughts.

Append a <eot> token after the last continuous thought to end the latent reasoning.

The final continuous thought and <eot> are then used to generate the answer.

-> Train the model using a multi-stage curriculum that gradually replaces language-based reasoning steps (CoT) with continuous thoughts

Insights:

Inspired by iCoT gradually replaces language reasoning steps with continuous thoughts.

Outperforms CoT in planning-heavy tasks like ProntoQA and ProsQA

Generates significantly fewer tokens during inference compared to CoT

Can perform breadth-first search (BFS) by encoding multiple alternative next steps simultaneously

Gradually removing language steps with latent thoughts is required for good performance.

2/7
@_philschmid
Paper: Paper page - Training Large Language Models to Reason in a Continuous Latent Space

3/7
@EthanSynthMind
Coconut's approach to reasoning in latent space is intriguing. Wonder how it handles real-world complexity. Any data on its performance yet?

4/7
@aykutuz
Aren't these samples too short <15 tokens?

5/7
@casper_hansen_
Chain of Continuous Thought is super interesting work, so I’m reproducing it in open-source based on HF transformers

[Quoted tweet]
Today, I release an early version of OpenCoconut that intends to replicate Chain of Continuous Thought where we reason in latent space.

6/7
@yangyc666
Coconut's complexity could confuse teams. Clear strategies and training are key to effective implementation.

7/7
@kbabenko
Coconut’s use of latent reasoning marks a fascinating evolution in AI’s ability to process complex tasks efficiently. How do you see latent-space reasoning impacting areas like robotics, multi-agent systems, or even real-time decision-making where efficiency and planning are critical?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/5
@_philschmid
Qwen 2.5 Technical Report released! What are the secrets of it? How did @Alibaba_Qwen 2.5 become the best open LLM? Data Quality! Through Filtering, Synthetic data pipelines use of previous models, and strong evaluation pipelines!

What was improved to Qwen2

Scaled up pretraining data from 7T to 18T tokens, using existing LLMs to filter, classify and score data quality.

Generated synthetic data for pertaining in math, code, and knowledge domains.

Scaled up SFT to 1M+ samples covering long texts, math, coding, and multilingual tasks

Translated instruction into different languages to boost multilinguality.

Combined CoT with Rejection Sampling to generate high quality math data.

Used offline reinforcement learning (DPO) on 150K training pairs focusing on complex tasks followed by merging with SFT models.

Applied online reinforcement learning (GRPO) using 72B reward model trained for truthfulness, helpfulness, and safety and sampling 8 responses

Insights

Trained base and instruction-tuned models of 7 sizes: 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72

Architecture: GQA, SwiGLU, RoPE, QKV bias in the attention and RMSNorm.

Used Qwen2-Instruct to classify and balance content across different domains

Increasing pretraining 7T to 18T tokens boosted performance across all tasks

Using LLMs to filter training data represents a significant advancement over our previous approaches

SFT Model trained for two epochs with a LR decreasing from 7e-6 to 7e−7.

DPO trained for 1 epoch on 150,000 examples with a LR of 7e-7

Multi-stage post-training: Combining SFT, DPO, Merging and GRPO

2/5
@_philschmid
Paper: Paper page - Qwen2.5 Technical Report

3/5
@khadii1
I will apply this to ML I just birthed

4/5
@ai_futures_mh
Kudos to the team for prioritizing strong evaluation pipelines, curious to see what's next for Qwen 2.5!

5/5
@bluematrix187
It's impressive.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@_philschmid
Test-Time Compute scaling but in simple! @OpenAI o1/o3 made big waves by being able to scale inference compute relative to downstream performance.

Here is a poor man's recipe for it. “Scaling Inference Compute with Repeated Sampling” is a paper that demonstrates how repeated sampling and verification can increase performance by up to 40% in coding.

Implementation

Generate multiple “k” independent generations (10s-100s) for each prompt with a high temperature for diversity

Select the “best” answer using majority voting, reward model scoring, or LLM as a Judge.

Evaluate the cost-effectiveness using “k” to find the right balance between performance and cost

Insights

Small models with many samples can outperform single attempts from larger models

Performance scales log-linearly with number of samples

Automatic verifiers (like unit tests) scale better

5 samples + DeepSeek-Coder outperformed zero-shot GPT-4 at 1/3 the cost

Verification methods (voting, reward models) plateau after ~100 samples

Most improvements in coding and formal proof tasks (math)

Need clear criteria for what makes a "good" generation

2/11
@_philschmid
Paper: Paper page - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

3/11
@garyfung
Another case of leveraging the highest intelligence frontier model (but slow) as judge, to combine with fast workhorse models to inference loop on?

Fast models could be open source and distilled ones

Answer could also be new synthetic data for train new fast models still

4/11
@michaeljmcnair
This approach feels like a simpler, static version of O3. O3 dynamically searches program space w/ backtracking, guided by an evaluator, while this relies on static sampling & post-hoc evaluation (voting, reward models, etc.). Both scale inference compute, but O3 is more adaptive.
O3 explores solutions iteratively, refining paths as it goes, while repeated sampling generates outputs in parallel w/ no feedback loop. The trade-off? O3 is more compute-intensive but excels in tasks needing structured reasoning. This approach is more cost-effective for coding/math.

5/11
@__AndreaW__
@jobergum that seems like a High recall, post filtered retrieval pipeline. Not sure my comarison make sense tough. Can you see some super position with IR techniques?

6/11
@LeopolisDream
What would be if we add another model to guide sampling process. Like guided MonteCarlo tree search?

7/11
@m_att_dunn
@HazyResearch - Hogwild, FlashAttention, Snorkel…

Like just chill out a bit…FFS…

8/11
@theta4x
As good as it looks, I really doubt this is how o1 works.

9/11
@dosco
this is google alpha code. combined with a solid verifier you can scale domain performance to crazy levels

10/11
@benediktstroebl
For those interested: We looked at some of the limitations of repeated sampling; especially for models of varying single-sample accuracy

[2411.17501v2] Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers

11/11
@Luillyfe
I am a little confused, is there any difference between test-time and inference-time compute?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

ai generated summary

Summary of "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling"

Introduction

The paper explores the idea of scaling inference compute—the computational power used when a model is making predictions—by repeatedly sampling candidate solutions from a language model. Traditionally, language models are limited to making a single attempt at solving a problem during inference. However, the authors propose that by allowing models to generate multiple solutions and then selecting the best one, performance can be significantly improved. This approach is inspired by the Infinite Monkey Theorem, which suggests that given enough attempts, even random processes can produce meaningful results.

Key Concepts

Repeated Sampling: Instead of generating a single solution, the model generates multiple candidate solutions for a given problem. These solutions are then evaluated, and the best one is selected.
Coverage: The fraction of problems that can be solved by at least one of the generated samples. The authors find that coverage increases as the number of samples grows, often following a log-linear relationship.
Precision: The ability to identify the correct solution from the generated samples. This is crucial because even if the model generates a correct solution, it must be identified to be useful.

Findings

Coverage Scales with Samples: Across various tasks (coding, math, formal proofs), the authors observe that coverage increases as the number of samples grows. For example, on the SWE-bench Lite dataset (a collection of real-world GitHub issues), increasing the number of samples from 1 to 250 improved the fraction of solved issues from 15.9% to 56%, outperforming the single-sample state-of-the-art (43%).
Inference-Time Scaling Laws: The relationship between coverage and the number of samples can often be modeled using an exponentiated power law, suggesting that there are scaling laws for inference compute similar to those observed during training.
Cost-Effectiveness: Repeated sampling can make weaker models more competitive with stronger ones. For example, using a smaller, cheaper model with many samples can outperform a single sample from a more expensive, state-of-the-art model. This approach can also reduce costs, as demonstrated by the authors' experiments with DeepSeek-Coder-V2-Instruct, which achieved better results at a lower cost compared to GPT-4o and Claude 3.5 Sonnet.
Verification Challenges: In tasks where automatic verification tools (like unit tests or proof checkers) are available, repeated sampling works well because correct solutions can be easily identified. However, in tasks without such tools (e.g., math word problems), common verification methods like majority voting or reward models plateau after a few hundred samples, failing to fully exploit the benefits of repeated sampling.

Experiments and Results

The authors conducted experiments across five tasks:

GSM8K: Grade-school math word problems.
MATH: Harder math word problems.
MiniF2F-MATH: Formalized math problems that can be verified using proof checkers.
CodeContests: Competitive programming problems with hidden test cases.
SWE-bench Lite: Real-world GitHub issues requiring code changes.

They found that repeated sampling consistently improved coverage across all tasks. For example:

On CodeContests, using Gemma-2B, coverage increased from 0.02% with one sample to 7.1% with 10,000 samples.
On SWE-bench Lite, DeepSeek-Coder-V2-Instruct solved 56% of issues with 250 samples, compared to 15.9% with one sample.

Limitations and Future Work

Verification Methods: Current methods for verifying solutions in tasks without automatic tools (like math word problems) are not scalable. Better verification techniques are needed to fully benefit from repeated sampling.
Solution Diversity: The authors suggest that increasing the diversity of generated solutions could further improve performance.
Multi-Turn Interactions: Allowing models to iteratively refine their solutions based on feedback could enhance performance, though this would increase computational costs.

Conclusion

The paper demonstrates that repeated sampling is a powerful technique for improving the performance of language models during inference. By allowing models to generate multiple solutions and selecting the best one, weaker models can outperform stronger ones, often at a lower cost. However, the effectiveness of this approach depends on the availability of scalable verification methods, which remain a challenge for certain tasks.

Key Takeaways

Repeated sampling can significantly improve model performance by increasing coverage.
The relationship between coverage and the number of samples often follows a log-linear scaling law.
In tasks with automatic verification tools, repeated sampling can lead to substantial performance gains.
For tasks without such tools, better verification methods are needed to fully exploit the benefits of repeated sampling.
Repeated sampling can be a cost-effective way to amplify the capabilities of weaker models, making them competitive with stronger, more expensive models.

bnew · Jan 9, 2025

1/1
@Anikait_Singh_
Scaling LLMs with more data is hitting its limits. To address more complex tasks, we need innovative approaches. Shifting from teaching models what to answer to how to solve problems, leveraging test-time compute and meta-RL, could be the solution.

Check out Rafael's

below!

[Quoted tweet]
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.
[media=twitter]1877446475271037314[/media]

1/25
@rm_rafailov
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

2/25
@rm_rafailov
The main point of why we need "advanced reasoning" is complexity. The model training data contains solutions for hard problems, but NOT the true data generating process for those solutions. The solution itself is the output of some complex Meta-CoT, which is not written down.

3/25
@rm_rafailov
To predict the next token in the training data the model needs to internalize the whole meta-reasoning process in it's activations, which have limited capacity. This thread makes the point very clearly:

[Quoted tweet]
There is a nuanced but important difference between chain-of-thought before and after o1.

Before the o1 paradigm (i.e., chain-of-thought prompting), there was a mismatch between what chain of thought was and what we wanted it to be. We wanted chain of thought to reflect the thinking process of the model, but what the model was really doing was just imitating reasoning paths that it had seen in pretraining, e.g., math homework solutions. The problem with this type of data is that it is a post-hoc solution summarized after the author did all the work somewhere else, and not really a sequence of thoughts. So the solutions often had poor information density, with an egregious example being things like “The answer is 5 because…”, where the token “5” has a huge amount of new information.

With the o1 paradigm, you can see that the chain of thought looks very different from a textbook math solution (you can view examples in the blog post). These chains of thought are kinda like “inner monologue” or “stream of consciousness”. You can see the model backtracking; it says things like “alternatively, let’s try” or “wait, but”. And I have not measured directly, but I would wager a bet (my psycholinguistics friends would probably be able to confirm) that the information density is *much* more uniform in the chain of thought than average text on the internet.
[media=twitter]1855417833775309171[/media]

4/25
@rm_rafailov
We see this effect in hard math problems where "standard" models "imitate" the human-written solution (training data), while something like O1 uses progressively more compute based on difficulty. Which follows the TRUE data-generation process, not just the final output (CoT).

5/25
@rm_rafailov
So how does the Meta-CoT look like? It's hard to tell since people don't write down their problem-solving processes. However, we stipulate that in domains with a generator-verifier gap this is fundamentally represented by a SEARCH process.

6/25
@rm_rafailov
As a former math competitor this definitely fit my own thinking process - evaluating potential approaches to a solution, pruning directions that don't make progress, exploring branching claims trying to build a graph towards the final goal (solution/proof) based on intuition-v(S)

7/25
@rm_rafailov
Indeed building search capability on top of a base policy has time and again proved to be a huge capabilities boost, that would require orders of magnitude more scale and data to internalize in a single model!

8/25
@rm_rafailov
So do we (and advanced reasoning models) just need to do search? No, we need to TEACH the models to do this themselves for two main reasons:

1. Efficiency - training a model to search in-context can teach it to avoid exploring similar branches.
2. Super-Intelligence.

9/25
@rm_rafailov
The first question is can we train transformer models to do in-context search at all? Yes! Prior works have successfully internalized search procedures on small domains like mazes, Countdown and board games.

10/25
@rm_rafailov
Interestingly, empirically these models exhibit the same train compute scaling and inference compute scaling as advanced reasoning models!

11/25
@rm_rafailov
Can we scale this to real LLMs - yes prior works have successfully trained large scale models for both multi-turn reasoning AND backtracking, improving efficiency:

12/25
@rm_rafailov
In our own experiments in this domain we discovered an interesting artifact when training an LLM with a variable number of revision turns - the model internalizes problem difficulty and scales it's compute (revision attempts) accordingly on TEST questions:

13/25
@rm_rafailov
So, do advanced reasoning models also carry out in-context search? We believe so!
1. O1 seems to implement a general search with backtracking and branching.
2. DeepSeek R1 uses additional self-criticism or inner-dialogue.
3. Gemini Think follows a revision-based format.

14/25
@rm_rafailov
We do not claim that these models do classical search at test time, but that they might have been trained on synthetic examples of those reasoning strategies. We implement traces for MCTS and A* search (with MC rollouts), which exhibit surprisingly similar behaviors:

15/25
@rm_rafailov
One important thing to note is that these behaviors are not well represented in current (non-reasoning LLMs), which very rarely express "regret" on reasoning tasks. While these can be induced through prompting or in-context examples, they DO NOT improve performance.

16/25
@rm_rafailov
A fundamental shift in internalizing search within a model is the ability to post-train with RL. However this is no longer standard RL training, but form of Meta-RL (RL2) in an epistemic POMDP. I.e. we can train the model to implement NEW WAYS to think!

17/25
@rm_rafailov
People who worked on Meta-RL will remember this graph of Ant-Goal - standard RL trained model generates solutions at a best-effort basis, some of which will land on the right approach, while the meta-RL (Meta-CoT) trains the model to explore before yielding the final answer.

18/25
@rm_rafailov
In small scale example (meta) RL post-training improves performance, corrects errors and makes search more efficient. In larger scale code-repair experiments it significantly boost efficiency by environment interactions!

19/25
@rm_rafailov
However, the biggest unanswered question is about Super-Intelligence - can these models discover novel ALGORITHMS of thinking, which allow them to solve problems that classical search CANNOT solve under ANY compute budget?
DID THE COMPUTE-PERFORMANCE CURVE MOVE LEFT OR UP?

20/25
@rm_rafailov
Unfortunately, at small scale the answer seems to be mostly no. Indeed in the SoS framework the model solves only 1-4% of hard problems that symbolic search cannot. It remains unclear if further scaling of RL can fundamentally push these numbers up.

21/25
@rm_rafailov
We have been working to realize these ideas in the open for a while but there are 3 main challenges:

1. Data for reasoning problems is severely lacking.
2. Open infrastructure for large-scale inference and RL is still lacking.
3. Many open design and research questions remain.

22/25
@rm_rafailov
We have been working on the "Big MATH" project for several months now with the goal of curating 500K diverse verifiable math problems, which cover the full spectrum of domains and difficulties. If you work on data contact us, we're interested in collaborating, even beyond math!

23/25
@rm_rafailov
We've been working on distributed, highly-scallable online inference, search and RL infrastructure on top of the Neo-X framework, shooting for SOTA, which we aim to be FULLY OPEN. If you're interested in Infra, get in touch!
Introducing RLHF and RLAIF in GPT-NeoX

24/25
@rm_rafailov
Check out our position paper: [2501.04682] Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for a lot more discussion, empirical results and technical details. We have nearly two pages of open research problems and we need people to work on them! If these interest you and want to work on open research, get in touch!

25/25
@rm_rafailov
This is an ongoing effort with a great team of people doing open research. Would also like to thank @aviral_kumar2 @agarwl_ @ben_eysenbach @srush_nlp @natolambert for all the fruitful discussion and efforts on this problem!

bnew · Jan 9, 2025

1/1
@TrelisResearch
Very nice blog by @huggingface on sampling for better reasoning

See spaces/HuggingFaceH4/blogpost-scaling-test-time-compute

A few thoughts:

1. The small models (3B) with sampling can outperform 70B - but not without having a verifier, which itself is an 8B model doing compute.

2. Interesting the verifier was trained via MonteCarlo on maths type data. An interesting Q is whether this generalizes , my guess is “a little” but having a general verifier is probably key to going further like o3 does.

3. Probably there could be more open source verifiers developed just doing Monte Carlo on all of the datasets with ground truth answers that are deterministically verifiable. Data contamination always an issue.

A related question , if you have such data to train verifiers, how much more does that help you than just directly training the LLM on that data (and skipping sampling altogether)?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

$venturebeat.com$

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems

Phi-4 and an rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry’s largest systems.

venturebeat.com

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems

Carl Franzen@carlfranzen

January 9, 2025 11:01 AM

$Illustration of researchers typing on computers and gazing up through telescopes at a starry sky filled with equations$

Credit: VentureBeat made with ChatGPT

Microsoft is doubling down on the potential of small language models (SLMs) with the unveiling of rStar-Math, a new reasoning technique that can be applied to small models to boost their performance on math problems using reasoning techniques — performance similar to, and in some cases exceeding, that of OpenAI’s o1-preview model.

While still in a research phase — as outlined in a paper published on pre-review site arXiv.org and credited to eight authors at Microsoft, Peking University and Tsinghua University in China — the technique was applied to several different smaller open-source models including Microsoft’s own Phi-3 mini, Alibaba’s Qwen-1.5B (a 1.5-billion-parameter model), and Qwen-7B (a 7-billion-parameter model). It showed improved performance on all of them, even exceeding OpenAI’s previously most advanced model at the MATH (word problem solving) third-party benchmark of 12,500 questions covering various branches such as geometry and algebra, and all levels of difficulty.

Screenshot-2025-01-09-at-1.21.02%E2%80%AFPM.png

Ultimately, according to a post on Hugging Face, the researchers plan to make their code and data available on Github at https://github.com/microsoft/rStar, though one of the paper’s authors, Li Lyna Zhang, wrote in the comments on the Hugging Face post that the team is “still undergoing the internal review process for open-source release.” As such, “the repository remains private for now. Please stay tuned!”

Community members expressed enthusiasm, calling the innovations “impressive” and praising the blend of Monte Carlo Tree Search (MCTS) with step-by-step reasoning. One commenter highlighted the simplicity and utility of using Q-values for step scoring, while others speculated on future applications in geometric proofs and symbolic reasoning.

This news follows closely on the heels of the open-sourcing of Microsoft’s Phi-4 model, a smaller 14-billion-parameter AI system now available on Hugging Face under the permissive MIT license.

While the Phi-4 release has expanded access to high-performance small models, rStar-Math showcases a specialized approach: using smaller AI systems to achieve state-of-the-art results in mathematical reasoning.

rStar-Math works by using several different models and components to help a target small model ‘self-evolve’

The key to rStar-Math is that it leverages Monte Carlo Tree Search (MCTS), a method that mimics human “deep thinking” by iteratively refining step-by-step solutions to mathematical problems.

The researchers used MCTS because it “breaks down complex math problems into simpler single-step generation tasks, reducing the difficulty” for smaller models.

However, they didn’t just apply MCTS as other researchers have done. Instead, in a stroke of brilliance, they also ask the model they trained to always output its “chain-of-thought” reasoning steps as both natural language descriptions and Python code.

They mandated the model would include the natural language responses as Python code comments, and only those outputs using Python would be used to train the model.

Screenshot-2025-01-09-at-1.35.40%E2%80%AFPM.png

The researchers also trained a “policy model” to generate math reasoning steps and a process preference model (PPM) to select the most promising steps to solving the problems, and improved them both over four rounds of “self-evolution,” with each model improving the other.

For their starting data, the researchers said they used “747,000 math word problems from publicly available sources,” along with their solutions, but generated new steps for solving them with the two models described above.

Record-breaking results

After four rounds of self-evolution, rStar-Math achieved significant milestones:

• On the MATH benchmark, the accuracy of the Qwen2.5-Math-7B model jumped from 58.8% to 90.0%, outperforming OpenAI o1-preview.

• On the American Invitational Mathematics Examination (AIME), it solved 53.3% of problems, placing among the top 20% of high school competitors.

These results highlight the power of SLMs in handling complex mathematical reasoning, traditionally dominated by larger systems.

Smaller is better?

In recent years, AI innovation has largely been driven by scaling up language models, with increasing parameters seen as a way to improve performance. Yet, the high costs associated with these massive models, from computational resources to energy consumption, have raised questions about scalability.

Microsoft is offering an alternative path, focusing on efficiency. The release of rStar-Math further underscores this commitment by demonstrating how SLMs can rival — and in some cases exceed — the capabilities of their larger counterparts.

Microsoft’s dual releases of Phi-4 and the rStar-Math paper suggest that compact, specialized models can provide powerful alternatives to the industry’s largest systems.

Moreover, by outperforming larger competitors in key benchmarks, these models challenge the notion that bigger is always better. They open doors for mid-sized organizations and academic researchers to access cutting-edge capabilities without the financial or environmental burden of massive models.

bnew · Jan 10, 2025

https://archive.is/HmcVM

bnew · Jan 10, 2025

Qwen Chat

Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.

chat.qwenlm.ai

https://archive.is/wipGAvwl

bnew · Jan 10, 2025

https://archive.is/SNIAj

bnew · Jan 10, 2025

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Summary of "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling"​

Introduction​

Key Concepts​

Findings​

Experiments and Results​

Limitations and Future Work​

Conclusion​

Key Takeaways​

Veteran

Veteran

Veteran

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems​

rStar-Math works by using several different models and components to help a target small model ‘self-evolve’​

Record-breaking results​

Smaller is better?​

Veteran

Veteran

Veteran

Veteran

Summary of "Large Language Monkeys: Scaling Inference Compute with Repeated Sampling"

Introduction

Key Concepts

Findings

Experiments and Results

Limitations and Future Work

Conclusion

Key Takeaways

Microsoft’s new rStar-Math technique upgrades small models to outperform OpenAI’s o1-preview at math problems

rStar-Math works by using several different models and components to help a target small model ‘self-evolve’

Record-breaking results

Smaller is better?