1/2
1/n How Mutual Consistent Reasoning Unlocks Agentic AI for Small Language Models
Large Language Models (LLMs) have demonstrated remarkable abilities in various tasks, yet their capacity for complex reasoning remains a significant challenge, especially for their smaller, more accessible counterparts – Small Language Models (SLMs). While fine-tuning on specific reasoning datasets can improve performance, this approach often relies on data generated by superior models, creating a dependence that hinders the development of truly self-sufficient SLMs. The paper "Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers" tackles this challenge head-on, introducing rStar, a novel approach that significantly enhances the reasoning capabilities of SLMs without relying on fine-tuning or data from superior models.
The core of rStar lies in addressing the two major pain points that plague SLMs when it comes to complex reasoning: ineffective exploration of potential solutions and unreliable self-assessment. Traditional methods often confine SLMs to a limited set of reasoning actions, hindering their ability to explore diverse paths towards a solution. Furthermore, relying on these models to evaluate their own reasoning proves unreliable, as their self-assessment capabilities are often inaccurate.
rStar tackles these limitations through a clever two-pronged approach: a richer, human-inspired set of reasoning actions and a collaborative evaluation mechanism called mutual consistency. Unlike previous methods that rely on a single action type, rStar empowers SLMs with a diverse set of actions, mimicking human problem-solving strategies. These actions include proposing thoughts, formulating sub-questions, re-answering, and even rephrasing questions for clarity. This expanded repertoire allows SLMs to navigate the solution space more effectively, exploring a wider range of possibilities.
To address the issue of unreliable self-evaluation, rStar introduces a second SLM as a partner in a collaborative verification process. The first SLM, acting as a generator, leverages the diverse action set and the Monte Carlo Tree Search (MCTS) algorithm to generate multiple candidate reasoning trajectories. The second SLM, acting as a discriminator, then evaluates these trajectories by attempting to complete them with partial information. This collaborative approach, termed "mutual consistency," ensures that only those reasoning paths agreed upon by both SLMs are considered valid, leading to a more robust and reliable evaluation process.
The effectiveness of rStar is evident in its impressive performance on a variety of reasoning tasks. Tested on five different SLMs and five diverse reasoning benchmarks, including mathematical problem-solving and multi-hop reasoning over text, rStar consistently outperforms existing state-of-the-art methods. Remarkably, it achieves accuracy comparable to or even exceeding models fine-tuned on these specific datasets, highlighting its ability to learn and improve without task-specific training data.
The success of rStar signifies a significant leap forward in the field of LLM reasoning. By combining the power of diverse reasoning actions with a collaborative evaluation mechanism, rStar unlocks the potential of SLMs, enabling them to tackle complex reasoning tasks with remarkable accuracy. This approach not only paves the way for more accessible and efficient AI systems but also sheds light on the power of collaborative learning and self-improvement in pushing the boundaries of artificial intelligence.
2/2
2/n Comparision with other methods
1. Prompting LLMs to Reason:
Chain-of-Thought (CoT) (Wei et al., 2022): Prompts LLMs with a few-shot demonstration of reasoning steps.Contrast with rStar: CoT relies on a single, greedy decoding path, while rStar explores multiple reasoning trajectories using MCTS and a richer action space.
Planning, Decomposition, Abstraction, Programming Prompts: Various works explore specific prompting strategies to guide reasoning.
Contrast with rStar: These methods focus on single-round inference, while rStar uses an iterative, self-improving approach.
2. LLM Self-improvement:
Fine-tuning based methods (Chen et al., 2024b;a): Use a well-pretrained LLM to generate data for further fine-tuning.Contrast with rStar: rStar improves reasoning at inference time without requiring additional training data or a superior teacher model.
Self-verification (Gero et al., 2023; Zhou et al., 2023): LLMs verify their own answers, often by generating explanations or checking for consistency.Contrast with rStar: rStar uses a separate discriminator SLM for more reliable evaluation, overcoming the limitations of self-assessment in SLMs.
RAP (Hao et al., 2023): Uses self-exploration and self-rewarding to iteratively improve reasoning.
Contrast with rStar: rStar addresses the limitations of RAP's single action type and unreliable self-rewarding with its diverse action space and mutual consistency mechanism.
3. Sampling Reasoning Paths:
Self-Consistency (Wang et al., 2023): Samples multiple CoT paths and selects the most consistent answer.Contrast with rStar: Self-consistency relies on random sampling of complete CoT paths, while rStar uses MCTS with a richer action space for more guided exploration.
Tree-search approaches (Yao et al., 2024; Hao et al., 2023; Zhang et al., 2024): Use tree search algorithms like MCTS to explore reasoning paths.
Contrast with rStar: Most existing tree-search methods use limited action spaces, while rStar's diverse actions provide more flexibility and effectiveness.
4. Answer Verification:
Majority voting (Wang et al., 2023): Selects the answer that appears most frequently across multiple generated solutions.Contrast with rStar: rStar's mutual consistency mechanism provides a more robust evaluation than simple majority voting, especially for SLMs.
Trained reward models (Wang et al., 2024b; Chen et al., 2024a): Train separate models to evaluate the quality of reasoning paths.
Contrast with rStar: rStar avoids the need for additional training data and potential overfitting issues associated with training separate reward models.
In essence, rStar distinguishes itself from prior work by combining the strengths of several approaches:
It leverages the power of tree search for exploring solution spaces.
It introduces a richer, human-inspired action space for more effective exploration.
It employs a novel mutual consistency mechanism for reliable evaluation without relying on self-assessment or external training data.
This unique combination allows rStar to significantly improve SLM reasoning, achieving performance comparable to or even surpassing fine-tuned models.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196