bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

[Submitted on 8 Jul 2024]

InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct​

Yutong Wu
Abstract:Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Cite as: arXiv:2407.05700
arXiv:2407.05700v1
[2407.05700] InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.05700


A.I Generated explanation:


**Title:** InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

**Author:** Yutong Wu

**Summary:** This paper is about making computer programs that can write code better. Right now, these programs are trained by using data from other powerful programs. But the authors of this paper found a way to make these programs even better by using their own abilities to generate more data.

**The Problem:** When we try to translate formal language (like code) into informal language (like natural language), it's easier than the other way around. The authors used this idea to create a new way to improve these code-writing programs.

**The Solution:** They created a system called Inverse-Instruct, which takes code snippets and generates instructions for them. Then, they use these instructions to train the program again, making it even better at writing code.

**Results:** They created a series of programs called InverseCoder, which performed better than the original programs on many tests, including generating Python code, multilingual coding, and data science code generation.

**Details:**

* The paper is about computer science, artificial intelligence, and software engineering.
* You can find the paper on the arXiv website, and it has a unique identifier (arXiv:2407.05700).
* There's a history of when the paper was submitted and updated.

Let me know if you have any specific questions about this explanation
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

[Submitted on 4 Nov 2023 (v1), last revised 5 Jun 2024 (this version, v4)]

Levels of AGI for Operationalizing Progress on the Path to AGI​

Meredith Ringel Morris
Abstract:We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill six principles that a useful ontology for AGI should satisfy. With these principles in mind, we propose "Levels of AGI" based on depth (performance) and breadth (generality) of capabilities, and reflect on how current systems fit into this ontology. We discuss the challenging requirements for future benchmarks that quantify the behavior and capabilities of AGI models against these levels. Finally, we discuss how these levels of AGI interact with deployment considerations such as autonomy and risk, and emphasize the importance of carefully selecting Human-AI Interaction paradigms for responsible and safe deployment of highly capable AI systems.
Comments:version 4 - Position Paper accepted to ICML 2024. Note that due to ICML position paper titling format requirements, the title has changed slightly from that of the original arXiv pre-print. The original pre-print title was "Levels of AGI: Operationalizing Progress on the Path to AGI" but the official published title for ICML 2024 is "Levels of AGI for Operationalizing Progress on the Path to AGI"
Subjects:Artificial Intelligence (cs.AI)
Cite as: arXiv:2311.02462
arXiv:2311.02462v4
[2311.02462] Levels of AGI for Operationalizing Progress on the Path to AGI
Journal reference:Proceedings of ICML 2024

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2311.02462v4
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

Reasoning skills of large language models are often overestimated​

New CSAIL research highlights how LLMs excel in familiar scenarios but struggle in novel ones, questioning their true reasoning abilities versus reliance on memorization.

Rachel Gordon | MIT CSAIL

Publication Date:

July 11, 2024

PRESS INQUIRIES

A cartoon android recites an answer to a math problem from a textbook in one panel and reasons about that same answer in another

Caption:

MIT researchers examined how LLMs fare with variations of different tasks, putting their memorization and reasoning skills to the test. The result: Their reasoning abilities are often overestimated.

Credits:

Image: Alex Shipps/MIT CSAIL

When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.

The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.

When users interact with language models, any arithmetic is usually in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a false impression of them having strong competency in addition. Logically, if they truly possess good addition skills, you’d expect reliably high performance across all number bases, similar to calculators or computers. Indeed, the research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants and suffer from consistent and severe performance drop in the unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.

The pattern held true for many other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players are expected to still be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and couldn’t perform better than random guessing, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on the standard tasks is likely not due to general task abilities, but overfitting to, or directly memorizing from, what they have seen in their training data.

“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.”

Despite the insights gained, there are, of course, limitations. The study’s focus on specific tasks and settings didn’t capture the full range of challenges the models could potentially encounter in real-world applications, signaling the need for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean looking at more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better comprehend the rationale behind the models’ decision-making processes.

“As language models scale up, understanding their training data becomes increasingly challenging even for open models, let alone proprietary ones,” says Hao Peng, assistant professor at the University of Illinois at Urbana-Champaign. “The community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is perhaps far more limited than anticipated by many. It has the potential to inspire future research towards identifying the failure modes of today’s models and developing better ones.”

Additional authors include Najoung Kim, who is a Boston University assistant professor and Google visiting researcher, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

The team’s study was supported, in part, by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195













1/13
New paper on RL, synthetic data, LLM math reasoning (MATH / GSM 8k)

TL, DR: RL on wrong responses (yes, "proper" RL, not filtered SFT or STaR / RFT) scales utility of syn data by **8x**,
spurious correlations
stitching, credit assignment

[2406.14532] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

2/13
Our goal was to understand + compare approaches of learning from syn. data.

Approach:
1. ask GPT / Gemini to give new problems + solutions (see: [2403.04706] Common 7B Language Models Already Possess Strong Math Capabilities):
2. run SFT on it
3. STaR (reject bad data) or run RL (with good+bad data)

3/13
Predominant approaches often just do 2 (SFT) and then do 3 by running STaR, rejection fine-tuning, etc. It is unclear why 3 is useful or would 2 be just enough.

And moreover, it is unclear why RL is even needed or is useful.

Reminds us of RL vs BC ([2204.05618] When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?)

4/13
Takeaway 1: We began by understanding scaling laws of SFT on syn problems + solutions. The rate of reduction of test error is low (much lower than ERM), but at least it goes down (despite potential errors in solution traces)

promising future for scaling syn data for math....

5/13
..but turns out that if you do STaR on self-gen. solutions, it improves data scaling by 2x vs SFT. Note that this self-gen. data comes from a 7B model, while SFT data is from GPT 4 / Gemini 1.5 Pro (much more capable).

self-gen data is easier to fit & learn from (RFT vs SFT)

6/13
Finally though, RFT is not good enough. It amplifies the model's dependance on spurious correlations / spurious steps that hurt generalization on test problems as we can't detect them using reward checking!

7/13
So, RL can actually fix this spurious issue!

our insight is that you need per-step feedback. (Offline) RL on negative data -- incorrect model-gen. solutions can give you exactly that.

Per-step feedback = advantages for each step in a solution trace and do adv.-weighted RL

8/13
By running MC rollouts ([2312.08935] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations), we can identify adv. of each step and understand which steps are critical -- "need to get them right", spurious -- incorrect or irrelevant steps, unlearn them (see figure above).

Now feed this data into RL -- we do per-step DPO

9/13
Takeaway 2: Turns, out that using these advantages in RL (we pose it as pairwise per-step DPO, not the standard DPO), works -- and gives us an 8x boost in data scaling compared to SFT only. 4x boost with respect to RFT.

and, naive offline RL (i.e., standard DPO) doesn't work

10/13
We also show how RL gives this generalization boost.

- advantages can distinguish good & bad steps, when computed from a decent SFT initialization
- now RL is like group DRO, and that gives better test loss on more "critical" steps
- Better at critical steps better perf.

11/13
On an example from "the pitfalls of next token prediction" ([2403.06963] The pitfalls of next-token prediction), we show this whole thing in action

If you choose a good SFT init, run advantage-weighted RL with exact advantages, you do well, while SFT fails
over-trained SFT init, RL & SFT both fail

12/13
Summary:

- RL is useful for math reasoning (any task with some kind of "compositional" nature).

- Likely better RL methods can push this 8x much further (maybe an order more?)

- RL better generalization vs SFT (by connecting to DRO) is neat + some theoretical results!

13/13
This was an awesome collab with CMU, led by Amrith (
@setlur_amrith ), with
@saurabh_garg67
Naman Garg
@younggeng

@gingsmith
.

Code + data (coming soon): GitHub - ars22/scaling-LLM-math-synthetic-data: Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"

Paper: [2406.14532] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Please let us know if you have feedback!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ2Iib9bkAEN0tk.jpg

GQ2IShPacAAK3pC.jpg

GQ2J26DbUAEz6tu.jpg

GQ2KduQbcAAopIO.png

GQ2K3RgboAAkQdr.jpg

GQ2LfokbcAAhKrP.jpg

GQ2MLZMbcAAI2KL.jpg

GQ2MHyjaEAAXMOy.jpg

GQ2Nnp2boAASFGP.jpg

GQ2NpwkaUAAcAap.png

GQ2OPRWawAAZZ5S.jpg

GQ2PxDQbEAAtUG6.jpg

GQzfLuoWsAA1zHJ.jpg

GSTd8YaWQAAdGpy.png

GSSi2MlWwAAPlSH.jpg

GSOSxtXagAEA0qz.png


[Submitted on 20 Jun 2024]

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold​

Amrith Setlur
Abstract:Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data \textbf{doubles}\textbf{doubles} the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by \mathbf{8 \times}\mathbf{8 \times}. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.14532
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195












1/12
New LLM Agents Benchmark!

Introducing MIRAI: A groundbreaking benchmark crafted for evaluating LLM agents in temporal forecasting of international events with tool use and complex reasoning!

Arxiv: [2407.01231] MIRAI: Evaluating LLM Agents for Event Forecasting
Project page: MIRAI: Evaluating LLM Agents for Event Forecasting

1/N

2/12
2/N We released our code, data and an iteractive demo:

GitHub Repo: GitHub - yecchen/MIRAI: Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

Dataset:

Interactive Demo Notebook: Google Colab

3/12
3/N Data

With 59,161 unique events and 296,630 unique news articles, we curate a test set of 705 forecasting query-answer pairs.

(a) Circular Chart: The relation hierarchy and distribution in MIRAI.
(b-c) Heatmap: Intensity of global events, from conflict to mediation.

4/12
4/N Forecasting Task

Forecasting involves collecting essential historical data and performing temporal reasoning to predict future events.

Example: Forecasting cross-country relations on 2023-11-18 using event and news information up to 2023-11-17.

5/12
5/N APIs & Environment

Our comprehensive APIs empower agents to generate code and access the database.

APIs include data classes and functions for various info types and search conditions.

Agents can call a single function or generate a code block at each step.

6/12
6/N Agent Framework

Think: Agent analyzes and plans the next action using API specs.
Act: Generates Single Function or Code Block to retrieve data.
Execute: Python interpreter runs the code for observations.

These steps are repeated until reaching a final forecast.

7/12
7/N Forecasting with Different Base LLMs

Code Block benefits stronger LLMs but hurts weaker models.
GPT-4o consistently outperforms other models.
Self-consistency makes a small model stronger.

8/12
8/N Forecasting with Temporal Distance

Our ablation study let agents predicts 1, 7, 30, and 90 days ahead.

Results: As days increases, F1and KL.

Agent's accuracy drops for distant events. Longer ones anticipate trend shifts influenced by more factors and complexities.

9/12
9/N Tool-Use Ordering in Forecasting

Tool-Use Transition Graph: Agents start with recent events for key info and end with news for context.

Freq.(correct) - Freq.(incorrect): Highlight the need for strategic planning in LLM agents for effective forecasting.

10/12
10/N Check our paper out for more details!

Code error analysis, different event types, variation of API types, and different agent planning strategies!

Join us in advancing the capabilities of LLM agents in forecasting and understanding complex international events!

11/12
11/N

Sincere thanks to all amazing collaborators and advisors @acbuller , @Yihe__Deng, @HuangZi71008374, @mingyu_ma, @Zhu_Yanqiao, and @WeiWang1973 for their invaluable advice and efforts!

12/12
Thank you so much, Avi! Looking forward to hearing your thoughts!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GRczqvbb0AAW_nd.jpg

GRc0BVWaYAALoCV.jpg

GRc0H7paUAAPD0y.jpg

GRc0igMb0AQEfts.jpg

GRc0tPdbEAAhd_w.jpg

GRc1EnMb0AEKfJQ.jpg

GRc1N2HaYAAw1ps.jpg

GRc1Sgab0AAXPGs.jpg

GRc15SRb0AIwlBy.png

GRfNdz_b0AAaQk0.jpg


[Submitted on 1 Jul 2024]

MIRAI - Evaluating LLM Agents for Event Forecasting​

Chenchen Ye
Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.
Comments: this https URL
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as: arXiv:2407.01231
arXiv:2407.01231v1
[2407.01231] MIRAI: Evaluating LLM Agents for Event Forecasting

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.01231
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195










1/10
Is your model really a good math reasoner? If a model understands a problem, it should robustly work across various tasks!

Introducing MathCheck: Evaluating Math Reasoning with Checklist! MathCheck

MathCheck reveals comprehensive reasoning ability of (M)LLM.

2/10
[2/n] We employ LLMs as engines to automatically generate MATHCHECK. We develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathe textual reasoning and multi-modal reasoning abilities, serving as upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K.

3/10
[3/n] On MATHCHECK-GSM, GPT-4o achieves the highest levels in most tasks and question variants. Some models underperforms in tasks other than problem solving, suspecting special optimization of the solving task. This phenomenon is also observed across all math-customized models.

4/10
[4/n] On MATHCHECK-GEO, GPT-4o also demonstrates the best performance. Among the open-source MLLMs, all of them exhibited poor reasoning performance. This suggests that the multi-modal reasoning capabilities of open-source MLLMs still have significant room for improvement.

5/10
[5/n] Why MATHCHECK? We use performance on private data and compression efficiency as surrogates to assess the genuine mathematical abilities of models. Examining the correlation between MATHCHECK and surrogates, we find it represents intelligence more linearly.

6/10
[6/n] Behavior of Math Models: Examining the behaviors of the math models implies that training solely on massive solving data is not the right direction to improve math reasoning ability. Instead, training models with high-quality and diverse math data should be considered.

7/10
[7/n] Reasoning Consistency: Most of models show reasoning consistency, achieving similar scores on each unit. Some models perform reasoning inconsistently, showing excellent performance on solving but worse in other units, revealing that they may conduct excessive decoration.

8/10
[8/n] Behavior on Different Complexity Levels: MATHCHECK better demonstrates the reasoning skills and capabilities required when problems become difficult.

9/10
[9/n] Behavior on Different Prompting Technologies: CoT and Plan-and-Solve in the zero-shot setting demonstrate superior performance. In contrast, the Few-shot prompt generally yields poorer results than the Zero-shot prompt.

10/10
Joint collaboration w/ WeiLiu99 shudong_liu ning_mz jd92wang , Derek F. Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GSR_-ccagAIKY-7.jpg

GSSACzKagAInAM_.jpg

GSSAElkbsAAaREI.jpg

GSSAGbzagAENynn.jpg

GSSAPSmagAAiOl8.jpg

GSSAPSZbEAAP01u.jpg

GSSAQgXasAARv4W.jpg

GSSASCqaEAAgHGY.jpg

GSSAU4YaMAAMQAZ.jpg

GSSAXBaagAIuWwP.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195


1/2
Large language models (LLMs) are now solving tasks that are commonly believed to require human-level reasoning but they are still nowhere near general intelligence. We developed an LLM self-improvement method called Code Iteration (CodeIt): [2402.04858] CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay #ICML2024 #AI

2/2
Advances in graphics rendering include better resolution, complexity, and novel textures components, but the growth in data volume has not been matched by advances in its compression. We propose a novel method to solve for this: [2407.00021] Neural Graphics Texture Compression Supporting Random Acces #ECCV2024 #AI


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[Submitted on 7 Feb 2024 (v1), last revised 1 Jul 2024 (this version, v2)]

CodeIt - Self-Improving Language Models with Prioritized Hindsight Replay​

Natasha Butt
Abstract:Large language models are increasingly solving tasks that are commonly believed to require human-level reasoning ability. However, these models still perform very poorly on benchmarks of general intelligence such as the Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a programming-by-examples problem, and introduce a novel and scalable method for language model self-improvement called Code Iteration (CodeIt). Our method iterates between 1) program sampling and hindsight relabeling, and 2) learning from prioritized experience replay. By relabeling the goal of an episode (i.e., the target program output given input) to the realized output produced by the sampled program, our method effectively deals with the extreme sparsity of rewards in program synthesis. Applying CodeIt to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. CodeIt is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at this https URL .
Comments:ICML'24 camera-ready version
Subjects:Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as: arXiv:2402.04858
arXiv:2402.04858v2
[2402.04858] CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2402.04858
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195







1/7
Thrilled to share our latest paper “Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models”

LLMs struggle at higher depths of logical reasoning

Check out paper @ [2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

#NLProc #logic #reasoning

Read ↓ (1/6)

2/7
Proposed Multi-LogiEval, a systematically created QA dataset covering multi-step logical reasoning across three logic: Propositional Logic (PL), First-Order Logic (FOL), and Non-Monotonic (NM).

Access Multi-LogiEval @GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Read ↓ (2/6)

3/7
Our dataset provides ~1.6k high-quality instances that cover 33 inference rules and reasoning patterns and more than 60 complex combinations of these inference rules with a different number of reasoning steps (1~5).

Read ↓ (3/6)

4/7
Our data creation process consists of two major stages: (i) Generation of rule combination and (ii) Generation of data instances.

Read ↓ (4/6)

5/7
Evaluating LLMs on Multi-LogiEval leads to interesting findings:

- Longer chains don't guarantee better reasoning
- Larger open-source models perform worse than smaller ones
- LLMs struggle with context-based conclusions without explicit knowledge
- Many more..

Read ↓ (5/6)

6/7
Want to evaluate your LLM? Check out the Multi-LogiEval dataset @
GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Take on its challenges and be part of advancing the capabilities of LLMs in logical reasoning!

Thanks to
@Nisarg14P @Mohith nrj_varshney @mutsumi32141651 @cbaral


Read (6/6)

7/7
Super excited to share that our paper on high-quality data generation using LLMs is accepted
@COLM_conf

Please check out our work - [2310.17876] TarGEN: Targeted Data Generation with Large Language Models

#NLProc #LLMs


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GSHSTlmakAAIjtr.png

GSHUwzKacAACXRb.png

GSHVEI4awAAh_0k.png

GSHVkJyaUAMaHNm.png

GSHVkJxaUAcITQs.png

GSHWpvSaUAQ90JS.png

F9tPrIXb0AAwFon.jpg

[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
[Submitted on 24 Jun 2024]

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models​

Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral
Abstract:
As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at this https URL.
this https URL

Comments:23 Pages
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2406.17169
arXiv:2406.17169v1
[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.17169
this https URL
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195


1/2
Do LLM-generated explanations give the true reasons for LLM decisions, or are they rationalisations?

Our #ACL2024NLP paper introduces Correlational Explanatory Faithfulness, which can measure both post-hoc and CoT and can’t be trivially gamed: [2404.03189] The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

2/2
The ability to measure faithfulness in explanations could help us understand how models work, detect and correct biased reasoning, and oversee models at scale.

Thanks to my coauthors @oanacamb , Nicolas Heess, and @MPerezOrtiz_, at @GoogleDeepMind and @ucl_nlp!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GSDQjgmWgAEqOQm.jpg


[2404.03189] The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models
[Submitted on 4 Apr 2024 (v1), last revised 7 Jun 2024 (this version, v2)]

The Probabilities Also Matter - A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models​

Noah Y. Siegel, Oana-Maria Camburu, Nicolas Heess, Maria Perez-Ortiz
Abstract:
In order to oversee advanced AI systems, it is important to understand their underlying decision-making process. When prompted, large language models (LLMs) can provide natural language explanations or reasoning traces that sound plausible and receive high ratings from human annotators. However, it is unclear to what extent these explanations are faithful, i.e., truly capture the factors responsible for the model's predictions. In this work, we introduce Correlational Explanatory Faithfulness (CEF), a metric that can be used in faithfulness tests based on input interventions. Previous metrics used in such tests take into account only binary changes in the predictions. Our metric accounts for the total shift in the model's predicted label distribution, more accurately reflecting the explanations' faithfulness. We then introduce the Correlational Counterfactual Test (CCT) by instantiating CEF on the Counterfactual Test (CT) from Atanasova et al. (2023). We evaluate the faithfulness of free-text explanations generated by few-shot-prompted LLMs from the Llama2 family on three NLP tasks. We find that our metric measures aspects of faithfulness which the CT misses.

Comments:To be published in ACL 2024. 19 pages, 2 figures
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2404.03189
arXiv:2404.03189v2
[2404.03189] The Probabilities Also Matter: A More Faithful Metric for Faithfulness of Free-Text Explanations in Large Language Models

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2404.03189
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195


1/2
System 2 prompts: Four techniques to enhance LLM reasoning

The "Distilling System 2 into System 1" paper (https://arxiv.org/html/2407.06023v1…) looks like Llama 3 Instruct's dataset recipe (note how they only mention Llama 2).

The authors build a dataset of complex reasoning tasks with prompts that push LLMs into system 2 mode, greatly improving answer quality.

However, other techniques remain less common. That's why I've extracted four system 2 prompt techniques from the paper.

These can be valuable for creating instruction datasets or in regular applications as part of prompt engineering.

Interested in LLM datasets? Check out my GitHub repo on LLM Datasets for more insights: GitHub - mlabonne/llm-datasets: High-quality datasets, tools, and concepts for LLM fine-tuning.

2/2
Is there a better name you'd recommend? I don't feel strongly about it at all :smile:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GSMgZ_PWwAA96Fa.jpg

GSTckauXYAA9XsN.jpg

GSTLAFEagBM6tj7.png

[2407.06023v1] Distilling System 2 into System 1
[Submitted on 8 Jul 2024 (this version), latest version 9 Jul 2024 (v2)]

Distilling System 2 into System 1​

Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov
Abstract:
Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks that they cannot yet do well.

Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2407.06023
arXiv:2407.06023v1
[2407.06023] Distilling System 2 into System 1

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.06023v1
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195



1/3
Detecting hallucinations in large language models using semantic entropy
"Large language model (LLM) systems, such as ChatGPT1 or Gemini2, can show impressive reasoning and question-answering capabilities but often ‘hallucinate’ false outputs and unsubstantiated answers3,4. Answering unreliably or without the necessary information prevents adoption in diverse fields, with problems including fabrication of legal precedents5 or untrue facts in news articles6 and even posing a risk to human life in medical domains such as radiology7. Encouraging truthfulness through supervision or reinforcement has been only partially successful8. Researchers need a general method for detecting hallucinations in LLMs that works even with new and unseen questions to which humans might not know the answer. Here we develop new methods grounded in statistics, proposing entropy-based uncertainty estimators for LLMs to detect a subset of hallucinations—confabulations—which are arbitrary and incorrect generations. Our method addresses the fact that one idea can be expressed in many ways by computing uncertainty at the level of meaning rather than specific sequences of words. Our method works across datasets and tasks without a priori knowledge of the task, requires no task-specific data and robustly generalizes to new tasks not seen before. By detecting when a prompt is likely to produce a confabulation, our method helps users understand when they must take extra care with LLMs and opens up new possibilities for using LLMs that are otherwise prevented by their unreliability."

2/3
Detecting hallucinations in large language models using semantic entropy

3/3
You want to play the longest possible game - to keep entropy at bay.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQyuXDFWAAA7kHE.jpg

GQyuw-jWYAA6s8A.jpg

GSUPYwRWgAAmUJM.jpg

GSUPYwNWkAIgRly.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
[Submitted on 23 May 2024 (v1), last revised 27 May 2024 (this version, v2)]

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization​

Boshi Wang, Xiang Yue, Yu Su, Huan Sun
Abstract:
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

Submission history​

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.15071


A.I Generated explanation:


Title: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

This is a research paper about artificial intelligence (AI) and how it can be improved to make it smarter.

Authors: Boshi Wang, Xiang Yue, Yu Su, and Huan Sun

These are the people who wrote the paper.

Abstract:

The paper is about whether a type of AI called transformers can learn to "reason" like humans do. Reasoning means making connections between different pieces of information and using them to make decisions or solve problems. The researchers found that transformers can learn to reason, but only if they are trained for a very long time. They also found that the way the transformers reason is different depending on the type of problem they are trying to solve.

What they did:

The researchers trained transformers to solve two types of problems: composition and comparison. They found that the transformers could learn to solve these problems, but only if they were trained for a very long time. They also looked at how the transformers were solving the problems and found that they were using different "circuits" in their "brain" to do so.

What they found:

The researchers found that the transformers were able to solve the problems, but they didn't always generalize well to new situations. This means that they could solve the problems they were trained on, but they didn't always understand the underlying principles well enough to apply them to new situations. They also found that the way the transformers were trained affected how well they could reason.

What it means:

This research is important because it shows that transformers can be trained to reason like humans do, but it also shows that there are still limitations to how well they can generalize to new situations. The researchers suggest that changing the way transformers are trained and adding new components to their architecture could help them reason better.

Links:

* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the paper on the arXiv website.
* GitHub - OSU-NLP-Group/GrokkedTransformer: Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization' - This is the link to the GitHub repository for the project.
* [2405.15071v2] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the updated version of the paper on the arXiv website.
* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the DOI (digital object identifier) for the paper.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,195

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’​

By Anna Tong and Katie Paul

July 12, 20245:23 PM EDTUpdated 19 hours ago

Illustration shows OpenAI logo

OpenAI logo is seen in this illustration taken May 20, 2024. REUTERS/Dado Ruvic/Illustration/File Photo Purchase Licensing Rights, opens new tab

July 12 - ChatGPT maker OpenAI is working on a novel approach to its artificial intelligence models in a project code-named “Strawberry,” according to a person familiar with the matter and internal documentation reviewed by Reuters.

The project, details of which have not been previously reported, comes as the Microsoft-backed startup races to show that the types of models it offers are capable of delivering advanced reasoning capabilities.

Teams inside OpenAI are working on Strawberry, according to a copy of a recent internal OpenAI document seen by Reuters in May. Reuters could not ascertain the precise date of the document, which details a plan for how OpenAI intends to use Strawberry to perform research. The source described the plan to Reuters as a work in progress. The news agency could not establish how close Strawberry is to being publicly available.

How Strawberry works is a tightly kept secret even within OpenAI, the person said.

The document describes a project that uses Strawberry models with the aim of enabling the company’s AI to not just generate answers to queries but to plan ahead enough to navigate the internet autonomously and reliably to perform what OpenAI terms “deep research,” according to the source.

This is something that has eluded AI models to date, according to interviews with more than a dozen AI researchers.

Asked about Strawberry and the details reported in this story, an OpenAI company spokesperson said in a statement: “We want our AI models to see and understand the world more like we do. Continuous research into new AI capabilities is a common practice in the industry, with a shared belief that these systems will improve in reasoning over time.”

The spokesperson did not directly address questions about Strawberry.

The Strawberry project was formerly known as Q*, which Reuters reported last year was already seen inside the company as a breakthrough.

Two sources described viewing earlier this year what OpenAI staffers told them were Q* demos, capable of answering tricky science and math questions out of reach of today’s commercially-available models.

On Tuesday at an internal all-hands meeting, OpenAI showed a demo of a research project that it claimed had new human-like reasoning skills, according to Bloomberg, opens new tab. An OpenAI spokesperson confirmed the meeting but declined to give details of the contents. Reuters could not determine if the project demonstrated was Strawberry.

OpenAI hopes the innovation will improve its AI models’ reasoning capabilities dramatically, the person familiar with it said, adding that Strawberry involves a specialized way of processing an AI model after it has been pre-trained on very large datasets.

Researchers Reuters interviewed say that reasoning is key to AI achieving human or super-human-level intelligence.

While large language models can already summarize dense texts and compose elegant prose far more quickly than any human, the technology often falls short on common sense problems whose solutions seem intuitive to people, like recognizing logical fallacies and playing tic-tac-toe. When the model encounters these kinds of problems, it often “hallucinates” bogus information.

AI researchers interviewed by Reuters generally agree that reasoning, in the context of AI, involves the formation of a model that enables AI to plan ahead, reflect how the physical world functions, and work through challenging multi-step problems reliably.

Improving reasoning in AI models is seen as the key to unlocking the ability for the models to do everything from making major scientific discoveries to planning and building new software applications.

OpenAI CEO Sam Altman said earlier this year, opens new tab that in AI “the most important areas of progress will be around reasoning ability.”

Other companies like Google, Meta and Microsoft are likewise experimenting with different techniques to improve reasoning in AI models, as are most academic labs that perform AI research. Researchers differ, however, on whether large language models (LLMs) are capable of incorporating ideas and long-term planning into how they do prediction. For instance, one of the pioneers of modern AI, Yann LeCun, who works at Meta, has frequently said that LLMs are not capable of humanlike reasoning.


AI CHALLENGES​

Strawberry is a key component of OpenAI’s plan to overcome those challenges, the source familiar with the matter said. The document seen by Reuters described what Strawberry aims to enable, but not how.

In recent months, the company has privately been signaling to developers and other outside parties that it is on the cusp of releasing technology with significantly more advanced reasoning capabilities, according to four people who have heard the company’s pitches. They declined to be identified because they are not authorized to speak about private matters.

Strawberry includes a specialized way of what is known as “post-training” OpenAI’s generative AI models, or adapting the base models to hone their performance in specific ways after they have already been “trained” on reams of generalized data, one of the sources said.

The post-training phase of developing a model involves methods like “fine-tuning,” a process used on nearly all language models today that comes in many flavors, such as having humans give feedback to the model based on its responses and feeding it examples of good and bad answers.

Strawberry has similarities to a method developed at Stanford in 2022 called "Self-Taught Reasoner” or “STaR”, one of the sources with knowledge of the matter said. STaR enables AI models to “bootstrap” themselves into higher intelligence levels via iteratively creating their own training data, and in theory could be used to get language models to transcend human-level intelligence, one of its creators, Stanford professor Noah Goodman, told Reuters.

“I think that is both exciting and terrifying…if things keep going in that direction we have some serious things to think about as humans,” Goodman said. Goodman is not affiliated with OpenAI and is not familiar with Strawberry.

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. Reuters was unable to determine what is in that dataset or how long an extended period would mean.

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

Reporting by Anna Tong in San Francisco and Katie Paul in New York; editing by Ken Li and Claudia Parsons






 
Top