Large Language Models News & Discussions

bnew · Jul 11, 2024

[Submitted on 10 Jul 2024]

Rectifier - Code Translation with Corrector via LLMs

Abstract:Software migration is garnering increasing attention with the evolution of software and society. Early studies mainly relied on handcrafted translation rules to translate between two languages, the translation process is error-prone and time-consuming. In recent years, researchers have begun to explore the use of pre-trained large language models (LLMs) in code translation. However, code translation is a complex task that LLMs would generate mistakes during code translation, they all produce certain types of errors when performing code translation tasks, which include (1) compilation error, (2) runtime error, (3) functional error, and (4) non-terminating execution. We found that the root causes of these errors are very similar (e.g. failure to import packages, errors in loop boundaries, operator errors, and more). In this paper, we propose a general corrector, namely Rectifier, which is a micro and universal model for repairing translation errors. It learns from errors generated by existing LLMs and can be widely applied to correct errors generated by any LLM. The experimental results on translation tasks between C++, Java, and Python show that our model has effective repair ability, and cross experiments also demonstrate the robustness of our method.

Comments:	arXiv:2308.03109
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.07472
	arXiv:2407.07472v1

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.07472

A.I Generated explanation:

Title: Rectifier - Code Translation with Corrector via LLMs

This is a research paper about a new tool called Rectifier, which helps fix errors in code translation.

Author: Xin Yin

The author of this paper is Xin Yin, who can be found on the arXiv website.

Abstract:

The abstract is a summary of the paper. It says that:

* Software migration (moving software from one language to another) is becoming more important.
* Early methods of translation were manual and prone to errors.
* Recently, researchers have started using large language models (LLMs) to translate code, but these models can also make mistakes.
* The mistakes made by LLMs can be categorized into four types: compilation errors, runtime errors, functional errors, and non-terminating execution.
* The authors propose a new tool called Rectifier, which can fix these errors.
* Rectifier is a universal model that can be used to correct errors made by any LLM.
* The authors tested Rectifier on translation tasks between C++, Java, and Python, and found that it was effective in fixing errors.

Comments and Subjects:

* The paper has been submitted to the arXiv website, which is a repository of electronic preprints in physics, mathematics, computer science, and related disciplines.
* The subjects of the paper are Software Engineering and Artificial Intelligence.

Cite as and Submission history:

* The paper can be cited using the arXiv identifier 2407.07472.
* The submission history shows that the paper was submitted on July 10, 2024, and can be viewed in PDF format on the arXiv website.

bnew · Jul 11, 2024

[Submitted on 10 Jul 2024]

Development of an automatic modification system for generated programs using ChatGPT

Jun Yoshida

Abstract:In recent years, the field of artificial intelligence has been rapidly developing. Among them, OpenAI's ChatGPT excels at natural language processing tasks and can also generate source code. However, the generated code often has problems with consistency and program rules. Therefore, in this research, we developed a system that tests the code generated by ChatGPT, automatically corrects it if it is inappropriate, and presents the appropriate code to the user. This study aims to address the challenge of reducing the manual effort required for the human feedback and modification process for generated code. When we ran the system, we were able to automatically modify the code as intended.

Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2407.07469
	arXiv:2407.07469v1
	[2407.07469] Development of an automatic modification system for generated programs using ChatGPT

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.07469

A.I Generated explanation:

Title: Development of an Automatic Modification System for Generated Programs using ChatGPT

This is a research paper about creating a system that can automatically fix mistakes in computer code generated by a language model called ChatGPT.

Author: Jun Yoshida

The person who wrote this paper is Jun Yoshida. You can click on their name to see more information about them.

Abstract:

The abstract is a short summary of the paper. It says that ChatGPT is really good at understanding human language and can even generate computer code. However, the code it generates often has mistakes. So, the researchers created a system that can test the code, fix the mistakes, and give the user the corrected code. This system aims to reduce the amount of time and effort humans need to spend fixing the code.

Subjects:

This paper is about software engineering, which is the process of designing, developing, and testing software.

Cite as:

If you want to reference this paper in your own work, you can use the links provided. There are three different links: one to the paper on arXiv, one to a specific version of the paper, and one to a DOI (digital object identifier) that will always point to the paper.

Submission history:

This section shows the history of when the paper was submitted and updated. You can click on the links to see the email and PDF versions of the paper.

In summary, this paper is about creating a system that can automatically fix mistakes in computer code generated by ChatGPT, with the goal of reducing the amount of time and effort humans need to spend fixing the code.

bnew · Jul 11, 2024

[Submitted on 9 Jul 2024]

Prompting Techniques for Secure Code Generation - A Systematic Investigation

Catherine Tony

Abstract:Large Language Models (LLMs) are gaining momentum in software development with prompt-driven programming enabling developers to create code from natural language (NL) instructions. However, studies have questioned their ability to produce secure code and, thereby, the quality of prompt-generated software. Alongside, various prompting techniques that carefully tailor prompts have emerged to elicit optimal responses from LLMs. Still, the interplay between such prompting strategies and secure code generation remains under-explored and calls for further investigations. OBJECTIVE: In this study, we investigate the impact of different prompting techniques on the security of code generated from NL instructions by LLMs. METHOD: First we perform a systematic literature review to identify the existing prompting techniques that can be used for code generation tasks. A subset of these techniques are evaluated on GPT-3, GPT-3.5, and GPT-4 models for secure code generation. For this, we used an existing dataset consisting of 150 NL security-relevant code-generation prompts. RESULTS: Our work (i) classifies potential prompting techniques for code generation (ii) adapts and evaluates a subset of the identified techniques for secure code generation tasks and (iii) observes a reduction in security weaknesses across the tested LLMs, especially after using an existing technique called Recursive Criticism and Improvement (RCI), contributing valuable insights to the ongoing discourse on LLM-generated code security.

Comments:	This work was partially supported by the EU-funded project Sec4AI4Sec: Cybersecurity for AI-Augmented Systems (grant no. 101120393)
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI); Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as:	arXiv:2407.07064
	arXiv:2407.07064v1
	[2407.07064] Prompting Techniques for Secure Code Generation: A Systematic Investigation

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.07064

[Submitted on 8 Jul 2024]

CodeCSE - A Simple Multilingual Model for Code and Comment Sentence Embeddings

Anthony Varkey

Abstract:Pretrained language models for code token embeddings are used in code search, code clone detection, and other code-related tasks. Similarly, code function embeddings are useful in such tasks. However, there are no out-of-box models for function embeddings in the current literature. So, this paper proposes CodeCSE, a contrastive learning model that learns embeddings for functions and their descriptions in one space. We evaluated CodeCSE using code search. CodeCSE's multi-lingual zero-shot approach is as efficient as the models finetuned from GraphCodeBERT for specific languages. CodeCSE is open source at this https URL and the pretrained model is available at the HuggingFace public hub: this https URL

Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2407.06360
	arXiv:2407.06360v1
	[2407.06360] CodeCSE: A Simple Multilingual Model for Code and Comment Sentence Embeddings

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.06360

[Submitted on 9 Jul 2024]

LLM for Mobile - An Initial Roadmap

Daihang Chen

Abstract:When mobile meets LLMs, mobile app users deserve to have more intelligent usage experiences. For this to happen, we argue that there is a strong need to appl LLMs for the mobile ecosystem. We therefore provide a research roadmap for guiding our fellow researchers to achieve that as a whole. In this roadmap, we sum up six directions that we believe are urgently required for research to enable native intelligence in mobile devices. In each direction, we further summarize the current research progress and the gaps that still need to be filled by our fellow researchers.

Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2407.06573
	arXiv:2407.06573v1
	[2407.06573] LLM for Mobile: An Initial Roadmap

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.06573

[Submitted on 20 Jun 2024 (v1), last revised 8 Jul 2024 (this version, v2)]

CREF - An LLM-based Conversational Software Repair Framework for Programming Tutors

Boyang Yang

Abstract:Program repair techniques offer cost-saving benefits for debugging within software development and programming education scenarios. With the proven effectiveness of Large Language Models (LLMs) in code-related tasks, researchers have explored their potential for program repair. However, it is crucial to recognize that existing repair benchmarks may have influenced LLM training data, potentially causing data leakage. To evaluate LLMs' realistic repair capabilities, (1) we introduce an extensive, non-crawled benchmark, referred to as TutorCode, comprising 1,239 C++ defect codes and associated information such as tutor guidance, solution description, failing test cases, and the corrected code. Our work assesses the repair performance of 12 LLMs on TutorCode, measuring repair correctness (TOP-5 and AVG-5) and patch precision (RPSR). (2) We then provide a comprehensive investigation into which types of extra information can help LLMs improve their performance in repairing defects. Among these types, tutor guidance was found to be the most effective information in enhancing LLM repair capabilities. To fully harness LLMs' conversational capabilities and the benefits of augmented information, (3) we introduce a novel conversational semi-automatic repair framework CREF assisting human tutor. It demonstrates a remarkable AVG-5 improvement of 17.2%-24.6% compared to the baseline, achieving an impressive AVG-5 of 76.6% when utilizing GPT-4. These results highlight the potential for enhancing LLMs' repair capabilities through interactions with tutors and historical conversations involving incorrect responses. The successful application of CREF in a real-world educational setting demonstrates its effectiveness in reducing tutors' workload and improving students' learning experience, while also showcasing its promise for facilitating other software engineering tasks, such as code review.

Subjects:	Software Engineering (cs.SE)
Cite as:	arXiv:2406.13972
	arXiv:2406.13972v2
	[2406.13972] CREF: An LLM-based Conversational Software Repair Framework for Programming Tutors

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.13972

[Submitted on 21 May 2024 (v1), last revised 5 Jul 2024 (this version, v2)]

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Patrick Diehl

Abstract:This study evaluates the capabilities of ChatGPT versions 3.5 and 4 in generating code across a diverse range of programming languages. Our objective is to assess the effectiveness of these AI models for generating scientific programs. To this end, we asked ChatGPT to generate three distinct codes: a simple numerical integration, a conjugate gradient solver, and a parallel 1D stencil-based heat equation solver. The focus of our analysis was on the compilation, runtime performance, and accuracy of the codes. While both versions of ChatGPT successfully created codes that compiled and ran (with some help), some languages were easier for the AI to use than others (possibly because of the size of the training sets used). Parallel codes -- even the simple example we chose to study here -- also difficult for the AI to generate correctly.

Comments:	9 pages, 3 figures
Subjects:	Software Engineering (cs.SE); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2405.13101
	arXiv:2405.13101v2
	[2405.13101] Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.13101

bnew · Jul 11, 2024

[Submitted on 8 Jul 2024]

InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Yutong Wu

Abstract:Recent advancements in open-source code large language models (LLMs) have demonstrated remarkable coding abilities by fine-tuning on the data generated from powerful closed-source LLMs such as GPT-3.5 and GPT-4 for instruction tuning. This paper explores how to further improve an instruction-tuned code LLM by generating data from itself rather than querying closed-source LLMs. Our key observation is the misalignment between the translation of formal and informal languages: translating formal language (i.e., code) to informal language (i.e., natural language) is more straightforward than the reverse. Based on this observation, we propose INVERSE-INSTRUCT, which summarizes instructions from code snippets instead of the reverse. Specifically, given an instruction tuning corpus for code and the resulting instruction-tuned code LLM, we ask the code LLM to generate additional high-quality instructions for the original corpus through code summarization and self-evaluation. Then, we fine-tune the base LLM on the combination of the original corpus and the self-generated one, which yields a stronger instruction-tuned LLM. We present a series of code LLMs named InverseCoder, which surpasses the performance of the original code LLMs on a wide range of benchmarks, including Python text-to-code generation, multilingual coding, and data-science code generation.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Cite as:	arXiv:2407.05700
	arXiv:2407.05700v1
	[2407.05700] InverseCoder: Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.05700

A.I Generated explanation:

**Title:** InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

**Author:** Yutong Wu

**Summary:** This paper is about making computer programs that can write code better. Right now, these programs are trained by using data from other powerful programs. But the authors of this paper found a way to make these programs even better by using their own abilities to generate more data.

**The Problem:** When we try to translate formal language (like code) into informal language (like natural language), it's easier than the other way around. The authors used this idea to create a new way to improve these code-writing programs.

**The Solution:** They created a system called Inverse-Instruct, which takes code snippets and generates instructions for them. Then, they use these instructions to train the program again, making it even better at writing code.

**Results:** They created a series of programs called InverseCoder, which performed better than the original programs on many tests, including generating Python code, multilingual coding, and data science code generation.

**Details:**

* The paper is about computer science, artificial intelligence, and software engineering.
* You can find the paper on the arXiv website, and it has a unique identifier (arXiv:2407.05700).
* There's a history of when the paper was submitted and updated.

Let me know if you have any specific questions about this explanation

bnew · Jul 12, 2024

Reasoning skills of large language models are often overestimated

MIT CSAIL researchers developed an evaluation framework for large language models about counterfactual tasks. They found that LLMs can recite answers, but struggle to reason as it relates to abstract task-solving.

news.mit.edu

Reasoning skills of large language models are often overestimated

New CSAIL research highlights how LLMs excel in familiar scenarios but struggle in novel ones, questioning their true reasoning abilities versus reliance on memorization.

Rachel Gordon | MIT CSAIL

Publication Date:

July 11, 2024

PRESS INQUIRIES

A cartoon android recites an answer to a math problem from a textbook in one panel and reasons about that same answer in another

Caption:

MIT researchers examined how LLMs fare with variations of different tasks, putting their memorization and reasoning skills to the test. The result: Their reasoning abilities are often overestimated.

Credits:

Image: Alex Shipps/MIT CSAIL

When it comes to artificial intelligence, appearances can be deceiving. The mystery surrounding the inner workings of large language models (LLMs) stems from their vast size, complex training methods, hard-to-predict behaviors, and elusive interpretability.

MIT's Computer Science and Artificial Intelligence Laboratory (CSAIL) researchers recently peered into the proverbial magnifying glass to examine how LLMs fare with variations of different tasks, revealing intriguing insights into the interplay between memorization and reasoning skills. It turns out that their reasoning abilities are often overestimated.

The study compared “default tasks,” the common tasks a model is trained and tested on, with “counterfactual scenarios,” hypothetical situations deviating from default conditions — which models like GPT-4 and Claude can usually be expected to cope with. The researchers developed some tests outside the models’ comfort zones by tweaking existing tasks instead of creating entirely new ones. They used a variety of datasets and benchmarks specifically tailored to different aspects of the models' capabilities for things like arithmetic, chess, evaluating code, answering logical questions, etc.

When users interact with language models, any arithmetic is usually in base-10, the familiar number base to the models. But observing that they do well on base-10 could give us a false impression of them having strong competency in addition. Logically, if they truly possess good addition skills, you’d expect reliably high performance across all number bases, similar to calculators or computers. Indeed, the research showed that these models are not as robust as many initially think. Their high performance is limited to common task variants and suffer from consistent and severe performance drop in the unfamiliar counterfactual scenarios, indicating a lack of generalizable addition ability.

The pattern held true for many other tasks like musical chord fingering, spatial reasoning, and even chess problems where the starting positions of pieces were slightly altered. While human players are expected to still be able to determine the legality of moves in altered scenarios (given enough time), the models struggled and couldn’t perform better than random guessing, meaning they have limited ability to generalize to unfamiliar situations. And much of their performance on the standard tasks is likely not due to general task abilities, but overfitting to, or directly memorizing from, what they have seen in their training data.

“We’ve uncovered a fascinating aspect of large language models: they excel in familiar scenarios, almost like a well-worn path, but struggle when the terrain gets unfamiliar. This insight is crucial as we strive to enhance these models’ adaptability and broaden their application horizons,” says Zhaofeng Wu, an MIT PhD student in electrical engineering and computer science, CSAIL affiliate, and the lead author on a new paper about the research. “As AI is becoming increasingly ubiquitous in our society, it must reliably handle diverse scenarios, whether familiar or not. We hope these insights will one day inform the design of future LLMs with improved robustness.”

Despite the insights gained, there are, of course, limitations. The study’s focus on specific tasks and settings didn’t capture the full range of challenges the models could potentially encounter in real-world applications, signaling the need for more diverse testing environments. Future work could involve expanding the range of tasks and counterfactual conditions to uncover more potential weaknesses. This could mean looking at more complex and less common scenarios. The team also wants to improve interpretability by creating methods to better comprehend the rationale behind the models’ decision-making processes.

“As language models scale up, understanding their training data becomes increasingly challenging even for open models, let alone proprietary ones,” says Hao Peng, assistant professor at the University of Illinois at Urbana-Champaign. “The community remains puzzled about whether these models genuinely generalize to unseen tasks, or seemingly succeed by memorizing the training data. This paper makes important strides in addressing this question. It constructs a suite of carefully designed counterfactual evaluations, providing fresh insights into the capabilities of state-of-the-art LLMs. It reveals that their ability to solve unseen tasks is perhaps far more limited than anticipated by many. It has the potential to inspire future research towards identifying the failure modes of today’s models and developing better ones.”

Additional authors include Najoung Kim, who is a Boston University assistant professor and Google visiting researcher, and seven CSAIL affiliates: MIT electrical engineering and computer science (EECS) PhD students Linlu Qiu, Alexis Ross, Ekin Akyürek SM ’21, and Boyuan Chen; former postdoc and Apple AI/ML researcher Bailin Wang; and EECS assistant professors Jacob Andreas and Yoon Kim.

The team’s study was supported, in part, by the MIT–IBM Watson AI Lab, the MIT Quest for Intelligence, and the National Science Foundation. The team presented the work at the North American Chapter of the Association for Computational Linguistics (NAACL) last month.

bnew · Jul 12, 2024

1/13
New paper on RL, synthetic data, LLM math reasoning (MATH / GSM 8k)

TL, DR: RL on wrong responses (yes, "proper" RL, not filtered SFT or STaR / RFT) scales utility of syn data by **8x**,
spurious correlations
stitching, credit assignment

[2406.14532] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

2/13
Our goal was to understand + compare approaches of learning from syn. data.

Approach:
1. ask GPT / Gemini to give new problems + solutions (see: [2403.04706] Common 7B Language Models Already Possess Strong Math Capabilities):
2. run SFT on it
3. STaR (reject bad data) or run RL (with good+bad data)

3/13
Predominant approaches often just do 2 (SFT) and then do 3 by running STaR, rejection fine-tuning, etc. It is unclear why 3 is useful or would 2 be just enough.

And moreover, it is unclear why RL is even needed or is useful.

Reminds us of RL vs BC ([2204.05618] When Should We Prefer Offline Reinforcement Learning Over Behavioral Cloning?)

4/13
Takeaway 1: We began by understanding scaling laws of SFT on syn problems + solutions. The rate of reduction of test error is low (much lower than ERM), but at least it goes down (despite potential errors in solution traces)

promising future for scaling syn data for math....

5/13
..but turns out that if you do STaR on self-gen. solutions, it improves data scaling by 2x vs SFT. Note that this self-gen. data comes from a 7B model, while SFT data is from GPT 4 / Gemini 1.5 Pro (much more capable).

self-gen data is easier to fit & learn from (RFT vs SFT)

6/13
Finally though, RFT is not good enough. It amplifies the model's dependance on spurious correlations / spurious steps that hurt generalization on test problems as we can't detect them using reward checking!

7/13
So, RL can actually fix this spurious issue!

our insight is that you need per-step feedback. (Offline) RL on negative data -- incorrect model-gen. solutions can give you exactly that.

Per-step feedback = advantages for each step in a solution trace and do adv.-weighted RL

8/13
By running MC rollouts ([2312.08935] Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations), we can identify adv. of each step and understand which steps are critical -- "need to get them right", spurious -- incorrect or irrelevant steps, unlearn them (see figure above).

Now feed this data into RL -- we do per-step DPO

9/13
Takeaway 2: Turns, out that using these advantages in RL (we pose it as pairwise per-step DPO, not the standard DPO), works -- and gives us an 8x boost in data scaling compared to SFT only. 4x boost with respect to RFT.

and, naive offline RL (i.e., standard DPO) doesn't work

10/13
We also show how RL gives this generalization boost.

- advantages can distinguish good & bad steps, when computed from a decent SFT initialization
- now RL is like group DRO, and that gives better test loss on more "critical" steps
- Better at critical steps better perf.

11/13
On an example from "the pitfalls of next token prediction" ([2403.06963] The pitfalls of next-token prediction), we show this whole thing in action

If you choose a good SFT init, run advantage-weighted RL with exact advantages, you do well, while SFT fails
over-trained SFT init, RL & SFT both fail

12/13
Summary:

- RL is useful for math reasoning (any task with some kind of "compositional" nature).

- Likely better RL methods can push this 8x much further (maybe an order more?)

- RL better generalization vs SFT (by connecting to DRO) is neat + some theoretical results!

13/13
This was an awesome collab with CMU, led by Amrith (
@setlur_amrith ), with
@saurabh_garg67
Naman Garg
@younggeng

@gingsmith
.

Code + data (coming soon): GitHub - ars22/scaling-LLM-math-synthetic-data: Code and data used in the paper: "Training on Incorrect Synthetic Data via RL Scales LLM Math Reasoning Eight-Fold"

Paper: [2406.14532] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Please let us know if you have feedback!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[Submitted on 20 Jun 2024]

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Amrith Setlur

Abstract:Training on model-generated synthetic data is a promising approach for finetuning LLMs, but it remains unclear when it helps or hurts. In this paper, we investigate this question for math reasoning via an empirical study, followed by building a conceptual understanding of our observations. First, we find that while the typical approach of finetuning a model on synthetic correct or positive problem-solution pairs generated by capable models offers modest performance gains, sampling more correct solutions from the finetuned learner itself followed by subsequent fine-tuning on this self-generated data \textbf{doubles}\textbf{doubles} the efficiency of the same synthetic problems. At the same time, training on model-generated positives can amplify various spurious correlations, resulting in flat or even inverse scaling trends as the amount of data increases. Surprisingly, we find that several of these issues can be addressed if we also utilize negative responses, i.e., model-generated responses that are deemed incorrect by a final answer verifier. Crucially, these negatives must be constructed such that the training can appropriately recover the utility or advantage of each intermediate step in the negative response. With this per-step scheme, we are able to attain consistent gains over only positive data, attaining performance similar to amplifying the amount of synthetic data by \mathbf{8 \times}\mathbf{8 \times}. We show that training on per-step negatives can help to unlearn spurious correlations in the positive data, and is equivalent to advantage-weighted reinforcement learning (RL), implying that it inherits robustness benefits of RL over imitating positive data alone.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2406.14532
	arXiv:2406.14532v1
	[2406.14532] RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.14532

bnew · Jul 12, 2024

1/12
New LLM Agents Benchmark!

Introducing MIRAI: A groundbreaking benchmark crafted for evaluating LLM agents in temporal forecasting of international events with tool use and complex reasoning!

Arxiv: [2407.01231] MIRAI: Evaluating LLM Agents for Event Forecasting
Project page: MIRAI: Evaluating LLM Agents for Event Forecasting

1/N

2/12
2/N We released our code, data and an iteractive demo:

GitHub Repo: GitHub - yecchen/MIRAI: Code and Data for "MIRAI: Evaluating LLM Agents for Event Forecasting"

Dataset:

Interactive Demo Notebook: Google Colab

3/12
3/N Data

With 59,161 unique events and 296,630 unique news articles, we curate a test set of 705 forecasting query-answer pairs.

(a) Circular Chart: The relation hierarchy and distribution in MIRAI.
(b-c) Heatmap: Intensity of global events, from conflict to mediation.

4/12
4/N Forecasting Task

Forecasting involves collecting essential historical data and performing temporal reasoning to predict future events.

Example: Forecasting cross-country relations on 2023-11-18 using event and news information up to 2023-11-17.

5/12
5/N APIs & Environment

Our comprehensive APIs empower agents to generate code and access the database.

APIs include data classes and functions for various info types and search conditions.

Agents can call a single function or generate a code block at each step.

6/12
6/N Agent Framework

Think: Agent analyzes and plans the next action using API specs.
Act: Generates Single Function or Code Block to retrieve data.
Execute: Python interpreter runs the code for observations.

These steps are repeated until reaching a final forecast.

7/12
7/N Forecasting with Different Base LLMs

Code Block benefits stronger LLMs but hurts weaker models.
GPT-4o consistently outperforms other models.
Self-consistency makes a small model stronger.

8/12
8/N Forecasting with Temporal Distance

Our ablation study let agents predicts 1, 7, 30, and 90 days ahead.

Results: As days increases, F1and KL.

Agent's accuracy drops for distant events. Longer ones anticipate trend shifts influenced by more factors and complexities.

9/12
9/N Tool-Use Ordering in Forecasting

Tool-Use Transition Graph: Agents start with recent events for key info and end with news for context.

Freq.(correct) - Freq.(incorrect): Highlight the need for strategic planning in LLM agents for effective forecasting.

10/12
10/N Check our paper out for more details!

Code error analysis, different event types, variation of API types, and different agent planning strategies!

Join us in advancing the capabilities of LLM agents in forecasting and understanding complex international events!

11/12
11/N

Sincere thanks to all amazing collaborators and advisors @acbuller , @Yihe__Deng, @HuangZi71008374, @mingyu_ma, @Zhu_Yanqiao, and @WeiWang1973 for their invaluable advice and efforts!

12/12
Thank you so much, Avi! Looking forward to hearing your thoughts!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[Submitted on 1 Jul 2024]

MIRAI - Evaluating LLM Agents for Event Forecasting

Chenchen Ye

Abstract:Recent advancements in Large Language Models (LLMs) have empowered LLM agents to autonomously collect world information, over which to conduct reasoning to solve complex problems. Given this capability, increasing interests have been put into employing LLM agents for predicting international events, which can influence decision-making and shape policy development on an international scale. Despite such a growing interest, there is a lack of a rigorous benchmark of LLM agents' forecasting capability and reliability. To address this gap, we introduce MIRAI, a novel benchmark designed to systematically evaluate LLM agents as temporal forecasters in the context of international events. Our benchmark features an agentic environment with tools for accessing an extensive database of historical, structured events and textual news articles. We refine the GDELT event database with careful cleaning and parsing to curate a series of relational prediction tasks with varying forecasting horizons, assessing LLM agents' abilities from short-term to long-term forecasting. We further implement APIs to enable LLM agents to utilize different tools via a code-based interface. In summary, MIRAI comprehensively evaluates the agents' capabilities in three dimensions: 1) autonomously source and integrate critical information from large global databases; 2) write codes using domain-specific APIs and libraries for tool-use; and 3) jointly reason over historical knowledge from diverse formats and time to accurately predict future events. Through comprehensive benchmarking, we aim to establish a reliable framework for assessing the capabilities of LLM agents in forecasting international events, thereby contributing to the development of more accurate and trustworthy models for international relation analysis.

Comments:	this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.01231
	arXiv:2407.01231v1
	[2407.01231] MIRAI: Evaluating LLM Agents for Event Forecasting

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.01231

bnew · Jul 12, 2024

1/10
Is your model really a good math reasoner? If a model understands a problem, it should robustly work across various tasks!

Introducing MathCheck: Evaluating Math Reasoning with Checklist! MathCheck

MathCheck reveals comprehensive reasoning ability of (M)LLM.

2/10
[2/n] We employ LLMs as engines to automatically generate MATHCHECK. We develop MATHCHECK-GSM and MATHCHECK-GEO to assess mathe textual reasoning and multi-modal reasoning abilities, serving as upgraded versions of benchmarks including GSM8k, GeoQA, UniGeo, and Geometry3K.

3/10
[3/n] On MATHCHECK-GSM, GPT-4o achieves the highest levels in most tasks and question variants. Some models underperforms in tasks other than problem solving, suspecting special optimization of the solving task. This phenomenon is also observed across all math-customized models.

4/10
[4/n] On MATHCHECK-GEO, GPT-4o also demonstrates the best performance. Among the open-source MLLMs, all of them exhibited poor reasoning performance. This suggests that the multi-modal reasoning capabilities of open-source MLLMs still have significant room for improvement.

5/10
[5/n] Why MATHCHECK? We use performance on private data and compression efficiency as surrogates to assess the genuine mathematical abilities of models. Examining the correlation between MATHCHECK and surrogates, we find it represents intelligence more linearly.

6/10
[6/n] Behavior of Math Models: Examining the behaviors of the math models implies that training solely on massive solving data is not the right direction to improve math reasoning ability. Instead, training models with high-quality and diverse math data should be considered.

7/10
[7/n] Reasoning Consistency: Most of models show reasoning consistency, achieving similar scores on each unit. Some models perform reasoning inconsistently, showing excellent performance on solving but worse in other units, revealing that they may conduct excessive decoration.

8/10
[8/n] Behavior on Different Complexity Levels: MATHCHECK better demonstrates the reasoning skills and capabilities required when problems become difficult.

9/10
[9/n] Behavior on Different Prompting Technologies: CoT and Plan-and-Solve in the zero-shot setting demonstrate superior performance. In contrast, the Few-shot prompt generally yields poorer results than the Zero-shot prompt.

10/10
Joint collaboration w/ WeiLiu99 shudong_liu ning_mz jd92wang , Derek F. Wong, Xiaowei Huang, Qiufeng Wang, Kaizhu Huang.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 12, 2024

1/2
Large language models (LLMs) are now solving tasks that are commonly believed to require human-level reasoning but they are still nowhere near general intelligence. We developed an LLM self-improvement method called Code Iteration (CodeIt): [2402.04858] CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay #ICML2024 #AI

2/2
Advances in graphics rendering include better resolution, complexity, and novel textures components, but the growth in data volume has not been matched by advances in its compression. We propose a novel method to solve for this: [2407.00021] Neural Graphics Texture Compression Supporting Random Acces #ECCV2024 #AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[Submitted on 7 Feb 2024 (v1), last revised 1 Jul 2024 (this version, v2)]

CodeIt - Self-Improving Language Models with Prioritized Hindsight Replay

Natasha Butt

Abstract:Large language models are increasingly solving tasks that are commonly believed to require human-level reasoning ability. However, these models still perform very poorly on benchmarks of general intelligence such as the Abstraction and Reasoning Corpus (ARC). In this paper, we approach ARC as a programming-by-examples problem, and introduce a novel and scalable method for language model self-improvement called Code Iteration (CodeIt). Our method iterates between 1) program sampling and hindsight relabeling, and 2) learning from prioritized experience replay. By relabeling the goal of an episode (i.e., the target program output given input) to the realized output produced by the sampled program, our method effectively deals with the extreme sparsity of rewards in program synthesis. Applying CodeIt to the ARC dataset, we demonstrate that prioritized hindsight replay, along with pre-training and data-augmentation, leads to successful inter-task generalization. CodeIt is the first neuro-symbolic approach that scales to the full ARC evaluation dataset. Our method solves 15% of ARC evaluation tasks, achieving state-of-the-art performance and outperforming existing neural and symbolic baselines. Our code is available at this https URL .

Comments:	ICML'24 camera-ready version
Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2402.04858
	arXiv:2402.04858v2
	[2402.04858] CodeIt: Self-Improving Language Models with Prioritized Hindsight Replay

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2402.04858

bnew · Jul 12, 2024

1/7
Thrilled to share our latest paper “Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models”

LLMs struggle at higher depths of logical reasoning

Check out paper @ [2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

#NLProc #logic #reasoning

Read ↓ (1/6)

2/7
Proposed Multi-LogiEval, a systematically created QA dataset covering multi-step logical reasoning across three logic: Propositional Logic (PL), First-Order Logic (FOL), and Non-Monotonic (NM).

Access Multi-LogiEval @GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Read ↓ (2/6)

3/7
Our dataset provides ~1.6k high-quality instances that cover 33 inference rules and reasoning patterns and more than 60 complex combinations of these inference rules with a different number of reasoning steps (1~5).

Read ↓ (3/6)

4/7
Our data creation process consists of two major stages: (i) Generation of rule combination and (ii) Generation of data instances.

Read ↓ (4/6)

5/7
Evaluating LLMs on Multi-LogiEval leads to interesting findings:

- Longer chains don't guarantee better reasoning
- Larger open-source models perform worse than smaller ones
- LLMs struggle with context-based conclusions without explicit knowledge
- Many more..

Read ↓ (5/6)

6/7
Want to evaluate your LLM? Check out the Multi-LogiEval dataset @
GitHub - Mihir3009/Multi-LogiEval: A comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths

Take on its challenges and be part of advancing the capabilities of LLMs in logical reasoning!

Thanks to
@Nisarg14P @Mohith nrj_varshney @mutsumi32141651 @cbaral

Read (6/6)

7/7
Super excited to share that our paper on high-quality data generation using LLMs is accepted
@COLM_conf

Please check out our work - [2310.17876] TarGEN: Targeted Data Generation with Large Language Models

#NLProc #LLMs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models
[Submitted on 24 Jun 2024]

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Nisarg Patel, Mohith Kulkarni, Mihir Parmar, Aashna Budhiraja, Mutsumi Nakamura, Neeraj Varshney, Chitta Baral

Abstract:
As Large Language Models (LLMs) continue to exhibit remarkable performance in natural language understanding tasks, there is a crucial need to measure their ability for human-like multi-step logical reasoning. Existing logical reasoning evaluation benchmarks often focus primarily on simplistic single-step or multi-step reasoning with a limited set of inference rules. Furthermore, the lack of datasets for evaluating non-monotonic reasoning represents a crucial gap since it aligns more closely with human-like reasoning. To address these limitations, we propose Multi-LogiEval, a comprehensive evaluation dataset encompassing multi-step logical reasoning with various inference rules and depths. Multi-LogiEval covers three logic types--propositional, first-order, and non-monotonic--consisting of more than 30 inference rules and more than 60 of their combinations with various depths. Leveraging this dataset, we conduct evaluations on a range of LLMs including GPT-4, ChatGPT, Gemini-Pro, Yi, Orca, and Mistral, employing a zero-shot chain-of-thought. Experimental results show that there is a significant drop in the performance of LLMs as the reasoning steps/depth increases (average accuracy of ~68% at depth-1 to ~43% at depth-5). We further conduct a thorough investigation of reasoning chains generated by LLMs which reveals several important findings. We believe that Multi-LogiEval facilitates future research for evaluating and enhancing the logical reasoning ability of LLMs. Data is available at this https URL.
this https URL

Comments:	23 Pages
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2406.17169
	arXiv:2406.17169v1
	[2406.17169] Multi-LogiEval: Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2406.17169
this https URL

bnew · Jul 12, 2024

1/2
System 2 prompts: Four techniques to enhance LLM reasoning

The "Distilling System 2 into System 1" paper (https://arxiv.org/html/2407.06023v1…) looks like Llama 3 Instruct's dataset recipe (note how they only mention Llama 2).

The authors build a dataset of complex reasoning tasks with prompts that push LLMs into system 2 mode, greatly improving answer quality.

However, other techniques remain less common. That's why I've extracted four system 2 prompt techniques from the paper.

These can be valuable for creating instruction datasets or in regular applications as part of prompt engineering.

Interested in LLM datasets? Check out my GitHub repo on LLM Datasets for more insights: GitHub - mlabonne/llm-datasets: High-quality datasets, tools, and concepts for LLM fine-tuning.

2/2
Is there a better name you'd recommend? I don't feel strongly about it at all :smile:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[2407.06023v1] Distilling System 2 into System 1
[Submitted on 8 Jul 2024 (this version), latest version 9 Jul 2024 (v2)]

Distilling System 2 into System 1

Ping Yu, Jing Xu, Jason Weston, Ilia Kulikov

Abstract:
Large language models (LLMs) can spend extra compute during inference to generate intermediate thoughts, which helps to produce better final responses. Since Chain-of-Thought (Wei et al., 2022), many such System 2 techniques have been proposed such as Rephrase and Respond (Deng et al., 2023a), System 2 Attention (Weston and Sukhbaatar, 2023) and Branch-Solve-Merge (Saha et al., 2023). In this work we investigate self-supervised methods to ``compile'' (distill) higher quality outputs from System 2 techniques back into LLM generations without intermediate reasoning token sequences, as this reasoning has been distilled into System 1. We show that several such techniques can be successfully distilled, resulting in improved results compared to the original System 1 performance, and with less inference cost than System 2. We posit that such System 2 distillation will be an important feature of future continually learning AI systems, enabling them to focus System 2 capabilities on the reasoning tasks that they cannot yet do well.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.06023
	arXiv:2407.06023v1
	[2407.06023] Distilling System 2 into System 1

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.06023v1

bnew · Jul 13, 2024

[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization
[Submitted on 23 May 2024 (v1), last revised 27 May 2024 (this version, v2)]

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

Boshi Wang, Xiang Yue, Yu Su, Huan Sun

Abstract:
We study whether transformers can learn to implicitly reason over parametric knowledge, a skill that even the most capable language models struggle with. Focusing on two representative reasoning types, composition and comparison, we consistently find that transformers can learn implicit reasoning, but only through grokking, i.e., extended training far beyond overfitting. The levels of generalization also vary across reasoning types: when faced with out-of-distribution examples, transformers fail to systematically generalize for composition but succeed for comparison. We delve into the model's internals throughout training, conducting analytical experiments that reveal: 1) the mechanism behind grokking, such as the formation of the generalizing circuit and its relation to the relative efficiency of generalizing and memorizing circuits, and 2) the connection between systematicity and the configuration of the generalizing circuit. Our findings guide data and training setup to better induce implicit reasoning and suggest potential improvements to the transformer architecture, such as encouraging cross-layer knowledge sharing. Furthermore, we demonstrate that for a challenging reasoning task with a large search space, GPT-4-Turbo and Gemini-1.5-Pro based on non-parametric memory fail badly regardless of prompting styles or retrieval augmentation, while a fully grokked transformer can achieve near-perfect accuracy, showcasing the power of parametric memory for complex reasoning.

Comments:	this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.15071
	arXiv:2405.15071v2
	[2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.15071

A.I Generated explanation:

Title: Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

This is a research paper about artificial intelligence (AI) and how it can be improved to make it smarter.

Authors: Boshi Wang, Xiang Yue, Yu Su, and Huan Sun

These are the people who wrote the paper.

Abstract:

The paper is about whether a type of AI called transformers can learn to "reason" like humans do. Reasoning means making connections between different pieces of information and using them to make decisions or solve problems. The researchers found that transformers can learn to reason, but only if they are trained for a very long time. They also found that the way the transformers reason is different depending on the type of problem they are trying to solve.

What they did:

The researchers trained transformers to solve two types of problems: composition and comparison. They found that the transformers could learn to solve these problems, but only if they were trained for a very long time. They also looked at how the transformers were solving the problems and found that they were using different "circuits" in their "brain" to do so.

What they found:

The researchers found that the transformers were able to solve the problems, but they didn't always generalize well to new situations. This means that they could solve the problems they were trained on, but they didn't always understand the underlying principles well enough to apply them to new situations. They also found that the way the transformers were trained affected how well they could reason.

What it means:

This research is important because it shows that transformers can be trained to reason like humans do, but it also shows that there are still limitations to how well they can generalize to new situations. The researchers suggest that changing the way transformers are trained and adding new components to their architecture could help them reason better.

Links:

* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the paper on the arXiv website.
* GitHub - OSU-NLP-Group/GrokkedTransformer: Code for the paper 'Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization' - This is the link to the GitHub repository for the project.
* [2405.15071v2] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the updated version of the paper on the arXiv website.
* [2405.15071] Grokked Transformers are Implicit Reasoners: A Mechanistic Journey to the Edge of Generalization - This is the link to the DOI (digital object identifier) for the paper.

bnew · Jul 13, 2024

https://www.reuters.com/technology/artificial-intelligence/openai-working-new-reasoning-technology-under-code-name-strawberry-2024-07-12/

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’

By Anna Tong and Katie Paul

July 12, 20245:23 PM EDTUpdated 19 hours ago

OpenAI logo is seen in this illustration taken May 20, 2024. REUTERS/Dado Ruvic/Illustration/File Photo Purchase Licensing Rights, opens new tab

July 12 - ChatGPT maker OpenAI is working on a novel approach to its artificial intelligence models in a project code-named “Strawberry,” according to a person familiar with the matter and internal documentation reviewed by Reuters.

The project, details of which have not been previously reported, comes as the Microsoft-backed startup races to show that the types of models it offers are capable of delivering advanced reasoning capabilities.

Teams inside OpenAI are working on Strawberry, according to a copy of a recent internal OpenAI document seen by Reuters in May. Reuters could not ascertain the precise date of the document, which details a plan for how OpenAI intends to use Strawberry to perform research. The source described the plan to Reuters as a work in progress. The news agency could not establish how close Strawberry is to being publicly available.

How Strawberry works is a tightly kept secret even within OpenAI, the person said.

The document describes a project that uses Strawberry models with the aim of enabling the company’s AI to not just generate answers to queries but to plan ahead enough to navigate the internet autonomously and reliably to perform what OpenAI terms “deep research,” according to the source.

This is something that has eluded AI models to date, according to interviews with more than a dozen AI researchers.

Asked about Strawberry and the details reported in this story, an OpenAI company spokesperson said in a statement: “We want our AI models to see and understand the world more like we do. Continuous research into new AI capabilities is a common practice in the industry, with a shared belief that these systems will improve in reasoning over time.”

The spokesperson did not directly address questions about Strawberry.

The Strawberry project was formerly known as Q*, which Reuters reported last year was already seen inside the company as a breakthrough.

Two sources described viewing earlier this year what OpenAI staffers told them were Q* demos, capable of answering tricky science and math questions out of reach of today’s commercially-available models.

On Tuesday at an internal all-hands meeting, OpenAI showed a demo of a research project that it claimed had new human-like reasoning skills, according to Bloomberg, opens new tab. An OpenAI spokesperson confirmed the meeting but declined to give details of the contents. Reuters could not determine if the project demonstrated was Strawberry.

OpenAI hopes the innovation will improve its AI models’ reasoning capabilities dramatically, the person familiar with it said, adding that Strawberry involves a specialized way of processing an AI model after it has been pre-trained on very large datasets.

Researchers Reuters interviewed say that reasoning is key to AI achieving human or super-human-level intelligence.

While large language models can already summarize dense texts and compose elegant prose far more quickly than any human, the technology often falls short on common sense problems whose solutions seem intuitive to people, like recognizing logical fallacies and playing tic-tac-toe. When the model encounters these kinds of problems, it often “hallucinates” bogus information.

AI researchers interviewed by Reuters generally agree that reasoning, in the context of AI, involves the formation of a model that enables AI to plan ahead, reflect how the physical world functions, and work through challenging multi-step problems reliably.

Improving reasoning in AI models is seen as the key to unlocking the ability for the models to do everything from making major scientific discoveries to planning and building new software applications.

OpenAI CEO Sam Altman said earlier this year, opens new tab that in AI “the most important areas of progress will be around reasoning ability.”

Other companies like Google, Meta and Microsoft are likewise experimenting with different techniques to improve reasoning in AI models, as are most academic labs that perform AI research. Researchers differ, however, on whether large language models (LLMs) are capable of incorporating ideas and long-term planning into how they do prediction. For instance, one of the pioneers of modern AI, Yann LeCun, who works at Meta, has frequently said that LLMs are not capable of humanlike reasoning.

AI CHALLENGES

Strawberry is a key component of OpenAI’s plan to overcome those challenges, the source familiar with the matter said. The document seen by Reuters described what Strawberry aims to enable, but not how.

In recent months, the company has privately been signaling to developers and other outside parties that it is on the cusp of releasing technology with significantly more advanced reasoning capabilities, according to four people who have heard the company’s pitches. They declined to be identified because they are not authorized to speak about private matters.

Strawberry includes a specialized way of what is known as “post-training” OpenAI’s generative AI models, or adapting the base models to hone their performance in specific ways after they have already been “trained” on reams of generalized data, one of the sources said.

The post-training phase of developing a model involves methods like “fine-tuning,” a process used on nearly all language models today that comes in many flavors, such as having humans give feedback to the model based on its responses and feeding it examples of good and bad answers.

Strawberry has similarities to a method developed at Stanford in 2022 called "Self-Taught Reasoner” or “STaR”, one of the sources with knowledge of the matter said. STaR enables AI models to “bootstrap” themselves into higher intelligence levels via iteratively creating their own training data, and in theory could be used to get language models to transcend human-level intelligence, one of its creators, Stanford professor Noah Goodman, told Reuters.

“I think that is both exciting and terrifying…if things keep going in that direction we have some serious things to think about as humans,” Goodman said. Goodman is not affiliated with OpenAI and is not familiar with Strawberry.

Among the capabilities OpenAI is aiming Strawberry at is performing long-horizon tasks (LHT), the document says, referring to complex tasks that require a model to plan ahead and perform a series of actions over an extended period of time, the first source explained.

To do so, OpenAI is creating, training and evaluating the models on what the company calls a “deep-research” dataset, according to the OpenAI internal documentation. Reuters was unable to determine what is in that dataset or how long an extended period would mean.

OpenAI specifically wants its models to use these capabilities to conduct research by browsing the web autonomously with the assistance of a “CUA,” or a computer-using agent, that can take actions based on its findings, according to the document and one of the sources. OpenAI also plans to test its capabilities on doing the work of software and machine learning engineers.

Reporting by Anna Tong in San Francisco and Katie Paul in New York; editing by Ken Li and Claudia Parsons

bnew · Jul 16, 2024

[2407.04153] Mixture of A Million Experts
[Submitted on 4 Jul 2024]

Mixture of A Million Experts

Xu Owen He

Abstract:
The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows. Sparse mixture-of-experts (MoE) architectures have emerged as a viable approach to address this issue by decoupling model size from computational cost. The recent discovery of the fine-grained MoE scaling law shows that higher granularity leads to better performance. However, existing MoE models are limited to a small number of experts due to computational and optimization challenges. This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million). Experiments on language modeling tasks demonstrate that PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off. By enabling efficient utilization of a massive number of experts, PEER unlocks the potential for further scaling of transformer models while maintaining computational efficiency.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.04153
	arXiv:2407.04153v1

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2407.04153

A.I Generated explanation:

This is a good start! Here's a slightly more detailed and technically accurate explanation:

Title: **Efficiently using a large number of experts in a Transformer model**

This paper describes a new technique for making Transformer models more efficient.

Here's the problem in more detail:

* Transformers are powerful AI models, especially for understanding and generating language. They do this by analyzing the relationships between words in a sentence, no matter where they are in the sentence. This is great for understanding complex language, but the traditional way of building Transformers means they need a lot of computing power to work well.

* The researchers are trying to address this by creating a more efficient way to use the "experts" (think of them as specialized modules within the model) to improve performance without increasing the size of the model itself.

The problem the researchers are tackling:

* **Computational cost:**

Training large, powerful AI models is expensive, requiring a lot of computational resources and time.

* **Efficiency:**

The researchers want to make these models more efficient without sacrificing their ability to learn.

The solution they propose:

* PEER (Parameter-Efficient Expert Retrieval) is a method that addresses the scaling problem of Transformers by using a technique called "sparse mixture-of-experts". This means instead of having one large model, they use many smaller models, each specializing in a specific aspect of the language.

* Think of it like this: Imagine you have a team of experts, each with a limited area of expertise. Instead of letting them all work on every problem, which would be inefficient, you only select the experts who are relevant to the task at hand. This is what the researchers are aiming for with this new method.

The key is to use "expert routing"

* The researchers want to use a large number of experts (smaller models) to make the model more powerful, but they also want to make sure the model is still efficient.

* The paper proposes a way to make the model more powerful by increasing the number of experts, but it's not clear from this excerpt how they achieve that efficiency.

Possible ways to achieve efficiency in PEER:

* Sparsely activating only a subset of experts:

This means that not all of the experts are used for every input, only the ones that are most relevant.

* Efficient routing mechanisms:

The paper likely proposes a specific method for determining which experts are activated for a given input.

* Efficient training techniques:

The excerpt mentions the paper will likely discuss this, but it's not clear what specific techniques are used.

To understand the full solution, you'd need to read the paper, but the key takeaway is that they're proposing a way to improve the efficiency of AI models by making them more modular and scalable.

1/1
The Mixture of a Million Experts paper is a straight banger.

Reduces inference cost and memory usage, scales to millions of experts, oh and just happens to overcome catastrophic forgetting and enable life long learning for the model.

Previous MOE models never got past 10k experts and they had a static router to connect them up that was inefficient but this includes a learned router than can handle millions of micro experts. Reminds me a bit of how the neocortex works because it is composed of about 2 million cortical columns that can each learn a model of the world and then work together to form a collective picture of reality.

Catastrophic forgetting and continual learning are two of the most important and nasty problems with current architectures and this approach just potentially wiped out both in one shot.

There have been other approaches to try to enable continual learning and overcome catastrophic forgetting like bi-level continual learning or progress and compress, that use elastic weight consolidation, knowledge distillation and two models, a big neural net and a small learning net. The small net learns and over time the learnings are passed back to the big net. Its weights are partially frozen and consolidated as the new knowledge is brought in. Good ideas, also out of Deep Mind robotics teams.

But this paper seems to say you can just add in new mini experts, freeze or partially freeze old weights, and just grow the model's understanding as much as you want, without causing it to lose what it already knows.

It's like having Loras built right into the model itself.

[2407.04153] Mixture of A Million Experts

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 16, 2024

1/9

Excited to share our latest research on investigating the effect of coding data on LLMs' reasoning abilities!

Discover how Instruction Fine-Tuning with code can boost zero-shot performance across various tasks and domains.

[2405.20535] Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

2/9
1/

We created IFT datasets with increasing coding data proportions, fine-tuned six LLM backbones, evaluated performance across twelve tasks in three reasoning domains, and analyzed outcomes from overall, domain-level, and task-specific perspectives.

3/9
2/

Overall, we observed a consistent and gradual enhancement in the LLMs' reasoning performance as the proportion of coding data used for fine-tuning increased.

4/9
3/

Diving into each domain, we found that coding data uniquely impacts different reasoning abilities. Consistent trends within each domain across model backbones and sizes suggest the benefits of coding data transfer effectively during the IFT stage.

5/9
4/

Further analysis revealed that coding data generally provides similar task-specific benefits across model families. While most optimal proportions of coding data are consistent across families, no single proportion enhances all task-specific reasoning abilities.

6/9
Many thanks to my wonderful collaborators: Zhiyu (@ZhiyuChen4), Xi(@xiye_nlp), Xianjun (
@Qnolan4), Lichang (@LichangChen2), William Wang (
@WilliamWangNLP) and Linda Ruth Petzold.

7/9
its a good proxy but chain of though is still the way to go
[2305.20050] Let's Verify Step by Step

8/9
Dark mode for this paper for night readers

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

9/9

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

[2405.20535] Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning
[Submitted on 30 May 2024]

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

Xinlu Zhang, Zhiyu Zoey Chen, Xi Ye, Xianjun Yang, Lichang Chen, William Yang Wang, Linda Ruth Petzold

Abstract:
Instruction Fine-Tuning (IFT) significantly enhances the zero-shot capabilities of pretrained Large Language Models (LLMs). While coding data is known to boost reasoning abilities during LLM pretraining, its role in activating internal reasoning capacities during IFT remains understudied. This paper investigates a key question: How does coding data impact LLMs' reasoning capacities during the IFT stage? To explore this, we thoroughly examine the impact of coding data across different coding data proportions, model families, sizes, and reasoning domains, from various perspectives. Specifically, we create three IFT datasets with increasing coding data proportions, fine-tune six LLM backbones across different families and scales on these datasets, evaluate the tuned models' performance across twelve tasks in three reasoning domains, and analyze the outcomes from three broad-to-granular perspectives: overall, domain-level, and task-specific. Our holistic analysis provides valuable insights in each perspective. First, coding data tuning enhances the overall reasoning capabilities of LLMs across different model families and scales. Moreover, the effect of coding data varies among different domains but shows consistent trends across model families and scales within each domain. Additionally, coding data generally yields comparable task-specific benefits across different model families, with the optimal coding data proportions in IFT datasets being task-specific.

Subjects:	Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2405.20535
	arXiv:2405.20535v1
	[2405.20535] Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2405.20535

Large Language Models News & Discussions

Veteran

Rectifier - Code Translation with Corrector via LLMs​

Submission history​

Veteran

Development of an automatic modification system for generated programs using ChatGPT​

Submission history​

Veteran

Prompting Techniques for Secure Code Generation - A Systematic Investigation​

Submission history​

CodeCSE - A Simple Multilingual Model for Code and Comment Sentence Embeddings​

Submission history​

LLM for Mobile - An Initial Roadmap​

Submission history​

CREF - An LLM-based Conversational Software Repair Framework for Programming Tutors​

Submission history​

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust​

Submission history​

Veteran

InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct​

Submission history​

Veteran

Reasoning skills of large language models are often overestimated​

Veteran

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold​

Submission history​

Veteran

MIRAI - Evaluating LLM Agents for Event Forecasting​

Submission history​

Veteran

Veteran

CodeIt - Self-Improving Language Models with Prioritized Hindsight Replay​

Submission history​

Veteran

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models​

Submission history​

Veteran

Distilling System 2 into System 1​

Submission history​

Veteran

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization​

Submission history​

Veteran

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’​

AI CHALLENGES​

Veteran

Mixture of A Million Experts​

Submission history​

Here's the problem in more detail:​

The problem the researchers are tackling:​

The solution they propose:​

Possible ways to achieve efficiency in PEER:​

Veteran

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning​

Submission history​

Rectifier - Code Translation with Corrector via LLMs

Submission history

Development of an automatic modification system for generated programs using ChatGPT

Submission history

Prompting Techniques for Secure Code Generation - A Systematic Investigation

Submission history

CodeCSE - A Simple Multilingual Model for Code and Comment Sentence Embeddings

Submission history

LLM for Mobile - An Initial Roadmap

Submission history

CREF - An LLM-based Conversational Software Repair Framework for Programming Tutors

Submission history

Evaluating AI-generated code for C++, Fortran, Go, Java, Julia, Matlab, Python, R, and Rust

Submission history

InverseCoder - Unleashing the Power of Instruction-Tuned Code LLMs with Inverse-Instruct

Submission history

Reasoning skills of large language models are often overestimated

RL on Incorrect Synthetic Data Scales the Efficiency of LLM Math Reasoning by Eight-Fold

Submission history

MIRAI - Evaluating LLM Agents for Event Forecasting

Submission history

CodeIt - Self-Improving Language Models with Prioritized Hindsight Replay

Submission history

Multi-LogiEval - Towards Evaluating Multi-Step Logical Reasoning Ability of Large Language Models

Submission history

Distilling System 2 into System 1

Submission history

Grokked Transformers are Implicit Reasoners - A Mechanistic Journey to the Edge of Generalization

Submission history

Exclusive: OpenAI working on new reasoning technology under code name ‘Strawberry’

AI CHALLENGES

Mixture of A Million Experts

Submission history

Here's the problem in more detail:

The problem the researchers are tackling:

The solution they propose:

Possible ways to achieve efficiency in PEER:

Unveiling the Impact of Coding Data Instruction Fine-Tuning on Large Language Models Reasoning

Submission history