Large Language Models News & Discussions

bnew · Apr 7, 2024

1/7
ChatGLM-Math

Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

Large language models (LLMs) have shown excellent mastering of human language, but still struggle in real-world applications that require mathematical problem-solving. While many

2/7
strategies and datasets to enhance LLMs' mathematics are developed, it remains a challenge to simultaneously maintain and improve both language and mathematical capabilities in deployed LLM systems. In this work, we tailor the Self-Critique pipeline, which addresses the

3/7
challenge in the feedback learning stage of LLM alignment. We first train a general Math-Critique model from the LLM itself to provide feedback signals. Then, we sequentially employ rejective fine-tuning and direct preference optimization over the LLM's own generations for

4/7
data collection. Based on ChatGLM3-32B, we conduct a series of experiments on both academic and our newly created challenging dataset, MathUserEval. Results show that our

5/7
pipeline significantly enhances the LLM's mathematical problem-solving while still improving its language ability, outperforming LLMs that could be two times larger.

6/7
paper page:

7/7
daily papers:

https://huggingface.co/BlinkDL/rwkv-6-world…
https://stanford.zoom.us/j/99922151759?pwd=dW5CcUtVYkNybGZGY0hMWUZtVkZBZz09…
CS25: Tranformers United!

Paper page - ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Cr...
Daily Papers - Hugging Face

bnew · Apr 7, 2024

1/6
(1/2) Introducing LL3M: Large Language, Multimodal, and Moe Model Open Research Plan

GitHub - jiasenlu/LL3M: LL3M: Large Language and Multi-Modal Model in Jax

LL3M: Large Language and Multi-Modal Model in Jax. Contribute to jiasenlu/LL3M development by creating an account on GitHub.

github.com

With the following goals:
- Build an open-sourced codebase in Jax / Flax that supports large-scale training in LLM, LMM, and MoE models.

- Record and share the journey and tricks to implement and train such a model.

- Come up with something new and exciting -- A multimodal version of Branch-Train-MiX initialized with a learnable model soup.

(More release milestones are listed below)

2/6
(2/2) Here are five release milestones :

1: Language Model and Seqio Dataloader for Dolma dataset that supports existing open-weight LLM, such as LLaMA2, Mistral, Gemma, and Phi.

2: Multimodal LLM and Dataloader that supports Llava 1.5 and Llava 1.6.

3: Learnable Model soup…

3/6
Thanks for the suggestion! Jax/Flax has very good data model parallelism that make it easy to implement MoE type of model. I also mainly test the code based on TPU.

4/6
Thanks!

5/6
Not right now, maybe can use the discussions in (jiasenlu LL3M · Discussions)[/URL]

6/6
Current codebase is based on EazyLM (GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution for pre-training, finetuning, evaluating and serving LLMs in JAX/Flax.)[/URL] but will add much more functions!

GitHub - jiasenlu/LL3M: LL3M: Large Language and Multi-Modal Model in Jax
https://github.com/TRI-ML/prismatic-vlms…
https://github.com/jiasenlu/LL3M/discussions…
https://github.com/young-geng/EasyLM…

GitHub - TRI-ML/prismatic-vlms: A flexible and efficient codebase for training visually-conditioned...
jiasenlu LL3M · Discussions
GitHub - stanford-crfm/levanter: Legible, Scalable, Reproducible Foundation Models with Named...
GitHub - young-geng/EasyLM: Large language models (LLMs) made easy, EasyLM is a one stop solution...

bnew · Apr 7, 2024

1/1
Multi-Conditional Ranking with Large Language Models

Defines the multi-conditional ranking (MCR) task, introduces a benchmark to evaluate it, and proposes a method to enhance large language models' performance on MCR.

[2404.00211] Multi-Conditional Ranking with Large Language Models

GitHub - megagonlabs/MCR

Contribute to megagonlabs/MCR development by creating an account on GitHub.

github.com

bnew · Apr 7, 2024

1/2
1/n Recursive Reasoning: This AI Unlocks Complexity Through Smart Decomposition

Large language models (LLMs) have shown remarkable capabilities in performing complex decision-making tasks that require planning and adaptation to interactive environments. However, existing approaches for using LLMs as decision-making agents face significant limitations when dealing with highly complex tasks.

The predominant methods can be categorized into two types - iterative executors and plan-and-execute approaches. Iterative executors like ReAct determine the next action for the LLM based on the current state, but struggle with compositional complexity and long action-observation trajectories as tasks become harder. On the other hand, plan-and-execute methods create an upfront high-level plan by having an LLM decompose the task into sub-tasks to be executed by another LLM. While this reduces complexity, these approaches lack flexibility - if any sub-task is too complex for the executor LLM, the overall task fails.

To address these shortcomings, the paper introduces ADAPT (As-Needed Decomposition and Planning with Language Models), a novel approach that can recursively decompose complex sub-tasks further as needed during execution. ADAPT consists of separate LLM modules for planning and execution, coordinated by a controller program. If the executor LLM fails on a sub-task, the planner LLM decomposes it further into smaller steps to align with the executor's capabilities.

The key strength of ADAPT lies in its dynamic, recursive decomposition structure inspired by hierarchical planning. However, unlike classical methods that rely on hand-specified domain knowledge, ADAPT leverages the broad world knowledge encoded in LLMs to autonomously decompose tasks. This allows it to adapt not just to different LLMs with varying execution abilities, but also to the inherent complexity of tasks.

The authors evaluate ADAPT on three diverse interactive environment datasets - ALFWorld (household tasks), WebShop (online shopping), and a new compositional game TextCraft for crafting Minecraft recipes. Using GPT-3.5 as the base LLM, ADAPT demonstrates significant performance gains over strong baselines like ReAct, Plan-and-Solve, and the adaptive Reflexion method. On ALFWorld, WebShop, and TextCraft, ADAPT achieves up to 28.3%, 27%, and 33% absolute improvements in success rates respectively compared to ReAct.

Through extensive analysis, the paper establishes the importance of ADAPT's recursive decomposition in achieving these results. The extent of decomposition automatically adjusts based on the execution capabilities of the LLM employed as well as the inherent complexity of tasks. This ability to dynamically adapt its hierarchical decomposition to the situation at hand is the key innovation that enables ADAPT's state-of-the-art performance.

In summary, the ADAPT paper presents a elegant solution to a critical limitation in using LLMs for complex, interactive decision-making tasks. By integrating recursive task decomposition with LLM-based planning and execution, ADAPT paves the way for more capable and robust AI agents that can gracefully handle complex real-world environments. Despite some limitations like the lack of multi-step lookahead, the promising results position ADAPT as an important step towards leveraging the remarkable knowledge and reasoning capabilities of LLMs for sequential decision-making problems.

2/2
2/n Here is the source code. Thank you for subscribing!

bnew · Apr 9, 2024

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets...

arxiv.org

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Computer Science > Computation and Language

[Submitted on 26 Mar 2024]

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang

Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at this https URL

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.18058 [cs.CL]
	(or arXiv:2403.18058v1 [cs.CL] for this version)
	[2403.18058] COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning Focus to learn more

Submission history

From: Yuelin Bai [view email]
[v1] Tue, 26 Mar 2024 19:24:18 UTC (7,301 KB)

https://arxiv.org/pdf/2403.18058.pdf

bnew · Apr 9, 2024

1/3
I *WAS* WRONG - $10K CLAIMED!

## The Claim

Two days ago, I confidently claimed that "GPTs will NEVER solve the A::B problem". I believed that: 1. GPTs can't truly learn new problems, outside of their training set, 2. GPTs can't perform long-term reasoning, no matter how simple it is. I argued both of these are necessary to invent new science; after all, some math problems take years to solve. If you can't beat a 15yo in any given intellectual task, you're not going to prove the Riemann Hypothesis. To isolate these issues and raise my point, I designed the A::B problem, and posted it here - full definition in the quoted tweet.

## Reception, Clarification and Challenge

Shortly after posting it, some users provided a solution to a specific 7-token example I listed. I quickly pointed that this wasn't what I meant; that this example was merely illustrative, and that answering one instance isn't the same as solving a problem (and can be easily cheated by prompt manipulation).

So, to make my statement clear, and to put my money where my mouth is, I offered a $10k prize to whoever could design a prompt that solved the A::B problem for *random* 12-token instances, with 90%+ success rate. That's still an easy task, that takes an average of 6 swaps to solve; literally simpler than 3rd grade arithmetic. Yet, I firmly believed no GPT would be able to learn and solve it on-prompt, even for these small instances.

## Solutions and Winner

Hours later, many solutions were submitted. Initially, all failed, barely reaching 10% success rates. I was getting fairly confident, until, later that day,
@ptrschmdtnlsn and
@SardonicSydney
submitted a solution that humbled me. Under their prompt, Claude-3 Opus was able to generalize from a few examples to arbitrary random instances, AND stick to the rules, carrying long computations with almost zero errors. On my run, it achieved a 56% success rate.

Through the day, users
@dontoverfit
(Opus),
@hubertyuan_
(GPT-4),
@JeremyKritz
(Opus) and
@parth007_96
(Opus),
@ptrschmdtnlsn
(Opus) reached similar success rates, and
@reissbaker
made a pretty successful GPT-3.5 fine-tune. But it was only late that night that
@futuristfrog
posted a tweet claiming to have achieved near 100% success rate, by prompting alone. And he was right. On my first run, it scored 47/50, granting him the prize, and completing the challenge.

## How it works!?

The secret to his prompt is... going to remain a secret! That's because he kindly agreed to give 25% of the prize to the most efficient solution. This prompt costs $1+ per inference, so, if you think you can improve on that, you have until next Wednesday to submit your solution in the link below, and compete for the remaining $2.5k! Thanks, Bob.

## How do I stand?

Corrected! My initial claim was absolutely WRONG - for which I apologize. I doubted the GPT architecture would be able to solve certain problems which it, with no margin for doubt, solved. Does that prove GPTs will cure Cancer? No. But it does prove me wrong!

Note there is still a small problem with this: it isn't clear whether Opus is based on the original GPT architecture or not. All GPT-4 versions failed. If Opus turns out to be a new architecture... well, this whole thing would have, ironically, just proven my whole point But, for the sake of the competition, and in all fairness, Opus WAS listed as an option, so, the prize is warranted.

## Who I am and what I'm trying to sell?

Wrong! I won't turn this into an ad. But, yes, if you're new here, I AM building some stuff, and, yes, just like today, I constantly validate my claims to make sure I can deliver on my promises. But that's all I'm gonna say, so, if you're curious, you'll have to find out for yourself (:

####

That's all. Thanks for all who participated, and, again - sorry for being a wrong guy on the internet today! See you.

Gist:

2/3
(The winning prompt will be published Wednesday, as well as the source code for the evaluator itself. Its hash is on the Gist.)

3/3
half of them will be praising Opus (or whatever the current model is) and the other half complaining of CUDA, and 1% boasting about HVM milestones... not sure if that's your type of content, but you're welcome!

bnew · Apr 9, 2024

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical...

arxiv.org

Mathematics > Optimization and Control

[Submitted on 4 Apr 2024]

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Darioush Kevian, Usman Syed, Xingang Guo, Aaron Havens, Geir Dullerud, Peter Seiler, Lianhui Qin, Bin Hu

In this paper, we explore the capabilities of state-of-the-art large language models (LLMs) such as GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra in solving undergraduate-level control problems. Controls provides an interesting case study for LLM reasoning due to its combination of mathematical theory and engineering design. We introduce ControlBench, a benchmark dataset tailored to reflect the breadth, depth, and complexity of classical control design. We use this dataset to study and evaluate the problem-solving abilities of these LLMs in the context of control engineering. We present evaluations conducted by a panel of human experts, providing insights into the accuracy, reasoning, and explanatory prowess of LLMs in control engineering. Our analysis reveals the strengths and limitations of each LLM in the context of classical control, and our results imply that Claude 3 Opus has become the state-of-the-art LLM for solving undergraduate control problems. Our study serves as an initial step towards the broader goal of employing artificial general intelligence in control engineering.

Subjects:	Optimization and Control (math.OC); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2404.03647 [math.OC]
	(or arXiv:2404.03647v1 [math.OC] for this version)
	[2404.03647] Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra Focus to learn more

Submission history

From: Bin Hu [view email]
[v1] Thu, 4 Apr 2024 17:58:38 UTC (505 KB)

https://arxiv.org/pdf/2404.03647.pdf

bnew · Apr 9, 2024

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Remarkable progress on English instruction tuning has facilitated the efficacy and reliability of large language models (LLMs). However, there remains a noticeable gap in instruction tuning for Chinese, where the complex linguistic features pose significant challenges. Existing datasets...

arxiv.org

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Computer Science > Computation and Language

[Submitted on 26 Mar 2024]

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Ziqiang Liu, Junting Zhou, Tianyu Zheng, Xincheng Zhang, Nuo Ma, Zekun Wang, Ruibin Yuan, Haihong Wu, Hongquan Lin, Wenhao Huang, Jiajun Zhang, Wenhu Chen, Chenghua Lin, Jie Fu, Min Yang, Shiwen Ni, Ge Zhang

Recently, there have been significant advancements in large language models (LLMs), particularly focused on the English language. These advancements have enabled these LLMs to understand and execute complex instructions with unprecedented accuracy and fluency. However, despite these advancements, there remains a noticeable gap in the development of Chinese instruction tuning. The unique linguistic features and cultural depth of the Chinese language pose challenges for instruction tuning tasks. Existing datasets are either derived from English-centric LLMs or are ill-suited for aligning with the interaction patterns of real-world Chinese users. To bridge this gap, we introduce COIG-CQIA, a high-quality Chinese instruction tuning dataset. Our aim is to build a diverse, wide-ranging instruction-tuning dataset to better align model behavior with human interactions. To this end, we collect a high-quality human-written corpus from various sources on the Chinese Internet, including Q&A communities, Wikis, examinations, and existing NLP datasets. This corpus was rigorously filtered and carefully processed to form the COIG-CQIA dataset. Furthermore, we train models of various scales on different subsets of CQIA, following in-depth evaluation and analyses. The findings from our experiments offer valuable insights for selecting and developing Chinese instruction-tuning datasets. We also find that models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks. Data are available at this https URL

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2403.18058 [cs.CL]
	(or arXiv:2403.18058v1 [cs.CL] for this version)
	[2403.18058] COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning Focus to learn more

Submission history

From: Yuelin Bai [view email]
[v1] Tue, 26 Mar 2024 19:24:18 UTC (7,301 KB)

https://arxiv.org/pdf/2403.18058.pdf

bnew · Apr 9, 2024

🏃Alibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models

Weekly China AI News from March 25, 2024 to April 3, 2024

recodechinaai.substack.com

Alibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models

Weekly China AI News from March 25, 2024 to April 3, 2024

TONY PENG

APR 05, 2024

Hello readers, as Chinese families are honoring their ancestors in the Qingming tomb-sweeping festival, I’m delivering this week's issue early. In this edition, I highlighted Alibaba Chair Joe Tsai’s perspective on China’s AI and the US chip restrictions. Surprisingly, a humor sub-forum on Baidu Tieba, Ruozhiba (弱智吧), has emerged as a goldmine for training Chinese LLMs. China’s Internet regulator has released a full list of 117 generative AI models now approved for public services.

Alibaba Chair Joe Tsai on China AI, Chip Restrictions, and Homegrown GPUs

What’s New: Alibaba’s co-founder and new chief Joe Tsai said in a recent public interview that China is two years behind the top LLM from the U.S. and believes the country can eventually produce its high-end GPUs. Below are quick highlights.

US vs China on AI: “I think China is today behind. It's clear that American companies like OpenAI have really leaped ahead of everybody

else, but China is trying to play catchup. I think China could have a lag that will last for a long time because everybody else is running very fast as

well. I think today we’re probably two years behind the top models.”
Chip Restrictions: “Last October the U.S. put in very stringent restrictions on the ability of companies like Nvidia to export high-end chips to every company in China, so they’ve sort of abandoned the entity list approach and they put the entire China on their list. I think we’re definitely affected by that. In fact, we’ve actually publicly communicated it did affect our cloud business and our ability to offer high-end Computing Services to our customers. So it is an issue in the short run and probably the medium run, but in the long run, China will develop its own ability to make these high-end GPUs.”
Short-term impact: “I think in the next year or 18 months the training of large language models (LLMs) can still go ahead given the inventory that people have. I think there’s more high computing that’s required for training as opposed to the applications, what people call inference. So on the inference side, there are multiple options. You don’t need to have as

high power and high-end chips such as the Nvidia you know the latest model.”
Alibaba’s AI strategy: “We’re one of the largest cloud players in China so AI is essential. Having a good large language model that is proprietarily developed in-house is very important because it helps our cloud business if we have a great LLM and other people, other developers are developing on top of it they’re using our computing services. So we see AI as very much the left hand and right hand for our cloud business. And the other aspect is the e-commerce business is one of the places where you can have the richest use cases for AI. So you can develop a lot of really cool products on top of our own models or even someone else’s open-source model…You can try something on using virtual dressing rooms. Our merchants doing business on our marketplace will be able to use AI to self-generate photos product descriptions and things like that.”

Chinese Reddit-like Humor Forum Ruozhiba (弱智吧) Unexpectedly Makes AI Smarter

https://www.youtube.com/watch?v=mT6mRJehJdw

What’s New: Ruozhiba, which literally translates to “Idiot Sub-forum”, is a bizarre corner of the Chinese internet. This sub-forum on Reddit-like Baidu Tieba is filled with ridiculous, pun-filled, logically challenging threads that will twist your brain into a pretzel. Here are some examples:

Is it a violation to drink all the water during a swimming race and then run?
Since prisons are full of criminals, why don’t the cops just go arrest people there?
Fresh sashimi is a dead fish slice (生鱼片是死鱼片).

But who knew this forum has unexpectedly become a treasure trove for training Chinese language AI models?

How it Works: A recent paper titled COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning introduced a high-quality Chinese dataset aimed at fine-tuning Chinese LLMs to better understand and respond to instructions like native Chinese.

The dataset contains over 48,000 instruction-response pairs collected from diverse sources on the Chinese internet like Q&A forums, Wiki articles, exams, and existing NLP datasets.

The authors then analyzed the effects of different data sources, including Ruozhiba.

The Ruozhiba dataset only contains 240 instruction-response pairs. The authors first collected the 500 most highly upvoted threads from Ruozhiba. They used the titles of these threads as instructions. For the responses, some were generated by humans and some by GPT-4.

Surprisingly, the authors found that the Yi-34B model fine-tuned on the Ruozhiba data performed the best overall across different tasks on the BELLE-EVAL benchmark, outperforming other data sources like Zhihu, Xiaohongshu, and Wiki.

Additionally, the smaller Yi-6B model fine-tuned on the Ruozhiba subset also ranked third overall, behind only the carefully curated CQIA-Subset and the Exam data.

On the SafetyBench which evaluates ethical and safe behavior, the Yi-6B model trained on Ruozhiba data also secured the second-highest score.

The authors conjectured that Ruozhiba “may enhance the model’s

logical reasoning ability, thereby benefiting most of the instruct-following tasks.”

Why it Matters: It’s just a fun story that I really enjoyed writing about. You never would have guessed that a dataset filled with pure nonsense could actually help enhance AI!

China Approves 117 Generative AI Models for Public Use

What’s New: China has approved 117 generative AI models for public use as of March 28, 2024, the Cyberspace Administration of China (CAC) disclosed for the first time.

Background: Under China’s generative AI regulation, platforms especially chatbots like Baidu’s ERNIE Bot and Alibaba’s Tongyi Qianwen had to seek approval from local CAC offices before launch. Since August last year, any generative AI services “capable of shaping public opinion or mobilizing society” must undergo a safety evaluation and registration process.

Local CAC offices will then publicly disclose information of registered generative AI services.

Key Takeaways

While I haven’t studied all 117 models, assumably a majority of models are language-based models (or LLMs).
No models from non-Chinese companies have made the cut yet.
Beijing and Shanghai stand at the forefront of China’s AI innovation, with 51 models from Beijing and 24 from Shanghai receiving approval.

bnew · Apr 9, 2024

1/1
Majorly improved GPT-4 Turbo model available now in the API and rolling out in ChatGPT.

Pricing

Simple and flexible. Only pay for what you use.

openai.com

bnew · Apr 9, 2024

1/4
The AI world is truly going insane!

Just the other day, we were discussing Princeton University's open-source AI software engineer SWE-agent outperforming Devin, and now a new contender, AutoCodeRover from Singapore, has dethroned SWE-agent in just a matter of days.

This powerhouse can tackle 67 GitHub issues (bug fixes or feature additions) in under ten minutes per issue, while regular developers take an average of over 2.77 days, all at a minuscule LLM cost of ~$0.5! Truly frightening!

2/4
Check it out!

3/4
New LLMs & cost transparency, accomplished!

4/4
@goon_nguyen has just added 4 more AI models to my settings!

All models come with input & output price (per million tokens).

1/4
it's over, and this is current generation LLMs (gpt-4)

> our approach resolved 67 GitHub issues in less than ten minutes each, whereas developers spent more than 2.77 days on average

2/4
ok not completely over for soft eng, but yea that efficiency gain is insane

3/4
this was the results posted from "devin" (they used 25% random subset)

4/4
code is here

1/1
1. Devin (Agentic AI Software Engineer)
2. followed by Devika: GitHub - stitionai/devika: Devika is an Agentic AI Software Engineer that can understand high-level human instructions, break them down into steps, research relevant information, and write code to achieve the given objective. Devika aims to be a competitive open-source alternative to Devin by Cognition AI. (open source Agentic AI Software Engineer)
3. followed by SWE-agent: GitHub - princeton-nlp/SWE-agent: SWE-agent takes a GitHub issue and tries to automatically fix it, using GPT-4, or your LM of choice. It solves 12.29% of bugs in the SWE-bench evaluation set and takes just 1.5 minutes to run. (resolved ~12.3% of issues on SWE-bench in comparison to Devin's ~13.8%)
4. and now AutoCodeRover: GitHub - nus-apr/auto-code-rover: Autonomous program improvement. (resolved ~22% of issues on SWE-bench)
5. I also came across this another AI Backend Engineer called GibsonAI(closed source).

All within weeks of each other.

1/4
Introducing AutoCodeRover
Presenting our autonomous software engineer from Singapore ! Takes in a Github issue (bug fixing or feature addition), resolves in few minutes, with minimal LLM cost ~$0.5 ! Please RT

GitHub - nus-apr/auto-code-rover: Autonomous program improvement

auto-code-rover/preprint.pdf at main · nus-apr/auto-code-rover

[ 1 / 4]

2/4
Absolutely free for everyone to try out ! And to improve it further!!

3/4
We prefer to run it multiple times - to cater for variations …

4/4
Try it from the following site

https://github.com/nus-apr/auto-code-rover…
https://github.com/nus-apr/auto-code-rover/blob/main/preprint.pdf…
#thursdAI
@ollama

auto-code-rover/preprint.pdf at main · nus-apr/auto-code-rover
GitHub - nus-apr/auto-code-rover: Autonomous program improvement

1/2
AutoCodeRover: Autonomous Software Engineer

Resolves 22% of Github issues in SWE-benchlite in <10 mins at minimal LLM cost ~$0.5
Works on program representation of Abstract Syntax Tree, and exploits program structure in the form of classes/methods/APIs

GitHub - AutoCodeRoverSG/auto-code-rover: A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 46.2% tasks (pass@1) in SWE-bench verified with each task

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 46.2% tasks (pass@1) in SWE-bench verified with...

github.com

2/2
Introducing AutoCodeRover
Presenting our autonomous software engineer from Singapore ! Takes in a Github issue (bug fixing or feature addition), resolves in few minutes, with minimal LLM cost ~$0.5 ! Please RT

https://github.com/nus-apr/auto-code-rover https://github.com/nus-apr/auto-code-rover/blob/main/preprint.pdf [ 1 / 4]

GitHub - AutoCodeRoverSG/auto-code-rover: A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 46.2% tasks (pass@1) in SWE-bench verified with each task

A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 46.2% tasks (pass@1) in SWE-bench verified with...

github.com

About

Autonomous program improvement

AutoCodeRover: Autonomous Program Improvement

ArXiv Paper

Overview

AutoCodeRover is a fully automated approach for resolving GitHub issues (bug fixing and feature addition) where LLMs are combined with analysis and debugging capabilities to prioritize patch locations ultimately leading to a patch.

On SWE-bench lite, which consists of 300 real-world GitHub issues, AutoCodeRover resolves ~22% of issues, improving over the current state-of-the-art efficacy of AI software engineers.

AutoCodeRover works in two stages:

Context retrieval: The LLM is provided with code search APIs to navigate the codebase and collect relevant context.
Patch generation: The LLM tries to write a patch, based on retrieved context.

Highlights

AutoCodeRover has two unique features:

Code search APIs are Program Structure Aware. Instead of searching over files by plain string matching, AutoCodeRover searches for relevant code context (methods/classes) in the abstract syntax tree.
When a test suite is available, AutoCodeRover can take advantage of test cases to achieve an even higher repair rate, by performing statistical fault localization.

🗎 arXiv Paper

AutoCodeRover: Autonomous Program Improvement

For referring to our work, please cite and mention:

@misc{zhang2024autocoderover, title={AutoCodeRover: Autonomous Program Improvement}, author={Yuntong Zhang and Haifeng Ruan and Zhiyu Fan and Abhik Roychoudhury}, year={2024}, eprint={2404.05427}, archivePrefix={arXiv}, primaryClass={cs.SE} }

Example: Django Issue #32347

As an example, AutoCodeRover successfully fixed issue #32347 of Django. See the demo video for the full process:

https://private-user-images.githubusercontent.com/48704330/320440436-719c7a56-40b8-4f3d-a90e-0069e37baad3.mp4?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MTI2OTM5NTMsIm5iZiI6MTcxMjY5MzY1MywicGF0aCI6Ii80ODcwNDMzMC8zMjA0NDA0MzYtNzE5YzdhNTYtNDBiOC00ZjNkLWE5MGUtMDA2OWUzN2JhYWQzLm1wND9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA0MDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNDA5VDIwMTQxM1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTExNGZkMTVmOWM4NWNhOGUzYTVlZGFjODJkNjNlN2FiNzUzN2I1M2E1MWM4ZWE0NTQ1ZDRmN2IwNjc4ZGRjMDQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.eNRYVyfdSuwh5JPFFeNhansQypb-GimykriAzpOkcXc

Enhancement: leveraging test cases

AutoCodeRover can resolve even more issues, if test cases are available. See an example in the video:

bnew · Apr 9, 2024

1/2
Emad Mostaque: last summer it took 20 seconds to generate an image, now we can do 300 images per second and over 1000 with the new Nvidia chips

2/2
Source:

bnew · Apr 10, 2024

1/2
magnet:?xt=urn:btih:9238b09245d0d8cd915be09927769d5f7584c1c9&dn=mixtral-8x22b&tr=udp%3A%2F%http://2Fopen.demonii.com%3A1337%2Fannounce&tr=http%3A%2F%http://2Ftracker.opentrackr.org%3A1337%2Fannounce

2/2
RELEASE 0535902c85ddbb04d4bebbf4371c6341 lol

Models Table (10,000+ LLM data points)

Open the Models Table in a new tab | Back to LifeArchitect.ai Open the Models Table in a new tab | Back to LifeArchitect.ai Models Table Rankings 2025 frontier AI models in China Download source (PDF) Permissions: Yes, you can use these visualizations anywhere, please leave the citation...

lifearchitect.ai

Models Table
Open the Models Table in a new tab | Back to LifeArchitect.ai
2024 optimal LLM highlights

bnew · Apr 10, 2024

1/10
Alright, strap in. Support for Command-R+ was merged into llama.cpp exactly 4 hours ago. We're going to start talking to a GPT-4 level model on local hardware without a GPU. If you have 64GB of RAM, feel free to follow along

2/10
First up, a note about hardware: Text generation is limited by memory bandwidth. This will run on any machine with 64GB or more, but if you want speed I recommend DDR5, ideally on an 8 or even 12-channel motherboard, like Xeon/Epyc/Threadripper Pro/Apple silicon.

3/10
To start, we're going to build the latest version of llama.cpp.

4/10
Next, we're going to get the compressed Command-R+ model and weights in GGUF format. That's here: dranger003/c4ai-command-r-plus-iMat.GGUF at main

Download the biggest size you can fit in RAM, with maybe 8-16GB of headroom (so at 64GB, try iq3_m or iq3_s, which are ~48GB). Bigger sizes are split.

5/10
Now, let's prepare our chat, using that chat template included with command-R+. pip install transformers, and then run this in Python. Feel free to change the chat:

6/10
The result is the formatted chat, ready to go to llama.cpp. Paste it into the -p argument to ./main in the llama.cpp dir, and pass your GGUF file to -m. -n is the maximum response length, in tokens.

7/10
Now simply hit enter, and...

8/10
For more performance, you can add more memory bandwidth or compile llama.cpp with BLAS support. You can also do the whole thing with the Python bindings, so you don't have to keep pasting back and forth like this. And that's it: GPT-4 in your home!

9/10
Also, note that the model will get stupider at the smaller quantizations. If you try this at iq2 and it gives you a terrible answer, don't blame me! You may need 128GB of RAM to fit the higher-quality Q6 and Q8 quantizations.

10/10
The latency depends almost entirely on memory bandwidth! High-bandwidth systems include DDR5 threadripper pro (~320GB/s), xeon (320GB/s), epyc (480GB/s) and Apple silicon (up to 400-800GB/s for mac studio).

250gb/s translates to about 1tok/s, in my experience.

https://huggingface.co/dranger003/c4ai-command-r-plus-iMat.GGUF/tree/main…

dranger003/c4ai-command-r-plus-iMat.GGUF at main

bnew · Apr 10, 2024

Udio | AI Music Generator - Official Website

Discover, create, and share music with the world. Use the latest technology to create AI music in seconds.

www.udio.com

Pricing

Thank you for being an early supporter! Our product is free for the duration of the beta program. In this period, you can make up to 1200 songs / month.

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning​

Submission history​

Veteran

Veteran

Mathematics > Optimization and Control​

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra​

Submission history​

Veteran

Computer Science > Computation and Language​

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning​

Submission history​

Veteran

Alibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models​

Weekly China AI News from March 25, 2024 to April 3, 2024​

Alibaba Chair Joe Tsai on China AI, Chip Restrictions, and Homegrown GPUs​

Chinese Reddit-like Humor Forum Ruozhiba (弱智吧) Unexpectedly Makes AI Smarter​

China Approves 117 Generative AI Models for Public Use​

Veteran

Veteran

About​

AutoCodeRover: Autonomous Program Improvement​

Overview​

Highlights​

🗎 arXiv Paper​

AutoCodeRover: Autonomous Program Improvement​

Example: Django Issue #32347​

Enhancement: leveraging test cases​

Veteran

Veteran

Veteran

Veteran

Pricing​

Computer Science > Computation and Language

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Submission history

Mathematics > Optimization and Control

Capabilities of Large Language Models in Control Engineering: A Benchmark Study on GPT-4, Claude 3 Opus, and Gemini 1.0 Ultra

Submission history

Computer Science > Computation and Language

COIG-CQIA: Quality is All You Need for Chinese Instruction Fine-tuning

Submission history

Alibaba Chief Says China AI 2 Years Behind US, How Humor Forum Unexpectedly Makes AI Smarter, and China Approves 117 Gen-AI Models

Weekly China AI News from March 25, 2024 to April 3, 2024

Alibaba Chair Joe Tsai on China AI, Chip Restrictions, and Homegrown GPUs

Chinese Reddit-like Humor Forum Ruozhiba (弱智吧) Unexpectedly Makes AI Smarter

China Approves 117 Generative AI Models for Public Use

About

AutoCodeRover: Autonomous Program Improvement

Overview

Highlights

🗎 arXiv Paper

AutoCodeRover: Autonomous Program Improvement

Example: Django Issue #32347

Enhancement: leveraging test cases

Pricing