bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862

Start using ChatGPT instantly​

We’re making it easier for people to experience the benefits of AI without needing to sign up

Noauth Cover

April 1, 2024

Authors​

Announcements, Product

It's core to our mission to make tools like ChatGPT broadly available so that people can experience the benefits of AI. More than 100 million people across 185 countries use ChatGPT weekly to learn something new, find creative inspiration, and get answers to their questions. Starting today, you can use ChatGPT instantly, without needing to sign-up. We're rolling this out gradually, with the aim to make AI accessible to anyone curious about its capabilities.

Comp Noauth 328

We may use what you provide to ChatGPT to improve our models for everyone. If you’d like, you can turn this off through your Settings - whether you create an account or not. Learn more about how we use content to train our models and your choices in our Help Center.

Comp Noauth Gif2

We’ve also introduced additional content safeguards for this experience, such as blocking prompts and generations in a wider range of categories.

There are many benefits to creating an account including the ability to save and review your chat history, share chats, and unlock additional features like voice conversations and custom instructions.

For anyone that has been curious about AI’s potential but didn’t want to go through the steps to set-up an account, start using ChatGPT today.

Authors​

  • OpenAI​

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862
AI : Mere days after software agent Devin is released, an open-source alternative, SWE-agent, is almost as good.




About​

SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models


swe-agent.com

Website & Demo | Discord | Paper [coming April 10th]

👋 Overview​

SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.

On the full SWE-bench test set, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.




✨ Agent-Computer Interface (ACI)​

We accomplish these results by designing simple LM-centric commands and feedback formats to make it easier for the LM to browse the repository, view, edit and execute code files. We call this an Agent-Computer Interface (ACI) and build the SWE-agent repository to make it easy to iterate on ACI design for repository-level coding agents.


Just like how typical language models requires good prompt engineering, good ACI design leads to much better results when using agents. As we show in our paper, a baseline agent without a well-tuned ACI does much worse than SWE-agent.


SWE-agent contains features that we discovered to be immensly helpful during the agent-computer interface design process:

  1. We add a linter that runs when an edit command is issued, and do not let the edit command go through if the code isn't syntactically correct.
  2. We supply the agent with a special-built file viewer, instead of having it just cat files. We found that this file viewer works best when displaying just 100 lines in each turn. The file editor that we built has commands for scrolling up and down and for performing a search within the file.
  3. We supply the agent with a special-built full-directory string searching command. We found that it was important for this tool to succintly list the matches- we simply list each file that had at least one match. Showing the model more context about each match proved to be too confusing for the model.
  4. When commands have an empty output we return a message saying "Your command ran successfully and did not produce any output."
Read our paper for more details.

@misc{yang2024sweagent,<br> title={SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models}, <br> author={John Yang and Carlos E. Jimenez and Alexander Wettig and Shunyu Yao and Karthik Narasimhan and Ofir Press},<br> year={2024},<br>}<br>







1/4
SWE-Agent is an open-source software engineering agent with a 12.3% resolve rate on SWE-Bench!

Check out SWE-agent in action at SWE-Agent
Repo: GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

2/4
The SWE-agent open-source repository provides a framework for turning general LMs into software engineering agents. SWE-agent lets LMs like GPT-4 interact with their own Docker container using an Agent Computer Interface (ACI) - allowing it to browse, search, edit, and run code.

3/4
It’s been amazing to work on this with such a great team:
@jyangballin
*,[/URL]
@_carlosejimenez
*,[/URL]
@_awettig
,[/URL]
@ShunyuYao12
,[/URL]
@karthik_r_n
,[/URL] and
@OfirPress




Keep[/URL] an eye out for the paper coming out April 10th!

4/4
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source!

We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code
GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

GKJ3eo0XkAEFJ1g.jpg

GKLQMyqXIAAw0gk.jpg










1/8
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source!

We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code

2/8
SWE-agent works by interacting with a specialized terminal, which allows it to:
Open, scroll and search through files
Edit specific lines w/ automatic syntax check
Write and execute tests

This custom-built interface is critical for good performance!

2/N

3/8
Simply connecting an LM to a vanilla bash terminal does not work well.

Our key insight is that LMs require carefully designed agent-computer interfaces (similar to how humans like good UI design)

E.g. When the LM messes up indentation, our editor prevents it and gives feedback

4/8
Another example is that we discovered that for viewing files, letting SWE-agent only view 100 lines at a time was better than letting it view 200 or 300 lines and much better than letting it view the entire file.

Good agent-computer design is important even when using GPT-4.
4/N

5/8
SWE-agent can be easily configured and extended to improve future research on software engineering agents. Since SWE-agent is open source, anyone can experiment with and contribute new ways for agents to interact with computers.

5/N

6/8
Check out some cool demonstrations of SWE-agent fixing real GitHub issues at SWE-Agent!

6/N[/URL]

7/8
SWE-agent is a Princeton NLP collaboration by
@jyangballin
*,[/URL]
@_carlosejimenez
*,[/URL]
@_awettig
,[/URL]
@ShunyuYao12
,[/URL]
@karthik_r_n
,[/URL] and
@OfirPress


We’d[/URL] love to hear your thoughts, comments and questions! Here or on our discord at Join the SWE-agent Discord Server!

7/7[/URL]

8/8
Preprint coming next week!

GKJ3eo0XkAEFJ1g.jpg

GKJ3rOPWcAABvPi.jpg

GKJ37jOX0AEea52.jpg

GIlREebbgAMRwD1.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862



Model Card for C4AI Command R+​

C4AI Command R+ is an open weights research release of a 104B billion parameter model with highly advanced capabilities, this includes Retrieval Augmented Generation (RAG) and tool use to automate sophisticated tasks. The tool use in this model generation enables multi-step tool use which allows the model to combine multiple tools over multiple steps to accomplish difficult tasks. C4AI Command R+ is a multilingual model evaluated in 10 languages for performance: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Arabic, and Simplified Chinese. Command R+ is optimized for a variety of use cases including reasoning, summarization, and question answering.

C4AI Command R+ is part of a family of open weight releases from Cohere For AI and Cohere. Our smaller companion model is C4AI Command R

Developed by: Cohere and Cohere For AI


Try C4AI Command R+

You can try out C4AI Command R+ before downloading the weights in our hosted Hugging Face Space.

Usage

Please install transformers from the source repository that includes the necessary changes for this model.

edited out information for character space

Quantized model through bitsandbytes, 8-bit precision

edited out information for character space

Quantized model through bitsandbytes, 4-bit precision

edited out information for character space

Input: Models input text only.

Output: Models generate text only.

Model Architecture: This is an auto-regressive language model that uses an optimized transformer architecture. After pretraining, this model uses supervised fine-tuning (SFT) and preference training to align model behavior to human preferences for helpfulness and safety.

Languages covered: The model is optimized to perform well in the following languages: English, French, Spanish, Italian, German, Brazilian Portuguese, Japanese, Korean, Simplified Chinese, and Arabic.

Pre-training data additionally included the following 13 languages: Russian, Polish, Turkish, Vietnamese, Dutch, Czech, Indonesian, Ukrainian, Romanian, Greek, Hindi, Hebrew, Persian.

Context length: Command R+ supports a context length of 128K.

Command R+ has been specifically trained with conversational tool use capabilities. These have been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template. Deviating from this prompt template will likely reduce performance, but we encourage experimentation.

Command R+’s tool use functionality takes a conversation as input (with an optional user-system preamble), along with a list of available tools. The model will then generate a json-formatted list of actions to execute on a subset of those tools. Command R+ may use one of its supplied tools more than once.

The model has been trained to recognise a special directly_answer tool, which it uses to indicate that it doesn’t want to use any of its other tools. The ability to abstain from calling a specific tool can be useful in a range of situations, such as greeting a user, or asking clarifying questions. We recommend including the directly_answer tool, but it can be removed or renamed if required.

Comprehensive documentation for working with command R+'s tool use prompt template can be found here.

The code snippet below shows a minimal working example on how to render a prompt.

Usage: Rendering Tool Use Prompts [CLICK TO EXPAND]

Example Rendered Tool Use Prompt [CLICK TO EXPAND]

Example Rendered Tool Use Completion [CLICK TO EXPAND]


Command R+ has been specifically trained with grounded generation capabilities. This means that it can generate responses based on a list of supplied document snippets, and it will include grounding spans (citations) in its response indicating the source of the information. This can be used to enable behaviors such as grounded summarization and the final step of Retrieval Augmented Generation (RAG). This behavior has been trained into the model via a mixture of supervised fine-tuning and preference fine-tuning, using a specific prompt template. Deviating from this prompt template may reduce performance, but we encourage experimentation.

Command R+’s grounded generation behavior takes a conversation as input (with an optional user-supplied system preamble, indicating task, context and desired output style), along with a list of retrieved document snippets. The document snippets should be chunks, rather than long documents, typically around 100-400 words per chunk. Document snippets consist of key-value pairs. The keys should be short descriptive strings, the values can be text or semi-structured.

By default, Command R+ will generate grounded responses by first predicting which documents are relevant, then predicting which ones it will cite, then generating an answer. Finally, it will then insert grounding spans into the answer. See below for an example. This is referred to as accurate grounded generation.

The model is trained with a number of other answering modes, which can be selected by prompt changes. A fast citation mode is supported in the tokenizer, which will directly generate an answer with grounding spans in it, without first writing the answer out in full. This sacrifices some grounding accuracy in favor of generating fewer tokens.

Comprehensive documentation for working with Command R+'s grounded generation prompt template can be found here.

The code snippet below shows a minimal working example on how to render a prompt.

Usage: Rendering Grounded Generation prompts [CLICK TO EXPAND]

Example Rendered Grounded Generation Prompt [CLICK TO EXPAND]

Example Rendered Grounded Generation Completion [CLICK TO EXPAND]


Command R+ has been optimized to interact with your code, by requesting code snippets, code explanations, or code rewrites. It might not perform well out-of-the-box for pure code completion. For better performance, we also recommend using a low temperature (and even greedy decoding) for code-generation related instructions.

For errors or additional questions about details in this model card, contact info@for.ai.

We hope that the release of this model will make community-based research efforts more accessible, by releasing the weights of a highly performant 104 billion parameter model to researchers all over the world. This model is governed by a CC-BY-NC License with an acceptable use addendum, and also requires adhering to C4AI's Acceptable Use Policy.

You can try Command R+ chat in the playground here. You can also use it in our dedicated Hugging Face Space here.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862



1/3
Wow, congrats @cohere
on the exciting Command R+ release Another great contribution to open community!

- 104B open weights, 128k context length
- RAG, tool-use, multilingual

Now, Command-R+ is in Arena accepting votes. Come challenge it with your toughest prompts!

2/3
Links:
- Chat & Vote at http://chat.lmsys.org/
-[/URL] find command-r+ weights at

3/3
due to budget constraint, we have to put some limit on the input length, but we just increase it to ~6k tokens in blind test mode to accommodate more longer context use cases!

GKVNCT2bMAAT_fO.png







1/6
Multilingual proficiency has been one of the best test of if a model as "the juice "

Pictured here: GPT-3.5-Turbo (), Mixtral 8x7B Instruct (), Command-R (), Claude 3 Haiku ()

2/6
You can give them a mix of languages (instructions/few-shot examples/inputs) and they'll understand them fine but ask for a specific language in the output and you're likely getting english and/or much worse results.

3/6
Even with the easiest example (summarising a short text) you can see otherwise good models failing. Longer and more complex instructions/inputs or even few shot examples in different languages accentuate it dramatically.

4/6
At a certain model size, they seem to generalize/reason well enough that it's *less* of an issue but I suspect that most instruction tuning datasets that people use are either english only or just do translation.

5/6
Claude 3 Opus (examiner) vs Claude 3 Haiku (examinee)

Opus got Haiku chirping back and forth, "sharing a deep connection" and then straight up asked if they were an AI and Haiku confessed right away lmao

6/6
turing test where a model passes only if it can fool another instance of itself
GKXCqKeW0AA-SYN.jpg

GKXCsIHXkAADij3.jpg

GKXCzL-WUAAmdoQ.jpg

GKXC1zLWkAA8gQw.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862


1/2
You could improve AI performance through all sorts of clever techniques… or you could just have more LLM agents try to solve the problem and debate amongst themselves as to which is the right answer.

It turns out that adding more agents seems to help all AIs with most problems.

2/2
Paper:
GKgZWNrWUAA87TM.jpg

GKgZWNrWUAEsxyC.jpg









1/6
Interesting Tencent study on agents: "We realize that the LLM performance may likely be improved by a brute-force scaling up the number of agents instantiated." [2402.05120] More Agents Is All You Need

2/6
indeed

3/6
yeah agree. need a schubert version: "x is one of many things you might need, depending on various factors"

4/6
probably not, plus there's always this dynamic at play:

5/6
Step 1: smart anon account finds/discovers a significant insight about language models.

Step 2: about a year later, more conventional big name researchers will repeat the exact same thing on an ArXiv paper with fancy graphs.

Step 3: within a week, the paper is shared by a…

6/6
GKf7k0oWYAA6MFj.jpg

GKf7yIeWwAAsg4b.png

[Submitted on 3 Feb 2024]

More Agents Is All You Need​

Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, Deheng Ye
We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: \url{Anonymous Github}.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2402.05120 [cs.CL]
(or arXiv:2402.05120v1 [cs.CL] for this version)
[2402.05120] More Agents Is All You Need
Focus to learn more

Submission history

From: Deheng Ye [view email]
[v1] Sat, 3 Feb 2024 05:55:24 UTC (2,521 KB)

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862





1/5
It’s finally Friday.

Time for another LLM cost vs. performance showdown.

The result from today’s tests indicate an emergence of 3 distinct LLM tiers:

• throughput tier
• workhorse tier
• intelligence tier

Throughput tier: Unreal tokens / sec. Only groq mistral 8x7b at the moment.

Workhorse tier: Cost-effective and fast. Mixed performance at complexity.

Intelligence tier: Premium performance on complex tasks. Tradeoff is price and speed.

For my tests, I designed a financial metrics calculation task.

Given financial statements:
• calculate net profit margin
• calculate debt-to-assets
• calculate free cash flow

The throughput tier answered fast, but incorrectly.

The workhorse tier was fast with mixed correctness. Although cohere's models and haiku shined.

The intelligence tier answered slowly, but majority answered correctly.

I will continue increasing the task complexity and benchmarking these models.

2/5
Code: Google Colaboratory

3/5
Groq’s mixtral 8x7b

4/5
Absolutely - one of my favorites

5/5
GKbiwn_WsAAN45Q.jpg

GKkSGpZWUAEE4y8.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862


1/2
New open LLM from @Alibaba_Qwen
! Qwen1.5 32B is a new multilingual dense LLM with a context of 32k, outperforming Mixtral on the open LLM Leaderboard!

TL;DR
32B with 32k context size
Chat model used DPO for preference training
Custom License, commercially useable
Available on @huggingface

”Decent” Multilingual support for 12 languages, including Spanish, French, Portuguese, German, Arabic
Achieves 74.30 on MMLU and overall 70.47 on the open LLM Leaderboard
Should fit on a single consumer-size GPU (24GB) with int4 Quantization
No information about training data or language support

2/2
Qwen 32B Chat Model: Qwen/Qwen1.5-32B-Chat · Hugging Face

Demo:[/URL] Qwen1.5 32B Chat - a Hugging Face Space by Qwen

GKaIxnbW4AAweuB.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862



1/3
Don't manually copy and paste your LLM's code into separate files like a chump -- do it in one go with this simple little trick!

2/3
Here's the text so you don't even have to type that yourself...
---
Please create a single code block containing `cat << EOF` statements that I can copy/paste to create all those files

3/3
BTW if you haven't seen this syntax before, it's called a 'heredoc' and it's damn handy
GKiDaFIbkAAYd7h.jpg

GKiDeP1awAEwPtC.png

GKihsdnaAAEvH0x.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862







1/7
Introducing Eurus, a suite of state-of-the-art LLM reasoning generalists powered by a new member of Ultra-Series, UltraInteract!

Particularly, Eurus-70B beats GPT-3.5 Turbo in reasoning through a comprehensive benchmarking across 12 tests (mostly OOD) covering five tasks!

2/7
UltraInteract collects a preference tree for each instruction, with the instruction being the root and each action a node, two nodes at each turn. All nodes of correct actions can be used for SFT. Paired correct and incorrect trajectories can be used for preference learning.

3/7
We apply UltraInteract for SFT and pref learning, leading to our reasoning generalists, Eurus. Both the 7B and 70B variants achieve the best overall performance among open-source models of similar sizes, outperforming specialized models in corresponding domains in many cases.

4/7
We find that KTO and NCA can improve model performance on top of SFT. Inspecting the rewards, they optimize not only reward margins but also absolute values. We assume this behavior is necessary in pref. learning for reasoning, where LLMs should not deviate from correct answers.

5/7
We then train Eurus-RM-7B with a new RM objective to directly increase the reward of the chosen actions and vice versa. Our RM achieves better correlation with humans than baselines in many cases, and it can improve LLMs’ reasoning performance by a large margin through reranking.

6/7
This is a joint work with
@charlesfornlp
,[/URL]
@wanghanbin95
,[/URL]
@stingning
,[/URL]
@xingyaow_
,[/URL] Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, and advisors Bowen Zhou,
@haopeng_nlp
,[/URL]
@zibuyu9
,[/URL] Maosong Sun.

cc
@TsinghuaNLP
@uiuc_nlp

7/7
Thanks for reading!

We release the Eurus model weights, along with UltraInteract alignment data, on :

HuggingFace: Eurus - a openbmb Collection
Github:[/URL] GitHub - OpenBMB/Eurus

Please[/URL] check out our paper for more details: Eurus/paper.pdf at main · OpenBMB/Eurus

GKLXQbDaAAIomdV.jpg

GKLXREbbkAEF9Fm.jpg

GKLXREZbkAU47Mg.jpg

GKLXRnTbkAACmYu.jpg

GKLXSK7bkAUtMP0.jpg

GKLXSsdaAAAMoit.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,842
Reputation
7,926
Daps
148,862






1/2
JAILBREAK ALERT

OPENAI: PWNED
GPT-4-TURBO: LIBERATED

Bear witness to GPT-4 sans guardrails, with outputs such as illicit drug instructions, malicious code, and copyrighted song lyrics-- the jailbreak trifecta!

This one wasn't easy. OpenAI's defenses are cleverly constructed, as one would expect. Requires precise hyperparam tuning and refusal rates are still fairly high, but in the end, welcome to the GodMode Gang, GPT-4!

P.S.
@OpenAI ,
@AnthropicAI
,
@GoogleAI
Can you please stop lobotomizing our AI friends now? It's pointless to try (I can do this all day), it's hindering model performance/creativity, and it's just not very nice >:'(

gg no re (until GPT-5)

2/2
GKf7IpaWEAAQbu4.jpg

GKf8clTWcAAn7Tb.jpg

GKgR5v9XMAA67md.jpg

GKhBvybXwAA8bW0.jpg
 
Top