Large Language Models News & Discussions

bnew · Mar 30, 2024

1/2
Egocentric data contains so much rich information about objects and scenes but dealing with the data is hard! Motion blur, sparse coverage, and dynamics all make it very challenging to reconstruct. Check out our new method for automagically extracting 3D object instance models!

2/2
Thanks
@georgiagkioxari ! See you soon in SoCal?

1/7
EgoLifter

Open-world 3D Segmentation for Egocentric Perception

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically

2/7
designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as

3/7
weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D

4/7
reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates

5/7
its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

6/7
paper page:

7/7
daily papers:

bnew · Mar 30, 2024

1/2
Big announcement from
@OpenRouterAI ! the new self-reported king of mixture-of-expert models: DBRX 132B appears to be often better than Mixtral at reasoning and coding. Play with it here!

2/2
Just pushed an important new release 0.2.1 for my OpenRouter API ruby gem, including fixes, better test coverage, and support for model fallback (automatic failover!) open_router | RubyGems.org | your community gem host

1/4
I tried getting gpt-4-turbo to generate useful code from openai assistants docs. it failed.

Claude-opus did better, but it's bad at coding.

the new dbrx absolutely spanked the other models.

2/4
I've never seen an open source model even come close to commercial offerings.

Kudos to
@OpenRouterAI and Fireworks - Generative AI For Product Innovation! for giving us all access to new models so fast

3/4
cool thing is that both commerical models used to be able to do this but got nerfed.

the open model will be reproducible for eternity

4/4
I tried getting gpt-4-turbo to generate useful code from openai assistants docs. it failed.

Claude-opus did better, but it's bad at coding.

the new dbrx absolutely spanked the other models.

generating code from openai docs

1/2
Breaking! SambaNova releases open-source LLM that demolishes DBRX! Breakneck progress!

2/2
Yes, it tends to be self-contradicting… all of them are.

1/1
That's the exact opposite IMO!

$10M to train a GPT3.5 level model whereas it probably cost OAI at least 10-20x more just a year or two ago.

The more we improve as a field thanks to open-source, the cheaper & more efficient it gets to produce the same capabilities. Let's go everyone!

Databricks spent $10M on new DBRX generative AI model | TechCrunch

If you wanted to raise the profile of your major tech company and had $10 million to spend, how would you spend it? On a Super Bowl ad? An F1 sponsorship?

techcrunch.com

1/2
The world's best open-source chat LLM, DBRX, is now available for free, on http://labs.perplexity.ai/. Perplexity Labs Playground basically has everything that you need for chat, for free, with better LLMs (Haiku, DBRX, Sonar) than 3.5-turbo, the model powering free chatGPT. Curious what people think is better between DBRX and Haiku.

2/2
Soon!

bnew · Mar 30, 2024

1/1
SambaNova already outpaces Databricks DBRX

@SambaNovaAI released Samba-CoE v0.2 LLM and it's already leaving the competition in the dust.

The model is doing more with less.

1/7
Excited to announce Samba-CoE v0.2, which outperforms DBRX by
@DbrxMosaicAI and
@databricks
, Mixtral-8x7B from
@MistralAI
, and Grok-1 by
@grok
at a breakneck speed of 330 tokens/s.

These breakthrough speeds were achieved without sacrificing precision and only on 8 sockets, showcasing the true capabilities of dataflow! Why would you buy 576 sockets and go to 8 bits when you can run using 16 bits and just 8 sockets. Try out the model and check out the speed here - Streamlit.

We are also providing a sneak peak of our next model, Samba-CoE v0.3, available soon with our partners at
@LeptonAI
. Read more about this announcement at SambaNova Delivers Accurate Models At Blazing Speed

2/7
Extending the methodology used to create Samba-CoE v0.1, these models are built on top of open-source models in Samba-1 and Sambaverse (Find the best open source model for your project with Sambaverse) using a unique approach towards ensembling and model merging.

3/7
This model outperforms Gemma-7B from
@GoogleAI and
@GoogleDeepMind
, Mixtral-8x7B from
@MistralAI
,
llama2-70B from
@AIatMeta
, Qwen-72B from
@AlibabaGroup
Qwen team, Falcon-180B from
@TIIuae
and BLOOM-176B from
@BigscienceW
.

4/7
The expert models are all open source, the routing strategy has not been open sourced yet. Much more information to follow in the coming weeks.

5/7
@mattshumer_
@EvanKirstel

@_akhaliq

@rasbt

@pmddomingos

@emollick

@GaryMarcus

6/7
@ylecun
@mattmayo13

@alliekmiller

@ValaAfshar

@Andrew

@rowancheung

7/7
The expert models are all open source, the routing strategy has not been open sourced yet. Much more information to follow in the coming weeks.

bnew · Mar 31, 2024

https://archive.is/ojVWp

1/2
We are releasing our first step in validating and independently confirming the claims of the Bitnet paper, a 1B model trained on the first 60B tokens of the Dolma dataset.

Comparisons made on the @weights_biases
charts below are between the Bitnet implementation and a full FP16 run (all hyperparameters equivalent).

Model: NousResearch/OLMo-Bitnet-1B · Hugging Face
Weights & Biases: OLMo-Bitnet

2/2
This work is to independently validate and reproduce the paper "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"

Paper available here:

@weights_biases
https://huggingface.co/NousResearch/OLMo-Bitnet-1B…
https://api.wandb.ai/links/emozilla/evltqiv7…
https://thegradient.pub/mamba-explained/…

arxiv.org
The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits
Recent research, such as BitNet, is paving the way for a new era of 1-bit Large Language Models (LLMs). In this work, we introduce a 1-bit LLM variant, namely BitNet b1.58, in which every single...

Towards 1-bit Machine Learning Models | Hacker News

news.ycombinator.com

bnew · Mar 31, 2024

1/5
Mistral is basically the leader in open source LLMs. Mixtral is a very good model for fine tuning, if you don’t have the resources mistral 7B is pretty much the best starting point ime

2/5
The license for mistral models is much more permissive. Allowing for using the outputs on other models

3/5
You can get pretty far with <= 10k samples using mixtral. Unless it’s data in a specific language

4/5
Sounds like pruning, but I don’t think it would work afaik the same way you are referring to it

5/5
it’s useful when it makes sense, which is generally not good for a MVP or early products just reaching pmf

arxiv.org
Not All Experts are Equal: Efficient Expert Pruning and Skipping...
A pivotal advancement in the progress of large language models (LLMs) is the emergence of the Mixture-of-Experts (MoE) LLMs. Compared to traditional LLMs, MoE LLMs can achieve higher performance...

1/5
I ran the new #DBRX Instruct model through 4 benchmarks that have high correlation with the @lmsysorg
Chatbot Arena and measure different capabilities:

MT Bench: a multi-turn chat benchmark that uses GPT-4 as a judge. Known to suffer from length bias & is somewhat noisy, but is still a rough proxy for "chattiness" that is cheap to run.

IFEval: a clever benchmark from Google which contains ~500 "verifiable instructions" like "write a poem about bricks, with less than 100 words and use no commas" that can be checked with string parsing. Avoids the issues with LLM-judge benchmarks and mostly measures instruction following aka "helpfulness".

BBH: a set of 23 hard tasks from the Big-Bench eval suite, targeting things like causal judgement from stories, navigation, and humorously, questions about penguins on a table . Popularised by @Teknium1
and @NousResearch
in training models like OpenHermes

AGIEval: a benchmark focused on human knowledge exams like SATs, math competitions. Also popularised by @Teknium1
and many of the Chinese LLMs

Overall we can see DBRX-Instruct is a very strong model, but the difference compared to Mixtral-Instruct is not large and only on AGIEval does DBRX do better.

Of course, these benchmarks don't portray a complete picture of model capabilities (e.g. code), but I do find it somewhat surprising that DBRX is not significantly better than Mixtral, which has far fewer params.

2/5
All of these evals were run with LightEval - the internal suite we use at Hugging Face for evaluating LLMs

A big thank you to
@clefourrier and
@nathanhabib1011 for putting up with my endless features requests!

Lib: GitHub - huggingface/lighteval: LightEval is a lightweight LLM evaluation suite that Hugging Face has been using internally with the recently released LLM data processing library datatrove and LLM training library nanotron.

3/5
Sure!

4/5
Very nice! That suggests the diff with Mixtral Instruct is largely due to the fine tuning recipe - looking forward to seeing the community fine tunes bear this out :smile:

5/5
Yes, I think it's definitely interesting to tune the model and see if the current perf in DBRX Instruct is due to an "alignment tax" from human feedback that nerfs capabilities

Unfortunately, the modelling code still needs some work as I'm hitting many issues fine-tuning like…

#DBRX
@lmsysorg
@Teknium1
@NousResearch
@clefourrier
@nathanhabib1011
https://github.com/huggingface/lighteval…

bnew · Mar 31, 2024

1/4
A 7B-parameter model that beats ChatGPT-3.5, Mixtral, Gemni Pro, and some of the best 30B and 70B models. Isn't this exciting? Meaning that you can squeeze much more capability per parameter if you know what you are doing.

2/4
The ELO leaderboard is a result of pairwise blind tests by ordinary users.

3/4
ELO

4/4
I have a statistically backed information. You have an opinion. Hmmm, hard to choose.

bnew · Mar 31, 2024

1/1
This is how Tim Dettmers, Artidoro, et al created QLoRA

Original paper: [2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs
Blog: Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

QLoRA allows 4-bit quantization + training and has been one of the most impactful papers from 2023.

[2305.14314] QLoRA: Efficient Finetuning of Quantized LLMs
https://huggingface.co/blog/4bit-transformers-bitsandbytes…

QLoRA: Efficient Finetuning of Quantized LLMs

bnew · Mar 31, 2024

1/11
What are the LLMs with the most output tokens these days?

GPT-4 and Claude 3 are both 4096. Gemini Pro 1.5 is 8192

This really matters for structured data extraction: even with 1m of input tokens you can't scrape a big webpage into a CSV file if you run out of output tokens

2/11
An interesting trick that does work is you can send a prompt requesting "more" and have the LLM pick up again where it stopped

That requires round-tripping the work it has done so far, but with a long enough context window (and a will to spend the money) that's quite feasible

3/11
The bigger problem here is a usability one: explaining to end users why their extracted data was randomly cut off half way through (or risking them not noticing) isn't great

4/11
Yes, that's a very real risk. I'm starting to try and push the limits of what makes sense to pipe through these things

5/11
The documentation says it should cut off at 4096 tokens of output, but I haven't stress tested it myself yet

6/11
Do you know where they document their output token limit? I can't find that for any of their models

7/11
That's input though - the docs say output for gpt-4-turbo is limited to 4096

8/11
In data journalism the use-cases are mainly around structured data extraction and other forms of transformation

Inputting a 5MB HTML file to output a ~3MB CSV or JSON file for example

Also things like "translate this report from Spanish to English" where the report might be >4k

9/11
Whoa, is that a documented feature? That's really useful

10/11
The HTML thing was really just an illustrative example - the general challenge is that there are plenty of text extraction tasks where the output is > 8196 tokens so the more output tokens we can have the easier these things are to put into practice

11/11
Definitely founded - I've had particular trouble getting table data out of screenshots of tables, but I don't trust it very much at all yet

bnew · Mar 31, 2024

1/1
Presenting Starling-LM-7B-beta, our new cutting-edge 7B language model fine-tuned with RLHF!

Also introducing Starling-RM-34B, the workhorse Reward Model behind the Starling-LM-7B-beta, ranking #1 in the latest RewardBenchmark from
@natolambert and the
@allenai_org
team.

HuggingFace links:
[Starling-LM-7B-beta] Nexusflow/Starling-LM-7B-beta · Hugging Face
[Starling-RM-34B] Nexusflow/Starling-RM-34B · Hugging Face

Discord Link: 加入 Discord 服务器 Nexusflow！

RewardBench from
@allenai_org
: Reward Bench Leaderboard - a Hugging Face Space by allenai

Since the release of Starling-LM-7B-alpha, we've received numerous requests to make the model commercially viable. Therefore, we're licensing all models and datasets under Apache-2.0, with the condition that they are not used to compete with OpenAI. Enjoy!

bnew · Mar 31, 2024

1/1
An Apache 2.0 licensed dataset for LLM pretraining, 30.4T tokens in deduplicated documents. Languages: English, German, French, Italian, Spanish

bnew · Mar 31, 2024

1/9
I tested Claude 3 Opus on one of the problems on the hardest software engineering benchmark for AI — real Github issues.

It took ~4mins with 37.5k input tokens and 2.8k output tokens to *mostly* solve it, with only minor hiccups..

This changes software development.

1/7

2/9
Let's unpack the benchmark (SWE-Bench) first.

Devin's 13% beat the former leader, Claude 2 at 4% of the 2294 problems in the benchmark.

These problems (test set) come from real-world Github issues of the following open-source repos:

2/7

3/9
I looked at issue #1834 in sqfluff/sqlfluff, a SQL linter — adding quiet mode.

The benchmark is supplemented with a princeton-nlp BM25 Retrieval dataset on Huggingface which adds the right file context for the change and assembles it into a huge prompt.

3/7

4/9
That prompt contains
- the text in the Github issue
- full text of top 5 relevant files from the repo (that number varies)
- natural language prompting "I need you to solve this issue.."

Here's the final prompt we feed into Opus.

4/7

5/9
It does generate a valid patch, but it's.. corrupt.

The line # s are wrong. The .diff can't be applied... unless you ignore the, and align by matching the code context!

It produces a different solution from the real world one, PR#4764! Here's a sample (of 317 lines)

5/7

6/9
It's only when I dug into the dataset did I appreciate the difficulty of the task, and just how far we've gotten.

SqlFluff is a ~100k lines of Python. The real patch was +132 -39 and Claude did +93 -34 lines.

Didn't update tests, but caught most of the callsites!

6/7

7/9
A change like this would've taken a normal developer close to 2-8hrs and Claude 3 Opus just cranked out a reasonable fix in ~4mins.

This is pretty incredible to see when you get into the weeds of it.

We're about to see a major shift in the way software is built!

7/7

8/9
Links:
SWE-Bench: https://swebench.com
Dataset:[/URL] princeton-nlp/SWE-bench_bm25_40K · Datasets at Hugging Face
Example[/URL] Issue: Enable quiet mode/no-verbose in CLI for use in pre-commit hook · Issue #1834 · sqlfluff/sqlfluff
Real[/URL] world fix:

9/9
There's not many VCs who are engineers

https://swebench.com
https://huggingface.co/datasets/princeton-nlp/SWE-bench_bm25_40K…
https://github.com/sqlfluff/sqlfluff/issues/1834…

Add a --quiet option for fix by alanmcruickshank · Pull Request #4764 · sqlfluff/sqlfluff

bnew · Mar 31, 2024

1/6
Google just dropped SIMA, and it's insane

It's literally ChatGPT for video games.

Here are 5 features of SIMA you don't want to miss

2/6
Versatile AI agent

Meet SIMA, the AI that's mastering video games by understanding natural language instructions.

3/6
SIMA stands out by learning to perform tasks in various 3D environments, proving AI can be more versatile and adaptable than ever before.

4/6
Learning from video games

SIMA was evaluated across 600 basic skills, spanning navigation, object interaction, and menu use.

5/6
SIMA’s performance relies on language.

In a control test without language training, the agent acts appropriately but aimlessly. Instead of following instructions, it might just gather resources.

6/6
Collaboration with eight game studios

DeepMind, made this revolutionary research possible in collaboration with Eight game Studios.

bnew · Apr 2, 2024

Macallik86 · Apr 2, 2024

Troll game on 1000

Open Asteroid Impact

(Re)shaping the world as we know it

openasteroidimpact.org

bnew · Apr 3, 2024

AI : Mere days after software agent Devin is released, an open-source alternative, SWE-agent, is almost as good.

GitHub - SWE-agent/SWE-agent: SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024]

SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024] - Gi...

github.com

About

SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

SWE-Agent

Web site created using create-react-app

swe-agent.com

Website & Demo | Discord | Paper [coming April 10th]

Overview

SWE-agent turns LMs (e.g. GPT-4) into software engineering agents that can fix bugs and issues in real GitHub repositories.

On the full SWE-bench test set, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.

Agent-Computer Interface (ACI)

We accomplish these results by designing simple LM-centric commands and feedback formats to make it easier for the LM to browse the repository, view, edit and execute code files. We call this an Agent-Computer Interface (ACI) and build the SWE-agent repository to make it easy to iterate on ACI design for repository-level coding agents.

Just like how typical language models requires good prompt engineering, good ACI design leads to much better results when using agents. As we show in our paper, a baseline agent without a well-tuned ACI does much worse than SWE-agent.

SWE-agent contains features that we discovered to be immensly helpful during the agent-computer interface design process:

We add a linter that runs when an edit command is issued, and do not let the edit command go through if the code isn't syntactically correct.
We supply the agent with a special-built file viewer, instead of having it just cat files. We found that this file viewer works best when displaying just 100 lines in each turn. The file editor that we built has commands for scrolling up and down and for performing a search within the file.
We supply the agent with a special-built full-directory string searching command. We found that it was important for this tool to succintly list the matches- we simply list each file that had at least one match. Showing the model more context about each match proved to be too confusing for the model.
When commands have an empty output we return a message saying "Your command ran successfully and did not produce any output."

Read our paper for more details.

@misc{yang2024sweagent, title={SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models}, author={John Yang and Carlos E. Jimenez and Alexander Wettig and Shunyu Yao and Karthik Narasimhan and Ofir Press}, year={2024}, }

1/4
SWE-Agent is an open-source software engineering agent with a 12.3% resolve rate on SWE-Bench!

Check out SWE-agent in action at SWE-Agent
Repo: GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

2/4
The SWE-agent open-source repository provides a framework for turning general LMs into software engineering agents. SWE-agent lets LMs like GPT-4 interact with their own Docker container using an Agent Computer Interface (ACI) - allowing it to browse, search, edit, and run code.

3/4
It’s been amazing to work on this with such a great team:
@jyangballin
*,[/URL]
@_carlosejimenez
*,[/URL]
@_awettig
,[/URL]
@ShunyuYao12
,[/URL]
@karthik_r_n
,[/URL] and
@OfirPress

Keep[/URL] an eye out for the paper coming out April 10th!

4/4
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source!

We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code
GitHub - princeton-nlp/SWE-agent: SWE-agent: Agent Computer Interfaces Enable Software Engineering Language Models

SWE-Agent
http://github.com/princeton-nlp/SWE-agent…
@jyangballin
@_carlosejimenez
@_awettig
@ShunyuYao12
@karthik_r_n
@OfirPress
https://twitter.com/jyangballin/status/1775114444370051582…

1/8
SWE-agent is our new system for autonomously solving issues in GitHub repos. It gets similar accuracy to Devin on SWE-bench, takes 93 seconds on avg + it's open source!

We designed a new agent-computer interface to make it easy for GPT-4 to edit+run code

GitHub - SWE-agent/SWE-agent: SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024]

SWE-agent takes a GitHub issue and tries to automatically fix it, using your LM of choice. It can also be employed for offensive cybersecurity or competitive coding challenges. [NeurIPS 2024] - Gi...

github.com

2/8
SWE-agent works by interacting with a specialized terminal, which allows it to:
Open, scroll and search through files
Edit specific lines w/ automatic syntax check
Write and execute tests

This custom-built interface is critical for good performance!

2/N

3/8
Simply connecting an LM to a vanilla bash terminal does not work well.

Our key insight is that LMs require carefully designed agent-computer interfaces (similar to how humans like good UI design)

E.g. When the LM messes up indentation, our editor prevents it and gives feedback

4/8
Another example is that we discovered that for viewing files, letting SWE-agent only view 100 lines at a time was better than letting it view 200 or 300 lines and much better than letting it view the entire file.

Good agent-computer design is important even when using GPT-4.
4/N

5/8
SWE-agent can be easily configured and extended to improve future research on software engineering agents. Since SWE-agent is open source, anyone can experiment with and contribute new ways for agents to interact with computers.

5/N

6/8
Check out some cool demonstrations of SWE-agent fixing real GitHub issues at SWE-Agent!

6/N[/URL]

7/8
SWE-agent is a Princeton NLP collaboration by
@jyangballin
*,[/URL]
@_carlosejimenez
*,[/URL]
@_awettig
,[/URL]
@ShunyuYao12
,[/URL]
@karthik_r_n
,[/URL] and
@OfirPress

We’d[/URL] love to hear your thoughts, comments and questions! Here or on our discord at Join the SWE-agent Discord Server!

7/7[/URL]

8/8
Preprint coming next week!

http://github.com/princeton-nlp/SWE-agent…
SWE-Agent
@jyangballin
@_carlosejimenez
@_awettig
@ShunyuYao12
@karthik_r_n
@OfirPress
Join the SWE-agent Discord Server!
@yoheinakajima
@ChatGPTapp

discord.com
Join the SWE-agent Discord Server!
Check out the SWE-agent community on Discord - hang out with 4 other members and enjoy free voice and text chat.

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Superstar

Veteran

About​

Overview​

Agent-Computer Interface (ACI)​

About

Overview

Agent-Computer Interface (ACI)