The A.I Megathread (LLM , GPT , Development)

bnew · Jun 26, 2024

1/2
Microsoft presents EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

- 20%-40% faster than EAGLE-1 (i.e. 3.05x - 4.26x faster than the baseline)
- Ensures that the distribution of the generated text remains unchanged

abs: [2406.16858] EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
repo: GitHub - SafeAILab/EAGLE: Official Implementation of EAGLE

2/2
Dark mode for this paper

EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/4
Long Context Transfer from Language to Vision

- Can process 2000 frames or over 200K visual tokens
- SotA perf on VideoMME among 7B-scale models

abs: [2406.16852] Long Context Transfer from Language to Vision
repo: GitHub - EvolvingLMMs-Lab/LongVA: Long Context Transfer from Language to Vision

2/4
Dark mode for this paper for night readers

Long Context Transfer from Language to Vision

3/4
AI Summary: The paper introduces a method called long context transfer to enable Large Multimodal Models (LMMs) to understand extremely long videos by extrapolating the context length of the language backbon...
Long Context Transfer from Language to Vision

4/4
Didn't get it. don't they need more training time?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/7
Adam-mini: Use Fewer Learning Rates To Gain More

Achieves 50% higher throughput than AdamW on Llama2-7B

[2406.16793] Adam-mini: Use Fewer Learning Rates To Gain More

2/7
Adam-Mini ([2406.16793] Adam-mini: Use Fewer Learning Rates To Gain More, left) is identical to NVIDIA's NovoGrad (https://arxiv.org/pdf/1905.11286, right) up to a constant factor.
The major difference is that NovoGrad has an official implementation and multiple reproductions (apex/apex/optimizers/fused_novograd.py at master · NVIDIA/apex), while this doesn't :/

3/7
used to be that you'd have to beg paper authors to include a useful transformer baseline.

now, you don't even have to ask: 7B-scale Llama tests are the default, other archs are shafted under "Non-LLM Tasks", and special changes are made specifically to target Transformer perf.

4/7
Thank @arankomatsuzaki for promoting our work!

As shown in the figure, Adam-mini saves 45-50% memory footprint of Adam and can save 33% time on Llama2-7B pre-training! The design of Adam-mini is built upon certain Hessian structures we observed on Transformers.

Our pytorch implementation is now available at GitHub - zyushun/Adam-mini . Feel free to try it out! Hope Adam-mini can help save time, cost, and energy in your tasks!

5/7
Empty github repo...

6/7
@ericzhang0410 @ZiniuLi

7/7
[QA] Adam-mini: Use Fewer Learning Rates To Gain More

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/2
Google presents Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

[2406.16807] Beyond Thumbs Up/Down: Untangling Challenges of Fine-Grained Feedback for Text-to-Image Generation

2/2
Useless feedback if users aren’t aligned on the meaning of the words.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/11
Have you wondered how the decision boundary of in-context learning in LLMs compares to traditional models like Decision Trees and KNN?

Our research uncovers unexpected irregularities and non-smoothness in LLMs' in-context decision boundaries.

: [2406.11233] Probing the Decision Boundaries of In-context Learning in Large Language Models

2/11
1/n As the number of in-context examples increases, LLMs can achieve high accuracy on linear and non-linear classification tasks.
But how reliable are these in-context classifiers?

We probe their decision boundaries to find out.

3/11
2/n By visualizing the decision boundaries, we show that SOTA LLMs, ranging from 1B to large closed-source models such as GPT-3.5-turbo and GPT-4o, all exhibit different non-smooth, irregular decision boundaries, even on simple linearly separable tasks.

4/11
3/n How do these irregularities arise?
We study various factors that impact decision boundary smoothness in LLMs, including in-context example count, quantization levels, label semantics & examples order.
Then, we identify methods to improve the smoothness of the boundaries.

5/11
4/n First, increasing in-context examples does not guarantee smoother decision boundaries. While classification accuracy improves with more in-context examples, the decision boundary remains fragmented.

6/11
5/n Decision boundaries are sensitive to label names, example order and quantization.

Shuffling in-context examples and labels changes the model’s decision boundaries, suggesting they depends on LLM's semantic prior knowledge of the labels and is not permutation invariant.

Reducing precision from 8-bit to 4-bit impacts areas near the boundary with high uncertainties. Varying quantization levels can flip LLM decisions in these uncertain regions.

7/11
6/n Can we improve decision boundary smoothness in LLMs through training?

We show that fine-tuning on simple linearly separable tasks can improve the smoothness of decision boundaries and generalize to more complex non-linear, multi-class tasks, enhancing robustness.

8/11
7/n Further, we show that fine-tuning the token embedding and attention layers can lead to smoother decision boundaries. However fine-tuning the linear prediction head alone does not improve smoothness.

9/11
8/n We also explore uncertainty-aware active learning. By adding labels for the most uncertain points to the in-context dataset, we can smoothen the decision boundary more efficiently.

10/11
9/n Lastly, we explore the effect of language pretraining. Compared to pretrained LLMs, we find that transformers trained from scratch on synthetic classification tasks can learn smooth in-context decision boundaries for unseen classification problems.

11/11
10/n. Thanks for reading about our work!

If you're interested in exploring more results, please check out our paper: https://arxiv.org/pdf/2406.11233.
Huge thanks to my amazing collaborators @tungnd_13 and @adityagrover_!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/11

New paper: we trained a SOTA (> GPT4, Gemini) VLM agent, DigiRL, that can do tasks on an Android phone in real time, in the wild, via autonomous offline + online RL

Web: DigiRL
Paper: [2406.11896] DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

/ little gif of learning progress

:

2/11
Device control is very challenging for foundation models:
- real-world stochasticity + non-stationarity + disctractors (e.g., pop-ups)
- websites + internal device state changing
- pixel, gesture control

need to continuously keep agents up-to-date + learn from "own" failures

3/11
To address these, a scalable way is to run autonomous online RL (SFT / imitation is not enough), where the agent interacts with the phone & internet in real time and learns from its mistakes.

We did exactly this. Many systems & methodological needed to be done to enable this:

4/11
1) We need a scalable interaction env, that emulates phone state in real-time for open-ended training. Most Android device control work provided data (e.g., Android in the Wild), but no simulator.

+ an autonomous evaluator based on Gemini 1.5 Pro

GitHub - DigiRL-agent/digirl: Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.

5/11
2) Running RL is not easy as we're fine-tuning VLMs amidst so much stochasticity / non-stationarity.

Step 1: fine-tune on existing demos via offline RL (e.g., data collected by rolling out off-the-shelf VLMs in the env),

Step 2: autonomous online RL.

All done with 1.5B VLM.

6/11
Three method ideas from RL:
- Advantage-weighted regression for policy learning
- Doubly robust advantage estimators using a step-level value function
- Automated curriculum to learn on most informative states using instruction-level value function

Check the paper for details

7/11
That's it! This gives us a SOTA VLM agent, DigiRL, on these tasks. Our method outperforms prompting + off-the-shelf VLM (Gemini 1.5 Pro, GPT 4) as well as SFT / imitation, other VLMs (CogAgent, AutoUI, etc)

around 70% improvement over the next best approach

8/11
The website has gifs of various methods (I could not upload them here as my X would always crash), but please check the website (the panel looks like this image below):

DigiRL

9/11
We also study the failure modes of these methods, and find that DigiRL cuts the following failure modes of off-the-shelf VLMs:
- of not recovering from the model's own mistakes
- getting distracted and "lost" by irrelevant elements

10/11
Many other results in the paper: results showing reliability / scalability of our evaluator + environment (on the website), and the entire online RL learning trace (below)

Code is out there: GitHub - DigiRL-agent/digirl: Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. for you to build on!

11/11
We are super excited to now use this as a foundation to study value-based RL (see our prev work ArCHer: [2402.19446] ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL) to real user-scale problems!

Awesome collab co-led by @jackbai_jkb @YifeiZhou02 w/ @mertcemri @pan_jiayipan @svlevine @alsuhr

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/3
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Enables MLLMs to express intermediate reasoning as images using code. You probably didn't use typography knowledge to solve this query

proj: Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
abs: [2406.14562] Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

2/3
I've long been suspecting that the ability of LLMs to benefit from using visual output for sketches will be the true test of multimodality. Nice to see that papers to this effect are coming out now.

3/3
This is exactly what @SNAT02792153 did with Self-Imagine!

Checkout the work led by her here:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/5
DeepSeek-Coder-V2 is by far the best open-source math (+ coding) model, performing on par with GPT4o w/o process RM or MCTS and w/ >20x less training compute. Data contamination doesn't seem to be a concern here.

Imagine about what this model could achieve with PRM, MCTS, and other yet-to-be-released agentic exploration methods. Unlike GPT4o, you can train this model further.

It has the potential to solve Olympiad, PhD and maybe even research level problems, like the internal model a Microsoft exec said to be able to solve PhD qualifying exam questions.

2/5
It's like 10 points lower on LiveCodeBench vs GPT-4o if you restrict problems to Jan 2024 and later. So contamination may be part of the story still

3/5
Thanks for letting me know. I just checked it again, and it's 7 point difference. Anyway, I think it's still powerful enough to be useful.

4/5
what does "process RM" mean here?

5/5
Can't solve the Alice problem though

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
DeepSeek-Coder-V2: First Open Source Model Beats GPT4-Turbo in Coding and Math

> Excels in coding and math, beating GPT4-Turbo, Claude3-Opus, Gemini-1.5Pro, Codestral.
> Supports 338 programming languages and 128K context length.
> Fully open-sourced with two sizes: 230B (also with API access) and 16B.

/search?q=#DeepSeekCoder

2/11
In the Arena-Hard-Auto leaderboard, DeepSeek-Coder-V2 surpasses Yi-large,Claude3-Opus, GLM4, and Qwen2-72B.

3/11
> Chat with DeepSeek-Coder-V2 (230B) : DeepSeek

> Access Coder-V2 APIs at the same unbeatable prices as DeepSeek-V2: DeepSeek Platform

> Download two-sized models for free commercial/research use: deepseek-ai (DeepSeek)

> Technical report：DeepSeek-Coder-V2/paper.pdf at main · deepseek-ai/DeepSeek-Coder-V2

4/11
Fantastic!

5/11
@cursor_ai

6/11
@ollama can we have this one?

7/11

8/11
Great work

Congrats to the team.

9/11
Where does the 16B sit on the standings?

10/11
I tested the chat by asking about myself.

It got it right.

Bjørn Are Solstad is a Norwegian entrepreneur and software developer known for his work in the technology sector. He has been involved in various tech startups and projects, contributing to the development of innovative software solutions. Solstad's expertise and contributions to the tech industry have made him a notable figure in the Norwegian tech community.

11/11
wait, we have 338 programming languages?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/1
BAIR presents LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

proj: LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning
abs: [2406.11815] LLARVA: Vision-Action Instruction Tuning Enhances Robot Learning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/2
OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

Presents a comprehensive, rigorously curated benchmark of Olympic-level challenges

proj: OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI
abs: [2406.12753] OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI

2/2
Very interesting work, we need more of these to actually measure capabilities of existing models. That said, there is this weird phenomenon with LLMs where they can solve correctly a very complex math problem and fail at a trivial one; this doesn't happen to human experts

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/4
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

GLM-4:
- closely rivals GPT-4 on MMLU, MATH, GPQA, etc
- gets close to GPT-4 in instruction following and long context tasks

hf: THUDM (Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University)
repo: THUDM
abs: [2406.12793] ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools

2/4
At last they release the paper

3/4
Were their newer models trained on their original infilling objective?

4/4
So are all future LLMs going to know less and less about most popular fields of knowledge while their MMLUs go through the roof?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/3
Google presents What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

[2406.12830] What Are the Odds? Language Models Are Capable of Probabilistic Reasoning

2/3
Misrepresentation.

3/3

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/1
GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

Provides a dataset of 12k high-quality, diverse GUI trajectories

proj: GUI-World: A Dataset for GUI-Oriented Multimodal LLM-based Agents
abs: [2406.10819] GUI-WORLD: A Dataset for GUI-oriented Multimodal LLM-based Agents

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/2
VideoLLM-online: Online Video Large Language Model for Streaming Video

The first streaming video LLM, high speed (5~10 FPS on RTX 3090 GPU, 10~15 FPS on A100 GPU) on long-form videos (10 minutes), with SOTA performances on online/offline settings

proj: SOCIAL MEDIA TITLE TAG
abs: [2406.11816] VideoLLM-online: Online Video Large Language Model for Streaming Video

2/2
Impressive!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 26, 2024

1/8
Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

Shows the superior performance across a variety of tasks, including reconstruction, classification and generation

repo: GitHub - zh460045050/VQGAN-LC
abs: [2406.11837] Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

2/8
For VideoPoet we used the MAGVIT v2 tokenizer which had an effective codebook size of 262k and 100% codebook utilization. Need to read the details but at first glance LFQ/FSQ could already offer all of these advantages?

3/8
The main objective is to develop a novel image quantization model, VQGAN-LC, that can effectively leverage an extremely large codebook (up to 100,000 entries) while maintaining a high utilization rate (over 99%). The hypothesis is that a larger codebook with improved utilization will lead to better performance across various tasks compared to existing VQGAN variants.

Scaling the Codebook Size of VQGAN to 100,000 with a Utilization Rate of 99%

4/8
Got nerd sniped by this a bit. It's a good paper and code is clean and easy to understand. They learn a nn.Embedding(dim=768) + nn.Linear layer (they call projector) to remap the codes in a learned way down to dim=8 per codebook entry. After training I think they discard the nn.Embedding and nn.Linear parameters and just keep the dim=8 codebook derived from it.

5/8
No FSQ mention in the whole paper...

6/8
I actually want the paper more than the code in this case.

7/8
Seems like we haven't fully explored the power of VAE's

8/8

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran