bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/4
Long Context Transfer from Language to Vision

- Can process 2000 frames or over 200K visual tokens
- SotA perf on VideoMME among 7B-scale models

abs: [2406.16852] Long Context Transfer from Language to Vision
repo: GitHub - EvolvingLMMs-Lab/LongVA: Long Context Transfer from Language to Vision

2/4
Dark mode for this paper for night readers 🌚 Long Context Transfer from Language to Vision

3/4
AI Summary: The paper introduces a method called long context transfer to enable Large Multimodal Models (LMMs) to understand extremely long videos by extrapolating the context length of the language backbon...
Long Context Transfer from Language to Vision

4/4
Didn't get it. don't they need more training time?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ4yH3cWUAAmLTA.jpg

GQ8QurjaQAM7Z78.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/3
Video-Infinity: Distributed Long Video Generation

Can generate super long videos, up to 2300 frames within 5 mins by Clip parallelism and Dual-scope attention

proj: Video-Infinity 1
abs: [2406.16260] Video-Infinity: Distributed Long Video Generation
repo: GitHub - Yuanshi9815/Video-Infinity: Video-Infinity generates long videos quickly using multiple GPUs without extra training.

2/3
Dark mode for this paper for those who read at night 🌙 Video-Infinity: Distributed Long Video Generation

3/3
Cool but their demo page shows videos with no temporal consistency


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ6K5kSWIAAFtF5.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/1
We've open-sourced the code and models for Self-Play Preference Optimization (SPPO)! 🚀🚀🚀
⭐ code: GitHub - uclaml/SPPO: The official implementation of Self-Play Preference Optimization (SPPO)
🤗models: SPPO - a UCLA-AGI Collection


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ53jYSaoAApPh4.jpg

GMjGJknacAAstxm.jpg






1/11
Another triumph for Self-Play! Self-Play Preference Optimization (SPPO) has surpassed (iterative) DPO, IPO, Self-Rewarding LMs, and others on AlpacaEval, MT-Bench, and the Open LLM Leaderboard.

Remarkably, Mistral-7B-instruct-v0.2 fine-tuned by SPPO achieves superior performance to GPT-4 0613 without relying on any GPT-4 responses.

Explore the roadmap of LLM fine-tuning techniques:
Supervised Fine-tuning: SFT --> SPIN
Preference Fine-tuning: PPO --> DPO --> SPPO

Paper: https://arxiv.org/pdf/2405.00675

2/11
Joint work with @FrankYueWu1 @EdwardSun0909 @HuizhuoY @Kaixuan_Ji_19 and Yiming Yang.

3/11
For more details about SPPO, please refer to:

4/11
Also check out this insightful tweet breaking down SPPO in detail!

5/11
Will you open source your code?

6/11
Like SPIN, we will open source code and model weights.

7/11
Looks cool! Is it open? Would like to see how it performs on Chatbot Arena.

8/11
Thank you, Ying! We're excited to evaluate it on Chatbot Arena. We'll be opening the model weights soon and will definitely need your help:smile:

9/11
I'm not sure why SFT evolves to SPIN on your tweet? SPIN is an iterative RLHF built on an already SFT'd model

10/11
I'm sorry to say that in a head-to-head with Llama-8b-Instruct on my own instruction-following benchmark this model is **much** worse at following instructions, and the reasoning is worse as well. I suspect there may be overfitting to the structure of AlpacaEval.

11/11
quick q: are the results in the paper based on PairRM soft-prob in the loss func or the hard reward of 0/1? From page 11 it looks like you select only the best and worst response using pairRM but it's not clear if the pairRM scores are part of the loss function


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GMjGJknacAAstxm.jpg

GMruUgUaMAAIhvg.png

GMpy7bTWYAA-wfn.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/2
Tencent and Huawei present Text-Animator: Controllable Visual Text Video Generation

Demonstrates the superiority of their approach to the accuracy of generated visual text over state-of-the-art video generation methods

abs: [2406.17777] Text-Animator: Controllable Visual Text Video Generation
proj: Text-Animator

2/2
Will this be like a Google search function, where you switch a tab - like going to image search but this would be AI generated search or something?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/2
MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

- outperforms existing MLLMs of comparable parameter sizes
- ranges from 3.8B to 34B

proj: MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning
abs: [2406.17770] MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning

2/2
Dark mode for this paper for night readers 🌙 MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ93VwjWMAAhFNZ.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/2
Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

- breaks memorization down into a taxonomy
- uses it to construct a predictive model for memorization
- finds that different factors influence the likelihood of memorization differently

[2406.17746] Recite, Reconstruct, Recollect: Memorization in LMs as a Multifaceted Phenomenon

2/2



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GQ95V1iXwAA3E0X.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/1
Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models

Significantly improves the multimodal math capabilities of LLaVA-1.5, achieving a 19-point increase and comparable performance to GPT-4V

[2406.17294] Math-LLaVA: Bootstrapping Mathematical Reasoning for Multimodal Large Language Models


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ9_P85WoAI2Rxp.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/3
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs

Achieves SotA performances and serves as a comprehensive, open cookbook for instruction-tuned MLLMs

proj: Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
abs: [2406.16860] Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
model: nyu-visionx (NYU VisionX)
repo: GitHub - cambrian-mllm/cambrian: Cambrian-1 is a family of multimodal LLMs with a vision-centric design.

2/3
Wut
(anyway, good paper)

3/3
I didn't like this paper because they are showing off by the casual mention of kaiming he


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ4wxqPW0AAjXCj.jpg

GQ6LUDBXYAAZFXv.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/2
Microsoft presents EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees

- 20%-40% faster than EAGLE-1 (i.e. 3.05x - 4.26x faster than the baseline)
- Ensures that the distribution of the generated text remains unchanged

abs: [2406.16858] EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees
repo: GitHub - SafeAILab/EAGLE: Official Implementation of EAGLE

2/2
Dark mode for this paper 🌙 EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ4xU8RW0AEL9jf.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/4
Long Context Transfer from Language to Vision

- Can process 2000 frames or over 200K visual tokens
- SotA perf on VideoMME among 7B-scale models

abs: [2406.16852] Long Context Transfer from Language to Vision
repo: GitHub - EvolvingLMMs-Lab/LongVA: Long Context Transfer from Language to Vision

2/4
Dark mode for this paper for night readers 🌚 Long Context Transfer from Language to Vision

3/4
AI Summary: The paper introduces a method called long context transfer to enable Large Multimodal Models (LMMs) to understand extremely long videos by extrapolating the context length of the language backbon...
Long Context Transfer from Language to Vision

4/4
Didn't get it. don't they need more training time?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ4yH3cWUAAmLTA.jpg

GQ8QurjaQAM7Z78.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/7
Adam-mini: Use Fewer Learning Rates To Gain More

Achieves 50% higher throughput than AdamW on Llama2-7B

[2406.16793] Adam-mini: Use Fewer Learning Rates To Gain More

2/7
Adam-Mini ([2406.16793] Adam-mini: Use Fewer Learning Rates To Gain More, left) is identical to NVIDIA's NovoGrad (https://arxiv.org/pdf/1905.11286, right) up to a constant factor.
The major difference is that NovoGrad has an official implementation and multiple reproductions (apex/apex/optimizers/fused_novograd.py at master · NVIDIA/apex), while this doesn't :/

3/7
used to be that you'd have to beg paper authors to include a useful transformer baseline.

now, you don't even have to ask: 7B-scale Llama tests are the default, other archs are shafted under "Non-LLM Tasks", and special changes are made specifically to target Transformer perf.

4/7
Thank @arankomatsuzaki for promoting our work!

As shown in the figure, Adam-mini saves 45-50% memory footprint of Adam and can save 33% time on Llama2-7B pre-training! The design of Adam-mini is built upon certain Hessian structures we observed on Transformers.

Our pytorch implementation is now available at GitHub - zyushun/Adam-mini . Feel free to try it out! Hope Adam-mini can help save time, cost, and energy in your tasks!

5/7
Empty github repo...

6/7
@ericzhang0410 @ZiniuLi

7/7
[QA] Adam-mini: Use Fewer Learning Rates To Gain More


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ41PLZWMAA4S7b.jpg

GQ6AiV7WUAAxGXL.jpg

GQ6A2_AWMAAoFT1.jpg

GQ5u7_tbwAAD9Id.jpg

GQ5vWSYa0AAmxnC.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834










1/11
Have you wondered how the decision boundary of in-context learning in LLMs compares to traditional models like Decision Trees and KNN? 🤔

Our research uncovers unexpected irregularities and non-smoothness in LLMs' in-context decision boundaries. 🔍

📄: [2406.11233] Probing the Decision Boundaries of In-context Learning in Large Language Models

2/11
1/n As the number of in-context examples increases, LLMs can achieve high accuracy on linear and non-linear classification tasks.
But how reliable are these in-context classifiers? 🤔
We probe their decision boundaries to find out.

3/11
2/n By visualizing the decision boundaries, we show that SOTA LLMs, ranging from 1B to large closed-source models such as GPT-3.5-turbo and GPT-4o, all exhibit different non-smooth, irregular decision boundaries, even on simple linearly separable tasks.

4/11
3/n How do these irregularities arise?
We study various factors that impact decision boundary smoothness in LLMs, including in-context example count, quantization levels, label semantics & examples order.
Then, we identify methods to improve the smoothness of the boundaries.🛠️

5/11
4/n First, increasing in-context examples does not guarantee smoother decision boundaries. While classification accuracy improves with more in-context examples, the decision boundary remains fragmented.

6/11
5/n Decision boundaries are sensitive to label names, example order and quantization.

Shuffling in-context examples and labels changes the model’s decision boundaries, suggesting they depends on LLM's semantic prior knowledge of the labels and is not permutation invariant.

Reducing precision from 8-bit to 4-bit impacts areas near the boundary with high uncertainties. Varying quantization levels can flip LLM decisions in these uncertain regions. 🔄

7/11
6/n Can we improve decision boundary smoothness in LLMs through training? 🛠️
We show that fine-tuning on simple linearly separable tasks can improve the smoothness of decision boundaries and generalize to more complex non-linear, multi-class tasks, enhancing robustness.

8/11
7/n Further, we show that fine-tuning the token embedding and attention layers can lead to smoother decision boundaries. However fine-tuning the linear prediction head alone does not improve smoothness.

9/11
8/n We also explore uncertainty-aware active learning. By adding labels for the most uncertain points to the in-context dataset, we can smoothen the decision boundary more efficiently.

10/11
9/n Lastly, we explore the effect of language pretraining. Compared to pretrained LLMs, we find that transformers trained from scratch on synthetic classification tasks can learn smooth in-context decision boundaries for unseen classification problems.

11/11
10/n. Thanks for reading about our work! 🙏

If you're interested in exploring more results, please check out our paper: https://arxiv.org/pdf/2406.11233.
Huge thanks to my amazing collaborators @tungnd_13 and @adityagrover_! 🙌


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQ2hQtdagAAenX0.jpg

GQ2hUUnboAAFToz.jpg

GQ2haANaIAAs62S.jpg

GQ2hjHza4AAjUo4.jpg

GQ2hjH0bcAAbS9V.jpg

GQ2hoVDaAAAMrGv.jpg

GQ2hq8yasAACi4X.jpg

GQ2htm6aMAAwBIi.jpg

GQ2hwcXbAAAjDA4.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834










1/11
🚨 New paper: we trained a SOTA (> GPT4, Gemini) VLM agent, DigiRL, that can do tasks on an Android phone in real time, in the wild, via autonomous offline + online RL

Web: DigiRL
Paper: [2406.11896] DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

🧵 ⬇️ / little gif of learning progress👇:

2/11
Device control is very challenging for foundation models:
- real-world stochasticity + non-stationarity + disctractors (e.g., pop-ups)
- websites + internal device state changing
- pixel, gesture control

➡️ need to continuously keep agents up-to-date + learn from "own" failures

3/11
To address these, a scalable way is to run autonomous online RL (SFT / imitation is not enough), where the agent interacts with the phone & internet in real time and learns from its mistakes.

We did exactly this. Many systems & methodological needed to be done to enable this: ⬇️

4/11
1) We need a scalable interaction env, that emulates phone state in real-time for open-ended training. Most Android device control work provided data (e.g., Android in the Wild), but no simulator.

+ an autonomous evaluator based on Gemini 1.5 Pro

GitHub - DigiRL-agent/digirl: Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning.

5/11
2) Running RL is not easy as we're fine-tuning VLMs amidst so much stochasticity / non-stationarity.

Step 1: fine-tune on existing demos via offline RL (e.g., data collected by rolling out off-the-shelf VLMs in the env),

Step 2: autonomous online RL.

All done with 1.5B VLM.

6/11
Three method ideas from RL:
- Advantage-weighted regression for policy learning
- Doubly robust advantage estimators using a step-level value function
- Automated curriculum to learn on most informative states using instruction-level value function

Check the paper for details

7/11
That's it! This gives us a SOTA VLM agent, DigiRL, on these tasks. Our method outperforms prompting + off-the-shelf VLM (Gemini 1.5 Pro, GPT 4) as well as SFT / imitation, other VLMs (CogAgent, AutoUI, etc)

around 70% improvement over the next best approach

8/11
The website has gifs of various methods (I could not upload them here as my X would always crash), but please check the website (the panel looks like this image below):

DigiRL

9/11
We also study the failure modes of these methods, and find that DigiRL cuts the following failure modes of off-the-shelf VLMs:
- of not recovering from the model's own mistakes
- getting distracted and "lost" by irrelevant elements

10/11
Many other results in the paper: results showing reliability / scalability of our evaluator + environment (on the website), and the entire online RL learning trace (below)

Code is out there: GitHub - DigiRL-agent/digirl: Official repo for paper DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning. for you to build on!

11/11
We are super excited to now use this as a foundation to study value-based RL (see our prev work ArCHer: [2402.19446] ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL) to real user-scale problems!

Awesome collab co-led by @jackbai_jkb @YifeiZhou02 w/ @mertcemri @pan_jiayipan @svlevine @alsuhr


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQkMR73WIAAnSB0.jpg

GQkOue6XkAAy-PF.jpg

GQkOy6lWoAEdrjx.jpg

GQkT_qOakAAxyaS.jpg

GQkU013bwAAllie.jpg

GQkValSakAEasm8.jpg

GQkVdBKakAEN9FZ.jpg

GQkV3dpakAAPAGK.jpg

GQkV7C9a0AAc-lO.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/3
Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

Enables MLLMs to express intermediate reasoning as images using code. You probably didn't use typography knowledge to solve this query

proj: Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities
abs: [2406.14562] Whiteboard-of-Thought: Thinking Step-by-Step Across Modalities

2/3
I've long been suspecting that the ability of LLMs to benefit from using visual output for sketches will be the true test of multimodality. Nice to see that papers to this effect are coming out now.

3/3
This is exactly what @SNAT02792153 did with Self-Imagine!

Checkout the work led by her here:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQkgbeUXQAMK6Px.png

GJucTqIXIAA-drt.jpg

GEELFeGbYAAPcIm.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,130
Reputation
8,239
Daps
157,834

1/5
DeepSeek-Coder-V2 is by far the best open-source math (+ coding) model, performing on par with GPT4o w/o process RM or MCTS and w/ >20x less training compute. Data contamination doesn't seem to be a concern here.

Imagine about what this model could achieve with PRM, MCTS, and other yet-to-be-released agentic exploration methods. Unlike GPT4o, you can train this model further.

It has the potential to solve Olympiad, PhD and maybe even research level problems, like the internal model a Microsoft exec said to be able to solve PhD qualifying exam questions.

2/5
It's like 10 points lower on LiveCodeBench vs GPT-4o if you restrict problems to Jan 2024 and later. So contamination may be part of the story still

3/5
Thanks for letting me know. I just checked it again, and it's 7 point difference. Anyway, I think it's still powerful enough to be useful.

4/5
what does "process RM" mean here?

5/5
Can't solve the Alice problem though


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQiLJIdWcAAiDkE.png

GQiQyOPWEAAZyZ_.jpg




1/11
DeepSeek-Coder-V2: First Open Source Model Beats GPT4-Turbo in Coding and Math

> Excels in coding and math, beating GPT4-Turbo, Claude3-Opus, Gemini-1.5Pro, Codestral.
> Supports 338 programming languages and 128K context length.
> Fully open-sourced with two sizes: 230B (also with API access) and 16B.

/search?q=#DeepSeekCoder

2/11
In the Arena-Hard-Auto leaderboard, DeepSeek-Coder-V2 surpasses Yi-large,Claude3-Opus, GLM4, and Qwen2-72B.

3/11
> Chat with DeepSeek-Coder-V2 (230B) : DeepSeek

> Access Coder-V2 APIs at the same unbeatable prices as DeepSeek-V2: DeepSeek Platform

> Download two-sized models for free commercial/research use: deepseek-ai (DeepSeek)

> Technical report:DeepSeek-Coder-V2/paper.pdf at main · deepseek-ai/DeepSeek-Coder-V2

4/11
Fantastic!

5/11
@cursor_ai

6/11
@ollama can we have this one?

7/11
👀

8/11
Great work 🤩
Congrats to the team.

9/11
Where does the 16B sit on the standings?

10/11
I tested the chat by asking about myself.

It got it right.

Bjørn Are Solstad is a Norwegian entrepreneur and software developer known for his work in the technology sector. He has been involved in various tech startups and projects, contributing to the development of innovative software solutions. Solstad's expertise and contributions to the tech industry have made him a notable figure in the Norwegian tech community.

11/11
wait, we have 338 programming languages? 😆


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQRne_ibYAA7ouv.jpg

GQRni-DbwAEF1fn.jpg

GQRn68vbkAA9Jen.jpg
 
Top