bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


















1/18
Finally, I present you my first ever preprint:

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

We show that sharing KV heads between layers allows for KV cache smaller than GQA/MQA allowed for, with reasonable acc/mem tradeoff
[2406.09297] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

2/18
As described by the name, MLKV allows for configurations that share KV heads of lower layers to layers above. While MQA was limited to reducing the KV cache to 1/n_layers the size, here MLKV, at the most extreme, can go down to 1/n_layers the vanilla cache size.

3/18
We uptrain 8 variants of Pythia-160M in different configs with different total KV head counts, from GQA/MQA equivalents to MLKV setups that go beyond what was possible. Uptrain loss is reported in Table 2.

4/18
The benchmark results show that there is a price to pay for reducing KV heads, as the avg accuracy goes down slightly. The most extreme config with 1 KV head collapses, but all other variants are very much usable.

5/18
Now to the good part. This plot can be predicted from the theoretical cache sizes, but it is really cool to see it in practice, when inferencing IRL. You can clearly see the benefit in going down with MLKV. IMPORTANT: This plot also applies for longer context lengths!

6/18
We can plot the accuracy/memory trade-off. We show that configurations with half the KV heads of MQA is right with MQA on the pareto optimal part of the curve. Even going below that is possible if needed.

7/18
Some limitations to curb your enthusiasm for those who have been following since the start:
1. Small scale at 160M params
2. Uptrain instead of train from scratch
Keep in mind this was my "final project" for a bachelors degree. My first priority was to get this degree as safely..

8/18
..and quickly as possible. This scale allowed me to iterate and experiment more in a limited timeframe, which I really needed. I'm certain it is possible to scale MLKV up as was done with GQA. If you are interested in optimizing your model, lmk and I'll help you

9/18
About CLA. This idea is pretty obvious if you know the literature, so it was bound to happen. There are still some differences, e.g. we uptrain models (which is more practical) and see the effects. Also, we tested more extreme configurations too.

10/18
This pre-print is planned to be submitted to EMNLP. I am not 100% confident in getting in (given limitations and CLA) but I do want to give it a shot. But what I really wanted was to get this up on arXiv and share it to you all, which is what I'm doing now.

11/18
Which is super satisfying and I'm happy with the outcome! I thank everyone who has been following and showing interest in this project since the start, nearly a year ago. You guys are cool, thank you!

12/18
Big thanks and credits to co-authors and supervisors @faridlazuarda , @AyuP_AI, and @AlhamFikri

13/18
yep lots of these KV optimizations are stackable. but also beware of the accuracy tradeoff

14/18
thank youu

15/18
thank you! here's to hoping

16/18
Thank you!

Yes you got it exactly right, no free lunch. I've tried and well, the benchmark results are worse ofc

17/18
Yes, they can in fact be used simultaneously! MHLA just compresses any KV cache that is used, so reducing KV cache size will also reduce the compressed representation. It is to be seen how well they work together, or if it is even necessary to do both, but it theoretically works

18/18
There are many in depth explanations out there on the attention mechanism, but I really recommend you trying attention hands-on through something like nanoGPT. I only started to really understand after trying it for myself


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQALcGXaoAAQoO_.jpg

GQALoZZaQAMAxET.jpg

GQALqTqbEAIqJIs.png

GQAL9CeaQAAUwgX.png

GQAMJEBaQAAMQGA.png

GQAMXKfaQAEQczh.jpg

GQAMgvJaQAEVu34.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

1/1
Not Llama 3 405B, but Nemotron 4 340B!
@nvidia just released 340B dense LLM matching the original
@OpenAI
GPT-4 performance for chat applications and synthetic data generation. NVIDIA does not claim ownership of any outputs generated.

TL;DR:
340B Paramters with 4k context window
Base, Reward Model and Instruct Model released
Trained on 9 trillion tokens with 2 phases
Trained on English, 50+ languages and 40+ programming languages
Requires 16x H100 in bf16 and ~8x H100 in int4
Base Model achieves 81.1 MMLU; 90.53 HellaSwag; 85.44 BHH
Used SFT, DPO, and RPO for post-training
Commercially useable, but custom license
Available on
@huggingface


Model: Nemotron 4 340B - a nvidia Collection
Technical Report: Nemotron-4 340B | Research


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQDAtgMXsAAMyuh.png

GP58Wn4aUAAe9OS.jpg






1/6
Nemotron-4-340B is released today!
* Base, Instruct, Reward models
* Permissive license
* Great for Synthetic Data Generation
* Designed to help others build their own models
* Sized for inference on 8 NVIDIA H100 GPUs
* Competitive across many tasks

2/6
Nemotron-4-340B-Base:
* Trained for 9T tokens on 6144 H100 GPUs
* Using Megatron-Core at 41% MFU
* 96 layers, 18432 hidden state
* GQA, Squared ReLU

3/6
Nemotron-4-340B-Instruct:
* Aligned using 98% synthetic data
* 28.19% : 46.57% : 25.24% win/tie/loss against GPT-4-1106-preview on our eval set with human raters

4/6
Nemotron -4-340B-Reward:
* Best reward model currently available
* Reward Bench Leaderboard - a Hugging Face Space by allenai

5/6
Blog: NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
Report: Nemotron-4 340B | Research
Checkpoints:

6/6
I don’t know what license you’re looking at - the appropriate license is here:

https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQC5xzqaQAEIK1-.png

GQC6SlcbEAMjiP6.png

GQC6kG-aQAIlnR1.png

GQDA1n9WwAEX1Ra.png

GQALcGXaoAAQoO_.jpg

GQDMS3eaQAEQCGj.jpg

GQDAtgMXsAAMyuh.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


1/2
RWKV-CLIP with SotA resultsit's using #RWKV for both image & text encoder GitHub - deepglint/RWKV-CLIP: The official code of "RWKV-CLIP: A Robust Vision-Language Representation Learner" [2406.06973] RWKV-CLIP: A Robust Vision-Language Representation Learner

2/2
"RWKV-CLIP demonstrates closer distances in the image-text modality space, indicating superior cross-modal alignment performance."


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQCXWqobEAA94nv.jpg

GQCXXSBbkAA8khS.jpg

GQCZ_hrbEAAiEuN.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556







1/7
Inspired by
@karpathy 's latest "Let's Build GPT2", I trained a GPT2 model to generate audio. dont think its good enough to call it 'GPT2o' but it was funny.

code, model, dataset all on
@huggingface
but not sure if you want to use it lol. Will train for longer next time.

1/2

2/7
Blog :
Code : GitHub - nivibilla/build-nanogpt at audio
model : eastwind/gpt2-audio-tiny-sherlock-5k-overfit · Hugging Face
dataset : eastwind/tiny-sherlock-audio · Datasets at Hugging Face

2/2 https://pic.x.com/proerez0r5

3/7
Much better when trained for longer.

New model also on
@huggingface

eastwind/gpt2-audio-tiny-sherlock-100k-overfit · Hugging Face

4/7
Colab for couple hours. I have the Colab notebook linked. Has both training and inference.

5/7
Yes

6/7
Thanks! When trained for longer you get really clean audio.

7/7
No this is GPT2 trained for audio-2-audio


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GP_HLoUWwAA4c-i.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556














1/14
Introducing Samba 3.8B, a simple Mamba+Sliding Window Attention architecture that outperforms Phi3-mini on major benchmarks (e.g., MMLU, GSM8K and HumanEval) by a large margin. And it has an infinite context length with linear complexity.

Paper: [2406.07522] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

(1/6)

2/14
When trained on the 4K sequence length, Samba shows improved perplexity up to 1M context length on Proof-Pile, while still keeping its linear decoding complexity. This results in a 3.64x speed up than the Llama-3 architecture at 64k generation length.

(2/6)

3/14
Wondering how is the extrapolation ability of Samba compared to Mistral? We instruction tuned both arcitectures on Passkey Retrieval with 4K sequence length, and found that Samba (left) can have perfect memory recall up to 256K context length, while Mistral (right) struggles

4/14
This result drives us to do the full cycle post-training of our largest Samba-3.8B model. The recipe here is from the original Phi3-mini team, but we can already see substantial performance improvements on both short-context benchmarks and long-context summarization tasks. There

5/14
We release the training codebase for Samba-421M and Samba-1.3B, together with tons of baselines we tried in the paper, for supporting the opensource community: GitHub - microsoft/Samba: Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"

(5/6)

6/14
A big shoutout to my brilliant and supportive collaborators at Microsoft: @nlpyang , Yadong Lu, Yelong Shen, @chenliang1_ and @WeizhuChen!

(6/6)

7/14
Maybe after we release the 3.8B model. Stay tuned!

8/14
Hi Kaizhao, thanks for the interests! It is simple extrapolation. We actually tried state-of-the-art techniques such as Self-Extend but it just cannot extrapolate infinitely, as shown in Figure 1 of the paper.

9/14
It depends on how much compute we can get. We currently don't have the plan to further scale it up.

10/14
Yes!

11/14
The release of Samba-3.8B-instruct is on the plan!

12/14
Yes, Flash Attention supports Sliding Window Attention.

13/14
We will release the Samba-3.8B-instruct soon. Stay tuned!

14/14
Tried my good old SeqBoat so hard but it seems this simple approach works the best for LM. Will try to revive SeqBoat from the MoE perspective later. Stay tuned!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP58Wn4aUAAe9OS.jpg

GP5_SJTaEAEWkpI.jpg

GP5_XReaoAAkMeU.jpg

GP6CnSBbYAAYvK5.jpg

GP6EBZEbkAAtgbn.png

GP9A36VXYAENObI.jpg

GP9l8dxW0AAen2-.jpg

GP9iEWGWYAAG0oT.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


1/2
Google DeepMind presents a new hybrid architecture which enables tokens in the LLM to cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR).

The resulting model, called TransNAR, demonstrates improvements in OOD reasoning across algorithmic tasks.

Quotes from the paper on why NARs could be useful: "NARs are capable of holding perfect generalization even on 6× larger inputs than ones seen in the training set, for highly complex algorithmic tasks with long rollouts".

The key here is the generalization that you are getting from NARs when combined with Transformers.

[2406.09308] Transformers meet Neural Algorithmic Reasoners

2/2
Sketching as a Visual Chain of Thought for Multimodal LLMs

Introduces SketchPad, a framework that enables a multimodal LLM to access a visual sketchpad and tools to draw on the sketchpad.

This can equip a model like GPT-4 with the capability to generate intermediate sketches to


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQAFhZEXgAA54is.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556



1/2
Exciting new paper: TextGrad: Automatic “Differentiation” via Text!

This is a DSPy-like framework for optimizing prompts in a composite LLM system. However, there is one major difference!



In DSPy the idea is (basically):
- Forward pass: Each Module (LLM call) generates picks a random prompt (set of few-shot examples)
- Backwards pass: If the final answer was good, each internal module upweight the prompt it used.

In TextGrad you try to do something like the "chain rule":
- Forward pass: Each Module receives a (query, prompt) pair and outputs an answer.
- Backwards pass: We ask an LLM to "improve the prompt given the (query, answer) pair" and importantly "how you should improve the query" given the (prompt, answer) pair.

That allows you to bootstrap the backwards pass all the way end-to-end.

How well does it work? Pretty well! Prompt optimization is hard to do, and this method works at least as well as DSPy with random search!



More importantly, it's an exciting view into what's to come in module LLM programming!

The paper also discusses lots of interesting applications, such as code gen! Well worth a read.



Pdf: https://arxiv.org/pdf/2406.07496
By @mertyuksekgonul , @federicobianchy, Joseph Boen, @ShengLiu_, @ZhiHuangPhD, Carlos Guestrin, @james_y_zou!

2/2
This is the most fun project!

We built PyTorch-for-text!
#TextGrad: automated "differentiation" via text to optimize AI systems by backpropagating LLM text feedback.

TextGrad + GPT4o:
LeetCodeHard best score
GPQA sota
Designs new molecules
Improves treatments


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP_U0lCaQAAsDrp.jpg

GP_X0CCbEAETuYx.jpg

GP_YfrJbEAAqxBu.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556



1/3
DepthAnythingV2 is up w/ code, ckpts and excellent performance.
tl;dr pipeline: finetune DinoV2-G for depth estimation on synthetic data (595k images) -> use it as teacher to generate pseudo-labels on 62M real images-> train student model on pseudo-labels
https://depth-anything-v2.github.io

2/3
Link to arXiv as the original link is currently broken:

3/3
Indeed, it's no longer working. It might be temporary.
Here is the arXiv link: [2406.09414] Depth Anything V2


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQBSwrQX0AExM7D.jpg

GQBTCzwWoAA10lv.jpg

GQBTDRZWAAA0Bqg.png

GQBTGb5XYAAHb_o.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556



1/2
This @LaminiAI memory tuning looks quite incredible

Lamini Memory Tuning tunes a massive mixture of memory experts on any open-source LLM.

Each memory expert acts like a LoRA adapter that functionally operates as memory for the model.

Together, the memory experts specialize in a million different ways to ensure faithful and factual accuracy to the data that it was tuned on. Inspired by information retrieval, these million memory experts are equivalent to indices from which the model intelligently retrieves and routes.

At inference time, the model retrieves the most relevant experts at each layer and merges back into the base model to respond to the user query.

2/2
Sounds like Mixture of LoRA's

having multiple LoRAs and swapping them out per prompt.

LoRA, especially with this approach, will bring in a whole bunch of related knowledge, embedded in the weights rather using up context.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GQCN2qgWcAAzkKl.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556




1/4
And got access to Gemini 1.5 Pro’s 2M token context window

time to test.

Thanks
@GoogleAI

2/4
Ah that's great to know.

3/4
For me the most important usecase a whole github repo. Still doubt if 2mn will accommodate really bigger ones (like pytorch).

4/4
A one-plus 24GB mobile running a Mixtral 8x7B at 11 tokens/second.

Much faster inference speed vs llama.cpp and MLC-LLM.

Using swap and caching to run the model even if it doesn't fit the available RAM.

Between Apple’s LLM in a flash and PowerInfer-2, seems like the


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP-i2-zWEAAcCSF.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556


1/2
This internal tool from
@GoogleAI is interesting.

Smart Paste, that streamlines the code authoring workflow by automating adjustments to pasted code.

Predicts the next state of a code environment and uses generative AI to create context-aware adjustments to pasted code.

2/2
Smart Paste for context-aware adjustments to pasted code


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556







1/7
Introduce HumanPlus - Shadowing part

Humanoids are born for using human data. We build a real-time shadowing system using a single RGB camera and a whole-body policy for cloning human motion. Examples:
- boxing
- playing the piano/ping pong
- tossing
- typing

Open-sourced!

2/7
Which hardware platform should HumanPlus be embodied on?

We build our own 33-DoF humanoid with two dexterous hands using components:
- Inspire-Robots RH56DFX hands
-
@UnitreeRobotics H1 robot
-
@ROBOTIS
Dynamixel motors
-
@Razer
webcams

We open-source our hardware design.

3/7
Naively copying joints from humans to humanoids does not work due to gravity and different actuations.

We train a transformer-based whole-body RL policy in IsaacGym simulation with realistic physics using AMASS dataset containing 40 hours of human motion: AMASS

4/7
To retarget from humans to humanoids, we copy the corresponding Euler angles from SMPL-X to our humanoid model.

We use open-sourced SOTA human pose and hand estimation methods (thanks!)
- WHAM for body: WHAM
- HaMeR for hands: HaMeR

5/7
Compared with other teleoperation methods, shadowing
- is affordable
- requires only 1 human operator
- avoids singularities
- natively supports whole-body control

6/7
Shadowing is an efficient data collection pipeline.

We then perform supervised behavior cloning to train skill policies using egocentric vision, allowing humanoids to complete different tasks autonomously by imitating human skills.

7/7
This project is not possible without our team of experts, covering from computer graphics to robot learning to robot hardware:
- co-leads: @qingqing_zhao_ @Qi_Wu577- advisors: @chelseabfinn @GordonWetzstein

project website: HumanPlus: Humanoid Shadowing and Imitation from Humans
hardware:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP8poQOaoAAaT8w.jpg

GP8rzNmaMAABuj_.jpg

GP8t0N8bcAAuB0d.jpg

GP81fPRaIAAgNlP.png

GP9yUj4bsAEPyaU.jpg








1/6
Introduce HumanPlus - Autonomous Skills part

Humanoids are born for using human data. Imitating humans, our humanoid learns:
- fold sweatshirts
- unload objects from warehouse racks
- diverse locomotion skills (squatting, jumping, standing)
- greet another robot

Open-sourced!

2/6
We build our customized 33-DoF humanoid, and a data collection pipeline through real-time shadowing in the real world.

3/6
Using the data collected through shadowing, we then perform supervised behavior cloning to train skill policies using egocentric vision.

We introduce Humanoid Imitation Transformer. Based on ACT, HIT adds forward dynamics prediction on image feature space as a regularization.

4/6
Compared to baselines, HIT uses
- binocular vision, thus having implicit stereos for depth information
- visual feedback better, avoiding overfitting to proprioception given small-sized demos

5/6
Besides vision-based whole-body manipulation skills, our humanoid has strong locomotion skills:
- outperforming H1 default standing controller under strong perturbation forces
- enabling more whole-body skills like squatting and jumping

6/6
This project is not possible without our team of experts, covering from computer graphics to robot learning to robot hardware:
- co-leads: @qingqing_zhao_ @Qi_Wu577- advisors: @chelseabfinn@GordonWetzstein


project website: HumanPlus: Humanoid Shadowing and Imitation from Humans
hardware:


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GP9yyYia4AAR0-L.jpg

GP9y4p5bAAAWqCR.png

GP9y8hVbsAA67zY.png

GP9zBNnaoAA6r_-.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

1/1
There's a lack of datasets for long-form video understanding for a genuine long-form comprehension.

This one is quite impressive for authentic long-form video understanding - CinePile

305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects, including temporal comprehension, understanding human-object interactions, and reasoning about events or actions within a scene.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GNoERvXWEAAvJKA.jpg
 
Top