Large Language Models News & Discussions

bnew · Jun 14, 2024

1/1
New Paper and Blog!

Sakana AI

As LLMs become better at generating hypotheses and code, a fascinating possibility emerges: using AI to advance AI itself! As a first step, we got LLMs to discover better algorithms for training LLMs that align with human preferences.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Can LLMs invent better ways to train LLMs?

At Sakana AI, we’re pioneering AI-driven methods to automate AI research and discovery. We’re excited to release DiscoPOP: a new SOTA preference optimization algorithm that was discovered and written by an LLM!

Sakana AI

Our method leverages LLMs to propose and implement new preference optimization algorithms. We then train models with those algorithms and evaluate their performance, providing feedback to the LLM. By repeating this process for multiple generations in an evolutionary loop, the LLM discovers many highly-performant and novel preference optimization objectives!

Paper: [2406.08414] Discovering Preference Optimization Algorithms with and for Large Language Models
GitHub: GitHub - SakanaAI/DiscoPOP: Code for Discovering Preference Optimization Algorithms with and for Large Language Models
Model: SakanaAI/DiscoPOP-zephyr-7b-gemma · Hugging Face

We proudly collaborated with the
@UniOfOxford (@FLAIR_Ox) and @Cambridge_Uni (@MihaelaVDS) on this groundbreaking project. Looking ahead, we envision a future where AI-driven research reduces the need for extensive human intervention and computational resources. This will accelerate scientific discoveries and innovation, pushing the boundaries of what AI can achieve.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/1
Google presents Transformers meet Neural Algorithmic Reasoners

Significant gains over Transformer for algorithmic reasoning, both in and out of distribution

[2406.09308] Transformers meet Neural Algorithmic Reasoners

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/9
so this is nuts, if you're cool with the high frequncy details of an image being reinterpreted/stochastic, you can encode an image quite faithfully into 32 tokens...
with a codebook size of 1024 as they use this is just 320bits, new upper bound for the information in an image unlocked

2/9
only 2^320 unique images which is both a lot and a little

3/9
Yeah I see it as a spectrum to how generative each stage can be, but still think it’s a very underrxploited area both in computation and in ways of improving outputs/controllability

4/9
you could probably get this down to fewer tokens with a larger vocab size and if willing to sacrifice a little more fidelity which IMHO is fine! especially in image generation where often we're getting random outputs anyway

5/9
Those kind of high frequency variations are the cost and left up the interpretation of the decoder, I could imagine the image still looking “plausible” but yeah likely things like text or finer identities of people will not hold up. Still very practical

6/9
In the network yes, but for storage and measured information you only need the indices. A 1024 (2^10) codebook requires 10 bits to cover all values. A typical LLM tokenizer of say 65536 ( 2^16 ) vocab puts you at 2 bytes a token

7/9
Why? You can just have a more expressive decoder (like diffusion) and consider the tokens to be a prior rather than a pure reconstruction

8/9
Absolutely not lol, but it’s close to good enough standards

9/9
Generation/reconstruction is pretty different from encoding for other purposes and I think the recent meta paper showed that discrete is worse for features you’d give to a VLM for instance

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/4
High-quality out-of-order generation is going to be really fun. Several interesting papers on this topic this month showing promise.

This is from new work led by Subham Sahoo and
@volokuleshov : MDLM Blog post

2/4
Also check out this concurrent work by Jiaxin Shi and company. Somehow we went through 100 titles and came up with nearly the same one as them!

3/4
Sigma GPT also looks at this problem from a slightly different angle.

Submission 1127 - σ-GPTs - Examples

4/4
TIL: All NN libraries have a special secret version of ReLU where you also threshold at 6.

ReLU6 — PyTorch 2.3 documentation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/1
Google presents Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

Presents a novel benchmark designed to assess LLMs’ temporal reasoning abilities, which SotA LLMs currently struggle with

data: baharef/ToT · Datasets at Hugging Face
abs: [2406.09170] Test of Time: A Benchmark for Evaluating LLMs on Temporal Reasoning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/7
Meta announces An Image is Worth More Than 16x16 Patches

Exploring Transformers on Individual Pixels

This work does not introduce a new method. Instead, we present an interesting finding that questions the necessity of the inductive bias -- locality in modern computer vision

2/7
architectures. Concretely, we find that vanilla Transformers can operate by directly treating each individual pixel as a token and achieve highly performant results. This is substantially different from the popular design in Vision Transformer, which maintains the inductive bias

3/7
from ConvNets towards local neighborhoods (e.g. by treating each 16x16 patch as a token). We mainly showcase the effectiveness of pixels-as-tokens across three well-studied tasks in computer vision: supervised learning for object classification, self-supervised learning via

4/7
masked autoencoding, and image generation with diffusion models. Although directly operating on individual pixels is less computationally practical, we believe the community

5/7
must be aware of this surprising piece of knowledge when devising the next generation of neural architectures for computer vision.

6/7
paper page:

7/7
daily papers:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/1
Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

abs: [2406.09279] Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback

"In this work, we aim to systematically investigate these key components of learning from preferences"

Exploration of Tulu 2 13B (Llama-2-13B SFT) with 14 different preference datasets evaluated on 11 different benchmarks.

Several observations:
1. synthetic, diverse data annotated with per-aspect preferences works best for learning from preferences
2. PPO outperforms DPO across varied datasets in our evaluation suite, even when using exactly the same models and initial training data
3. Increasing reward model size and dataset size used to train the reward model results in marginal to no improvements on policy model performance, except on GSM
4. Using unlabelled prompts that better match the test setting during policy can further improve model performance in domain-specific settings

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/5
What if we prompt aligned LLMs like Llama-3-Instruct with nothing? Surprisingly, it will decode decent user queries thanks to its auto-regressive nature. In our new preprint, Magpie, we find this is a scalable way to self-synthesize instruction data of high quality & diversity.

arXiv: [2406.08464] Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing

@huggingface : Magpie-Align (Magpie Alignment)
website: Magpie|

> Data & Results: We use this prompting-with-nothing method to self-synthesize 4M data from Llama-3-Instruct. After filtering, we use a 300K subset of them for SFT. Results show that using Magpie data for SFT-ing Llama-3-8B-base is better than using other public datasets (e.g., UltraChat, ShareGPT), on alignment benchmarks like AlpacaEval, ArenaHard and WildBench; It can be comparable or even better than SFT+DPO with open data. Notably, on AE2, Magpie models' performance is close to the official Llama-3-8B-Instruct.

> How to self-synthesize alignment data? It’s easy. We only input the pre-query template tokens “<|start_header_id|>user <|end_header_id|>”, sample the completions as queries (i.e., instructions) (before <|eot_id|>). Then, we use Llama-3-8/70B-Instruct to infer the responses for creating Magpie-Air/Pro. Similarly, we can repeat the process to create multi-turn dialogs. Finally, we apply multiple filters to subsample and further ensure the overall data quality.

> Transferability? We also used the Magpie data to tune Qwen 1.5 4B & 7B base, and also get better performance than their official aligned models, suggesting that Magpie data can be used for aligning other model families and smaller models too. We hope our data can help improve AI democratization.

> Limitations? On MMLU-Redux (0-shot), Magpie-tuned LLMs are at 52%-ish (among the top in baselines), while still fall behind official Llama-3-8B-Instruct (58%). Magpie does not have enough math data yet, and we’re still working on synthesizing more math & reasoning data. Meanwhile, we suggest to mix Magpie data and other awesome data (e.g., MAmmoTH2 from @xiangyue96 @WenhuChen ) which excel in math & reasoning. In addition, we’re also working on creating pairwise preference data for *PO.

> Why "Magpie"? "Other birds collect twigs for their nests. Magpies steal jewels for theirs."

Awesome co-authors from @UW & @allen_ai: @zhangchen_xu (lead) @fengqing_jiang Luyao Niu, @yuntiandeng @RadhaPoovendran @YejinChoinka

Thanks @_akhaliq for helping introduce our paper previously! :D

2/5
Btw, we previously observed a similar situation with GPT API: given empty queries, they will return a response to a latent query [1] ; after getting these outputs, one can recover the prompts by inversion [2].

Sadly we find it seems that @OpenAI soon disallowed us inputting

3/5
For Llama-3-Instruct, it is "<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n"

4/5
Previously, we thought Llama-3-instruct was only trained on generating responses and masked the loss on input tokens. They claimed they did so in Llama-2-chat report, which is also the common practice of SFT & DPO. So it’s a bit surprising to us that it can recover the inputs

5/5
WebInstruct data (MAmmoTH2) should be better in science QA, reasoning and math; Magpie has fewer instances on these domains (see our limitation part). We believe the two data are complementary and mixed together for a better SFT.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/18
Finally, I present you my first ever preprint:

MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

We show that sharing KV heads between layers allows for KV cache smaller than GQA/MQA allowed for, with reasonable acc/mem tradeoff
[2406.09297] MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

2/18
As described by the name, MLKV allows for configurations that share KV heads of lower layers to layers above. While MQA was limited to reducing the KV cache to 1/n_layers the size, here MLKV, at the most extreme, can go down to 1/n_layers the vanilla cache size.

3/18
We uptrain 8 variants of Pythia-160M in different configs with different total KV head counts, from GQA/MQA equivalents to MLKV setups that go beyond what was possible. Uptrain loss is reported in Table 2.

4/18
The benchmark results show that there is a price to pay for reducing KV heads, as the avg accuracy goes down slightly. The most extreme config with 1 KV head collapses, but all other variants are very much usable.

5/18
Now to the good part. This plot can be predicted from the theoretical cache sizes, but it is really cool to see it in practice, when inferencing IRL. You can clearly see the benefit in going down with MLKV. IMPORTANT: This plot also applies for longer context lengths!

6/18
We can plot the accuracy/memory trade-off. We show that configurations with half the KV heads of MQA is right with MQA on the pareto optimal part of the curve. Even going below that is possible if needed.

7/18
Some limitations to curb your enthusiasm for those who have been following since the start:
1. Small scale at 160M params
2. Uptrain instead of train from scratch
Keep in mind this was my "final project" for a bachelors degree. My first priority was to get this degree as safely..

8/18
..and quickly as possible. This scale allowed me to iterate and experiment more in a limited timeframe, which I really needed. I'm certain it is possible to scale MLKV up as was done with GQA. If you are interested in optimizing your model, lmk and I'll help you

9/18
About CLA. This idea is pretty obvious if you know the literature, so it was bound to happen. There are still some differences, e.g. we uptrain models (which is more practical) and see the effects. Also, we tested more extreme configurations too.

10/18
This pre-print is planned to be submitted to EMNLP. I am not 100% confident in getting in (given limitations and CLA) but I do want to give it a shot. But what I really wanted was to get this up on arXiv and share it to you all, which is what I'm doing now.

11/18
Which is super satisfying and I'm happy with the outcome! I thank everyone who has been following and showing interest in this project since the start, nearly a year ago. You guys are cool, thank you!

12/18
Big thanks and credits to co-authors and supervisors @faridlazuarda , @AyuP_AI, and @AlhamFikri

13/18
yep lots of these KV optimizations are stackable. but also beware of the accuracy tradeoff

14/18
thank youu

15/18
thank you! here's to hoping

16/18
Thank you!

Yes you got it exactly right, no free lunch. I've tried and well, the benchmark results are worse ofc

17/18
Yes, they can in fact be used simultaneously! MHLA just compresses any KV cache that is used, so reducing KV cache size will also reduce the compressed representation. It is to be seen how well they work together, or if it is even necessary to do both, but it theoretically works

18/18
There are many in depth explanations out there on the attention mechanism, but I really recommend you trying attention hands-on through something like nanoGPT. I only started to really understand after trying it for myself

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/1
Not Llama 3 405B, but Nemotron 4 340B!
@nvidia just released 340B dense LLM matching the original
@OpenAI
GPT-4 performance for chat applications and synthetic data generation. NVIDIA does not claim ownership of any outputs generated.

TL;DR:
340B Paramters with 4k context window
Base, Reward Model and Instruct Model released
Trained on 9 trillion tokens with 2 phases
Trained on English, 50+ languages and 40+ programming languages
Requires 16x H100 in bf16 and ~8x H100 in int4
Base Model achieves 81.1 MMLU; 90.53 HellaSwag; 85.44 BHH
Used SFT, DPO, and RPO for post-training
Commercially useable, but custom license
Available on
@huggingface

Model: Nemotron 4 340B - a nvidia Collection
Technical Report: Nemotron-4 340B | Research

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/6
Nemotron-4-340B is released today!
* Base, Instruct, Reward models
* Permissive license
* Great for Synthetic Data Generation
* Designed to help others build their own models
* Sized for inference on 8 NVIDIA H100 GPUs
* Competitive across many tasks

2/6
Nemotron-4-340B-Base:
* Trained for 9T tokens on 6144 H100 GPUs
* Using Megatron-Core at 41% MFU
* 96 layers, 18432 hidden state
* GQA, Squared ReLU

3/6
Nemotron-4-340B-Instruct:
* Aligned using 98% synthetic data
* 28.19% : 46.57% : 25.24% win/tie/loss against GPT-4-1106-preview on our eval set with human raters

4/6
Nemotron -4-340B-Reward:
* Best reward model currently available
* Reward Bench Leaderboard - a Hugging Face Space by allenai

5/6
Blog: NVIDIA Releases Open Synthetic Data Generation Pipeline for Training Large Language Models
Report: Nemotron-4 340B | Research
Checkpoints:

6/6
I don’t know what license you’re looking at - the appropriate license is here:

https://developer.download.nvidia.com/licenses/nvidia-open-model-license-agreement-june-2024.pdf

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/2
RWKV-CLIP with SotA resultsit's using #RWKV for both image & text encoder GitHub - deepglint/RWKV-CLIP: The official code of "RWKV-CLIP: A Robust Vision-Language Representation Learner" [2406.06973] RWKV-CLIP: A Robust Vision-Language Representation Learner

2/2
"RWKV-CLIP demonstrates closer distances in the image-text modality space, indicating superior cross-modal alignment performance."

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/14
Introducing Samba 3.8B, a simple Mamba+Sliding Window Attention architecture that outperforms Phi3-mini on major benchmarks (e.g., MMLU, GSM8K and HumanEval) by a large margin. And it has an infinite context length with linear complexity.

Paper: [2406.07522] Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling

(1/6)

2/14
When trained on the 4K sequence length, Samba shows improved perplexity up to 1M context length on Proof-Pile, while still keeping its linear decoding complexity. This results in a 3.64x speed up than the Llama-3 architecture at 64k generation length.

(2/6)

3/14
Wondering how is the extrapolation ability of Samba compared to Mistral? We instruction tuned both arcitectures on Passkey Retrieval with 4K sequence length, and found that Samba (left) can have perfect memory recall up to 256K context length, while Mistral (right) struggles

4/14
This result drives us to do the full cycle post-training of our largest Samba-3.8B model. The recipe here is from the original Phi3-mini team, but we can already see substantial performance improvements on both short-context benchmarks and long-context summarization tasks. There

5/14
We release the training codebase for Samba-421M and Samba-1.3B, together with tons of baselines we tried in the paper, for supporting the opensource community: GitHub - microsoft/Samba: Official implementation of "Samba: Simple Hybrid State Space Models for Efficient Unlimited Context Language Modeling"

(5/6)

6/14
A big shoutout to my brilliant and supportive collaborators at Microsoft: @nlpyang , Yadong Lu, Yelong Shen, @chenliang1_ and @WeizhuChen!

(6/6)

7/14
Maybe after we release the 3.8B model. Stay tuned!

8/14
Hi Kaizhao, thanks for the interests! It is simple extrapolation. We actually tried state-of-the-art techniques such as Self-Extend but it just cannot extrapolate infinitely, as shown in Figure 1 of the paper.

9/14
It depends on how much compute we can get. We currently don't have the plan to further scale it up.

10/14
Yes!

11/14
The release of Samba-3.8B-instruct is on the plan!

12/14
Yes, Flash Attention supports Sliding Window Attention.

13/14
We will release the Samba-3.8B-instruct soon. Stay tuned!

14/14
Tried my good old SeqBoat so hard but it seems this simple approach works the best for LM. Will try to revive SeqBoat from the MoE perspective later. Stay tuned!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/2
Google DeepMind presents a new hybrid architecture which enables tokens in the LLM to cross-attend to node embeddings from a GNN-based neural algorithmic reasoner (NAR).

The resulting model, called TransNAR, demonstrates improvements in OOD reasoning across algorithmic tasks.

Quotes from the paper on why NARs could be useful: "NARs are capable of holding perfect generalization even on 6× larger inputs than ones seen in the training set, for highly complex algorithmic tasks with long rollouts".

The key here is the generalization that you are getting from NARs when combined with Transformers.

[2406.09308] Transformers meet Neural Algorithmic Reasoners

2/2
Sketching as a Visual Chain of Thought for Multimodal LLMs

Introduces SketchPad, a framework that enables a multimodal LLM to access a visual sketchpad and tools to draw on the sketchpad.

This can equip a model like GPT-4 with the capability to generate intermediate sketches to

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/2
Exciting new paper: TextGrad: Automatic “Differentiation” via Text!

This is a DSPy-like framework for optimizing prompts in a composite LLM system. However, there is one major difference!

In DSPy the idea is (basically):
- Forward pass: Each Module (LLM call) generates picks a random prompt (set of few-shot examples)
- Backwards pass: If the final answer was good, each internal module upweight the prompt it used.

In TextGrad you try to do something like the "chain rule":
- Forward pass: Each Module receives a (query, prompt) pair and outputs an answer.
- Backwards pass: We ask an LLM to "improve the prompt given the (query, answer) pair" and importantly "how you should improve the query" given the (prompt, answer) pair.

That allows you to bootstrap the backwards pass all the way end-to-end.

How well does it work? Pretty well! Prompt optimization is hard to do, and this method works at least as well as DSPy with random search!

More importantly, it's an exciting view into what's to come in module LLM programming!

The paper also discusses lots of interesting applications, such as code gen! Well worth a read.

Pdf: https://arxiv.org/pdf/2406.07496
By @mertyuksekgonul , @federicobianchy, Joseph Boen, @ShengLiu_, @ZhiHuangPhD, Carlos Guestrin, @james_y_zou!

2/2
This is the most fun project!

We built PyTorch-for-text!
#TextGrad: automated "differentiation" via text to optimize AI systems by backpropagating LLM text feedback.

TextGrad + GPT4o:
LeetCodeHard best score
GPQA sota
Designs new molecules
Improves treatments

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jun 14, 2024

1/3
DepthAnythingV2 is up w/ code, ckpts and excellent performance.
tl;dr pipeline: finetune DinoV2-G for depth estimation on synthetic data (595k images) -> use it as teacher to generate pseudo-labels on 62M real images-> train student model on pseudo-labels
https://depth-anything-v2.github.io

2/3
Link to arXiv as the original link is currently broken:

3/3
Indeed, it's no longer working. It might be temporary.
Here is the arXiv link: [2406.09414] Depth Anything V2

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran