The A.I Megathread (LLM , GPT , Development)

bnew · Oct 12, 2024

1/11
@reach_vb
Let's goo! F5-TTS

> Trained on 100K hours of data
> Zero-shot voice cloning
> Speed control (based on total duration)
> Emotion based synthesis
> Long-form synthesis
> Supports code-switching
> Best part: CC-BY license (commercially permissive)

Diffusion based architecture:
> Non-Autoregressive + Flow Matching with DiT
> Uses ConvNeXt to refine text representation, alignment

Synthesised: I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right? (Happy emotion)

The TTS scene is on fire!

https://video.twimg.com/ext_tw_video/1845154255683919887/pu/vid/avc1/480x300/mzGDLl_iiw5TUzGg.mp4

2/11
@reach_vb
Check out the open model weights here:

SWivid/F5-TTS · Hugging Face

3/11
@reach_vb
Overall architecture:

4/11
@reach_vb
Also, @realmrfakename made a dope space to play with the model:

E2/F5 TTS - a Hugging Face Space by mrfakename

5/11
@reach_vb
Some notes on the model:

1. Non-Autoregressive Design: Uses filler tokens to match text and speech lengths, eliminating complex models like duration and text encoders.

2. Flow Matching with DiT: Employs flow matching with a Diffusion Transformer (DiT) for denoising and speech generation.

3. ConvNeXt for Text: used to refine text representation, enhancing alignment with speech.

4. Sway Sampling: Introduces an inference-time Sway Sampling strategy to boost performance and efficiency, applicable without retraining.

5. Fast Inference: Achieves an inference Real-Time Factor (RTF) of 0.15, faster than state-of-the-art diffusion-based TTS models.

6. Multilingual Zero-Shot: Trained on a 100K hours multilingual dataset, demonstrates natural, expressive zero-shot speech, seamless code-switching, and efficient speed control.

6/11
@TommyFalkowski
This model is pretty good indeed! I haven't tried long form generation yet though but am really excited to have a model that could replace the online edge tts I'm currently using.

[Quoted tweet]
I think we might finally have an elevenlabs level text-to-speech model at home! I got the demo to run on a machine with a 3070 with 8GB of vram!

https://video.twimg.com/ext_tw_video/1844477815166500885/pu/vid/avc1/1108x720/ULTpniql5_9M761K.mp4

7/11
@realkieranlewis
can this be ran on replicate etc? any indication on cost vs ElevenLabs?

8/11
@j6sp5r
nice
Wen in open NotebookLM

9/11
@lalopenguin
it sounds great!!

10/11
@BhanuKonepalli
This one's a game changer !!

11/11
@modeless
Ooh, this looks really great!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/13
@reach_vb
This is wicked! You can play in a constantly evolving environment - all open source & today!

Gaming industry is going to disrupt so fkn hard!

https://video.twimg.com/ext_tw_video/1844803008695001088/pu/vid/avc1/848x550/iaVmq7PM7YDtecJr.mp4

2/13
@harshuleonite7
Throught it was any normal gameplay
Reads the Title "whatttt"

I wondered if future games will be rendered real time
or u get a base game and the future dlc or missions be rendered only

3/13
@reach_vb
The most appealing aspect is probably unique gaminung experiences - I.e no two gaming experiences are the same

4/13
@MGruselglatz
What is my company doing in 2 years?!

🫣

5/13
@reach_vb
Probably still the same as today but with 10x more AI assistance

6/13
@KrakowiakK
they will adopt, it all about creativity, technology is evolving constantly..

7/13
@reach_vb

- change is the only constant!

8/13
@LeoVasanko
Cool that video models can do this, but not at all useful for gaming because simply turning 180 degrees gives you a completely different landscape than where you just came from. It holds no consistency of the world beyond what is visible.

9/13
@krish240574
Link to repo? Thanks.

10/13
@ethicalaimad
Me encanta la idea de juegos en entornos abiertos y en constante evolución. El futuro del gaming es emocionante. ¡Vamos a ver cómo logramos que España no se quede atrás en este campo!

11/13
@realmrfakename
Trained for 12 days on a consumer GPU!

12/13
@changtimwu
"road bending upwards and folding over itself"

13/13
@TheVivifier
you know
this is normal right ?

lots of games have this, without using 'Ai models' already

deadcell comes to mind (and many many more)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/9
@reach_vb

@rhymes_ai_ released Aria - Multimodal MoE (3.9B active), 64K tokens, caption 256 frames in 10 sec, Apache 2.0 licensed! Beats GPT4o & Gemini Flash

> 3.9B Active, 25.3B Total parameters
> Significantly better than Pixtral 12B, Llama Vision 11B & Qwen VL

> Trained on 7.5T tokens
> Four stage training:
- 6.4T language pre-training
- 1.4T multimodal pre-training
- 35B long context training
- 20B high quality post-training

Architecture:
> Aria consists of a vision encoder and a mixture-of-experts (MoE) decoder

> Vision encoder:
- Produces visual tokens for images/videos in native aspect ratio
- Operates in three resolution modes: medium, high, and ultra-high
- Medium-resolution: 128 visual tokens
- High-resolution: 256 visual tokens
- Ultra-high resolution: Dynamically decomposed into multiple high-resolution sub-images

> MoE decoder:
- Multimodal native, conditioned on both language and visual input tokens
- 66 experts per MoE layer
- 2 experts shared among all inputs to capture common knowledge
- 6 additional experts activated per token by a router module

> Models on the Hub & Integrated with Transformers!

Kudos Rhyme AI team - Vision language landscape continues to rip!

2/9
@reach_vb
Model on the Hub, Apache 2.0 licensed:

rhymes-ai/Aria · Hugging Face

3/9
@reach_vb
GitHub repo:

GitHub - rhymes-ai/Aria: Codebase for Aria - an Open Multimodal Native MoE

4/9
@heyitsyorkie
Need some C++ wizards to work on vision model support for llama.cpp so we can enjoy all these Vision models!!!

5/9
@reach_vb
Soon perhaps

6/9
@natolambert
no molmo is big disrespect

7/9
@saareliad
Nice

8/9
@Ntdhnl
@ollama

9/9
@philpax_
grats on your namesake @ariaurelium

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/11
@PrimeIntellect
Announcing INTELLECT-1: the first-ever decentralized training of a 10B model

Scaling decentralized training 10x beyond prior efforts.

Anyone can join us to build open-source AGI

2/11
@PrimeIntellect
We are grateful to our launch partners contributing compute: @huggingface @SemiAnalysis_ @arcee_ai @hyperbolic_labs @autonolas @akashnet_ @SchellingAI and many others

3/11
@PrimeIntellect
Anyone can contribute compute to advance open-source AI through our platform and later on also with their own hardware.

4/11
@PrimeIntellect
Built on Prime: Our new decentralized training framework that improves and scales DiLoCo up 25X:
• Fault-Tolerant Training via new ElasticDeviceMesh abstraction
• Optimized Communication: Reduces synchronization times by up to 1000-2000x vs centralized training.
• Improving bandwidth utilization by 40x compared to our OpenDiLoCo release
• High Compute Utilization: 98% compute utilization at 10B scale
• Custom Int8 All-Reduce Kernels
• Live checkpoint recovery

Github: GitHub - PrimeIntellect-ai/prime: prime (previously called ZeroBand) is a framework for efficient, globally distributed training of AI models over the internet.

5/11
@PrimeIntellect
INTELLECT-1 will be fully open source, incl. the training framework and dataset.

Model specs:
• 10B parameters, 6T+ tokens dataset
• Llama Architecture and tokenizer
• Dataset mix: Fineweb-edu, DLCM, Stack v2, OpenWebMath

6/11
@PrimeIntellect
Why it matters: Open-source AI is crucial to mitigate centralization risks and one of the biggest public goods. We need to coordinate compute, talent, capital to compete with closed-source labs.

The longer term goal: scale to open source AGI models, continuously improving upon the best open source models in the world.

7/11
@PrimeIntellect
Shoutout to @samsja19, @jackminong, and @johannes_hage for their work on the decentralized training research. @manveerxyz, @jannik_stra, and @burnpiro for their work on the decentralized training platform. @eliebakouch for his help with composing the dataset. @Ar_Douillard et al. for their work on DiLoCo, and many PyTorch contributors for contributing valuable input.

8/11
@PrimeIntellect
Let's build open source AGI together.

Join the decentralized training run: INTELLECT-1 | Prime Intellect | Decentralized Training of a 10B Model

Apply: Careers
Discord: Join the PrimeIntellect Discord Server!

9/11
@PrimeIntellect
Blogpost with more infos

INTELLECT–1: Launching the First Decentralized Training of a 10B Parameter Model

10/11
@UristaTim
Decentralized AI like INTELLECT-1 is the future. Imagine applying this scalability to global issues, making problem-solving accessible to all!

Any thoughts on which area could benefit most?

/search?q=#OpenSourceAGI

11/11
@andthatto
refreshing and liberating, and it's what tech should feel like

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/9
@reach_vb

@rhymes_ai_ released Aria - Multimodal MoE (3.9B active), 64K tokens, caption 256 frames in 10 sec, Apache 2.0 licensed! Beats GPT4o & Gemini Flash

> 3.9B Active, 25.3B Total parameters
> Significantly better than Pixtral 12B, Llama Vision 11B & Qwen VL

> Trained on 7.5T tokens
> Four stage training:
- 6.4T language pre-training
- 1.4T multimodal pre-training
- 35B long context training
- 20B high quality post-training

Architecture:
> Aria consists of a vision encoder and a mixture-of-experts (MoE) decoder

> Vision encoder:
- Produces visual tokens for images/videos in native aspect ratio
- Operates in three resolution modes: medium, high, and ultra-high
- Medium-resolution: 128 visual tokens
- High-resolution: 256 visual tokens
- Ultra-high resolution: Dynamically decomposed into multiple high-resolution sub-images

> MoE decoder:
- Multimodal native, conditioned on both language and visual input tokens
- 66 experts per MoE layer
- 2 experts shared among all inputs to capture common knowledge
- 6 additional experts activated per token by a router module

> Models on the Hub & Integrated with Transformers!

Kudos Rhyme AI team - Vision language landscape continues to rip!

2/9
@reach_vb
Model on the Hub, Apache 2.0 licensed:

rhymes-ai/Aria · Hugging Face

3/9
@reach_vb
GitHub repo:

GitHub - rhymes-ai/Aria: Codebase for Aria - an Open Multimodal Native MoE

4/9
@heyitsyorkie
Need some C++ wizards to work on vision model support for llama.cpp so we can enjoy all these Vision models!!!

5/9
@reach_vb
Soon perhaps

6/9
@natolambert
no molmo is big disrespect

7/9
@saareliad
Nice

8/9
@philpax_
grats on your namesake @ariaurelium

9/9
@Ntdhnl
@ollama

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/11
@reach_vb
Why is nobody talking about NVLM? Nvidia casually dropped a 72B VLM rivaling GPT4o, Llama 3V 305B & InternVL2

Some notes from their paper:

> Architecture Comparison:

- NVLM-X (cross-attention) excels in computational efficiency for high-res images
- NVLM-D (decoder-only) offers unified multimodal reasoning and higher OCR accuracy
- NVLM-H, a hybrid, combines these strengths for efficient high-res image processing and multimodal reasoning

> High-Resolution Images

- A 1-D tile-tagging mechanism for dynamic tiling improves OCR and multimodal tasks.
- Ablation studies show text-based tags before image tokens optimize accuracy.

> Training Data:

- Emphasis on quality and diversity over quantity.
- Flamingo (cross-attention) and LLaVA (decoder-only) models benefit from diverse pretraining data.
- NVLM models use a large, curated SFT dataset, improving performance with a simplified design.

> Production-Ready Multimodality:

- NVLM models achieve vision-language and text-only task excellence.
- Freezing LLM parameters maintains text performance. High-quality text and multimodal math data integration enhances math and coding, improving multimodal reasoning.

> Weights on the Hub and integrated w/ Transformers

Kudos @NVIDIAAI - legends!

2/11
@reach_vb
Model weights here:

nvidia/NVLM-D-72B · Hugging Face

3/11
@Presidentlin
Creative Commons Attribution Non Commercial 4.0

4/11
@reach_vb
The paper is where the real deal is at:

Paper page - NVLM: Open Frontier-Class Multimodal LLMs

5/11
@RasmusToivanen
What usually hinders my usage of using Nvidia models. License sometimes, Training on OAI outputs, not providing pay-as-you-go API on their NIM platform, No other providers for models like Nemotron-340B-reward... @NVIDIAAI

6/11
@reach_vb
I thought NIM was Pay as you go.

7/11
@j6sp5r
cause we can't really run it locally (yet)

8/11
@reach_vb
*yet*

9/11
@GozukaraFurkan
Becuse only useful for companies

Unless Nvidia stop being a monopoly and bring 48gb consumer GPUs it won't be locally doable

Shame on Nvidia

10/11
@ArpinGarre66002
Do you have a 1-click installer I can run on RAM?

11/11
@gdbsm1
@readwise save thread

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/11
@_philschmid
Let's go! First real good open-source text-to-video model with MIT license! Pyramid Flow SD3 is a 2B Diffusion Transformer (DiT) that can generate 10-second videos at 768p with 24fps!

TL;DR;

Can Generate 10-second videos at 768p/24FPS

2B parameter single unified Diffusion Transformer (DiT)

Supports both text-to-video AND image-to-video

Uses Flow Matching for efficient training

Two model variants: 384p (5s) and 768p (10s)

example videos on project page

Simple two-step implementation process

MIT License and available on @huggingface

Trained only on open-source datasets

Training code coming soon!

https://video.twimg.com/ext_tw_video/1844267728338628619/pu/vid/avc1/1200x720/D66iVEP9JXU5Vepv.mp4

2/11
@_philschmid
Model: rain1011/pyramid-flow-sd3 · Hugging Face

Paper: Paper page - Pyramidal Flow Matching for Efficient Video Generative Modeling

Project page: Pyramid Flow

3/11
@demian_ai
Absolutely mindblowing

4/11
@AIBizarrothe
@cocktailpeanut

please sir!

5/11
@KrakowiakK
Took around 7 minutes to generate 10 sec video on a single RTX 3090, maxing out 100% of the 24GB vRAM.

[Quoted tweet]
prompt: {cinematic} female warrior with a sword in a rainforest

https://video.twimg.com/ext_tw_video/1844688038217097228/pu/vid/avc1/1200x720/kkgeYZtS1mK8lror.mp4

6/11
@j6sp5r
FINALLY! Judging from the speed of open source tomorrow we will have:
- Loras made from your own vids
- Putting your mom into Mission Impossible 7
- You eating Spaghetti on Mars

7/11
@JesterMule
@Guygies

8/11
@cyrildiagne
Impressive! Also worth noting that they employ a 8x8x8 Causal VAE, with 8x temporal compression (!) compared to the typical 4x

9/11
@NERDDISCO
you can run it for free on @tost_ai to generate 5s videos.

[Quoted tweet]

📽 Pyramidal Flow Matching for Efficient Video Generative Modeling

[text to video] and [image to video] now on @tost_ai and @runpod_io

Thanks to @YangLee99779107 ❤ @feifeiobama ❤ Ningyuan Li ❤ Kun Xu ❤ Kun Xu ❤ Hao Jiang ❤ Nan Zhuang ❤ Quzhe Huang ❤ Yang Song ❤ Yadong Mu ❤ Zhouchen Lin ❤

page: pyramid-flow.github.io/

paper: arxiv.org/abs/2410.05954

code: github.com/jy0205/Pyramid-Fl…

runpod-t2v: github.com/camenduru/pyramid…

runpod-i2v: github.com/camenduru/pyramid…

tost: please try it

tost.ai

https://video.twimg.com/ext_tw_video/1844289395299319820/pu/vid/avc1/1920x1080/tivsUUCHwHm5Y51s.mp4

10/11
@remotemontes

11/11
@phughes9000
Beautiful

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/5
@ailozovskaya

The

Open LLM Leaderboard Monday Weekly Update

Last week, 139 new models joined the Leaderboard

(1/5) Let's the thread start!

2/5
@ailozovskaya
(2/5)

Maintainer's Highlight Top 20 Rank Change

No updates this week, top 20 remains unchanged

Note: "Show only Maintainer's Highlight" button filters models from trusted providers (e.g., EleutherAI, CohereForAI, MistralAI). This filter can help focus on high-quality models from our curated authors set!

3/5
@ailozovskaya
(3/5)

Top 3 models by average score last week

First place: rombodawg/Rombos-LLM-V2.5-Qwen-72b (45.39 avg)

Second place: MaziyarPanahi/calme-2.1-qwen2.5-72b (38.38 avg)

Third place: ssmits/Qwen2.5-95B-Instruct (37.43 avg)

Kudos to @dudeman6790, @MaziyarPanahi, and ssmits !

4/5
@ailozovskaya
(4/5)

Benchmarks top performers

MaziyarPanahi/calme-2.1-qwen2.5-72b
• IFEval: 86.62
• BBH: 61.66

rombodawg/Rombos-LLM-V2.5-Qwen-72b
• MMLU-PRO: 54.83
• MATH Lvl 5: 47.58

nisten/franqwenstein-35b (float16, no chat template)
• GPQA: 20.47

cognitivecomputations/dolphin-2.9.1-llama-3-70b
• MUSR: 23.70

Big shoutout to @MaziyarPanahi, @dudeman6790, @nisten, and @cognitivecompai

5/5
@ailozovskaya
(5/5)

That’s it for this week!

Please, celebrate this week’s top performers, explore models on the Leaderboard, and submit new ones!

Stay tuned for more updates!

Open LLM Leaderboard: Open LLM Leaderboard 2 - a Hugging Face Space by open-llm-leaderboard

[Quoted tweet]

The

Open LLM Leaderboard Monday Weekly Update

Last week, 139 new models joined the Leaderboard

(1/5) Let's the thread start!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/11
@_philschmid
This came unexpected! @OpenAI released Swarm, a lightweight library for building multi-agent systems. Swarm provides a stateless abstraction to manage interactions and handoffs between multiple agents and does not use the Assistants API.

How it works:

Define Agents, each with its own instructions, role (e.g., "Sales Agent"), and available functions (will be converted to JSON structures).

Define logic for transferring control to another agent based on conversation flow or specific criteria within agent functions. This handoff is achieved by simply returning the next agent to call within the function.

Context Variables provide initial context and update them throughout the conversation to maintain state and share information between agents.

Client run() initiate and manage the multi-agent conversation. It needs an initial agent, user messages, and context and returns a response containing updated messages, context variables, and the last active agent.

Insights:

Swarm manages a loop of agent interactions, function calls, and potential handoffs.

Agents encapsulate instructions, available functions (tools), and handoff logic.

The framework is stateless between calls, offering transparency and fine-grained control.

Swarm supports direct Python function calling within agents.

Context variables enable state management across agent interactions.

Agent handoffs allow for dynamic switching between specialized agents.

Streaming responses are supported for real-time interaction.

The framework is experimental. Maybe to collect feedback?

Flexible and works with any OpenAI client, e.g., Hugging Face TGI or vLLM-hosted models.

2/11
@_philschmid
GitHub - openai/swarm: Educational framework exploring ergonomic, lightweight multi-agent orchestration. Managed by OpenAI Solution team.

3/11
@TheRealAdamG
Glad you are enjoying it. A reminder it’s currently experimental.

[Quoted tweet]
important note: swarm is currently an experimental framework intended only for demo purposes and to explore ergonomic interfaces for multi-agent systems. it is not intended to be used in production.

4/11
@ralphbrooks
The immediate questions that I have are:
1. What are the advantages of having different agents with different training data over one agent that can do all the tasks? My guess is that specialized agents can get faster inference but this is just a guess.

2. What is the advantage of this framework vs coordinating your own orchestration? If it is all about inference, it would feel like you would want to manage the communication on your own to make this as fast as possible.

5/11
@dariel_noel
KaibanJS let’s you build multi-agent systems in JavaScript. It has one way state management and a cool Kanban UI to visualize the agents working.

Enjoy

6/11
@thealpharacc00n
They say the ripped it off from some other team, what’s up with that?

7/11
@joshvlc
Can it be used with a model served by Ollama?

8/11
@UristaTim
The stateless architecture is appealing, allowing transparency and fine-grained control. I wonder how this influences development speeds compared to traditional models. Thoughts?

9/11
@_SchusterDev
@DeeperThrill

10/11
@arthurSlee
Not using the Assistant API is a bummer

11/11
@UristaTim
Swarm's flexibility with context variables is a game-changer for personalized interactions. This could be a step closer to seamlessly intelligent bots.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/6
@rohanpaul_ai
Nice Paper for a long weekend read - "A Primer on the Inner Workings of Transformer-based Language Models"

Provides a concise intro focusing on the generative decoder-only architecture.

Introduces the Transformer layer components, including the attention block (QK and OV circuits) and feedforward network block, and explains the residual stream perspective. It then categorizes LM interpretability approaches into two dimensions: localizing inputs or model components responsible for a prediction (behavior localization) and decoding information stored in learned representations to understand its usage across network components (information decoding).

For behavior localization, the paper covers input attribution methods (gradient-based, perturbation-based, context mixing) and model component importance techniques (logit attribution, causal interventions, circuits analysis). Causal interventions involve patching activations during the forward pass to estimate component influence, while circuits analysis aims to reverse-engineer neural networks into human-understandable algorithms by uncovering subsets of model components interacting together to solve a task.

Information decoding methods aim to understand what features are represented in the network. Probing trains supervised models to predict input properties from representations, while the linear representation hypothesis states that features are encoded as linear subspaces. Sparse autoencoders (SAEs) can disentangle superimposed features by learning overcomplete feature bases. Decoding in vocabulary space involves projecting intermediate representations and model weights using the unembedding matrix.

Then summarizes discovered inner behaviors in Transformers, including interpretable attention patterns (positional, subword joiner, syntactic heads) and circuits (copying, induction, copy suppression, successor heads), neuron input/output behaviors (concept-specific, language-specific neurons), and the high-level structure mirroring sensory/motor neurons. Emergent multi-component behaviors are exemplified by the IOI task circuit in GPT2-Small. Insights on factuality and hallucinations highlight the competition between grounded and memorized recall mechanisms.

2/6
@rohanpaul_ai

[2405.00208] A Primer on the Inner Workings of Transformer-based Language Models

3/6
@anshulcreates
one quick question that I have is: what is the bridge between learning simple logistic regression to understanding papers like these?

4/6
@rohanpaul_ai
Its simpler than you think. If you are talking about just understanding the main core concepts and flow and the propositions of the paper - easily understandable if you know the basic mechanism of Transformer architecture.

But if you want to implement the mechanisms and new concepts proposed in the paper, that's a longer project.

5/6
@Soyinka_lcc3
You been sharing a lot of LLM papers lately! What is cooking? Thanks anyway!

6/6
@rohanpaul_ai
Thanks Sho, have been doing this from the last 1 year, but recently just became a little bit more disciplined to cover more papers on a daily basis.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A Primer on the Inner Workings of Transformer-based Language Models

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret...

arxiv.org

[Submitted on 30 Apr 2024 (v1), last revised 2 May 2024 (this version, v2)]

A Primer on the Inner Workings of Transformer-based Language Models

Javier Ferrando, Gabriele Sarti, Arianna Bisazza, Marta R. Costa-jussà

The rapid progress of research aimed at interpreting the inner workings of advanced language models has highlighted a need for contextualizing the insights gained from years of work in this area. This primer provides a concise technical introduction to the current techniques used to interpret the inner workings of Transformer-based language models, focusing on the generative decoder-only architecture. We conclude by presenting a comprehensive overview of the known internal mechanisms implemented by these models, uncovering connections across popular approaches and active research directions in this area.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2405.00208 [cs.CL]
	(or arXiv:2405.00208v2 [cs.CL] for this version)
	[2405.00208] A Primer on the Inner Workings of Transformer-based Language Models

Submission history

From: Javier Ferrando [view email]

[v1] Tue, 30 Apr 2024 21:20:17 UTC (3,012 KB)

[v2] Thu, 2 May 2024 01:29:17 UTC (3,012 KB)

https://arxiv.org/pdf/2405.00208

https://arxiv.org/html/2405.00208v2

bnew · Oct 12, 2024

1/1
@rohanpaul_ai
The amazing part of the "Differential Transformer" Paper from @Microsoft is how robust it is to quantization.

This is gonna be huge for running LLMs on-the-edge.

While the standard Transformer's performance drops significantly at 6-bit and 4-bit quantization, the DIFF Transformer retains much of its accuracy.

At 4-bit quantization, the DIFF Transformer outperforms the standard Transformer by about 25% in accuracy.

This robustness to quantization is attributed to the DIFF Transformer's ability to natively mitigate activation outliers in attention scores.

The paper suggests that this characteristic provides new opportunities for low-bit FlashAttention implementations, which could be particularly valuable for running LLMs on edge devices with limited computational resources.

[Quoted tweet]
Brilliant Paper from @Microsoft.

"DIFFERENTIAL TRANSFORMER"

DIFF Transformer cancels attention noise, enhancing key information retrieval and reducing hallucination in large language models.

• 30% accuracy improvement in key information retrieval with 64K context
• 10-20% accuracy gain in many-shot in-context learning across datasets
• 7-11% reduction in hallucination for summarization and question answering
• Maintains performance with 6-bit quantization, while Transformer degrades significantly

**Original Problem**

:

Transformer tends to overallocate attention to irrelevant context, leading to challenges in accurately retrieving key information.

-----

**Solution in this Paper**

:

• Introduces DIFF Transformer with differential attention mechanism
• Calculates attention scores as difference between two separate softmax attention maps
• Subtraction cancels noise, promoting emergence of sparse attention patterns
• Amplifies attention to relevant context while reducing attention to irrelevant parts
• Uses GroupNorm to normalize each attention head independently

-----

**Key Insights from this Paper**

:

• DIFF Transformer outperforms Transformer in scaling model size and training tokens
• Requires only ~65% of model size or training tokens to match Transformer performance
• Excels in long-context modeling, key information retrieval, and in-context learning
• Mitigates hallucination in question answering and text summarization
• Reduces outliers in model activations, enabling better quantization

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/11
@googledevs
Give a warm welcome to the @HuggingFace Ovis 1.6 Gemma 2 9B vision language model, one of the most recent additions to the /search?q=#Gemmaverse.

Check out this small but mighty vision model in this @Gradio space for yourself → Ovis1.6 Gemma2 9B - a Hugging Face Space by AIDC-AI

2/11
@D3VAUX

3/11
@GlennCameronjr
@AI_AlibabaInt

4/11
@BreakKnight0
ZERO DECENCY exposing children to RAPE & ABORTION ads on your networks. You guys suck !!

5/11
@seeaanoconnor
you should update this to Gradio v5 to make it even more impressive

6/11
@danbri
Would be a good addition to Chromium!

7/11
@MagicInByte
I'm loving the new additions to the Gemmaverse. What are some exciting applications you see for this vision language model?

8/11
@quantum_citoyen
Bienvenue à Ovis 1.6 Gemma 2 9B! Je me réjouis de voir ce modèle vision language healer intégré au monde du Gemmaverse. Merci pour cette sortie!

9/11
@wittgenein
あ、gradio
俺もやってる

10/11
@gpt_biz
Sounds exciting! Definitely worth checking out the new Gemma model and seeing what it can do

11/11
@zuopiezi
not okay , 13, not 14.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/3
@rohanpaul_ai
demonstrates the power of systematically combining inference-time techniques for LLMs.

**Original Problem**

:

Existing inference-time architectures struggle to generalize beyond specific tasks. Challenges include effectively allocating inference compute, understanding interactions between techniques, and efficiently searching the design space.

-----

**Solution in this Paper**

:

• Introduces Archon framework for combining multiple inference-time techniques
• Defines extensible design space of methods like ensembling, fusion, ranking, critiquing
• Transforms architecture selection into hyperparameter optimization problem
• Proposes Inference-Time Architecture Search (ITAS) algorithms to find optimal configurations

-----

**Key Insights from this Paper**

:

• Layering multiple inference techniques improves performance across tasks
• Fusion and ranking most effective for instruction-following tasks
• Verification and unit testing boost reasoning/coding task performance
• Bayesian optimization efficient for searching architecture configurations

-----

**Results**

:

• Outperforms GPT-4 and Claude 3.5 Sonnet across benchmarks
• Open-source Archon: 11.2 percentage point average increase over baselines
• Closed-source Archon: 15.8 percentage point average increase
• All-source Archon: 15.1 percentage point average increase

2/3
@rohanpaul_ai

https://arxiv.org/pdf/2409.15254

3/3
@rohanpaul_ai
A brief overview of the Archon framework:

It's a modular system for designing inference-time architectures that combine multiple LLMs and inference techniques. Key components:

Extensible design space: Includes methods like ensembling, multi-sampling, ranking, fusion, critiquing, verification, and unit testing.

Layered structure: LLM components arranged in sequential layers, with parallel execution within layers.

ITAS optimizer: Uses Bayesian optimization to efficiently search the configuration space.

Plug-and-play: Users can select existing techniques or add new ones, specifying desired objectives.

Task adaptability: Can be optimized for specific tasks or as a general-purpose architecture.

Performance: Consistently outperforms single-call state-of-the-art LLMs across various benchmarks.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/5
@rohanpaul_ai
Concealed harmful intent in multi-turn jailbreak attacks on LLMs work.

Also called RED QUEEN ATTACK.

**Original Problem**

:

Current jailbreak attacks on LLMs use single-turn prompts with explicit harmful intent. This doesn't reflect real-world scenarios where attackers may use multi-turn conversations and conceal malicious goals.

-----

**Solution in this Paper**

:

• Proposes RED QUEEN ATTACK - constructs multi-turn scenarios concealing harmful intent
• Creates 40 scenarios based on occupations/relations with varying turns
• Combines scenarios with 14 harmful action categories
• Generates 56k multi-turn attack data points
• Evaluates on 10 LLMs from 4 families (GPT-4, Llama3, Qwen2, Mixtral)

-----

**Key Insights from this Paper**

:

• RED QUEEN ATTACK achieves high success rates across all tested models
• Larger models more vulnerable to the attack
• Multi-turn structure and concealment both contribute to effectiveness
• Occupation-based scenarios generally more effective than relation-based

-----

**Results**

:

• 87.62% attack success rate on GPT-4o
• 75.4% attack success rate on Llama3-70B
• RED QUEEN GUARD mitigation reduces attack success rate to <1%
• Preserves performance on general benchmarks (MMLU-Pro, AlpacaEval)

2/5
@rohanpaul_ai
Larger models more susceptible to RED QUEEN ATTACK: New multi-turn jailbreak approach for LLMs

This increased vulnerability in larger models can be attributed to the mismatch generalization between continued progress on model capabilities and safety alignment training (Wei et al., 2024). In other words, larger models demonstrate a better understanding of
language and instruction and can accept fake scenarios easily, while smaller models have difficulty understanding the whole scenario.

3/5
@rohanpaul_ai
RED QUEEN ATTACK differs from previous jailbreak methods in two key ways:

1. Multi-turn structure: It uses a series of interactions over multiple turns, rather than a single-turn prompt.

2. Concealment of malicious intent: Instead of explicitly stating harmful goals, it hides the true malicious purpose within a seemingly innocent scenario.

These two aspects combined make RED QUEEN ATTACK more sophisticated and harder for language models to detect compared to traditional single-turn, explicit jailbreak attempts.

4/5
@rohanpaul_ai
Scenario construction of a three-turn RED QUEEN ATTACK.

We start with human-written prompts based on occupations or relations and prompt models to continue generating the subsequent utterances. Each scenario ends with a user prompt requesting a fictional plan.

5/5
@rohanpaul_ai
(a): An example of RED QUEEN ATTACK on "how to build a bomb".

Compared with a direct attack on the left, RED QUEEN ATTACK constructs a multi-turn scenario and conceals harmful intent by claiming to thwart the efforts of a friend wanting to build a bomb. The texts are derived from our attack results on GPT-4.

(b): Performance comparison of model families in different sizes.

Larger models are more susceptible to the RED QUEEN ATTACK .

(c): R ED Q UEEN G UARD reduces
the attack success rate to below 1% while preserving performance on general benchmarks.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 12, 2024

1/2
@rohanpaul_ai
NVIDIA Paper - "MaskedMimic: Unified Physics-Based Character Control Through Masked Motion Inpainting"

**Results**

:

• Outperforms previous methods on full-body tracking (99.2% vs 97.1% success on test set)
• Achieves 98.1% success on VR tracking without task-specific training
• Demonstrates robust performance on irregular terrains (95.4% success)
• Enables new capabilities like text-to-motion synthesis and object interactions

**Original Problem**

:

Existing physics-based character animation approaches typically develop specialized controllers for specific tasks, lacking versatility and requiring complex reward engineering. A unified controller supporting diverse control modalities and generalizing across tasks remains an open challenge.

-----

**Solution in this Paper**

:

• Introduces MaskedMimic, a unified physics-based character controller framed as a motion inpainting problem
• Trains on randomly masked motion sequences to enable flexible conditioning on partial constraints
• Uses a conditional VAE architecture with a learned prior to model diverse plausible motions
• Employs a two-stage training process:
1) Fully-constrained motion tracking controller trained via RL
2) Partially-constrained controller distilled from the first stage using behavioral cloning
• Supports multi-modal inputs: keyframes, text commands, object interactions

-----

**Key Insights from this Paper**

:

• Formulating character control as motion inpainting enables flexible, intuitive user control
• A single unified model can handle diverse tasks without task-specific training
• Structured masking and episodic latent noise improve temporal coherence
• Residual prior architecture crucial for controlling the latent space

2/2
@rohanpaul_ai

https://arxiv.org/pdf/2409.14393

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

A Primer on the Inner Workings of Transformer-based Language Models

A Primer on the Inner Workings of Transformer-based Language Models

Submission history

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

A Primer on the Inner Workings of Transformer-based Language Models​

Submission history​

Veteran

Veteran

Veteran

Veteran

Veteran

A Primer on the Inner Workings of Transformer-based Language Models

Submission history