bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689



















1/16
@leloykun
Deep Learning Optimizers from First Principles

My attempt at answering these questions:

1. Why do steepest descent in non-Euclidean spaces?
2. Why does adaptive preconditioning work so well in practice? And,
3. Why normalize everything ala nGPT?

[Quoted tweet]
The Case for Muon

1) We can descend 'faster' in non-Euclidean spaces
2) Adam/Shampoo/SOAP/etc. dynamically learn the preconditioner and, equivalently, the norm & space to descend in
3) Muon saves a lot of compute by simply letting the norm to vary within a fixed range


GaUXdGAbUAATyBT.jpg

GaFI4hFaUAA2GHy.jpg


2/16
@leloykun
Ideally, when training a neural network, we want to bound the features, the weights, and their respective updates so that:

1. [lower] the model actually "learns" stuff; and
2. [upper] model training is stable

These bounds then depend on the norms, but which norms?



GaUYC0mbUAA2WXg.png


3/16
@leloykun
The fun part is that the norms of the input and output features already induce the norm of the weights between them

We can also let the feature and feature updates have the same norm (likewise for the weights)

And so, we only have to choose the norms for the features!



GaUYwrIbUAEvisF.png


4/16
@leloykun
Now, our datasets are usually Euclidean or locally Euclidean (see Manifold Hypothesis)

What's the norm induced by Euclidean input and output vector spaces? The Spectral Norm!



GaUZVJJaYAAGX5U.png


5/16
@leloykun
So even if we don't want to do anything fancy, we'd still have to do steepest descent in non-Euclidean space because:

1. The induced norm for the weights (w/ Euclidean features) is non-Euclidean; and
2. We're optimizing the weights, not the features

cc:

[Quoted tweet]
The rate this whole space is converging towards a topological math problem makes me really uncomfortable


6/16
@leloykun
The model inputs and outputs being Euclidean sounds reasonable, but why do the "hidden" features have to be Euclidean too?

If we vary the norms of these features, we also vary the induced norms of the weights and v.v

Adaptive preconditioning then "searches" for the proper norms



GaUaf8ma8AA2HsF.png


7/16
@leloykun
This also answers @mattecapu's Q here

Shampoo & SOAP starts from Euclidean features and Spectral weights, then tunes the norms over time. SOAP does this tuning with momentum so it's theoretically faster.

[Quoted tweet]
really cool to also optimize the p in the norm. do you have a conceptual idea of what that's tuning? I guess intuitively as p->oo each dimension is getting 'further away' from each other..


8/16
@leloykun
A more cynical answer, from a mathematician to another, is that almost nobody in this field is actually doing proper linear algebra.

Adaptive preconditioning allows us to start from really crappy configurations/parametrizations and get away scoff free



9/16
@leloykun
A more pro-ML answer would be that humans suck at predicting which inductive biases would work best when cooked into the models

E.g. why should the "hidden" features be in Euclidean space? Why not let the model learn the proper space(s) to work with?



10/16
@leloykun
Another takeaway here is that interpretability folks also need to be mindful of this.

The "hidden" features may not be Euclidean!

And e.g., if you use Adam (w/o accumulation), what you're really doing is optimizing a transform from l_1 norm to l_infty norm



11/16
@leloykun
Finally, why is it a good idea to normalize everything everywhere?

Cuz it lets us have sane bounds & same norms on the features which means we can use the same optimizer for all the layers with minimal tuning!

[2410.01131] nGPT: Normalized Transformer with Representation Learning on the Hypersphere



12/16
@leloykun
I also realized I overlooked something elementary here

I was getting unstable iterators because I wasn't properly upper-bounding the polynomials lol

[Quoted tweet]
I've also noticed that for each γ, there is a minimum r below which things become unstable. And as we increase γ, the minimum r also increases

This indicates that as we shrink the attractor basin, the more iterations we'll need for things to converge.


13/16
@leloykun
If you want to design your own iterators:

1. Pick an inner and outer radii (l, r)
2. Let {0,+-(1-l),+-(1+r)} be fixed points
3. Binary search for the maximum gamma such that the peak in (0, 1-l) = 1 + r while the trough in (1-l, 1+r) > 0

The current iterator already decent tho



GaUd4MSbUAEiFaS.png

GaUeOYhbUAMXOt7.png


14/16
@leloykun
Going back,

it's really important to think about where we're mapping things to and from because the optimizer, learning rate scheduler, and etc. can be derived from there



15/16
@leloykun
That's all from me!

I now feel intellectually satisfied and the nerdsnipe trance is wearing off. I can finally be normal again and be productive and post memes huhu



16/16
@leloykun
I also really recommend this work by @TheGregYang and @jxbz : [2310.17813] A Spectral Condition for Feature Learning

This one too: [2409.20325] Old Optimizer, New Norm: An Anthology

super information-dense works!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




3/4
@rohanpaul_ai
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Proposals in this Paper** 🛠️:

• Normalized Transformer (nGPT) architecture
• All vectors normalized to unit norm on hypersphere
• Learnable eigen learning rates control hidden state updates
• Removal of LayerNorm/RMSNorm layers
• Introduction of scaling factors for logits, query/key vectors, and MLP states
• Elimination of weight decay and learning rate warmup

-----

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences


GaLZE_nWAAAyqN3.png


https://video-t-2.twimg.com/ext_tw_...16/pu/vid/avc1/1080x1080/puHpFuQ16dykA8Pb.mp4

4/4
@VXaviervm
what are the current transformers that are not a hypersphere?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689






1/6
@rohanpaul_ai
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Proposals in this Paper** 🛠️:

• Normalized Transformer (nGPT) architecture
• All vectors normalized to unit norm on hypersphere
• Learnable eigen learning rates control hidden state updates
• Removal of LayerNorm/RMSNorm layers
• Introduction of scaling factors for logits, query/key vectors, and MLP states
• Elimination of weight decay and learning rate warmup

-----

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences

GaLZE_nWAAAyqN3.png


2/6
@rohanpaul_ai
🧠 How does nGPT differ from the standard Transformer architecture?

Key differences include:

- All vectors and matrices are normalized to unit norm along their embedding dimension
- Removal of LayerNorm/RMSNorm layers
- Introduction of learnable eigen learning rates to control hidden state updates
- Modification of attention and MLP block updates to operate on the hypersphere
- Addition of learnable scaling factors for logits, query/key vectors, and MLP intermediate states
- Removal of weight decay and learning rate warmup

GaLZoeSWsAAUWOD.png


3/6
@rohanpaul_ai
🤔 Potential drawbacks of nGPT

- Increased computational cost per step due to additional normalizations (though this is offset by faster convergence)
- Potential loss of expressiveness due to unit norm constraint (though experiments don't show this to be an issue)
- More hyperparameters to tune (eigen learning rates, scaling factors)
- Possible challenges in very large-scale training (not tested beyond 1B parameters)



4/6
@rohanpaul_ai
🔬 nGPT allows for some insights about Transformer internals :

- Interpreting Transformer layers as optimization steps on a hypersphere
- Viewing attention and MLP blocks as providing gradient information
- Understanding eigen learning rates as controlling the impact of each block's updates
- Analyzing the learned scaling factors and their effects on network behavior
- Examining condition numbers and rank of weight matrices across layers



5/6
@rohanpaul_ai
📚 [2410.01131] nGPT: Normalized Transformer with Representation Learning on the Hypersphere



6/6
@rohanpaul_ai
yep




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689





1/2
@rohanpaul_ai
You can fine-tune a base language model to the BitNet 1.58b architecture. You dont necessarily have to train a model from scratch.

i.e. you can fine-tune an existing model to 1.58 bits!

---

Now BitNet is a special transformers architecture that represents each parameter with only three values: (-1, 0, 1), offering a extreme quantization of just 1.58 bits per parameter.

And all this while, it requires to train a model from scratch.

However, pre-training the model from scratch is very costly. So recent experiments allows us to fine-tuning an existing model to 1.58 bits!

There are recent examples of successful fine-tuning a Llama3 8B model using the BitNet architecture

------

And Huggingface Transformers has recently integrated the BitNet architecture, by introducing a new quantization method called "bitnet".

📌 This method involves replacing the standard Linear layers with specialized BitLinear layers that are compatible with the BitNet architecture, with appropriate dynamic quantization of activations, weight unpacking, and matrix multiplication.

📌 In the Huggingface blog example they mention, for bitnet, the train in full precision, but quantize the weights into ternary values as we go, using symmetric per tensor quantization.

📌 Activations are then quantized to a specified bit-width (8-bit, in our case) using absmax per token quantization (for a comprehensive introduction to quantization methods check out this post). This involves scaling the activations into the range [−128, 127] for an 8-bit bit-width.

[Quoted tweet]
WOW. @Microsoft just open-sourced the code for one of "THE MOST" influential Paper of 2024 🔥

1-bit LLMs (e.g., BitNet b1.58).

Now you can run a 100B param models on local devices quantized with BitNet b1.58 on single CPU at 5-7 tokens/sec 🤯

The dream we have all been waiting for.

📊 Performance Improvements:

- Achieves speedups of 1.37x to 5.07x on ARM CPUs

- Larger models see greater performance gains

- Reduces energy consumption by 55.4% to 70.0% on ARM

- On x86 CPUs, speedups range from 2.37x to 6.17x


GaYFPuRXUAAlgRt.jpg

GaTA5DOWoAAW9i3.jpg


2/2
@rohanpaul_ai
Fine-tuning LLMs to 1.58bit: extreme quantization made easy




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689


1/3
@rohanpaul_ai
Qwen Code Interpreter, with Qwen Code 2.5 & WebLLM

Running locally on your browser

A cool @huggingface space showcasing the power or opensource model and WebLLM

-----

WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.



GaX-XAbWQAA1ruD.jpg


2/3
@rohanpaul_ai
Qwen 2.5 Code Interpreter - a Hugging Face Space by cfahlgren1



3/3
@RaphLeclercAI
Running WebLLM locally in a browser, no server-side needed? That's quite impressive. Have you considered running it in environments with limited internet connectivity for even more impact?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689


1/3
@rohanpaul_ai
Divergent CoT (DCoT) : Requiring models to compare multiple reasoning chains before generating a solution in a single inference step.

DCoT enhances LLM reasoning by generating multiple chains, enabling self-correction and improving performance across scales.

With this complex reasoning methods can be encoded into LLMs through appropriate instruction tuning.

**Original Problem** 🤔:

LLMs struggle with complex reasoning tasks. Existing Chain of Thought (CoT) methods generate single reasoning chains, limiting exploration of diverse solutions.

-----

**Solution in this Paper** 🧠:

• Generates multiple reasoning chains in one inference step
• Compares chains before selecting final answer
• Instruction fine-tunes models on DCoT datasets
• Enables smaller models to benefit from complex reasoning

-----

**Key Insights from this Paper** 💡:

• DCoT improves performance across model sizes (1.3B to 70B parameters)
• Enables self-correction without explicit training
• Generalizes to unseen tasks
• Compatible with existing CoT extensions

-----

**Results** 📊:

• DCoT consistently outperforms CoT baseline across tasks and model scales
• Performance gains up to 7.59 points on BGQA for Phi 2
• Improves on unseen tasks: 5+ points on AQuA and SVAMP (Phi 2)
• Enables self-correction: 45% of corrected cases show different reasoning in second chain
• Combines with self-consistency for further gains



GaX5l-9W0AAjRvY.png


2/3
@rohanpaul_ai
🔬 How Divergent Chain of Thought (DCoT) differs from standard Chain of Thought (CoT)?

DCoT requires LLMs to generate multiple reasoning chains before producing a solution in a single inference step.

Unlike standard CoT which generates a single reasoning chain, DCoT generates multiple divergent chains and compares them before selecting a final answer.

This allows the model to explore different reasoning paths simultaneously.



GaX7FYSWMAAn9lJ.png


3/3
@LaurenceBrem
Is this similar to the tree of thought method? it sounds like it




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689


1/2
@rohanpaul_ai
Chain-of-thought (CoT) via prompting is NOT ALWAYS needed for eliciting reasoning capabilities from large language models (LLMs).

CoT excels in math and logic but underperforms in broader language tasks. So Selective use of CoT can optimize performance without incurring high inference costs.

**Original Problem** 🤔:

Chain-of-Thought (CoT) prompting is widely used to enhance reasoning in LLMs. However, its effectiveness across different task types is unclear.

-----

**Solution in this Paper** 🛠️:

- Conducted a meta-analysis of over 100 papers and evaluated 20 datasets across 14 models.
- Focused on separating planning and execution stages in problem-solving.
- Compared CoT performance against tool-augmented LLMs.
- Suggested selective application of CoT to reduce inference costs.

-----

**Key Insights from this Paper** 💡:

- CoT significantly improves tasks involving math and logic.
- Symbolic reasoning benefits most from CoT, especially in execution.
- Tool augmentation outperforms CoT in symbolic tasks.
- CoT's utility is limited for non-symbolic reasoning tasks.

-----

**Results** 📊:

- Math and symbolic reasoning tasks showed substantial improvements with CoT.
- Non-symbolic tasks saw minimal gains.
- On MMLU, CoT's benefit was mostly for questions involving symbolic operations like equations.



GaX4PO4WUAALmX5.png


2/2
@rohanpaul_ai
CoT is particularly beneficial for tasks requiring symbolic execution, such as solving equations or logical puzzles. It helps more in the execution phase rather than the planning phase of problem-solving.

📚 https://arxiv.org/pdf/2409.12183




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689





1/4
@rohanpaul_ai
GraphInstruct: Comprehensive benchmark with 21 classical graph reasoning tasks

**Results** 📊:

• Strong performance maintained on unseen graph sizes and description formats
• Lower performance on out-of-domain tasks highlights areas for future work

**Original Problem** 🔍:

LLMs lack graph understanding capabilities, crucial for advancing general intelligence. Existing benchmarks are limited in scope and diversity.

-----

**Solution in this Paper** 🛠️:

• Diverse graph generation: Various structures, sizes, and description formats
• GraphLM: Fine-tuned Vicuna-7b using LoRA on GraphInstruct
• GraphLM+: Enhanced with step mask training strategy for improved reasoning

-----

**Key Insights from this Paper** 💡:

• Intermediate reasoning steps crucial for enhancing graph reasoning capability
• Label mask training filters redundant information while preserving graph structure
• LLMs can generalize to unseen graph sizes and description formats
• Out-of-domain task performance indicates room for improvement



GaX2lndWAAAjHbr.png


2/4
@rohanpaul_ai
💡 To enhance reasoning abilities, GraphLM+ was created using a step mask training strategy.

This involved incorporating intermediate reasoning steps as supervised signals during training, with a label mask to filter out redundant information while preserving essential graph structure data.



GaX3dXPXUAAeKS5.png


3/4
@rohanpaul_ai
📊 GraphInstruct includes 21 classical graph reasoning tasks across three levels:

- Node level: Degree, PageRank, Predecessor, Neighbor, Clustering Coefficient

- Node-pair level: Common Neighbor, Jaccard, Edge, Shortest Path, Connectivity, Maximum Flow

- Graph level: DFS, BFS, Cycle, Connected Component, Diameter, Bipartite, Topological Sort, MST, Euler Path, Hamiltonian Path

🤖 GraphLM was created by fine-tuning Vicuna-7b on GraphInstruct using LoRA.



GaX3wBMWQAAlpim.png


4/4
@rohanpaul_ai
📚 https://arxiv.org/pdf/2403.04483




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/1
@rohanpaul_ai
Cross-domain instruction variety key to LLM generalization and adaptability.

Instruction-following capabilities in LLMs are crucial but poorly understood. Existing research lacks systematic analysis of factors influencing generalization to unseen instructions.

Data diversity trumps quantity in LLM instruction-following generalization.

**Key Insights from this Paper** 💡:

• Generalization emerges only with sufficient cross-domain instruction diversity
• Data diversity matters more than quantity for improving model performance
• Optimal performance requires balancing specialization and diversification
• Strategic data curation enhances both specialist and generalist LLM capabilities

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
Data diversity trumps quantity in LLM instruction-following generalization.

Cross-domain instruction variety key to LLM generalization and adaptability.

**Original Problem** 🔍:

Instruction-following capabilities in LLMs are crucial but poorly understood. Existing research lacks systematic analysis of factors influencing generalization to unseen instructions.

-----

**Solution in this Paper** 🧠:

• Introduces controlled string-rewriting tasks to isolate instruction-following abilities
• Examines impact of instruction diversity on model generalization
• Analyzes real-world scenarios: specialist (code generation) and generalist LLM fine-tuning
• Proposes strategic data diversification for improved performance

-----

**Key Insights from this Paper** 💡:

• Generalization emerges only with sufficient cross-domain instruction diversity
• Data diversity matters more than quantity for improving model performance
• Optimal performance requires balancing specialization and diversification
• Strategic data curation enhances both specialist and generalist LLM capabilities

-----

**Results** 📊:

• Generalization accuracy improves sharply with 300-1000 unique instructions
• Cross-domain diversification boosts code generation performance by 3%
• Diverse instruction mixtures enhance generalist LLM capabilities across tasks
• Balanced specialization and diversification yield 14.06% relative gain in generalist scenarios


GaLmE5JWsAA7B-q.png


https://video-t-2.twimg.com/ext_tw_...84/pu/vid/avc1/1080x1080/SAQfTsFw-hcUNiZh.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/1
@rohanpaul_ai
Scaling laws for diffusion transformers in text-to-image generation were unexplored, hindering optimal resource allocation and performance prediction.

Scaling laws reveal optimal resource allocation for diffusion transformers in text-to-image synthesis.

And Power-law relationships govern diffusion transformer scaling

**Key Insights from this Paper** 💡:

• Optimal model size and data quantity scale with compute budget according to power laws
• Training loss and FID follow power-law relationships with compute
• Scaling laws hold for out-of-domain datasets, with consistent trends but vertical offsets
• Cross-Attention Transformers show more efficient performance improvement than In-Context Transformers

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
Scaling laws reveal optimal resource allocation for diffusion transformers in text-to-image synthesis.

And Power-law relationships govern diffusion transformer scaling

**Original Problem** 🔍:

Scaling laws for diffusion transformers in text-to-image generation were unexplored, hindering optimal resource allocation and performance prediction.

-----

**Solution in this Paper** 🧠:

• Conducted experiments across compute budgets from 1e17 to 6e18 FLOPs
• Established power-law relationships between compute, model size, data quantity, and loss
• Used Rectified Flow formulation with v-prediction and Logit-Normal timestep sampling
• Evaluated models on Laion5B subset and COCO validation datasets
• Analyzed scaling behavior of In-Context and Cross-Attention Transformers

-----

**Key Insights from this Paper** 💡:

• Optimal model size and data quantity scale with compute budget according to power laws
• Training loss and FID follow power-law relationships with compute
• Scaling laws hold for out-of-domain datasets, with consistent trends but vertical offsets
• Cross-Attention Transformers show more efficient performance improvement than In-Context Transformers


GaLqQmGWcAAAFxi.png


https://video.twimg.com/ext_tw_video/1848127465417318400/pu/vid/avc1/1080x1080/i3rZA_juWF28AA4v.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/3
@rohanpaul_ai
ConvNets have been challenged by Vision Transformers in image recognition tasks and lack universal modeling capabilities across modalities.

This paper proposes UniRepLKNet: Expanding ConvNet capabilities with large kernels for multi-modal perception.

Achieve universal modeling across modalities

-----

**Key Insights from this Paper** 💡:

• Large kernels improve performance without significant computational overhead

• Structural re-parameterization enhances large kernel effectiveness

• ConvNets can achieve universal perception across modalities

• Large-kernel ConvNets show higher shape bias than traditional ConvNets

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
UniRepLKNet: Expanding ConvNet capabilities with large kernels for multi-modal perception.

Achieve universal modeling across modalities

**Original Problem** 🔍:

ConvNets have been challenged by Vision Transformers in image recognition tasks and lack universal modeling capabilities across modalities.

-----

**Solution in this Paper** 🛠️:

• UniRepLKNet: A universal large-kernel ConvNet architecture
• Uses depth-wise large kernels (13x13) in middle and later stages
• Incorporates Dilated Reparam Block for enhanced large kernel convolutions
• Efficient implementation using block-wise implicit GEMM
• Modality-specific preprocessing for audio, point clouds, time-series, and video

-----

**Key Insights from this Paper** 💡:

• Large kernels improve performance without significant computational overhead
• Structural re-parameterization enhances large kernel effectiveness
• ConvNets can achieve universal perception across modalities
• Large-kernel ConvNets show higher shape bias than traditional ConvNets


GZ9udd8WoAAtUvc.png


https://video-t-1.twimg.com/ext_tw_...01/pu/vid/avc1/1080x1080/2XIjD9FUshg8qKTo.mp4

2/3
@zamderax
I hate that that people won’t pick between CNN vs ConvNets



3/3
@NatureTerryW
Excited to dive into UniRepLKNet and explore its potential for multi-modal perception. The ideas on large kernels and structural re-parameterization are particularly intriguing, and I'm curious to see how they'll impact the field of computer vision.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/1
@rohanpaul_ai
Existing retrieval-augmented generation (RAG) methods struggle with knowledge-intensive reasoning tasks due to scattered information across documents.

After GraphRAG now the kid on the block is StructRAG!

Imagine asking your AI assistant a complex question that requires piecing together information from multiple sources.

1. StructRAG first identifies the best way to structure the knowledge for the specific task, such as a table, graph, or tree.

2. It then reconstructs the original documents into this structured format, making it easier to see connections and relationships between pieces of information.

3. Finally, it uses this structured knowledge to infer the answer to the original question.

**Key Insights from this Paper** 💡:

• Structured knowledge in optimal format enhances LLM reasoning
• Hybrid information structurization outperforms fixed structure types
• DPO training improves structure type selection accuracy

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
After GraphRAG now the kid on the block is StructRAG!

Imagine asking your AI assistant a complex question that requires piecing together information from multiple sources.

1. StructRAG first identifies the best way to structure the knowledge for the specific task, such as a table, graph, or tree.

2. It then reconstructs the original documents into this structured format, making it easier to see connections and relationships between pieces of information.

3. Finally, it uses this structured knowledge to infer the answer to the original question.

**Original Problem** 🔍:

Existing retrieval-augmented generation (RAG) methods struggle with knowledge-intensive reasoning tasks due to scattered information across documents.

-----

**Solution in this Paper** 🛠️:

• StructRAG framework:
- Hybrid structure router selects optimal structure type
- Scattered knowledge structurizer converts documents into structured knowledge
- Structured knowledge utilizer decomposes questions and infers answers
• DPO-based training for hybrid structure router
• Synthesizing-simulating-judging method for constructing preference pairs

-----

**Key Insights from this Paper** 💡:

• Structured knowledge in optimal format enhances LLM reasoning
• Hybrid information structurization outperforms fixed structure types
• DPO training improves structure type selection accuracy

-----

**Results** 📊:

• StructRAG achieves state-of-the-art performance on knowledge-intensive tasks
• Outperforms baselines on Loong benchmark and Podcast Transcripts
• Performance improvement increases with task complexity
• Operates significantly faster than Graph RAG methods


GaNs_esWYAAvbTF.png


https://video-t-2.twimg.com/ext_tw_...81/pu/vid/avc1/1080x1080/6tozeNiODbH7Dn_n.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/2
@rohanpaul_ai
Sonar image synthesis faces challenges in data scarcity, quality, and diversity. Traditional methods rely on costly data collection, limiting research and applications in underwater exploration.

This Paper brings in GPT-prompted sonar image synthesis: A new frontier in underwater data generation.

**Solution in this Paper** 🛠️:

• Synth-SONAR framework leverages dual diffusion models and GPT prompting

• Creates large dataset by combining real, simulated, and AI-generated images

• Incorporates GPT and vision-language models for improved text-to-image synthesis

• Applies style injection techniques to enhance image diversity

[Quoted tweet]
GPT-prompted sonar image synthesis: A new frontier in underwater data generation.

**Original Problem** 🔍:

Sonar image synthesis faces challenges in data scarcity, quality, and diversity. Traditional methods rely on costly data collection, limiting research and applications in underwater exploration.

-----

**Solution in this Paper** 🛠️:

• Synth-SONAR framework leverages dual diffusion models and GPT prompting
• Creates large dataset by combining real, simulated, and AI-generated images
• Uses two-stage image generation: coarse and fine-grained
• Incorporates GPT and vision-language models for improved text-to-image synthesis
• Applies style injection techniques to enhance image diversity

-----

**Key Insights from this Paper** 💡:

• First application of GPT-prompting in sonar imagery generation
• Dual-stage diffusion model hierarchy enhances image quality and diversity
• Integration of language models bridges gap between text descriptions and sonar image generation
• Style injection with attention mechanism improves feature separation in generated images

-----

**Results** 📊:

• Outperforms state-of-the-art models in image quality metrics (SSIM: 0.381, PSNR: 12.730, FID: 3.8)
• Achieves up to 97% accuracy in sonar image classification when combining real and synthetic data
• Generates high-quality synthetic sonar images with enhanced diversity and realism
• Enables controlled and interpretable sonar image synthesis through text prompts


GaNjl4BWcAEtYo3.png


https://video.twimg.com/ext_tw_video/1848106451484409856/pu/vid/avc1/1080x1080/6XQ2fm4LO2Oywa98.mp4

2/2
@techredner
Fascinating approach to sonar image synthesis. The use of dual diffusion models and GPT prompting is particularly innovative, enabling high-quality synthetic images with enhanced diversity and realism. Great work!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689






1/9
@rohanpaul_ai
Now that @Microsoft open-sourced the code for one THE CLASSIC Paper of 2024, I am revising the MASTERPIECE.

📚 "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"

BitNet b1.58 70B was 4.1 times faster and 8.9 times higher throughput capable than the corresponding FP16 LLama.

📌 Requires almost no multiplication operations for matrix multiplication and can be highly optimized.

📌 BitNet b1.58 is a 1-bit LLM, where every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.

They introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits in the binary system.

📌 The term "1.58 bits" refers to the information content of each parameter.

What "1.58 bits" means is that it is log base 2 of 3, or 1.5849625... Actually decoding data at that density takes a lot of math.Since each parameter can take one of three possible values (-1, 0, 1), the information content is log2(3) ≈ 1.58 bits.

----

Here all weight values are ternary, taking on values {-1, 0, 1}.

Its quantization function is absmean in which, the weights are first scaled by their average absolute value and then rounded to the nearest integer ε {-1,0,1}.

It is an efficient extension of 1-bit BitNet by including 0 in model parameters.

So BitNet b1.58 is based upon BitNet architecture (replaces `nn.linear` with `BitLinear`).

It is highly optimized as it removes floating point multiplication overhead, involving only integer addition (INT-8), and efficiently loads parameters from DRAM.

[Quoted tweet]
WOW. @Microsoft just open-sourced the code for one of "THE MOST" influential Paper of 2024 🔥

1-bit LLMs (e.g., BitNet b1.58).

Now you can run a 100B param models on local devices quantized with BitNet b1.58 on single CPU at 5-7 tokens/sec 🤯

The dream we have all been waiting for.

📊 Performance Improvements:

- Achieves speedups of 1.37x to 5.07x on ARM CPUs

- Larger models see greater performance gains

- Reduces energy consumption by 55.4% to 70.0% on ARM

- On x86 CPUs, speedups range from 2.37x to 6.17x


GaWiZ_8XAAAmSPp.jpg

GaTA5DOWoAAW9i3.jpg


2/9
@rohanpaul_ai
How the way BitNet b1.58 requires almost no multiplication operations for matrix multiplication:

🧮 Traditional matrix multiplication:

In standard neural networks using floating-point weights, matrix multiplication involves many floating-point multiplications and additions. Each element of the output is computed by multiplying corresponding elements from the input vector and weight matrix, then summing these products.

🔢 BitNet b1.58 approach:

With BitNet b1.58, the weights are constrained to {-1, 0, 1}. This fundamentally changes the nature of the matrix multiplication:

- Multiplication by 1: Simply keeps the input value as-is

- Multiplication by -1: Just flips the sign of the input value

- Multiplication by 0: Results in zero, effectively filtering out that input

So, instead of actual multiplications, the operations become:
- Sign flipping (for -1)
- Passing through (for 1)
- Zeroing out (for 0)

These operations are much simpler than floating-point multiplication. The bulk of the computation then becomes integer addition to sum up the resulting values.

💻 Why this is highly optimizable:

- Bit-level operations: Sign flipping and zeroing can be implemented as fast bit-level operations

- SIMD friendly: These simple operations are ideal for SIMD (Single Instruction, Multiple Data) parallelization

- Reduced memory bandwidth: The 1.58-bit weights require much less memory transfer, often a bottleneck in neural network computations

- Specialized hardware potential: This simplified computation model opens doors for highly efficient custom hardware designs

This approach essentially transforms complex floating-point matrix multiplications into a series of simple integer additions and bit manipulations, leading to significant performance and energy efficiency gains.



GaWir8fW0AA3ZpJ.jpg


3/9
@rohanpaul_ai
🧠 Let's break down this quantization process in BitNet b1.58 simple terms:

1️⃣ Starting point:
We have a regular neural network with weights that can be any decimal number.

2️⃣ Goal:
We want to convert all these weights to only three possible values: -1, 0, or +1.

3️⃣ The process:

🔹 Step 1: Find the average:
First, we calculate the average of all the absolute values (ignoring negative signs) of the weights in the matrix. Let's call this average γ.

🔹 Step 2: Scaling:
We divide all the weights by this average γ. This scales the weights so they're generally closer to the range of -1 to +1.

🔹 Step 3: Rounding:
After scaling, we round each weight to the nearest value among -1, 0, and +1.

- If it's closer to -1, it becomes -1
- If it's closer to 0, it becomes 0
- If it's closer to +1, it becomes +1

4️⃣ Result:

After this process, all weights in the network are now either -1, 0, or +1.

🎯 Why do this?

This method allows us to drastically simplify the neural network while trying to keep its overall behavior similar to the original. It makes computations much faster and more efficient, especially on certain types of hardware.

Think of it like simplifying a detailed color image into just black, white, and gray. You lose some detail, but the main features are still there, and it becomes much easier to process.



GaWlzNzXoAAms7q.png


4/9
@rohanpaul_ai
🔍 Comparison of BitNet b1.58 across different model sizes:

📊 Left graph (Latency):

- Shows decoding speed (lower is better)
- BitNet b1.58 is consistently faster
- Gap widens as model size increases
- At 70B parameters, BitNet is 4.10x faster

📉 Right graph (Memory):

- Shows memory usage (lower is better)
- BitNet b1.58 uses significantly less memory
- Difference grows with model size
- At 70B, BitNet uses 7.16x less memory

📈 Table (Throughput):
- For 70B models:
- BitNet handles 11x larger batch size
- Processes 8.9x more tokens per second



GaWmxZSXQAA3qxD.png


5/9
@rohanpaul_ai
bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices.

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs



GaWvzOGW0AAS0cC.jpg


6/9
@rohanpaul_ai
Energy consumption of BitNet b1.58 compared to LLaMA LLM at 7nm process nodes.



GaWnJjLWkAAZGFZ.png


7/9
@Astro_Erik
this is truly insane, a fractional dimension of bits XD



8/9
@DanielLSainz
is there any realistic way to make 1 bit quantization work?



9/9
@wynrdynr69r
Shame you didn't have more black representation on the team. Just a bunch of rice munchers.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/1
@rohanpaul_ai
Smaller LLMs struggle with complex mathematical reasoning due to inability to detect and fix errors.

Teacher-student framework enhances mathematical reasoning in smaller language models.

With Hierarchical thought templates and cross-model DPO

**Solution in this Paper** 🧠:

• SUPERCORRECT: Two-stage framework using large teacher model to supervise smaller student model

• Stage 1: Hierarchical thought-based supervised fine-tuning (HSFT)
- Extracts high-level and detailed thought templates from teacher
- Guides student to produce fine-grained reasoning thoughts

• Stage 2: Cross-model collaborative direct preference optimization (DPO)
- Enhances student's self-correction using teacher's correction traces
- Teaches student to locate and resolve errors using teacher's insights

[Quoted tweet]
Teacher-student framework enhances mathematical reasoning in smaller language models.

With Hierarchical thought templates and cross-model DPO

**Original Problem** 🔍:

Smaller LLMs struggle with complex mathematical reasoning due to inability to detect and fix errors.

-----

**Solution in this Paper** 🧠:

• SUPERCORRECT: Two-stage framework using large teacher model to supervise smaller student model

• Stage 1: Hierarchical thought-based supervised fine-tuning (HSFT)
- Extracts high-level and detailed thought templates from teacher
- Guides student to produce fine-grained reasoning thoughts

• Stage 2: Cross-model collaborative direct preference optimization (DPO)
- Enhances student's self-correction using teacher's correction traces
- Teaches student to locate and resolve errors using teacher's insights

-----

**Key Insights from this Paper** 💡:

• Leverages teacher model to improve both reasoning and reflection in student model
• Hierarchical thought templates enable more precise reasoning
• Cross-model DPO allows student to break thought bottlenecks and acquire new skills
• Addresses limitations of self-reflection methods where models struggle to identify errors independently

-----

**Results** 📊:

• Surpasses DeepSeekMath-7B by 7.8%/5.3% on MATH/GSM8K benchmarks
• Outperforms Qwen2.5-Math-7B by 15.1%/6.3% on MATH/GSM8K benchmarks
• Achieves new state-of-the-art performance among all 7B models
• SUPERCORRECT-Qwen-7B: 70.2% accuracy on MATH, 89.5% on GSM8K


GaNkMh6WQAAKc04.png


https://video-ft.twimg.com/ext_tw_v...76/pu/vid/avc1/1080x1080/0V6Rx0A_zYJQSG7t.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
54,105
Reputation
8,072
Daps
153,689

1/3
@rohanpaul_ai
Current multimodal large language models (MLLMs) lack personalization capabilities, struggling to conduct dialogues targeting specific individuals. This limitation hinders their application in personalized settings like mobile visual assistants or domestic robots.

Personalized Visual Instruction Tuning (PVIT) framework empowers MLLMs to recognize individuals and conduct personalized conversations effectively.

**Solution in this Paper** 🛠️:

• Represents individuals as multimodal prefix pairs (personal image, personal introduction)
• Uses personalized wrapper tokens to eliminate ambiguity
• Develops an automatic pipeline for generating personalized training data
• Leverages visual experts, image generation models, and MLLMs

[Quoted tweet]
Personalized Visual Instruction Tuning (PVIT) framework empowers MLLMs to recognize individuals and conduct personalized conversations effectively.

**Original Problem** 🔍:

Current multimodal large language models (MLLMs) lack personalization capabilities, struggling to conduct dialogues targeting specific individuals. This limitation hinders their application in personalized settings like mobile visual assistants or domestic robots.

-----

**Solution in this Paper** 🛠️:

• Represents individuals as multimodal prefix pairs (personal image, personal introduction)
• Uses personalized wrapper tokens to eliminate ambiguity
• Develops an automatic pipeline for generating personalized training data
• Leverages visual experts, image generation models, and MLLMs

-----

**Key Insights from this Paper** 💡:

• PVIT enables MLLMs to identify target individuals and engage in personalized dialogues
• Utilizes in-context learning capability of MLLMs for generalization
• Introduces P-Bench, a benchmark for evaluating personalized potential of MLLMs
• Addresses "face blindness" limitation of current MLLMs

-----

**Results** 📊:

• P-LLaVA (PVIT-tuned LLaVA) outperforms state-of-the-art MLLMs across various question types in P-Bench
• Achieves 96.69% accuracy on answerable tasks (vs 63.13% for next best)
• Demonstrates 99.72% accuracy on unanswerable queries (vs 31.49% for next best)
• Shows strong performance even with complex scenarios and multiple individuals


GaLrQmyXQAAjkVp.png


https://video.twimg.com/ext_tw_video/1848107381827276800/pu/vid/avc1/1080x1080/EiQU_CALe4R5T0sT.mp4

2/3
@neosmithai
I love how PVIT tackles the "face blindness" of MLLMs. Its potential in mobile assistants and domestic robots is huge. Can't wait to see real-world applications of this tech.



3/3
@RibbeFelipe
@Ugo_alves take a look!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top