bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798











1/12
@minchoi
HeyGen just dropped Avatar 3.0 with Unlimited Looks.

Now anyone can clone themselves with AI and unlock multiple poses, outfits, and camera angles.

Here's how:



https://video.twimg.com/ext_tw_video/1843391162154909696/pu/vid/avc1/1280x720/vjOC27CxtaYlcsNT.mp4

2/12
@minchoi
1/ Create your Avatar

- Go to HeyGen
- Click on Avatars then Create Avatar
- Follow the given step by step instructions

It's very important to follow the instructions on making a good video of yourself



https://video.twimg.com/ext_tw_video/1843391974348369920/pu/vid/avc1/1402x720/u1ApPmZvnF9nKIKv.mp4

3/12
@minchoi
2/ Create your Video

- Click Create Video
- Click Avatar Video
- Choose video orientation
- Select your Avatar (you can add more Scenes for different Looks)
- Add/Generate your script
- Then Click Submit



https://video.twimg.com/ext_tw_video/1843392144632889344/pu/vid/avc1/1402x720/xFXBfdxUpFtB66hu.mp4

4/12
@minchoi
3/ Wait for your Video to be generated

- Depending on the length of your video, it will take several minutes to process
- And you are done ✅



https://video.twimg.com/ext_tw_video/1843392877172281344/pu/vid/avc1/1402x720/knyVqqheSTn9iaVk.mp4

5/12
@minchoi
Official video from @HeyGen_Official & @joshua_xu_

Unlock Unlimited looks, unlimited creativity at HeyGen - AI Video Generator



https://video.twimg.com/ext_tw_video/1841433281515892737/pu/vid/avc1/1920x1080/mzMZa8aPOgAp1Pm9.mp4

6/12
@minchoi
If you enjoyed this thread,

Follow me @minchoi and please Bookmark, Like, Comment & Repost the first Post below to share with your friends:

[Quoted tweet]
HeyGen just dropped Avatar 3.0 with Unlimited Looks.

Now anyone can clone themselves with AI and unlock multiple poses, outfits, and camera angles.

Here's how:


https://video.twimg.com/ext_tw_video/1843391162154909696/pu/vid/avc1/1280x720/vjOC27CxtaYlcsNT.mp4

7/12
@KanikaBK
Woah, awesome



8/12
@minchoi
Awesome indeed



9/12
@adolfoasorlin
This is amazing dude 😚😳



10/12
@minchoi
Lip syncing on HeyGen is another level



11/12
@mhdfaran
With all these unlimited looks and poses, I bet social media influencers are already gearing up to flood our feeds with even more content.



12/12
@minchoi
Genuine human content will become more important than ever




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798



















1/16
@leloykun
Deep Learning Optimizers from First Principles

My attempt at answering these questions:

1. Why do steepest descent in non-Euclidean spaces?
2. Why does adaptive preconditioning work so well in practice? And,
3. Why normalize everything ala nGPT?

[Quoted tweet]
The Case for Muon

1) We can descend 'faster' in non-Euclidean spaces
2) Adam/Shampoo/SOAP/etc. dynamically learn the preconditioner and, equivalently, the norm & space to descend in
3) Muon saves a lot of compute by simply letting the norm to vary within a fixed range


GaUXdGAbUAATyBT.jpg

GaFI4hFaUAA2GHy.jpg


2/16
@leloykun
Ideally, when training a neural network, we want to bound the features, the weights, and their respective updates so that:

1. [lower] the model actually "learns" stuff; and
2. [upper] model training is stable

These bounds then depend on the norms, but which norms?



GaUYC0mbUAA2WXg.png


3/16
@leloykun
The fun part is that the norms of the input and output features already induce the norm of the weights between them

We can also let the feature and feature updates have the same norm (likewise for the weights)

And so, we only have to choose the norms for the features!



GaUYwrIbUAEvisF.png


4/16
@leloykun
Now, our datasets are usually Euclidean or locally Euclidean (see Manifold Hypothesis)

What's the norm induced by Euclidean input and output vector spaces? The Spectral Norm!



GaUZVJJaYAAGX5U.png


5/16
@leloykun
So even if we don't want to do anything fancy, we'd still have to do steepest descent in non-Euclidean space because:

1. The induced norm for the weights (w/ Euclidean features) is non-Euclidean; and
2. We're optimizing the weights, not the features

cc:

[Quoted tweet]
The rate this whole space is converging towards a topological math problem makes me really uncomfortable


6/16
@leloykun
The model inputs and outputs being Euclidean sounds reasonable, but why do the "hidden" features have to be Euclidean too?

If we vary the norms of these features, we also vary the induced norms of the weights and v.v

Adaptive preconditioning then "searches" for the proper norms



GaUaf8ma8AA2HsF.png


7/16
@leloykun
This also answers @mattecapu's Q here

Shampoo & SOAP starts from Euclidean features and Spectral weights, then tunes the norms over time. SOAP does this tuning with momentum so it's theoretically faster.

[Quoted tweet]
really cool to also optimize the p in the norm. do you have a conceptual idea of what that's tuning? I guess intuitively as p->oo each dimension is getting 'further away' from each other..


8/16
@leloykun
A more cynical answer, from a mathematician to another, is that almost nobody in this field is actually doing proper linear algebra.

Adaptive preconditioning allows us to start from really crappy configurations/parametrizations and get away scoff free



9/16
@leloykun
A more pro-ML answer would be that humans suck at predicting which inductive biases would work best when cooked into the models

E.g. why should the "hidden" features be in Euclidean space? Why not let the model learn the proper space(s) to work with?



10/16
@leloykun
Another takeaway here is that interpretability folks also need to be mindful of this.

The "hidden" features may not be Euclidean!

And e.g., if you use Adam (w/o accumulation), what you're really doing is optimizing a transform from l_1 norm to l_infty norm



11/16
@leloykun
Finally, why is it a good idea to normalize everything everywhere?

Cuz it lets us have sane bounds & same norms on the features which means we can use the same optimizer for all the layers with minimal tuning!

[2410.01131] nGPT: Normalized Transformer with Representation Learning on the Hypersphere



12/16
@leloykun
I also realized I overlooked something elementary here

I was getting unstable iterators because I wasn't properly upper-bounding the polynomials lol

[Quoted tweet]
I've also noticed that for each γ, there is a minimum r below which things become unstable. And as we increase γ, the minimum r also increases

This indicates that as we shrink the attractor basin, the more iterations we'll need for things to converge.


13/16
@leloykun
If you want to design your own iterators:

1. Pick an inner and outer radii (l, r)
2. Let {0,+-(1-l),+-(1+r)} be fixed points
3. Binary search for the maximum gamma such that the peak in (0, 1-l) = 1 + r while the trough in (1-l, 1+r) > 0

The current iterator already decent tho



GaUd4MSbUAEiFaS.png

GaUeOYhbUAMXOt7.png


14/16
@leloykun
Going back,

it's really important to think about where we're mapping things to and from because the optimizer, learning rate scheduler, and etc. can be derived from there



15/16
@leloykun
That's all from me!

I now feel intellectually satisfied and the nerdsnipe trance is wearing off. I can finally be normal again and be productive and post memes huhu



16/16
@leloykun
I also really recommend this work by @TheGregYang and @jxbz : [2310.17813] A Spectral Condition for Feature Learning

This one too: [2409.20325] Old Optimizer, New Norm: An Anthology

super information-dense works!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




3/4
@rohanpaul_ai
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Proposals in this Paper** 🛠️:

• Normalized Transformer (nGPT) architecture
• All vectors normalized to unit norm on hypersphere
• Learnable eigen learning rates control hidden state updates
• Removal of LayerNorm/RMSNorm layers
• Introduction of scaling factors for logits, query/key vectors, and MLP states
• Elimination of weight decay and learning rate warmup

-----

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences


GaLZE_nWAAAyqN3.png


https://video-t-2.twimg.com/ext_tw_...16/pu/vid/avc1/1080x1080/puHpFuQ16dykA8Pb.mp4

4/4
@VXaviervm
what are the current transformers that are not a hypersphere?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798






1/6
@rohanpaul_ai
New Transformer architecture modifications from NVIDIA researchers.

nGPT: A hypersphere-based Transformer achieving 4-20x faster training and improved stability for LLMs.

**Proposals in this Paper** 🛠️:

• Normalized Transformer (nGPT) architecture
• All vectors normalized to unit norm on hypersphere
• Learnable eigen learning rates control hidden state updates
• Removal of LayerNorm/RMSNorm layers
• Introduction of scaling factors for logits, query/key vectors, and MLP states
• Elimination of weight decay and learning rate warmup

-----

**Key Insights from this Paper** 💡:

• nGPT learns 4-20x faster than standard Transformers
• Hyperspherical representation improves stability and embedding separability
• Transformer layers act as optimization steps on a hypersphere
• Eigen learning rates control the impact of each block's updates
• nGPT handles longer contexts without modifying positional encodings

-----

**Results** 📊:

• 4x faster training for 1k context length
• 10x faster training for 4k context length
• 20x faster training for 8k context length
• Similar or better performance on downstream tasks with less training
• More stable performance when extrapolating to longer sequences

GaLZE_nWAAAyqN3.png


2/6
@rohanpaul_ai
🧠 How does nGPT differ from the standard Transformer architecture?

Key differences include:

- All vectors and matrices are normalized to unit norm along their embedding dimension
- Removal of LayerNorm/RMSNorm layers
- Introduction of learnable eigen learning rates to control hidden state updates
- Modification of attention and MLP block updates to operate on the hypersphere
- Addition of learnable scaling factors for logits, query/key vectors, and MLP intermediate states
- Removal of weight decay and learning rate warmup

GaLZoeSWsAAUWOD.png


3/6
@rohanpaul_ai
🤔 Potential drawbacks of nGPT

- Increased computational cost per step due to additional normalizations (though this is offset by faster convergence)
- Potential loss of expressiveness due to unit norm constraint (though experiments don't show this to be an issue)
- More hyperparameters to tune (eigen learning rates, scaling factors)
- Possible challenges in very large-scale training (not tested beyond 1B parameters)



4/6
@rohanpaul_ai
🔬 nGPT allows for some insights about Transformer internals :

- Interpreting Transformer layers as optimization steps on a hypersphere
- Viewing attention and MLP blocks as providing gradient information
- Understanding eigen learning rates as controlling the impact of each block's updates
- Analyzing the learned scaling factors and their effects on network behavior
- Examining condition numbers and rank of weight matrices across layers



5/6
@rohanpaul_ai
📚 [2410.01131] nGPT: Normalized Transformer with Representation Learning on the Hypersphere



6/6
@rohanpaul_ai
yep




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798





1/2
@rohanpaul_ai
You can fine-tune a base language model to the BitNet 1.58b architecture. You dont necessarily have to train a model from scratch.

i.e. you can fine-tune an existing model to 1.58 bits!

---

Now BitNet is a special transformers architecture that represents each parameter with only three values: (-1, 0, 1), offering a extreme quantization of just 1.58 bits per parameter.

And all this while, it requires to train a model from scratch.

However, pre-training the model from scratch is very costly. So recent experiments allows us to fine-tuning an existing model to 1.58 bits!

There are recent examples of successful fine-tuning a Llama3 8B model using the BitNet architecture

------

And Huggingface Transformers has recently integrated the BitNet architecture, by introducing a new quantization method called "bitnet".

📌 This method involves replacing the standard Linear layers with specialized BitLinear layers that are compatible with the BitNet architecture, with appropriate dynamic quantization of activations, weight unpacking, and matrix multiplication.

📌 In the Huggingface blog example they mention, for bitnet, the train in full precision, but quantize the weights into ternary values as we go, using symmetric per tensor quantization.

📌 Activations are then quantized to a specified bit-width (8-bit, in our case) using absmax per token quantization (for a comprehensive introduction to quantization methods check out this post). This involves scaling the activations into the range [−128, 127] for an 8-bit bit-width.

[Quoted tweet]
WOW. @Microsoft just open-sourced the code for one of "THE MOST" influential Paper of 2024 🔥

1-bit LLMs (e.g., BitNet b1.58).

Now you can run a 100B param models on local devices quantized with BitNet b1.58 on single CPU at 5-7 tokens/sec 🤯

The dream we have all been waiting for.

📊 Performance Improvements:

- Achieves speedups of 1.37x to 5.07x on ARM CPUs

- Larger models see greater performance gains

- Reduces energy consumption by 55.4% to 70.0% on ARM

- On x86 CPUs, speedups range from 2.37x to 6.17x


GaYFPuRXUAAlgRt.jpg

GaTA5DOWoAAW9i3.jpg


2/2
@rohanpaul_ai
Fine-tuning LLMs to 1.58bit: extreme quantization made easy




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798


1/3
@rohanpaul_ai
Qwen Code Interpreter, with Qwen Code 2.5 & WebLLM

Running locally on your browser

A cool @huggingface space showcasing the power or opensource model and WebLLM

-----

WebLLM is a high-performance, in-browser language model inference engine that leverages WebGPU for hardware acceleration, enabling powerful LLM operations directly within web browsers without server-side processing.



GaX-XAbWQAA1ruD.jpg


2/3
@rohanpaul_ai
Qwen 2.5 Code Interpreter - a Hugging Face Space by cfahlgren1



3/3
@RaphLeclercAI
Running WebLLM locally in a browser, no server-side needed? That's quite impressive. Have you considered running it in environments with limited internet connectivity for even more impact?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798


1/3
@rohanpaul_ai
Divergent CoT (DCoT) : Requiring models to compare multiple reasoning chains before generating a solution in a single inference step.

DCoT enhances LLM reasoning by generating multiple chains, enabling self-correction and improving performance across scales.

With this complex reasoning methods can be encoded into LLMs through appropriate instruction tuning.

**Original Problem** 🤔:

LLMs struggle with complex reasoning tasks. Existing Chain of Thought (CoT) methods generate single reasoning chains, limiting exploration of diverse solutions.

-----

**Solution in this Paper** 🧠:

• Generates multiple reasoning chains in one inference step
• Compares chains before selecting final answer
• Instruction fine-tunes models on DCoT datasets
• Enables smaller models to benefit from complex reasoning

-----

**Key Insights from this Paper** 💡:

• DCoT improves performance across model sizes (1.3B to 70B parameters)
• Enables self-correction without explicit training
• Generalizes to unseen tasks
• Compatible with existing CoT extensions

-----

**Results** 📊:

• DCoT consistently outperforms CoT baseline across tasks and model scales
• Performance gains up to 7.59 points on BGQA for Phi 2
• Improves on unseen tasks: 5+ points on AQuA and SVAMP (Phi 2)
• Enables self-correction: 45% of corrected cases show different reasoning in second chain
• Combines with self-consistency for further gains



GaX5l-9W0AAjRvY.png


2/3
@rohanpaul_ai
🔬 How Divergent Chain of Thought (DCoT) differs from standard Chain of Thought (CoT)?

DCoT requires LLMs to generate multiple reasoning chains before producing a solution in a single inference step.

Unlike standard CoT which generates a single reasoning chain, DCoT generates multiple divergent chains and compares them before selecting a final answer.

This allows the model to explore different reasoning paths simultaneously.



GaX7FYSWMAAn9lJ.png


3/3
@LaurenceBrem
Is this similar to the tree of thought method? it sounds like it




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798


1/2
@rohanpaul_ai
Chain-of-thought (CoT) via prompting is NOT ALWAYS needed for eliciting reasoning capabilities from large language models (LLMs).

CoT excels in math and logic but underperforms in broader language tasks. So Selective use of CoT can optimize performance without incurring high inference costs.

**Original Problem** 🤔:

Chain-of-Thought (CoT) prompting is widely used to enhance reasoning in LLMs. However, its effectiveness across different task types is unclear.

-----

**Solution in this Paper** 🛠️:

- Conducted a meta-analysis of over 100 papers and evaluated 20 datasets across 14 models.
- Focused on separating planning and execution stages in problem-solving.
- Compared CoT performance against tool-augmented LLMs.
- Suggested selective application of CoT to reduce inference costs.

-----

**Key Insights from this Paper** 💡:

- CoT significantly improves tasks involving math and logic.
- Symbolic reasoning benefits most from CoT, especially in execution.
- Tool augmentation outperforms CoT in symbolic tasks.
- CoT's utility is limited for non-symbolic reasoning tasks.

-----

**Results** 📊:

- Math and symbolic reasoning tasks showed substantial improvements with CoT.
- Non-symbolic tasks saw minimal gains.
- On MMLU, CoT's benefit was mostly for questions involving symbolic operations like equations.



GaX4PO4WUAALmX5.png


2/2
@rohanpaul_ai
CoT is particularly beneficial for tasks requiring symbolic execution, such as solving equations or logical puzzles. It helps more in the execution phase rather than the planning phase of problem-solving.

📚 https://arxiv.org/pdf/2409.12183




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

1/1
@rohanpaul_ai
Existing retrieval-augmented generation (RAG) methods struggle with knowledge-intensive reasoning tasks due to scattered information across documents.

After GraphRAG now the kid on the block is StructRAG!

Imagine asking your AI assistant a complex question that requires piecing together information from multiple sources.

1. StructRAG first identifies the best way to structure the knowledge for the specific task, such as a table, graph, or tree.

2. It then reconstructs the original documents into this structured format, making it easier to see connections and relationships between pieces of information.

3. Finally, it uses this structured knowledge to infer the answer to the original question.

**Key Insights from this Paper** 💡:

• Structured knowledge in optimal format enhances LLM reasoning
• Hybrid information structurization outperforms fixed structure types
• DPO training improves structure type selection accuracy

------

Generated this podcast with Google's Illuminate.

[Quoted tweet]
After GraphRAG now the kid on the block is StructRAG!

Imagine asking your AI assistant a complex question that requires piecing together information from multiple sources.

1. StructRAG first identifies the best way to structure the knowledge for the specific task, such as a table, graph, or tree.

2. It then reconstructs the original documents into this structured format, making it easier to see connections and relationships between pieces of information.

3. Finally, it uses this structured knowledge to infer the answer to the original question.

**Original Problem** 🔍:

Existing retrieval-augmented generation (RAG) methods struggle with knowledge-intensive reasoning tasks due to scattered information across documents.

-----

**Solution in this Paper** 🛠️:

• StructRAG framework:
- Hybrid structure router selects optimal structure type
- Scattered knowledge structurizer converts documents into structured knowledge
- Structured knowledge utilizer decomposes questions and infers answers
• DPO-based training for hybrid structure router
• Synthesizing-simulating-judging method for constructing preference pairs

-----

**Key Insights from this Paper** 💡:

• Structured knowledge in optimal format enhances LLM reasoning
• Hybrid information structurization outperforms fixed structure types
• DPO training improves structure type selection accuracy

-----

**Results** 📊:

• StructRAG achieves state-of-the-art performance on knowledge-intensive tasks
• Outperforms baselines on Loong benchmark and Podcast Transcripts
• Performance improvement increases with task complexity
• Operates significantly faster than Graph RAG methods


GaNs_esWYAAvbTF.png


https://video-t-2.twimg.com/ext_tw_...81/pu/vid/avc1/1080x1080/6tozeNiODbH7Dn_n.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

1/2
@rohanpaul_ai
Sonar image synthesis faces challenges in data scarcity, quality, and diversity. Traditional methods rely on costly data collection, limiting research and applications in underwater exploration.

This Paper brings in GPT-prompted sonar image synthesis: A new frontier in underwater data generation.

**Solution in this Paper** 🛠️:

• Synth-SONAR framework leverages dual diffusion models and GPT prompting

• Creates large dataset by combining real, simulated, and AI-generated images

• Incorporates GPT and vision-language models for improved text-to-image synthesis

• Applies style injection techniques to enhance image diversity

[Quoted tweet]
GPT-prompted sonar image synthesis: A new frontier in underwater data generation.

**Original Problem** 🔍:

Sonar image synthesis faces challenges in data scarcity, quality, and diversity. Traditional methods rely on costly data collection, limiting research and applications in underwater exploration.

-----

**Solution in this Paper** 🛠️:

• Synth-SONAR framework leverages dual diffusion models and GPT prompting
• Creates large dataset by combining real, simulated, and AI-generated images
• Uses two-stage image generation: coarse and fine-grained
• Incorporates GPT and vision-language models for improved text-to-image synthesis
• Applies style injection techniques to enhance image diversity

-----

**Key Insights from this Paper** 💡:

• First application of GPT-prompting in sonar imagery generation
• Dual-stage diffusion model hierarchy enhances image quality and diversity
• Integration of language models bridges gap between text descriptions and sonar image generation
• Style injection with attention mechanism improves feature separation in generated images

-----

**Results** 📊:

• Outperforms state-of-the-art models in image quality metrics (SSIM: 0.381, PSNR: 12.730, FID: 3.8)
• Achieves up to 97% accuracy in sonar image classification when combining real and synthetic data
• Generates high-quality synthetic sonar images with enhanced diversity and realism
• Enables controlled and interpretable sonar image synthesis through text prompts


GaNjl4BWcAEtYo3.png


https://video.twimg.com/ext_tw_video/1848106451484409856/pu/vid/avc1/1080x1080/6XQ2fm4LO2Oywa98.mp4

2/2
@techredner
Fascinating approach to sonar image synthesis. The use of dual diffusion models and GPT prompting is particularly innovative, enabling high-quality synthetic images with enhanced diversity and realism. Great work!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798






1/9
@rohanpaul_ai
Now that @Microsoft open-sourced the code for one THE CLASSIC Paper of 2024, I am revising the MASTERPIECE.

📚 "The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits"

BitNet b1.58 70B was 4.1 times faster and 8.9 times higher throughput capable than the corresponding FP16 LLama.

📌 Requires almost no multiplication operations for matrix multiplication and can be highly optimized.

📌 BitNet b1.58 is a 1-bit LLM, where every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.

They introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits in the binary system.

📌 The term "1.58 bits" refers to the information content of each parameter.

What "1.58 bits" means is that it is log base 2 of 3, or 1.5849625... Actually decoding data at that density takes a lot of math.Since each parameter can take one of three possible values (-1, 0, 1), the information content is log2(3) ≈ 1.58 bits.

----

Here all weight values are ternary, taking on values {-1, 0, 1}.

Its quantization function is absmean in which, the weights are first scaled by their average absolute value and then rounded to the nearest integer ε {-1,0,1}.

It is an efficient extension of 1-bit BitNet by including 0 in model parameters.

So BitNet b1.58 is based upon BitNet architecture (replaces `nn.linear` with `BitLinear`).

It is highly optimized as it removes floating point multiplication overhead, involving only integer addition (INT-8), and efficiently loads parameters from DRAM.

[Quoted tweet]
WOW. @Microsoft just open-sourced the code for one of "THE MOST" influential Paper of 2024 🔥

1-bit LLMs (e.g., BitNet b1.58).

Now you can run a 100B param models on local devices quantized with BitNet b1.58 on single CPU at 5-7 tokens/sec 🤯

The dream we have all been waiting for.

📊 Performance Improvements:

- Achieves speedups of 1.37x to 5.07x on ARM CPUs

- Larger models see greater performance gains

- Reduces energy consumption by 55.4% to 70.0% on ARM

- On x86 CPUs, speedups range from 2.37x to 6.17x


GaWiZ_8XAAAmSPp.jpg

GaTA5DOWoAAW9i3.jpg


2/9
@rohanpaul_ai
How the way BitNet b1.58 requires almost no multiplication operations for matrix multiplication:

🧮 Traditional matrix multiplication:

In standard neural networks using floating-point weights, matrix multiplication involves many floating-point multiplications and additions. Each element of the output is computed by multiplying corresponding elements from the input vector and weight matrix, then summing these products.

🔢 BitNet b1.58 approach:

With BitNet b1.58, the weights are constrained to {-1, 0, 1}. This fundamentally changes the nature of the matrix multiplication:

- Multiplication by 1: Simply keeps the input value as-is

- Multiplication by -1: Just flips the sign of the input value

- Multiplication by 0: Results in zero, effectively filtering out that input

So, instead of actual multiplications, the operations become:
- Sign flipping (for -1)
- Passing through (for 1)
- Zeroing out (for 0)

These operations are much simpler than floating-point multiplication. The bulk of the computation then becomes integer addition to sum up the resulting values.

💻 Why this is highly optimizable:

- Bit-level operations: Sign flipping and zeroing can be implemented as fast bit-level operations

- SIMD friendly: These simple operations are ideal for SIMD (Single Instruction, Multiple Data) parallelization

- Reduced memory bandwidth: The 1.58-bit weights require much less memory transfer, often a bottleneck in neural network computations

- Specialized hardware potential: This simplified computation model opens doors for highly efficient custom hardware designs

This approach essentially transforms complex floating-point matrix multiplications into a series of simple integer additions and bit manipulations, leading to significant performance and energy efficiency gains.



GaWir8fW0AA3ZpJ.jpg


3/9
@rohanpaul_ai
🧠 Let's break down this quantization process in BitNet b1.58 simple terms:

1️⃣ Starting point:
We have a regular neural network with weights that can be any decimal number.

2️⃣ Goal:
We want to convert all these weights to only three possible values: -1, 0, or +1.

3️⃣ The process:

🔹 Step 1: Find the average:
First, we calculate the average of all the absolute values (ignoring negative signs) of the weights in the matrix. Let's call this average γ.

🔹 Step 2: Scaling:
We divide all the weights by this average γ. This scales the weights so they're generally closer to the range of -1 to +1.

🔹 Step 3: Rounding:
After scaling, we round each weight to the nearest value among -1, 0, and +1.

- If it's closer to -1, it becomes -1
- If it's closer to 0, it becomes 0
- If it's closer to +1, it becomes +1

4️⃣ Result:

After this process, all weights in the network are now either -1, 0, or +1.

🎯 Why do this?

This method allows us to drastically simplify the neural network while trying to keep its overall behavior similar to the original. It makes computations much faster and more efficient, especially on certain types of hardware.

Think of it like simplifying a detailed color image into just black, white, and gray. You lose some detail, but the main features are still there, and it becomes much easier to process.



GaWlzNzXoAAms7q.png


4/9
@rohanpaul_ai
🔍 Comparison of BitNet b1.58 across different model sizes:

📊 Left graph (Latency):

- Shows decoding speed (lower is better)
- BitNet b1.58 is consistently faster
- Gap widens as model size increases
- At 70B parameters, BitNet is 4.10x faster

📉 Right graph (Memory):

- Shows memory usage (lower is better)
- BitNet b1.58 uses significantly less memory
- Difference grows with model size
- At 70B, BitNet uses 7.16x less memory

📈 Table (Throughput):
- For 70B models:
- BitNet handles 11x larger batch size
- Processes 8.9x more tokens per second



GaWmxZSXQAA3qxD.png


5/9
@rohanpaul_ai
bitnet.cpp can run a 100B BitNet b1.58 model on a single CPU, achieving speeds comparable to human reading (5-7 tokens per second), significantly enhancing the potential for running LLMs on local devices.

GitHub - microsoft/BitNet: Official inference framework for 1-bit LLMs



GaWvzOGW0AAS0cC.jpg


6/9
@rohanpaul_ai
Energy consumption of BitNet b1.58 compared to LLaMA LLM at 7nm process nodes.



GaWnJjLWkAAZGFZ.png


7/9
@Astro_Erik
this is truly insane, a fractional dimension of bits XD



8/9
@DanielLSainz
is there any realistic way to make 1 bit quantization work?



9/9
@wynrdynr69r
Shame you didn't have more black representation on the team. Just a bunch of rice munchers.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798

1/3
@rohanpaul_ai
Current multimodal large language models (MLLMs) lack personalization capabilities, struggling to conduct dialogues targeting specific individuals. This limitation hinders their application in personalized settings like mobile visual assistants or domestic robots.

Personalized Visual Instruction Tuning (PVIT) framework empowers MLLMs to recognize individuals and conduct personalized conversations effectively.

**Solution in this Paper** 🛠️:

• Represents individuals as multimodal prefix pairs (personal image, personal introduction)
• Uses personalized wrapper tokens to eliminate ambiguity
• Develops an automatic pipeline for generating personalized training data
• Leverages visual experts, image generation models, and MLLMs

[Quoted tweet]
Personalized Visual Instruction Tuning (PVIT) framework empowers MLLMs to recognize individuals and conduct personalized conversations effectively.

**Original Problem** 🔍:

Current multimodal large language models (MLLMs) lack personalization capabilities, struggling to conduct dialogues targeting specific individuals. This limitation hinders their application in personalized settings like mobile visual assistants or domestic robots.

-----

**Solution in this Paper** 🛠️:

• Represents individuals as multimodal prefix pairs (personal image, personal introduction)
• Uses personalized wrapper tokens to eliminate ambiguity
• Develops an automatic pipeline for generating personalized training data
• Leverages visual experts, image generation models, and MLLMs

-----

**Key Insights from this Paper** 💡:

• PVIT enables MLLMs to identify target individuals and engage in personalized dialogues
• Utilizes in-context learning capability of MLLMs for generalization
• Introduces P-Bench, a benchmark for evaluating personalized potential of MLLMs
• Addresses "face blindness" limitation of current MLLMs

-----

**Results** 📊:

• P-LLaVA (PVIT-tuned LLaVA) outperforms state-of-the-art MLLMs across various question types in P-Bench
• Achieves 96.69% accuracy on answerable tasks (vs 63.13% for next best)
• Demonstrates 99.72% accuracy on unanswerable queries (vs 31.49% for next best)
• Shows strong performance even with complex scenarios and multiple individuals


GaLrQmyXQAAjkVp.png


https://video.twimg.com/ext_tw_video/1848107381827276800/pu/vid/avc1/1080x1080/EiQU_CALe4R5T0sT.mp4

2/3
@neosmithai
I love how PVIT tackles the "face blindness" of MLLMs. Its potential in mobile assistants and domestic robots is huge. Can't wait to see real-world applications of this tech.



3/3
@RibbeFelipe
@Ugo_alves take a look!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798


1/2
Depth Any Video - Produce high-resolution depth inference 🔥🔥



2/2
Depth Any Video

> Introduces a scalable synthetic data pipeline with 40,000 video clips from diverse games
> Easily handles various video lengths and frame rates while producing high-res depth inference
> Achieves superior spatial accuracy and temporal consistency over sota

Gradio app on @huggingface Spaces: Depth Any Video - a Hugging Face Space by hhyangcs




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/11
@dihuang52453419
We just released Depth Any Video: a new model for high-fidelity & consistent video depth estimation.

Large-scale synthetic data training makes the model robust for various scenarios in our use cases.

Paper: [2410.10815] Depth Any Video with Scalable Synthetic Data
Website: Depth Any Video



https://video.twimg.com/ext_tw_video/1846040483283587072/pu/vid/avc1/1280x720/uJdo3vLvbd4830Qx.mp4

2/11
@dihuang52453419
Also, code repo: GitHub - Nightmare-n/DepthAnyVideo: Depth Any Video with Scalable Synthetic Data

Will release soon.



3/11
@JaidevShriram
Cool results! Is there any plan to release the dataset or share more information about it?



4/11
@dihuang52453419
I’m not sure about the dataset release… Unfortunately, It’s up to the lab’s decision, not mine. 🥲 I'll try my best.



5/11
@RedmondAI
I still dont see the weights on Huggingface unfortunately.



6/11
@dihuang52453419
HF model will be available in one week



7/11
@macrodotcom
Check out this paper in Macro! Ask questions and leverage AI for free:
Macro



8/11
@LeoSuppya
I funking love thisssssssss. Please release demo on Hugging face for us to use. Thanks <3



9/11
@daniel_lichy
Cool work! Can you show some unprotected point clouds? It is hard to get a sense of how good the model is from just the depth maps.



10/11
@synthical_ai
Dark mode for this paper for those who read at night 🌚 Depth Any Video with Scalable Synthetic Data



11/11
@GoatstackAI
AI Summary: The paper introduces 'Depth Any Video', a novel model for video depth estimation that addresses the scarcity of scalable ground truth data. It features a synthetic data pipeline that generates 40...
Depth Any Video with Scalable Synthetic Data



GaMR1cWbUAExLD2.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/2
Depth Any Video does really well on anime too!

[Quoted tweet]
Depth Any Video - Produce high-resolution depth inference 🔥🔥


2/2
Depth Any Video app
Depth Any Video - a Hugging Face Space by hhyangcs




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798








1/11
@AnthropicAI
Introducing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. We’re also introducing a new capability in beta: computer use.

Developers can now direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking, and typing text.



GagIEBuWcAAkHHL.png


2/11
@AnthropicAI
The new Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta.

While groundbreaking, computer use is still experimental—at times error-prone. We're releasing it early for feedback from developers.



3/11
@AnthropicAI
We've built an API that allows Claude to perceive and interact with computer interfaces.

This API enables Claude to translate prompts into computer commands. Developers can use it to automate repetitive tasks, conduct testing and QA, and perform open-ended research.



4/11
@AnthropicAI
We're trying something fundamentally new.

Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people.



5/11
@AnthropicAI
Claude 3.5 Sonnet's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges. So we encourage exploration with low-risk tasks.

We expect this to rapidly improve in the coming months.



6/11
@AnthropicAI
Even while recording these demos, we encountered some amusing moments. In one, Claude accidentally stopped a long-running screen recording, causing all footage to be lost.

Later, Claude took a break from our coding demo and began to peruse photos of Yellowstone National Park.



7/11
@AnthropicAI
Beyond computer use, the new Claude 3.5 Sonnet delivers significant gains in coding—an area where it already led the field.

Sonnet scores higher on SWE-bench Verified than all available models—including reasoning models like OpenAI o1-preview and specialized agentic systems.



GagIx16XkAAKmoL.png


8/11
@AnthropicAI
Claude 3.5 Haiku is the next generation of our fastest model.

Haiku now outperforms many state-of-the-art models on coding tasks—including the original Claude 3.5 Sonnet and GPT-4o—at the same cost as before.

The new Claude 3.5 Haiku will be released later this month.



GagIy5AWIAA7NeW.png


9/11
@AnthropicAI
We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create.

Read the updates in full: Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku



10/11
@HenrikBirkeland
Soo.. not all in, yet.



GagQShaW8AA5w3y.jpg


11/11
@BraydonDymm
Comparison to June release:



GagWWJfXAAESu9s.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/6
@ArtificialAnlys
Anthropic’s Claude 3.5 Sonnet leapfrogs GPT-4o, takes back the frontier and extends its lead in coding

Our independent quality evals of @AnthropicAI's Claude 3.5 Sonnet (Oct 2024) confirm a 3 point improvement in Artificial Analysis Quality Index vs. the original release in June. Improvement is reflected across evals and particularly in coding and math capabilities.

This makes Claude 3.5 Sonnet (Oct 2024) the top scoring model that does not require the generation of reasoning tokens before beginning to generate useful output (ie. excluding OpenAI’s o1 models).

With no apparent regressions and no changes to pricing or speed, we generally recommend an immediate upgrade from the earlier version of Claude 3.5 Sonnet.

Maybe Claude 3.5 Sonnet (Oct 2024) can suggest next time to increment the version number - 3.6?

See below tweets for further analysis 👇



Gah57qgbEAcNa5D.jpg


2/6
@ArtificialAnlys
While Claude 3.5 Sonnet (Oct 2024) has achieved higher scores across all evals, improvement is particularly reflected in its math and coding abilities.

Our Quality Index includes MMLU, GPQA, MATH and HumanEval evals.



GaiPB7YasAAi4OU.png

GaiPFu8bEAYh8Ne.png

GaiPISqbEAAP-9i.png

GaiPMrFbEAADTi_.png


3/6
@ArtificialAnlys
Anthropic has kept prices the same for Sonnet since the original Claude 3 Sonnet launch in March ($3/$15 in/out per million tokens).

This means Claude 3.5 Sonnet remains slightly more expensive than GPT-4o and Gemini 1.5 Pro.



GaiRGJ-bEAAEKoZ.jpg


4/6
@ArtificialAnlys
Link to our analysis:

🔗 https://artificialanalysis.ai/models/claude-35-sonnet



5/6
@alby13
how do we think o1-full/o1-large is going to be?



6/6
@AntDX316
We don't need a model name change.

Updates with the same model name will do.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,111
Reputation
8,239
Daps
157,798


Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku​


Oct 22, 2024●5 min read

An illustration of Claude navigating a computer cursor


Today, we’re announcing an upgraded Claude 3.5 Sonnet, and a new model, Claude 3.5 Haiku. The upgraded Claude 3.5 Sonnet delivers across-the-board improvements over its predecessor, with particularly significant gains in coding—an area where it already led the field. Claude 3.5 Haiku matches the performance of Claude 3 Opus, our prior largest model, on many evaluations for the same cost and similar speed to the previous generation of Haiku.

We’re also introducing a groundbreaking new capability in public beta: computer use. Available today on the API, developers can direct Claude to use computers the way people do—by looking at a screen, moving a cursor, clicking buttons, and typing text. Claude 3.5 Sonnet is the first frontier AI model to offer computer use in public beta. At this stage, it is still experimental—at times cumbersome and error-prone. We're releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.

Asana, Canva, Cognition, DoorDash, Replit, and The Browser Company have already begun to explore these possibilities, carrying out tasks that require dozens, and sometimes even hundreds, of steps to complete. For example, Replit is using Claude 3.5 Sonnet's capabilities with computer use and UI navigation to develop a key feature that evaluates apps as they’re being built for their Replit Agent product.



The upgraded Claude 3.5 Sonnet is now available for all users. Starting today, developers can build with the computer use beta on the Anthropic API, Amazon Bedrock, and Google Cloud’s Vertex AI. The new Claude 3.5 Haiku will be released later this month.

image

Claude 3.5 Sonnet: Industry-leading software engineering skills​


The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain. The new Claude 3.5 Sonnet offers these advancements at the same price and speed as its predecessor.

Early customer feedback suggests the upgraded Claude 3.5 Sonnet represents a significant leap for AI-powered coding. GitLab, which tested the model for DevSecOps tasks, found it delivered stronger reasoning (up to 10% across use cases) with no added latency, making it an ideal choice to power multi-step software development processes. Cognition uses the new Claude 3.5 Sonnet for autonomous AI evaluations, and experienced substantial improvements in coding, planning, and problem-solving compared to the previous version. The Browser Company, in using the model for automating web-based workflows, noted Claude 3.5 Sonnet outperformed every model they’ve tested before.

As part of our continued effort to partner with external experts, joint pre-deployment testing of the new Claude 3.5 Sonnet model was conducted by the US AI Safety Institute (US AISI) and the UK Safety Institute (UK AISI).

We also evaluated the upgraded Claude 3.5 Sonnet for catastrophic risks and found that the ASL-2 Standard, as outlined in our Responsible Scaling Policy, remains appropriate for this model.

Claude 3.5 Haiku: State-of-the-art meets affordability and speed​


Claude 3.5 Haiku is the next generation of our fastest model. For the same cost and similar speed to Claude 3 Haiku, Claude 3.5 Haiku improves across every skill set and surpasses even Claude 3 Opus, the largest model in our previous generation, on many intelligence benchmarks. Claude 3.5 Haiku is particularly strong on coding tasks. For example, it scores 40.6% on SWE-bench Verified, outperforming many agents using publicly available state-of-the-art models—including the original Claude 3.5 Sonnet and GPT-4o.

With low latency, improved instruction following, and more accurate tool use, Claude 3.5 Haiku is well suited for user-facing products, specialized sub-agent tasks, and generating personalized experiences from huge volumes of data—like purchase history, pricing, or inventory records.

Claude 3.5 Haiku will be made available later this month across our first-party API, Amazon Bedrock, and Google Cloud’s Vertex AI—initially as a text-only model and with image input to follow.

Teaching Claude to navigate computers, responsibly​


With computer use, we're trying something fundamentally new. Instead of making specific tools to help Claude complete individual tasks, we're teaching it general computer skills—allowing it to use a wide range of standard tools and software programs designed for people. Developers can use this nascent capability to automate repetitive processes, build and test software, and conduct open-ended tasks like research.

To make these general skills possible, we've built an API that allows Claude to perceive and interact with computer interfaces. Developers can integrate this API to enable Claude to translate instructions (e.g., “use data from my computer and online to fill out this form”) into computer commands (e.g. check a spreadsheet; move the cursor to open a web browser; navigate to the relevant web pages; fill out a form with the data from those pages; and so on). On OSWorld, which evaluates AI models' ability to use computers like people do, Claude 3.5 Sonnet scored 14.9% in the screenshot-only category—notably better than the next-best AI system's score of 7.8%. When afforded more steps to complete the task, Claude scored 22.0%.

While we expect this capability to improve rapidly in the coming months, Claude's current ability to use computers is imperfect. Some actions that people perform effortlessly—scrolling, dragging, zooming—currently present challenges for Claude and we encourage developers to begin exploration with low-risk tasks. Because computer use may provide a new vector for more familiar threats such as spam, misinformation, or fraud, we're taking a proactive approach to promote its safe deployment. We've developed new classifiers that can identify when computer use is being used and whether harm is occurring. You can read more about the research process behind this new skill, along with further discussion of safety measures, in our post on developing computer use.

Looking ahead​


Learning from the initial deployments of this technology, which is still in its earliest stages, will help us better understand both the potential and the implications of increasingly capable AI systems.

We’re excited for you to explore our new models and the public beta of computer use—and welcome you to share your feedback with us. We believe these developments will open up new possibilities for how you work with Claude, and we look forward to seeing what you'll create.
 
Top