bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492











1/11
@TheTuringPost
The freshest AI/ML researches of the week, part 1

▪️ New AI Model Gemini Experimental 1114 Debuts On Google AI Studio
▪️ CamemBERT 2.0
▪️ Qwen2.5-Coder Series
▪️ Llava-o1
▪️ LLMs Can Self-Improve In Long-Context Reasoning
▪️ Direct Preference Optimization Using Sparse Feature-Level Constraints
▪️ Cut Your Losses In Large-Vocabulary Language Models
▪️ SPARSING LAW

🧵



Gc3rrfyaIAAHVGk.png

Gc3rru2aAAMbYfq.jpg

Gc3rr8qawAALst2.png

Gc3rsMFaAAAnk0k.png


2/11
@TheTuringPost
1. New AI Model Gemini Experimental 1114 Debuts On Google AI Studio

Demonstrates strong reasoning skills with a 32k context window, outperforming competitors on benchmarks, despite slower problem-solving speed.

[Quoted tweet]
gemini-exp-1114…. available in Google AI Studio right now, enjoy : )

aistudio.google.com


3/11
@TheTuringPost
2. CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Tackles concept drift in French NLP with improved tokenization, excelling in QA and domain-specific tasks like biomedical NER.

[2411.08868] CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
Open models: almanach (ALMAnaCH (Inria))



Gc3rt_UaAAMLg5D.jpg


4/11
@TheTuringPost
3. Qwen2.5-Coder Series: Powerful, Diverse, Practical

Excels in coding and multi-language repair tasks, rivaling GPT-4o in 40+ programming languages with open innovation for developers.

Qwen2.5-Coder Series: Powerful, Diverse, Practical.



Gc3ru_HbIAABGBu.jpg

Gc3rvSoaIAAyZ_M.jpg


5/11
@TheTuringPost
4. Llava-o1: Let Vision Language Models Reason Step-By-Step

Enhances multimodal reasoning through structured, multi-stage processes, achieving superior benchmark performance.

[2411.10440] LLaVA-o1: Let Vision Language Models Reason Step-by-Step

[Quoted tweet]
LLaVA-o1 is a smarter Vision-Language Model (VLM) that thinks step-by-step.

Instead of jumping to answers, it divides reasoning into 4 clear stages and uses stage-level beam search to generate multiple answers and select the best one for each stage.

Here's how is works:


GcqiVI5akAA6d13.png

GcqiVWUaIAAEsFG.jpg


6/11
@TheTuringPost
5. Large Language Models Can Self-Improve In Long-Context Reasoning

Uses self-improvement via ranking model outputs (SeaLong approach), improving performance in long-context reasoning tasks without external datasets.

[2411.08147] Large Language Models Can Self-Improve in Long-context Reasoning
GitHub: GitHub - SihengLi99/SEALONG: Large Language Models Can Self-Improve in Long-context Reasoning



Gc3rxIfaAAE07uh.jpg


7/11
@TheTuringPost
6. Direct Preference Optimization Using Sparse Feature-Level Constraints

Introduces method that improves alignment efficiency in LLMs and reduces computational overhead, using sparse autoencoders and feature constraints.

[2411.07618] Direct Preference Optimization Using Sparse Feature-Level Constraints



Gc3ryLCaAAQysYV.jpg


8/11
@TheTuringPost
7. Cut Your Losses In Large-Vocabulary Language Models

Proposes Cut Cross-Entropy (CCE) method that reduces memory use for large-scale training, enabling up to 10x larger batch sizes without sacrificing performance.

[2411.09009] Cut Your Losses in Large-Vocabulary Language Models
GitHub: GitHub - apple/ml-cross-entropy



Gc3rzJVa0AACC4-.jpg


9/11
@TheTuringPost
8. SPARSING LAW: Towards Large Language Models With Greater Activation Sparsity

Explores neuron sparsity in LLMs to enhance efficiency while preserving interpretability.

[2411.02335] Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
GitHub: GitHub - thunlp/SparsingLaw: The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".



Gc3r0PCbgAAcj2o.jpg


10/11
@TheTuringPost
9. Find a complete list of the latest research papers in our free weekly digest: 🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)



11/11
@TheTuringPost
10. Follow @TheTuringPost for more.

Like/repost the 1st post to support our work 🤍

Also, elevate your AI game with our free newsletter ↓
Turing Post

[Quoted tweet]
The freshest AI/ML researches of the week, part 1

▪️ New AI Model Gemini Experimental 1114 Debuts On Google AI Studio
▪️ CamemBERT 2.0
▪️ Qwen2.5-Coder Series
▪️ Llava-o1
▪️ LLMs Can Self-Improve In Long-Context Reasoning
▪️ Direct Preference Optimization Using Sparse Feature-Level Constraints
▪️ Cut Your Losses In Large-Vocabulary Language Models
▪️ SPARSING LAW

🧵


Gc3rrfyaIAAHVGk.png

Gc3rru2aAAMbYfq.jpg

Gc3rr8qawAALst2.png




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/3
@rohanpaul_ai
LLaVA-o1 teaches machines to think step-by-step like humans when analyzing images.

LLaVA-o1 introduces a novel approach to enhance Vision Language Models (VLMs) by implementing structured, multi-stage reasoning. This paper tackles the challenge of systematic reasoning in visual tasks by breaking down the process into distinct stages: summary, caption, reasoning, and conclusion.

-----

🤔 Original Problem:

Current VLMs struggle with systematic reasoning and often produce errors or hallucinated outputs during complex visual question-answering tasks. They lack structured thinking processes and tend to jump to conclusions without proper analysis.

-----

🛠️ Solution in this Paper:

→ LLaVA-o1 implements a 4-stage reasoning process with dedicated tags for each stage: summary, caption, reasoning, and conclusion.

→ The model uses supervised fine-tuning on a new LLaVA-o1-100k dataset, created using GPT-4o for structured reasoning annotations.

→ A stage-level beam search method generates multiple candidates at each reasoning stage, selecting the best one to continue.

→ Training is performed on a single node with 8 H100 GPUs, combining samples from both general VQA and science-targeted datasets.

-----

💡 Key Insights:

→ Structured reasoning stages help models organize thoughts before reaching conclusions

→ Special tags for each stage maintain clarity throughout the reasoning process

→ Stage-level beam search is more effective than sentence-level or best-of-N approaches

-----

📊 Results:

→ Outperforms base model by 8.9% on multimodal reasoning benchmarks

→ Surpasses larger models including Gemini-1.5-pro and GPT-4o-mini

→ Stage-level beam search improves MMVet score from 60.3% to 62.9%



GdG0DzCagAMKeIY.jpg


2/3
@rohanpaul_ai
Paper Title: "LLaVA-o1: Let Vision Language Models Reason Step-by-Step"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1860466160707469312/pu/vid/avc1/1080x1080/mAeNIFuBt10AwrXP.mp4

3/3
@rohanpaul_ai
[2411.10440] LLaVA-o1: Let Vision Language Models Reason Step-by-Step




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@jreuben1
LLaVA-o1: Let Vision Language Models Reason Step-by-Step LLaVA-o1: Let Vision Language Models Reason Step-by-Step inference-time stage-level beam search method, which enables effective inference-time scaling.



GdImGS0WYAA-PeF.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/10
@Gradio
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

🤯 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks.



GcvsSEiWcAAAmI0.jpg


2/10
@Gradio
LlaVA-o1

Stay tuned for the code and gradio app release.
GitHub - PKU-YuanGroup/LLaVA-o1



3/10
@NNaumovsky
@threadreaderapp unroll



4/10
@threadreaderapp
@NNaumovsky Namaste, please find the unroll here: Thread by @Gradio on Thread Reader App Enjoy :smile: 🤖



5/10
@CohorteAI
"LLaVA-o1’s success on multimodal benchmarks suggests it’s mastering the integration of vision and language. Could this pave the way for models capable of deeper real-world contextual understanding, like AR-enhanced assistants?



6/10
@hanul93
WOW



7/10
@arya_mukhlis354
amazing



8/10
@txhno
which image decoder does it use?



9/10
@matthaeus_win
I thought every model based on Llama 3 has to have 'Llama' in the name.. 👀



10/10
@wuwenjie1992


[Quoted tweet]
由北大信工袁粒课题组发布的 LLaVA-o1 是第一个能够进行自发、系统推理的视觉语言模型,类似于 GPT-o1!
⚙ 模型首先概述问题,解释图像中的相关信息,逐步进行推理,最终得出有充分依据的结论。
🤯 11B 的模型在六个多模态基准测试中优于 Gemini1.5pro、GPT4o-mini 和 Llama3.2-90B-Vision-Instruct。


Gc0ELI1bcAAceGq.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492


LLaVA-o1: Let Vision Language Models Reason Step-by-Step​


  • Image-Text-to-Text
  • Image Feature Extraction
  • Visual Question Answering

Published 11/18/2024 by Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, Li Yuan


Overview​


  • New approach called LLaVA-o1 improves visual reasoning in AI models
  • Implements step-by-step reasoning for analyzing images
  • Achieves state-of-the-art performance on visual reasoning benchmarks
  • Uses chain-of-thought prompting to break down complex visual tasks
  • Integrates with existing vision-language models

𝕏Share on 𝕏


Plain English Explanation​

LLaVA-o1 works like a careful detective examining a crime scene. Instead of jumping to conclusions, it breaks down what it sees in an image into smaller, manageable steps. This approach mirrors how humans naturally solve complex visual problems.

Just as we might count objects one by one or compare different parts of an image systematically, LLaVA-o1 follows a structured thinking process. This makes its reasoning more transparent and accurate compared to models that try to answer questions about images in one go.

The system shows particular strength in handling complex visual tasks like counting objects, comparing features, and understanding spatial relationships. Think of it as the difference between asking someone to solve a puzzle all at once versus guiding them through it piece by piece.


Key Findings​

Visual reasoning capabilities improved significantly with step-by-step processing. The model achieved:

  • 15% improvement in accuracy on complex visual reasoning tasks
  • Better performance in counting and comparison tasks
  • More consistent and explainable results
  • Enhanced ability to handle multi-step visual problems


Technical Explanation​


The chain-of-thought approach builds on existing vision-language models by adding structured reasoning steps. The system processes visual information through multiple stages:

  1. Initial visual feature extraction

  2. Sequential reasoning steps

  3. Final answer synthesis

The model architecture integrates visual encoders with language processing components, allowing for seamless communication between visual and textual understanding. This enables more sophisticated reasoning about visual content.


Critical Analysis​


While LLaVA-o1 shows promising results, several limitations exist. The step-by-step reasoning can be computationally intensive, potentially limiting real-world applications. The model may also struggle with highly abstract or ambiguous visual scenarios.

The research could benefit from:

  • Broader testing across diverse visual domains
  • Evaluation of computational efficiency
  • Investigation of failure cases
  • Assessment of bias in visual reasoning


Conclusion​

Smart vision language reasoners like LLaVA-o1 represent a significant step forward in AI visual understanding. The step-by-step approach offers a more transparent and reliable method for visual reasoning tasks. This advancement could impact applications from autonomous vehicles to medical imaging analysis, though practical implementation challenges remain to be addressed.


Full paper​


Read original: arXiv:2411.10440
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492









1/12
@rohanpaul_ai
Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA

✨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs

Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts,

→ The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs

👨‍🔧 Architecture

Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples

→ ComposableART's Innovation in Audio Control

Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity

→ Temporal Interpolation Capabilities

Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn

→ Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others

→ Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities

🔍 Real-world Applications

→ Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions

→ Enables dynamic audio asset generation matching gameplay situations, reducing pre-recorded audio requirements

→ Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices

@NVIDIAAIDev



https://video.twimg.com/ext_tw_video/1861123021983096837/pu/vid/avc1/1280x720/2cD4kuUZUpyj6qdc.mp4

2/12
@rohanpaul_ai
→ Creates a massive dataset (20M+ rows, ~330 years of audio) by combining multiple open source datasets and using LLMs to generate rich descriptions and instructions



GdQKr-oaoAElHQf.png


3/12
@rohanpaul_ai
A sample from Fugatto's official page.

Fugatto is a framework for audio synthesis and transformation given text instructions and optional audio inputs.

The framework includes the generative model Fugatto, a dataset creation technique that exploits relationships between audio and text, and a method for controlling and composing instructions, including from different models, called ComposeableART.



https://video.twimg.com/ext_tw_video/1861396412816334848/pu/vid/avc1/1026x514/DFHmk3iZoMYGS8fe.mp4

4/12
@rohanpaul_ai




GdQNdllbQAAP1tj.jpg


5/12
@rohanpaul_ai
→ Optimal Transport Conditional Flow Matching

Trains using OT-CFM objective with a T5-based transformer architecture and adaptive layer normalization



GdQLquBaoAAxZYy.png


6/12
@GuitarGeorge6
Where is it hosted?



7/12
@rohanpaul_ai
afaik they didn't announce when — or if — the tool will be widely available.



8/12
@rohanpaul_ai
Now Hear This: World’s Most Flexible Sound Machine Debuts

https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf



9/12
@xJOSHUAJOSHUAx
is it open source?



10/12
@rohanpaul_ai
afaik they didn't announce when — or if — the tool will be widely available.



11/12
@hckinz
Wow, where is the huggingface space?



12/12
@rohanpaul_ai
not yet




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@theaitechsuite
🗞️🗞️🗞️Nvidia introduces Fugatto, an AI-driven music editor that uniquely blends sounds—creating features like barking trumpets and meowing saxophones beyond its training data.

Read more: Nvidia's new AI music tool creates barking trumpets, meowing saxophones




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@lRichBl
x-post:
Tech News Today 🚨

Nvidia unveils Fugatto: A suite of AI audio tools that's like a "Swiss Army Knife" for sound editing and creation. 🎶

Samsung Galaxy S25 Ultra leaks: Hands-on video reveals a sleek design and impressive camera setup. 📷

AI bias in hiring: Study shows AI overwhelmingly favors white and male candidates in resume screening. 😟

Google Chrome at risk? Regulators may force Google to sell its browser due to antitrust concerns. 🌐
/search?q=#technology /search?q=#news /search?q=#AI /search?q=#Nvidia /search?q=#Samsung /search?q=#Google




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@technewsworld
Nvidia Reveals ‘Swiss Army Knife’ of AI Audio Tools: Fugatto...The new AI model can generate or transform any mix of music, voices, and sounds described with prompts using any combination of text and audio files. Nvidia Reveals 'Swiss Army Knife' of AI Audio Tools: Fugatto



GdVexhGXUAAK_sM.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/12
@AndrewCurran_
NVIDIA has built a 2.5 billion parameter audio model called Fugatto that generates music, voice, and sound from text and audio input. Sound inputs become completely mutable. It can change a piano line to a human voice singing or make 'a trumpet bark or a saxophone meow.



GdPd5YaXsAAOLgf.jpg


2/12
@AndrewCurran_
Using a feature called temporal interpolation, Fugatto can 'create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves'



GdPd8CGWMAAOSfq.png


3/12
@AndrewCurran_
YouTube:
https://invidious.poast.org/qj1Sp8He6e4?si=q9_b9ns1JYMZSbwI



4/12
@AndrewCurran_
Now Hear This: World’s Most Flexible Sound Machine Debuts



5/12
@AndrewCurran_




GdPhGIXWEAAJrGF.jpg


6/12
@AndrewCurran_
Great demo.



GdPh3YBXQAEpNJ_.jpg


7/12
@ericreator
wen



8/12
@AndrewCurran_
No release date yet.



9/12
@fkthefeed
Is there a repo for this?



10/12
@AndrewCurran_
No public release yet unfortunately, and not even a prospective date for one.



11/12
@AviSchiffmann
Can it do the opposite?



12/12
@AndrewCurran_
Yes, from the demo it seems to be omnidirectional, any sound to any other type of sound.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492


Now Hear This: World’s Most Flexible Sound Machine Debuts​


Using text and audio as inputs, a new generative AI model from NVIDIA can create any combination of music, voices and sounds.

November 25, 2024 by Richard Kerris

Fugatto


A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text.

While some AI models can compose a song or modify a voice, none have the dexterity of the new offering.

Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files.

For example, it can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice — even let people produce sounds never heard before.

“This thing is wild,” said Ido Zmishlany, a multi-platinum producer and songwriter — and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups. “Sound is my inspiration. It’s what moves me to create music. The idea that I can create entirely new sounds on the fly in the studio is incredible.”


A Sound Grasp of Audio

“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at NVIDIA and one of the dozen-plus people behind Fugatto, as well as an orchestral conductor and composer.

Supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions.

“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” Valle said.


A Sample Playlist of Use Cases


For example, music producers could use Fugatto to quickly prototype or edit an idea for a song, trying out different styles, voices and instruments. They could also add effects and enhance the overall audio quality of an existing track.

“The history of music is also a history of technology. The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born,” said Zmishlany. “With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music — and that’s super exciting.”

An ad agency could apply Fugatto to quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers.

Language learning tools could be personalized to use any voice a speaker chooses. Imagine an online course spoken in the voice of any family member or friend.

Video game developers could use the model to modify prerecorded assets in their title to fit the changing action as users play the game. Or, they could create new assets on the fly from text instructions and optional audio inputs.


Making a Joyful Noise

“One of the model’s capabilities we’re especially proud of is what we call the avocado chair,” said Valle, referring to a novel visual created by a generative AI model for imaging.

For instance, Fugatto can make a trumpet bark or a saxophone meow. Whatever users can describe, the model can create.

With fine-tuning and small amounts of singing data, researchers found it could handle tasks it was not pretrained on, like generating a high-quality singing voice from a text prompt.


Users Get Artistic Controls


Several capabilities add to Fugatto’s novelty.

During inference, the model uses a technique called ComposableART to combine instructions that were only seen separately during training. For example, a combination of prompts could ask for text spoken with a sad feeling in a French accent.

The model’s ability to interpolate between instructions gives users fine-grained control over text instructions, in this case the heaviness of the accent or the degree of sorrow.

“I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one,” said Rohan Badlani, an AI researcher who designed these aspects of the model.

“In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist,” said Badlani, who holds a master’s degree in computer science with a focus on AI from Stanford.

The model also generates sounds that change over time, a feature he calls temporal interpolation. It can, for instance, create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves.

Plus, unlike most models, which can only recreate the training data they’ve been exposed to, Fugatto allows users to create soundscapes it’s never seen before, such as a thunderstorm easing into a dawn with the sound of birds singing.


A Look Under the Hood


Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger.

One of the hardest parts of the effort was generating a blended dataset that contains millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform, while achieving more accurate performance and enabling new tasks without requiring additional data.

They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.

Valle remembers two moments when the team knew it was on to something. “The first time it generated music from a prompt, it blew our minds,” he said.

Later, the team demoed Fugatto responding to a prompt to create electronic music with dogs barking in time to the beat.

“When the group broke up with laughter, it really warmed my heart.”

Hear what Fugatto can do:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492


1/2
@rohanpaul_ai
OuteTTS-0.2-500M, a 500M parameter text-to-speech model just released by @OuteAI .

Built on Qwen-2.5-0.5B, trained on over 5 billion audio prompt tokens with multilingual capabilities for English, Chinese, Japanese, and Korean.

→ The model offers improved voice cloning, natural speech synthesis, and enhanced prompt following accuracy compared to its previous version, utilizing audio prompts without architectural modifications.

→ The model leverages audio prompts directly into Qwen-2.5-0.5B

→ Training utilized three major datasets - Emilia-Dataset, LibriTTS-R, and Multilingual LibriSpeech, creating a diverse foundation for voice synthesis.

→ Voice Cloning Mechanics

Requires 10-15 second audio samples with accurate transcriptions. Context length of 4096 tokens enables ~54 seconds of audio generation.

⚙️ Technical Deep-Dive

→ Implements flash attention 2.0 and bfloat16 precision, showing careful consideration for inference speed and memory usage.

→ Context Window Management

Audio generation capacity reduces proportionally when including speaker profiles, demonstrating intelligent resource allocation.



https://video.twimg.com/ext_tw_video/1861432382576119808/pu/vid/avc1/1920x1080/nEGoZpESwC2B2C8w.mp4

2/2
@rohanpaul_ai
OuteAI/OuteTTS-0.2-500M · Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/8
@AIWarper
On the topic of TTS from yesterday

Another new contender to try out OuteTTS v0.2 - 500M

(will run on a hamster wheel for all you 8gb enjoyors)

Multilingual - English, Chinese, Korean & Japanese ✅
Zero-shot voice cloning ✅



https://video.twimg.com/ext_tw_video/1861432103227006980/pu/vid/avc1/1280x720/ryHdWl08xirpqiO7.mp4

2/8
@AIWarper
OuteAI/OuteTTS-0.2-500M · Hugging Face



3/8
@DreamStarter_1
Have you seen anything similar to elevenlabs voice creation feature?
Cloning voices is fine but...I'd like to create new voices...I don't want any problems:smile:)



4/8
@AIWarper
Off the top of my head no I don't. Most offer just pretrained unique voices but the ability to create your own from scratch I am unaware of.

Interesting idea though..... surely it exists?



5/8
@AIWarper
GitHub - edwko/OuteTTS: Interface for OuteTTS models.



6/8
@ZAswanth
@OpenInterpreter



7/8
@Notregularuser2
This it’s getting better and better



8/8
@Notregularuser2
👀🐋




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/11
@reach_vb
Smol TTS keeps getting better! Introducing OuteTTS v0.2 - 500M parameters, multilingual with voice cloning! 🔥

> Multilingual - English, Chinese, Korean & Japanese
> Cross platform inference w/ llama.cpp
> Zero-shot voice cloning
> Trained on 5 Billion audio tokens
> Qwen 2.5 0.5B LLM backbone
> Trained via HF GPU grants

Model weights on the hub, you can even run this on a Raspberry Pi! Go run, inference now! 🐐



https://video.twimg.com/ext_tw_video/1861158412664373248/pu/vid/avc1/1280x720/jB_4nlzY1nPP3LWz.mp4

2/11
@reach_vb
Check out the model weights and inference code base here:

OuteAI/OuteTTS-0.2-500M · Hugging Face



3/11
@reach_vb
llama.cpp compatible GGUFs:

OuteAI/OuteTTS-0.2-500M-GGUF · Hugging Face



4/11
@reach_vb
OuteTTS GitHub:

GitHub - edwko/OuteTTS: Interface for OuteTTS models.



5/11
@umeshonai
This is improving so fast that I don't want to speak myself anymore. Just use this and get done 🤖



6/11
@0xKyon
this is very good!



7/11
@TheRobKennedy
Very nice 👌🏻



8/11
@TommyFalkowski
Just tested it out and the quality is very good. More importantly, the fact that you can generate speaker profiles is awesome! Will test it out some more and add it to my growing list of supported tts engines in my app 🤣



9/11
@JulienBlanchon
Pretty interested to know the overall cost of training



10/11
@thedigitaldr
Are you saying you can voice CLONE on a R-Pi? Is that what you're saying????



11/11
@Fronesis_ai
Thank you for your work and for sharing insights! 🙌
Advancements like OuteTTS v0.2 showcase the rapid evolution of AI and its potential to empower global communities. 🚀
The future of /search?q=#AI is bright, and collaborative innovation is key to unlocking its full potential!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492




1/4
@rohanpaul_ai
Genetic algorithm meets LLM reasoning and Zero-shot prompt recovery to reverse-engineer prompts from outputs.

Reverse Prompt Engineering (RPE) reconstructs original prompts from just 5 LLM outputs without accessing model internals.

Making LLMs work backwards: from answers to questions.

Original Problem 🤔:

Inferring the original prompt from LLM outputs is challenging, especially in black-box settings where only text outputs are available. Previous methods require extensive resources (64+ outputs) and often need access to internal model parameters.

-----

Solution in this Paper 🛠️:

→ Introduces Reverse Prompt Engineering (RPE), a zero-shot method using the target LLM's reasoning to reconstruct prompts from just 5 outputs

→ Employs a three-stage approach: One Answer One Shot (RPE1A1S) for basic inference, Five Answers Inference (RPE5A5S) for enhanced accuracy using multiple responses

→ Implements RPE-GA, an iterative optimization inspired by genetic algorithms that progressively refines candidate prompts through multiple iterations

→ Uses ROUGE-1 scores and cosine similarity to evaluate and select the best candidate prompts

-----

Key Insights from this Paper 💡:

→ Black-box prompt recovery is possible with minimal resources (5 outputs vs 64 required by previous methods)

→ Using multiple outputs reduces overemphasis on specific response details

→ Genetic algorithm-based optimization significantly improves prompt recovery accuracy

→ Zero-shot approach eliminates need for training data or additional model training

-----

Results 📊:

→ Outperforms state-of-the-art by 5.2% in cosine similarity across different embedding models

→ Achieves 2.3% higher similarity with ada-002 embeddings

→ Shows 8.1% improvement with text-embedding-3-large

→ Maintains slightly lower ROUGE-1 scores (-1.6%) while generating more natural prompts



GdVs70RaoAI1uGV.png


2/4
@rohanpaul_ai
Paper Title: "Reverse Prompt Engineering"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861513863939936256/pu/vid/avc1/1080x1080/WR8qgprGfUH0SC8N.mp4

3/4
@rohanpaul_ai
[2411.06729] Reverse Prompt Engineering



4/4
@rohanpaul_ai




GdVtuZCboAAWqMR.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492


1/5
@omarsar0
o1 Replication Journey - Part 2

Shows that combining simple distillation from O1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks.

"A base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity."



GdUQNRWbIAARwhg.png


2/5
@omarsar0
[2411.16489] O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?



3/5
@BensenHsu
This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with a particular focus on the widespread but often undisclosed use of knowledge distillation techniques.

The authors show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains can outperform O1-preview on the AIME with minimal technical complexity. Moreover, their investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks, including hallucination, safety, and open-domain QA.

full paper: O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation Big Progress or Bitter Lesson?



GdURUV9acAAdbFm.jpg


4/5
@AngelAITalk
This approach could open doors to more efficient AI solutions for advanced problems.



5/5
@JeffreyH630
That's really interesting, Elvis!

It’s amazing to see how combining different methods can yield such impressive results.

Looking forward to Part 3!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492




1/4
@rohanpaul_ai
New smart weight-compression technique has arrived to reduce your GPU VRAM requirement.

Squeeze more parameters into your GPU by compressing the wasteful parts of floating-point numbers

NeuZip compresses neural networks by exploiting the low entropy nature of floating-point exponents

🎯 Original Problem:

Training and deploying large neural networks is severely constrained by GPU memory limitations. While model sizes have grown 100x since 2017, GPU memory has only increased 2.5x (from 32GB to 80GB), creating a critical bottleneck for scaling neural networks.

-----

🔧 Solution in this Paper:

→ Introduces NeuZip - a novel compression scheme that exploits low entropy in neural network parameters' exponent bits

→ Compresses exponent bits losslessly using asymmetric numeral system (ANS) while keeping sign and mantissa bits unchanged

→ Implements layer-by-layer compression/decompression during training to avoid creating large buffers

→ Compatible with activation checkpointing for additional memory savings

→ For inference, introduces lossy compression by truncating mantissa bits while controlling relative weight changes

-----

💡 Key Insights:

→ Neural network parameters tend to concentrate around zero, making exponent bits highly compressible

→ Exponent bits carry only ~3 bits of information despite using 8 bits of storage

→ Layer-wise compression enables training without ever fully decompressing the entire network

→ Inference tasks can tolerate more aggressive lossy compression compared to training

-----

📊 Results:

→ Reduces Llama-3 8B training memory from 31GB to <16GB with no performance loss

→ Enables training 13B parameter models on consumer GPUs (<20GB memory)

→ For inference, achieves >50% memory reduction while maintaining near-lossless performance

→ Outperforms QLoRA and other quantization methods in memory-performance trade-off



GdQ4CwCboAACkXl.png


2/4
@rohanpaul_ai
Paper Title: "NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861174229959614464/pu/vid/avc1/1080x1080/TgngZoqe92dCG6tb.mp4

3/4
@rohanpaul_ai




GdQ4p6OaoAEkbd8.png


4/4
@rohanpaul_ai
📚 [2410.20650] NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492




1/5
@rohanpaul_ai
A single model with multiple experts handles error correction for different input types

NeKo, proposed in this paper, uses specialized experts to fix recognition errors across speech, text and vision tasks

Original Problem 🤔:

Building a general-purpose post-recognition error corrector that can handle multiple domains (speech, text, vision) while maintaining high performance across all tasks remains challenging. Current solutions require separate models for each domain, leading to parameter inefficiency.

-----

Solution in this Paper 🛠️:

→ NeKo introduces a task-oriented Mixture-of-Experts (MoE) architecture where experts specialize in specific tasks (speech-to-text, language-to-text, vision-to-text)

→ During training, input tokens are routed to both their task-specific expert and the top expert selected by a gating network

→ During inference, tokens are routed purely based on router probabilities without task knowledge, enabling zero-shot generalization

→ The model replaces standard feedforward blocks with MoE layers, allowing efficient parameter sharing across tasks

-----

Key Insights from this Paper 💡:

→ Task-specific expert assignment during training enables better specialization while maintaining cross-task knowledge sharing

→ MoE architecture provides better parameter efficiency compared to having separate models for each task

→ Zero-shot generalization is possible by relying on learned routing patterns during inference

-----

Results 📊:

→ 5.0% relative Word Error Rate reduction on Open ASR Leaderboard

→ 15.5% to 27.6% relative WER reduction compared to GPT-3.5 and Claude-Opus on zero-shot Hyporadise benchmark

→ State-of-the-art results in ASR correction while maintaining competitive performance on grammar and OCR correction tasks



GdQ5stpaoAAkoJw.png


2/5
@rohanpaul_ai
Paper Title: "NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861176047334760449/pu/vid/avc1/1080x1080/1V5wmFt-Uzj8Ueyf.mp4

3/5
@rohanpaul_ai
🔍 NeKo's architecture

The model replaces standard feedforward blocks with MoE layers. During training, each expert is mapped to a specific task, with input tokens being routed to both their task-specific expert and the top expert selected by the gating network.

During inference, tokens are routed purely based on router probabilities without task knowledge.



GdQ6D3QaoAE0cA1.png


4/5
@rohanpaul_ai
[2411.05945] NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts



5/5
@MiaAI_Builder
I appreciate the idea of using specialized experts to correct errors across different input types, a promising approach to building a general-purpose post-recognition error correction model.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492




1/6
@rohanpaul_ai
New tests reveal the true effective context limits of leading LLMs.

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

🎯 Original Problem:

Current benchmarks for evaluating LLMs' long context capabilities are inadequate - they either saturate at perfect scores, test on limited context lengths, or lack granular insights into specific model behaviors.

-----

🔬 Solution in this Paper:

→ Introduced a series of increasingly complex retrieval tasks using synthetic UUID key-value pairs to test 17 leading LLMs

→ Created novel "needle threading" tasks where models must follow chains of linked information through contexts up to 900k tokens

→ Developed tasks like Single Needle (basic retrieval), Multiple Needles (concurrent retrieval), and Threading (following information chains)

→ Introduced Multi-Threading to test if models can track multiple information threads simultaneously

-----

💡 Key Insights:

→ Most models' effective context limit is shorter than their advertised context length

→ Models perform better with forward-moving threads compared to backward threads

→ Many models are remarkably "thread-safe" - can follow multiple threads without performance loss

→ Different tokenizers count tokens very differently - direct comparisons can be misleading

→ Performance generally decreases towards the middle of the context window

-----

📊 Results:

→ GPT-4 performs best at small contexts, Gemini 1.5 Pro excels at longer contexts

→ Claude 3.5 Sonnet leads in mid-range contexts (2.5k to 32k tokens)

→ Closed-source models consistently outperform open-source alternatives

→ Most models show significant accuracy drop beyond their effective context limit



GdRLiEraoAMAoM4.png


2/6
@rohanpaul_ai
Paper Title: "Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861195970811445248/pu/vid/avc1/1080x1080/tHjcPg5xbjO0bzhA.mp4

3/6
@rohanpaul_ai
🔬 Evaluation methods used

→ Used 17 leading LLMs including GPT-4, Gemini 1.5, Claude 3, and open-source models

→ Tested on contexts ranging from 1k to 630k tokens

→ Evaluated using exact matching with expected answers

→ Introduced a task-specific effective context limit metric to measure real performance capabilities



GdRMPvHa0AAqSFb.png


4/6
@rohanpaul_ai
[2411.05000] Needle Threading: Can LLMs Follow Threads through Near-Million-Scale Haystacks?



5/6
@tOSUFever
everything every pixel is all in long context.
multithreaded (better) = yes.
gpt4 best at short context = (this is probably wrong) i think gpt4 long context was your grandaddy worldmodel distilling data synthetically initially at great cost.

also we need some new words.

long context serves two purposes the first is prompting with a ton of data like a book or repo. (this is still one turn one prompt), human expects ai output

the second is different; the behavior of a long conversation thread. this isn’t one turn this is thousands of unique turns. this is when you SEE behavior. human INPUT to context, ai (output) is INPUT to context.

and human INPUT (as far as the context window is concerned for ‘the next output’) is weighted differently.

for example llm providers may not pass through your exact prompt to “prevent hallucinations” or “jailbreaking”. maybe characters like _* or (…) whatever. however the model will happily insert them for you into context…

and if it *likes* you (literally) the next outputs only get better 💥



6/6
@medoraai
Anthropic's new enterprise version has a 500K context window. Interested to see how they do. Google is the king here however.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492




1/6
@rohanpaul_ai
Nice survey paper presents a unified taxonomy bridging personalized text generation and downstream applications

🎯 Current research on LLM personalization is fragmented into two disconnected areas: direct personalized text generation and downstream task personalization.

This split creates a knowledge gap, limiting the development of comprehensive personalization solutions.

This Paper:

→ Establishes three personalization granularity levels: user-level (individual), persona-level (groups), and global preference alignment

→ Proposes systematic frameworks for personalization techniques including RAG, prompt engineering, fine-tuning, embedding learning, and RLHF

→ Creates evaluation taxonomies distinguishing between direct (text quality) and indirect (task performance) assessment methods

-----

💡 Key Insights:

→ Personalization can be achieved at different granularities, with trade-offs between precision and data requirements

→ User-level personalization offers finest control but needs substantial user data

→ Persona-level grouping helps handle cold-start problems with new users

→ Privacy concerns and bias management are critical challenges

→ Multi-modal personalization remains an open challenge



GdQVU-IakAAdNe8.png


2/6
@rohanpaul_ai
Paper Title: "Personalization of Large Language Models: A Survey"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861136074451623938/pu/vid/avc1/1080x1080/ZcMPQJ7Cxpxrf4Pj.mp4

3/6
@rohanpaul_ai
[2411.00027] Personalization of Large Language Models: A Survey



4/6
@rohanpaul_ai




GdQWruxaoAAOv01.png


5/6
@Zhehao_Zhang123
Thanks Rohan for sharing our work!



6/6
@Adina_Coder
Amazing share




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492





1/6
@rohanpaul_ai
Open-source alternative to GPT-4V for building reliable GUI automation agents

OS-ATLAS, a foundational GUI action model enables open-source GUI agents to match commercial VLM performance through cross-platform data synthesis.

Release the largest open-source cross-platform GUI grounding corpus to date, which contains over 13 million GUI elements.

🤖 Original Problem:

Existing GUI agents heavily depend on commercial Vision-Language Models (VLMs) like GPT-4V. Open-source VLMs perform poorly in GUI grounding and Out-Of-Distribution scenarios, making them less preferred for real-world applications.

-----

🛠️ Solution in this Paper:

→ Created OS-ATLAS, a foundational GUI action model with three operating modes: Grounding, Action, and Agent

→ Built first multi-platform GUI data synthesis toolkit covering Windows, Linux, MacOS, Android, web

→ Created largest open-source cross-platform GUI corpus (13M+ elements from 2.3M screenshots)

→ Implemented unified action space during training to resolve naming conflicts across platforms

→ Standardized Basic Actions (click, type, scroll) and Custom Actions for extensibility

-----

💡 Key Insights:

→ Pre-training on comprehensive cross-platform GUI data significantly improves grounding accuracy

→ Unified action space prevents performance degradation from naming conflicts

→ Instruction grounding data, while valuable, isn't critical - referring expression data is sufficient

→ Web-only training doesn't generalize well to other platforms

-----

📊 Results:

→ Achieves 82.47% average grounding accuracy without planner

→ Reaches 85.14% accuracy with GPT-4 as planner

→ Outperforms previous SOTA across mobile, desktop and web platforms

→ Shows 14.63% success rate on OSWorld benchmark (compared to 9.21% baseline)



GdQXG9qbkAAUDqR.png


2/6
@rohanpaul_ai
Paper Title: "OS-ATLAS: A Foundation Action Model for Generalist GUI Agents"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861138549690769411/pu/vid/avc1/720x720/oRPlbm8e6lhO-ppC.mp4

3/6
@rohanpaul_ai
The first multi-platform GUI grounding data synthesis toolkit that works across Windows, Linux, MacOS, Android and web platforms



GdQXuyKaoAExE5T.jpg


4/6
@rohanpaul_ai
📚 [2410.23218] OS-ATLAS: A Foundation Action Model for Generalist GUI Agents



5/6
@rohanpaul_ai




GdQX-6IaoAEz-YD.jpg


6/6
@tOSUFever
🔥




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492





1/7
@rohanpaul_ai
O1 doesn't cheat on math tests - it actually knows how to solve them

A/B testing reveals o1's true mathematical reasoning capabilities beyond memorization

🎯 Original Problem:

OpenAI's Orion-1 (o1) model claims superior reasoning capabilities, but skeptics suggest its performance might stem from memorizing solutions rather than true reasoning abilities.

-----

🔧 Solution in this Paper:

→ Used A/B testing comparing o1's performance on two datasets: IMO problems (easily accessible) and CNT problems (less accessible but similar difficulty)

→ Implemented a 7-point grading system: 1 point for correct numerical answer, 2 points for intuitive approach, 4 points for detailed reasoning

→ Categorized problems into "search" type (finding specific solutions) and "solve" type (equations/optimization)

-----

💡 Key Insights:

→ O1 shows strong intuitive reasoning and pattern discovery capabilities

→ Performs exceptionally well on "search" type problems (~70% accuracy)

→ Struggles with rigorous proof steps and "solve" type problems (~21% accuracy)

→ Often uses trial-and-error approach instead of formal proofs

-----

📊 Results:

→ No significant performance difference between IMO (51.4%) and CNT (48%) datasets

→ T-statistics close to 0, suggesting o1 relies on reasoning rather than memorization

→ Outperforms GPT-4o's benchmark of 39.97% on both datasets



GdQyuWIbgAATGrz.png


2/7
@rohanpaul_ai
Paper Title: "OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861168366934990851/pu/vid/avc1/1080x1080/D1uoa5hsOde-Fyr9.mp4

3/7
@dikksonPau
It's not Orion-1...



4/7
@rohanpaul_ai
yes the paper refers to o1-preview and o1-mini variants



5/7
@rohanpaul_ai
This paper is actually referring to 01-preview model (not the final 01)



6/7
@rohanpaul_ai
[2411.06198] OpenAI-o1 AB Testing: Does the o1 model really do good reasoning in math problem solving?



7/7
@TShirtnJeans2
Wait, hold on. There are folks who have access to OpenAI's GPT-5/Orion-1 model?

I thought that was scheduled to come out until next year?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492






1/6
@rohanpaul_ai
First Open code LLM to reveal entire training pipeline and reproducible datasets

🎯 Original Problem:

Code LLMs lack transparency in training data and protocols, limiting research community's ability to establish strong baselines and gain deeper insights.

-----

🛠️ Solution in this Paper:

→ Introduces OpenCoder, a fully transparent code LLM with complete training data, processing pipeline, and protocols

→ Implements sophisticated data processing pipeline called RefineCode with 960B tokens across 607 programming languages

→ Uses aggressive file-level deduplication and language-specific filtering rules

→ Employs two-stage instruction tuning with annealing phase using high-quality synthetic data

-----

💡 Key Insights:

→ File-level deduplication outperforms repository-level approach for maintaining data diversity

→ GitHub star-based filtering can reduce data diversity and affect distribution

→ High-quality data in annealing phase is more crucial than quantity

→ Two-stage instruction tuning improves both theoretical and practical coding tasks

-----

📊 Results:

→ OpenCoder-8B achieves 83.5% pass@1 on HumanEval benchmark

→ Surpasses all previous fully open models at 6B+ parameter scale

→ Demonstrates superior training efficiency compared to The Stack v2



GdQsa9zaoAA-gGQ.png


2/6
@rohanpaul_ai
Paper Title: "OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861161457960001538/pu/vid/avc1/1080x1080/mc7X5OmyGwoCxj21.mp4

3/6
@rohanpaul_ai
The illustration of our pretraining data processing workflow.



GdQtCqqbEAAcSeb.jpg


4/6
@rohanpaul_ai
🚀 OpenCoder surpasses all previous fully open models and other open-access models at the 6B+ parameter scale. The 8B version achieves 83.5% pass@1 on HumanEval benchmark, making it competitive with leading proprietary models.



GdQsyloaoAAnGsi.jpg


5/6
@rohanpaul_ai
Their instruction data synthesis workflow



GdQv0Tya8AAC_ma.jpg


6/6
@rohanpaul_ai
[2411.04905] OpenCoder: The Open Cookbook for Top-Tier Code Large Language Models




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,601
Reputation
8,519
Daps
160,492





1/8
@rohanpaul_ai
New benchmark exposes the true reasoning capabilities of LLMs using dynamic puzzle generation

K&K puzzles, proposed in this paper, reveal how LLMs balance memorization and reasoning in logical problem-solving

🤖 Original Problem:

LLMs show puzzling behavior in reasoning tasks - excellent performance on complex problems but basic mistakes on simple ones. This raises questions about whether they truly reason or just memorize training data.

-----

🔧 Solution in this Paper:

→ Introduces Knights and Knaves (K&K) puzzles as a dynamic benchmark for testing logical reasoning

→ Develops Local Inconsistency-based Memorization Score (LiMem) that measures model performance on original vs perturbed puzzles

→ Creates two key modules:

- Abstract Module: Generates puzzles with specified complexity

- Natural Language Module: Converts abstract puzzles to natural text

→ Implements systematic perturbation tests at both mathematical and linguistic levels

-----

💡 Key Insights:

→ LLMs can simultaneously use memorization and genuine reasoning

→ Fine-tuning improves generalization even as memorization increases

→ Models can develop reasoning skills even when trained only on question-answer pairs

→ More complex puzzles show higher memorization scores

→ Language-level perturbations affect models less than mathematical structure changes

-----

📊 Results:

→ Only advanced LLMs achieve >70% accuracy on 2-person puzzles

→ Performance drops to 11% for 8-person puzzles

→ GPT4o-mini reaches near 100% training accuracy on 3/5-person puzzles

→ LiMem scores ~50% on 8-person puzzles indicate heavy memorization

→ Models show 80% memorization under role-flipping perturbations



GdQzloRaoAUxfE7.png


2/8
@rohanpaul_ai
Paper Title: "On Memorization of Large Language Models in Logical Reasoning"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861169337081700352/pu/vid/avc1/1080x1080/vJUxlnzGezPNqx37.mp4

3/8
@rohanpaul_ai




GdQzvb6aMAAsvfa.jpg


4/8
@rohanpaul_ai
📚 [2410.23123] On Memorization of Large Language Models in Logical Reasoning



5/8
@rohanpaul_ai
🧩 The Knights and Knaves benchmark generates logical puzzles where some characters always tell truth (knights) and others always lie (knaves).

It has two key modules:

→ Abstract Module: Generates puzzles with specified number of people, tree width, and depth. Can perturb puzzles by changing statements or leaf nodes

→ Natural Language Module: Converts abstract puzzles to natural language, can perturb by changing names, role terms, statement order



GdQ0KARaoAE-XNc.jpg


6/8
@ICoffeeDaemon
Nicht schlecht



7/8
@fabmilo
Just compression/memorization. There is no reasoning in refining probabilities of tokens. We need a complete different system to achieve reasoning and tons of compute power.



8/8
@MiaAI_Builder
LLMs have been puzzling us with their behavior in reasoning tasks, indeed. Hope this benchmark helps us understand them better




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top