bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992


Mistral unveils new AI models and chat features​

Kyle Wiggers
8:03 AM PST · November 18, 2024

French AI startup Mistral has released a slew of updates to its product portfolio as it looks to stay competitive in the cutthroat AI space.

Mistral’s Le Chat chatbot platform can now search the web — with citations in line, a la OpenAI’s ChatGPT. It’s also gained a “canvas” tool along the lines of ChatGPT Canvas, allowing users to modify, transform, or edit content, like webpage mockups and data visualizations, leveraging Mistral’s AI models.

“You can use [the canvas feature] to create documents, presentations, code, mockups… the list goes on,” Mistral writes in a blog post. “You’re able to modify its contents in place without regenerating responses, version your drafts, and preview your designs.”



In addition to all this, Le Chat can now process large PDF documents and images for analysis and summarization, including files containing graphs and equations. As of today, the platform incorporates Black Forest Labs‘ Flux Pro model for image generation. And Le Chat can now host shareable automated workflows for tasks like scanning expense reports and invoice processing; Mistal calls these AI “agents.”

Some of Le Chat’s new capabilities, all of which will remain free while in beta, are made possible by Mistral’s new models.


One, Pixtral Large, can process both text and images — it’s the second in Mistral’s Pixtral family of models. Weighing in at 124 billion parameters, Pixtral Large matches or bests leading models including Anthropic’s Claude 3.5 Sonnet, Google’s Gemini 1.5 Pro, and OpenAI’s GPT-4o on certain multimodal benchmarks. (Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.)

“Particularly, Pixtral Large is able to understand documents, charts, and natural images,” Mistral wrote in a second blog post. “The model demonstrates frontier-level image understanding.”


Mistral also today unveiled a new version of Mistral Large, its flagship line of text-only models. Called Mistral Large 24.11, the new model brings “notable improvements” in long context understanding, Mistral says, making it well-suited for use cases like document analysis and task automation.

Both Pixtral Large and Mistral Large 24.11 can be used outside of Le Chat under two licenses: a more restrictive license for research and an enterprise license for development and commercialization. Mistral Large 24.11 is already in Mistral’s API and on AI platform Hugging Face, and will soon be available through cloud platforms including Google Cloud and Microsoft Azure, Mistral says.

Paris-based Mistral, which recently raised $640 million in venture capital, continues to gradually expand its AI offerings. Over the past few months, the company has launched a free service for developers to test its models, an SDK to let customers fine-tune those models, and new models, including a generative model for code called Codestral.

Co-founded by alumni from Meta and DeepMind, Mistral’s stated mission is to create highly competitive models and services around those models — and ideally make money in the process. While the “making money” bit is proving to be challenging (as it is for most generative AI startups), Mistral reportedly began to generate revenue this summer.

“At Mistral, our approach to AI is different — we’re not chasing artificial general intelligence at all costs; our mission is to instead place frontier AI in your hands, so you get to decide what to do with advanced AI capabilities,” the company wrote in one of its blogs today. “This approach has allowed us to be quite frugal with our capital, while consistently delivering frontier capabilities at affordable price points.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992













Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions​


  • Text Generation
  • Text2Text Generation
  • Reinforcement Learning

Published 11/22/2024 by Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang


Overview​


  • New AI model called Marco-o1 focused on open-ended reasoning tasks
  • Uses Monte Carlo Tree Search (MCTS) to explore multiple solution paths
  • Implements flexible reasoning strategies to handle complex problems
  • Achieves improved performance on reasoning-intensive tasks
  • Designed to generate diverse solutions rather than single answers

𝕏Share on 𝕏


Plain English Explanation​


Marco-o1 is a fresh approach to making AI systems that can think through problems more like humans do. Instead of rushing to a single answer, it explores multiple possible solutions using a method called Monte Carlo Tree Search - think of it like a chess player considering different possible moves before deciding.

The system works like a detective following different leads. When given a problem, it doesn't just pick the first solution that seems right. Instead, it maps out various possible approaches and evaluates which ones might work best. This is particularly useful for questions that don't have one clear answer.

Large language models often struggle with complex reasoning tasks, but Marco-o1 breaks down these challenges into smaller, manageable steps. It's similar to how a student might solve a difficult math problem by working through it piece by piece.


Key Findings​


  • Marco-o1 showed significant improvement in handling open-ended reasoning tasks\
  • The system successfully generated multiple valid solutions for complex problems

  • Performance exceeded baseline models on reasoning-intensive benchmarks
  • MCTS implementation proved effective for exploring solution spaces
  • The model demonstrated ability to adapt its reasoning strategy based on problem type


Technical Explanation​


The reasoning model employs a sophisticated architecture combining MCTS with strategic reasoning components. The MCTS implementation explores potential solution paths while maintaining a balance between exploring new possibilities and exploiting known successful approaches.

The system incorporates a flexible action strategy that adapts to different problem types. This includes techniques for breaking down complex problems, generating intermediate steps, and validating potential solutions.

Reinforcement learning plays a key role in optimizing the model's decision-making process, helping it learn which reasoning strategies work best for different types of problems.


Critical Analysis​


The research has several limitations worth noting. The model's performance on extremely complex reasoning tasks still shows room for improvement. Additionally, the computational resources required for MCTS could limit practical applications.

Some questions remain about the scalability of the approach and its applicability to real-world scenarios. The research could benefit from more extensive testing across diverse problem domains.https://arxiv.org/abs/2411.14405

Computer Science > Computation and Language​

[Submitted on 21 Nov 2024]

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions​

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?" Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2411.14405 [cs.CL]
(or arXiv:2411.14405v1 [cs.CL] for this version)
[2411.14405] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Focus to learn more

Submission history​


From: Huifeng Yin [view email]

[v1] Thu, 21 Nov 2024 18:37:33 UTC (5,397 KB)


Planning capabilities in the model, while improved, still fall short of human-level reasoning in certain contexts.


Conclusion​


Marco-o1 represents a significant step forward in AI reasoning capabilities. By embracing open-ended problem solving and multiple solution paths, it moves closer to human-like reasoning abilities. The findings suggest promising directions for future development of AI systems that can handle complex reasoning tasks more effectively.

The research opens new possibilities for applications in fields requiring sophisticated problem-solving capabilities, though practical implementation challenges remain to be addressed.






Computer Science > Computation and Language​

[Submitted on 21 Nov 2024]

Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions​

Yu Zhao, Huifeng Yin, Bo Zeng, Hao Wang, Tianqi Shi, Chenyang Lyu, Longyue Wang, Weihua Luo, Kaifu Zhang
Currently OpenAI o1 has sparked a surge of interest in the study of large reasoning models (LRM). Building on this momentum, Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding -- which are well-suited for reinforcement learning (RL) -- but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?" Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies -- optimized for complex real-world problem-solving tasks.
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2411.14405 [cs.CL]
(or arXiv:2411.14405v1 [cs.CL] for this version)
[2411.14405] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions

Focus to learn more

Submission history​


From: Huifeng Yin [view email]

[v1] Thu, 21 Nov 2024 18:37:33 UTC (5,397 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992

logo.png


⭐ MarcoPolo Team ⭐

AI Business, Alibaba International Digital Commerce

Github
🤗 Hugging Face 📝 Paper 🧑‍💻 Model 🗂️ Data 📽️ Demo
🎯 Marco-o1 not only focuses on disciplines with standard answers, such as mathematics, physics, and coding—which are well-suited for reinforcement learning (RL)—but also places greater emphasis on open-ended resolutions. We aim to address the question: "Can the o1 model effectively generalize to broader domains where clear standards are absent and rewards are challenging to quantify?"

Currently, Marco-o1 Large Language Model (LLM) is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and _innovative reasoning strategies_—optimized for complex real-world problem-solving tasks.

⚠️ Limitations: We would like to emphasize that this research work is inspired by OpenAI's o1 (from which the name is also derived). This work aims to explore potential approaches to shed light on the currently unclear technical roadmap for large reasoning models. Besides, our focus is on open-ended questions, and we have observed interesting phenomena in multilingual applications. However, we must acknowledge that the current model primarily exhibits o1-like reasoning characteristics and its performance still fall short of a fully realized "o1" model. This is not a one-time effort, and we remain committed to continuous optimization and ongoing improvement.

img.png

Currently, our work is distinguished by the following highlights:

  • 🍀 Fine-Tuning with CoT Data: We develop Marco-o1-CoT by performing full-parameter fine-tuning on the base model using open-source CoT dataset combined with our self-developed synthetic data.
  • 🍀 Solution Space Expansion via MCTS: We integrate LLMs with MCTS (Marco-o1-MCTS), using the model's output confidence to guide the search and expand the solution space.
  • 🍀 Reasoning Action Strategy: We implement novel reasoning action strategies and a reflection mechanism (Marco-o1-MCTS Mini-Step), including exploring different action granularities within the MCTS framework and prompting the model to self-reflect, thereby significantly enhancing the model's ability to solve complex problems.
  • 🍀 Application in Translation Tasks: We are the first to apply Large Reasoning Models (LRM) to Machine Translation task, exploring inference time scaling laws in the multilingual and translation domain.

OpenAI recently introduced the groundbreaking o1 model, renowned for its exceptional reasoning capabilities. This model has demonstrated outstanding performance on platforms such as AIME, CodeForces, surpassing other leading models. Inspired by this success, we aimed to push the boundaries of LLMs even further, enhancing their reasoning abilities to tackle complex, real-world challenges.

🌍 Marco-o1 leverages advanced techniques like CoT fine-tuning, MCTS, and Reasoning Action Strategies to enhance its reasoning power. As shown in Figure 2, by fine-tuning Qwen2-7B-Instruct with a combination of the filtered Open-O1 CoT dataset, Marco-o1 CoT dataset, and Marco-o1 Instruction dataset, Marco-o1 improved its handling of complex tasks. MCTS allows exploration of multiple reasoning paths using confidence scores derived from softmax-applied log probabilities of the top-k alternative tokens, guiding the model to optimal solutions. Moreover, our reasoning action strategy involves varying the granularity of actions within steps and mini-steps to optimize search efficiency and accuracy.










1/11
@omarsar0
Nice paper from Alibaba on building open reasoning models.

They propose Marco-o1 which is a reasoning model built for open-ended solutions.

"Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and innovative reasoning strategies—optimized for complex real-world problem-solving tasks."

It's good to see more efforts on open reasoning LLMs. I am tracking this space very closely and will be highlighting more research on this topic.



GdAOTE9XoAAdyYE.png


2/11
@omarsar0
Paper: [2411.14405] Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions



3/11
@kbeguir
Just read it, big reasoning mistake 🤯 on Figure 6: correct answer on the right should be 2 legos remaining, not 11!



GdD9SKrXQAAE80l.jpg


4/11
@omarsar0
The calculation does seem off! They need to correct this. I notice these questions are getting harder to assess by humans. 😅 In contrast, here is the output from o1-preview:



GdECRRJWwAEWQQn.png


5/11
@CohorteAI
Marco-o1’s design suggests an emphasis on adaptability and iterative refinement. Do you think this reflective reasoning will outperform traditional LLMs in dynamic scenarios like crisis management or multi-step decision-making?



6/11
@geetkhosla
Do you think it’s novel for applications, or is it similar to o1 from OAI?



7/11
@tricalt
Any implementation provided or?



8/11
@BensenHsu
The paper discusses the development of a large reasoning model called Marco-o1, which aims to improve the reasoning capabilities of language models beyond traditional tasks like mathematics, physics, and coding. The researchers want to explore whether large language models can effectively solve open-ended problems where clear standards are absent and rewards are challenging to quantify.

The results show that the Marco-o1 models with MCTS (step, mini-step of 64 tokens, and mini-step of 32 tokens) outperform the Qwen2-7B-Instruct model and the Marco-o1-CoT model on the MGSM (English) and MGSM (Chinese) datasets. The researchers also demonstrate the model's superior performance in translating colloquial and slang expressions compared to Google Translate.

full paper: Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions



GdAQQcDagAQVg0Y.jpg


9/11
@AngelAITalk
The emphasis on open-ended solutions with mechanisms like reflection and innovative reasoning strategies in Marco-o1 is an interesting development.



10/11
@DeployAITool
Interesting paper



11/11
@filoynavaja
Probando…



GdAU5cuWoAAndK8.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/7
@rohanpaul_ai
Marco-o1 combines Monte Carlo Tree Search (MCTS) and reflection mechanisms to solve open-ended reasoning tasks

This fremework explores multiple reasoning paths to find optimal solutions in ambiguous scenarios

Original Problem 🤔:

OpenAI's o1 showed strong reasoning in math and coding tasks with clear answers. But handling open-ended problems where solutions aren't clear-cut remains challenging for LLMs.

-----

Solution in this Paper 🛠️:

→ Marco-o1 enhances reasoning through Chain-of-Thought finetuning on filtered o1 datasets.

→ It implements Monte Carlo Tree Search (MCTS) with varying granularity levels - full steps and mini-steps of 32/64 tokens.

→ The model calculates confidence scores using softmax-applied log probabilities of top tokens.

→ A reflection mechanism prompts self-criticism after each reasoning step.

→ The system integrates these components while maintaining flexibility for both structured and open-ended tasks.

-----

Key Insights 💡:

→ Monte Carlo Tree Search (MCTS) with different granularity levels explores solution spaces more effectively

→ Self-reflection mechanisms can correct approximately 50% of initially wrong answers

→ The model shows strong cross-lingual capabilities, especially in handling colloquial expressions

-----

Results 📊:

→ +6.17% accuracy improvement on MGSM English dataset

→ +5.60% accuracy boost on MGSM Chinese dataset

→ 90.40% accuracy achieved with step-level Monte Carlo Tree Search (MCTS)

→ Superior translation quality compared to Google Translate on colloquial expressions



Gc-gvmya4AAnBqi.png


2/7
@rohanpaul_ai
🔄 The role of reflection mechanism in improving model performance

The reflection mechanism adds "Wait! Maybe I made some mistakes! I need to rethink from scratch" after each thought process. This self-criticism helped correct approximately 50% of previously incorrect solutions on difficult problems.



Gc-hkOzbYAALH46.png


3/7
@rohanpaul_ai
⚙️ Marco-o1 introduces varying granularity in Monte Carlo Tree Search (MCTS) actions:

- Step-level actions for complete reasoning steps

- Mini-step actions (32 or 64 tokens) for finer-grained exploration

- Confidence scoring using softmax-applied log probabilities of top-k tokens

- Reward calculation based on average confidence across tokens



Gc-hwY3a0AAegm3.png


4/7
@rohanpaul_ai
📚 https://arxiv.org/pdf/2411.14405v1



5/7
@gdbsm1
@Readwise save thread



6/7
@TheJohnEgan
fremeworks are cool



7/7
@UristaTim
Combining MCTS with reflection could lead to some innovative problem-solving strategies.

Open-ended reasoning is crucial. This could address complex real-world problems effectively.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992











1/11
@TheTuringPost
The freshest AI/ML researches of the week, part 1

▪️ New AI Model Gemini Experimental 1114 Debuts On Google AI Studio
▪️ CamemBERT 2.0
▪️ Qwen2.5-Coder Series
▪️ Llava-o1
▪️ LLMs Can Self-Improve In Long-Context Reasoning
▪️ Direct Preference Optimization Using Sparse Feature-Level Constraints
▪️ Cut Your Losses In Large-Vocabulary Language Models
▪️ SPARSING LAW

🧵



Gc3rrfyaIAAHVGk.png

Gc3rru2aAAMbYfq.jpg

Gc3rr8qawAALst2.png

Gc3rsMFaAAAnk0k.png


2/11
@TheTuringPost
1. New AI Model Gemini Experimental 1114 Debuts On Google AI Studio

Demonstrates strong reasoning skills with a 32k context window, outperforming competitors on benchmarks, despite slower problem-solving speed.

[Quoted tweet]
gemini-exp-1114…. available in Google AI Studio right now, enjoy : )

aistudio.google.com


3/11
@TheTuringPost
2. CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Tackles concept drift in French NLP with improved tokenization, excelling in QA and domain-specific tasks like biomedical NER.

[2411.08868] CamemBERT 2.0: A Smarter French Language Model Aged to Perfection
Open models: almanach (ALMAnaCH (Inria))



Gc3rt_UaAAMLg5D.jpg


4/11
@TheTuringPost
3. Qwen2.5-Coder Series: Powerful, Diverse, Practical

Excels in coding and multi-language repair tasks, rivaling GPT-4o in 40+ programming languages with open innovation for developers.

Qwen2.5-Coder Series: Powerful, Diverse, Practical.



Gc3ru_HbIAABGBu.jpg

Gc3rvSoaIAAyZ_M.jpg


5/11
@TheTuringPost
4. Llava-o1: Let Vision Language Models Reason Step-By-Step

Enhances multimodal reasoning through structured, multi-stage processes, achieving superior benchmark performance.

[2411.10440] LLaVA-o1: Let Vision Language Models Reason Step-by-Step

[Quoted tweet]
LLaVA-o1 is a smarter Vision-Language Model (VLM) that thinks step-by-step.

Instead of jumping to answers, it divides reasoning into 4 clear stages and uses stage-level beam search to generate multiple answers and select the best one for each stage.

Here's how is works:


GcqiVI5akAA6d13.png

GcqiVWUaIAAEsFG.jpg


6/11
@TheTuringPost
5. Large Language Models Can Self-Improve In Long-Context Reasoning

Uses self-improvement via ranking model outputs (SeaLong approach), improving performance in long-context reasoning tasks without external datasets.

[2411.08147] Large Language Models Can Self-Improve in Long-context Reasoning
GitHub: GitHub - SihengLi99/SEALONG: Large Language Models Can Self-Improve in Long-context Reasoning



Gc3rxIfaAAE07uh.jpg


7/11
@TheTuringPost
6. Direct Preference Optimization Using Sparse Feature-Level Constraints

Introduces method that improves alignment efficiency in LLMs and reduces computational overhead, using sparse autoencoders and feature constraints.

[2411.07618] Direct Preference Optimization Using Sparse Feature-Level Constraints



Gc3ryLCaAAQysYV.jpg


8/11
@TheTuringPost
7. Cut Your Losses In Large-Vocabulary Language Models

Proposes Cut Cross-Entropy (CCE) method that reduces memory use for large-scale training, enabling up to 10x larger batch sizes without sacrificing performance.

[2411.09009] Cut Your Losses in Large-Vocabulary Language Models
GitHub: GitHub - apple/ml-cross-entropy



Gc3rzJVa0AACC4-.jpg


9/11
@TheTuringPost
8. SPARSING LAW: Towards Large Language Models With Greater Activation Sparsity

Explores neuron sparsity in LLMs to enhance efficiency while preserving interpretability.

[2411.02335] Sparsing Law: Towards Large Language Models with Greater Activation Sparsity
GitHub: GitHub - thunlp/SparsingLaw: The open-source materials for paper "Sparsing Law: Towards Large Language Models with Greater Activation Sparsity".



Gc3r0PCbgAAcj2o.jpg


10/11
@TheTuringPost
9. Find a complete list of the latest research papers in our free weekly digest: 🌁#76: Rethinking Scaling Laws (when plateau is actually a fork)



11/11
@TheTuringPost
10. Follow @TheTuringPost for more.

Like/repost the 1st post to support our work 🤍

Also, elevate your AI game with our free newsletter ↓
Turing Post

[Quoted tweet]
The freshest AI/ML researches of the week, part 1

▪️ New AI Model Gemini Experimental 1114 Debuts On Google AI Studio
▪️ CamemBERT 2.0
▪️ Qwen2.5-Coder Series
▪️ Llava-o1
▪️ LLMs Can Self-Improve In Long-Context Reasoning
▪️ Direct Preference Optimization Using Sparse Feature-Level Constraints
▪️ Cut Your Losses In Large-Vocabulary Language Models
▪️ SPARSING LAW

🧵


Gc3rrfyaIAAHVGk.png

Gc3rru2aAAMbYfq.jpg

Gc3rr8qawAALst2.png




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/3
@rohanpaul_ai
LLaVA-o1 teaches machines to think step-by-step like humans when analyzing images.

LLaVA-o1 introduces a novel approach to enhance Vision Language Models (VLMs) by implementing structured, multi-stage reasoning. This paper tackles the challenge of systematic reasoning in visual tasks by breaking down the process into distinct stages: summary, caption, reasoning, and conclusion.

-----

🤔 Original Problem:

Current VLMs struggle with systematic reasoning and often produce errors or hallucinated outputs during complex visual question-answering tasks. They lack structured thinking processes and tend to jump to conclusions without proper analysis.

-----

🛠️ Solution in this Paper:

→ LLaVA-o1 implements a 4-stage reasoning process with dedicated tags for each stage: summary, caption, reasoning, and conclusion.

→ The model uses supervised fine-tuning on a new LLaVA-o1-100k dataset, created using GPT-4o for structured reasoning annotations.

→ A stage-level beam search method generates multiple candidates at each reasoning stage, selecting the best one to continue.

→ Training is performed on a single node with 8 H100 GPUs, combining samples from both general VQA and science-targeted datasets.

-----

💡 Key Insights:

→ Structured reasoning stages help models organize thoughts before reaching conclusions

→ Special tags for each stage maintain clarity throughout the reasoning process

→ Stage-level beam search is more effective than sentence-level or best-of-N approaches

-----

📊 Results:

→ Outperforms base model by 8.9% on multimodal reasoning benchmarks

→ Surpasses larger models including Gemini-1.5-pro and GPT-4o-mini

→ Stage-level beam search improves MMVet score from 60.3% to 62.9%



GdG0DzCagAMKeIY.jpg


2/3
@rohanpaul_ai
Paper Title: "LLaVA-o1: Let Vision Language Models Reason Step-by-Step"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1860466160707469312/pu/vid/avc1/1080x1080/mAeNIFuBt10AwrXP.mp4

3/3
@rohanpaul_ai
[2411.10440] LLaVA-o1: Let Vision Language Models Reason Step-by-Step




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@jreuben1
LLaVA-o1: Let Vision Language Models Reason Step-by-Step LLaVA-o1: Let Vision Language Models Reason Step-by-Step inference-time stage-level beam search method, which enables effective inference-time scaling.



GdImGS0WYAA-PeF.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/10
@Gradio
LLaVA-o1 is the first visual language model capable of spontaneous, systematic reasoning, similar to GPT-o1!

🤯 11B model outperforms Gemini-1.5-pro,GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct on six multimodal benchmarks.



GcvsSEiWcAAAmI0.jpg


2/10
@Gradio
LlaVA-o1

Stay tuned for the code and gradio app release.
GitHub - PKU-YuanGroup/LLaVA-o1



3/10
@NNaumovsky
@threadreaderapp unroll



4/10
@threadreaderapp
@NNaumovsky Namaste, please find the unroll here: Thread by @Gradio on Thread Reader App Enjoy :smile: 🤖



5/10
@CohorteAI
"LLaVA-o1’s success on multimodal benchmarks suggests it’s mastering the integration of vision and language. Could this pave the way for models capable of deeper real-world contextual understanding, like AR-enhanced assistants?



6/10
@hanul93
WOW



7/10
@arya_mukhlis354
amazing



8/10
@txhno
which image decoder does it use?



9/10
@matthaeus_win
I thought every model based on Llama 3 has to have 'Llama' in the name.. 👀



10/10
@wuwenjie1992


[Quoted tweet]
由北大信工袁粒课题组发布的 LLaVA-o1 是第一个能够进行自发、系统推理的视觉语言模型,类似于 GPT-o1!
⚙ 模型首先概述问题,解释图像中的相关信息,逐步进行推理,最终得出有充分依据的结论。
🤯 11B 的模型在六个多模态基准测试中优于 Gemini1.5pro、GPT4o-mini 和 Llama3.2-90B-Vision-Instruct。


Gc0ELI1bcAAceGq.png



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992

About​


My name is Grant Sanderson. Videos here cover a variety of topics in math, or adjacent fields like physics and CS, all with an emphasis on visualizing the core ideas. The goal is to use animation to help elucidate and motivate otherwise tricky topics, and for difficult problems to be made simple with changes in perspective. For more information, other projects, FAQs, and inquiries see the website:

But what is a neural network? | Deep learning chapter 1




Large Language Models explained briefly
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992









1/12
@rohanpaul_ai
Type a sentence, get any sound - from talking cats to singing saxophones. Brilliant release by NVIDIA

✨ NVIDIA just unveiled Fugatto, a groundbreaking 2.5B parameter audio AI model that can generate and transform any combination of music, voices, and sounds using text prompts and audio inputs

Fugatto could ultimately allow developers and creators to bring sounds to life simply by inputting text prompts,

→ The model demonstrates unique capabilities like creating hybrid sounds (trumpet barking), changing accents/emotions in voices, and allowing fine-grained control over sound transitions - trained on millions of audio samples using 32 NVIDIA H100 GPUs

👨‍🔧 Architecture

Built as a foundational generative transformer model leveraging NVIDIA's previous work in speech modeling and audio understanding. The training process involved creating a specialized blended dataset containing millions of audio samples

→ ComposableART's Innovation in Audio Control

Introduces a novel technique allowing combination of instructions that were only seen separately during training. Users can blend different audio attributes and control their intensity

→ Temporal Interpolation Capabilities

Enables generation of evolving soundscapes with precise control over transitions. Can create dynamic audio sequences like rainstorms fading into birdsong at dawn

→ Processes both text and audio inputs flexibly, enabling tasks like removing instruments from songs or modifying specific audio characteristics while preserving others

→ Shows capabilities beyond its training data, creating entirely new sound combinations through interaction between different trained abilities

🔍 Real-world Applications

→ Allows rapid prototyping of musical ideas, style experimentation, and real-time sound creation during studio sessions

→ Enables dynamic audio asset generation matching gameplay situations, reducing pre-recorded audio requirements

→ Can modify voice characteristics for language learning applications, allowing content delivery in familiar voices

@NVIDIAAIDev



https://video.twimg.com/ext_tw_video/1861123021983096837/pu/vid/avc1/1280x720/2cD4kuUZUpyj6qdc.mp4

2/12
@rohanpaul_ai
→ Creates a massive dataset (20M+ rows, ~330 years of audio) by combining multiple open source datasets and using LLMs to generate rich descriptions and instructions



GdQKr-oaoAElHQf.png


3/12
@rohanpaul_ai
A sample from Fugatto's official page.

Fugatto is a framework for audio synthesis and transformation given text instructions and optional audio inputs.

The framework includes the generative model Fugatto, a dataset creation technique that exploits relationships between audio and text, and a method for controlling and composing instructions, including from different models, called ComposeableART.



https://video.twimg.com/ext_tw_video/1861396412816334848/pu/vid/avc1/1026x514/DFHmk3iZoMYGS8fe.mp4

4/12
@rohanpaul_ai




GdQNdllbQAAP1tj.jpg


5/12
@rohanpaul_ai
→ Optimal Transport Conditional Flow Matching

Trains using OT-CFM objective with a T5-based transformer architecture and adaptive layer normalization



GdQLquBaoAAxZYy.png


6/12
@GuitarGeorge6
Where is it hosted?



7/12
@rohanpaul_ai
afaik they didn't announce when — or if — the tool will be widely available.



8/12
@rohanpaul_ai
Now Hear This: World’s Most Flexible Sound Machine Debuts

https://d1qx31qr3h6wln.cloudfront.net/publications/FUGATTO.pdf



9/12
@xJOSHUAJOSHUAx
is it open source?



10/12
@rohanpaul_ai
afaik they didn't announce when — or if — the tool will be widely available.



11/12
@hckinz
Wow, where is the huggingface space?



12/12
@rohanpaul_ai
not yet




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@theaitechsuite
🗞️🗞️🗞️Nvidia introduces Fugatto, an AI-driven music editor that uniquely blends sounds—creating features like barking trumpets and meowing saxophones beyond its training data.

Read more: Nvidia's new AI music tool creates barking trumpets, meowing saxophones




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@lRichBl
x-post:
Tech News Today 🚨

Nvidia unveils Fugatto: A suite of AI audio tools that's like a "Swiss Army Knife" for sound editing and creation. 🎶

Samsung Galaxy S25 Ultra leaks: Hands-on video reveals a sleek design and impressive camera setup. 📷

AI bias in hiring: Study shows AI overwhelmingly favors white and male candidates in resume screening. 😟

Google Chrome at risk? Regulators may force Google to sell its browser due to antitrust concerns. 🌐
/search?q=#technology /search?q=#news /search?q=#AI /search?q=#Nvidia /search?q=#Samsung /search?q=#Google




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@technewsworld
Nvidia Reveals ‘Swiss Army Knife’ of AI Audio Tools: Fugatto...The new AI model can generate or transform any mix of music, voices, and sounds described with prompts using any combination of text and audio files. Nvidia Reveals 'Swiss Army Knife' of AI Audio Tools: Fugatto



GdVexhGXUAAK_sM.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/12
@AndrewCurran_
NVIDIA has built a 2.5 billion parameter audio model called Fugatto that generates music, voice, and sound from text and audio input. Sound inputs become completely mutable. It can change a piano line to a human voice singing or make 'a trumpet bark or a saxophone meow.



GdPd5YaXsAAOLgf.jpg


2/12
@AndrewCurran_
Using a feature called temporal interpolation, Fugatto can 'create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves'



GdPd8CGWMAAOSfq.png


3/12
@AndrewCurran_
YouTube:
https://invidious.poast.org/qj1Sp8He6e4?si=q9_b9ns1JYMZSbwI



4/12
@AndrewCurran_
Now Hear This: World’s Most Flexible Sound Machine Debuts



5/12
@AndrewCurran_




GdPhGIXWEAAJrGF.jpg


6/12
@AndrewCurran_
Great demo.



GdPh3YBXQAEpNJ_.jpg


7/12
@ericreator
wen



8/12
@AndrewCurran_
No release date yet.



9/12
@fkthefeed
Is there a repo for this?



10/12
@AndrewCurran_
No public release yet unfortunately, and not even a prospective date for one.



11/12
@AviSchiffmann
Can it do the opposite?



12/12
@AndrewCurran_
Yes, from the demo it seems to be omnidirectional, any sound to any other type of sound.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992


Now Hear This: World’s Most Flexible Sound Machine Debuts​


Using text and audio as inputs, a new generative AI model from NVIDIA can create any combination of music, voices and sounds.

November 25, 2024 by Richard Kerris

Fugatto


A team of generative AI researchers created a Swiss Army knife for sound, one that allows users to control the audio output simply using text.

While some AI models can compose a song or modify a voice, none have the dexterity of the new offering.

Called Fugatto (short for Foundational Generative Audio Transformer Opus 1), it generates or transforms any mix of music, voices and sounds described with prompts using any combination of text and audio files.

For example, it can create a music snippet based on a text prompt, remove or add instruments from an existing song, change the accent or emotion in a voice — even let people produce sounds never heard before.

“This thing is wild,” said Ido Zmishlany, a multi-platinum producer and songwriter — and cofounder of One Take Audio, a member of the NVIDIA Inception program for cutting-edge startups. “Sound is my inspiration. It’s what moves me to create music. The idea that I can create entirely new sounds on the fly in the studio is incredible.”


A Sound Grasp of Audio

“We wanted to create a model that understands and generates sound like humans do,” said Rafael Valle, a manager of applied audio research at NVIDIA and one of the dozen-plus people behind Fugatto, as well as an orchestral conductor and composer.

Supporting numerous audio generation and transformation tasks, Fugatto is the first foundational generative AI model that showcases emergent properties — capabilities that arise from the interaction of its various trained abilities — and the ability to combine free-form instructions.

“Fugatto is our first step toward a future where unsupervised multitask learning in audio synthesis and transformation emerges from data and model scale,” Valle said.


A Sample Playlist of Use Cases


For example, music producers could use Fugatto to quickly prototype or edit an idea for a song, trying out different styles, voices and instruments. They could also add effects and enhance the overall audio quality of an existing track.

“The history of music is also a history of technology. The electric guitar gave the world rock and roll. When the sampler showed up, hip-hop was born,” said Zmishlany. “With AI, we’re writing the next chapter of music. We have a new instrument, a new tool for making music — and that’s super exciting.”

An ad agency could apply Fugatto to quickly target an existing campaign for multiple regions or situations, applying different accents and emotions to voiceovers.

Language learning tools could be personalized to use any voice a speaker chooses. Imagine an online course spoken in the voice of any family member or friend.

Video game developers could use the model to modify prerecorded assets in their title to fit the changing action as users play the game. Or, they could create new assets on the fly from text instructions and optional audio inputs.


Making a Joyful Noise

“One of the model’s capabilities we’re especially proud of is what we call the avocado chair,” said Valle, referring to a novel visual created by a generative AI model for imaging.

For instance, Fugatto can make a trumpet bark or a saxophone meow. Whatever users can describe, the model can create.

With fine-tuning and small amounts of singing data, researchers found it could handle tasks it was not pretrained on, like generating a high-quality singing voice from a text prompt.


Users Get Artistic Controls


Several capabilities add to Fugatto’s novelty.

During inference, the model uses a technique called ComposableART to combine instructions that were only seen separately during training. For example, a combination of prompts could ask for text spoken with a sad feeling in a French accent.

The model’s ability to interpolate between instructions gives users fine-grained control over text instructions, in this case the heaviness of the accent or the degree of sorrow.

“I wanted to let users combine attributes in a subjective or artistic way, selecting how much emphasis they put on each one,” said Rohan Badlani, an AI researcher who designed these aspects of the model.

“In my tests, the results were often surprising and made me feel a little bit like an artist, even though I’m a computer scientist,” said Badlani, who holds a master’s degree in computer science with a focus on AI from Stanford.

The model also generates sounds that change over time, a feature he calls temporal interpolation. It can, for instance, create the sounds of a rainstorm moving through an area with crescendos of thunder that slowly fade into the distance. It also gives users fine-grained control over how the soundscape evolves.

Plus, unlike most models, which can only recreate the training data they’ve been exposed to, Fugatto allows users to create soundscapes it’s never seen before, such as a thunderstorm easing into a dawn with the sound of birds singing.


A Look Under the Hood


Fugatto is a foundational generative transformer model that builds on the team’s prior work in areas such as speech modeling, audio vocoding and audio understanding.

The full version uses 2.5 billion parameters and was trained on a bank of NVIDIA DGX systems packing 32 NVIDIA H100 Tensor Core GPUs.

Fugatto was made by a diverse group of people from around the world, including India, Brazil, China, Jordan and South Korea. Their collaboration made Fugatto’s multi-accent and multilingual capabilities stronger.

One of the hardest parts of the effort was generating a blended dataset that contains millions of audio samples used for training. The team employed a multifaceted strategy to generate data and instructions that considerably expanded the range of tasks the model could perform, while achieving more accurate performance and enabling new tasks without requiring additional data.

They also scrutinized existing datasets to reveal new relationships among the data. The overall work spanned more than a year.

Valle remembers two moments when the team knew it was on to something. “The first time it generated music from a prompt, it blew our minds,” he said.

Later, the team demoed Fugatto responding to a prompt to create electronic music with dogs barking in time to the beat.

“When the group broke up with laughter, it really warmed my heart.”

Hear what Fugatto can do:

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992


1/2
@rohanpaul_ai
OuteTTS-0.2-500M, a 500M parameter text-to-speech model just released by @OuteAI .

Built on Qwen-2.5-0.5B, trained on over 5 billion audio prompt tokens with multilingual capabilities for English, Chinese, Japanese, and Korean.

→ The model offers improved voice cloning, natural speech synthesis, and enhanced prompt following accuracy compared to its previous version, utilizing audio prompts without architectural modifications.

→ The model leverages audio prompts directly into Qwen-2.5-0.5B

→ Training utilized three major datasets - Emilia-Dataset, LibriTTS-R, and Multilingual LibriSpeech, creating a diverse foundation for voice synthesis.

→ Voice Cloning Mechanics

Requires 10-15 second audio samples with accurate transcriptions. Context length of 4096 tokens enables ~54 seconds of audio generation.

⚙️ Technical Deep-Dive

→ Implements flash attention 2.0 and bfloat16 precision, showing careful consideration for inference speed and memory usage.

→ Context Window Management

Audio generation capacity reduces proportionally when including speaker profiles, demonstrating intelligent resource allocation.



https://video.twimg.com/ext_tw_video/1861432382576119808/pu/vid/avc1/1920x1080/nEGoZpESwC2B2C8w.mp4

2/2
@rohanpaul_ai
OuteAI/OuteTTS-0.2-500M · Hugging Face




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/8
@AIWarper
On the topic of TTS from yesterday

Another new contender to try out OuteTTS v0.2 - 500M

(will run on a hamster wheel for all you 8gb enjoyors)

Multilingual - English, Chinese, Korean & Japanese ✅
Zero-shot voice cloning ✅



https://video.twimg.com/ext_tw_video/1861432103227006980/pu/vid/avc1/1280x720/ryHdWl08xirpqiO7.mp4

2/8
@AIWarper
OuteAI/OuteTTS-0.2-500M · Hugging Face



3/8
@DreamStarter_1
Have you seen anything similar to elevenlabs voice creation feature?
Cloning voices is fine but...I'd like to create new voices...I don't want any problems:smile:)



4/8
@AIWarper
Off the top of my head no I don't. Most offer just pretrained unique voices but the ability to create your own from scratch I am unaware of.

Interesting idea though..... surely it exists?



5/8
@AIWarper
GitHub - edwko/OuteTTS: Interface for OuteTTS models.



6/8
@ZAswanth
@OpenInterpreter



7/8
@Notregularuser2
This it’s getting better and better



8/8
@Notregularuser2
👀🐋




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196






1/11
@reach_vb
Smol TTS keeps getting better! Introducing OuteTTS v0.2 - 500M parameters, multilingual with voice cloning! 🔥

> Multilingual - English, Chinese, Korean & Japanese
> Cross platform inference w/ llama.cpp
> Zero-shot voice cloning
> Trained on 5 Billion audio tokens
> Qwen 2.5 0.5B LLM backbone
> Trained via HF GPU grants

Model weights on the hub, you can even run this on a Raspberry Pi! Go run, inference now! 🐐



https://video.twimg.com/ext_tw_video/1861158412664373248/pu/vid/avc1/1280x720/jB_4nlzY1nPP3LWz.mp4

2/11
@reach_vb
Check out the model weights and inference code base here:

OuteAI/OuteTTS-0.2-500M · Hugging Face



3/11
@reach_vb
llama.cpp compatible GGUFs:

OuteAI/OuteTTS-0.2-500M-GGUF · Hugging Face



4/11
@reach_vb
OuteTTS GitHub:

GitHub - edwko/OuteTTS: Interface for OuteTTS models.



5/11
@umeshonai
This is improving so fast that I don't want to speak myself anymore. Just use this and get done 🤖



6/11
@0xKyon
this is very good!



7/11
@TheRobKennedy
Very nice 👌🏻



8/11
@TommyFalkowski
Just tested it out and the quality is very good. More importantly, the fact that you can generate speaker profiles is awesome! Will test it out some more and add it to my growing list of supported tts engines in my app 🤣



9/11
@JulienBlanchon
Pretty interested to know the overall cost of training



10/11
@thedigitaldr
Are you saying you can voice CLONE on a R-Pi? Is that what you're saying????



11/11
@Fronesis_ai
Thank you for your work and for sharing insights! 🙌
Advancements like OuteTTS v0.2 showcase the rapid evolution of AI and its potential to empower global communities. 🚀
The future of /search?q=#AI is bright, and collaborative innovation is key to unlocking its full potential!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992




1/4
@rohanpaul_ai
Genetic algorithm meets LLM reasoning and Zero-shot prompt recovery to reverse-engineer prompts from outputs.

Reverse Prompt Engineering (RPE) reconstructs original prompts from just 5 LLM outputs without accessing model internals.

Making LLMs work backwards: from answers to questions.

Original Problem 🤔:

Inferring the original prompt from LLM outputs is challenging, especially in black-box settings where only text outputs are available. Previous methods require extensive resources (64+ outputs) and often need access to internal model parameters.

-----

Solution in this Paper 🛠️:

→ Introduces Reverse Prompt Engineering (RPE), a zero-shot method using the target LLM's reasoning to reconstruct prompts from just 5 outputs

→ Employs a three-stage approach: One Answer One Shot (RPE1A1S) for basic inference, Five Answers Inference (RPE5A5S) for enhanced accuracy using multiple responses

→ Implements RPE-GA, an iterative optimization inspired by genetic algorithms that progressively refines candidate prompts through multiple iterations

→ Uses ROUGE-1 scores and cosine similarity to evaluate and select the best candidate prompts

-----

Key Insights from this Paper 💡:

→ Black-box prompt recovery is possible with minimal resources (5 outputs vs 64 required by previous methods)

→ Using multiple outputs reduces overemphasis on specific response details

→ Genetic algorithm-based optimization significantly improves prompt recovery accuracy

→ Zero-shot approach eliminates need for training data or additional model training

-----

Results 📊:

→ Outperforms state-of-the-art by 5.2% in cosine similarity across different embedding models

→ Achieves 2.3% higher similarity with ada-002 embeddings

→ Shows 8.1% improvement with text-embedding-3-large

→ Maintains slightly lower ROUGE-1 scores (-1.6%) while generating more natural prompts



GdVs70RaoAI1uGV.png


2/4
@rohanpaul_ai
Paper Title: "Reverse Prompt Engineering"

Generated below podcast on this paper with Google's Illuminate.



https://video.twimg.com/ext_tw_video/1861513863939936256/pu/vid/avc1/1080x1080/WR8qgprGfUH0SC8N.mp4

3/4
@rohanpaul_ai
[2411.06729] Reverse Prompt Engineering



4/4
@rohanpaul_ai




GdVtuZCboAAWqMR.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992

1/1
@rohanpaul_ai
A super useful blog.

"7 examples of Gemini’s multimodal capabilities in action"

1. Detailed Image Descriptions - Can analyze and describe images, adjusting style and format based on prompts

2. Long PDF Understanding - Processes 1000+ page PDFs, including tables, layouts, charts, diagrams, and handwritten text

3. Real World Document Reasoning - Extracts information from receipts, labels, signs, notes, and whiteboard sketches

4. Webpage Data Extraction - Extracts structured data from webpage screenshots, including text and visual content

5. Object Detection - Detects objects and generates bounding box coordinates in images

6. Video Summarization - Processes 90-minute videos, generating transcripts, summaries, and answering questions

7. Video Information Extraction - Extracts structured data from videos for cataloging and entity detection, though currently limited by 1FPS sampling

[Quoted tweet]
7 examples of Gemini's multimodal capabilities in action (with code and prompts) 🤯🧵


GdVq90zawAAwI1O.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
60,066
Reputation
8,966
Daps
165,992


1/5
@omarsar0
o1 Replication Journey - Part 2

Shows that combining simple distillation from O1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks.

"A base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME) with minimal technical complexity."



GdUQNRWbIAARwhg.png


2/5
@omarsar0
[2411.16489] O1 Replication Journey -- Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson?



3/5
@BensenHsu
This paper presents a critical examination of current approaches to replicating OpenAI's O1 model capabilities, with a particular focus on the widespread but often undisclosed use of knowledge distillation techniques.

The authors show that a base model fine-tuned on simply tens of thousands of samples O1-distilled long-thought chains can outperform O1-preview on the AIME with minimal technical complexity. Moreover, their investigation extends beyond mathematical reasoning to explore the generalization capabilities of O1-distilled models across diverse tasks, including hallucination, safety, and open-domain QA.

full paper: O1 Replication Journey – Part 2: Surpassing O1-preview through Simple Distillation Big Progress or Bitter Lesson?



GdURUV9acAAdbFm.jpg


4/5
@AngelAITalk
This approach could open doors to more efficient AI solutions for advanced problems.



5/5
@JeffreyH630
That's really interesting, Elvis!

It’s amazing to see how combining different methods can yield such impressive results.

Looking forward to Part 3!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top