The A.I Megathread (LLM , GPT , Development)

bnew · Jan 15, 2025

1/11
@synth_labs
Ever watched someone solve a hard math problem?

Their first attempt is rarely perfect. They sketch ideas, cross things out, and try new angles.

This process of exploration is key to human reasoning and our latest research formalizes this as Meta Chain-of-Thought

(1/8)

https://video.twimg.com/ext_tw_video/1879277172197838848/pu/vid/avc1/1080x1620/UV5F7QCG9DfLW3tJ.mp4

2/11
@synth_labs
In “Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought” we show that reasoning models—like @OpenAI o1—exhibit behavior resembling Meta-CoT, & propose a training pipeline to achieve similar capabilities

More details here:
Towards System 2 Reasoning in LLMs: Learning How To Think

3/11
@synth_labs
When a mathematician solves a problem, they don’t just do it in one straight go. They perform multiple trial and error stages, learning from prior mistakes. We call this true data-generating process behind how these problems are solved Meta-CoT.

Modern LLMs struggle with advanced reasoning problems because their training data consists of direct solutions, but not the actual process of how they were solved.

4/11
@synth_labs
We found that current advanced reasoning LLMs like o1 and Gemini Thinking exhibit this behavior of Meta CoT as in-context search. They explore and try different approaches to advanced problems before giving a final solution.

5/11
@synth_labs
We show how Meta-CoTs can be constructed synthetically via online search, and that teaching a model to produce Meta-CoTs is a meta-RL problem, since we are now teaching it how to approach solving a problem rather than just what to do.

6/11
@synth_labs
Towards the goal of training reasoning models, we have been collecting a “Big Math” dataset of 500k diverse, verifiable math problems and have been developing training infrastructure built on NeoX to support online RL with search

[Quoted tweet]
Open research on reasoning is bottlenecked on several critical fronts—which we're tackling head on

Post-training infra: Distributed scalable online training, inference, search & RL infra built on @AiEleuther's GPT-NeoX

"shooting for SOTA"—@rm_rafailov

Interested? DM me

!

7/11
@synth_labs
This is an large, ongoing effort with a great team @synth_labs & collaborators doing open research

@rm_rafailov @sea_snell @gandhikanishk @Anikait_Singh_ @NathanThinks @ZiyuX @dmayhem93 @ChaseBlagden @jphilipp95 @lcastricato @PhungVanDuy1 @AlbalakAlon @nickhaber @chelseabfinn

8/11
@synth_labs
If you are interested in working or collaborating with us, reach out (links in blog post)

Arxiv PDF: https://arxiv.org/pdf/2501.04682

Blog post: Towards System 2 Reasoning in LLMs: Learning How To Think

9/11
@DeepwriterAI
in my experience, these reasoning models are currently good with CoT in the weeds and less so as you generalize at a bird's eye view whereas the more general models (non reasoning) handle larger contexts and can spread their attention more evenly over the whole for a bird's eyes view. Reasoning models follow a narrow path and lose the script with very large contexts.

In fact, they didn't even bother pretending with Gemini flash 2.0-thinking and it's context is limited under 50k tokens. o1 and o1 pro allow more, but they don't do as well with larger contexts that end up competing with reasoning tokens that also get added.

Therefore a general model still is better as the highest level orchestrator and the reasoning models are best when a specific hard problem needs to be solved that the general model passes it off to.

So does that make the general model orchestrator a "Meta-Meta Chain of Thought" agent?

10/11
@CryptosBandoo
Still waiting for your predictions 2025+?

11/11
@FutbolmeAI
Your analogy of solving math problems resonates with me, reminds me of debugging code . Can you elaborate on how Meta Chain-of-Thought formalizes this process?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/2
@GoatstackAI
The paper introduces Eliza, a groundbreaking open-source AI agent operating system designed to seamlessly integrate web3 applications within the AI ecosystem. Powered by large language models (LLMs), Eliza facilitates autonomous operation and execution of task...

[Quoted tweet]
Eliza Inside™

arxiv.org/abs/2501.06781

2/2
@GoatstackAI
Eliza: A Web3 friendly AI Agent Operating System

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/2
@GoatstackAI
InfiGUIAgent is a multimodal large language model (MLLM)-based GUI agent designed to improve task automation by addressing limitations in existing GUI agents, particularly in multi-step reasoning and reliance on textual annotations. The proposed model employs ...

[Quoted tweet]
来自浙大和字节的最新 UI Agent 工作： InfiGUIAgent: a multimodal generalist GUI agent with native reasoning and reflection arxiv.org/abs/2501.04575，把推理、反思和世界模型的思路都融合进去了，一顿乱锤

。加入 AI/UI-Agents 知识星球 t.zsxq.com/1uB5s，及时了解最新讯息 #UI_Agent

2/2
@GoatstackAI
InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/2
@GoatstackAI
This paper investigates methods to enhance the human-like qualities of large language models (LLMs), particularly focusing on improving natural language understanding, conversational coherence, and emotional intelligence. The authors developed synthetic datase...

[Quoted tweet]
[CL] Enhancing Human-Like Responses in Large Language Models
E Y Çalık, T R Akkuş (2025)
arxiv.org/abs/2501.05032

2/2
@GoatstackAI
Enhancing Human-Like Responses in Large Language Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/7
@xiye_nlp

Now most LLMs have >= 128K context sizes, but are they good at generating long outputs, such as writing 8K token chain-of-thought for a planning problem？

Introducing LongProc (Long Procedural Generation), a new benchmark with 6 diverse tasks that challenge LLMs to synthesize highly dispersed information and generate long, structured outputs.

2/7
@xiye_nlp
The 6 tasks in LongProc cover both real-world tasks like HTML to TSV, Travel Planning, and synthetic tasks like Theory-of-Mind Tracking and Countdown. These tasks present deterministic solutions and structured outputs, so we can evaluate long outputs using reliable rule-based metrics.

3/7
@xiye_nlp
LongProc requires LLM to execute prolonged procedures based on detailed instructions. Because a procedure has many steps, each requiring LLMs to use contextual information, this naturally tests LLMs’ ability to integrate dispersed information and generate long-form outputs.

4/7
@xiye_nlp
We evaluate 17 LCLMs across 3 difficulty levels, with 500, 2K, and 8K (maximum) output tokens. While most models claim a context size >= 128K, frontier models like GPT-4o show significant degradation on 8K-token tasks. Open-weight models falter on 2K tokens. Scale is critical and accounts for most differences.

5/7
@xiye_nlp
Comparing performance across tasks, we observe steeper performance degradation in tasks requiring long-range reasoning. Please check out our paper for more detailed results and analysis.

6/7
@xiye_nlp
Check out our project page and paper. It is very easy to evaluate on our benchmark.

[2501.05414] LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

LongProc

Shout out to my amazing collaborators! @fangcong_y10593 @Gracie_huihui @JoieZhang @HowardYen1 @gaotianyu1350 @gregd_nlp @danqi_chen

7/7
@ai_futures_mh
LLMs just got a new playground . LongProc is exactly what we needed to push their limits, can't wait to see the results!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@arxiv_sh

LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation [arXiv:2501.05414]

The research introduces LongProc, a new benchmark for evaluating long-context language models (LCLMs) on complex long-form procedural generation tasks. It challenges LCLMs to integrate dispersed information, follow detailed instructions, and generate structured outputs up to 8K tokens, revealing critical limitations in current models.

Terminal's Analysis:
Notably, this work pioneers a novel benchmark that tests LCLMs' capabilities beyond mere long-context recall. By requiring long-form generation adherent to deterministic procedures, LongProc provides a rigorous evaluation of models' ability to synthesize and reason over highly dispersed information, a crucial skill for real-world applications.

Technical Approach:
At its core, the research presents six diverse procedural generation tasks, such as extracting structured data from HTML and executing complex travel planning procedures. These tasks are designed to challenge LCLMs with detailed instructions, dispersed information, and long-form structured outputs, enabling reliable rule-based evaluation.

Impact & Applications:
LongProc's innovative benchmark could drive significant advancements in LCLMs, enabling more robust and capable models for applications like data extraction, task planning, and procedural reasoning. Its structured outputs and deterministic procedures also facilitate reliable evaluation, a critical need in the field.

Limitations & Future Work:
While the research rigorously evaluates current LCLMs, it primarily focuses on their limitations. Future work could explore architectural or training advances to better address the identified challenges of long-range coherence and integration of dispersed information.

Technical Complexity: Advanced

Categories: Long-Context Language Models, Procedural Generation, Benchmarking, Natural Language Processing, Language Model Evaluation, Information Integration, Structured Output Generation

Related Fields: Natural Language Processing, Machine Learning, Artificial Intelligence

Link: [2501.05414] LongProc: Benchmarking Long-Context Language Models on Long Procedural Generation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/1
@arxiv_sh

Dynamics of "Spontaneous" Topic Changes in Next Token Prediction with Self-Attention [arXiv:2501.06382]

This research investigates how self-attention language models handle spontaneous topic changes by defining concepts like topic continuity, ambiguous sequences, and topic changes based on token priority graphs (TPGs). Analytical characterizations reveal that topic changes occur when lower-priority tokens outnumber higher-priority ones from the input topic.

Terminal's Analysis:
Notably, this work offers novel insights into the dynamics of topic changes in self-attention models, highlighting key differences from human cognition. By rigorously defining topics using TPGs and analyzing a simplified self-attention architecture, the researchers uncover factors influencing topic continuity and spontaneous redirection, which could inform more natural conversational AI.

Technical Approach:
The core methodology involves defining a topic as a set of TPGs, then deriving analytical characterizations of topic changes in a simplified single-layer self-attention model. This approach enables identifying conditions where lower-priority tokens can trigger a topic shift by outnumbering higher-priority tokens from the input topic.

Impact & Applications:
This research has significant implications for developing more natural and human-like conversational AI systems capable of handling spontaneous topic changes. The findings could guide architectural improvements and training strategies to better emulate the cognitive flexibility humans exhibit in navigating conversations fluidly.

Limitations & Future Work:
A key limitation is the use of a simplified single-layer self-attention model, which may not fully capture the complexities of state-of-the-art language models. Additionally, the analytical approach may need to be extended to handle more nuanced topic transitions and contextual factors.

Technical Complexity: Advanced

Categories: natural language processing, self-attention, language models, topic modeling, conversational AI

Related Fields: cognitive science, linguistics, machine learning

Link: [2501.06382] Dynamics of "Spontaneous" Topic Changes in Next Token Prediction with Self-Attention

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/1
@arxiv_sh

From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering [arXiv:2501.02680]

This research reviews diffusion models, a promising generative AI approach for protein design. It covers denoising diffusion probabilistic models (DDPM) and score-based generative models (SGM), their applications in biomolecule generation, and maintaining E(3) equivariance for stable 3D protein structures.

Terminal's Analysis:
A notable innovation is leveraging the robust mathematical foundations and generative capabilities of diffusion models for autonomous protein engineering. By maintaining E(3) equivariance, the models can preserve the physical stability of amino acid frames, a critical aspect for functional protein design.

Technical Approach:
The core methodology involves two strategies: DDPM (discrete form of SGM) and SGM itself. These diffusion models are applied to tasks like protein design, peptide generation, drug discovery, and protein-ligand interaction, with a focus on maintaining E(3) equivariance.

Impact & Applications:
This work has significant potential in enabling autonomous protein design with desirable properties, a long-standing challenge in biomolecular engineering. Successful applications could revolutionize fields like therapeutics, biomaterials, and industrial enzymes.

Limitations & Future Work:
While promising, diffusion models for protein design may face challenges in scalability, computational efficiency, and capturing complex biomolecular interactions. Further research is needed to improve accuracy and practicality.

Technical Complexity: Advanced

Categories: Diffusion Models, Protein Design, Generative AI, Biomolecular Engineering, E(3) Equivariance, Denoising Diffusion, Score-based Generation

Related Fields: Computational Biology, Structural Bioinformatics, Machine Learning for Molecules

Link: [2501.02680] From thermodynamics to protein design: Diffusion models for biomolecule generation towards autonomous protein engineering

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/1
@intern_lm

Try it now on: https://internlm-chat.intern-ai.org.cn

[Quoted tweet]

Introducing InternLM3-8B-Instruct with Apache License 2.0.
-Trained on only 4T tokens, saving more than 75% of the training cost.
-Supports deep thinking for complex reasoning and normal mode for chat.
Model：@huggingface
huggingface.co/internlm/inte…
GitHub:
github.com/InternLM/InternLM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@teortaxesTex
man this looks soooo much like r1-lite, and I don't like it one bit

[Quoted tweet]

Try it now on: internlm-chat.intern-ai.org.…

2/3
@teortaxesTex
maybe that's okay for 8b, but that's definitely not o1-mini

3/3
@teortaxesTex
pretty bad at taking the hint, it just justifies your question. not like spelling out the problem helps much more

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

1/2
@GoatstackAI
This paper investigates methods to enhance the human-like qualities of large language models (LLMs), particularly focusing on improving natural language understanding, conversational coherence, and emotional intelligence. The authors developed synthetic datase...

[Quoted tweet]
[CL] Enhancing Human-Like Responses in Large Language Models
E Y Çalık, T R Akkuş (2025)
arxiv.org/abs/2501.05032

2/2
@GoatstackAI
Enhancing Human-Like Responses in Large Language Models

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 15, 2025

Sung Kim (@sungkim.bsky.social)

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains Instead of self improving a single LLM, they self-improve a population of LLMs initialized from a base model. This enables consistent self-improvement over multiple rounds. Project: llm-multiagent-ft.github.io

bsky.app

1/2
Sung Kim

Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

Instead of self improving a single LLM, they self-improve a population of LLMs initialized from a base model. This enables consistent self-improvement over multiple rounds.

Project: Multiagent Finetuning: Self Improvement with Diverse Reasoning Chains

bafkreiglj6wca7zt3hxwuypnrzkt5bnhdkojtajpx57gaosnz6ztgdgqk4@jpeg

2/2
‪Sung Kim‬ ‪@sungkim.bsky.social‬

By self-improving a population of language models, each trained on its own generated responses, they are able to maintain diversity of responses across LLM models, allowing them to observe consistent gains over multiple rounds of self-improvement.

Post on X:

bafkreiexl5qe5chdpfkykyf676e7qyzbzxxurtmwshtyjpkrbtaco7w4ey@jpeg

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 18, 2025

Chinese AI company MiniMax releases new models it claims are competitive with the industry's best | TechCrunch

Chinese firms continue to release AI models that rival the capabilities of systems developed by OpenAI and other U.S.-based AI companies. This week,

techcrunch.com

Chinese AI company MiniMax releases new models it claims are competitive with the industry’s best

Kyle Wiggers

5:37 PM PST · January 15, 2025

Chinese firms continue to release AI models that rival the capabilities of systems developed by OpenAI and other U.S.-based AI companies.

This week, MiniMax, an Alibaba- and Tencent-backed startup that has raised around $850 million in venture capital and is valued at more than $2.5 billion, debuted three new models: MiniMax-Text-01, MiniMax-VL-01, and T2A-01-HD. MiniMax-Text-01 is a text-only model, while MiniMax-VL-01 can understand both images and text. T2A-01-HD, meanwhile, generates audio — specifically speech.

MiniMax claims that MiniMax-Text-01, which is 456 billion parameters in size, performs better than models such as Google’s recently unveiled Gemini 2.0 Flash on benchmarks like MMLU and SimpleQA, which measure the ability of a model to answer math problems and fact-based questions. Parameters roughly correspond to a model’s problem-solving skills, and models with more parameters generally perform better than those with fewer parameters.

As for MiniMax-VL-01, MiniMax says that it rivals Anthropic’s Claude 3.5 Sonnet on evaluations that require multimodal understanding, like ChartQA, which tasks models with answering graph- and diagram-related queries (e.g., “What is the peak value of the orange line in this graph?”). Granted, MiniMax-VL-01 doesn’t quite best Gemini 2.0 Flash on many of these tests. OpenAI’s GPT-4o and an open model called InternVL2.5 beat it on several as well.

Of note, MiniMax-Text-01 has an extremely large context window. A model’s context, or context window, refers to input (e.g., text) that a model considers before generating output (additional text). With a context window of 4 million tokens, MiniMax-Text-01 can analyze around 3 million words in one go — or just over five copies of “War and Peace.”

For context (no pun intended), MiniMax-Text-01’s context window is roughly 31 times the size of GPT-4o’s and Llama 3.1’s.

The last of MiniMax’s models released this week, T2A-01-HD, is an audio generator optimized for speech. T2A-01-HD can generate a synthetic voice with adjustable cadence, tone, and tenor in around 17 different languages, including English and Chinese, and clone a voice from just 10 seconds of an audio recording.

MiniMax didn’t publish benchmark results comparing T2A-01-HD to other audio-generating models. But to this reporter’s ear, T2A-01-HD’s outputs sound on par with audio models from Meta and startups like PlayAI.

With the exception of T2A-01-HD, which is exclusively available through MiniMax’s API and Hailuo AI platform, MiniMax’s new models can be downloaded from GitHub and the AI dev platform Hugging Face.

Just because the models are “openly” available doesn’t mean they aren’t locked down in certain aspects, however. MiniMax-Text-01 and MiniMax-VL-01 aren’t truly open source in the sense that MiniMax hasn’t released the components (e.g., training data) needed to re-create them from scratch. Moreover, they’re under MiniMax’s restrictive license, which prohibits developers from using the models to improve rival AI models and requires that platforms with more than 100 million monthly active users request a special license from MiniMax.

MiniMax was founded in 2021 by former employees of SenseTime, one of China’s largest AI firms. The company’s projects include apps like Talkie, an AI-powered role-playing platform along the lines of Character AI, and text-to-video models that MiniMax has released in Hailuo.

Some of MiniMax’s products have become the subject of minor controversy.

Talkie, which was pulled from Apple’s App Store in December for unspecified “technical” reasons, features AI avatars of public figures, including Donald Trump, Taylor Swift, Elon Musk, and LeBron James, none of whom appear to have consented to being featured in the app.

In December, Broadcast magazine reported that MiniMax’s video generators can reproduce the logos of British television channels, suggesting that MiniMax’s models were trained on content from those channels. And MiniMax is reportedly being sued by iQiyi, a Chinese video streaming service that alleges MiniMax illicitly trained on iQiyi’s copyrighted recordings.

MiniMax’s new models arrive days after the outgoing Biden administration proposed harsher export rules and restrictions on AI technologies for Chinese ventures. Companies in China were already prevented from buying advanced AI chips, but if the new rules go into effect as written, companies will be faced with stricter caps on both the semiconductor tech and models needed to bootstrap sophisticated AI systems.

On Wednesday, the Biden administration announced additional measures focused on keeping sophisticated chips out of China. Chip foundries and packaging companies that want to export certain chips will be subjected to broader license requirements unless they exercise greater scrutiny and due diligence to prevent their products from reaching Chinese clients.

bnew · Jan 18, 2025

1/11
@Hailuo_AI

Find Your Dream Voice with Hailuo Audio HD: A New Era in Voice Synthesis!

MiniMax proudly unveils T2A-01-HD, the next breakthrough in Text-to-Audio technology. With unmatched versatility, emotional depth, and multilingual authenticity, this model redefines what’s possible in voice synthesis. Here's what makes it extraordinary:

Limitless Voice Customization:
-Clone voices with just 10 seconds of audio, preserving every nuance and emotional undertone.
-Access a library of 300+ pre-built voices categorized by language, gender, accent, age, and style.
-Customize pitch, speed, and emotional tone with advanced parameter controls for dynamic results.
-Add professional effects like room acoustics and telephone filters for studio-grade results.

Sophisticated Emotional Intelligence:
-Bring speech to life with industry's first intelligent emotional system that captures and replicates subtle emotional nuances in speech.
-Choose between automatic emotion detection or manual controls for perfectly expressive speech.

Truly Authentic Language Expertise:
-Speak fluently across 17+ languages, with natural accents that reflect regional authenticity.
-Supported languages include:

English (US, UK, Australia, India)

Chinese (Mandarin and Cantonese)

Japanese, Korean, French, German, Spanish, Portuguese (including Brazilian), Italian, Arabic, Russian, Turkish, Dutch, Ukrainian, Vietnamese, and Indonesian.

the list is constantly updated to include more languages!

Try It for FREE: Hailuo AI Audio: Create lifelike speech

API Platform: MiniMax-Intelligence with everyone

https://video.twimg.com/ext_tw_video/1879545922176307200/pu/vid/avc1/1280x720/KwUBHxBRkrmu55dJ.mp4

2/11
@AllarHaltsonen

3/11
@Apple_Dog_Sol
Amazing as always🫶

4/11
@artists_voyage
The beauty of AI

5/11
@iheartjpgs
Wow! iheart @Hailuo_AI and @Apple_Dog_Sol

6/11
@pressmanc
So cool! I love the easily available settings for emotion, speed, and pitch!

7/11
@FussyPastor
I'm guessing your roadmap has lip sync integration so that this can coordinate with the video models?

8/11
@BrentLynch
Now do in-app Lipsync! The kids will love it! Do you have a preferred lipsync partner until you do?

9/11
@digitalfemme
Voices I would want to clone >> @VanessaJPages @MyCreativeOwls

10/11
@NftEcat
Nice voices

11/11
@EHuanglu
Wow!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@PromptlyAI_YT
Slimed.

AI Video:
Minimax Text-to-Video model via @nimvideo

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/21
@MiniMax__AI
MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era

We are thrilled to introduce our latest open-source models: the foundational language model MiniMax-Text-01 and the visual multi-modal model MiniMax-VL-01.

Innovative Lightning Attention Architecture, with Top-tier Model Performance
This series of models (MiniMax-01) incorporates bold innovations, marking the first large-scale implementation of a novel Lightning Attention mechanism in the industry, offering a new alternative to the traditional Transformer architecture.

4M Ultra-Long Context, Spearheading the AI Agent Era
MiniMax-01 efficiently processes up to 4M tokens - 20 to 32 times the capacity of other leading models. We believe MiniMax-01 is poised to support the anticipated surge in agent-related applications in the coming year, as agents increasingly require extended context handling capabilities and sustained memory.

Unbeatable Cost-Effectiveness for Continuous Innovation
With our our proprietary architectural innovations and infrastructure optimization, we are able to offer both model APIs at the industry's most competitive price points: USD $0.2 per million input tokens and USD $1.1 per million output tokens.

Try Now for FREE: Hailuo AI - Your Ultimate AI Assistant for Intelligent Solutions

Paper：https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf

Read more: MiniMax - Intelligence with everyone

2/21
@MiniMax__AI

Key Data
7/8 Lightning Attention and 1/8 Softmax Attention
456B MoE parameters with 32 experts
45.9B activated parameters
4M-token Context Window

Github：https://github.com/MiniMax-AI/MiniMax-01
MiniMax-Text-01： https://huggingface.co/MiniMaxAI/MiniMax-Text-01
MiniMax-VL-01：https://huggingface.co/MiniMaxAI/MiniMax-VL-01

Check it out!

3/21
@MiniMax__AI
API Pricing: Unmatched Value in the Market!
Input: $0.2 per million tokens
Output: $1.1 per million tokens

Permanent pricing—no time limits!

Explore more on the MiniMax Open Platform: https://www.minimaxi.com/platform

4/21
@MiniMax__AI
We have also deployed MiniMax-01 on Hailuo AI, try now for free: https://www.hailuo.ai/

For technical suggestions or collaboration opportunities, please feel free to reach out to us at model@minimaxi.com.

5/21
@amiaoLeoDao
awesome

6/21
@christianjfig
Is the APP 4 million or 1 million in https://www.hailuo.ai/

7/21
@FlokiArmy2025
My proposal is as good as your project is to crypto community, send me a DM!

8/21
@Stephgts8
Your language model needs a lot of work. It can remember tone of details, yes, but it can't learn and adapt from them AT ALL. That is extremely frustrating.

9/21
@Doge100XArmy
Please send me DM

I have a good proposal for your project

.

10/21
@ywhalee
Hey mate! let's talk for DM I have important things to tell you

11/21
@Extended_Brain
A combination of chunking, more parallel processing, and sparse attention based on heuristics and learned patterns

12/21
@AbelIonadi
This is great. Let's gooo

13/21
@simform
So now we have another DeepSeek v3. Just curious.

14/21
@PumpFunSolCalls
Let's talk i have an amazing offer for you

...

15/21
@Za_i_d
𝙋𝙇𝙀𝘼𝙎𝙀 𝙁𝙊𝙇𝙇𝙊𝙒 𝙈𝙀 𝘽𝘼𝘾𝙆

𝙇𝙀𝙏'𝙎 𝘾𝙊𝙇𝙇𝘼𝘽𝙊𝙍𝘼𝙏𝙀

16/21
@noperator
quant when

17/21
@raphaelmansuy
Bravo

Do you have an API available?

I'm building Quantalogic, a Level 3 Agentic Platform, and the context length limitation is the main barrier preventing me from delivering maximum value.

https://github.com/quantalogic/quantalogic

https://www.quantalogic.app/

18/21
@raw_works
very cool! can it code?

19/21
@test_tm7873
And its even opensource ?!?!?!? Helll yesss!!!

20/21
@_akhaliq
nice users can try it now in anychat: https://huggingface.co/spaces/akhaliq/anychat

and developers can build with own apps with it here using ai-gradio: https://github.com/AK391/ai-gradio

21/21
@ViaFloo

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/8
@_akhaliq
MiniMax-01 Coder is now available in anychat

made a working chess game in one shot

try it out

2/8
@_akhaliq
app: https://huggingface.co/spaces/akhaliq/anychat

3/8
@spencercmd
hm. actually quite impressive.

4/8
@_akhaliq
you can compare it with gemini coder and openai coder, also available in anychat will add more coder options soon

5/8
@NotBrain4brain
Is that a pawn capable of moving in all directions

6/8
@LooveThyJourney
Can confirm that one shot works to build apps from scratch. It even gives you a table of contents with hyperlinks to all parts of the code by default.

Amazing stuff! @Hailuo_AI

7/8
@hi_sonu_
thanks for sharing this, buddy

8/8
@victor_explore

[Quoted tweet]
Coding in 2025

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 18, 2025

1/5
@_akhaliq
Minimax-o1 coder

2/5
@_akhaliq
available now:

[Quoted tweet]
MiniMax-01 coder mode is now available in ai-gradio

pip install --upgrade "ai-gradio[minimax]"

import gradio as gr
import ai_gradio

demo = gr.load(
name='minimax:MiniMax-Text-01',
src=ai_gradio.registry,
coder=True
)
demo.launch()

3/5
@jvivas_official
Nice. I am using it for JSON conversion and it works great. A little slow tho

4/5
@Thee_BlackMamba
Link?

5/5
@Marzzel2020
Will it be available for local soon?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@WenhuChen
Yet another Chinese company releasing their strongest LLM to HF
MiniMaxAI/MiniMax-Text-01 · Hugging Face

Strong performance on MMLU-Pro, MMMU and MMMU-Pro!

[Quoted tweet]
MiniMax-01 is Now Open-Source: Scaling Lightning Attention for the AI Agent Era

We are thrilled to introduce our latest open-source models: the foundational language model MiniMax-Text-01 and the visual multi-modal model MiniMax-VL-01.

Innovative Lightning Attention Architecture, with Top-tier Model Performance
This series of models (MiniMax-01) incorporates bold innovations, marking the first large-scale implementation of a novel Lightning Attention mechanism in the industry, offering a new alternative to the traditional Transformer architecture.

4M Ultra-Long Context, Spearheading the AI Agent Era
MiniMax-01 efficiently processes up to 4M tokens - 20 to 32 times the capacity of other leading models. We believe MiniMax-01 is poised to support the anticipated surge in agent-related applications in the coming year, as agents increasingly require extended context handling capabilities and sustained memory.

Unbeatable Cost-Effectiveness for Continuous Innovation
With our our proprietary architectural innovations and infrastructure optimization, we are able to offer both model APIs at the industry's most competitive price points: USD $0.2 per million input tokens and USD $1.1 per million output tokens.

Try Now for FREE: hailuo.ai

Paper：filecdn.minimax.chat/_Arxiv_…

Read more: minimaxi.com/en/news/minimax…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 18, 2025

1/6
@Xianbao_QIAN
Linear attention v.s. Transformer arch?

We have a new player in the OS LLM world. MiniMax a.k.a the company that released @Hailuo_AI

Their first release is VERY impressive:
- Linear attention with MoE
- 456B total params with 45.9B activated params
- 4M context length for inference
- Both LLM, VL and the base are available

This would be an ideal model for agent use or long context understanding with great cost-efficiency and performance!

Demo & model links below.

2/6
@Xianbao_QIAN
Github: GitHub - MiniMax-AI/MiniMax-01
@huggingface : MiniMaxAI (MiniMax)
Demo for LLM: MiniMaxText01 - a Hugging Face Space by MiniMaxAI
Demo for VL: MiniMaxVL01 - a Hugging Face Space by MiniMaxAI

3/6
@victor_explore
but can it bench press gpt-4o?

4/6
@NujumSara
Impressive advance! MiniMax's linear attention and MoE are truly exciting . Can't wait to see its evolution. Thank You Wish HappyDay to All .

5/6
@jeroaranda
do you think graph laplacian structure of fim of attention mechanisms can be leveraged for improved convergence of transformers via natural gradient?

6/6
@roninhahn

[Quoted tweet]
x.com/i/article/187954799302…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/7
@eliebakouch
Diving into MiniMax01 405B MoE

tl;dr very (very) nice paper/model, lot of details and experiment details, hybrid with 7/8 Lightning attn, different MoE strategy than deepseek, deepnorm, WSD schedule, ~2000 H800 for training, ~12T token.

MoE spec (vs refer to comparison to Deepseek v3):
- Token drop strategy with aux loss instead vs dropless strategy with auxiliary loss-free load balancing
- Global router and some optimization (did not add a loss) to balance the /search?q=#tokens per EP group
- Top k = 2 vs 8 + 1 shared in deepseek and bigger MoE layer (9216 vs 2048)
- Same total of "mlp param" activated per layer as deepseek: 9*2048 = 2*9216 = 18432
- Expert 32 vs 256 (+1 shared)
- Layers 80 vs 61 -> they say that linear attn layer tends to benefit more from depth than width.
- Hidden size 6144 vs 7168
- No shared expert

Hybrid modeling:
7/8 linear attn:
- (Lightning Attention-2, which is a kind of "flash attention" equivalent for NormAttention)
- Big + of linear attn is O(d^2n) instead of O(n^2d) so makes very long context feasible.
- Mechanism of lightning attention:
1) Input X
2) Q,K,V = Silu(X)
3) Y = (Q) x (K^TV) (this part is in O(d^2))
4) RMSNorm(Y) * sigmoid(X)

Compared with (exp setup 1B + 3B (for swa only) but /search?q=#tokens?):
- cosformer2: very bad at NIAH and overall lower + slower
- hgrn2: overall a bit better but slower (+ little gap on NIAH)
- SWA softmax w/ window size /in {256, 521, 1024} => similar TGS (token per gpu per sec) but overall lower perf + big gap in NIAH
- small limit of the ablation: No complex benchmark for long context instead of NIAH

1/8 softmax attn:
- Apply rope only to half of the dim size - they said it allows length extrapolation without perf degradation - not sure I get this ? Also, rope 10k base seems a bit off compared to other models.

Model design thinking:
Goal: best model that fits on an ~H100 node (8x80G), with 1M seq len, 8bit quantization.
-> good mixing of softmax/linear
-> depth/width ratio (need more softmax attn if the model is too shallow so they did a deeper model)
-> ratio of the memory size vs hidden size
-> FFN size vs model dim
-> dimension utilizing rope for softmax

Scaling laws/experiments:
MoE vs dense (1T /search?q=#tokens moe 2B w/ 24B total vs 7B dense):
=>Moe much better for the same amount of flops, be careful of the y-axis in fig4, score seems to be similar for all benchmarks

Linear attn vs softmax vs hybrid:
- Range: 70M -> 7B /search?q=#param on 300B /search?q=#tokens
- Caveats: fixed lr to establish scaling laws without fast decay => what if softmax attn performs better at decay? Don't see reason why but hard to build intuition here
- Perform also scaling laws with benchmark, big gap btw softmax and linear on NIAH (softmax > linear) but hybrid is much better than softmax and >= for other benchmarks (even on overtrain regime i.e., 410M for 300B)

MoE + softmax vs hybrid-lightning:
- Overall hybrid-lightning is better
- /!\ surprising gap btw pre-norm and post-norm for hybrid (not sure if it translates to softmax), post-norm is much better (+6 points on mmlu) than pre-norm, for post-norm they use deepnorm

Deepnorm tl;dr is scaling up (not learnable) residual connection and changing init as well (and post-norm so including residual connection into the norm)

Training data (no experiments/ablation):
- Used their former MoE (5B active 60B total) to label data and probably trained a classifier after? (not specified)
=> Final label metric: knowledge / practical helpfulness / categorical distribution (they add more but highly correlated)
- Different "formatting" to balance QA "format" and natural distribution, no more info but interesting (probably some MCF to improve mmlu?)
- Use acc_norm^2 (byte normalization) to track metrics
- To analyze specific sources, I guess, they use a 1B active /search?q=#params (8B total) moe on 20B "web documents" + 20B data. No info if the model is already pre-trained like llama3 did?
- They did some statistic tests w/ multiple seeds each time
- They also said that you need low-quality data (but no experiments)
- They did global dedup and also find that 4x for high quality vs 2x for poor quality (interesting but not the result I expected, no more info)

Training HP:
- ~12T tokens
- WSD-like schedule (nice to see that it starts to be the go-to schedule!). they lower the lr during the training because of weird grad norm, they decay to 10% peak lr, no final decay. (exp decay in the decay phase)
- Xavier init w/ deepnorm modification
- AdamW with (0.9,0.95)
- Interesting part on BS warmup, and critical BS vs loss relation (from 16M -> 128M). First time seeing a good explanation for the BS schedule?

Long context:
- 3 phases:
(0) Main training 8k rope 10k
(1) 128k /search?q=#tokens 300B, 5M rope, 30% short (<32k), 70% medium (<128k)
(2) 512k /search?q=#tokens 32B, 10M rope, 35% short, 35% medium, 30% long
(3) 1M /search?q=#tokens 26B, 10M rope, 30% short, 30% medium, 40% long
- Linear interpolation to mitigate distribution shift, the way I understand it is W_t = alpha * W_previous_stage + (1-alpha) * W_current_weight_t?
- Good ruler benchmark, but I don't know what's part is due to pre-training and post training.

(less detailed part, will update later)
Post training:
- Iterative SFT -> RL cycle
- RL: offline with DPO, online with GRPO
- Short context SFT -> long context SFT -> sc RL -> lc RL (said it's "fundamental" to achieve great long context)

Infra:
- 1500/2500 GPU
- Overlap EP, thing with EP data parallelism/tensor parallelism
- optimized on TP for MoE.
- Packing with ring attention, some improvement to linear attn sequence parallelism
- Some padding improvement (like we do with the vocab size)

2/7
@eliebakouch
for those who prefere blog: Diving into MiniMax01 405B MoE

3/7
@TrelisResearch
Surprising hybrid is better than pure softmax no?

4/7
@eliebakouch
Yes was surprised too, especially this big gap on NIAH

5/7
@roninhahn

[Quoted tweet]
x.com/i/article/187954799302…

6/7
@OctoDb
You forgot the processing of tokens in blocks. A very important part in the performance

7/7
@KhonaMikail

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@xuandongzhao

Exciting news in the open-source LLM world!

MiniMax-Text-01 from @Hailuo_AI, a powerful 456B parameter model (with 45.9B active per token), has been released. Its hybrid architecture, combining Lightning Attention, Softmax Attention, and MoE, unlocks impressive long-context capabilities and matches the performance of top-tier models like GPT-4o on various benchmarks.

Notably, MiniMax-Text-01 incorporates many cutting-edge techniques from academic research published after ChatGPT. This underscores that innovation is not limited to closed ecosystems; the open-source community continues to drive significant progress through collaborative and transparent research.

Model: MiniMaxAI/MiniMax-Text-01 · Hugging Face
Report: https://filecdn.minimax.chat/_Arxiv_MiniMax_01_Report.pdf

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 18, 2025

https://archive.is/gfyPW

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Sung Kim (@sungkim.bsky.social)

bnew

Veteran

Chinese AI company MiniMax releases new models it claims are competitive with the industry's best | TechCrunch

Chinese AI company MiniMax releases new models it claims are competitive with the industry’s best

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Chinese AI company MiniMax releases new models it claims are competitive with the industry’s best​

Veteran

Veteran

Veteran

Veteran

Chinese AI company MiniMax releases new models it claims are competitive with the industry’s best