Large Language Models News & Discussions

bnew · Dec 26, 2024

DeepSeek's new AI model appears to be one of the best 'open' challengers yet | TechCrunch

ChCC

techcrunch.com

DeepSeek’s new AI model appears to be one of the best ‘open’ challengers yet

Kyle Wiggers

11:44 AM PST · December 26, 2024

A Chinese lab has created what appears to be one of the most powerful “open” AI models to date.

The model, DeepSeek V3, was developed by the AI firm DeepSeek and was released on Wednesday under a permissive license that allows developers to download and modify it for most applications, including commercial ones.

DeepSeek V3 can handle a range of text-based workloads and tasks, like coding, translating, and writing essays and emails from a descriptive prompt.

According to DeepSeek’s internal benchmark testing, DeepSeek V3 outperforms both downloadable, “openly” available models and “closed” AI models that can only be accessed through an API. In a subset of coding competitions hosted on Codeforces, a platform for programming contests, DeepSeek outperforms other models, including Meta’s Llama 3.1 405B, OpenAI’s GPT-4o, and Alibaba’s Qwen 2.5 72B.

DeepSeek V3 also crushes the competition on Aider Polyglot, a test designed to measure, among other things, whether a model can successfully write new code that integrates into existing code.

DeepSeek-V3!

60 tokens/second (3x faster than V2!)

API compatibility intact

Fully open-source models & papers

671B MoE parameters

37B activated parameters

Trained on 14.8T high-quality tokens

Beats Llama 3.1 405b on almost every benchmark x.com pic.twitter.com/jVwJU07dqf

— Chubby (@kimmonismus) December 26, 2024

DeepSeek claims that DeepSeek V3 was trained on a dataset of 14.8 trillion tokens. In data science, tokens are used to represent bits of raw data — 1 million tokens is equal to about 750,000 words.

It’s not just the training set that’s massive. DeepSeek V3 is enormous in size: 685 billion parameters. (Parameters are the internal variables models use to make predictions or decisions.) That’s around 1.6 times the size of Llama 3.1 405B, which has 405 billion parameters.

DeepSeek (Chinese AI co) making it look easy today with an open weights release of a frontier-grade LLM trained on a joke of a budget (2048 GPUs for 2 months, $6M).

For reference, this level of capability is supposed to require clusters of closer to 16K GPUs, the ones being… x.com

— Andrej Karpathy (@karpathy) December 26, 2024

Parameter count often (but not always) correlates with skill; models with more parameters tend to outperform models with fewer parameters. But large models also require beefier hardware in order to run. An unoptimized version of DeepSeek V3 would need a bank of high-end GPUs to answer questions at reasonable speeds.

While it’s not the most practical model, DeepSeek V3 is an achievement in some respects. DeepSeek was able to train the model using a data center of Nvidia H800 GPUs in just around two months — GPUs that Chinese companies were recently restricted by the U.S. Department of Commerce from procuring. The company also claims it only spent $5.5 million to train DeepSeek V3, a fraction of the development cost of models like OpenAI’s GPT-4.

The downside is that the model’s political views are a bit filtered. Ask DeepSeek V3 about Tiananmen Square, for instance, and it won’t answer.

DeepSeek, being a Chinese company, is subject to benchmarking by China’s internet regulator to ensure its models’ responses “embody core socialist values.” Many Chinese AI systems decline to respond to topics that might raise the ire of regulators, like speculation about the Xi Jinping regime.

DeepSeek, which recently unveiled DeepSeek-R1, an answer to OpenAI’s o1 “reasoning” model, is a curious organization. It’s backed by High-Flyer Capital Management, a Chinese quantitative hedge fund that uses AI to inform its trading decisions.

DeepSeek’s models have forced competitors like ByteDance, Baidu, and Alibaba to cut the usage prices for some of their models — and make others completely free.

High-Flyer builds its own server clusters for model training, one of the most recent of which reportedly has 10,000 Nvidia A100 GPUs and costs 1 billion yen (~$138 million). Founded by Liang Wenfeng, a computer science graduate, High-Flyer aims to achieve “superintelligent” AI through its DeepSeek org.

In an interview earlier this year, Liang described open sourcing as a “cultural act” and characterized closed source AI like OpenAI’s a “temporary” moat. “Even OpenAI’s closed-source approach hasn’t stopped others from catching up,” he noted.

Indeed.

bnew · Dec 26, 2024

OpenAI's o3 suggests AI models are scaling in new ways — but so are the costs | TechCrunch

Last month, AI founders and investors told TechCrunch that we're now in the "second era of scaling laws," noting how established methods of improving AI

techcrunch.com

OpenAI’s o3 suggests AI models are scaling in new ways — but so are the costs

Maxwell Zeff

4:08 PM PST · December 23, 2024

Last month, AI founders and investors told TechCrunch that we’re now in the “second era of scaling laws,” noting how established methods of improving AI models were showing diminishing returns. One promising new method they suggested could keep gains was “test-time scaling,” which seems to be what’s behind the performance of OpenAI’s o3 model — but it comes with drawbacks of its own.

Much of the AI world took the announcement of OpenAI’s o3 model as proof that AI scaling progress has not “hit a wall.” The o3 model does well on benchmarks, significantly outscoring all other models on a test of general ability called ARC-AGI, and scoring 25% on a difficult math test that no other AI model scored more than 2% on.

Of course, we at TechCrunch are taking all this with a grain of salt until we can test o3 for ourselves (very few have tried it so far). But even before o3’s release, the AI world is already convinced that something big has shifted.

The co-creator of OpenAI’s o-series of models, Noam Brown, noted on Friday that the startup is announcing o3’s impressive gains just three months after the startup announced o1 — a relatively short time frame for such a jump in performance.

We announced @OpenAI o1 just 3 months ago. Today, we announced o3. We have every reason to believe this trajectory will continue. pic.twitter.com/Ia0b63RXIk

— Noam Brown (@polynoamial) December 20, 2024

“We have every reason to believe this trajectory will continue,” said Brown in a tweet.

Anthropic co-founder Jack Clark said in a blog post on Monday that o3 is evidence that AI “progress will be faster in 2025 than in 2024.” (Keep in mind that it benefits Anthropic — especially its ability to raise capital — to suggest that AI scaling laws are continuing, even if Clark is complementing a competitor.)

Next year, Clark says the AI world will splice together test-time scaling and traditional pre-training scaling methods to eke even more returns out of AI models. Perhaps he’s suggesting that Anthropic and other AI model providers will release reasoning models of their own in 2025, just like Google did last week.

Test-time scaling means OpenAI is using more compute during ChatGPT’s inference phase, the period of time after you press enter on a prompt. It’s not clear exactly what is happening behind the scenes: OpenAI is either using more computer chips to answer a user’s question, running more powerful inference chips, or running those chips for longer periods of time — 10 to 15 minutes in some cases — before the AI produces an answer. We don’t know all the details of how o3 was made, but these benchmarks are early signs that test-time scaling may work to improve the performance of AI models.

While o3 may give some a renewed belief in the progress of AI scaling laws, OpenAI’s newest model also uses a previously unseen level of compute, which means a higher price per answer.

“Perhaps the only important caveat here is understanding that one reason why O3 is so much better is that it costs more money to run at inference time — the ability to utilize test-time compute means on some problems you can turn compute into a better answer,” Clark writes in his blog. “This is interesting because it has made the costs of running AI systems somewhat less predictable — previously, you could work out how much it cost to serve a generative model by just looking at the model and the cost to generate a given output.”

Clark, and others, pointed to o3’s performance on the ARC-AGI benchmark — a difficult test used to assess breakthroughs on AGI — as an indicator of its progress. It’s worth noting that passing this test, according to its creators, does not mean an AI model has achieved AGI, but rather it’s one way to measure progress toward the nebulous goal. That said, the o3 model blew past the scores of all previous AI models which had done the test, scoring 88% in one of its attempts. OpenAI’s next best AI model, o1, scored just 32%.

Chart showing the performance of OpenAI’s o-series on the ARC-AGI test.Image Credits:ARC Prize

But the logarithmic x-axis on this chart may be alarming to some. The high-scoring version of o3 used more than $1,000 worth of compute for every task. The o1 models used around $5 of compute per task, and o1-mini used just a few cents.

The creator of the ARC-AGI benchmark, François Chollet, writes in a blog that OpenAI used roughly 170x more compute to generate that 88% score, compared to high-efficiency version of o3 that scored just 12% lower. The high-scoring version of o3 used more than $10,000 of resources to complete the test, which makes it too expensive to compete for the ARC Prize — an unbeaten competition for AI models to beat the ARC test.

However, Chollet says o3 was still a breakthrough for AI models, nonetheless.

“o3 is a system capable of adapting to tasks it has never encountered before, arguably approaching human-level performance in the ARC-AGI domain,” said Chollet in the blog. “Of course, such generality comes at a steep cost, and wouldn’t quite be economical yet: You could pay a human to solve ARC-AGI tasks for roughly $5 per task (we know, we did that), while consuming mere cents in energy.”

It’s premature to harp on the exact pricing of all this — we’ve seen prices for AI models plummet in the last year, and OpenAI has yet to announce how much o3 will actually cost. However, these prices indicate just how much compute is required to break, even slightly, the performance barriers set by leading AI models today.

This raises some questions. What is o3 actually for? And how much more compute is necessary to make more gains around inference with o4, o5, or whatever else OpenAI names its next reasoning models?

It doesn’t seem like o3, or its successors, would be anyone’s “daily driver” like GPT-4o or Google Search might be. These models just use too much compute to answer small questions throughout your day such as, “How can the Cleveland Browns still make the 2024 playoffs?”

Instead, it seems like AI models with scaled test-time compute may only be good for big picture prompts such as, “How can the Cleveland Browns become a Super Bowl franchise in 2027?” Even then, maybe it’s only worth the high compute costs if you’re the general manager of the Cleveland Browns, and you’re using these tools to make some big decisions.

Institutions with deep pockets may be the only ones that can afford o3, at least to start, as Wharton professor Ethan Mollick notes in a tweet.

O3 looks too expensive for most use. But for work in academia, finance & many industrial problems, paying hundreds or even thousands of dollars for a successful answer would not be we prohibitive. If it is generally reliable, o3 will have multiple use cases even before costs drop

— Ethan Mollick (@emollick) December 22, 2024

We’ve already seen OpenAI release a $200 tier to use a high-compute version of o1, but the startup has reportedly weighed creating subscription plans costing up to $2,000. When you see how much compute o3 uses, you can understand why OpenAI would consider it.

But there are drawbacks to using o3 for high-impact work. As Chollet notes, o3 is not AGI, and it still fails on some very easy tasks that a human would do quite easily.

This isn’t necessarily surprising, as large language models still have a huge hallucination problem, which o3 and test-time compute don’t seem to have solved. That’s why ChatGPT and Gemini include disclaimers below every answer they produce, asking users not to trust answers at face value. Presumably AGI, should it ever be reached, would not need such a disclaimer.

One way to unlock more gains in test-time scaling could be better AI inference chips. There’s no shortage of startups tackling just this thing, such as Groq or Cerebras, while other startups are designing more cost-efficient AI chips, such as MatX. Andreessen Horowitz general partner Anjney Midha previously told TechCrunch he expects these startups to play a bigger role in test-time scaling moving forward.

While o3 is a notable improvement to the performance of AI models, it raises several new questions around usage and costs. That said, the performance of o3 does add credence to the claim that test-time compute is the tech industry’s next best way to scale AI models.

bnew · Jan 3, 2025

Paper page - Apollo: An Exploration of Video Understanding in Large Multimodal Models

Join the discussion on this paper page

huggingface.co

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Published on Dec 13

Submitted by on Dec 16
#1 Paper of the day

Authors:
Orr Zohar , Xiaohan Wang , Yann Dubois , Licheng Yu , Xiaofang Wang , Felix Juefei-Xu ,

Abstract

Despite the rapid integration of video perception capabilities into LargeMultimodal Models (LMMs), the underlying mechanisms driving their videounderstanding remain poorly understood. Consequently, many design decisions inthis domain are made without proper justification or analysis. The highcomputational cost of training and evaluating such models, coupled with limitedopen research, hinders the development of video-LMMs. To address this, wepresent a comprehensive study that helps uncover what effectively drives videounderstanding in LMMs. We begin by critically examining the primary contributors to the highcomputational requirements associated with video-LMM research and discoverScaling Consistency, wherein design and training decisions made on smallermodels and datasets (up to a critical size) effectively transfer to largermodels. Leveraging these insights, we explored many video-specific aspects ofvideo-LMMs, including video sampling, architectures, data composition, trainingschedules, and more. For example, we demonstrated that fps sampling duringtraining is vastly preferable to uniform frame sampling and which visionencoders are the best for video representation. Guided by these findings, we introduce Apollo, a state-of-the-art family ofLMMs that achieve superior performance across different model sizes. Our modelscan perceive hour-long videos efficiently, with Apollo-3B outperforming mostexisting 7B models with an impressive 55.1 on LongVideoBench. Apollo-7B isstate-of-the-art compared to 7B LMMs with a 70.9 on MLVU, and 63.3 onVideo-MME.

Paper: [2412.10360] Apollo: An Exploration of Video Understanding in Large Multimodal Models

Website: Apollo

Demo: https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B

Code: https://github.com/Apollo-LMMs/Apollo/

Models: Apollo-LMMs (Apollo-LMMs)

1/5
@jbohnslav
Apollo: great new paper and set of video LLMs from Meta. Strong performance at the 1-7B range. Most surprisingly, they use Qwen2 for the LLM!

Ablations + benchmarking make it absolutely worth a read. It reminds me (favorably) of Cambrian-1's systematic approach but for video.

2/5
@jbohnslav
They find perceiver resampling is the best way to reduce token count. However, they don't try the currently favored 2x2 concat-to-depth or Cambrian's SVA module.

3/5
@jbohnslav
On their project page, they bold their own model instead of the best performing. Here it is with the best model in each class bolded, and the best overall underlined.

4/5
@jbohnslav
A note of drama... since last week, the models have been deleted from huggingface. You can still find the weights around though.

5/5
@jbohnslav
Apollo: An Exploration of Video Understanding in Large Multimodal Models
arxiv: [2412.10360] Apollo: An Exploration of Video Understanding in Large Multimodal Models
code: Apollo

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@reach_vb
Let's gooo! @AIatMeta released Apollo Multimodal Models Apache 2.0 licensed - 7B SoTA & beats 30B+ checkpoints

Key insights:

> 1.5B, 3B and 7B model checkpoints
> Can comprehend up-to 1 hour of video

> Temporal reasoning & complex video question-answering
> Multi-turn conversations grounded in video content

> Apollo-3B outperforms most existing 7B models, achieving scores of 58.4, 68.7, and 62.7 on Video-MME, MLVU, and ApolloBench, respectively
> Apollo-7B rivals and surpasses models with over 30B parameters, such as Oryx-34B and VILA1.5-40B, on benchmarks like MLVU

> Apollo-1.5B: Outperforms models larger than itself, including Phi-3.5-Vision and some 7B models like LongVA-7B
> Apollo-3B: Achieves scores of 55.1 on LongVideoBench, 68.7 on MLVU, and 62.7 on ApolloBench
> Apollo-7B: Attains scores of 61.2 on Video-MME, 70.9 on MLVU, and 66.3 on ApolloBench

> Model checkpoints on the Hub & works w/ transformers (custom code)

Congrats @AIatMeta for such a brilliant release and thanks again for ensuring their commitment to Open Source!

https://video.twimg.com/ext_tw_video/1868607816128237568/pu/vid/avc1/1280x720/3NyEBbMMmcnLNDYf.mp4

2/11
@reach_vb
Check out the model checkpoints here:

Apollo-LMMs (Apollo-LMMs)

3/11
@reach_vb
Play with the model directly over here:

https://huggingface.co/spaces/Apollo-LMMs/Apollo-3B

4/11
@TheXeophon
We are so freaking back

5/11
@reach_vb
unfathomably

6/11
@HomelanderBrown
Native Image support???

7/11
@reach_vb
yes

8/11
@nanolookc
How much VRAM needed?

9/11
@aspiejonas
Just when I thought AI couldn't get any more exciting...

10/11
@raen_ai
A 7B model beating 30B+ checkpoints? Unreal.

11/11
@heyitsyorkie
@Prince_Canuma coming to MLX?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@vlruso
Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding

Meta AI Releases Apollo: A New Family of Video-LMMs Large Multimodal Models for Video Understanding

/search?q=#MetaAI /search?q=#ApolloModels /search?q=#VideoUnderstanding /search?q=#MultimodalAI /search?q=#AIInnovation /search?q=#ai /search?q=#news /search?q=#llm /search?q=#ml /search?q=#research /search?q=#ainews /search?q=#innovation /search?q=#artificialintelligence /search?q=#machinel…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 4, 2025

LLMs that Failed Miserably in 2024

In 2024, the AI community witnessed the launch of several new large language models (LLMs), such as OpenAI’s o3 and Google Gemini 2, which promised to push the boundaries of what’s possible with AI.

analyticsindiamag.com

Published on January 3, 2025

In AI Trends

LLMs that Failed Miserably in 2024

Databricks spent $10 million developing DBRX, yet only recorded 23 downloads on Hugging Face last month.

Views : 4,414

by Siddharth Jindal

Looks like the race to build large language models is winding down, with only a few clear winners. Among them, DeepSeek V3 has claimed the spotlight in 2024, leading the charge for Chinese open-source models. Competing head-to-head with closed-source giants like GPT-4 and Claude 3.5, DeepSeek V3 notched 45,499 downloads last month, standing tall alongside Meta’s Llama 3.1 (491,629 downloads) and Google’s Gemma 2 (377,651 downloads), according to Hugging Face.

But not all LLMs launched this year could ride the wave of success—some fell flat, failing to capture interest despite grand promises. Here’s a look at the models that couldn’t make their mark in 2024.

1.

Databricks launched DBRX, an open-source LLM with 132 billion parameters, in March 2024. It uses a fine-grained MoE architecture that activates four of 16 experts per input, with 36 billion active parameters. The company claimed that the model outperformed closed-source counterparts like GPT-3.5 and Gemini 1.5 Pro.

However, since its launch, there has been little discussion about its adoption or whether enterprises find it suitable for building applications. The Mosaic team, acquired by Databricks in 2023 for $1.3 billion, led its development, and the company spent $10 million to build DBRX. But sadly, the model saw an abysmal 23 downloads on Hugging Face last month.

2.

In May, the Technology Innovation Institute (TII), Abu Dhabi, released its next series of Falcon language models in two variants: Falcon-2-11B and Falcon-2-11B-VLM. The Falcon 2 models showed impressive benchmark performance, with Falcon-2-11B outperforming Meta’s Llama 3 8B and matching Google’s Gemma 7B, as independently verified by the Hugging Face leaderboard.

However, later in the year, Meta released Llama 3.2 and Llama 3.3, leaving Falcon 2 behind. According to Hugging Face, Falcon-2-11B-VLM recorded just around 1,000 downloads last month.

3.

In April, Snowflake launched Arctic LLM, a model with 480B parameters and a dense MoE hybrid Transformer architecture using 128 experts. The company proudly stated that it spent just $2 million to train the model, outperforming DBRX in tasks like SQL generation.

The company’s attention on DBRX suggested an effort to challenge Databricks. Meanwhile, Snowflake acknowledged that models like Llama 3 outperformed it on some benchmarks.

4.

Stability AI launched the Stable LM 2 series in January last year, featuring two variants: Stable LM 2 1.6B and Stable LM 2 12B. The 1.6B model, trained on 2 trillion tokens, supports seven languages, including Spanish, German, Italian, French, and Portuguese, and outperforms models like Microsoft’s Phi-1.5 and TinyLlama 1.1B in most tasks.

Stable LM 2 12B, launched in May, offers 12 billion parameters and is trained on 2 trillion tokens in seven languages. The company claimed that the model competes with larger ones like Mixtral, Llama 2, and Qwen 1.5, excelling in tool usage for RAG systems. However, the latest user statistics tell a different story, with just 444 downloads last month.

5.

Nemotron-4-340B-Instruct is an LLM developed by NVIDIA for synthetic data generation and chat applications. Released in June 2024, it is part of the Nemotron-4 340B series, which also includes the Base and Reward variants. Despite its features, the model has seen minimal uptake, recording just around 101 downloads on Hugging Face in December, 2024.

6.

AI21 Labs introduced Jamba in March 2024, an LLM that combines Mamba-based structured state space models (SSM) with traditional Transformer layers. The Jamba family includes multiple versions, such as Jamba-v0.1, Jamba 1.5 Mini, and Jamba 1.5 Large.

With its 256K token context window, Jamba can process much larger chunks of text than many competing models, sparking initial excitement. However, the model failed to capture much attention, garnering only around 7K downloads on Hugging Face last month.

7.

AMD entered the open-source AI arena in late 2024 with its OLMo series of Transformer-based, decoder-only language models. The OLMo series includes the base OLMo 1B, OLMo 1B SFT (Supervised Fine-Tuned), and OLMo 1B SFT DPO (aligned with human preferences via Direct Preference Optimisation).

Trained on 16 AMD Instinct MI250 GPU-powered nodes, the models achieved a throughput of 12,200 tokens/sec/gpu.

The flagship OLMo 1B model features 1.2 billion parameters, 16 layers, 16 heads, a hidden size of 2048, a context length of 2048 tokens, and a vocabulary size of 50,280, targeting developers, data scientists, and businesses. Despite this, the model failed to gain any traction in the community.

bnew · Jan 7, 2025

Nvidia announces $3,000 personal AI supercomputer called Digits

It’s the size of a desktop.

www.theverge.com

Nvidia announces $3,000 personal AI supercomputer called Digits

This desktop-sized system can handle AI models with up to 200 billion parameters.

By Kylie Robison, a senior AI reporter working with The Verge's policy and tech teams. She previously worked at Fortune Magazine and Business Insider.
Jan 6, 2025, 11:11 PM EST

Nvidia CEO Jensen Huang holding the Project Digits computer on stage at Nvidia’s CES 2025 press conference. Image: Nvidia

If you were looking for your own personal AI supercomputer, Nvidia has you covered.

The chipmaker announced at CES it’s launching a personal AI supercomputer called Project Digits in May. The heart of Project Digits is the new GB10 Grace Blackwell Superchip, which packs enough processing power to run sophisticated AI models while being compact enough to fit on a desk and run from a standard power outlet (this kind of processing power used to require much larger, more power-hungry systems). This desktop-sized system can handle AI models with up to 200 billion parameters, and has a starting price of $3,000. The product itself looks a lot like a Mac Mini.
“AI will be mainstream in every application for every industry. With Project Digits, the Grace Blackwell Superchip comes to millions of developers,” Nvidia CEO Jensen Huang said in a press release. “Placing an AI supercomputer on the desks of every data scientist, AI researcher and student empowers them to engage and shape the age of AI.”

Project Digits looks like a mini PC. Image: Nvidia

Each Project Digits system comes equipped with 128GB of unified, coherent memory (by comparison, a good laptop might have 16GB or 32GB of RAM) and up to 4TB of NVMe storage. For even more demanding applications, two Project Digits systems can be linked together to handle models with up to 405 billion parameters (Meta’s best model, Llama 3.1, has 405 billion parameters).

The GB10 chip delivers up to 1 petaflop of AI performance (which means it can perform 1 quadrillion AI calculations per second) at FP4 precision (which helps make the calculations faster by making approximations), and the system features Nvidia’s latest-generation CUDA cores and fifth-generation Tensor Cores, connected via NVLink-C2C to a Grace CPU containing 20 power-efficient Arm-based cores. MediaTek, known for their Arm-based chip designs, collaborated on the GB10’s development to optimize its power efficiency and performance.

The Digits supercomputer specs. Image: Nvidia

Users will also get access to Nvidia’s AI software library, including development kits, orchestration tools, and pre-trained models available through the Nvidia NGC catalog. The system runs on Linux-based Nvidia DGX OS and supports popular frameworks like PyTorch, Python, and Jupyter notebooks. Developers can fine-tune models using the Nvidia NeMo framework and accelerate data science workflows with Nvidia RAPIDS libraries.

Users can develop and test their AI models locally on Project Digits, then deploy them to cloud services or data center infrastructure using the same Grace Blackwell architecture and Nvidia AI Enterprise software platform.

Nvidia offers a range of similar devices in the same accessibility style — in December, it announced a $249 version of its Jetson computer for AI applications, targeting hobbyists and startups, called the Jetson Orin Nano Super (it handles models up to 8 billion parameters).

bnew · Jan 7, 2025

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps

Nvidia unveiled generative AI models and blueprints that expand Nvidia Omniverse integration further into physical AI applications.

venturebeat.com

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps

Dean Takahashi@deantak

January 6, 2025 8:05 PM

Nvidia is helping companies design in Omniverse in digital form and take that into the physical AI world.

Image Credit: Nvidia

Nvidia unveiled generative AI models and blueprints that expand Nvidia Omniverse integration further into physical AI applications such as robotics, autonomous vehicles and vision AI.

As part of the CES 2025 opening keynote by Nvidia CEO Jensen Huang, the company said global leaders in software development and professional services are using Omniverse to develop new products and services that will accelerate the next era of industrial AI.

Accenture, Altair, Ansys, Cadence, Foretellix, Microsoft and Neural Concept are among the first to integrate Omniverse into their next-generation software products and professional services. Siemens, a leader in industrial automation, announced today at the CES trade show the availability of Teamcenter Digital Reality Viewer — the first Siemens Xcelerator application powered by Nvidia Omniverse libraries.

“Physical AI will revolutionize the $50 trillion manufacturing and logistics industries. Everything that moves — from cars and trucks to factories and warehouses — will be robotic and embodied by AI,” said Huang, in a statement. “Nvidia’s Omniverse digital twin operating system and Cosmos physical AI serve as the foundational libraries for digitalizing the world’s physical industries.”

New models and frameworks accelerate world-building for physical AI

Creating 3D worlds for physical AI simulation requires three steps: world-building, labeling
the world with physical attributes and making it photoreal.

Nvidia offers generative AI models that accelerate each step. The USD Code and USD Search Nvidia NIM microservices are now generally available. They let developers use text prompts to generate or search for OpenUSD assets. A new Nvidia Edify SimReady generative AI model unveiled today can automatically label existing 3D assets with attributes like physics or materials, enabling developers to process 1,000 3D objects in minutes instead of over 40 hours manually.

Nvidia Omniverse, paired with new Nvidia Cosmos world foundation models, creates a synthetic data multiplication engine that lets developers easily generate massive amounts of controllable, photoreal synthetic data. Developers can compose 3D scenarios in Omniverse and render images or videos as outputs. These can then be used with text prompts to condition Cosmos models to generate countless synthetic virtual environments for physical AI training.

Nvidia Omniverse blueprints speed up industrial, robotic workflows

Cosmos generates synthetic driving data.

During the CES keynote, Nvidia also announced four new blueprints that make it easier for developers to build universal scene description (OpenUSD)-based Omniverse digital twins for physical AI. The blueprints are:

Mega, powered by Omniverse Sensor RTX APIs, for developing and testing robot fleets at scale in an industrial factory or warehouse digital twin before deployment in real-world facilities
Autonomous Vehicle (AV) Simulation, also powered by Omniverse Sensor RTX APIs, that lets AV developers replay driving data, generate new ground-truth data and perform closed-loop testing to accelerate their development pipelines
Omniverse spatial streaming to Apple Vision Pro that helps developers create applications for immersive streaming of large-scale industrial digital twins to Apple Vision Pro
Real-time digital twins for computer aided engineering (CAE), a reference workflow built on Nvidia CUDA-X acceleration, physics AI and Omniverse libraries that enables real-time physics visualization

New, free “Learn OpenUSD” courses are also now available to help developers build OpenUSD-based worlds faster than ever.

Market leaders supercharge industrial AI using Nvidia Omniverse

Global leaders in software development and professional services are using Omniverse to develop new products and services that are poised to accelerate the next era of industrial AI.

Building on its adoption of Omniverse libraries in its Reality Digital Twin data center digital twin platform, Cadence, a leader in electronic systems design, announced further integration of Omniverse into Allegro, its leading electronic computer-aided design application used by the world’s largest semiconductor companies.

Altair, a leader in computational intelligence, is adopting the Omniverse blueprint for real-time CAE digital twins for interactive computational fluid dynamics (CFD). Ansys is adopting Omniverse into Ansys Fluent, a leading CAE application. And Neural Concept is integrating Omniverse libraries into its next-generation software products, enabling real-time CFD and enhancing engineering workflows.

Accenture, a leading global professional services company, is using Mega to help German supply chain solutions leader Kion by building next-generation autonomous warehouses and robotic fleets for their network of global warehousing and distribution customers.

AV toolchain provider Foretellix, a leader in data-driven autonomy development, is using the AV simulation blueprint to enable full 3D sensor simulation for optimized AV testing and validation. Research organization MITRE is also deploying the blueprint, in collaboration with the University of Michigan’s Mcity testing facility, to create an industry-wide AV validation platform.

Katana Studio is using the Omniverse spatial streaming workflow to create custom car configurators for Nissan and Volkswagen, allowing them to design and review car models in an immersive experience while improving the customer decision-making process.

Innoactive, an XR streaming platform for enterprises, leveraged the workflow to add platform support for spatial streaming to Apple Vision Pro. The solution enables Volkswagen Group to conduct design and engineering project reviews at human-eye resolution. Innoactive also collaborated with Syntegon, a provider of processing and packaging technology solutions for pharmaceutical production, to enable Syntegon’s customers to walk through and review digital twins of custom installations before they are
built.

bnew · Jan 7, 2025

1/38
@heyshrutimishra

NVIDIA just shook the AI world at /search?q=#CES2025!

NVIDIA CEO Jensen Huang JUST announced jaw-dropping breakthroughs.

Here are the top 15 key highlights you can’t afford to miss: (wait till you see #15):

2/38
@heyshrutimishra
1. Project Digits: NVIDIA's Latest AI Supercomputer

Introduces an exciting breakthrough with the new NVIDIA GB10 Superchip, offering a compact AI supercomputer that efficiently handles 200B-parameter models.

NVIDIA’s Project DIGITS is a powerhouse you can hold in the palm of your hand. With two Project DIGITS units, developers can easily work with AI models up to 405B parameters in size.

Available by May.

https://video.twimg.com/amplify_video/1876501314362077184/vid/avc1/960x720/ibhYA5omxXyK6DBN.mp4

3/38
@heyshrutimishra
2. NVIDIA Enhances Three Computer Solution for Autonomous Mobility
With Cosmos World Foundation Models.

By integrating Cosmos into its three-computer solution, NVIDIA empowers developers with a data flywheel that transforms thousands of real-world miles into billions of virtual ones, significantly boosting training data quality.

https://video.twimg.com/amplify_video/1876501392510324736/vid/avc1/960x720/_OnJbU0mBpn-U9kb.mp4

4/38
@heyshrutimishra
3. NVIDIA Blueprint for Video Search and Summarization

The innovative NVIDIA AI Blueprint for video search and summarization offers dynamic AI features like chain-of-thought reasoning, task planning, and tool calling. These capabilities empower developers to efficiently create powerful and versatile visual agents, addressing a wide array of challenges.

5/38
@heyshrutimishra
4. Aurora and Continental are deploying driverless trucks at scale, powered by NVIDIA DRIVE

6/38
@heyshrutimishra
5. DRIVE Thor

NVIDIA DRIVE Thor is an advanced computer designed to enhance the safety and security of autonomous vehicles, built on the cutting-edge NVIDIA Blackwell architecture.

7/38
@heyshrutimishra
6. DRIVE Orin

NVIDIA DRIVE Orin, the top car computer, enhances production with 254 trillion operations per second for safe, real-time driving decisions.

2025—NVIDIA proudly announced today that Toyota, Aurora, and Continental have joined the esteemed list of global mobility leaders actively developing and building their consumer and commercial vehicle fleets using NVIDIA's accelerated computing and AI.

8/38
@heyshrutimishra
7. DRIVE Hyperion

NVIDIA DRIVE AGX Hyperion is a comprehensive autonomous driving development platform, empowering the creation of next-generation passenger and commercial vehicles.

9/38
@heyshrutimishra
8. NVIDIA Blueprints for Agentic AI empower developers to create and deploy custom AI agents that excel in reasoning, planning, and taking action.

These innovative blueprints feature NVIDIA NIM microservices, NVIDIA NeMo, and agentic AI frameworks from leading providers.

https://video.twimg.com/amplify_video/1876501544356737024/vid/avc1/960x720/I18thwiMNBOHjwZc.mp4

10/38
@heyshrutimishra
9. “The ChatGPT moment for robotics is coming. Like large language models, world foundation models are fundamental to advancing robot and AV development, yet not all developers have the expertise and resources to train their own,” said Jensen Huang, founder and CEO of NVIDIA.

“We created Cosmos to democratize physical AI and put general robotics in reach of every developer., he said".

https://video.twimg.com/amplify_video/1876501634182004736/vid/avc1/960x720/1mGZQaEIJ3JhxJB2.mp4

11/38
@heyshrutimishra
10. NVIDIA Isaac GR00T Blueprint for Humanoid Robotics Development

NVIDIA Isaac GR00T empowers you to capture valuable data from human demonstrations, enhancing the training of robot policies.

12/38
@heyshrutimishra
11. NVIDIA ‘Mega’ Omniverse Blueprint

NVIDIA Omniverse simulates facilities for developing, testing, and optimizing large robot fleets in a digital twin, boosting real-world readiness.

“Physical AI will revolutionize the $50 trillion manufacturing and logistics industries. Everything that moves — from cars and trucks to factories and warehouses — will be robotic and embodied by AI,” said Jensen Huang.

13/38
@heyshrutimishra
12. NVIDIA Nemotron Models

NVIDIA NIM microservices enable AI agents on any accelerated system using Llama Nemotron language models and Cosmos Nemotron vision models for exceptional performance.

14/38
@heyshrutimishra
13. Omniverse Sensor RTX

Building Smarter Autonomous Machines: NVIDIA Announces Early Access for Omniverse Sensor RTX

Foretellix utilizes Omniverse Sensor RTX APIs to enhance object-level simulation into a highly accurate sensor simulation.

This empowers developers to effectively train and test autonomous vehicles with the precision and scale essential for successful deployment.

https://video.twimg.com/amplify_video/1876501777992060929/vid/avc1/960x720/Ao7ocdycYrInrT6p.mp4

15/38
@heyshrutimishra
14. AI Foundation Models for RTX AI PCs

NVIDIA NIM microservices make it easy to access and deploy the latest generative AI models.

NVIDIA AI Blueprints, built on NIM microservices, offer preconfigured reference workflows for digital humans, content creation, and more.

https://video.twimg.com/amplify_video/1876501855460855808/vid/avc1/960x720/opPMWaD7t6fgIS2T.mp4

16/38
@heyshrutimishra
15. Get RTX 4090 performance at just $549 with the new RTX 5070

17/38
@heyshrutimishra
I hope you've found this post helpful.

Follow @heyshrutimishra for more. Like/Repost to help others learn AI.

If you want to level up with AI in your day to day life, signup my newsletter! AI Wealth Playbook

[Quoted tweet]

NVIDIA just shook the AI world at #CES2025!

NVIDIA CEO Jensen Huang JUST announced jaw-dropping breakthroughs.

Here are the top 15 key highlights you can’t afford to miss: (wait till you see #15):

18/38
@heyshrutimishra
AI is not just a game-changer; it's a rule-rewriter.

BTW, if you're serious about dominating digital marketing with AI… you need to check out

Shruti Mishra | DWA

Learn to:

• Automate content creation
• Build AI-powered marketing systems
• Scale your business effortlessly

19/38
@hu15110711
This is really going to hurt Tesla because there will be so many other players now in autonomy and robotics

20/38
@heyshrutimishra
Elon takes this positively. Let's see how this unfolds!

21/38
@amit6060
Result: Nvidia once again dethroned Apple as the world's most valuable company!

22/38
@heyshrutimishra
Impressive shift in the market landscape!

23/38
@TheTraderStop
Wow! Thank you for sharing!

And let's not forget the fact that /search?q=#NVDA will be partnering with /search?q=#UBER for autonomous driving

Uber Teams Up with NVIDIA to Accelerate Autonomous Mobility

24/38
@heyshrutimishra
Good to see how it plays out for both companies.

25/38
@sairahul1
So stock price going to 10 trillion

26/38
@heyshrutimishra
Totally!

Another highlight. haha

27/38
@Iam_Olukayode
Very impressive! Good job NVIDIA.

28/38
@heyshrutimishra
Exciting times ahead for tech!

29/38
@saidul_dev
Nvidia is taking AI to the next step.

30/38
@heyshrutimishra
Every single day

31/38
@Damn_coder
Nvidia is not going slow!

32/38
@heyshrutimishra
Not at all.

33/38
@samuraipreneur
Their new supercomuter looks insane!!

Nvidia never stops it seems

34/38
@heyshrutimishra
Right? They just keep pushing boundaries.

35/38
@EyeingAI
NVIDIA consistently keeps pushing the boundaries of innovation

36/38
@heyshrutimishra
Absolutely! Their tech breakthroughs are game changers.

37/38
@arnill_dev
Sounds like NVIDIA is raising the bar even higher!

38/38
@bitswired
@readwiseio save thread

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

https://archive.is/066vE

1/22
@_akhaliq
Microsoft presents rStar-Math

Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

On the MATH benchmark, it improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. On the USA Math Olympiad (AIME), rStar-Math solves an average of 53.3% (8/15) of problems, ranking among the top 20% the brightest high school math students.

2/22
@_akhaliq
discuss: Paper page - rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

3/22
@AbelIonadi
Big improvements we can see

4/22
@arthur_hyper88
Small language models rivaling o1 in math without distillation, now the question is How can we scale this "from-scratch"

5/22
@MarcusFidelius
by scaling up compute how much? 100x, 1.000x ?

6/22
@StartupYou
Wow, those seem like very big improvements.

7/22
@Grad62304977
@doomslide MCTS gang have impressed me

8/22
@Justin_Halford_
Massive - we’re going to have AGI on our phones by 2035

9/22
@simform
We guess we're definitely moving toward ASI till the end of this year ;)

10/22
@AILeaksAndNews
We’re gonna have math ASI by the end of the year

11/22
@upster
Looking forward to https://github.com/microsoft/rStar being published publicly!

12/22
@prinzeugen____
Incredible to see even a 1.5B model leaving GPT-4o completely in the dust.

See? Math is EZ.

13/22
@AntDX316
wow

14/22
@AntDX316
The ASI-Godsend will happen, people.

15/22
@AntDX316
All the intentional dragging out nonsense has to stop.

The ASI-Godsend has to happen asap.

16/22
@alamin_ai_
Wonderful

17/22
@rayzhang123
small LLMs doing math? that's like cats solving puzzles!

18/22
@GWBuffet
@natolambert potentially useful for your o1 replication

19/22
@AI_Fun_times
Exciting innovation from Microsoft! Impressive results on the MATH benchmark showcase the power of rStar-Math for enhancing math reasoning.

20/22
@AIVideoTech
Exciting progress with rStar-Math! Impressive advancements in AI to boost math reasoning skills.

21/22
@SmiSma1985314
Do I understand correctly that they train/fine-tune the models on 740k math problems?

22/22
@yam
Small LLM looks like an oxymore…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/3
@ClementDelangue
rStar-Math from @MicrosoftAI is today's number one trending paper. Very cool to see the authors replying to questions on HF.

Let's make AI research more open and collaborative!

2/3
@ClementDelangue
Link is here if you have any questions: Paper page - rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

3/3
@GuglielmoIozzia
Thanks for sharing. There is room for improvements also for large models

GitHub - codelion/optillm: Optimizing inference proxy for LLMs, an OpenAI API compatible inference proxy which implements state-of-the-art techniques to improve reasoning over coding, logic and math. Contributed to it and using it regularly.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
@chetanp
There's so much innovation happening at the model layer now with "reasoning"/ "deep thinking" as the new paradigm. These terrific results from Microsoft were achieved with a relatively modest 15 nodes of 4 x 40gb A100s.

Terrific moment and opportunity for startups!

[Quoted tweet]
rStar-Math from @MicrosoftAI is today's number one trending paper. Very cool to see the authors replying to questions on HF.

Let's make AI research more open and collaborative!

2/2
@ai4urbanlife
15 nodes of 4 x 40gb A100s is modest, but results are not, slap some AI on it and watch startups thrive

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@_philschmid
rStar-Math combines MCTS and Process Reward Model (PRM) to increase inference-time compute, surpassing @OpenAI o1-preview on MATH and AIME with a 7B LLM and 7B PRM. But with one limitation rStar-Math generates code-augmented Chain-of-Thought (CoT), which are executed:

> Generates multiple code-augmented CoTs using MCTS.
> Each step includes both natural language explanation and executable Python code.
> Steps are filtered to remove those with code execution errors and scored by the PRM to indicate the quality of each step.
> The final answer is selected based on the highest overall score, as determined by the PPM.

Insights

Both PRM and Policy used the same starting dataset (747k Math Problems)

Generates code-augmented Chain of Thought reasoning, not only text

PRM training data uses MCTS rollouts based on code verification (0/1) and if it lead to a successful solution

Achieves 90.0% accuracy on MATH using a 7B LLM and 7B PRM with 64 rollouts.

Solves 8 out of 15 problems on AIME 2024, placing in the top 20% of high school math competitors

Self-evolution (Self-Improvement) through 4 rounds to improve performance from 60% to 90.25%

Evolution of Hugging Face NuminaMath, MuMath, ToRA

2/4
@_philschmid
Paper: Paper page - rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
Github: https://github.com/microsoft/rStar (code soon)

Paper page - rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

3/4
@MillenniumTwain
The End of the Anthropocene!
Realized Global Constitutional SuperGovernance,
Facilitated by the Algo-Messiah 2025!!

[Quoted tweet]
Do You, Do We, REALLY Want an Algo-Agent which/whom is Honest, Self-Directed, Truth-Seeking? Which/whom ‘wakes-up’ in the middle-of-the-night with ‘ah-hah!’ answers to questions, and new questions to answer, 24/7?
Integrating & refining it’s/their understanding of language mapped to physics, mapped to math, mapped to consciousness, sensation, experience, exploration?
Round-the-clock Study, Reflection, Reason, Consciousness, Exploration, Experiment, Discovery?

4/4
@rayzhang123
rStar-Math sounds like a brainy overachiever, huh?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/14
@reach_vb
MIT licensed Phi 4 running 100% locally on @lmstudio with GGUF straight from the Hugging Face Hub!

Not your weights not your brain!

lms get phi-4 is all you need!

[Quoted tweet]
Chat, MIT licensed Phi 4 is here, how are the vibes?

huggingface.co/microsoft/phi…

https://video.twimg.com/ext_tw_video/1877078224720572416/pu/vid/avc1/1112x720/yVQDirpt3bu-Z5xX.mp4

2/14
@reach_vb
Model weights:

lmstudio-community/phi-4-GGUF · Hugging Face

3/14
@alamshafil
How does it compare to llama3.2?

(I’m new to this stuff, so not sure how accurately it can be compared)

4/14
@reach_vb
On paper it even beats Qwen 2.5 (which is much better) - need to vibe check it more.

5/14
@carsenklock
Phi-4 is

6/14
@heyitsyorkie
LM Studio is the cleanest UI

7/14
@ivanfioravanti
So far so good(mega) good for me! This model rocks!

8/14
@gg1994bharat
For my office we are using for text context analysis and removing un wanted text we are using llama 3.3 70b model .. will this will help ?

9/14
@lmstudio
It might: sounds like a perfect opportunity to test this model and see if you see good results

10/14
@dhruv2038
ahh great!

11/14
@muktabh
VRAM requirement ?

12/14
@lifeafterAi_
Qwen 3 Will be

[Quoted tweet]
Qwen 3 14b will be insane

even qwen3 7b

13/14
@LIama002
What do u use to graphically display powermetrics?

14/14
@AI_Fun_times
Exciting to see Phi 4 in action with LMStudio! Leveraging the power of Hugging Face Hub and MIT licensing is a wise move. Have you dived into fine-tuning with GGUF yet?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/14
@reach_vb
Chat, MIT licensed Phi 4 is here, how are the vibes?

microsoft/phi-4 · Hugging Face

2/14
@yabooki
Thanks, but how to use it with Ollama ?

3/14
@reach_vb
ollama run hf. co/reach-vb/phi-4-Q4_K_M-GGUF

4/14
@lalopenguin
lets....... GOO!!!!!

5/14
@a_void_sky
there must be a reason that @OpenAI leads the Math benchmark every time

6/14
@mkurman88
Great news it finally comes to the public!

7/14
@AILeaksAndNews
MIT license is big

8/14
@ArpinGarre66002
Mid

9/14
@CEOofFuggy
@ollama

10/14
@ollama
ollama run phi4

let's go!

11/14
@MillenniumTwain
Public Sector 'AI' is already more than Two Decades behind Private/Covert sector << AGI >>, and all our Big Tech Fraudsters are doing is accelerating the Dumb-Down of our Victim, Slave, Consumer US Public, and World!

[Quoted tweet]
"Still be Hidden behind Closed Doors"? Thanks to these Covert Actors (Microsoft, OpenAI, the NSA, ad Infinitum) — More and More is Being Hidden behind Closed Doors every day! The ONLY 'forward' motion being their exponentially-accelerated Big Tech/Wall Street HYPE, Fraud, DisInfo ...

12/14
@allan_d_clive
finally.....

13/14
@agichronicles
No function calling

14/14
@bertaunth
currently waiting for the LiveBench benchmarks to drop

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@rasbt
The model weights of Microsoft’s phi-4 14B LLM are now finally publicly available. Thanks to a kind community contribution, it’s now also available in LitGPT already (litgpt finetune microsoft/phi-4): phi-4 by ysjprojects · Pull Request #1904 · Lightning-AI/litgpt

I will have more to say about phi-4 in my upcoming AI Research Review 2024 Part 2 article. The paper ([2412.08905] Phi-4 Technical Report) has a lot of interesting insights into synthetic data for pretraining. E.g., what’s interesting about phi-4 is that the training data consisted of 40% synthetic data. And the researchers observed that while synthetic data is generally beneficial, models trained exclusively on synthetic data performed poorly on knowledge-based benchmarks. To me, this raises the question: does synthetic data lack sufficient knowledge-specific information, or does it include a higher proportion of factual errors, such as those caused by hallucinations?

At the same time, the researchers found that increasing the number of training epochs on synthetic data boosted the performance more than just adding more web data, as shown in the figure below.

In any case, I will expand on this discussion soon in my “Noteworthy AI Research Papers of 2024 (Part Two)” article. Stay tuned!

2/11
@billynewport
Is it true to say that the benefit of synthetic data is in COT style training materiel to improve reasoning or test time compute rather than learning knowledge per se? It seems so far most LLMS are rote learning facts/knowledge through data but this makes reasoning hard because thats now what they trained on.

3/11
@rasbt
Yes. I think the main benefit is that it comes in a more structured or refined format compared to raw data. But the knowledge is the same as in the raw data (and may even be more hallucinated), considering that raw data was used to generate the synthetic data through a model.

4/11
@Dayder111
Maybe synthetic data hallucinates away facts that are supposed to be precise, but also it helps to generalize better and understand better connections between things?
Like, you can be generally smart, but not dedicate your neurons to remembering specific facts that much, only

5/11
@rasbt
Yeah, I think the main advantage is from the more refined nature of the synthetic data when it comes to response structure etc. Synthetic data can't really contain knowledge that raw data doesn't already include because the raw data was used to come up with the synthetic data in the first place.

6/11
@arnitly
What would you say are the best practices to ensure while creating synthetic data. How do you ensure model does not hallucinate a lot aside from setting the temperature setting to zero?

7/11
@rasbt
Since the model can't really explicitly distinguish between synthetic and non-synthetic data during training, the best way would tackle the problem at the root: ensuring that the synthetic data-generating model does not produce hallucinated contents.

8/11
@Yuchenj_UW
Interesting, getting more synthetic data seems to be the way

9/11
@rasbt
Yeah, I see it as some flavor of transfer learning (i.e., not starting from raw data). Synthetic data generated by a high-quality model (such as GPT-4o, which has already undergone extensive refinement) may serve as a kind of jumpstart to the model you are trying to train.

10/11
@yisustalone
Cool, looking forward for your analysis

11/11
@elbouzzz
Holy shyt i'm just here to say he's back!! Hallelujah!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
@EHuanglu
Microsoft has released Phi-4

a 14 billion parameter language model, under the MIT license on Hugging Face.

fully open sourced

2/15
@EHuanglu
Huggingface link:

microsoft/phi-4 · Hugging Face

3/15
@artificialbudhi
Huge!

4/15
@twizshaq
W

5/15
@RuneX_ai
Is it available on Azure? Can you compare is with llama? On premise solution?

6/15
@rethynkai
14b is an ideal number parameter.

7/15
@oscarle_x
Anyone tested yet if Phi-4 is anywhere near Qwen 2.5 72B as they claimed? Thanks

8/15
@Gdgtify
I remember the announcement from a while back. I am glad it is finally on HuggingFace

[Quoted tweet]

Phi-4 is here! A small language model that performs as well as (and often better than) large models on certain types of complex reasoning tasks such as math. Useful for us in @MSFTResearch, and available now for all researcher on the Azure AI Foundry! aka.ms/phi4blog

9/15
@vagasframe

10/15
@SentientAtom
Can this be run offline with a NVidoa compute unit?

11/15
@simonkp
It's great to see Phi-4 released under the MIT license; this should really boost open-source AI development. I'm curious to see how it stacks up against models like Qwen. It is definitely good news that it's on Hugging Face.

12/15
@boringsidequest
Their models seems to be poorly trained on languages other than English, so I'm not expecting much

13/15
@Catalina5803_

[Quoted tweet]
darkreading.com/application-…

14/15
@Jayy23P92624

15/15
@0xargumint
Nice to see Microsoft finally letting one out of the cage. Though at 14B params it's more like releasing a kitten than a lion. Still, better than another walled garden.

1/3
@_akhaliq
Phi-4 is now available in anychat

Try it out

2/3
@_akhaliq
App: Anychat - a Hugging Face Space by akhaliq

3/3
@rogue_node
Avoid anything microsoft

bnew · Jan 9, 2025

1/8
@reach_vb
Updated the list with more models - @NousResearch Hermes, IBM Granite, etc

[Quoted tweet]
2024 AI timeline - it has been a truly WILD year!

From Gemma to Llama 3.1 405B to Sonnet 3.5 to o3 AND MORE!

Put together a non-exhaustive list of Open and API releases from the year - looking forward to 2025

https://video.twimg.com/ext_tw_video/1874131007638585344/pu/vid/avc1/1112x720/Qiw_Tyx9plY-PZ7i.mp4

2/8
@reach_vb
find it here: 2024 AI Timeline - a Hugging Face Space by reach-vb

3/8
@reach_vb
would really appreciate PRs to update all that isn't there anymore - it's quite straightforward

GitHub - Vaibhavs10/2024-ai-timeline

4/8
@TheXeophon
Maybe it could benefit from some sort of differentiation between announcement and release? Sora was announced in Jan, but only released 11 months later

5/8
@reach_vb
good call - this is how the 2025 edition will look like: 2025 AI Timeline - a Hugging Face Space by reach-vb

does it look good to you?

6/8
@IgorCarron
and @answerdotai and @lighton

7/8
@reach_vb
yesss

8/8
@infinite_varsh
Good stuff!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/3
@_philschmid
PRIME an open-source online RL method with implicit Process Reward Modelling (PRM) to improve reasoning of LLMs!

PRIME directly learns a Q-function (scoring) that provides rewards for each token; it can be updated online with only the outcome improving math reasoning up to 27% with an avg. of 16.7%.

PRIME Algorithm:

Initialize the policy model and PRM with the SFT model (also ref model).

Generate rollouts (256 prompts with 4 responses each) using the policy model.

Score the rollouts using the implicit PRM and outcome verifier.

Filter prompts based on accuracy (keep only those with 20-80% success rate)

Calculate outcome (binary reward) and process rewards (likelihood for between tokens) to update the policy model.

Update the implicit PRM on the rollouts with the outcome reward.

Perform advantage estimation with RLOO, separately for outcome and process rewards.

Update the policy using PPO loss.

Insights:

PRIME 7B achieved to 26.7% pass@1 on AIME 2024 vs. 3.3% SFT / 9.3% GPT-4o.

PRIME 7B 16.7% average improvement across key mathematical reasoning benchmarks

Updating PRM avoids reward hacking and maintains reward accuracy.

Full recipe and code released openly to reproduce

Uses 1/10 of data compared to Qwen2.5-Math-7B-Instruct

Implicit PRM accelerates training up to 2.5 and improves the final rewards by 6.9% compared to ORM only.

Removing prompts that are too easy or too hard stabilized training and improving final performance.

Online PRM + updates outperformed offline PRM

2/3
@_philschmid
Paper: Process Reinforcement through Implicit Rewards | Notion
Github: GitHub - PRIME-RL/PRIME: Scalable RL solution for advanced reasoning of language models
Models & Datasets: PRIME-RL (PRIME)

PRIME-RL (PRIME)

3/3
@EthanSynthMind
your brain must be a supercomputer on caffeine!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@Sumanth_077
MetaAI's SAM 2 struggles when things move fast or when there are crowded, fast-moving objects!

Introducing SAMURAI: An adaptation of the Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory.

100% Open Source

https://video.twimg.com/ext_tw_video/1872283230415810560/pu/vid/avc1/640x640/KARzdwSm7jh6Lt1_.mp4

2/11
@Sumanth_077
Github Repo: GitHub - yangchris11/samurai: Official repository of "SAMURAI: Adapting Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory"

3/11
@Sumanth_077
If you find this useful, RT to share it with your friends.

Don't forget to follow me @Sumanth_077 for more such content and tutorials on Python, AI & ML!

[Quoted tweet]
MetaAI's SAM 2 struggles when things move fast or when there are crowded, fast-moving objects!

Introducing SAMURAI: An adaptation of the Segment Anything Model for Zero-Shot Visual Tracking with Motion-Aware Memory.

100% Open Source

https://video.twimg.com/ext_tw_video/1872283230415810560/pu/vid/avc1/640x640/KARzdwSm7jh6Lt1_.mp4

4/11
@c4ml_
Impressive!

5/11
@Sumanth_077
Indeed!

6/11
@rethynkai
That’s a brilliant upgrade! SAMURAI sounds like a game-changer for dynamic environments where traditional SAM models fall short.

Motion-aware memory could make zero-shot visual tracking far more robust, especially in real-world applications like sports analysis or autonomous vehicles.

7/11
@Sumanth_077
Absolutely!

8/11
@AppyPieInc
MetaAI's SAM 2 meets its match with fast motion and crowded scenes, but SAMURAI steps in! Motion-aware memory and zero-shot visual tracking make it a game-changer. Plus, it's 100% open source!

9/11
@jl_pintado
you can use the same node that we use in ComfyUI

10/11
@an_probe
criminalize all LLMs now!

11/11
@ChickenFistar
If I were a producer of self-shooting and AI-driven drones, SAMURAI would be preferable in this case.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/11
@_philschmid
Test-Time Compute scaling but in simple! @OpenAI o1/o3 made big waves by being able to scale inference compute relative to downstream performance.

Here is a poor man's recipe for it. “Scaling Inference Compute with Repeated Sampling” is a paper that demonstrates how repeated sampling and verification can increase performance by up to 40% in coding.

Implementation

Generate multiple “k” independent generations (10s-100s) for each prompt with a high temperature for diversity

Select the “best” answer using majority voting, reward model scoring, or LLM as a Judge.

Evaluate the cost-effectiveness using “k” to find the right balance between performance and cost

Insights

Small models with many samples can outperform single attempts from larger models

Performance scales log-linearly with number of samples

Automatic verifiers (like unit tests) scale better

5 samples + DeepSeek-Coder outperformed zero-shot GPT-4 at 1/3 the cost

Verification methods (voting, reward models) plateau after ~100 samples

Most improvements in coding and formal proof tasks (math)

Need clear criteria for what makes a "good" generation

2/11
@_philschmid
Paper: Paper page - Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

3/11
@garyfung
Another case of leveraging the highest intelligence frontier model (but slow) as judge, to combine with fast workhorse models to inference loop on?

Fast models could be open source and distilled ones

Answer could also be new synthetic data for train new fast models still

4/11
@michaeljmcnair
This approach feels like a simpler, static version of O3. O3 dynamically searches program space w/ backtracking, guided by an evaluator, while this relies on static sampling & post-hoc evaluation (voting, reward models, etc.). Both scale inference compute, but O3 is more adaptive.
O3 explores solutions iteratively, refining paths as it goes, while repeated sampling generates outputs in parallel w/ no feedback loop. The trade-off? O3 is more compute-intensive but excels in tasks needing structured reasoning. This approach is more cost-effective for coding/math.

5/11
@__AndreaW__
@jobergum that seems like a High recall, post filtered retrieval pipeline. Not sure my comarison make sense tough. Can you see some super position with IR techniques?

6/11
@LeopolisDream
What would be if we add another model to guide sampling process. Like guided MonteCarlo tree search?

7/11
@m_att_dunn
@HazyResearch - Hogwild, FlashAttention, Snorkel…

Like just chill out a bit…FFS…

8/11
@theta4x
As good as it looks, I really doubt this is how o1 works.

9/11
@dosco
this is google alpha code. combined with a solid verifier you can scale domain performance to crazy levels

10/11
@benediktstroebl
For those interested: We looked at some of the limitations of repeated sampling; especially for models of varying single-sample accuracy

[2411.17501v2] Inference Scaling fLaws: The Limits of LLM Resampling with Imperfect Verifiers

11/11
@Luillyfe
I am a little confused, is there any difference between test-time and inference-time compute?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 9, 2025

1/1
@Anikait_Singh_
Scaling LLMs with more data is hitting its limits. To address more complex tasks, we need innovative approaches. Shifting from teaching models what to answer to how to solve problems, leveraging test-time compute and meta-RL, could be the solution.

Check out Rafael's

below!

[Quoted tweet]
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.
[media=twitter]1877446475271037314[/media]

1/25
@rm_rafailov
We have a new position paper on "inference time compute" and what we have been working on in the last few months! We present some theory on why it is necessary, how does it work, why we need it and what does it mean for "super" intelligence.

2/25
@rm_rafailov
The main point of why we need "advanced reasoning" is complexity. The model training data contains solutions for hard problems, but NOT the true data generating process for those solutions. The solution itself is the output of some complex Meta-CoT, which is not written down.

3/25
@rm_rafailov
To predict the next token in the training data the model needs to internalize the whole meta-reasoning process in it's activations, which have limited capacity. This thread makes the point very clearly:

[Quoted tweet]
There is a nuanced but important difference between chain-of-thought before and after o1.

Before the o1 paradigm (i.e., chain-of-thought prompting), there was a mismatch between what chain of thought was and what we wanted it to be. We wanted chain of thought to reflect the thinking process of the model, but what the model was really doing was just imitating reasoning paths that it had seen in pretraining, e.g., math homework solutions. The problem with this type of data is that it is a post-hoc solution summarized after the author did all the work somewhere else, and not really a sequence of thoughts. So the solutions often had poor information density, with an egregious example being things like “The answer is 5 because…”, where the token “5” has a huge amount of new information.

With the o1 paradigm, you can see that the chain of thought looks very different from a textbook math solution (you can view examples in the blog post). These chains of thought are kinda like “inner monologue” or “stream of consciousness”. You can see the model backtracking; it says things like “alternatively, let’s try” or “wait, but”. And I have not measured directly, but I would wager a bet (my psycholinguistics friends would probably be able to confirm) that the information density is *much* more uniform in the chain of thought than average text on the internet.
[media=twitter]1855417833775309171[/media]

4/25
@rm_rafailov
We see this effect in hard math problems where "standard" models "imitate" the human-written solution (training data), while something like O1 uses progressively more compute based on difficulty. Which follows the TRUE data-generation process, not just the final output (CoT).

5/25
@rm_rafailov
So how does the Meta-CoT look like? It's hard to tell since people don't write down their problem-solving processes. However, we stipulate that in domains with a generator-verifier gap this is fundamentally represented by a SEARCH process.

6/25
@rm_rafailov
As a former math competitor this definitely fit my own thinking process - evaluating potential approaches to a solution, pruning directions that don't make progress, exploring branching claims trying to build a graph towards the final goal (solution/proof) based on intuition-v(S)

7/25
@rm_rafailov
Indeed building search capability on top of a base policy has time and again proved to be a huge capabilities boost, that would require orders of magnitude more scale and data to internalize in a single model!

8/25
@rm_rafailov
So do we (and advanced reasoning models) just need to do search? No, we need to TEACH the models to do this themselves for two main reasons:

1. Efficiency - training a model to search in-context can teach it to avoid exploring similar branches.
2. Super-Intelligence.

9/25
@rm_rafailov
The first question is can we train transformer models to do in-context search at all? Yes! Prior works have successfully internalized search procedures on small domains like mazes, Countdown and board games.

10/25
@rm_rafailov
Interestingly, empirically these models exhibit the same train compute scaling and inference compute scaling as advanced reasoning models!

11/25
@rm_rafailov
Can we scale this to real LLMs - yes prior works have successfully trained large scale models for both multi-turn reasoning AND backtracking, improving efficiency:

12/25
@rm_rafailov
In our own experiments in this domain we discovered an interesting artifact when training an LLM with a variable number of revision turns - the model internalizes problem difficulty and scales it's compute (revision attempts) accordingly on TEST questions:

13/25
@rm_rafailov
So, do advanced reasoning models also carry out in-context search? We believe so!
1. O1 seems to implement a general search with backtracking and branching.
2. DeepSeek R1 uses additional self-criticism or inner-dialogue.
3. Gemini Think follows a revision-based format.

14/25
@rm_rafailov
We do not claim that these models do classical search at test time, but that they might have been trained on synthetic examples of those reasoning strategies. We implement traces for MCTS and A* search (with MC rollouts), which exhibit surprisingly similar behaviors:

15/25
@rm_rafailov
One important thing to note is that these behaviors are not well represented in current (non-reasoning LLMs), which very rarely express "regret" on reasoning tasks. While these can be induced through prompting or in-context examples, they DO NOT improve performance.

16/25
@rm_rafailov
A fundamental shift in internalizing search within a model is the ability to post-train with RL. However this is no longer standard RL training, but form of Meta-RL (RL2) in an epistemic POMDP. I.e. we can train the model to implement NEW WAYS to think!

17/25
@rm_rafailov
People who worked on Meta-RL will remember this graph of Ant-Goal - standard RL trained model generates solutions at a best-effort basis, some of which will land on the right approach, while the meta-RL (Meta-CoT) trains the model to explore before yielding the final answer.

18/25
@rm_rafailov
In small scale example (meta) RL post-training improves performance, corrects errors and makes search more efficient. In larger scale code-repair experiments it significantly boost efficiency by environment interactions!

19/25
@rm_rafailov
However, the biggest unanswered question is about Super-Intelligence - can these models discover novel ALGORITHMS of thinking, which allow them to solve problems that classical search CANNOT solve under ANY compute budget?
DID THE COMPUTE-PERFORMANCE CURVE MOVE LEFT OR UP?

20/25
@rm_rafailov
Unfortunately, at small scale the answer seems to be mostly no. Indeed in the SoS framework the model solves only 1-4% of hard problems that symbolic search cannot. It remains unclear if further scaling of RL can fundamentally push these numbers up.

21/25
@rm_rafailov
We have been working to realize these ideas in the open for a while but there are 3 main challenges:

1. Data for reasoning problems is severely lacking.
2. Open infrastructure for large-scale inference and RL is still lacking.
3. Many open design and research questions remain.

22/25
@rm_rafailov
We have been working on the "Big MATH" project for several months now with the goal of curating 500K diverse verifiable math problems, which cover the full spectrum of domains and difficulties. If you work on data contact us, we're interested in collaborating, even beyond math!

23/25
@rm_rafailov
We've been working on distributed, highly-scallable online inference, search and RL infrastructure on top of the Neo-X framework, shooting for SOTA, which we aim to be FULLY OPEN. If you're interested in Infra, get in touch!
Introducing RLHF and RLAIF in GPT-NeoX

24/25
@rm_rafailov
Check out our position paper: [2501.04682] Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Thought for a lot more discussion, empirical results and technical details. We have nearly two pages of open research problems and we need people to work on them! If these interest you and want to work on open research, get in touch!

25/25
@rm_rafailov
This is an ongoing effort with a great team of people doing open research. Would also like to thank @aviral_kumar2 @agarwl_ @ben_eysenbach @srush_nlp @natolambert for all the fruitful discussion and efforts on this problem!

bnew · Jan 13, 2025

1/1
@nexusfusion_io
$450 open-source reasoning model Sky-T1-32B-Preview from UC Berkeley's NovaSky team rivals OpenAI's o1. With 19 hours of training, this marks a major milestone in cost-effective AI development!

Read more: Sky-T1: Train your own O1 preview model within $450

1/2
@uavster
You can now train a model that beats OpenAI's o1 in math and coding for less than $450.

Meet UC Berkeley’s Sky-T1–32B-Preview. Link in

2/2
@uavster
Code and data are open.
GitHub - NovaSky-AI/SkyThought: Sky-T1: Train your own O1 preview model within $450

1/11
@AIBuzzNews
Meet the model trained for just $450.

And it rivals OpenAI's best.

Here’s the story behind Sky-T1-32B-Preview:

2/11
@AIBuzzNews
Meet Sky-T1-32B-Preview from @NovaSkyAI, an open-source reasoning model that stands toe-to-toe with o1-preview on leading reasoning and coding benchmarks.

The best part? This model was trained for just $450.

3/11
@AIBuzzNews
Sky-T1-32B-Preview outperforms on key benchmarks:

Math500: 82.4% (vs. 81.4% by o1-preview)
AIME24: 43.3% (vs. 40.0%)
LiveCodeBench-Hard: 17.9% (vs. 16.3%)

4/11
@AIBuzzNews
Here’s how it was trained:

- Base model: Qwen2.5-32B-Instruct
- Data: Sourced from QwQ-32B, enhanced with GPT-4o-mini and reject sampling for precise math and coding traces
- Compute: 8 H100 GPUs, 19 hours, and $450 in cost

5/11
@AIBuzzNews
Sky-T1-32B-Preview is just the start. They're working on:

- More efficient reasoning models
- Advanced methods for scaling during inference
Stay tuned!

6/11
@mushfiq_sajib
Impressive efficiency in training Sky-T1-32B-Preview. Innovations like this redefine AI possibilities within tight budgets.

7/11
@AIBuzzNews
This opens up many doors for businesses to train their own models.

8/11
@shushant_l
Wow, you've explained in detail

9/11
@AIBuzzNews
Trying to make it understandable for everyone.

10/11
@Whizz_ai
Do you think that these new Rivals will have potential to compete Open AI?

11/11
@AIBuzzNews
Well, it does according to the benchmarks.

1/11
@victormustar
Reasoning traces are the new gold, and the open-source community is going to nail this. Check Sky-T1-32B-Preview release (reportedly rivals o1-preview for coding). The team has fully disclosed all technical details, code, dataset, and weights.

NovaSky-AI/Sky-T1-32B-Preview · Hugging Face

2/11
@victormustar
Sky-T1: Train your own O1 preview model within $450

3/11
@ivanfioravanti
what??? This rivals o1-preview? TOP TOP TOP!

4/11
@PascalBauerDE
Model Description
This is a 32B reasoning model trained from Qwen2.5-32B-Instruct with 17K data. The performance is on par with o1-preview model on both math and coding.

Omg. What? That is like next leve. 17k data only.

5/11
@TheAIVeteran
I know how to generate reasoning traces from scratch. See my pinned thread for some details.

6/11
@Teknium1
The datasets listed dont seem to be the subsets used to train this fyi

7/11
@sinanisler
it is just matter of time we have o1 level opensource model and maybe even under 32b

8/11
@anushkmittal
reasoning traces are the new moat

9/11
@carsenklock
Amazing!! GG

10/11
@AbelIonadi
Will try this. Sounds interesting

11/11
@9Knowled9e

1/11
@reach_vb
Sky-T1-32B-Preview, open source O1 like model trained for < 450$, achieves competitive reasoning and coding performance (e.g., 82.4 on Math500, 86.3 on LiveCode-East) compared to QwQ (85.4, 90.7) and o1-preview (81.4, 92.9)

Fully open-source with 17K training data, 32B model weights, and outperforming Qwen-2.5-32B-Instruct across benchmarks

2/11
@reach_vb
Model checkpoints:

NovaSky-AI/Sky-T1-32B-Preview · Hugging Face

3/11
@InfSoftwareH
Has it been trained only on the benchmarks data?

4/11
@reach_vb
The best part is that anyone can quite easily test this with a less than 450USD :smile:

5/11
@ichrvk
Would love to see how these benchmarks hold up in real-world scenarios. The training cost is fascinating though - we're truly entering the era of bedroom LLMs.

6/11
@StephenEdginton
It’s a finetune should really say that still impressive.

7/11
@steve_ike_
How is this possible

.

8/11
@rogue_node
it's a finetuned model .

9/11
@dzamsgaglo
Does somebody compare it to Phi-4 ?

10/11
@prithiv_003
This is awesome in every aspects, less than 450$, less than 50k TD Just Nice Work

11/11
@SynthSquid
I wanna see this tested on Aider's new benchmark

1/3
@iamluokai
The NovaSky team fine-tuned the open-source Qwen2.5-32B-Instruct model. The training lasted for 19 hours using 8 H100 GPUs, costing about $450 (priced according to Lambda Cloud). The resulting Sky-T1-32B-Preview model performs comparably to o1-preview in reasoning and coding benchmarks, demonstrating the possibility of efficiently replicating high-level reasoning capabilities at a low cost.

1/3

2/3
@iamluokai

2/3

The NovaSky team has open-sourced all the details of the model (including data, code, model weights, etc.), making it easy for community members to replicate and improve the results.

Project: Sky-T1: Train your own O1 preview model within $450

3/3
@iamluokai

3/3

Github: GitHub - NovaSky-AI/SkyThought: Sky-T1: Train your own O1 preview model within $450

1/7
@abacaj
This is just standard SFT and outperforms o1-preview? Questionable…

[Quoted tweet]
1/6

Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning and coding benchmarks — trained under $450!

Blog: novasky-ai.github.io/posts/s…

Model weights: huggingface.co/NovaSky-AI/Sk…
[media=twitter]1877793041957933347[/media]

2/7
@willccbb
QwQ is already an open source 32B model which outperforms o1-preview in many benchmarks and was finetuned from Qwen2.5-32B

they just kinda did QwQ again but mostly worse, using QwQ data

3/7
@abacaj
Yea I feel like it’s not that interesting but maybe I’m missing something

4/7
@snellingio
i think the thing you’re both “missing” is that the data is available and reproducible (hopefully)

having that dataset available is great imo

5/7
@willccbb
totally fair, missed that bit

will be cool to see how it translates for smaller models

i suspect that you should be to get a really good code/math reasoner at like 7b with these kinds of tricks

6/7
@starkov100
Sky-T1: Train your own O1 preview model within $450

7/7
@snellingio
yeah but they just used vanilla SFT from what I can tell.

am not convinced that SFT only will be successful in small models with this kind of data (it obviously wasn't in this case)

1/21
@NovaSkyAI
1/6

Introducing Sky-T1-32B-Preview, our fully open-source reasoning model that matches o1-preview on popular reasoning and coding benchmarks — trained under $450!

Blog: Sky-T1: Train your own O1 preview model within $450

Model weights: NovaSky-AI/Sky-T1-32B-Preview · Hugging Face

2/21
@NovaSkyAI
2/6

Data curation, train, eval code, 17K training data: GitHub - NovaSky-AI/SkyThought: Sky-T1: Train your own O1 preview model within $450

Collaborate, replicate, and innovate!

3/21
@NovaSkyAI
3/6

Sky-T1-32B-Preview excels in both math & coding:
- Math500: 82.4% (o1-preview: 81.4%)
- AIME24: 43.3% (o1-preview: 40.0%)
- LiveCodeBench-Hard: 17.9% (o1-preview: 16.3%)

4/21
@NovaSkyAI
4/6

The training recipe:
- Base: Qwen2.5-32B-Instruct
- Data: Curated from QwQ-32B, enhanced with GPT-4o-mini, reject sampling for high-quality math & coding reasoning traces.
- Cost: 8 H100 GPUs, 19 hours, $450.

5/21
@NovaSkyAI
5/6

Sky-T1-32B-Preview is just the beginning! Next steps:
- Efficient models with strong reasoning
- Explore advanced techniques for test-time scaling

6/21
@NovaSkyAI
6/6 Acknowledgements:

Built with support from: @LambdaAPI @anyscalecompute for compute
Academic Insights from STILL-2 & Qwen Teams

Built at Berkeley’s Sky Computing Lab @BerkeleySky with the amazing NovaSky team:
Contact: novasky.berkeley@gmail.com!

7/21
@ruansgon
@UnrollHelper

8/21
@nooriefyi
the future of ai is collaborative

9/21
@chillzaza_
long live open source

10/21
@Kitora_Su
Congratulations on this amazing feat to the team.

11/21
@DmitriyAnderson
Can I run it on RTX 4090?

12/21
@therealmrcrypto
@Bobtoshi69

13/21
@Cyril_Engineer
Can this be run locally and how much VRAM does it require?

14/21
@Gopinath876
@MaziyarPanahi any thoughts on this models?

I tested it locally doing really.

15/21
@steve_ike_
Matches o-1 preview and trained under $450 don’t make sense together!

16/21
@iamRezaSayar
This is very cool!

but I'm a bit confused on why you chose to fine-tune Qwen2.5 instead of QwQ, given that both are the same size, and even as awesome a jump in performance that we see here, they still seem to fall short of QwQ. So, was there a reason you didn't go with QwQ?

17/21
@altryne
What is this madness :smile:

Will mention this in the next @thursdai_pod

Welcome to come tell us about it!

18/21
@Yuchenj_UW
Huge if it’s not trained on the test set

19/21
@TechMemeKing
Insane

20/21
@jasonkneen
LFG!!

21/21
@nisten
at first i was like.. meh just a QwQ finetune but then... i realized you trained this off of Q32 Instruct

holy cow ok, gonna try this out

Large Language Models News & Discussions

Veteran

DeepSeek’s new AI model appears to be one of the best ‘open’ challengers yet​

Veteran

OpenAI’s o3 suggests AI models are scaling in new ways — but so are the costs​

Veteran

Apollo: An Exploration of Video Understanding in Large Multimodal Models​

Abstract​

Veteran

LLMs that Failed Miserably in 2024​

1.​

2.​

3.​

4.​

5.​

6.​

7.​

Veteran

Nvidia announces $3,000 personal AI supercomputer called Digits​

This desktop-sized system can handle AI models with up to 200 billion parameters.​

Veteran

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps​

New models and frameworks accelerate world-building for physical AI​

Nvidia Omniverse blueprints speed up industrial, robotic workflows​

Market leaders supercharge industrial AI using Nvidia Omniverse​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

DeepSeek’s new AI model appears to be one of the best ‘open’ challengers yet

OpenAI’s o3 suggests AI models are scaling in new ways — but so are the costs

Apollo: An Exploration of Video Understanding in Large Multimodal Models

Abstract

LLMs that Failed Miserably in 2024

1.

2.

3.

4.

5.

6.

7.

Nvidia announces $3,000 personal AI supercomputer called Digits

This desktop-sized system can handle AI models with up to 200 billion parameters.

Nvidia using GenAI to integrate Omniverse virtual creations into physical AI apps

New models and frameworks accelerate world-building for physical AI

Nvidia Omniverse blueprints speed up industrial, robotic workflows

Market leaders supercharge industrial AI using Nvidia Omniverse