China just wrecked all of American AI. Silicon Valley is in shambles.

bnew · Jan 28, 2025

1/52
@jiayi_pirate
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for < $30
Code: GitHub - Jiayi-Pan/TinyZero: Clean, accessible reproduction of DeepSeek R1-Zero

Here's what we learned

2/52
@jiayi_pirate
The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.

3/52
@jiayi_pirate
The results: It just works!

Model start from dummy outputs but gradually develop tactics such as revision and search.

In the following sample, the model propose a solution, self-verify, and iteratively revise it until it works.

Full experiment log: jiayipan

4/52
@jiayi_pirate
Quick ablations on CountDown:
Base model quality is key:

We run Qwen-2.5-Base 0.5B, 1.5B, 3B to 7B. 0.5B guess a solution and stop. From 1.5B, the model start learning to search, to self-verify and to revise its solutions, enabling them to achieve much higher scores.

5/52
@jiayi_pirate
Either base or instruct model works

- Instruct model learns faster, but converges to about same performance as base
- Instruct model's output are more structured and readable

So extra instruction tuning isn't necessary, which supports R1-Zero's design decision

6/52
@jiayi_pirate
The specific RL alg doesn't matter much

We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven't got the time to tune the hyper-parameters, so don't want to make quantitative conclusions about which alg works better.

7/52
@jiayi_pirate
Model's reasoning behavior is very task dependent:

- For countdown, the model learns to do search and self-verificatoin
- For number multiplicatoin, the model instead learns to break down the problem using distirbution rule and solve it step by step.

8/52
@jiayi_pirate
Everything's open at GitHub - Jiayi-Pan/TinyZero: Clean, accessible reproduction of DeepSeek R1-Zero

And it costs < $30 to train the model! We hope this project helps to demystify the emerging RL scaling research and make it more accessible!

9/52
@jiayi_pirate
One caveat, of course, is that it's validated only in the Countdown task but not the general reasoning domain. We are now bounded by compute, and please reach out if you wanna help!

10/52
@jiayi_pirate
A wild ride with @JunjieZhang12 @xingyaow_ @lifan__yuan

11/52
@deter3
The dataset in github is Jiayi-Pan/Countdown-Tasks-3to4 , right ?

12/52
@jiayi_pirate
Yes, right here Jiayi-Pan/Countdown-Tasks-3to4 · Datasets at Hugging Face

13/52
@duluhagv
is the countdown dataset gen also open source? i lol'ed when I saw this release today after working on something similar last night

14/52
@jiayi_pirate
Hi the countdown generation code is mostly borrowed from Stream-of-Search
stream-of-search/src/countdown_generate.py at main · kanishkg/stream-of-search

Preprocessed the data is here:
Jiayi-Pan/Countdown-Tasks-3to4 · Datasets at Hugging Face

Everything's open and reproducible

15/52
@Samhanknr
How feasible do you think it is to teach a model to work on a given codebase. For eg - teach is to write unit tests , pass unit tests in given codebase using RL. Would it be affordable ?

16/52
@jiayi_pirate
That’s definitely possible. We are working on this

[Quoted tweet]
Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

github.com/SWE-Gym/SWE-Gym [

]

17/52
@frankxu2004
Very nice! One question: do you have any observation regarding CoT length changes during training? Is there a plot showcasing CoT length increased during training?

18/52
@jiayi_pirate
Great question! Early results show the 3B model initially reduces output length for correct formatting, then increases chain-of-thought length for better performance.

There may be minor code mismatches, so take this with a grain of salt.

Raw log:
jiayipan

19/52
@bennetkrause
One question: What model size and algorithm do the $30 refer to?

20/52
@jiayi_pirate
3B model, PPO, it takes 10 H100 hours

21/52
@wzihanw
Nice!

22/52
@reissbaker
this is really cool. any chance for a huggingface model upload so we can play around with it?

23/52
@_herobotics_
Any insight on why open llama achieves high response length but low scores?

24/52
@jiayi_pirate
OpenLlama doesn't like to generate EOS tokens, and since we haven't implemented stop after </answer>, the model often fails to terminate.
We didn't report OpenLlama's results in the Twitter thread as we believe the results will improve significantly once we fix this problem.

25/52
@GiorgioMantova
Do you find the original papers' numbers about the training cost to be plausible?

26/52
@jiayi_pirate
Yes, with MoE and FP8, that’s expected

27/52
@paul_cal
This is such a beautiful, efficient demonstration of something very powerful. Great idea

28/52
@anushkmittal
$30? might as well be free. nice work

29/52
@iamRezaSayar
this is very interesting.
I'm wondering about 2 things now. 1. if we could extend these beyond verifiable math problems, to say, empathy. and 2. if people would need / want to train their own personal reward model that is tuned to each user's prefrences

30/52
@HolgersenTobias
That was fast

Excellent work. This paradigm seems robust and scalable.

Bottleneck now will be gathering huge, diverse sets of hard but verifiable tasks.

31/52
@CharlieYouAI
Incredible work! and super cool result

32/52
@RishfulThinking
Excuse me, fukking WHAT

33/52
@garybasin
Hell yeah

34/52
@rrektcapital
Could you kindly ELI5 what RL means in this case?

35/52
@burny_tech
Cambrian explosion of RL on LLMs begins

36/52
@bookwormengr
Could you please provide flop analysis?

37/52
@nooriefyi
love seeing this kind of ingenuity in action. what were the biggest hurdles you faced?

38/52
@SurfTheUniverse
Was this supposed to work?

39/52
@AntDX316
There's going to be no way for people to intentionally 'inflate' the required amount of tokens by adding certain things that makes it look like it costs more to run a generation, when it doesn't.

The ASI-Singularity(Godsend) is the only Global Solution, people.

40/52
@ReplayRyan
Huge

41/52
@nrehiew_
Thanks for this! I wonder if you guys tried llama 3 7B instead of llama 2 7B. Also would be interesting if you guys have like a completely overtrain/overfitted run similar to what the TULU3 team experimented with.

42/52
@suwakopro
So cool, intelligence is reproducible.

43/52
@xiaoze_jin
Need to look into it; thanks for sharing

44/52
@corefpark
Thanks for open sourcing this!!!! This is an enourmous contribution of understanding the science of RLxLLMs!!

45/52
@JunYang1688
Great post. That DeepSeek core contributions can be reproduced at lightning speed demonstrates the power of open source! It will definitely accelerate the progress towards AGI.

46/52
@fjrdomingues
How do I super like a post?

47/52
@Vrda82073569
The Bitter Lesson strikes again!

48/52
@Nick_from_Texas
Any idea how autoregressive models become capable of self verification?

How do they avoid getting stuck along one chain of thought due to an earlier suboptimal token?

49/52
@neurosp1ke
Inspired by your experiment I added a procedural generator for countdown games to GitHub - open-thought/reasoning-gym: procedural reasoning datasets

50/52
@adamnemecek1
All machine learning approaches are convolutional inverses, including RL and LLMs.

51/52
@soheilsadathoss
Awesome!

52/52
@Kathleen_Tyson_
I played Scrabble against the All Time Countdown Champion last year. He destroyed me. Nice guy, though.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

2Quik4UHoes · Jan 28, 2025

O.Red said:
I tried to keep it out of this thread but fukk it

I know that people are upset about Trump winning but there's a larger context to consider and that shyt is

This is just another case of the ongoing saga of whites crashing out. Yes you see Trump and Elon doing and saying all of this outrageous shyt but nikkas gotta see it for the hail Mary flailing that it is

What's been interesting about the past few days, and the past month for that matter, is it being out in the open now that white folks are at a fork in the road they're clearly struggling to reconcile with. They've been beaten at their own game by Asians and the merit talk has come back to bite them Meritocracy is Deepseek crushing the buildings and you see how mad these scamming ass megalomaniac tech bros are

White people chest beat excellence but what they really want is to be comfortably mediocre. It was easy to hide when they stood above the rest of the world but now the opps have caught up and literally bootstrapped themselves into being better

They want so bad to stay focused on hating black folks because that's the easy way out, but they got some real smoke at their door and they're already cracking at the seams They're in a potentially catastrophic position right now and I think people are too busy being mad about Trump to see the overall desperation in the moves they're making

If only black folks were focused and on the same page This is a prime opportunity

Breh, you totally see it. This shyt is actually crazy to see happen in real time… :wow:

I hope more Black people realize this and get serious. Because these mentally diseased cacs won’t go quietly we fr gotta stick together.

3rdWorld · Jan 28, 2025

O.Red said:
I have an earlier post in this thread about Tesla being the BIGGEST of these scams

Tesla is overvalued over 87% times its actual value..Elon's money is all speculation in ones and zero's, it could all vanish overnight..

papa pimp · Jan 28, 2025

TDUBB said:
So they banned Huawei but it still outsold Apple

So they tried banning chips to China and that didn’t work

So they spent billions on AI but it still got outperformed by DeepSeek which cost them a tiny fraction of the price

just to correct

Huawei did not outsell apple…its phones sold more than iphones in China but worldwide Apple is a 3 trillion dollar behemoth.

The chips ban has limited China’s ability to make smaller and smaller chips but Deepseek brings into question whether it’s necessary.

Deepseek did not outperform American bleeding edge LLMs. It performed SIMILARLY for certain task (math for example) but something like Sora (image generation) is still at the tippy top.

I know this site doesn’t like nuance or being technically correct but yall are going to hurt yourself making inferences off fake news.

papa pimp · Jan 28, 2025

KillerPups said:
The market is going to recover. Just like crowdstrike a few months ago. Buy the dip.

Yeah it was a good time to scoop Oracle Nvidia Microsoft.

TDUBB · Jan 28, 2025

papa pimp said:
just to correct

Huawei did not outsell apple…its phones sold more than iphones in China but worldwide Apple is a 3 trillion dollar behemoth.

The chips ban has limited China’s ability to make smaller and smaller chips but Deepseek brings into question whether it’s necessary.

Deepseek did not outperform American bleeding edge LLMs. It performed SIMILARLY for certain task (math for example) but something like Sora (image generation) is still at the tippy top.

I know this site doesn’t like nuance or being technically correct but yall are going to hurt yourself making inferences off fake news.

bro, you've been in this thread playing Captain Save A Hoe trying to save face for Amerikka this whole entire time.

Go grab ur tap dancing shoes nerd
:camby:

your running late for the old white man.

Moose_Greyjoy · Jan 28, 2025

Tech is overvalued here in the states :heh:

We have worse electric cars
Worse AI
Worse transit
It’ll be crazy if China just booms bigger, and I was told communism does not innovate

TDUBB · Jan 28, 2025

Moose_Greyjoy said:
Tech is overvalued here in the states
We have worse electric cars
Worse AI
Worse transit
It’ll be crazy if China just booms bigger, and I was told communism does not innovate

What's funny these are just some of China’s greatest strengths. Crazy to think DeepSeek was really a side project.

papa pimp · Jan 28, 2025

TDUBB said:
bro, you've been in this thread playing Captain Save A Hoe trying to save face for Amerikka this whole entire time.

Go grab ur tap dancing shoes nerd
your running late for the old white man.

ah so you’re an idiot

got it

The Intergalactic Koala · Jan 28, 2025

Somebody pointed out on Reddit that this could lead to the big tech bubble bursting. Musk told heads that there will be dark times ahead, but really this country fitting to get that wake up call.

Not going to be surprised when your favorite gazillionaire yokes themselves because they thought shyt was sweet.

We fitting to be Japan in the 90s :huhldup:

morris · Jan 28, 2025

does DeepSeek do the following?

-make slides/presentations?
-write a book?
-help make an audiobook?
-review a long YT video and give you a 2 min synopsis of fit?
-pick the Super Bowl winner?
-show you how to make a mobile app step-by-step?

Moose_Greyjoy · Jan 28, 2025

TDUBB said:
What's funny these are just some of Chinese greatest strengths. Crazy to think DeepSeek was really a side project.

It’s cool though, we got meme coins on lock :manny:

The Intergalactic Koala · Jan 28, 2025

The thing that separates China from the States is that they took decades to get where they are, while it's took us a decade and a half to destroy the country's process.

Over there, its one accord and communistic visuals that Trump wants to embrace, but will never be successful in because he's a fukking idiot. Education is essential, while the states destroying the school system.

Health care is a step up from what we have to deal with (word to the marsupial that's currently dealing with insurance woes).

Ways of getting around is at a snap of a finger, while we just embrace a fancier taxi share ala Uber/Lyft.

We are so primitive and lost in our own farts to see beyond the cult that this nation became. While cacs were trying to beat a dead horse with making black people's lives a living hell, China created infrastructure and railroads that can get you to point A to B within minutes.

We are so cooked its not even funny :francis:

3rdWorld · Jan 28, 2025

The biggest mistake these Western Cacs ever made was trying to use China as a slave force to manufature all their shyt..

:picard:

bnew · Jan 28, 2025

morris said:
does DeepSeek do the following?

-make slides/presentations?
-write a book?
-help make an audiobook?
-review a long YT video and give you a 2 min synopsis of fit?
-pick the Super Bowl winner?
-show you how to make a mobile app step-by-step?

1. if it can design webpages, it probably can programmatically create presentations.
2. yeah, people have been using chatgpt since version 3.5 to do that and sell them on amazon so its doable.
3. in some ways probably but they don't have a text to speech model that i know of. probably help with creating a script.
4. copy paste the youtube transcript and ask it to summarize it.
5. it'll guess but no guarantee it'll be correct.
6. probably, depends on the app.

China just wrecked all of American AI. Silicon Valley is in shambles.

More options

bnew

Veteran

2Quik4UHoes

Why you had to go?

3rdWorld

Veteran

papa pimp

All Star

papa pimp

All Star

TDUBB

All Star

Moose_Greyjoy

I DO Not Sow

TDUBB

All Star

papa pimp

All Star

The Intergalactic Koala

Finally Made it to Eucalyptus Nova

morris

Superstar

Moose_Greyjoy

I DO Not Sow

The Intergalactic Koala

Finally Made it to Eucalyptus Nova

3rdWorld

Veteran

bnew

Veteran