China just wrecked all of American AI. Silicon Valley is in shambles.

Dr. Acula

Hail Hydra
Supporter
Joined
Jul 26, 2012
Messages
26,013
Reputation
8,801
Daps
138,459
yeah I've been hearing about this for the past few days.
They are offering for free, and open source, what you have to pay Open AI 200 bucks a month for.
Open AI is about lose some change :ohhh:

I’m so sick of them.


China, is showing this off to force America to show their hand. That’s all.

This is one case where China played it smart :yeshrug:
 

FaTaL

Veteran
Joined
May 2, 2012
Messages
103,128
Reputation
5,105
Daps
205,859
Reppin
NULL
yeah I've been hearing about this for the past few days.
They are offering for free, and open source, what you have to pay Open AI 200 bucks a month for.
Open AI is about lose some change :ohhh:
arnt they trying to go public?

it might be too late
 

Mike Nasty

Superstar
Joined
Nov 19, 2016
Messages
12,425
Reputation
2,247
Daps
60,677
Don't be mad at the news being told. If China is doing that much better that often, then the US needs to get its game up.
I'm mad cause it's blatant propaganda.
Go best the best country on earth already, but how about make an airliner that's sold outside your country first....or maybe qualify for the World Cup.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,190
Reputation
8,782
Daps
163,873
lol its built on pytorch and llama and likely used a bigger model with tons of compute to train it but hey sensationalism

it also matches o1 for certain task but not all.




















1/27
@nrehiew_
How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level.



GhxgFKUXwAAdbJQ.png


2/27
@nrehiew_
Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (https://arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (Tulu 3: Pushing Frontiers in Open Language Model Post-Training)



3/27
@nrehiew_
They start by trying to improve the Base Model without any supervised data.

They use Group Relative Policy Optimization (https://arxiv.org/pdf/2402.03300) with the advantage function just being the normalized outcome rewards

For the reward models, they use simple accuracy reminders (check answer within \boxed, run test cases) + they encourage the model to put its thinking process between <think/> tags



Ghxj-jaXkAAQM_W.png

GhxkrX-W8AEKPdG.png


4/27
@nrehiew_
The GRPO algorithm here. Again the advantage estimation is just the outcome reward. Check out the paper linked above for more details



GhxkUFPXMAAHhxS.png


5/27
@nrehiew_
1st interesting thing of the paper:
> neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

not much else for me to add here



6/27
@nrehiew_
They say that they use a really simple prompt because they are more interested in observing the evolution in model outputs



Ghxk-uhWMAA_Z9M.jpg


7/27
@nrehiew_
Notice that they went straight from Base -> RL without an intermediate SFT/Instruct tuning stage as is common. They call this model R1-Zero



GhxlWeRWcAA7eog.png


8/27
@nrehiew_
Why is this interesting?

Notice how simple the entire setup is. It is extremely easy to generate synthetic prompts with deterministic answers. And with literally nothing else, it is possible to go from 0.2->0.85 AIME scores.

Training the base model directly also directly extracts that ability without having its distribution disturbed by SFT

Again, at no point did they provide reference answers or instructions. The model realizes that to achieve higher reward, it needs to CoT longer



Ghxl8y3XIAANfDY.png


9/27
@nrehiew_
With this extremely straightforward setup, the network learns to reflect/reevaluate its own answers. Again, this is done completely without supervision



GhxmUHzWYAAMmtB.png

GhxmXEiW0AAOzSf.png


10/27
@nrehiew_
The problem with RL on the base model is that the reasoning process/CoT is not really readable. So, they introduce a small amount of high quality user-friendly data before the RL process such that the final model isnt a "base model" but rather something more "assistant" like



11/27
@nrehiew_
Their entire pipeline is as follows:
1) Take a few thousand samples of high quality data of the format COT + Summary and SFT the base model

2) Repeat the R1 Zero process. They notice the language mixing problem still remains so they add a reward accounting for the proportion of target language words in the COT. (Interesting Note: This worsens performance slightly)

3) Collect 800k accurate samples from the trained model -600K STEm, 200K general purpose. (Note: These were the samples used to FT the other open models like Qwen, Llama etc)

4) They have 1 last RL stage where they combine the verifiable rewards + preference tuning that was done for DeepSeek v3 (for alignment purposes)



12/27
@nrehiew_
By now, you should have seen/heard all the results. So I will just say 1 thing. I really do think this is an o1 level model. If i had to guess its ~ the same as o1 (reasoning_effort = medium)



GhxoLkTXMAAX9L4.png


13/27
@nrehiew_
They also evaluate on the distilled models and distillation really just works. They even beat Qwen's very own QwQ.

At 8B parameters, it is matching Sonnet and has surpassed GPT-4o



GhxofeVWEAANctu.png


14/27
@nrehiew_
Now they have a section on the effectiveness of distillation. They train a Qwen32B model using RL and compare it with the distilled version.

The finding that this RL version is worse off (~ the same as QwQ) shows that the way forward is to RL a huge model and distill it down.

This also gives insight to the impressive performance of o1-mini. It looks like it really is just extremely well engineered distillation



Ghxo9_wW8AA2oBK.png


15/27
@nrehiew_
They also have a section on their unsuccessfully attempt which i find extremely commendable to share.

tldr: PRMs are hard to train and can be hacked. It should only be used for guided search rather than learning. MCTS was also not working and was too complicated



Ghxpc4hWgAAsbeG.png


16/27
@nrehiew_
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

GitHub - deepseek-ai/DeepSeek-R1



17/27
@nrehiew_
Some thoughts:

I think this is 1 of the most important papers in a while because its the first open model that is genuinely at the frontier and not just riding on the goodwill of being open.

The paper is really really simple as you can probably tell from the thread because the approach is really really simple. It really is exactly what OpenAI is good at - doing simple things but executing at an extremely high level

Personally, I'm surprised (maybe i shouldn't be) that just RL on verifiable rewards (credits to the TULU3 team for the term) works. Now that we know this recipe, we also would have something that can match o3 soon.

Also worth noting that they did alignment tuning + language consistency tuning. This hurts performance which indicates that the model could be even better. Really interesting to think about the tradeoffs here.

The way i see it there are 2 open research areas:
- Can we improve inference time performance. Search? What is o1-pro mode doing? How is the reasoning_effort in o1 controlled?

- What does this unhackable ground truth reward look like for normal domains without deterministic ground truths. I think its just LLM-as-a-Judge but done extremely well (Sonnet probably does this)



18/27
@srivsesh
best follow in a while. can you remind me what the base mode architecture is?



19/27
@nrehiew_
deepseek v3



20/27
@threadreaderapp
Your thread is very popular today! /search?q=#TopUnroll Thread by @nrehiew_ on Thread Reader App 🙏🏼@ssfd____ for 🥇unroll



21/27
@fyhao
I am asking o1 and deepthink.

Question:

117115118110113

Deepthink:



22/27
@AbelIonadi
Value packed thread.



23/27
@moss_aidamax
@readwise save thread



24/27
@raphaelmansuy
DeepSeek R1 in Quantalogic ReAct agent: We are thrilled to announce that with the release of v0.2.26, the Quantalogic ReAct agent now offers support for the DeepSeek R1 model! This enhancement empowers our users with advanced reasoning capabilities, making it easier to harness the power of AI for your projects. You can now utilize DeepSeek R1 seamlessly through the following commands: 🔹 `quantalogic --model-name deepseek/deepseek-reasoner` or 📷 `quantalogic --model-name openrouter/deepseek/deepseek-r1` This integration marks a significant step forward for our community, enhancing the versatility and potential applications of the Quantalogic platform. Join us in exploring the possibilities that this powerful model brings to the table!



25/27
@DrRayZha
very valuable thread! BTW R1 seems sensitive to the input prompt and few-shot prompting would degrade the performance, it may be a promising direction to make it more robust to input prompts



Gh3Gs8EbUAASwtf.jpg


26/27
@ethan53896137
great post!!!



27/27
@ssfd____
@UnrollHelper




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

King Harlem

Superstar
Joined
Jan 28, 2016
Messages
5,118
Reputation
774
Daps
19,930
I don't know if having AI 50x more efficient than the current U.S. ones is a good thing. Actually, it's probably a bad thing but inevitable.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,190
Reputation
8,782
Daps
163,873
I’m so sick of them.


China, is showing this off to force America to show their hand. That’s all.

for two years now many posters have been saying how the elite will keep the best AI for themselves and even META has said when it gets to AGI level, they'll stop releasing open models but deepseek has said it's their intention to release it. we're suppose to think that's a bad thing now? :gucci:
 
Last edited:

King Harlem

Superstar
Joined
Jan 28, 2016
Messages
5,118
Reputation
774
Daps
19,930
so you never want to be able to run a chatgpt o3 level model on your phone that doesn't require an internet connection?
I hear you. There are great things that will come out of this, but Capitalism will ensure that advances in AI will used to reduce costs which will eventually cost people jobs.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,190
Reputation
8,782
Daps
163,873
lol its built on pytorch and llama and likely used a bigger model with tons of compute to train it but hey sensationalism

it also matches o1 for certain task but not all.


the AI model the OP tweet is referring to is Deepseek-V3 or DeepSeek-R1 not Bytedance's Doubau LLM model










1/12
@lmarena_ai
Breaking News: DeepSeek-R1 surges to the top-3 in Arena🐳!

Now ranked #3 Overall, matching the top reasoning model, o1, while being 20x cheaper and open-weight!

Highlights:
- #1 in technical domains: Hard Prompts, Coding, Math
- Joint #1 under Style Control
- MIT-licensed

A massive congrats to @deepseek_ai for this incredible milestone and gift to the community! More analysis below 👇

[Quoted tweet]
🚀 DeepSeek-R1 is here!

⚡ Performance on par with OpenAI-o1
📖 Fully open-source model & technical report
🏆 MIT licensed: Distill & commercialize freely!

🌐 Website & API are live now! Try DeepThink at chat.deepseek.com today!

🐋 1/n


GiDY-ybasAACoHj.jpg

GhvI3AoaAAAw-4z.png


2/12
@lmarena_ai
In Hard Prompt with Style Control, DeepSeek-R1 ranked joint #1 with o1.



GiDb-BLaMAADtmV.jpg


3/12
@lmarena_ai
Early results show DeepSeek-R1 strong across all domains! More votes are being collected for stable rankings.



GiDbWaIbAAEbV2q.jpg


4/12
@lmarena_ai
Win-rate heat map



GiDcK7tbwAAhUg0.jpg


5/12
@lmarena_ai
DeepSeek-R1 is also in the WebDev Arena for real-world web dev evals! Stay tuned for the leaderboard.

http://web.lmarena.ai



6/12
@lmarena_ai
Check out full data at http://lmarena.ai/leaderboard and try DeepSeek-R1 yourself!



7/12
@skingers
The danger of losing AI to CCP China...👆😕 Deepseek 🔥



GiEAu-bWkAARIM4.jpg


8/12
@NaturallyDragon
Well deserved!



9/12
@elias_judin
beats o1-pro in some maths questions - but, also not in others. o1-pro seems to overcomplicate things when you’re working with very abstract maths and requiring a calculation (that’s actually a trick question)



10/12
@_losublime
@erythvian curve the sky, deep in the ocean



11/12
@walidbey
@UnrollHelper



12/12
@threadreaderapp
@walidbey Hola, please find the unroll here: Thread by @lmarena_ai on Thread Reader App Talk to you soon. 🤖




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top