1/31
@ProfTomYeh
How does OpenAI train the Strawberry
(o1) model to spend more time thinking?
I read the report. The report is mostly about 𝘸𝘩𝘢𝘵 impressive benchmark results they got. But in term of the 𝘩𝘰𝘸, the report only offers one sentence:
"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."
I did my best to understand this sentence. I drew this animation to share my best understanding with you.
The two key phrases in this sentence are: Reinforcement Learning (RL) and Chain of Thought (CoT).
Among the contributors listed in the report, two individuals stood out to me:
Ilya Sutskever, the inventor of RL with Human Feedback (RLHF). He left OpenAI and just started a new company, Safe Superintelligence. Listing Ilya tells me that RLHF still plays a role in training the Strawberry model.
Jason Wei, the author of the famous Chain of Thought paper. He left Google Brain to join OpenAI last year. Listing Jason tells me that CoT is now a big part of RLHF alignment process.
Here are the points I hope to get across in my animation:
In RLHF+CoT, the CoT tokens are also fed to the reward model to get a score to update the LLM for better alignment, whereas in the traditional RLHF, only the prompt and response are fed to the reward model to align the LLM.
At the inference time, the model has learned to always start by generating CoT tokens, which can take up to 30 seconds, before starting to generate the final response. That's how the model is spending more time to think!
There are other important technical details missing, like how the reward model was trained, how human preferences for the "thinking process" were elicited...etc.
Finally, as a disclaimer, this animation represents my best educated guess. I can't verify the accuracy. I do wish someone from OpenAI can jump out to correct me. Because if they do, we will all learn something useful!
2/31
@modelsarereal
I think o1 learns CoT by RL by following steps:
1. AI generates synthetic task + answer.
2. o1 gets task and generates CoT answers
3. AI rewards those answers which solve the task
4. task + rewarded answer are used to finetune o1
3/31
@ProfTomYeh
This does make sense. Hope we get more tech info soon.
4/31
@cosminnegruseri
Ilya wasn't on the RLHF papers.
5/31
@ProfTomYeh
You are right. I will make a correction.
6/31
@NitroX919
Could they be using active inference?
Google used test time fine tuning for their math Olympiad AI
7/31
@ProfTomYeh
I am not sure. They may use it secretly. This tech report emphasizes cot.
8/31
@Teknium1
Watch the o1 announcement video, the cot is all synthetic.
9/31
@Cryptoprofeta1
But Chat GPT told me that Strawberry has 2 R in the word
10/31
@sauerlo
They did publish their STaR research months ago. Nothing intransparent or mysterious.
11/31
@AlwaysUhhJustin
I am guessing that the model starts by making a list of steps to perform and then executes on the step, and then has some accuracy/hallucination/confirmation step that potentially makes it loop. And then when all that is done, it outputs a response.
Generally agree on RL part
12/31
@shouheiant
@readwise save
13/31
@manuaero
Most likely: Model generates multiple steps, expert humans provide feedback (correct, incorrect), modify step if necessary. This data then used for RLHF
14/31
@dikksonPau
Not RLHF I think
15/31
@LatestPaperAI
CoT isn’t just a hack; it’s the architecture for deeper reasoning. The missing details? Likely where the real magic happens, but your framework holds.
16/31
@Ugo_alves
17/31
@arattml
18/31
@zacinabox
A dev in one of their videos essentially said “you have to make a guess, then see if that’s a right or wrong guess, and then backtrack if you get it wrong. So any type of task where you have to search through a space where you have different pieces pointing in different directions but there are mutual dependencies. You might get a bit of information that these two pieces contradict each other and our model is really good at refining the search space.”
19/31
@DrOsamaAhmed
This is fascinating explanation, thanks really for sharing it
20/31
@GlobalLife365
It’s simple. The code is slow so they decided to call it “thinking”. ChatGPT 4 is also thinking but a lot faster. It’s a gimmick.
21/31
@GuitarGeorge6
Q*
22/31
@ReneKriest
I bet they did some JavaScript setTimeout with a prompt “Think again!” and give it fancy naming.
23/31
@armin1i
What is the reward model? Gpt4o?
24/31
@Austin_Jung2003
I think there is "Facilitator" in the CoT inference step.
25/31
@alperakgun
if cot is baked in inference; then why is o1 too slow?
26/31
@wickedbrok
If the model was train to input Cot tokens, then it just aesthetic and doesn’t mean that the machine can actually think.
27/31
@Daoist_Wang
The idea is quite simple because we all learn in that way.
The key is to apply it in real practices.
So, I don't see anything beyond what GoogleDeepmind has done in Alphazero.
28/31
@dhruv2038
This is a great illustration!
I learn a lot from your videos!
29/31
@rnednur
Why can we not fine-tune COT tokens on existing open source models to do the same. What is the moat here?
30/31
@ThinkDi92468945
The model is trained with RL on preference data to generate high quality CoT reasoning. The hard part is to generate labeled preference data (CoTs for a given problem ranked from best to worst).
31/31
@JamesBe14335391
The recent Agent Q paper by the AGI company and Stanford hints at how this might work…
https://arxiv.org/pdf/2408.07199
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196