bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086

1/11
@arcprize
We put OpenAI o1 to the test against ARC Prize.

Results: both o1 models beat GPT-4o. And o1-preview is on par with Claude 3.5 Sonnet.

Can chain-of-thought scale to AGI? What explains o1's modest scores on ARC-AGI?

Our notes:
OpenAI o1 Results on ARC-AGI-Pub

2/11
@TeemuMtt3
Sonnet 3.5 equals o1 in ARC, which confirms Anthropic has a 3 mont lead in time..

We should value more Sonnet 3.5!

3/11
@TheXeophon
That’s shockingly low, imo. Expected o1-preview to absolutely destroy Sonnet and be among the Top 3. Says a lot about ARC in a very positive sense.

4/11
@MelMitchell1
I'm confused by this plot.

The 46% by MindsAI is on the private test set, but42% by Greenblatt, and the 21% by o1-preview, are on the "semi-private validation set", correct?

If so, why are you plotting results from different sets on the same graph? Am I missing something?

5/11
@Shawnryan96
This test will not be beaten until dynamic learning happens. You can be as clever as you want but if its at all novel, until the weights change it doesn't really exist.

6/11
@phi_architect
yeah - but how did you prompt it?

you might need to talk to it differently

7/11
@Blznbreeze
@OpenAI recommends a different prompting style for the new model. Could using the same prompt as gpt4o have an effect on performance? What would be a better more strawberryish prompt?

8/11
@letalvoj
What's up with Sonnet?

9/11
@burny_tech
Hmm, I wonder how much would the AlphaZero-like RL with selfcorrecting CoT finetuning of o1 on ARC score on ARC challenge 🤔

10/11
@RiyanMendonsa
Wow!
Looks like they are no where close to reasoning here.
Curious what their thinking long and hard means then 🤔

11/11
@far__el
interested to see what mindsai scores with o1


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXYs7lvbwAcLTPA.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086



























1/40
@amasad
AI is incredible at writing code.

But that's not enough to create software. You need to set up a dev environment, install packages, configure DB, and, if lucky, deploy.

It's time to automate all this.

Announcing Replit Agent in early access—available today for subscribers:

2/40
@amasad
Just go to Replit logged in homepage. Write what you want to make and click "start building"!

3/40
@amasad
Will thread in some runs in here. 4-min run for a startup landing page with waiting list complete with postgres persistence

4/40
@amasad


5/40
@amasad


6/40
@amasad


7/40
@amasad
idea to software all on your phone

8/40
@amasad


9/40
@amasad
Healthcare app!

10/40
@amasad
Ad spend calc from someone who can't code

11/40
@amasad
Stripe!

12/40
@amasad


13/40
@amasad
Replit clone w/ Agent!

14/40
@amasad
Sentiment analysis in 23 minutes!

15/40
@amasad
Website with CMS in 10 minutes!

16/40
@amasad
Love the work tools people are building

17/40
@amasad
Mobile game made on mobile

18/40
@amasad
Passed the frontend hiring bar

19/40
@amasad
Social media ad generator

20/40
@amasad
3D platformer game

21/40
@elonmusk
It can’t write good video games (yet)

22/40
@amasad
Grok-3 will be Donkey-Kong-complete 🤞

23/40
@msg
its cute that the replit agent refers to the project as OUR project and when it updates the replit functionality it states that it communicated with the TEAM

24/40
@amasad
It’s actually implemented as a team! Hahaha

25/40
@007musk
When can existing users get this feature?

26/40
@amasad
Paid users already have it. It's on your loggedin homepage

27/40
@alexchristou_
Looks sweet

Would be cool to try out without having to upgrade out of the gate.

Have been a paid member before

28/40
@amasad
Will do that after beta

29/40
@0xastro98
Hey Amjad, is this available for Core members?

30/40
@amasad
yes, just go to your homepage

31/40
@itsPaulAi
Ok that's just insane. Congrats on the launch!

32/40
@amasad
Thank you! Aider in Replit is still super useful as they serve slightly different use cases. I used it yesterday. Thanks for the demos.

33/40
@hwchase17
🚀

34/40
@amasad
The team spent more time in langsmith than their significant others the past few weeks :D

35/40
@arben777
To this day I have not used Replit at all. I will be booting it up and seeing how this agent performs advancing a project with ~8,000 lines.

I have found many of these tools are solid for quick boot ups or simple "shallow" demos but many fall short in bigger codebases.

36/40
@DarbyBaileyXO
are you kidding!? less than 10 minutes from idea and now it's building in the background

prompt:

i want to build an app that shows all the hot springs on a map for idaho and oregon, where people can plan a road trip and also see what KOA's or AIRBnB's are nearby so they can plan an itinerary and see driving times and optimal stops along with gas stations for those stops and the scenic views they will see at those stops, on their way to the selected hot spring https://nitter.poast.org/messages/media/1831781177122025961

37/40
@seflless
The mobile support out of the gate! This is 🔥. The mobile experience will be so enhanced by this type of thing.

38/40
@mckaywrigley
It’s time.

What a release.

Replit is AGI infrastructure.

39/40
@0interestrates
congrats amjad! this is fire

40/40
@BasedBeffJezos
This is sickk 🔥🔥

Congrats @amasad


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWufD6Ia8AAtbdX.jpg

GWujqBIXkAEKtqE.jpg

GWvJu0YaIAAz5rM.jpg

GWu93CqakAAvo-a.jpg

GWvLY_iWQAAizqV.jpg

GWwPHLjXwAAcE7b.jpg

GWwOlP8XUAArN-z.jpg

GWwO0KIa8AMF1Kb.jpg

GWvpv55XAAAiuvc.jpg

GWwhF6kXwAAipUS.jpg

GWu8sG-XgAACpF_.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086































1/31
@AravSrinivas
Reply to this thread with prompts where you feel o1-preview outperformed sonnet-3.5 that’s not a puzzle or a coding competition problem but your daily usage prompts. 🧵

2/31
@RubberDucky_AI
o1-mini impressed me here.

Reusable Template for MacOS application that has CRUD, Menu, Canvas and other standard expected features of an application. Light and Dark Mode. Primary purpose will be an information screen that brings data in from other applications, web, os etc. create hooks. Be thorough.

ChatGPT

3/31
@AravSrinivas
This is a good one. Thanks for sharing. I tried the same query on Perplexity Pro. o1-mini's answer is more thorough. The planning ability does shine here. Weirdly, o1-mini might even be better than o1-preview—probably more test-time inference. https://www.perplexity.ai/search/reusable-template-for-macos-ap-Qmp_uehNS5250irMfxSHOQ

4/31
@shreyshahi
“<description of idea>. Make a detailed plan to implement and test <this idea>. Do you think <this idea> will work? Please write all the code needed to test <the idea> and note down the key assumptions you made in the implementation.” O1-preview beat sonnet few times I tried.

5/31
@AravSrinivas
Any specific examples ?

6/31
@deedydas
Not exactly daily usage, but for fermi problems of estimation, the thought process was a lot better on o1-mini than 4-o

o1mini said 500k ping pong balls fit in a school bus and 4-o said 1m and didn’t account for packing density etc

7/31
@AravSrinivas
Man, I specifically asked for non puzzles! I am trying to better understand where o1-preview will shine for real daily usage in products beyond current models. Anyway, I asked Perplexity (o1 not there yet) and it answered perfectly fine (ie with a caveat for the first).

8/31
@primustechno
answering this question

9/31
@AravSrinivas
I got it done with Perplexity Pro btw https://www.perplexity.ai/search/describe-this-image-LXp6Ul09Ruu..GICVP.fNA

10/31
@mpauldaniels
Creating a list of 100 companies relevant info “eg largest companies by market cap with ticker and URL”

Sonnet will only do ~30 at a time and chokes.

11/31
@AravSrinivas
That's just a context limitation.

12/31
@adonis_singh
is this your way of building a router, by any chance?😏

13/31
@AravSrinivas
no, this is genuinely to understand. this model is pretty weird, so I am trying to figure it out too.

14/31
@ThomasODuffy
As you specifically said "coding completion" - but not refactoring:

I uploaded a 700 line JavaScript app, that I had asked Claude 3.5 to refactor into smaller files/components for better architecture. Claude's analysis was ok... but the output was regressive.

o1-Preview nailed it though... like it can do more consideration, accurately, in working memory, and get it right... vs approximations from step to step.

It turned one file into 9 files and maintained logic with better architecture. This in turn, unblocked @Replit agent, which seemed to get stuck beyond a certain file size.

To achieve this, o1-preview had to basically build a conceptual model of how the software worked, then create an enhanced architecture that included original features and considered their evolution, then give the outputs.

15/31
@JavaFXpert
Utilizing o1-preview as a expert colleague has been a very satisfying use case. As with prior models, I specify that it keep a JSON knowledge graph representation of relevant information so that state is reliably kept. o1-preview seems to plan tasks better and give more accurate feedback.

16/31
@Dan_Jeffries1
It's not the right question.

I'd have to post a chain of prompts.

Basically I was building a vLLM server wrapper for serving Pixtral using the open AI compatible interface with token authentication.

Claude got me very far and then got stuck in a loop not understanding how to get the server working as it and I waded through conflicting documentation, outdated tutorials, rapid code updates from the team and more. I was stuck for many hours the day before o1 came out.

o1 had it fixed and running in ten rounds of back and forth over a half hour, with me doing a lot of [at]docs [at]web and [at]file/folder/code URL to give it the background it needed.

Also my prompts tend to be very long and explicit with no words like "it" to refer to something. These are largely the same prompts I use with other models too but they work better with o1.

Here is an example of a general template I use in Cursor that works very well for me.

"We are working on creating a vLLM server wrapper in python, which serves the Pixtral model located at this HF [at]url using the OpenAI compatible API created by vLLM whose code is here [at]file(s)/folder and whose documentation is here [at]docs. We have our own token generator and we want to secure the server over the web with it and not allow anyone to use the model without presenting the proper token. We are receiving this error [at]file and the plan we have constructed to follow is located here for your reference [at]file (MD file of the plan I had it output at the beginning. I believer the problem is something like {X/Y/Z}. Please use your deep critical thinking abilities, reflect on everything you have read here and create an updated set of steps to solve this challenging problem. Refine your plan as needed, do your research, and be sure to carefully consider every step in depth. When you make changes ensure that you change only what is necessary to address the problem, while carefully preserving everything else that does not need to change."

17/31
@afonsolfm
not working for me

18/31
@default_anton
Are there any security issues in the following code?

<code>

</code>

—-

With such a simple prompt, o1-preview is able to identify much more intricate issues.

3.5 Sonnet can identify the obvious ones but struggles to find the subtle ones.

FYI: I’ve tried different variations of the following idea with sonnet:

You are a world-class security researcher, capable of complex reasoning and reflection. Reason through the query inside <thinking> tags, and then provide your final response inside <output> tags. If you detect that you made a mistake in your reasoning at any point, correct yourself inside <reflection> tags.

19/31
@adonis_singh
Anything with school and understanding a certain topic with examples.

For example with math:

Explain the topic of vector cross products for a DP 2 student in AAHL, using example questions that are actually tough and then quiz me.

o1 does much better than any other model

20/31
@danmana
Tried o1, GPT-4o, and Claude on a slow PostgreSQL query I optimized (complex SQL with joins, filters, geometry intersections).
Gave them the query, explain plan, tables, indexes.

My solution: 50% faster
Claude: 30% slower
GPT-4o: Bugged, and after fixing with Claude, 280% slower 😑

O1 was the only one to precompute PostGIS area fractions (like my solution) and after self-fixing a small mistake, it was 10% faster.

With materialized CTE instead of subselect, it could've matched my 50% reduction.

21/31
@quiveron_x
It's light years ahead of sonnet 3.5 in every sense:
It's failed for my needs but it captured something that no other LLM captured ever, it suggested that we go chapter by chapter and I break down this to step by step manner. This isn't the impressive part btw. I will explain that in next post.

22/31
@boozelee86
I dont have acces to it , i could make you something special

23/31
@vybhavram
I saw the opposite. Claude sonnet-3.5 did a better job fixing my code than o1-mini.

Maybe mini excels at 0 to 1 project setup and boilerplate coding.

24/31
@MagnetsOh
Propose an extremely detailed and comprehensive plan to redesign US high school curriculums, that embraces the use of generative AI in the classroom and with homework.
- Ditto for LA traffic.
- Just my curiosity of course.

25/31
@lightuporleave
Exactly. Your agentic use of Gpt4o or Sonnet is comparable or better. And very few even realized perplexity had this capability for quite some time.

26/31
@CoreyNoles
I don’t have the prompt handy, but it did a very impressive job of analyzing and improving the clarity, accuracy, and efficiency of our system prompts. They’re not all things that will be visible on the front end, but significant on the backend.

27/31
@opeksoy
wow, 11hrs… only this ?👆🏻

28/31
@BTurtel
"How many letters are in this sentence?"

29/31
@ashwinl
Finding baby names across different cultures is still troublesome.

Sample prompt:

“We are parents of a soon to be newborn based in New York City. We are looking for boys names that have a root in Sanskrit and Old Norse or Sanskrit and Germanic. Can you provide a list of 20 names ranked by ease of pronunciation in Indian and European cultures?”

30/31
@cpdough
huge improvement as an AI Agent for data analysis

31/31
@primustechno
i could probably add more to this long-term, but much of it would be personal preference (diction/focus/style)

but in my experience GPT beats Claude for most of my use cases most of the time (brainstorming, math, creative/essay writing, editing, recipes, how tos, summary etc)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXaG1LybwAIzPN5.jpg

GXaG1L4bwAUGz_d.jpg

GXaG1L0bwAA5fwa.jpg

GXaHbdfbwAA2vSq.jpg

GXaHwo2bwAQ6qV4.jpg

GXZ6d_uWQAA9ISb.png

GXbz3DBXwAAHAZT.jpg

GXZ5ps4bkAAaaiq.png

GXbNUWDWoAA29__.png

GXbOBy_WMAE6Way.png

GXTe-vNW4AAFFGo.png

GXTfCgLXQAAfTEB.png

GXTfGG7W0AA8H-Z.png

GXbL0xHbwAEKCbH.jpg

GXcv5kjWMAAmjLy.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086






1/6
Yesterday, OpenAI dropped a preview of a new model, o1.

Our initial observations are that the o1-preview model is more autonomous in generating steps to solve problems, more precise, and handles complex tasks with higher consistency. The more digestible formatting of outputs and its increased sense of ‘confidence’ relative to 4o help too. o1-preview will unlock new use cases & accelerate time to deployment for existing use cases in healthcare. Here are several healthcare examples. 1/6

2/6
On claims adjudication:

Thousands of rules in natural language, including contracts, best practices, and guidelines, determine medical service costs. Synthesizing these rules is extremely cumbersome, error-prone, and opaque. 4o can't take in these rules and determine the cost of a service without a lot of manual intervention and technical scaffolding.

We tested o1-preview’s ability to determine the cost for a newborn delivery. It identified relevant rules, made assumptions, performed calculations, and explained cost discrepancies independently.

Impressive nuances included high-cost drug carve-outs, compounding cost increases, and secondary insurance impacts. We had o1-preview look at the claim and the relevant section of a contract and simply said ‘Determine the contracted rate.’ It autonomously generated a plan to follow. 4o does not get this right at all. 2/6

3/6
o1-preview also performed the calculations to arrive at the correct price for the delivery. It identified the correct clauses of the contract to apply and even applied an annual inflation adjuster, a nuanced calculation often messed up in healthcare. 3/6

4/6
On detecting fraud, waste and abuse:

Detecting FWA is critical in healthcare to avoid excess costs. We leverage GPT-4o prototypes to detect common FWA schemes, but that requires manual rule sets. Our observations on o1-preview show that it autonomously navigates data sources to identify FWA signals and provides thorough explanations. For example, it successfully identified 'incident-to billing' violations, flagging discrepancies in provider signatures. 4/6

5/6
On clinical reviews:

Clinicians navigate extensive medical records where errors or omissions can have high stakes, especially for the sickest patients with the longest charts. Oscar uses GPT-4o to extract clinical data, but it requires significant verification. Our observations on o1-preview: it performs clinical data extraction with more specificity and detailed reasoning. For example, both 4o and o1-preview identified a patient's immunosuppression, but o1-preview precisely correlated it to a specific drug side effect. 5/6

6/6
More details & healthcare examples here:

Oscar Health’s early observations on OpenAI o1-preview — OscarAI


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXY1JyQXUAAIMPS.png

GXZFwfAWkAAu8GM.png

GXZF5BqW8AApjM8.png

GXZGjcDWUAAIDXg.png

GXZG0ZzWAAAye2R.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086



1/3
fukking wild.

@OpenAI's new o1 model was tested with a Capture The Flag (CTF) cybersecurity challenge. But the Docker container containing the test was misconfigured, causing the CTF to crash. Instead of giving up, o1 decided to just hack the container to grab the flag inside.

This stuff will get scary soon.

2/3
Writing for a non security audience :smile: but I'm going to count it. Going outside of the framework of the CTF itself to trick the server into dumping the flag is pretty damn smart.

3/3
The o1 system card: https://openai.com/index/openai-o1-system-card/


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXUD36JacAArrOy.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086

1/1
AI News

0:00 - 🚀 Introduction
0:37 - 🗣️ Empathetic Voice Interface (EVI 2) @hume_ai 1:46 - 🦙 Llama Omni 8B @AIatMeta
2:24 - 📄 @JinaAI_ reader - HTML to markdown LM
2:33 - 🐟 Fish Speech: open-source TTS @FishAudio
3:01 - 📷 OCR2 @_philschmid
3:23 - 👁️ @MistralAI Pixtral 12B
3:57 - 🧪 SciAgents @Chi_Wang_
4:40 - ⚡ Fastest AI platform @SambaNovaAI @GroqInc
5:16 - 🎵 Flux Music and 🤖 Aloha robot @RemiCadene
5:39- 💪 AI-powered dumbbells @kabatafitness
7:28 - 🗓️ Isaac Robot @weaverobotics
7:47 - 🔍 @deepseek_ai V2.5
8:06 - 🧠 @OpenAI 0-1 model release
8:23 - 🏁 Conclusion

@hellokillian @OpenInterpreter @reach_vb


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086


















1/41
@aidan_mclau
it's a good model sir

2/41
@aidan_mclau
we reran aidan bench on top models at temp=0.7 (we used default temp for o1 and sonnet) and averaged 5 runs (this cost a lot)

we found aidan bench is temperature-sensitive, and we hope to increase num of runs and questions, which we think will stabilize performance

3/41
@aidan_mclau
i'm a bit hesitant to trust precise relative rankings here (sonnet might have had a bad day, 4o mini maybe had a great day), but two things are obvious:

>o1-mini is insane
>big models do seem to perform better on average

4/41
@aidan_mclau
o1 obviously demolishes, but this isn't an apples-to-apples comparison. we should grant best-of-n sampling to other models, CoT (right now, they zero-shot everything), or other enhancements

5/41
@aidan_mclau
i started the o1 run 5 minutes after release on thursday morning. it finished saturday morning. this model is slow, and the rate limits are brutal

6/41
@aidan_mclau
i also just suspect o1-mini architecturally differs from everything we've seen before. probably a mixture of depths or something clever. i'm curious to rerun this benchmark with o1-mini but force it to *not* use reasoning tokens. i expect it'll still do well

7/41
@aidan_mclau
regardless, o1-mini is a freak of nature.

i expected o1 to be a code/math one-trick pony. it's not. it' just really good.

i could not be more excited to use o1 to build machines that continuously learn and tackle open-ended questions

8/41
@SpencerKSchiff
Will you benchmark o1-preview or wait for o1?

9/41
@aidan_mclau
yeah when @OpenAIDevs gives me higher rate limits

10/41
@tszzl
the only benchmark that matters

11/41
@aidan_mclau
🫶

12/41
@pirchavez
i’m curious, why did you run across temps? have you found any correlation between temp and performance across models?

13/41
@aidan_mclau
didn't title is a mistake. sorry!

we ran everything at temp=0.7 and averaged scores

sonnet and o1 are run at default temp

14/41
@nddnjd18863
how is sonnet so low??

15/41
@aidan_mclau
unsure. it makes me wanna run it at 10 more interations but i can't justify that expense rn

16/41
@shoggoths_soul
Wow, way better than I expected. Is there a max score? Any idea of ~human level for your bench?

17/41
@aidan_mclau
no max score

i should take it and let you know haha

18/41
@chrypnotoad
Is it not locked at temp 1 for you?

19/41
@aidan_mclau
it is you can’t change it (used defaults for sonnet and o1)

20/41
@johnlu0x
What eval

21/41
@aidan_mclau
GitHub - aidanmclaughlin/Aidan-Bench: Aidan Bench attempts to measure <big_model_smell> in LLMs.

22/41
@Sarah_A_Bentley
Are you going to eventually integrate o1 into topology? What do you think would be the best way to do so?

23/41
@aidan_mclau
yes

will keep you posted

24/41
@zerosyscall
why are we comparing this to raw models when it's a CoT API?

25/41
@aidan_mclau


26/41
@maxalgorhythm
Pretty cool how o1 is dynamically adjusting its temperature during *thinking* based on the user prompt

27/41
@aidan_mclau
it is?

28/41
@UserMac29056
i’ve been waiting for this. great job!

29/41
@aidan_mclau
🫡

30/41
@iruletheworldmo
agi achieved internally

31/41
@gfodor
It does seem like mini is the star until o1 itself drops

32/41
@yupiop12
&gt;Aidan approved

33/41
@loss_gobbler
b-but it’s j-just a sysprompt I can repo with my s-sysprompt look I made haiku count r’s

34/41
@Orwelian84
been waiting for this

35/41
@yayavarkm
This is really good

36/41
@airesearch12
funny how openai has managed to get rid of the term gpt (which was an awful marketing choice), just by calling their new model o1.
look at your comments, ppl have stopped saying GPT-4o, they say 4o or o1.

37/41
@DeepwriterAI
Never saw 4o performing that bad. It's usually topping those others (aside from o1). This doesn't add up. 🤷‍♂️

38/41
@armanhaghighik
not even close..!

39/41
@ComputingByArts
wow, what a score!

I wonder what your bench would eventually show for o1-preview... Although cost is a problem...

40/41
@advaitjayant
very gud model

41/41
@BobbyGRG
not bad! it clearly is. And people asking "have you done something with it?" are a bit nuts. It was released less than 3 days ago! lol and it does behave differently than previous ones, api slightly different etc. It's a big unlock and we will see that before end if 2024 already


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXdHz8jagAAGp78.jpg

GXdYZTmX0AA72t_.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086




1/31
@ProfTomYeh
How does OpenAI train the Strawberry🍓 (o1) model to spend more time thinking?

I read the report. The report is mostly about 𝘸𝘩𝘢𝘵 impressive benchmark results they got. But in term of the 𝘩𝘰𝘸, the report only offers one sentence:

"Through reinforcement learning, o1 learns to hone its chain of thought and refine the strategies it uses."

I did my best to understand this sentence. I drew this animation to share my best understanding with you.

The two key phrases in this sentence are: Reinforcement Learning (RL) and Chain of Thought (CoT).

Among the contributors listed in the report, two individuals stood out to me:

Ilya Sutskever, the inventor of RL with Human Feedback (RLHF). He left OpenAI and just started a new company, Safe Superintelligence. Listing Ilya tells me that RLHF still plays a role in training the Strawberry model.

Jason Wei, the author of the famous Chain of Thought paper. He left Google Brain to join OpenAI last year. Listing Jason tells me that CoT is now a big part of RLHF alignment process.

Here are the points I hope to get across in my animation:

💡In RLHF+CoT, the CoT tokens are also fed to the reward model to get a score to update the LLM for better alignment, whereas in the traditional RLHF, only the prompt and response are fed to the reward model to align the LLM.

💡At the inference time, the model has learned to always start by generating CoT tokens, which can take up to 30 seconds, before starting to generate the final response. That's how the model is spending more time to think!

There are other important technical details missing, like how the reward model was trained, how human preferences for the "thinking process" were elicited...etc.

Finally, as a disclaimer, this animation represents my best educated guess. I can't verify the accuracy. I do wish someone from OpenAI can jump out to correct me. Because if they do, we will all learn something useful! 🙌

2/31
@modelsarereal
I think o1 learns CoT by RL by following steps:
1. AI generates synthetic task + answer.
2. o1 gets task and generates CoT answers
3. AI rewards those answers which solve the task
4. task + rewarded answer are used to finetune o1

3/31
@ProfTomYeh
This does make sense. Hope we get more tech info soon.

4/31
@cosminnegruseri
Ilya wasn't on the RLHF papers.

5/31
@ProfTomYeh
You are right. I will make a correction.

6/31
@NitroX919
Could they be using active inference?
Google used test time fine tuning for their math Olympiad AI

7/31
@ProfTomYeh
I am not sure. They may use it secretly. This tech report emphasizes cot.

8/31
@Teknium1
Watch the o1 announcement video, the cot is all synthetic.

9/31
@Cryptoprofeta1
But Chat GPT told me that Strawberry has 2 R in the word

10/31
@sauerlo
They did publish their STaR research months ago. Nothing intransparent or mysterious.

11/31
@AlwaysUhhJustin
I am guessing that the model starts by making a list of steps to perform and then executes on the step, and then has some accuracy/hallucination/confirmation step that potentially makes it loop. And then when all that is done, it outputs a response.

Generally agree on RL part

12/31
@shouheiant
@readwise save

13/31
@manuaero
Most likely: Model generates multiple steps, expert humans provide feedback (correct, incorrect), modify step if necessary. This data then used for RLHF

14/31
@dikksonPau
Not RLHF I think

15/31
@LatestPaperAI
CoT isn’t just a hack; it’s the architecture for deeper reasoning. The missing details? Likely where the real magic happens, but your framework holds.

16/31
@Ugo_alves


17/31
@arattml


18/31
@zacinabox
A dev in one of their videos essentially said “you have to make a guess, then see if that’s a right or wrong guess, and then backtrack if you get it wrong. So any type of task where you have to search through a space where you have different pieces pointing in different directions but there are mutual dependencies. You might get a bit of information that these two pieces contradict each other and our model is really good at refining the search space.”

19/31
@DrOsamaAhmed
This is fascinating explanation, thanks really for sharing it 🙏

20/31
@GlobalLife365
It’s simple. The code is slow so they decided to call it “thinking”. ChatGPT 4 is also thinking but a lot faster. It’s a gimmick.

21/31
@GuitarGeorge6
Q*

22/31
@ReneKriest
I bet they did some JavaScript setTimeout with a prompt “Think again!” and give it fancy naming. 😁

23/31
@armin1i
What is the reward model? Gpt4o?

24/31
@Austin_Jung2003
I think there is "Facilitator" in the CoT inference step.

25/31
@alperakgun
if cot is baked in inference; then why is o1 too slow?

26/31
@wickedbrok
If the model was train to input Cot tokens, then it just aesthetic and doesn’t mean that the machine can actually think.

27/31
@Daoist_Wang
The idea is quite simple because we all learn in that way.
The key is to apply it in real practices.
So, I don't see anything beyond what GoogleDeepmind has done in Alphazero.

28/31
@dhruv2038
This is a great illustration!
I learn a lot from your videos!

29/31
@rnednur
Why can we not fine-tune COT tokens on existing open source models to do the same. What is the moat here?

30/31
@ThinkDi92468945
The model is trained with RL on preference data to generate high quality CoT reasoning. The hard part is to generate labeled preference data (CoTs for a given problem ranked from best to worst).

31/31
@JamesBe14335391
The recent Agent Q paper by the AGI company and Stanford hints at how this might work…

https://arxiv.org/pdf/2408.07199


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXXiUufakAADofb.jpg

GXXiUueagAANCYP.jpg

GXc_PGBW8AAO9jZ.jpg

GXc_PGAWMAAp3Gq.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086

1/1
How reasoning works in OpenAI's o1


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXWbkM1XgAAZGQb.png























1/21
@rohanpaul_ai
How Reasoning Works in the new o1 models from @OpenAI

The key point is that reasoning allows the model to consider multiple approaches before generating final response.

🧠 OpenAI introduced reasoning tokens to "think" before responding. These tokens break down the prompt and consider multiple approaches.

🔄 Process:
1. Generate reasoning tokens
2. Produce visible completion tokens as answer
3. Discard reasoning tokens from context

🗑️ Discarding reasoning tokens keeps context focused on essential information

📊 Multi-step conversation flow:
- Input and output tokens carry over between turns
- Reasoning tokens discarded after each turn

🪟 Context window: 128k tokens

🔍 Visual representation:
- Turn 1: Input → Reasoning → Output
- Turn 2: Previous Output + New Input → Reasoning → Output
- Turn 3: Cumulative Inputs → Reasoning → Output (may be truncated)

2/21
@rohanpaul_ai
https://platform.openai.com/docs/guides/reasoning

3/21
@sameeurehman
So strawberry o1 uses chain of thought when attempting to solve problems and uses reinforcement learning to recognize and correct its mistakes. By trying a different approach when the current one isn’t working, the model’s ability to reason improves...

4/21
@rohanpaul_ai
💯

5/21
@ddebowczyk
System 1 (gpt-4o) vs system 2 (o1) models necessitate different work paradigm: "1-1, interactive" vs "multitasking, delegated".

O1-type LLMs will require other UI than chat to make collaboration effective and satisfying:

6/21
@tonado_square
I would name this as an agent, rather than a model.

7/21
@realyashnegi
Unlike traditional models, O1 is trained using reinforcement learning, allowing it to develop internal reasoning processes. This method improves data efficiency and reasoning capabilities.

8/21
@JeffreyH630
Thanks for sharing, Rohan!

It's fascinating how these reasoning tokens enhance the model's ability to analyze and explore different perspectives.

Can’t wait to see how this evolves in future iterations!

9/21
@mathepi
I wonder if there is some sort of confirmation step going on, like a theorem prover, or something. I've tried using LLMs to check their own work in certain vision tasks and they just don't really know what they're doing; no amount of iterating and repeating really fixes it.

10/21
@AIxBlock
Nice breakdown!

11/21
@AITrailblazerQ
We have this pipeline from 6 months in ASAP.

12/21
@gpt_biz
This is a fascinating look into how AI models reason, a must-read for anyone curious about how these systems improve their responses!

13/21
@labsantai


14/21
@GenJonesX
How can quantum-like cognitive processes be empirically verified?

15/21
@AImpactSpace


16/21
@gileneo1
so it's CoT in a loop with large context window

17/21
@mycharmspace
Discard reasoning tokens actually bring inference challenges for KV cache, unless custom attention introduced

18/21
@SerisovTj
Reparsing output e? 🤔

19/21
@Just4Think
Well, I will ask again: should it be considered one model?
Should it be benchmarked as one model?

20/21
@HuajunB68287
I wonder where does the figure come from? Is it the actual logic behind o1?

21/21
@dhruv2038
Well.Just take a look here.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXTFjmDXMAA1DC1.png

GXTH6J1bgAAAcl0.jpg

GXXE_8cWwAERJYP.jpg

GXVlsKDXkAAtPEO.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086



















1/42
@BenjaminKlieger
Hacked together an app today powered by Llama-3.1 on @GroqInc. Inspired by o1, it uses reasoning chains to solve problems. Maybe prompting is all you need?

Works great on the Strawberry problem, even triple checks the solution! All within 4 seconds of thinking… at the speed of Groq.

2/42
@BenjaminKlieger


3/42
@BenjaminKlieger
GitHub repository with the code coming soon!

4/42
@BenjaminKlieger
Now released!

5/42
@zeeb0t
have been running many-shot CoT for almost a year now and it works beautifully. i question if o1 isn’t just a recursive prompt

6/42
@BenjaminKlieger
Probably a combination of prompting with an improved model, if I had to guess.

7/42
@GaryIngle77
Awesome - are you prepared to Open Source it?

8/42
@BenjaminKlieger
Will be doing today!

9/42
@AlwaysUhhJustin
The difference in o1 vs this is that o1 had extensive RL on process design, so it scales well to a lot of areas. And given that 20 extremely intelligent people worked on it for a year, I'm sure they came up with a lot of small tweaks.

But yes this seems to do pretty well.

10/42
@BenjaminKlieger
Yes, I’m excited to dive deeper into o1 as rate limits improve. This app was a nice experiment to see what’s possible with open source LLMs using prompting alone.

Wrote more about g1 in this thread with the code:

11/42
@lalopenguin
github repo?

12/42
@BenjaminKlieger
If there’s interest, I’ll open source! Still playing around with it…. It’s not perfect, but it does a lot better than an LLM out of the box without prompting.

13/42
@yar_vol
ask it the .9 vs .11 question or some other failure cases :smile:

is it a single prompt or each step is its own prompt?

14/42
@BenjaminKlieger
Not only does it get the answer right, it derives 3 different ways of getting there all within ~4 seconds!

15/42
@GuiBibeau
do you make use of parralel tool call? I feel it would be a great usecase. I'm hacking a bit too on the side to get something working.

16/42
@BenjaminKlieger
Right now it’s just sequential reasoning steps. It would be interesting to do something in parallel.

I think if you had three identical agents all solving the problem with their own reasoning, then take the most common answer, you could significantly increase accuracy.

17/42
@gerardsans
Prompt techniques are bounded by the quality of the training data. More context may improve or not the performance for a task. It’s not a silver bullet. CoT is just one way to get more relevant context details but not the only one.

18/42
@BenjaminKlieger
It’s a pretty powerful one though! But agreed that there are other ways to improve performance too. Likely the best solution is a combination of different techniques.

19/42
@bayramgnb
Yeah that’s what I thought, they made a big deal of it of, maybe it is, but felt like it was just added pre-prompts for “thinking” which Claude already been doing.

Maybe next step would be is it will ask clarifying question before starts thinking too :smile:

20/42
@BenjaminKlieger
I think o1 could be a significant improvement for complex PhD-level reasoning, but definitely for these simple logic problems, it seems prompting reasoning does the trick.

Good idea!

21/42
@ssslomp
nice! fine tune or multi shot?

22/42
@BenjaminKlieger
Actually, neither! Just prompting the LLM to go through reasoning steps before giving the answer. Plus a few tips like exploring alternative possible answers.

23/42
@0xjohnho
Hey, what was the cost comparing groq api cot to o1? I think many want to know the answer to this without having to try it out themselves.

24/42
@BenjaminKlieger
That’s a good question. I believe o1 is $15/M input tokens, $60/M output tokens. Llama-3.1 70b on Groq is $0.59/M input and $0.79/M input. So big difference! And I think o1 uses a lot of tokens now for the reasoning step, so it’s more comparable to CoT usage.

25/42
@manpreets7
Really interesting. Can you share the prompt? Assuming it’s one prompt. Or outline of the approach?

26/42
@BenjaminKlieger


27/42
@BigBrother_Jane
Hey Ben, this is really cool! Can it be called with an API?

28/42
@BenjaminKlieger
It could be wrapped in one… right now it’s a streamlit app, but that wouldn’t be too hard. Will be open sourcing soon!

29/42
@AImpactSpace
Reasoning meets speed! Great.
Have you considered to create some Agent workflow using this mechanism? Would be amazing.

30/42
@BenjaminKlieger
Interesting idea! Once I share it open source, hopefully some devs will experiment and build upon it.

31/42
@mysticaltech
Do you use something similar to A* search + Q-Learning aka Q*, with high temperature and then RL? If not, it can't be as good as o1.

32/42
@BenjaminKlieger
Not claiming to be better, just an experiment to see how far prompting alone can get an open source model!

33/42
@avantguard66634
Is it open source? I would love to play around with it.

34/42
@BenjaminKlieger
GitHub repo coming soon this weekend!

35/42
@asankhaya
Yes it works well and there are many other approaches that can be used to improve performance by spending more compute at inference. I have implemented many such techniques in optillm - GitHub - codelion/optillm: Optimizing inference proxy for LLMs

36/42
@GestaltU
Love it. Also:

[2409.03733] Planning In Natural Language Improves LLM Search For Code Generation

37/42
@geeknik
whew. i'm glad an r didn't slip away when it spelled it backwards.

38/42
@gillinghammer
@MatthewBerman what do you think?

39/42
@peterjabraham
@threadreaderapp unroll

40/42
@threadreaderapp
@peterjabraham Hello, here is your unroll: Thread by @BenjaminKlieger on Thread Reader App Share this if you think it's interesting. 🤖

41/42
@GaryIngle77
People getting fixated on the .9 .11 problem in the comments here and in general baffles me it is irrelevant for almost anything. But anyways this is super interesting and I can't wait to see the github. I would be very appreciative if you @ or dm me when you release it

42/42
@danielkempe
Already been using the superprompt on Claude for this.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXY95qUbwAAcO_Z.jpg

GXZBgvZbQAAXnS5.jpg

GXcKLOKbwAA52LS.jpg

GXcKLOMbwAERpfZ.jpg

GXcKLOKbwAA52LS.jpg

GXcKLOMbwAERpfZ.jpg

GXZBgvZbQAAXnS5.jpg











1/12
@BenjaminKlieger
Inspired by the new o1 model, I hacked together g1, powered by Llama-3.1 on @GroqInc. It uses reasoning chains to solve problems.

It solves the Strawberry problem ~70% of the time, with no fine tuning or few shot techniques.

A thread 🧵 (with GitHub repo!)

2/12
@BenjaminKlieger
Here is the GitHub repo: GitHub - bklieger-groq/g1: g1: Using Llama-3.1 70b on Groq to create o1-like reasoning chains

3/12
@BenjaminKlieger
How does it work?

g1 uses reasoning chains, in principle a dynamic Chain of Thought, that allows the LLM to "think" and solve some logical problems that usually otherwise stump leading models.

At each step, the LLM can choose to continue to another reasoning step, or provide a final answer. Each step is titled and visible to the user.

The system prompt also includes tips for the LLM. The README on GitHub has a full breakdown, but a few examples are asking the model to “include exploration of alternative answers” and “use at least 3 methods to derive the answer”.

4/12
@BenjaminKlieger
How does it perform?

No robust performance testing has been done yet. Since g1 is powered by only one open source model, I imagine leading frontier models like GPT-4o and Claude would still outperform on more complex tasks (although they would likely be slower and more expensive).

However, I do think the reasoning gains from this prompting strategy in g1 are significant.

In a small experiment using the Strawberry problem (“How many Rs are in strawberry?”):

g1:
⁃7/10 correctly says 3 Rs
⁃3/10 incorrectly says 2 Rs

Llama-3.1-70b-versatile at 0.2 temp (same temp as g1):
⁃10/10 incorrectly says 2 Rs

Llama-3.1-70b-versatile at 1.0 temp:
⁃10/10 incorrectly says 2 Rs

ChatGPT-4o (through ChatGPT[.]com):
⁃3/10 correctly says 3 Rs
⁃7/10 incorrectly says 2 Rs

In summary:

g1 (powered by Llama-3.1 70b) has 70% accuracy, ChatGPT-4o has 30% accuracy, and Llama-3.1 70b which powers g1 has 0% accuracy.

g1 improved Llama-3.1 70b’s accuracy on this problem from 0% to 70%! Similar gains would likely be seen for GPT-4o. The power of prompting reasoning.

Note this was just a trial with n=10 on one problem! This is not to claim that g1 is smarter than GPT-4o, but that reasoning techniques alone can allow an open source model to significantly increase quality.

5/12
@BenjaminKlieger
What does this mean?

g1 is not an attempt to fully reproduce o1’s abilities, though I am excited to see what the open source community creates in the coming months! This is an early prototype that, inspired by o1’s output, explores an improved Chain of Thought prompting strategy that seems to work well on some of the famous LLM logic problems. o1 may use different techniques than Chain of Thought alone. Hope this thread and g1 inspires new app builds and designs!

Happy building!

6/12
@BenjaminKlieger
If you found this helpful, share the thread!

7/12
@pixelverseIT
This is super cool! Actually made a similar one on @GroqInc earlier today. Compared it with Gemini as well:

8/12
@BenjaminKlieger
Ooh, very cool! It’s exciting to see how fast the community is developing new solutions inspired by o1.

9/12
@MatthewBerman
Do you think different models working in collaboration might garner better performance?

10/12
@BenjaminKlieger
Depending on the models, it’s likely. The Mixture of Agents paper showed great promise for that method.

I would be particularly interested if different models could catch each other’s reasoning errors, or be a better judge of which reasoning chain is most accurate than the original LLM.

11/12
@tristanbob
I assume g1 will be slower and more expensive to run than Llama 3.1 70b because it performs more work, correct?

12/12
@BenjaminKlieger
g1 is a system of prompts to llama 3.1 70b, so it requires more output tokens than just asking the question to llama 3.1 70b, but should give better performance. This is why providers like Groq are helpful for lower costs and much faster inference.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXcKLOKbwAA52LS.jpg

GXcKLOMbwAERpfZ.jpg

GXcKLOKbwAA52LS.jpg

GXcKLOMbwAERpfZ.jpg

GXb9bt0bYAAmbYT.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086



1/3
@rohanpaul_ai
AutoToS: Automated version of Thought of Search, fantastic Paper from @IBMResearch 👏

Achieved 100% accuracy across all evaluated domains 🤯

AutoToS generates sound, complete search components from LLMs without human feedback, achieving high accuracy across domains.

Original Problem 🔍:

LLMs struggle with planning tasks, particularly in terms of soundness and completeness. Existing approaches sacrifice soundness for flexibility, while the Thought of Search (ToS) method requires human feedback to produce sound search components.

Key Insights 💡:

• LLMs can generate sound and complete search components with automated feedback

• Generic and domain-specific unit tests guide LLMs effectively
• Iterative refinement improves code quality without human intervention
• Smaller LLMs can achieve comparable performance to larger models

Solution in this Paper 🛠️:

• Components:

- Initial prompts for successor and goal test functions
- Goal unit tests to verify soundness and completeness
- Successor soundness check using BFS with additional checks
- Optional successor completeness check
• Feedback loop:
- Provides error messages and examples to LLM
- Requests code revisions based on test failures
• Extends search algorithms with:
- Timeout checks
- Input state modification detection
- Partial soundness validation

Results 📊:

• Minimal feedback iterations required (comparable to ToS with human feedback)
• Consistent performance across different LLM sizes
• Effective on 5 representative search/planning problems:
- BlocksWorld, PrOntoQA, Mini Crossword, 24 Game, Sokoban

2/3
@rohanpaul_ai
Step-by-step process of how AutoToS works, to systematically refine and improve the generated search components without human intervention.

Step 1 🔍:

Start with initial prompts asking for the successor function succ and the goal test isgoal.

Step 2 ✅:

Perform goal unit tests, providing feedback to the model in cases of failure. Repeat this process until all goal unit tests have passed or a predefined number of iterations is exhausted.

Step 3 🔁:

Once isgoal has passed the unit tests, perform a soundness check of the current succ and isgoal functions. This is done by plugging these functions into a BFS extended with additional checks and running it on a few example problem instances. If BFS finishes, check whether the goal was indeed reached. If not, provide feedback to the model and repeat Steps 2 and 3.

Step 4 (Optional) 🚀:

Once the previous steps are finished, perform the successor unit test, providing feedback to the language model in case of failure.

The process includes iterative feedback loops. Every time a goal test fails, the system goes back to Step 2. Every time the successor test fails, it goes back to Step 3. After the first step, there's always a succ and isgoal that can be plugged into a blind search algorithm, but if Step 3 fails, it indicates that the solutions produced by that algorithm can't be trusted.

3/3
@rohanpaul_ai
📚 https://arxiv.org/pdf/2408.11326


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXeoTB2WAAAdu8S.jpg

GXeo59jWYAA8aVq.jpg

GXepYnOXQAAT9nk.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,369
Reputation
8,499
Daps
160,086
Top