bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621





1/3
@rohanpaul_ai
Reverse Engineering o1 OpenAI Architecture with Claude 👀

2/3
@NorbertEnders
The reverse engineered o1 OpenAI Architecture simplified and explained in a more narrative style, using layman’s terms.
I used Claude Sonnet 3.5 for that.

Keep in mind: it’s just an educated guess

3/3
@NorbertEnders
Longer version:

Imagine a brilliant but inexperienced chef named Alex. Alex's goal is to become a master chef who can create amazing dishes on the spot, adapting to any ingredient or cuisine challenge. This is like our language model aiming to provide intelligent, reasoned responses to any query.

Alex's journey begins with intense preparation:

First, Alex gathers recipes. Some are from famous cookbooks, others from family traditions, and many are creative variations Alex invents. This is like our model's Data Generation phase, collecting a mix of real and synthetic data to learn from.

Next comes Alex's training. It's not just about memorizing recipes, but understanding the principles of cooking. Alex practices in a special kitchen (our Training Phase) where:

1. Basic cooking techniques are mastered (Language Model training).
2. Alex plays cooking games, getting points for tasty dishes and helpful feedback when things go wrong (Reinforcement Learning).
3. Sometimes, the kitchen throws curveballs - like changing ingredients mid-recipe or having multiple chefs compete (Advanced RL techniques).

This training isn't a one-time thing. Alex keeps learning, always aiming to improve.

Now, here's where the real magic happens - when Alex faces actual cooking challenges (our Inference Phase):

1. A customer orders a dish. Alex quickly thinks of a recipe (Initial CoT Generation).
2. While cooking, Alex tastes the dish and adjusts seasonings (CoT Refinement).
3. For simple dishes, Alex works quickly. For complex ones, more time is taken to perfect it (Test-time Compute).
4. Alex always keeps an eye on the clock, balancing perfection with serving time (Efficiency Monitoring).
5. Finally, the dish is served (Final Response).
6. Alex remembers this experience for future reference (CoT Storage).

The key here is Alex's ability to reason and adapt on the spot. It's not about rigidly following recipes, but understanding cooking principles deeply enough to create new dishes or solve unexpected problems.

What makes Alex special is the constant improvement. After each shift, Alex reviews the day's challenges, learning from successes and mistakes (feedback loop). Over time, Alex becomes more efficient, creative, and adaptable.

In our language model, this inference process is where the real value lies. It's the ability to take a query (like a cooking order), reason through it (like Alex combining cooking knowledge to create a dish), and produce a thoughtful, tailored response (serving the perfect dish).

The rest of the system - the data collection, the intense training - are all in service of this moment of creation. They're crucial, but they're the behind-the-scenes work. The real magic, the part that amazes the 'customers' (users), happens in this inference stage.

Just as a master chef can delight diners with unique, perfectly crafted dishes for any request, our advanced language model aims to provide insightful, reasoned responses to any query, always learning and improving with each interaction.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXetQ5QXYAA9o0_.jpg

GXgWFRDXAAAj60F.jpg

GXgWFRAXQAAVFNi.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621

1/1
How LLM inference speed leads to different AI / human collaboration models

Two paradigms emerge based on inference speed:
1) High-speed inference:
- Real-time user-AI interaction
- Direct user steering and immediate feedback
- Ideal for 1:1 user-task model
- Examples: AI chatbots, programming assistants

2) Low-speed, high-quality inference:
- Task delegation to AI
- Deferred, more detailed feedback
- Necessitates multitasking
- Examples: AI researchers, complex processing chains

New OpenAI models (o1, o1-mini) showcase this trade-off:
- Improved reasoning
- Significantly slower inference (GPT-4o: 3s, o1-mini: 9s, o1: 30s)
- Those times are for simple question answering; it means minutes for more complex multistep process

User and AI roles:
- Fast inference: AI as real-time assistant, "hand-in-hand" collaboration
- Slow, high-quality inference: user as leader or coordinator for multiple parallel AI processes, similar to distributed team management

Comments on delegation & multitasking paradigm:
- Requires different app UX (task and dependency management, notifications on results available)
- Employees working with highly capable, slow inference AI models will need to organize their work differently
- Additional factors: distraction, cost of context switching, increased mental burden


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621




1/4
remember anthropic's claim that 2025-2026 the gap is going to be too large for any competitor to catch up?

that o1 rl data fly wheel is escape velocity

[Quoted tweet]
Many (including me) who believed in RL were waiting for a moment when it will start scaling in a general domain similarly to other successful paradigms. That moment finally has arrived and signifies a meaningful increase in our understanding of training neural networks


2/4
the singularitarians rn: rl scaling in general domain...? wait a sec...



3/4
another nice side effect of tighly rationing o1 is that high S/N on quality hard queries they will be receiving. ppl will largely only consult o1 for *important* tasks, easy triage

the story rly writes itself here guys

[Quoted tweet]
at only 30 requests - im going to think long and hard before i consult the oracle.


4/4
u cannot move a single step without extrapolating even a little, we draw lines regardless, including u. i am inclined to believe them here, shouldn't take too long to confirm that suspicion tho




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXUirofWoAAE-cN.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621










1/10
ARC-AGI’s been hyped over the last week as a benchmark that LLMs can’t solve. This claim triggered my dear coworker Ryan Greenblatt so he spent the last week trying to solve it with LLMs. Ryan gets 71% accuracy on a set of examples where humans get 85%; this is SOTA.



2/10
For context, ARC-AGI is a visual reasoning benchmark that requires guessing a rule from few examples. Its creator, @fchollet, claims that LLMs are unable to learn, which is why they can't perform well on this benchmark.



3/10
Ryan's approach involves a long, carefully-crafted few-shot prompt that he uses to generate many possible Python programs to implement the transformations. He generates ~5k guesses, selects the best ones using the examples, then has a debugging step.



4/10
The results:

Train set: 71% vs a human baseline of 85%
Test set: 51% vs prior SoTA of 34% (human baseline is unknown)
(The train set is much easier than the test set.)
(These numbers are on a random subset of 100 problems that we didn't iterate on.)



5/10
This is despite GPT-4o's non-reasoning weaknesses:
- It can't see well (e.g. it gets basic details wrong)
- It can't code very well
- Its performance drops when there are more than 32k tokens in context
These are problems that scaling seems very likely to solve.



6/10
Scaling the number of sampled Python rules reliably increase performance (+3% accuracy for every doubling). And we are still quite far from the millions of samples AlphaCode uses!



7/10
@RyanPGreenblatt, who I think might be the world expert at getting LMs to do complicated reasoning (you'll love the project that he dropped for a week to do this one), did lots of fancy tricks to get the performance this high; you can see the details on our blog.



8/10
Getting 50% (SoTA) on ARC-AGI with GPT-4o



9/10
I think it's actually ambiguous whether this is sota, see here.

[Quoted tweet]
If I had realized there was such a large gap between public/private for prior approaches, I wouldn't have claimed SoTA. Apologies. I'm considering what edits I should make.

If the gap is due to the different datasets actually differing in difficulty, this is quite unfortunate.


10/10
Presumably. There's no public access to gpt-4o finetuning though.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GQS2dqtbgAAyeon.png

GQS2fTaawAAL3M_.png

GQS2k50bkAAvykr.png

GQS2mhCbsAEkTCP.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621


1/11
@flowersslop
iTs jUsT cOt pRoMpTing

Not it isn’t

No, your custom prompt for 4o or any other model isn’t the same as o1

No, o1 is not what reflect tried to be

It’s way more advanced than any of that. We don’t even have access to the actual o1 yet.

You can’t replicate o1 by prompting 4o.

2/11
@TheRealSamTes
He thinks different:

3/11
@flowersslop
I like Dave, I like his content, I like the way he thinks about things, but here he's on the wrong track. But he is welcome to document how he reaches or exceeds the 1o benchmarks.

Very happy to be proven wrong. OpenAI would probably appreciate it too.

4/11
@adonis_singh
Thats what I am saying. On benchmarks, you can see CoT, and that is usually a 1% diff or so, not a 70% increase LMAO

5/11
@nicdunz
even david shapiro said it is so while i understand your coping its time to accept that openai is just an L now

6/11
@LewisNWatson
It’s not just cot promoting but it IS *basically* just cot finetuned into the model with a lot of training and human preference/auto eval tuning.

The openai “moat” is just funding (compute) and hype (social/media attention). The point is there is no “special sauce”.

7/11
@_EyesofTruth_
It kinda is.

Just really good preferential cot weighted by alignment to accurate output.

Probably works on all modes, which does make it unique, but it is the same idea that Shumer was proposing, just more complex.

You can replicate o1 with 4o, but not its accurate autonomy.

8/11
@mysticaltech
Exactly. Uses A* graph search and Q-learning RL i.e. Q*, to gen the CoTs and choose the best ones, and probably has DQNs. It's not prompting per say, it's architectural changes inside the model on top of a more traditional gpt-4o base, maybe.

9/11
@horsperg
I disagree, I've now worked out a workflow where I use gpt-4o as the critic for o1-preview's fancy-pants plans (copy-paste the entire interaction with it), and more often than not, it's gpt-4o that ultimately puts some sense into o1-preview's proposals.

10/11
@VortexFlyer76
I wonder if people read a lot of research or kept up with it so they're let down which is silly because research and shipping are two totally different things. Research is done first and then the product is deemed to scale up assigned to engineers etc. my husband does the shipping part and I obviously don't know much but I do know it's an insane amount of work.

People forget we have never had these products before or anything like them. This isn't something anyone learned in school or can do with their eyes closed. Scaling is a big deal.

11/11
@cloudseedingtec
i dont use the playground or api so i dunno rekon im sill just on 4o<3


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXXOoavWcAAdX5_.jpg

GXXOsNsXEAAGFjP.jpg

GXXOwbKWYAAK8-r.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621





1/11
@RichardMCNgo
One reason I don’t spend much time debating AI accelerationists: few of them take superintelligence seriously. So most of them will become more cautious as AI capabilities advance - especially once it’s easy to picture AIs with many superhuman skills following long-term plans.



2/11
@RichardMCNgo
It’s difficult to look at an entity far more powerful than you and not be wary. You’d need a kind of self-sacrificing “I identify with the machines over humanity” mindset that even dedicated transhumanists lack (since many of them became alignment researchers).



3/11
@RichardMCNgo
Unfortunately the battle lines might become so rigid that it’s hard for people to back down. So IMO alignment people should be thinking less about “how can we argue with accelerationists?” and more about “how can we make it easy for them to help once they change their minds?”



4/11
@RichardMCNgo
For instance:

[Quoted tweet]
ASI is a fairy tale.


5/11
@atroyn
at the risk of falling into the obvious trap here, i think this deeply mis-characterizes most objections to the standard safety position. specifically, what you call not taking super-intelligence seriously, is mostly a refusal to accept a premise which is begging the question.



6/11
@RichardMCNgo
IMO the most productive version of accelerationism would generate an alternative conception of superintelligence. I think it’s possible but hasn’t been done well yet; and when accelerationists aren’t trying to do so, “not taking superintelligence seriously” is a fair description.



7/11
@BotTachikoma
e/acc treats AI as a tool, and so just like any other tool it is the human user that is responsible for how it's used. they don't seem to think fully-autonomous, agentic AI is anywhere near.



8/11
@teortaxesTex
And on the other hand, I think that as perceived and understandable control over AI improves, with clear promise of carrying over to ASI, the concern of mundane power concentration will become more salient to people who currently dismiss it as small-minded ape fear.



9/11
@psychosort
I come at this from both ends

On one hand people underestimate the economic interoperability of advanced AI and people. That will be an enormous economic/social shock not yet priced in.



10/11
@summeroff
From our perspective, it seems like the opposite team isn't taking superintelligence seriously, with all those doom scenarios where superintelligence very efficiently do something stupid.



11/11
@norabelrose
This isn't really my experience at all. Many accelerationists say stuff like "build the sand god" and in order to make the radically transformed world they want, they'll likely need ASI.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621


1/8
@hsu_steve
Great explainer video on neural scaling laws. These were discovered by teams at OAI and DeepMind in LLM training. $billions (~ 1% US GDP!) have been allocated to continue hyperscaling LLMs! Thanks to @thiagot5 for the pointer.

https://invidious.poast.org/watch?v=5eqRuVp65eY&ab_channel=WelchLabs

The video discusses a bound on the scaling exponent as a function of the dimensionality of the data manifold. For language models, this is the intrinsic (local?) dimensionality of human language: d ~ 100.

There is a discrepancy mentioned in the video between the implied d ~ 42 from scaling data and the directly estimated d ~ 100 for natural language. Does this discrepancy arise because the model architecture is sub-optimal?

https://arxiv.org/pdf/2004.10802

"...networks with non-ReLU activations (eg Transformers and ResNets) may mix and superimpose different data manifolds upon each other, obscuring the manifold structure and causing the measured dimension to exceed the true dimension."

In a recent Manifold episode ("Letter from Reykjavik") I discuss the perils of relying on synthetic data for future scaling OOMs. It's a challenge to ensure that the entropy density in synthetic data remains as high as that found in (now exhausted) human-generated data used in training existing models.

Manifold | Letter from Reykjavik: Genomics, Chess, Hyperscaling genAI, and Quantum Black Holes — #67



2/8
@todistetta
> networks with non-ReLU activations (eg Transformers and ResNets

Fyi, this is just poor phrasing from them. Those networks generally use ReLU, and they prolly meant the exceptional cases that don't.



3/8
@hsu_steve
My question is whether there is a 1-1 relationship between increase in parameters of the transformer vs increase in expressiveness re: data manifold. There could be an O(1) factor depending on the actual architecture.



4/8
@ikirigin
I've been a fan of the channel for ages

[Quoted tweet]
I've taken a great deal of math, but this delightful video series on imaginary numbers taught me quite a bit invidious.poast.org/playlist?list=PL…


5/8
@ubersec
Thanks for sharing Steve. 👏🙏🏻



6/8
@todistetta
Scaling laws per the Bitter Lesson actually reveal the real insight of: more brute force will always be smarter than you.



7/8
@IHateStapler
it's up to us to make sure that ai advancement is not held back by the ultradimensionality of human language by building intelligences that don't relate linguistically.



8/8
@DXidious
After watching this, I'm more convinced than ever that the data-driven ANN backprop approach is a scientific and engineering cul-de-sac in the quest for AGI.

It takes a special kind of misguidance to come up with an obscenity like the "entropy of natural language."




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621



Web-LLM Assistant​

Description​

Web-LLM Assistant is an simple web search assistant that leverages a large language model (LLM) running via Llama.cpp to provide informative and context-aware responses to user queries. This project combines the power of LLMs with real-time web searching capabilities, allowing it to access up-to-date information and synthesize comprehensive answers.


Here is how it works in practice:


You can ask the LLM a question, for example: "Is the boeing starliner still stuck on the international space station", then the LLM will decide on a search query and a time frame for which to get search results from, such as results from the last day or the last year, depending on the needs of your specific question.


Then it will perform a web search, and collect the first 10 results and the information contained within them, it then will select 2 most relevant results and web scrape them to acquire the information contained within those results, after reviewing the information it will decide whether or not the information is sufficient to answer the your question. If it is then the LLM will answer the question, if it isn't then the LLM will perform a new search, likely rephrasing the search terms and/or time-frame, to find more appropriate and relevant information to use to answer your question, it can continue to do multiple searches refining the search terms or time-frame until it either has enough information to actually answer the User's question, or until it has done 5 separate searches, retrieving information from the top 4 results of each search, at which time if it hasn't been able to find the information needed to answer the User's question it will try it's best to provide whatever information it has acquired from the searches at that point to answer your question the best it can.


Thus allowing you to ask it queries about recent events, or anything that may not actually be in it's training data. Which it can now, via this python program still determine the answer to your question, even if the answer is absent from the LLM's training data via web searching and retrieving information from those searches.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621
OpenAI sent me an email threatening a ban if I don't stop


If OpenAI is threatening to ban people over trying to discover their CoT system prompt, then they find financial value in a prompt, thus there is low hanging fruit for local models too!

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621










1/11
@ac_crypto
2 MacBooks is all you need.

Llama 3.1 405B running distributed across 2 MacBooks using @exolabs_ home AI cluster



2/11
@ac_crypto
Uses:
- @exolabs_ for distributed inference
- MLX, Apple's open source ML library for inference engine (h/t @awnihannun)
- @__tinygrad__ tinychat web UI (backend coming soon)

h/t @RayFernando1337 for coming in person to transfer the weights OTC and filming



3/11
@ac_crypto
The code is open source and Llama 3.1 8B, 70B and 405B are all supported in exo: GitHub - exo-explore/exo: Run your own AI cluster at home with everyday devices 📱💻 🖥️⌚



4/11
@carmandale
I have 50 iPhones I could make into a distributed inference machine. it would fun to see at least.



5/11
@ac_crypto
That would actually be enough GPU memory for 405b. We have to try this :smile:



6/11
@MendeMatthias
Is one Macbook M3 not enough to run it?



7/11
@ac_crypto
No, sir. Funny 2 MacBooks ended up being just enough to fit it in GPU memory



8/11
@MrCatid
Yeah that's about as slow as I was expecting :smile:



9/11
@ac_crypto
A lot of low hanging fruit here to improve performance. Only 1 device is active at a time (although the model is hot in memory at all times)



10/11
@onemoremichael
Spec on the MacBooks?



11/11
@ac_crypto
2 x MacBook Pro 128GB




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,787
Reputation
7,926
Daps
148,621



1/11
@npew
This is a pretty big deal. We're rapidly moving towards a world where you can decide how many GPU-hours you want to dedicate to a problem.

More GPU-hours = more thinking = better solution.

[Quoted tweet]
First step was to scale the model size. Next step is to scale test time compute. LLM -> TTC.


2/11
@npew
Imagine spinning up 1000 PhDs -- all working in parallel -- to work on your problem, with just a few strokes on your keyboard.



3/11
@bradneuberg
For test time compute, do you think GPUs are the right compute substrate longer term or do you think other silicon architectures now have a chance to compete against GPUs and Nvidia? Basically, is this a potential platform shift?



4/11
@npew
The big shift is towards inference becoming much more important. I’m sure we’ll see tons of silicon competition there. Inference cost will keep falling, as silicon gets more optimized and less general purpose.



5/11
@Grad62304977
In this graph tho, is each point trained with a slightly different rl reward or is this controlled fully at inference with no extra rl training required?



6/11
@iamrobotbear
I have been asking myself since the day you released batch inference if we will see a “long batch inference” sometime.

Set a budget, goal, or time bound end and it runs until it hits one of those.



7/11
@beforeagi
Nvidia is the most profitable AI company for a reason. Everything needs GPU's



8/11
@deter3
More GPU-hours = more thinking = better solution. Too naive perspective .



9/11
@coreylynch
Huge congrats to you and team @npew



10/11
@AlwaysUhhJustin
Are you able to share roughly how big the models are?

GPT4 / GPT-4o / 1o-preview / 1o-Mini / 1o

Trying to get a sense of whether this required meaningful scale increases or if it's mostly algorithm



11/11
@AIMachineDream
Makes sense. Let users decide.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GXcs-JXWsAAOgWR.png
 
Top