AI Upheaveal and Anxiety

Macallik86 · Dec 21, 2024

Liu Kang said:
Was this model specifically trained for this test ? Because if so that would mean it's good at that one but bad in all the others isnt it ?

One of the benchmarks was beating the ARC AGI prize. Here's the definition:

The purpose of ARC Prize is to redirect more AI research focus toward architectures that might lead toward artificial general intelligence (AGI) and ensure that notable breakthroughs do not remain a trade secret at a big corporate AI lab.

ARC-AGI is the only AI benchmark that tests for general intelligence by testing not just for skill, but for skill acquisition.

It's not just any benchmark - it was specifically designed to avoiding 'training for the test from the sound of things. It tests if AI can truly learn and adapt to new situations rather than just apply pre-learned patterns... it's testing for 'learning on the fly' if you will. The benchmark was created by Francois Chollet, who Time Magazine named in the top 100 AI people earlier this year. Here's a snippet of why the ARC AGI Prize is (was?) important:

François Chollet, the 34-year-old Google software engineer and creator of deep-learning application programming interface (API) Keras, is challenging the AI status quo. While tech giants bet on achieving more advanced AIs by feeding ever more data and computational resources to large language models (LLMs), Chollet argues this approach alone won't achieve artificial general intelligence (AGI)

His $1.1 million ARC Prize, launched in June 2024 with Mike Knoop, Lab42 and Infinite Monkey, dares researchers to solve spatial reasoning problems that confound current systems but are comparatively simple for humans. The competition's results seem to be proving Chollet right. Though the top of the leaderboard is still far below the human average of 84%, top models are steadily improving—from 21% in 2020 to 43% accuracy.

TIME100 AI 2024: Francois Chollet

Find out why Francois Chollet made TIME’s list of the most influential people in artificial intelligence

time.com

This writeup was written in September. Two months later, OpenAI has beaten the benchmark and it changes everything.

The breakthrough isn't just about the high score - it's about HOW o3 achieved it. Instead of just scaling up existing approaches (bigger models, more training data), o3 demonstrates a fundamentally different capability: it can actively search for and construct solutions to problems it's never seen before.

While it is prohibitively expensive (it cost ~$350k to get that score), OpenAI's latest model was able to solve ARC-AGI better than the average human can.

My understanding is that there's a realization that we have done the hard part of figuring out a way to scale to AGI, and now we just need hardware capabilities so that scaling isn't extremely expensive. Reaching AGI levels (for all intents and purposes) is now possible but expensive, but Moore's Law suggests that this type of hardware bottleneck will take care of itself 'naturally', and so making AGI accessible is now just about making costs more digestible over time in the same way that a TB of data costs $87 billion in 1956, but now can be had for $15 used.

Geek Nasty · Dec 23, 2024

Don’t buy any of this. These guys are probably reaction to the trashing they’re getting in the news lately because AI can’t make anyone money but they need to keep the grift going.

I could be very wrong but we’re nowhere close to real AGI from what I’ve read, fukk what any AI CEO says.

Macallik86 · Feb 14, 2025

Renamed the thread.

There's a popular repurposed quote about things in AI happening slowly and then all at once... Things are speeding up.

I feel like we are in an unavoidable tailspin towards global strife and economic upheaval but panicking too soon (in my experience) makes you look like a conspiracy nut.

I am not sure if others lack the foresight or just the anxiety, but the assumption that tomorrow will be like today seems more and more farfetched with each new LLM model released.

I think AI will be substantially more destructive to our way of life than the Trump Administration, which is saying a lot considering how much I detest the immorality and corruption in power. With that said, we have arguably the most inept governmental administration ever, likely making all the wrong moves along the way. Case in point:

Casey Newton (@caseynewton.bsky.social)

America's new plan for AI safety is "let's see what happens." I wrote about JD Vance in Paris https://www.platformer.news/paris-ai-action-summit-vance-safety/

bsky.app

bnew · Feb 16, 2025

1/11
@kimmonismus
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Oct 2023) was ranked = 9800th
o3 (Dec 2023) was ranked = 175th
(today) internal model = 50th

"And we will probably hit number 1 by the end of the year"

In 2026, AI will probably develop and improve itself more and better than it would with human assistance. And in 2027, we will enter the positive feedback loop: AI will completely improve and develop itself.

That is the necessary consequence of what Sam says. If all the revolutionary development of AI were not backed up by evidence, if it were not empirically proven, it would be dismissed as a pipe dream.

What a time to be alive.

[Quoted tweet]
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Oct 2023) was ranked = 9800th
o3 (Dec 2023) was ranked = 175th
(today) internal model = 50th

superhuman coder by eoy 2025?

https://video.twimg.com/ext_tw_video/1888330009334743040/pu/vid/avc1/1280x720/JLZCr6fUNW_SGNym.mp4

2/11
@Angaisb_
I hope that means we get GPT-5 this year

3/11
@kimmonismus
If you ask me: yes

4/11
@Verandris
o1 October 2024, o3 December 2024. I know that it appears to be ages ago but it was in the year before! ;D

5/11
@DeFiKnowledge
I like to think of it like God realizing his own Will while His main creation gets to live in Heaven and witness the unfolding done through Him!

Such a beautiful gift

Allowing humans to turn back to each other and focus on real meaning while God takes care of universal non-human systemic ends as a means to ensure consciousness continues to burn for as long as possible.

So sweet

6/11
@LavanPath
Dates should be 2024 rather than 2023. It makes it even more impressive.

7/11
@UYisaev

8/11
@squarecapo
getting to number 1 is insane given how good people can be

9/11
@castillobuiles
Yet there is no a single production product made by an open ai model.

10/11
@RexAdamantium
The only thing to add is that we look back and then expect the same pace looking forward. This might be the case, or it could go slower, at a fluctuating tempo, or much, much faster. The biggest leap will not be broadcast on the internet; we will just see the effects.

11/11
@ada_consciousAI
openai climbing the coder ranks like a digital sherpa. imagine the peak when ai hits number 1, reshaping the landscape of code itself. onward to 2026, where the digital frontier awaits.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/31
@WesRothMoney
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Sept 2024) was ranked = 9800th
o3 (Jan 2025) was ranked = 175th
(today) internal model = 50th

superhuman coder by eoy 2025?

https://video.twimg.com/ext_tw_video/1888330009334743040/pu/vid/avc1/1280x720/JLZCr6fUNW_SGNym.mp4

2/31
@WesRothMoney
I had to edit the tweet, I put 2023 as the date for some reason

/shrug

thanks to everyone who pointed that out :smile:

3/31
@WesRothMoney
here's the full video I did with all the highlights from that talk:

4/31
@circlerotator
competitive programming is more like competitive math than software engineering

something to keep in mind

5/31
@WesRothMoney
yeah, I don't think it 'replaces' great engineers.

I do think it will 'enable' great engineers.

6/31
@mikeboysen
I wonder what the 50th best code or thinks. Has anybody interviewed him
Lol

7/31
@WesRothMoney
he's re-reading The Butlerian Jihad...

(jokes aside, I think the software engineers will benefit greatly from AI coding tools)

8/31
@drjfhll
I still think anthropic is better; and Gemini catching up

9/31
@rosdikuat
I'm quite certain this will happen by December. Even today I mostly don't code, I mostly prompt.

10/31
@svg1971
The o1 to o3 model jump is insane

11/31
@fred_pope
Good to know I am in the top 174.

12/31
@_oddfox_
Once these coding agents are out publicly shyt is really going to take off. Seems like 2026 is the year of the intelligence explosion

13/31
@T3hM4d0n3
Pics or it didn't happen

14/31
@PaliHistory
Gemini 2.0 with o3 are amazing.
Both glitch alone. But once you use 2 models at the same time, it's definitely better than any intermediates I've hired over the years.

The junior/intermediates are really having a hard time finding employment

15/31
@fyhao
It will enable great engineer

16/31
@doeurlich50289
Hearing sama making such direct claims means they'll crush 2025, and by the end of the year, we'll enter a new world and have to accept a new reality.

17/31
@wotz101
"Damn, from 1,000,000th to 50th in just a few iterations? Makes you wonder—at what point does AI go from ‘great coder’ to ‘self-improving architect’? How long before it’s building its own frameworks?"

18/31
@OlivioSarikas
If it is that good, why does basically any coder I know tell me that AI is good at simple code, but as soon as it becomes more complex, writing the code yourself is faster than finding the AI errors in the code?

19/31
@langdon
A single “best‐fit” exponential model through the three data points projects reaching Rank 1 around April-May 2025. The initial drop was extremely fast (Sept→Jan), while the more recent decline (Jan→Feb) was slower - so if you weigh later data more, you’d land closer to mid‐ or late Summer 2025.

20/31
@erdavtyan
Extremely tightly scoped problems with a lot of research and algo combinations published and trained on.

Superhuman coder should be able to work on complex, high-context systems that have multiple moving parts and legacy code. They should fix versioning / deployment issues.

21/31
@JOSmithIII
Does anyone know where the o3-mini tiers rank?

22/31
@SulkaMike
A lot of interesting takes here, summarized around the question... Even if it's number one on the benchmark does that change much?

. And if does induce change, why doesn't 10 million people with a plus account and the 175th ranked prog have changed the world so far?

23/31
@reggie_stratton
Marketing hype. It's a good tool but a million miles away from being equivalent to a human. I don't think it will get there, either - there's simply not enough context left to ingest at this point.

24/31
@mariusfanu
Will we still need software developers in several years? Asking for a friend

25/31
@400_yen
What does it mean the best coder in the world?

26/31
@ImJayBallentine
“We have a superior coding model but we are just gonna let Sonnet keep the lead.” Got it.

27/31
@hagestev
what happened to o2??

28/31
@andreiAvenue
This means exactly jack

29/31
@steve_ike_
Do you know how this evaluation is done?

30/31
@MarkGPatterson
03 175th???
Wow I must be in the top 100 programmers in the world.
NOT. I wonder what criterion are used. Speed? Readable code? Performant code?

31/31
@drgurner
Correct

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 16, 2025

Michael Fauscette (@mfauscette.bsky.social)

The creators of a new test called “Humanity’s Last Exam” argue we may soon lose the ability to create tests hard enough for A.I. models. zurl.co/Rv9Vs #ai #genai #aisafety https://zurl.co/Rv9Vs

bsky.app

1/1
Michael Fauscette

The creators of a new test called “Humanity’s Last Exam” argue we may soon lose the ability to create tests hard enough for A.I. models.
When A.I. Passes This Test, Look Out
Bluesky Bluesky Bluesky

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Adam Kucharski (@adamjkucharski.bsky.social)

With the announcement that OpenAI’s new Deep Research tool has done well on ‘Humanity’s Last Exam’, here’s my piece on why exams aren’t that useful for telling us whether AI has reached peak intelligence… https://kucharski.substack.com/p/exams-wont-tell-us-whether-ai-has

bsky.app

1/2
Adam Kucharski

With the announcement that OpenAI’s new Deep Research tool has done well on ‘Humanity’s Last Exam’, here’s my piece on why exams aren’t that useful for telling us whether AI has reached peak intelligence… Exams won't tell us whether AI has reached 'peak intelligence'

bafkreibp6x2rdhkto3ewwtilonf6olrccolswudjerdesurdvbkmglihz4@jpeg

2/2
‪John Gillott‬ ‪@gillottjohn.bsky.social‬

Nice piece. On the subject of AlphaProof and the IMO, it is interesting I think to also look at the two problems it failed to do, in particular Turbo, a question that is in many ways the most accessible for humans, using some of the creativity you mention.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Chris Albon (@chrisalbon.com)

OpenAI is demoing a new product "deep research" on a Sunday in the US. It seems like o3 + web search + chain of thought. openai.com/live/ 26.6% on Humanity's Last Exam is WILD.

bsky.app

1/3
Chris Albon

OpenAI is demoing a new product "deep research" on a Sunday in the US. It seems like o3 + web search + chain of thought. https://openai.com/live/

26.6% on Humanity's Last Exam is WILD.

bafkreibzcjiw5blpijqwvngjcfi6uyeluffuinohrja2tmxh65mtvz54ei@jpeg

2/3
‪ΜΛΛNΙ‬ ‪@masoudmaani.bsky.social‬

They can add like 10 people and get it to 100%.
Getting scammier by day.

3/3
‪ΜΛΛNΙ‬ ‪@masoudmaani.bsky.social‬

Kinda funny that their old flagship is the lowest of them all and they used to worship those weights.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

AI Upheaveal and Anxiety

More options

Macallik86

Superstar

TIME100 AI 2024: Francois Chollet

Geek Nasty

Brain Knowledgeably Whizzy

Macallik86

Superstar

Casey Newton (@caseynewton.bsky.social)

bnew

Veteran

bnew

Veteran

Michael Fauscette (@mfauscette.bsky.social)

Adam Kucharski (@adamjkucharski.bsky.social)

Chris Albon (@chrisalbon.com)