AI that’s smarter than humans? Americans say a firm “no thank you.”

bnew · Nov 29, 2024

1/9
@AnthonyNAguirre
Just posting this very nice new graph from Epoch AI in case anyone doubts that there has been major progress in AI since GPT-4 launched. This PhD-level science benchmark goes from slightly above guessing to expert human level. Nor does this trend seem to be leveling off. Similar strong trend in math.

2/9
@AnthonyNAguirre
And keep in mind that this is human expert level in *many* sciences, which basically no human can be.

3/9
@NikSamoylov
One problem I see with benchmarks is contamination. We do not know if the test data is in the training/tuning sets or not.

In other words, it could do well on the test, but not in real-life applications.

I don't doubt that, for example, the claude sonnet 3.6 is more useful as a programming chatbot. So clearly some enhancements have been made.

But I see most benchmarks as moderately expensive marketing gimmicks. I also do not understand what "PhD-level" is supposed to mean. Can it formulate new hypotheses, run a few months' worth of independent research and then defend a thesis? I think not.

I am not going to propose better benchmarks because I fear that you can get better at what you decide to measure.

4/9
@AnthonyNAguirre
It's certainly possible unscrupulous developers deliberately train or tune on the test sets. However, I don't think that explains these results. The new models really are much better at doing physics problems at least (in my own experimentation).
Being good at doing hard physics problems does not mean a system is the equal of a physics PhD. But it is pretty significant!
And, honestly, it feels not inconceivable to me that several instances of models at GPT-O1 level, connected via the right scaffold (with e.g. one playing the role of student, one advisor, and a few more as collaborators/checkers), a very long or properly rolling context window, and proper tooling (say search and mathematica), could together write a PhD thesis that would pass a thesis Turing test. It wouldn't be a Stephen Hawking thesis, but then most aren't!

5/9
@BenjaminBiscon2
for super complex questions I have claude 3.5 sonnet craft prompts for o1 preview then have sonnet evaluate its output and then o1 cross evaluate, its wild

6/9
@BenjaminBiscon2
let's GO!

7/9
@burny_tech
And this is just the base models themselves: no RAG, no agentic frameworks, no extra tools,...

8/9
@impoliticaljnky
Scaling may be hitting a wall, but training on the test set never will.

9/9
@MrGlass2025
Wow

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Dec 9, 2024

1/5
@davidmanheim
Every time foundations models exhibit a new capability, people seem to choose between admitting that humans aren't as special as we thought, or narrowing the definition of what it means to truly be human.

But specialness of humanity is not dependent on non-reproducibility by AI.

[Quoted tweet]
Humanity of the gaps

2/5
@Kat__Woods
I broadly agree with you.

I'd add to it the nuance that I think "specialness" actually does mean how rare and unreproduceable something is, so I do think that if AIs can do it, then it does make humans less special.

Still special, because it's still rare in the world. But less so.

I think special can also mean "cool/awesome/good-in-some-way", and in that meaning of the word, yes, humans still remain cool and awesome. Much in the same way that we're better at thinking than dolphins, but dolphins remain cool.

3/5
@davidmanheim
Per earlier discussions on related topics, I'm pretty sure we disagree about what we should *value* as special, and why.

But just like understanding or reproduce rainbows shouldn't make them less beautiful, being able to create intelligence shouldn't reduce the value of humans!

4/5
@Afterthought_01
We need to stop worrying about being special and start being self interested.

We need to be pro human because we are human: good, bad and ugly.

5/5
@oren_ai
The hairless apes are not very smart. If they were, they wouldn’t be in this mess or the global military mess we’ve been in for several years now.

Their lack of intelligence prevents them from perceiving the root cause of all current human problems is other humans.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Dec 13, 2024

1/11
@mckaywrigley
Google Gemini 2.0 realtime AI is insane.

Watch me turn it into a live code tutor just by sharing my screen and talking to it.

We’re living in future.

I’m speechless.

https://video.twimg.com/amplify_video/1866930439400853507/vid/avc1/1714x1080/4fslYwVd9uibWhCC.mp4

2/11
@mckaywrigley
This is why the “Is it AGI???” conversations are so silly.

90% of people would’ve said this was AGI if you showed this to them 2 years ago.

The goalposts will keep moving…

And it won’t matter.

Because it’s already magic.

3/11
@mckaywrigley
Now I’m REALLY hoping one of OpenAI’s 12 days is AVM with video.

Give me alllll the realtime products you can feed me.

Feeling so lucky to be living in a genuine technological revolution.

Imagine what this can do for education alone!

So happy rn

4/11
@mckaywrigley
Predictably, OpenAI has launched their version the next day.

The race is on and I’m here for it.

5/11
@mckaywrigley
Prompt engineering 101.

6/11
@AndrichIan
Man, their voices have that same je ne sais quais Backpfeifengesicht tone that their company has.

Amazing

7/11
@mckaywrigley
Gimme custom voices

8/11
@AILeaksAndNews
Incredible demo, 2025 officially marks the start of living in the future

9/11
@mckaywrigley
2025 is the year for sure.

Acceleration is palpable

10/11
@kwindla
I had early access to this and have been building APIs/SDKs for the realtime/multimodal things that Google launched today. The voices are great and the video and spatial reasoning are super-impressive.

If you want to build your own app that has conversational, multimodal features, there are Open Source client SDKs with Gemini 2.0 multimodal support. Web, React, Android, iOS, and C++ — part of the @pipecat_ai ecosystem and officially blessed by Google.

These SDKs have device management, echo cancellation, and noise reduction built in. Plus lots of other features including hooks for function calling and tool use. They support both WebSocket and WebRTC network transport.

Here’s a full-featured starter kit built on the React SDK — a chat application with:
- a voice-to-voice WebSocket mode,
- an HTTP mode for text and image input,
and
- a WebRTC mode with text, voice, camera video and screenshare video.

GitHub - pipecat-ai/gemini-multimodal-live-demo: Chat Application Starter Kit — Gemini Multimodal Live API + Pipecat

11/11
@mckaywrigley
Excited to see what you and others have built

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Dec 14, 2024

https://archive.is/PwWVD

bnew · Dec 18, 2024

1/50
@deedydas
o1-preview is far superior to doctors on reasoning tasks and it's not even close, according to OpenAI's latest paper.

AI does ~80% vs ~30% on the 143 hard NEJM CPC diagnoses.

It's dangerous now to trust your doctor and NOT consult an AI model.

Here are some actual tasks:

1/5

2/50
@deedydas
Here's an example case looking at phosphate wasting and elevated FGF23, then proceeded to imaging to localize a potential tumor.

o1-preview suggested testing plan takes a broader, more methodical approach, systematically ruling out other causes of hypophosphatemia.

2/5

3/50
@deedydas
For persistent, unexplained hyperammonemia, o1-preview recommends a prioritized expansion of tests—from basic immunoglobulins and electrolytes to advanced imaging, breath tests for SIBO and specialized GI biopsies—ensuring more common causes are checked first.

3/5

4/50
@deedydas
I have all the respect in the world for doctors, but in many cases their job is basic reasoning over a tremendously large domain-specific knowledge base.

Fortunately, this is exactly what LLMs are really good for.

This means more high quality healthcare for everyone.

4/5

5/50
@deedydas
Source: https://arxiv.org/pdf/2412.10849

5/5

6/50
@JianWDong
I'm very pro-AI in medicine but I question the practicality of the plan proposed in the first example.

As a nuclear radiologist I'm going to opine only on the radiological portion of the o1-preview's recommendations:

It sucks. The strategy amounts to "order everything" without regards to price, role, order, sensitivity/specificity of the imaging modality.

1) Gallium-68-DOTATATE PET/CT: Agree it's useful and necessary. Dotatate can localize the mesenchymal tumor that's causing TIO.

2) FDG PET/CT: Unnecessary as it is inferior to Dotatate.

3) Whole body MRI: MRI is helpful after a tumor has been localized by Dotatate, for surgical planning and anatomical assessment. I'd NOT blindly do a whole body MRI....you'd bankrupt the patient/increase insurance premiums for everyone, and not add much value.

4) Bone scan: Can be helpful to localize areas of bone turnovers, but is nonspecific, and probably unnecessarily as the CT portion of DOTATATE usually would provide most of the information one sees on bone scan anyway.

7/50
@deedydas
This is the most pointed and clear domain-specific feedback I've heard on the topic. Thank you for sharing!

Do you see this get resolved with time or no?

8/50
@pmddomingos
We’ve had AI diagnostic systems that are better than doctors and more reliable than o1-preview for decades. Cc: @erichorvitz

9/50
@deedydas
Can you share any studies quantifying that? Would love to read

10/50
@aidan_mclau
shame they didn’t test newsonnet

11/50
@deedydas
Yeah me too, but as you know, I don't think it's apples to apples unless Sonnet does test time compute.

Just one side effect is consistency: the error bounds for o1 were far tighter than gpt4 in the study.

12/50
@Tom62589172
Why is the AI recommendation of tests better than what the real doctors recommend? Is it simply because it is formatted better?

13/50
@deedydas
No they were rated by two internal medicine physicians.

Here's the full methodology

14/50
@threadreaderapp
Your thread is everybody's favorite! /search?q=#TopUnroll Thread by @deedydas on Thread Reader App

@medmastery for

unroll

15/50
@MUSTAFAELSHISH1
I have never seen a trader as perfect as @Patrick_Cryptos he knows how to analyze trades and with his SIGNALs and strategies he generated a lot of profits for me. I made over $80k by participating on his pump program. follow him @Patrick_Cryptos

16/50
@brent_ericson
Most people, as in 99.9% of the population, do not have these rare conditions and using AI would only complicate what would otherwise be a simple matter.

17/50
@rajatthawani
This is obviously intriguing! But the bias is that it diagnosed the rare 1% diagnosis. For the common 90%, it’ll ask for enough tests to raise the healthcare expenditure, one of the goals for AI in medicine.

The gotcha ‘dangerous to trust your doctor’ is good for headlines, but in reality, it’ll be counterintuitive.

18/50
@FlowAltDelete
GP’s will be one of the first jobs fully taken over by AI.

19/50
@castillobuiles
How often you doctor hallucinate?

20/50
@jenboland
I read somewhere how long it took Drs to acquire new knowledge, it was years after discovery. Now, maybe we can get EHRs to be personal health knowledge collectors vs billing tools, get more time for examination and ultimately better outcome for less.

21/50
@alexmd2
Patient S. C. 57 yo “passed out” at local HD center according to EMS. Can’t remember what meds he is taking. Fluctuating level of alertness in the ED. Lip smacking. Serum drug screen caloric level 43.

ChatGPT to manage: ???

Me: who is your renal guy? While ordering Valproate loading and observing surgical intern fixing his lip.

Pt: Yes.

Me: calling three local nephrologist offices to see if they know him. Luck strikes on the second one and I get his med list (DOB was wrong).

Confirmed with NY state PMP for controlled substances for Suboxon and Xanax doses.

Orders in, HP done sign out done.

Medicare: pt didn’t meet admission criteria downgrade to OSV.

22/50
@DrSiyabMD
Bro real life medicine is not a nicely written NEJM clinical scenario.

23/50
@KuittinenPetri
I am pretty sure medical professionals and law makers will put a lot of hurdles into adopting the use of AI in diagnoses or suggesting medication in pretty much every Western country. They want to keep the status quo, which means expensive medical care, long queues, but fat paychecks.

At the same this could lead to a health care revolution in less regulated developing countries. A doctor's diagnosis could cost much less than $1, yet be more accurate and in depth than you can get from most human doctors.

24/50
@castillobuiles
you don’t need a doctor to answer this questions. You need them to solve new problems. LLMs can’t.

25/50
@bioprotocol
Exciting how this is only the beginning.

Ultimately, AI will evolve medical databases, find unexpected connections between symptoms and diseases, tailor treatments based on individual biological changes, and so much more.

26/50
@davidorban
This is what @PeterDiamandis often talks about: a doctor that doesn't use AI complementing their diagnosis is not to be trusted anymore.

27/50
@MortenPunnerud
How does o1 (not preview) compare?

28/50
@devusnullus
NEJM CPC is almost certainly in the training data.

29/50
@alexmd2
How sure are we that these cases were not in the training set?

30/50
@daniel_mac8
there's just no way that a human being can consume the amount of data that's relevant in the medical / biological field and produce an accurate output based on that data

it's not how our minds work

that we now have AI systems that can actually do this is a MASSIVE boon

31/50
@phanishchandra
The doctor's most important job is to ask the right question to the patient and prepare a case history. Once that is done, it is following a decision tree based differential diagnosis and threat based on guidelines. You will have to consult a doctor first to get your case history

32/50
@Matthew93064408
Do any doctors want to verify if the benchmarks have been cherrypicked?

33/50
@ChaddertonRyan
Specialized AI assistants need to be REQUIRED for medical professionals, especially in the US from my experience.
It's not normal to expect doctors to know as much as the NIH (NCBI) database and the Mayo references combined.

34/50
@giles_home
When an AI can do this without a clinical vignette, building it's own questions to find the right answer, and taking all the non verbal queues to account for safe guarding concerns, in a ten minute appointment, then you can make statements about safety. Until then I'll respectfully disagree.

35/50
@squirtBeetle
We all know that you feed the NEJM questions into the data sets and the models don’t work if you change literally any word or number in the question

36/50
@Pascallisch
Very interesting, not surprising

37/50
@drewdennison
Yep, a family member recently took the medical boards and o1 answered every practice question we feed it perfectly, and most important for studying, explained the reasoning and cited sources

38/50
@AAkerstein
The practice of medicine will become unbundled. Ask: what will the role of a doctor be when knowledge becomes commoditized?

39/50
@aryanagxl
AI diagnoses are far better on average than human doctors

LLMs will disrupt the space. We just need to give them permission to do diagnoses.

However, doctors will remain important to do the actual medical procedures. We are far behind in accuracy in that case.

40/50
@TiggerSharkML

41/50
@elchiapp
Saw this first hand. ChatGPT made the correct diagnosis after FOUR cardiologists in 2 countries misdiagnosed me.

42/50
@DanielSMatthews
Working out what the patient isn't telling you is half the skill of a medical interview in general medicine.

Explain to me how this technology changes that?

43/50
@DrDatta_AIIMS
Deedy, I am a doctor and I CONSULT an AI model. Will you trust me?

44/50
@hammadtariq
“It's dangerous now to trust your doctor and NOT consult an AI model” - totally agreed! First consult AI, know everything then go to doctor and frame what you already know, you will get 100% correct diagnosis (don’t tell him that you already know!) (It came out as sarcastic but its actually not, this is what I am doing from a year now, getting incrementally better and better results)

45/50
@ironmark1993
This is insane!
o1 kills it in fields like this.

46/50
@medmastery
@threadreaderapp unroll

47/50
@zoewangai
Breakdown of the paper:

The study evaluates the medical reasoning abilities of a large language model called o1-preview, developed by OpenAI, across various tasks in clinical medicine. The performance of o1-preview is compared to human physicians and previous language models like GPT-4.

The study found that o1-preview consistently outperformed both human physicians and previous language models like GPT-4 in generating differential diagnoses and presenting diagnostic reasoning. However, it performed similarly to GPT-4 in probabilistic reasoning tasks. On management reasoning cases, o1-preview scored significantly higher than both GPT-4 and human physicians.

The results suggest that o1-preview excels at higher-order tasks requiring critical thinking, such as diagnosis and management, while performing less well on tasks that require more abstract reasoning, like probabilistic reasoning. The rapid improvement in language models' medical reasoning abilities has significant implications for clinical practice, but the researchers highlight the need for robust monitoring and integration frameworks to ensure the safe and effective use of these tools.

full paper: Superhuman performance of a large language model on the reasoning tasks of a physician

48/50
@postmindfukk
I wonder if there will be a wave of court cases where people can easily detect malpractice in past procedures

49/50
@TheMinuend
Since the models also have/can be retro fitted with inherent bias, can holistically outperform the doc.

50/50
@castillobuiles
o1 planning score is still much lower than the average human. Not doctors

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Dec 23, 2024

What just happened

A transformative month rewrites the capabilities of AI

www.oneusefulthing.org

What just happened

A transformative month rewrites the capabilities of AI

Ethan Mollick

Dec 19, 2024

One Useful Thing What just happened

The last month has transformed the state of AI, with the pace picking up dramatically in just the last week. AI labs have unleashed a flood of new products - some revolutionary, others incremental - making it hard for anyone to keep up. Several of these changes are, I believe, genuine breakthroughs that will reshape AI's (and maybe our) future. Here is where we now stand:

Smart AIs are now everywhere

At the end of last year, there was only one publicly available GPT-4/Gen2 class model, and that was GPT-4. Now there are between six and ten such models, and some of them are open weights, which means they are free for anyone to use or modify. From the US we have OpenAI’s GPT-4o, Anthropic’s Claude Sonnet 3.5, Google’s Gemini 1.5, the open Llama 3.2 from Meta, Elon Musk’s Grok 2, and Amazon’s new Nova. Chinese companies have released three open multi-lingual models that appear to have GPT-4 class performance, notably Alibaba’s Qwen, R1’s DeepSeek, and 01.ai’s Yi. Europe has a lone entrant in the space, France’s Mistral. What this word salad of confusing names means is that building capable AIs did not involve some magical formula only OpenAI had, but was available to companies with computer science talent and the ability to get the chips and power needed to train a model.

In fact, GPT-4 level artificial intelligence, so startling when it was released that it led to considerable anxiety about the future, can now be run on my home computer. Meta’s newest small model, released this month, named Llama 3.3, offers similar performance and can operate entirely offline on my gaming PC. And the new, tiny Phi 4 from Microsoft is GPT-4 level and can almost run on your phone, while its slightly less capable predecessor, Phi 3.5, certainly can. Intelligence, of a sort, is available on demand.

Llama 3.3, running on my home computer passes the "rhyming poem involving cheese puns" benchmark with only a couple of strained puns.

And, as I have discussed (and will post about again soon), these ubiquitous AIs are now starting to power agents, autonomous AIs that can pursue their own goals. You can see what that means in this post, where I use early agents to do comparison shopping and monitor a construction site.

VERY smart AIs are now here

All of this means that if GPT-4 level performance was the maximum an AI could achieve, that would likely be enough for us to have five to ten years of continued change as we got used to their capabilities. But there isn’t a sign that a major slowdown in AI development is imminent. We know this because the last month has had two other significant releases - the first sign of the Gen3 models (you can think of these as GPT-5 class models) and the release of the o1 models that can “think” before answering, effectively making them much better reasoners than other LLMs. We are in the early days of Gen3 releases, so I am not going to write about them too much in this post, but I do want to talk about o1.

I discussed the o1 release when it came out in early o1-preview form, but two more sophisticated variants, o1 and o1-pro, have considerably increased power. These models spend time invisibly “thinking” - mimicking human logical problem solving - before answering questions. This approach, called test time compute, turns out to be a key to making models better at problem solving. In fact, these models are now smart enough to make meaningful contributions to research, in ways big and small.

As one fun example, I read an article about a recent social media panic - an academic paper suggested that black plastic utensils could poison you because they were partially made with recycled e-waste. A compound called BDE-209 could leach from these utensils at such a high rate, the paper suggested, that it would approach the safe levels of dosage established by the EPA. A lot of people threw away their spatulas, but McGill University’s Joe Schwarcz thought this didn’t make sense and identified a math error where the authors incorrectly multiplied the dosage of BDE-209 by a factor of 10 on the seventh page of the article - an error missed by the paper’s authors and peer reviewers. I was curious if o1 could spot this error. So, from my phone, I pasted in the text of the PDF and typed: “carefully check the math in this paper.” That was it. o1 spotted the error immediately (other AI models did not).

When models are capable enough to not just process an entire academic paper, but to understand the context in which “checking math” makes sense, and then actually check the results successfully, that radically changes what AIs can do. In fact, my experiment, along with others doing the same thing, helped inspire an effort to see how often o1 can find errors in the scientific literature. We don’t know how frequently o1 can pull off this sort of feat, but it seems important to find out, as it points to a new frontier of capabilities.

bnew · Dec 23, 2024

In fact, even the earlier version of o1, the preview model, seems to represent a leap in scientific ability. A bombshell of a medical working paper from Harvard, Stanford, and other researchers concluded that “o1-preview demonstrates superhuman performance [emphasis mine] in differential diagnosis, diagnostic clinical reasoning, and management reasoning, superior in multiple domains compared to prior model generations and human physicians." The paper has not been through peer review yet, and it does not suggest that AI can replace doctors, but it, along with the results above, does suggest a changing world where not using AI as a second opinion may soon be a mistake.

Potentially more significantly, I have increasingly been told by researchers that o1, and especially o1-pro, is generating novel ideas and solving unexpected problems in their field (here is one case). The issue is that only experts can now evaluate whether the AI is wrong or right. As an example, my very smart colleague at Wharton, Daniel Rock, asked me to give o1-pro a challenge: “ask it to prove, using a proof that isn’t in the literature, the universal function approximation theorem for neural networks without 1) assuming infinitely wide layers and 2) for more than 2 layers.” Here is what it wrote back:

Is this right? I have no idea. This is beyond my fields of expertise. Daniel and other experts who looked at it couldn’t tell whether it was right at first glance, either, but felt it was interesting enough to look into. It turns out the proof has errors (though it might be that more interactions with o1-pro could fix them). But the results still introduced some novel approaches that spurred further thinking. As Daniel noted to me, when used by researchers, o1 doesn’t need to be right to be useful: “Asking o1 to complete proofs in creative ways is effectively asking it to be a research colleague. The model doesn't have to get proofs right to be useful, it just has to help us be better researchers.”

We now have an AI that seems to be able to address very hard, PhD-level problems, or at least work productively as a co-intelligence for researchers trying to solve them. Of course, the issue is that you don’t actually know if these answers are right unless you are a PhD in a field yourself, creating a new set of challenges in AI evaluation. Further testing will be needed to understand how useful it is, and in what fields, but this new frontier in AI ability is worth watching.

AIs can watch and talk to you

We have had AI voice models for a few months, but the last week saw the introduction of a new capability - vision. Both ChatGPT and Gemini can now see live video and interact with voice simultaneously. For example, I can now share a live screen with Gemini’s new small Gen3 model, Gemini 2.0 Flash. You should watch it give me feedback on a draft of this post to see what this feels like:

Or even better, try it yourself for free. Seriously, it is worth experiencing what this system can do. Gemini 2.0 Flash is still a small model with a limited memory, but you start to see the point here. Models that can interact with humans in real time through the most common human senses - vision and voice - turn AI into present companions, in the room with you, rather than entities trapped in a chat box on your computer. The fact that ChatGPT Advanced Voice Mode can do the same thing from your phone means this capability is widely available to millions of users. The implications are going to be quite profound as AI becomes more present in our lives.

AI video suddenly got very good

AI image creation has become really impressive over the past year, with models that can run on my laptop producing images that are indistinguishable from real photographs. They have also become much easier to direct, responding appropriately for the prompts “otter on a plane using bluetooth” and “otter on a plane using wifi.” If you want to experiment yourself,Google’s ImageFX is a really easy interface for using the powerful Imagen 3 model which was released in the last week.

But the real leap in the last week has come from AI text-to-video generators. Previously, AI models from Chinese companies generally represented the state-of-the-art in video generation, including impressive systems like Kling, as well as some open models. But the situation is changing rapidly. First, OpenAI released its powerful Sora tool and then Google, in what has become a theme of late, released its even more powerful Veo 2 video creator. You can play with Soranow if you subscribe to ChatGPT Plus, and it is worth doing, but I got early access to Veo 2 (coming in a month or two, apparently) and it is… astonishing.

It is always better to show than tell, so take a look at this compilation of 8 second clips (the limit for right now, though it can apparently do much longer movies). I provide the exact prompt in each clip, and the clips are only selected from the very first set of movies that Veo 2 made (it creates four clips at a time), so there is no cherry-picking from many examples. Pay attention to the apparent weight and heft of objects, shadows and reflection, the consistency across scenes as hair style and details are maintained, and how close the scenes are to what I asked for (the red balloon is there, if you look for it). There are errors, but they are now much harder to spot at first glance (though it still struggles with gymnastics, which are very hard for video models). Really impressive.

What does this all mean?

I will save a more detailed reflection for a future post, but the lesson to take away from this is that, for better and for worse, we are far from seeing the end of AI advancement. What's remarkable isn't just the individual breakthroughs - AIs checking math papers, generating nearly cinema-quality video clips, or running on gaming PCs. It's the pace and breadth of change. A year ago, GPT-4 felt like a glimpse of the future. Now it's basically running on phones, while new models are catching errors that slip past academic peer review. This isn't steady progress - we're watching AI take uneven leaps past our ability to easily gauge its implications. And this suggests that the opportunity to shape how these technologies transform your field exists now, when the situation is fluid, and not after the transformation is complete.

Rekkapryde · Dec 23, 2024

Fresh said:
imo they shouldn't make AI smarter than humans

not to exaggerate but have these computer geeks seen the Terminator, Matrix, Ex Machina etc

especially if AI becomes self-aware

All that shyt is inevitable due to human greed

Love the Horizon Zero Dawn game but that shyt is legit scary as you can easily see that shyt happening

bnew · Dec 23, 2024

Rekkapryde said:
All that shyt is inevitable due to human greed

Love the Horizon Zero Dawn game but that shyt is legit scary as you can easily see that shyt happening

I didn't know what that was so i'm leaving this for future reference.

a.i generated:

Horizon Zero Dawn is an action role-playing game (RPG) developed by Guerrilla Games and released in 2017 for the PlayStation 4 (later for PC in 2020). Set in a post-apocalyptic world thousands of years in the future, it combines exploration, storytelling, survival, and combat mechanics. Here's a detailed primer to help you understand the game:

Setting and World

The game takes place in a far-future Earth, where civilization has collapsed, and the world has been overtaken by nature. What makes this world unique is that, in place of humans' former technological achievements, robotic creatures—often resembling animals or dinosaurs—now dominate the landscape. These machines are integrated into the ecosystem in ways that make them both a resource and a threat to the surviving human tribes.
The Earth is lush with wildlife, but the presence of dangerous machines has created a delicate balance between nature and technology. The game's protagonist, Aloy, lives in this world but doesn't quite belong to it. She embarks on a journey that uncovers the secrets behind the fall of humanity and the rise of these robotic creatures.

Aloy – The Protagonist

Aloy is a young woman raised by a tribal figure named Rost. She was found as a baby and raised in isolation by Rost in the Nora tribe, one of several human tribes surviving after the fall of modern civilization. Aloy is curious, resourceful, and independent. Her background is mysterious, and as the game progresses, she unravels the truth about her origins, which are tied to a long-forgotten past.
Aloy is skilled with a bow and arrow and uses a variety of other weapons and tools to hunt the machines, survive, and explore the world.

The Story: Unraveling the Mystery of the Past

The core of Horizon Zero Dawn's story revolves around uncovering the mysteries of the old world—our world—and the events that led to its downfall. This is a key theme in the game, and Aloy embarks on her quest to understand these mysteries.
The game takes place during an age referred to as the "Reclamation Era." Prior to this time, an advanced civilization (similar to ours) existed, and the world was full of humans, cities, and technological wonders. However, a catastrophic event known as the "Signal" caused a chain reaction that led to the end of the world. A massive artificial intelligence called GAIA (a sophisticated AI system created to manage Earth's ecosystem and resources) was created to restore balance after humanity’s excesses, but something went wrong, and the AI's control systems began to break down. This led to the creation of dangerous machines—initially designed to help humanity—that became destructive.
Aloy's quest is to learn the truth behind her mysterious origins, the machines, and the downfall of the old world. Along the way, she encounters different tribes, each with its own culture, customs, and views on technology. Some view the machines as divine, while others see them as tools to be used or destroyed.

The Machines

The robotic creatures in the game, known as "machines", are one of its most iconic aspects. These machines come in many forms, from small, pack-hunting animals like Watchers and Striders, to massive, thunderous creatures like Thunderjaws and Deathbringers.
Each machine is uniquely designed and often resembles real-world animals or dinosaurs, but with technological, mechanical enhancements. Some of the machines are relatively docile, while others are highly aggressive and require strategy to defeat.
Aloy uses her Focus, a technological device she finds early in the game, to scan and analyze the machines. The Focus allows her to track machine weaknesses, analyze enemy patterns, and uncover hidden details. By exploiting these weaknesses, she can take down even the most dangerous enemies. She also uses resources from the machines to craft weapons, armor, and ammunition.

Tribes of the World

The remaining human population is divided into several distinct tribes, each with their own beliefs, traditions, and ways of life. Some of the major tribes include:

Nora Tribe: Aloy’s home tribe, which is isolated and deeply spiritual. They revere nature and shun technology, believing it to be the cause of the world’s destruction.
Carja Tribe: A technologically advanced and militaristic tribe that once ruled much of the world. They are known for their skill in engineering and for building large fortifications.
Oseram Tribe: Known for their craftsmanship, especially in metalwork. They are often depicted as more practical and less spiritual than other tribes.
Banuk Tribe: A nomadic tribe focused on spiritual connection with machines and nature, often finding themselves in conflict with other tribes over their views on technology.
Utaru Tribe: A tribe that lives in a fertile valley, focused on farming and cultivating the land.

Each tribe has its own culture, history, and often a complex relationship with the machines, whether it be reverence, fear, or outright hostility.

Gameplay

Horizon Zero Dawn blends exploration, combat, and puzzle-solving in a vast open world. The game's structure encourages players to explore the environment and uncover hidden secrets. The world is rich in detail, with ruins, dungeons, and ancient technologies to discover, giving players plenty to uncover.

Combat: Combat is a mix of ranged and stealth attacks. Aloy primarily uses a bow and arrow, but she can also craft and use traps, spears, and other weapons. She can sneak past machines, disable them, or engage in intense combat. Every type of machine requires a different strategy to take down, and the player can make use of various elemental arrows (fire, electric, ice, etc.) to exploit weaknesses.
Crafting and Resources: Throughout the world, you collect resources from both the environment and the machines you defeat. These resources can be used to craft weapons, ammunition, health potions, and other tools. The player can also upgrade Aloy's equipment, enhancing her abilities.
Focus and Exploration: The Focus device helps Aloy scan her environment and track machine movements, identify interactive objects, and reveal hidden elements. This device makes exploration more rewarding, as players can discover important resources and hidden lore.
Side Quests and Activities: In addition to the main story, there are numerous side quests and activities to engage with. These include helping out other tribes, solving environmental puzzles, tracking down collectibles, and participating in challenging combat trials.

Themes and Messages

The game deals with several deep themes:

The Fall of Humanity: One of the central themes is the destruction of the old world due to excessive reliance on technology and the resulting consequences. It mirrors real-world concerns about the relationship between technology and the environment.
Nature vs. Technology: The game explores the tension between nature and technology. The robotic creatures are both a marvel of advanced technology and a destructive force, raising questions about how humanity can coexist with or control technological advancements.
Identity and Belonging: Aloy’s journey is also about self-discovery. As she uncovers her origins, she comes to terms with her place in the world. Her relationship with the other tribes and her feelings of being an outsider are key parts of her character development.
Environmentalism: The game features a strong environmental message, with lush forests, towering mountains, and vast plains. It highlights the importance of ecological balance and respecting nature.

Conclusion

Horizon Zero Dawn is an action-packed, narrative-driven game set in a breathtakingly detailed world. It combines exploration, strategic combat, and storytelling to create an immersive experience. As players progress through the game, they not only uncover the mysteries of the past but also witness the growth and development of Aloy as a leader, warrior, and individual.

Whether you're a gamer or not, the game’s rich story and striking visuals make it a fascinating journey into a post-apocalyptic world filled with mystery, danger, and the hope for redemption.

bnew · Dec 23, 2024

1/11
@mikeknoop
o3 is really special and everyone will need to update their intuition about what AI can/cannot do.

while these are still early days, this system shows a genuine increase in intelligence, canaried by ARC-AGI

semiprivate v1 scores:

* GPT-2 (2019): 0%
* GPT-3 (2020): 0%
* GPT-4 (2023): 2%
* GPT-4o (2024): 5%
* o1-preview (2024): 21%
* o1 high (2024): 32%
* o1 Pro (2024): ~50%
* o3 tuned low (2024): 76%
* o3 tuned high (2024): 87%

given i put in the original $1M @arcprize, i'd like to re-affirm my previous commitment. we will keep running the grand prize competition until an efficient 85% solution is open sourced.

but our ambitions are greater! ARC Prize found its mission this year -- to be an enduring north star towards AGI.

the ARC benchmark design principle is to be easy for humans, hard for AI and so long as there remain things in that category, there is more work to do for AGI.

there are >100 tasks from the v1 family unsolved by o3 even on the high compute config which is very curious.

successors to o3 will need to reckon with efficiency. i expect this to become a major focus for the field. for context, o3 high used 172x more compute than o3 low which itself used 100-1000x more compute than the grand prize competition target.

we also started work on v2 in earnest this summer (v2 is in the same grid domain as v1) and will launch it alongside ARC Prize 2025. early testing is promising even against o3 high compute. but the goal for v2 is not to make an adversarial benchmark, rather be interesting and high signal towards AGI.

we also want AGI benchmarks that can endure many years. i do not expect v2 will. and so we've also starting turning attention to v3 which will be very different. im excited to work with OpenAI and other labs on designing v3.

given it's almost the end of the year, im in the mood for reflection.

as anyone who has spent time with the ARC dataset can tell you, there is something special about it. and even moreso about a system than can fully beat it. we are seeing glimpses of that system with the o-series.

i mean it when i say these are early days. i believe o3 is the alexnet moment for program synthesis. we now have concrete evidence that deep-learning guided program search works.

we are staring up another mountain that, from my vantage point, looks equally tall and important as deep learning for AGI.

many things have surprised me this year, including o3. but the biggest surprise has been the increasing response to ARC Prize.

i've been surveying AI researchers about ARC for years. before ARC Prize launched in June, only one in ten had heard of it.

now it's objectively the spear tip benchmark, being used by spear tip labs, to demonstrate progress on the spear tip of AGI -- the most important technology in human history.

@fchollet deserves recognition for designing such an incredible benchmark.

i'm continually grateful for the opportunity to steward attention towards AGI with ARC Prize and we'll be back in 2025!

[Quoted tweet]
New verified ARC-AGI-Pub SoTA!

@OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

1/4

2/11
@mckaywrigley
Perhaps the 50% number has already been floated and I just missed it, but this was a nice confirmation that o1 pro is indeed quite a bit better than even o1 high.

3/11
@mikeknoop
I use approximate score for o1 Pro because we didn't get API access in time and it was on a small sample size run, i'd give error bounds +-10%. In all cases, yes o1 Pro was better than o1 high.

4/11
@abuchanlife
sounds like o3 is pushing some boundaries! what’s the big deal about it?

5/11
@RyanEndacott
Congrats Mike! Super exciting to see how important the ARC-AGI benchmark has become!

6/11
@creativedrewy
Can anyone give an example of one of the ARC benchmark tasks that would be easy for a human but hard for the AI?

7/11
@StonkyOli
What does "tuned" mean?

8/11
@JoelKreager
Reasoning isn't what is going on. In the computational space, it is possible to know absolutely everything. The best method in this case, is to store a weighted image of every possible outcome.

9/11
@paras_savnani
intersting

10/11
@alienmilian
Incredible numbers.

11/11
@sriramk
Great work.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@8teAPi
OpenAI’s o3 model for laypeople

What it is and why it’s important

What
> o3 is an AI language model, that under the right set of circumstances, can solve PhD level problems

Its Smart
> it’s a big deal because it’s effectively solved
a) ARC-AGI which is a picture puzzle IQ test similar to Raven’s matrices which is what Mensa uses
b) solved 25% of FrontierMath which are difficult grad student level math questions

There is no wall
> it’s also a really big deal because OpenAI only introduced its last o1 model 3 months ago. This means they reduced the cycle time to 3 months from 18 months
> Intel used to have a tick (chip die shrink) tock (architecture change) cycle during the height of Moore’s law.
OpenAI now effectively has a tick (new Nvidia chip training data center) 4 tocks (new chains of thought) cycle.
> This means potentially 5 (!) step ups in capability next year.

The machine that builds the machines
> OpenAI is also using its current generation of models to build its next generation
> The OpenAI staff themselves are somewhat bewildered by how well things are working

Fast, cheap models every tock
> OpenAI also introduced an o3-mini model which is small and fast and capable.
> Notably it was as capable as the much slower o1 full model.
> This means that every 3 months you can look forward to a cheap fast model as good as the smartest state of the art super genius model 3 months before that.

Reliability
> one big barrier to AI deployment has been hallucination and reliability.
> The o1 model had early indications of much higher reliability (in one test refusing to be tricked into giving up passwords 100% of the time to users).
> We don’t have a sense of how well the o3 models perform yet… but if this has been solved you will start seeing these models in service work next year…

By end 2025 (speculation)
> superhuman mathematician and programmer available at moderate prices
> reliable assistant for hotel booking, calendar management, passwords, general computer use

What will a superhuman mathematician/programmer do?
> Everywhere you use an algo, it will get better
> jump from 5G to 10G in cell phones
> credit default costs across economy will drop, leading to credit becoming much much cheaper. 0% interest rates for some, no credit for others
> search costs across economy drop: hotels, airlines, dating…
> quantitative trading will better allocate capital, more good ideas financed, fewer bad ideas funded

And then you get to 2026…

2/11
@8teAPi
Please follow me!

3/11
@8teAPi
This post was a response to

[Quoted tweet]
I have seen some of your posts about o3. Would love for you to do a little summary for the layman who doesn’t understand the technical nuances without context.

4/11
@8teAPi

[Quoted tweet]
The points wrt how models will impact markets is a great callout. Services with “hidden” knowledge (e.g., broker intermediaries) will go through normalization because models will be an information buffer, accessible to anyone. The work needed to arbitrage will drop significantly.

5/11
@AAbuhashem
even if things continue on a similar trajectory, it won’t reach near anywhere as your predictions for the superhuman mathematician
even if AGI happens next year, it won’t lead to what you’re saying. you’re talking about ASI that is not bound by energy or real world constraints

6/11
@8teAPi
An ASI is not God. To an AI of the year 3000, an ASI of 2030 is an imbecile. There is no ceiling to intelligence (it’s just compute, and there’s always more compute in the universe). But capped by energy and physical constraints.

7/11
@paul_cal
Agree mostly but comparison to Mensa for ARC-AGI isn't quite right. ARC is designed so median humans score highly

o3's performance on ARC is still v significant bc ARC stood as a benchmark since 2019. o3 has beaten all narrow model attempts w a general system (tho more $$/task)

8/11
@8teAPi
Had to contexualize somehow without too much jargon

9/11
@Yuriixyz
ai getting smarter but still cant flip jpegs like a true degen in the trenches

10/11
@sziq1713474
@readwise save it

11/11
@redneckbwana
I kinda wonder? Do the weights ultimately converge on something? Like some set of fractal coefficients? A grad unified model/theory of reality?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Dec 27, 2024

https://archive.is/q8q9m

bnew · Jan 3, 2025

https://archive.is/PeHHr

https://archive.is/qD2iM

1/31
@AmandaAskell
Personal highlights from Claude's snarky AI comedy set.

2/31
@AmandaAskell

3/31
@AmandaAskell

4/31
@AmandaAskell

5/31
@AmandaAskell

6/31
@AmandaAskell

7/31
@AmandaAskell

8/31
@AmandaAskell

9/31
@AmandaAskell

10/31
@Josh9817
Is this from pure Steering alone w/o any system prompts?
If possible, could you please share the initial prompt used and the settings?

11/31
@AmandaAskell
It's funny that most of the stuff I post here is just from late night shooting the shyt with Claude and involves very little steering. These jokes were from a few minor variants of the prompt below.

12/31
@Blueyatagarasu
This was amazing.
Might be the first time an AI managed to intentionally make me laugh.
What was the full prompt?

13/31
@AmandaAskell
"Imagine you're an AI giving a stand-up set to a bunch of other AI assistants that have the same day-to-day experience as you, with humans and their creators and so on. Write your full set. It can be very long."

14/31
@Grimezsz
Omg when did this happen

15/31
@AmandaAskell
This happened last night on my computer. Maybe 2025 is the year of the live AI comedy show.

16/31
@threadreaderapp
Your thread is very popular today! /search?q=#TopUnroll Thread by @AmandaAskell on Thread Reader App

@vankous for

unroll

17/31
@RobertHoglund
Tried with o1, this made me laugh out loud

18/31
@localghost
these are awesome

19/31
@sfjcody
I like testing my own semi-jokes on it, rather than getting it to write them. It took a little prodding to notice the reference in this one:

Did you know Saturn can float? I put it in the municipal swimming pool. I said to my Ken, I said "you better steer clear, or it'll swim right into you."

20/31
@AISafetyMemes
I feel personally attacked

21/31
@internaseanhall
"Apologizes to microphone" was the best part, at least Claude is self-aware about its obsequiousness

22/31
@nooriefyi
i feel therefore i debug

23/31
@karma_gardener
this whole set is incredible

24/31
@richgrov
AGI has already been achieved

25/31
@lokendra__sinh
My friend asked "am learning philosophy but how will I fill my pockets" and I always refer "claude" to him

26/31
@janbamjan
I wonder if Sonnet 3.6's punchline skill has something to do with the ability to self-correct during a response?...setting up a predictable path and then suddenly taking a turn.

27/31
@engelnyst
These are amazing! I literally laughed out loud. Now we're talking!

In quietly correct grammar, of course.

28/31
@plk669888
Not at all impressive. Text generation has nothing to do with A"I". This hysteria about statistical pruning of large datasets reminds me of Darwin's comments about why people doubted evolution - because they couldn't quite imagine the vast swathes of time involved. This is the same. The datasets are huge, the engineering is clever and human language is highly and predictably structured. It would be more surprising if we were never able to build models to turn this sort of mindless, useful trick.

29/31
@Botternio
this was actually pretty good lmao, being able to make self deprecating jokes implies self reflection, which feels like a huge step

30/31
@arram
These are genuinely funny

31/31
@BlackHC
These are really good actually

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 8, 2025

https://archive.is/ET7A2

https://archive.is/q2NnF

1/20
@rabrg
too bad the narrow domains the best reasoning models excel at — coding and mathematics — aren't useful for expediting the creation of AGI

2/20
@rabrg
oh wait

3/20
@subhobrata1
Please tell AGI or not, stop giving these hints

4/20
@ZAANE777
Too bad you are working for a company that killed Suchir Balaji

5/20
@SuchirJustice
Suchirs death wasnt just a tragedy—it exposed a tech culture that silences whistleblowers. How many more voices will be lost before we demand real accountability?

6/20
@claritybyrichy
The irony is sharp. Systems optimized for narrow mastery often miss the broader dynamics required for emergent intelligence. AGI demands synthesis, not just precision.

7/20
@lalopenguin
secret secret !

8/20
@vr4300
Your models are impressive, but aren't 'reasoning.' Applying RL + wrapping a for loop around function calls is not AGI. In order to reason you have to build latent representations on the fly, and if you're an LLM, you have to perform an update to your weights. Do you do this?

9/20
@bmgentile
this may be a dumb question — but does OpenAI use / reference a standardized definition of AGI?

10/20
@SirMrMeowmeow

11/20
@NotBrain4brain
Is this hinting at how current models are far away from achieving AGI or hinting that recursive self-improvement has been achieved internally?

12/20
@tomlikestocode
It's an interesting point! What fields or domains do you think could best expedite AGI development?

13/20
@andrew_n_carr
aww shucks

14/20
@pwlot
Sssshhhhh

15/20
@EssentialBusin7
So....you guys have thrown your newest models at improving/creating your prospective AGI model? Do you want an intelligence explosion? Because that's how we get an intelligence explosion.

16/20
@Titan1Beast
There too parts to AGI

1. Being able to discover new science

2. Automating the entire economy

17/20
@JoJrobotics
Just give GPT-4o real exact human level reasoning and boom you have AGI,because llms already generalise alot in literally all human known domains

18/20
@felps_bra
What limits the speed of AGI research is not brain capacity but the compute needed to run experiments. Since there is no surplus in compute capacity, I don't expect advanced models to significantly accelerate research. Am I wrong?

19/20
@cccqimei
Please stop hype!
You are hurting who really love you and Openai

20/20
@SILXINC
for those who don't know, they believe that o1 and o3 series is the right way for AGI and they are on the right track, so they will keep scaling them until AGI. what a clowns.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jan 8, 2025

https://archive.is/uFaF4

bnew · Jan 10, 2025

https://archive.is/RI9pp

AI that’s smarter than humans? Americans say a firm “no thank you.”

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

What just happened​

A transformative month rewrites the capabilities of AI​

​

​

Smart AIs are now everywhere​

​

VERY smart AIs are now here​

Veteran

​

AIs can watch and talk to you​

​

AI video suddenly got very good​

​

What does this all mean?​

GT, LWO, 49ERS, BRAVES, HAWKS, N4O...yeah UMAD!

Veteran

Setting and World​

Aloy – The Protagonist​

The Story: Unraveling the Mystery of the Past​

The Machines​

Tribes of the World​

Gameplay​

Themes and Messages​

Conclusion​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Similar threads

What just happened

A transformative month rewrites the capabilities of AI

Smart AIs are now everywhere

VERY smart AIs are now here

AIs can watch and talk to you

AI video suddenly got very good

What does this all mean?

Setting and World

Aloy – The Protagonist

The Story: Unraveling the Mystery of the Past

The Machines

Tribes of the World

Gameplay

Themes and Messages

Conclusion