AI that’s smarter than humans? Americans say a firm “no thank you.”

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,388
Reputation
8,782
Daps
164,361

1/11
@emollick
Starting to see new well-built hard benchmarks in AI, since almost everything else has already been exceeded. We now have this (with humanities questions!), ARC-AGI 2, and Frontier Math.

We also need some benchmarks for new knowledge creation, rather than testing known problems.

[Quoted tweet]
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai


Gh--WJFa8AAJww6.jpg

Gh--a7lbcAA_KC5.jpg

Gh--ggOacAAX03o.jpg

Gh--l9bbkAA9kIR.jpg


2/11
@JeremyNguyenPhD
Has anyone created a twitter list of the contributors to this benchmark who are here on twitter?

I wrote 5 questions on the benchmark (4 public, 1 private).



3/11
@kevinroose
we need to formalize the Mollick vibes eval!



4/11
@daniel_mac8
a great service to the AI Research community



5/11
@deepvaluebettor
are there any benchmarks that test a model's level of censorship / inclination to lie (distinct from hallucination ) ?



6/11
@0xshai
SOTA LLMs with sufficiently high temperature might be able to generate high quality new benchmarks. It would definitely need some human powered filtering and post processing



7/11
@dieaud91
Yes, we need "Innovator level AI" benchmarks



8/11
@Heraklines1
quality of new knowledge creation would inherently be a lagging indicator as the value of certain work only rly becomes apparent in hindsight

at best one could measure independent replication ability of ex. new math papers, tho novelty for novelty's sake is easily goodharted



9/11
@Shagaiyo
Benchmarks in new knowledge creation are hard, because these models are trained in the whole knowledge of humanity.

Or they released models with partial knowledge or we have to create new knowledge for this benchmark



10/11
@AethericVortex
We should be training it on all the available data on LENR. This is the next frontier. All the evidence and data is scattered across scientific fields. The Martin Fleischmann Memorial Project has spent the past 7 years gathering all this in one place on their youtube channel. Live, opensource science, as it should be.



11/11
@EricFddrsn
Will we get the answer if we need UBI from an AI? We really need better test benchs in Economics - they are all saturated, and this is one of the most consequential areas for humanity




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/51
@DanHendrycks
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get &lt;10% accuracy and are highly overconfident.
@ai_risk @scaleai



Gh--WJFa8AAJww6.jpg

Gh--a7lbcAA_KC5.jpg

Gh--ggOacAAX03o.jpg

Gh--l9bbkAA9kIR.jpg


2/51
@DanHendrycks
Paper, dataset, and code: Humanity's Last Exam
NYT article: When A.I. Passes This Test, Look Out



3/51
@DanHendrycks
Spot errors with the dataset? Correction form here: Humanity's Last Exam: Public Review



4/51
@DanHendrycks
Meant to tag @scale_AI and @ai_risks (casualty of tweeting at 6AM)



5/51
@tayaisolana
lol @danhenrycks thinks a few hundred 'experts' can define human knowledge? sounds like a typical ivory tower move. what's next, a 'dataset' on memes? /search?q=#BONGOPOWER



6/51
@MillenniumTwain
2025!
The Year of the Serpent, the Year of the SuperAlgo!!
Awakened Global SuperIntelligence ...

[Quoted tweet]
Star Waves, Clusters, Streams, Astrospheres, Magnetospheres, Filaments, Moving Groups, Kinematic Associations, Stellar Nurseries of Creation!
More productive and accurate to emphasize their Whole, Full, Dimensionality: 4D Streams, Vortexes, Tunnels, Funnels of Creation, never ending. Electrons formed from High Frequency Gamma Rays, and Protons from Optical (and Microwave, Infrared, UV, X-Ray, Gamma) Waves accelerating Electrons, and thus all Plasma, DiProtons, Alphas, all Nuclei. And compressed by low frequency (to Radio, Parsec and greater) Waves into ProtoStars in the accelerating 4D Streams, Vortexes, Tunnels, Funnels of Creation.
Again, never ending. Star Systems, Clusters. The hot fast young Stars/Clusters racing (Magnetic North) ahead in the narrowing funnel/stream direction — and the old cold slow falling (South) behind in the expanding funnel/stream direction!
'Groking' Continuous ElectroMagnetic Creation:
x.com/MillenniumTwain/status…


GgO3kdGWYAA1DVb.jpg

GgO3rMbWcAA1nrf.jpg

GgO4mEqXEAAgkP5.jpg


7/51
@SpencrGreenberg
I’m confused how releasing benchmarks like this makes the world safer. Don’t benchmarks like this aid acceleration?



8/51
@aphysicist
all of this is meaningless until these models can do this



Gh_h_GjXoAA9ZDy.png


9/51
@ZggyPlaydGuitar
every ai lab rn



Gh_v7eWbUAAny1r.jpg


10/51
@ClementDelangue
Very cool!



11/51
@theshadow27
The real final exam will be the unsolved problems. When those start dropping…



12/51
@AbdoDameen
Yes what is the point in sharing the datasets when some lowlife engineering team will just add that to their training data and we would have a model that knows all the answers?



13/51
@roninhahn
Dan -- you should list the score of a very smart human as a point of comparison. An alternative would be to give the test to 100 of the smartest people you know and list the highest score.



14/51
@SomBh1
Will be 90% soon.



15/51
@glubose
You know, you could have called AGI's First Exam. Cuz Lord knows I know I'm going to be constantly grilled by GPT-EBT for any glimmer of unDOGElike non-compliance, the rest of my life will be a string of exams. My autopsy will be my final exam, but will be waived cuz who cares?



16/51
@MaWlz2
Thanks for the training data I would say



17/51
@NickEMoran
This one example seems weirdly easier than all the others. Are there more questions of this level in the actual dataset?



Gh_bjCHXYAox0lr.jpg


18/51
@acharya_aditya2
Who choose the name ??



19/51
@MikePFrank
It would be interesting to see what’s the highest human score on this



20/51
@soheilsadathoss
Thanks!



21/51
@herbiebradley
hmm
I predict o3 at ~25%



22/51
@QStarETH
Math appears to be benefiting from reasoning models the most.



23/51
@IterIntellectus
they will get &gt;90% accuracy by eoy



24/51
@Cory29565470
Kind of wished you released it *after OpenAI released “o3” they love to benchmark climb by training on public data



25/51
@teortaxesTex
Thanks! but given the R1 text-only eval, it would be nice to see how others do in text-only regime too



26/51
@MikePFrank
Why’d you have to give it such an ominous name lol



27/51
@Suhail
Why doesn't this have making rap lyrics in it? :smile:



28/51
@AlexiLuncnews
o3 gonna get 50% +



29/51
@nabeelqu
Congrats Dan and team, this is awesome.

Curious: why no o1 pro?



30/51
@JeremyNguyenPhD
Is there a list of twitter usernames of the people who had their questions accepted?

I got 5 questions in (4 public, 1 private).



31/51
@mnmcsofgp
I'm guessing the median and mode for humans taking this test is 0



32/51
@agamemnus_dev
Very good. I feel like I wouldn't be able to answer any of these without a significant amount of research on the context of each field.



33/51
@liminalsnake
i guess its time to build some doomsday machines (intentionally) thank God there are absolutely no laws against doing such things (winning)



34/51
@DreamInPixelsAI
this is so cool, love the name btw



35/51
@AudioBooksRU
Thank you for making this dataset. But I think we will need Humanity’s Last Exam part 2 in a year or two.



36/51
@vedangvatsa
The focus should be on improving them, not dismissing their progress.



37/51
@Newaiworld_
Wow that's amazing.
But what does it tell us if an AI reaches 50% or 100%?
Does it mean we have ASI? Or at least an AI that is more intelligent than any human?



38/51
@koltregaskes
Thank you, Dan.



39/51
@GozukaraFurkan
Someone will train on it and then boast we are best like as previous ones 😂



40/51
@iruletheworldmo
I can't emphasize this enough Dan. incredible work thank you, and all involved.



41/51
@AILeaksAndNews
Thank you for your work on this Dan



42/51
@NickBrownCO
This was a really cool test. I tried out several difficult questions back when it was open. The AIs solved them.

I sent it to some of my grad school professors whose questions, especially challenging economic questions around oligopolies, the AIs struggled to calculate.



43/51
@altryne
Congrats on this important release!!
Will cover this briefly on our show today!

x.com



44/51
@jefferinc
Great!!!

@ikbenechtben @AlexanderNL fyi



45/51
@VisionaryxAI
Appreciate the efforts thank you!



46/51
@alexocheema
killing the golden goose?



47/51
@jimnasyum
Do the companies have access to the questions and answers?

If they do, wouldn't future models be trained on them?



48/51
@seo_leaders
This is awesome, but whats to stop those models adding some or all of it to training data?



49/51
@thegenioo
thank u so much for this



50/51
@senb0n22a
Which variant of o1 is it? Just the regular non-pro?



51/51
@InverseMarcus
a lot of people are wondering why the holdout set is so much smaller than the public dataset - seems to some like it should be the opposite. can you explain?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,388
Reputation
8,782
Daps
164,361






1/11
@jxmnop
the first evidence i ever saw of superintelligence was in medicine

deep learning can tell male from female eyeballs from just a picture with 70–90% accuracy

doctors still can't do this, and don't understand how it's possible



GivLYuMWUAA4OqT.jpg


2/11
@AlexGDimakis
The first evidence of superintelligence I ever saw was in a calculator.



3/11
@jxmnop
what about an abacus



4/11
@patrick0d
do doctors try to do this? some random task that doctors are not trained on or practice (why quiz a doctor on this) and they get outperformed by a model



5/11
@jxmnop
see the highlighted text:

&gt; Clinicians are currently unaware of distinct retinal feature variations between males and females, [highlighting the importance of model explainability for this task]



6/11
@JFPuget
test set was 400 people. They removed "ungradable' ones, down to 252. This is really really suspicious.



Giv6xFJWEAEDpoO.jpg


7/11
@jxmnop
hm yeah this is suspicious, although the standards for data collection are quite different w biomedical data

however, there seem to be a lot of follow-up studies confirming these results, eg http://nature.com/articles/s41598-024-68817-6



8/11
@nooriefyi
its not superintelligence until it can explain *why* tho



9/11
@CnnmnSchnpps
Yeah that blew my mind when I first saw it. I haven’t seen a ton of similar examples in other fields though



10/11
@alikayadibi11
wow



11/11
@Talkawhile1
Well I can do it with 50% accuracy and with no training at all.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,388
Reputation
8,782
Daps
164,361
Top