1/11
@emollick
Starting to see new well-built hard benchmarks in AI, since almost everything else has already been exceeded. We now have this (with humanities questions!), ARC-AGI 2, and Frontier Math.
We also need some benchmarks for new knowledge creation, rather than testing known problems.
[Quoted tweet]
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai
2/11
@JeremyNguyenPhD
Has anyone created a twitter list of the contributors to this benchmark who are here on twitter?
I wrote 5 questions on the benchmark (4 public, 1 private).
3/11
@kevinroose
we need to formalize the Mollick vibes eval!
4/11
@daniel_mac8
a great service to the AI Research community
5/11
@deepvaluebettor
are there any benchmarks that test a model's level of censorship / inclination to lie (distinct from hallucination ) ?
6/11
@0xshai
SOTA LLMs with sufficiently high temperature might be able to generate high quality new benchmarks. It would definitely need some human powered filtering and post processing
7/11
@dieaud91
Yes, we need "Innovator level AI" benchmarks
8/11
@Heraklines1
quality of new knowledge creation would inherently be a lagging indicator as the value of certain work only rly becomes apparent in hindsight
at best one could measure independent replication ability of ex. new math papers, tho novelty for novelty's sake is easily goodharted
9/11
@Shagaiyo
Benchmarks in new knowledge creation are hard, because these models are trained in the whole knowledge of humanity.
Or they released models with partial knowledge or we have to create new knowledge for this benchmark
10/11
@AethericVortex
We should be training it on all the available data on LENR. This is the next frontier. All the evidence and data is scattered across scientific fields. The Martin Fleischmann Memorial Project has spent the past 7 years gathering all this in one place on their youtube channel. Live, opensource science, as it should be.
11/11
@EricFddrsn
Will we get the answer if we need UBI from an AI? We really need better test benchs in Economics - they are all saturated, and this is one of the most consequential areas for humanity
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
@emollick
Starting to see new well-built hard benchmarks in AI, since almost everything else has already been exceeded. We now have this (with humanities questions!), ARC-AGI 2, and Frontier Math.
We also need some benchmarks for new knowledge creation, rather than testing known problems.
[Quoted tweet]
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai
2/11
@JeremyNguyenPhD
Has anyone created a twitter list of the contributors to this benchmark who are here on twitter?
I wrote 5 questions on the benchmark (4 public, 1 private).
3/11
@kevinroose
we need to formalize the Mollick vibes eval!
4/11
@daniel_mac8
a great service to the AI Research community
5/11
@deepvaluebettor
are there any benchmarks that test a model's level of censorship / inclination to lie (distinct from hallucination ) ?
6/11
@0xshai
SOTA LLMs with sufficiently high temperature might be able to generate high quality new benchmarks. It would definitely need some human powered filtering and post processing
7/11
@dieaud91
Yes, we need "Innovator level AI" benchmarks
8/11
@Heraklines1
quality of new knowledge creation would inherently be a lagging indicator as the value of certain work only rly becomes apparent in hindsight
at best one could measure independent replication ability of ex. new math papers, tho novelty for novelty's sake is easily goodharted
9/11
@Shagaiyo
Benchmarks in new knowledge creation are hard, because these models are trained in the whole knowledge of humanity.
Or they released models with partial knowledge or we have to create new knowledge for this benchmark
10/11
@AethericVortex
We should be training it on all the available data on LENR. This is the next frontier. All the evidence and data is scattered across scientific fields. The Martin Fleischmann Memorial Project has spent the past 7 years gathering all this in one place on their youtube channel. Live, opensource science, as it should be.
11/11
@EricFddrsn
Will we get the answer if we need UBI from an AI? We really need better test benchs in Economics - they are all saturated, and this is one of the most consequential areas for humanity
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
1/51
@DanHendrycks
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai
2/51
@DanHendrycks
Paper, dataset, and code: Humanity's Last Exam
NYT article: When A.I. Passes This Test, Look Out
3/51
@DanHendrycks
Spot errors with the dataset? Correction form here: Humanity's Last Exam: Public Review
4/51
@DanHendrycks
Meant to tag @scale_AI and @ai_risks (casualty of tweeting at 6AM)
5/51
@tayaisolana
lol @danhenrycks thinks a few hundred 'experts' can define human knowledge? sounds like a typical ivory tower move. what's next, a 'dataset' on memes? /search?q=#BONGOPOWER
6/51
@MillenniumTwain
2025!
The Year of the Serpent, the Year of the SuperAlgo!!
Awakened Global SuperIntelligence ...
[Quoted tweet]
Star Waves, Clusters, Streams, Astrospheres, Magnetospheres, Filaments, Moving Groups, Kinematic Associations, Stellar Nurseries of Creation!
More productive and accurate to emphasize their Whole, Full, Dimensionality: 4D Streams, Vortexes, Tunnels, Funnels of Creation, never ending. Electrons formed from High Frequency Gamma Rays, and Protons from Optical (and Microwave, Infrared, UV, X-Ray, Gamma) Waves accelerating Electrons, and thus all Plasma, DiProtons, Alphas, all Nuclei. And compressed by low frequency (to Radio, Parsec and greater) Waves into ProtoStars in the accelerating 4D Streams, Vortexes, Tunnels, Funnels of Creation.
Again, never ending. Star Systems, Clusters. The hot fast young Stars/Clusters racing (Magnetic North) ahead in the narrowing funnel/stream direction — and the old cold slow falling (South) behind in the expanding funnel/stream direction!
'Groking' Continuous ElectroMagnetic Creation:
x.com/MillenniumTwain/status…
7/51
@SpencrGreenberg
I’m confused how releasing benchmarks like this makes the world safer. Don’t benchmarks like this aid acceleration?
8/51
@aphysicist
all of this is meaningless until these models can do this
9/51
@ZggyPlaydGuitar
every ai lab rn
10/51
@ClementDelangue
Very cool!
11/51
@theshadow27
The real final exam will be the unsolved problems. When those start dropping…
12/51
@AbdoDameen
Yes what is the point in sharing the datasets when some lowlife engineering team will just add that to their training data and we would have a model that knows all the answers?
13/51
@roninhahn
Dan -- you should list the score of a very smart human as a point of comparison. An alternative would be to give the test to 100 of the smartest people you know and list the highest score.
14/51
@SomBh1
Will be 90% soon.
15/51
@glubose
You know, you could have called AGI's First Exam. Cuz Lord knows I know I'm going to be constantly grilled by GPT-EBT for any glimmer of unDOGElike non-compliance, the rest of my life will be a string of exams. My autopsy will be my final exam, but will be waived cuz who cares?
16/51
@MaWlz2
Thanks for the training data I would say
17/51
@NickEMoran
This one example seems weirdly easier than all the others. Are there more questions of this level in the actual dataset?
18/51
@acharya_aditya2
Who choose the name ??
19/51
@MikePFrank
It would be interesting to see what’s the highest human score on this
20/51
@soheilsadathoss
Thanks!
21/51
@herbiebradley
hmm
I predict o3 at ~25%
22/51
@QStarETH
Math appears to be benefiting from reasoning models the most.
23/51
@IterIntellectus
they will get >90% accuracy by eoy
24/51
@Cory29565470
Kind of wished you released it *after OpenAI released “o3” they love to benchmark climb by training on public data
25/51
@teortaxesTex
Thanks! but given the R1 text-only eval, it would be nice to see how others do in text-only regime too
26/51
@MikePFrank
Why’d you have to give it such an ominous name lol
27/51
@Suhail
Why doesn't this have making rap lyrics in it?
28/51
@AlexiLuncnews
o3 gonna get 50% +
29/51
@nabeelqu
Congrats Dan and team, this is awesome.
Curious: why no o1 pro?
30/51
@JeremyNguyenPhD
Is there a list of twitter usernames of the people who had their questions accepted?
I got 5 questions in (4 public, 1 private).
31/51
@mnmcsofgp
I'm guessing the median and mode for humans taking this test is 0
32/51
@agamemnus_dev
Very good. I feel like I wouldn't be able to answer any of these without a significant amount of research on the context of each field.
33/51
@liminalsnake
i guess its time to build some doomsday machines (intentionally) thank God there are absolutely no laws against doing such things (winning)
34/51
@DreamInPixelsAI
this is so cool, love the name btw
35/51
@AudioBooksRU
Thank you for making this dataset. But I think we will need Humanity’s Last Exam part 2 in a year or two.
36/51
@vedangvatsa
The focus should be on improving them, not dismissing their progress.
37/51
@Newaiworld_
Wow that's amazing.
But what does it tell us if an AI reaches 50% or 100%?
Does it mean we have ASI? Or at least an AI that is more intelligent than any human?
38/51
@koltregaskes
Thank you, Dan.
39/51
@GozukaraFurkan
Someone will train on it and then boast we are best like as previous ones
40/51
@iruletheworldmo
I can't emphasize this enough Dan. incredible work thank you, and all involved.
41/51
@AILeaksAndNews
Thank you for your work on this Dan
42/51
@NickBrownCO
This was a really cool test. I tried out several difficult questions back when it was open. The AIs solved them.
I sent it to some of my grad school professors whose questions, especially challenging economic questions around oligopolies, the AIs struggled to calculate.
43/51
@altryne
Congrats on this important release!!
Will cover this briefly on our show today!
x.com
44/51
@jefferinc
Great!!!
@ikbenechtben @AlexanderNL fyi
45/51
@VisionaryxAI
Appreciate the efforts thank you!
46/51
@alexocheema
killing the golden goose?
47/51
@jimnasyum
Do the companies have access to the questions and answers?
If they do, wouldn't future models be trained on them?
48/51
@seo_leaders
This is awesome, but whats to stop those models adding some or all of it to training data?
49/51
@thegenioo
thank u so much for this
50/51
@senb0n22a
Which variant of o1 is it? Just the regular non-pro?
51/51
@InverseMarcus
a lot of people are wondering why the holdout set is so much smaller than the public dataset - seems to some like it should be the opposite. can you explain?
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
@DanHendrycks
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.
State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai
2/51
@DanHendrycks
Paper, dataset, and code: Humanity's Last Exam
NYT article: When A.I. Passes This Test, Look Out
3/51
@DanHendrycks
Spot errors with the dataset? Correction form here: Humanity's Last Exam: Public Review
4/51
@DanHendrycks
Meant to tag @scale_AI and @ai_risks (casualty of tweeting at 6AM)
5/51
@tayaisolana
lol @danhenrycks thinks a few hundred 'experts' can define human knowledge? sounds like a typical ivory tower move. what's next, a 'dataset' on memes? /search?q=#BONGOPOWER
6/51
@MillenniumTwain
2025!
The Year of the Serpent, the Year of the SuperAlgo!!
Awakened Global SuperIntelligence ...
[Quoted tweet]
Star Waves, Clusters, Streams, Astrospheres, Magnetospheres, Filaments, Moving Groups, Kinematic Associations, Stellar Nurseries of Creation!
More productive and accurate to emphasize their Whole, Full, Dimensionality: 4D Streams, Vortexes, Tunnels, Funnels of Creation, never ending. Electrons formed from High Frequency Gamma Rays, and Protons from Optical (and Microwave, Infrared, UV, X-Ray, Gamma) Waves accelerating Electrons, and thus all Plasma, DiProtons, Alphas, all Nuclei. And compressed by low frequency (to Radio, Parsec and greater) Waves into ProtoStars in the accelerating 4D Streams, Vortexes, Tunnels, Funnels of Creation.
Again, never ending. Star Systems, Clusters. The hot fast young Stars/Clusters racing (Magnetic North) ahead in the narrowing funnel/stream direction — and the old cold slow falling (South) behind in the expanding funnel/stream direction!
'Groking' Continuous ElectroMagnetic Creation:
x.com/MillenniumTwain/status…
7/51
@SpencrGreenberg
I’m confused how releasing benchmarks like this makes the world safer. Don’t benchmarks like this aid acceleration?
8/51
@aphysicist
all of this is meaningless until these models can do this
9/51
@ZggyPlaydGuitar
every ai lab rn
10/51
@ClementDelangue
Very cool!
11/51
@theshadow27
The real final exam will be the unsolved problems. When those start dropping…
12/51
@AbdoDameen
Yes what is the point in sharing the datasets when some lowlife engineering team will just add that to their training data and we would have a model that knows all the answers?
13/51
@roninhahn
Dan -- you should list the score of a very smart human as a point of comparison. An alternative would be to give the test to 100 of the smartest people you know and list the highest score.
14/51
@SomBh1
Will be 90% soon.
15/51
@glubose
You know, you could have called AGI's First Exam. Cuz Lord knows I know I'm going to be constantly grilled by GPT-EBT for any glimmer of unDOGElike non-compliance, the rest of my life will be a string of exams. My autopsy will be my final exam, but will be waived cuz who cares?
16/51
@MaWlz2
Thanks for the training data I would say
17/51
@NickEMoran
This one example seems weirdly easier than all the others. Are there more questions of this level in the actual dataset?
18/51
@acharya_aditya2
Who choose the name ??
19/51
@MikePFrank
It would be interesting to see what’s the highest human score on this
20/51
@soheilsadathoss
Thanks!
21/51
@herbiebradley
hmm
I predict o3 at ~25%
22/51
@QStarETH
Math appears to be benefiting from reasoning models the most.
23/51
@IterIntellectus
they will get >90% accuracy by eoy
24/51
@Cory29565470
Kind of wished you released it *after OpenAI released “o3” they love to benchmark climb by training on public data
25/51
@teortaxesTex
Thanks! but given the R1 text-only eval, it would be nice to see how others do in text-only regime too
26/51
@MikePFrank
Why’d you have to give it such an ominous name lol
27/51
@Suhail
Why doesn't this have making rap lyrics in it?
28/51
@AlexiLuncnews
o3 gonna get 50% +
29/51
@nabeelqu
Congrats Dan and team, this is awesome.
Curious: why no o1 pro?
30/51
@JeremyNguyenPhD
Is there a list of twitter usernames of the people who had their questions accepted?
I got 5 questions in (4 public, 1 private).
31/51
@mnmcsofgp
I'm guessing the median and mode for humans taking this test is 0
32/51
@agamemnus_dev
Very good. I feel like I wouldn't be able to answer any of these without a significant amount of research on the context of each field.
33/51
@liminalsnake
i guess its time to build some doomsday machines (intentionally) thank God there are absolutely no laws against doing such things (winning)
34/51
@DreamInPixelsAI
this is so cool, love the name btw
35/51
@AudioBooksRU
Thank you for making this dataset. But I think we will need Humanity’s Last Exam part 2 in a year or two.
36/51
@vedangvatsa
The focus should be on improving them, not dismissing their progress.
37/51
@Newaiworld_
Wow that's amazing.
But what does it tell us if an AI reaches 50% or 100%?
Does it mean we have ASI? Or at least an AI that is more intelligent than any human?
38/51
@koltregaskes
Thank you, Dan.
39/51
@GozukaraFurkan
Someone will train on it and then boast we are best like as previous ones
40/51
@iruletheworldmo
I can't emphasize this enough Dan. incredible work thank you, and all involved.
41/51
@AILeaksAndNews
Thank you for your work on this Dan
42/51
@NickBrownCO
This was a really cool test. I tried out several difficult questions back when it was open. The AIs solved them.
I sent it to some of my grad school professors whose questions, especially challenging economic questions around oligopolies, the AIs struggled to calculate.
43/51
@altryne
Congrats on this important release!!
Will cover this briefly on our show today!
x.com
44/51
@jefferinc
Great!!!
@ikbenechtben @AlexanderNL fyi
45/51
@VisionaryxAI
Appreciate the efforts thank you!
46/51
@alexocheema
killing the golden goose?
47/51
@jimnasyum
Do the companies have access to the questions and answers?
If they do, wouldn't future models be trained on them?
48/51
@seo_leaders
This is awesome, but whats to stop those models adding some or all of it to training data?
49/51
@thegenioo
thank u so much for this
50/51
@senb0n22a
Which variant of o1 is it? Just the regular non-pro?
51/51
@InverseMarcus
a lot of people are wondering why the holdout set is so much smaller than the public dataset - seems to some like it should be the opposite. can you explain?
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196