1/30
@OpenAI
Factuality is one of the biggest open problems in the deployment of artificial intelligence.
We are open-sourcing a new benchmark called SimpleQA that measures the factuality of language models.
https://openai.com/index/introducing-simpleqa/
2/30
@Klotzkette
Can you also just make sure that your great tool is not making up case law anymore, in answers and in search? Getting really annoying that this is not improving at all.
3/30
@MrTimothy1971
Entertainment proposes only. that would be my disclaimer. because there could be a lawsuit action if some gets the wrong information, and it harms them. as stupid as people are. so, to cover your butts you need Disclaimers. trust me in this day and age you need it.
4/30
@MrTimothy1971
don't get mad at me for saying it. just contemplate what I said.
5/30
@pattty847
When open ai?
6/30
@abe_chira
In english, please.
7/30
@TallPhilosopher
I get better results when I point out in my 'About me' and 'My goals' sections that subsidies are wasteful and divisive, and Fee-and-Dividend will produce the desired outcomes without need for government micromanagement of the econ. But everyone else still gets fed Green New Deal
8/30
@StevenHasty1
By what standard is something factual?
9/30
@DesFrontierTech
How does SimpleQA's approach to adversarially designing questions impact the overall effectiveness of model training in addressing factuality and reducing hallucinations in language models?
10/30
@ai_for_success
Huge difference between o1-mini and o1-preview.. That's surprising.
11/30
@testingcatalog
Politics is always tricky
Almost as sience & tech
12/30
@per_anders
And we should trust this why?
You know the questions, you’ll add the answers. Ergo your engine will score highest yet never change from standard deviation of accuracy.
You have no credibility here.
13/30
@theaievangelist
Interesting pattern: OpenAI's o1-preview and GPT-4 models show high "Correct" rates (42.7% and 38.2%) but also high "Incorrect" rates, while their mini versions are more conservative, and Claude's "Incorrect" rates are substantially lower.
14/30
@LechMazur
It will be interesting to see more models tested and how they compare to the results from my confabulations/hallucinations benchmark on provided texts.
15/30
@WorldEverett
I hope to see the full version of o1 to test it!
16/30
@JosephJacks_
Do more open source!
17/30
@thegenioo
So guys gpt-4o-mini has been just yappinh all the time?
Btw this is absolutely amazing from you guys Thank you!
18/30
@mtochs
Factuality is increasingly defined by emotional people rather than field-tested data. All the frontier models except @grok favor subjective factuality over objective factuality.
19/30
@BensenHsu
The study presents a benchmark called SimpleQA that evaluates the ability of large language models to answer short, fact-seeking questions. The researchers designed SimpleQA to be challenging and have a single, indisputable answer for each question.
The researchers evaluated several OpenAI and Anthropic language models on the SimpleQA benchmark. They found that larger models generally performed better than smaller models, with GPT-4o outperforming GPT-4o-mini and the Claude-3 series of models performing better than the smaller OpenAI models. However, even the best-performing models scored less than 50% on the benchmark.
full paper:
Measuring short-form factuality in large language models
20/30
@rohanpaul_ai
Some Insights from the Paper
• Larger models show better performance and calibration. calibration in this context refers to whether LLMs "know what they know" - essentially measuring if the model's confidence matches its actual accuracy.
• Models consistently overstate their confidence
• Answer frequency correlates with accuracy
• Claude attempt fewer questions than GPT models
21/30
@AyMoMusic
The best Christmas present this year would be leaking the f*ck out of o1 and Sora source codes.
22/30
@Caduceus_CAD
Amazing initiative with
/search?q=#SimpleQA!
/search?q=#Caduceus is here to support projects that push AI and AIGC boundaries
With our edge-rendering power and scalable blockchain infrastructure, we're all set for next-gen AI deployments. Welcome to build on CAD’s innovative platform!
/search?q=#Caduceus /search?q=#AI /search?q=#DePIN
23/30
@web3nam3
24/30
@the_aiatlas
This is big.
Who do you think will be on top there?
25/30
@ajaycan
The idea of "factuality" in artificial intelligence, especially in language models like ChatGPT, is about making sure the information these AI tools provide is true and accurate. Imagine you're asking a knowledgeable friend questions, and you expect them to give you real, correct answers based on what they know. But if this friend sometimes makes things up or gets confused, it could be a problem, especially if you're relying on them for important facts.
### Example Analogy
Imagine you have a library with thousands of books, and each book has different bits of information about various topics. Now, if you ask a question, someone gathers information from all these books to give you an answer. But here’s the catch: some books have outdated or incorrect information, while others are reliable and trustworthy. If the person gathering answers can't tell which books are trustworthy, they might give you incorrect answers.
AI models face a similar problem. They learn from massive amounts of data on the internet, which contains both factual and incorrect information. Without a way to tell what’s right, they might provide answers that seem correct but are inaccurate. This is where *factuality* benchmarks, like **SimpleQA**, come in. They help test and improve how accurately AI models give factual answers.
### What is SimpleQA?
SimpleQA is like a “truth test” for AI. It's a set of questions designed to see if an AI model can answer factual questions correctly. By asking simple, straightforward questions that have clear, factual answers, researchers can measure how well the AI distinguishes fact from fiction. For example:
- **Q:** Who was the first president of the United States?
**A:** George Washington.
*Correct, factual answer.*
- **Q:** How many moons does Mars have?
**A:** Two (Phobos and Deimos).
*Correct, factual answer.*
The purpose of SimpleQA is to ensure that, when people ask questions, they get reliable information instead of mistakes or made-up facts.
### Philosophy of Factuality in AI
Philosophically, factuality in AI touches on truth, trust, and responsibility. For AI to be helpful and trustworthy, it needs to reflect truth as much as possible. Think of it like this: just as we aim to tell the truth in our own lives to maintain trust with others, AI must "learn" the importance of truth to build a good relationship with users. SimpleQA and similar benchmarks are small steps in a larger journey to make AI responsible and dependable—values that are fundamental to making AI a reliable tool for society.
26/30
@RaphI_I
Wokeless ?
27/30
@anniealtman108
https://video.twimg.com/amplify_video/1851691195082440704/vid/avc1/720x1280/e7pFfoCVCIPFh-bW.mp4
28/30
@anniealtman108
29/30
@anniealtman108
https://www.lesswrong.com/posts/QDc...s-sister-annie-altman-claims-sam-has-severely
30/30
@anniealtman108
[Quoted tweet]
So how many of these do I need to collect?
Hypothetically, could I turn these into some kind of card game?
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196