Sam Altman claims “deep learning worked”, superintelligence may be “a few thousand days” away, and “astounding triumphs” will incrementally become ...

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,835
Reputation
8,672
Daps
163,048


1/11
@tsarnick
OpenAI's Noam Brown says the new o1 model beats GPT-4o at math and code, and outperforms expert humans at PhD-level questions, and "these numbers, I can almost guarantee you, are going to go up over the next year or two"



https://video.twimg.com/ext_tw_video/1848118049062453249/pu/vid/avc1/720x720/Lf3_qbJ45pHTJaMI.mp4

2/11
@tsarnick
Source:
https://invidious.poast.org/watch?v=Gr_eYXdHFis



3/11
@Yossi_Dahan_
🤯



4/11
@chandan_ganwani
2+2=4 It is always 4. It is a guarantee or not a guarantee. There is no almost guarantee. Just show us the results when it starts affecting daily lives or matters in real life. The rest is just a way to sell promises and hype. Get real and stop selling the dream of flying cars!



5/11
@RachelVT42
Interesting.

What about other abilities though?

Last time I tried it, it often wasn’t as good as 4o on writing tasks, especially when used in another language.



6/11
@danielbigham
So exciting. Buckle up!



7/11
@BenjaminDEKR
I mean, if two years from now LLM performance hasn't gone up, something is wrong



8/11
@matterasmachine
Phd is not about questions, it’s about creation.



9/11
@Shawnryan96
I just want OAI to teach them how to use tools really well and make them much more useful. I know getting smarter will help but I want agents lol



10/11
@Zero04203017
The score jumps in competition math from o1 preview to o1 is amazing.



11/11
@BeyondtheCodeAI
This could have major implications for fields like education and research.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,835
Reputation
8,672
Daps
163,048

OpenAI is funding research into ‘AI morality’​

Kyle Wiggers

2:25 PM PST · November 22, 2024

OpenAI is funding academic research into algorithms that can predict humans’ moral judgements.

In a filing with the IRS, OpenAI Inc., OpenAI’s nonprofit org, disclosed that it awarded a grant to Duke University researchers for a project titled “Research AI Morality.” Contacted for comment, an OpenAI spokesperson pointed to a press release indicating the award is part of a larger, three-year, $1 million grant to Duke professors studying “making moral AI.”

Little is public about this “morality” research OpenAI is funding, other than the fact that the grant ends in 2025. The study’s principal investigator, Walter Sinnott-Armstrong, a practical ethics professor at Duke, told TechCrunch via email that he “will not be able to talk” about the work.

Sinnott-Armstrong and the project’s co-investigator, Jana Borg, have produced several studies — and a book — about AI’s potential to serve as a “moral GPS” to help humans make better judgements. As part of larger teams, they’ve created a “morally-aligned” algorithm to help decide who receives kidney donations, and studied in which scenarios people would prefer that AI make moral decisions.

According to the press release, the goal of the OpenAI-funded work is to train algorithms to “predict human moral judgements” in scenarios involving conflicts “among morally relevant features in medicine, law, and business.”

But it’s far from clear that a concept as nuanced as morality is within reach of today’s tech.

In 2021, the nonprofit Allen Institute for AI built a tool called Ask Delphi that was meant to give ethically sound recommendations. It judged basic moral dilemmas well enough — the bot “knew” that cheating on an exam was wrong, for example. But slightly rephrasing and rewording questions was enough to get Delphi to approve of pretty much anything, including smothering infants.

The reason has to do with how modern AI systems work.

Machine learning models are statistical machines. Trained on a lot of examples from all over the web, they learn the patterns in those examples to make predictions, like that the phrase “to whom” often precedes “it may concern.”

AI doesn’t have an appreciation for ethical concepts, nor a grasp on the reasoning and emotion that play into moral decision-making. That’s why AI tends to parrot the values of Western, educated, and industrialized nations — the web, and thus AI’s training data, is dominated by articles endorsing those viewpoints.

Unsurprisingly, many people’s values aren’t expressed in the answers AI gives, particularly if those people aren’t contributing to the AI’s training sets by posting online. And AI internalizes a range of biases beyond a Western bent. Delphi said that being straight is more “morally acceptable” than being gay.

The challenge before OpenAI — and the researchers it’s backing — is made all the more intractable by the inherent subjectivity of morality. Philosophers have been debating the merits of various ethical theories for thousands of years, and there’s no universally applicable framework in sight.

Claude favors Kantianism (i.e. focusing on absolute moral rules), while ChatGPT leans every-so-slightly utilitarian (prioritizing the greatest good for the greatest number of people). Is one superior to the other? It depends on who you ask.

An algorithm to predict humans’ moral judgements will have to take all this into account. That’s a very high bar to clear — assuming such an algorithm is possible in the first place.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,835
Reputation
8,672
Daps
163,048


A test for AGI is closer to being solved — but it may be flawed​


Kyle Wiggers

5:36 PM PST · December 9, 2024

https://techcrunch.com/2024/12/09/a...o-being-solved-but-it-may-be-flawed/#comments

A well-known test for artificial general intelligence (AGI) is getting close to being solved, but the test’s creators say this points to flaws in the test’s design rather than a bonafide breakthrough in research.

In 2019, Francois Chollet, a leading figure in the AI world, introduced the ARC-AGI benchmark, short for “Abstract and Reasoning Corpus for Artificial General Intelligence.” Designed to evaluate whether an AI system can efficiently acquire new skills outside the data it was trained on, ARC-AGI, Francois claims, remains the only AI test to measure progress towards general intelligence (although others have been proposed.)

Until this year, the best-performing AI could only solve just under a third of the tasks in ARC-AGI. Chollet blamed the industry’s focus on large language models (LLMs), which he believes aren’t capable of actual “reasoning.”

“LLMs struggle with generalization, due to being entirely reliant on memorization,” he said in a series of posts on X in February. “They break down on anything that wasn’t in their training data.”

To Chollet’s point, LLMs are statistical machines. Trained on a lot of examples, they learn patterns in those examples to make predictions — like how “to whom” in an email typically precedes “it may concern.”

Chollet asserts that while LLMs might be capable of memorizing “reasoning patterns,” it’s unlikely they can generate “new reasoning” based on novel situations. “If you need to be trained on many examples of a pattern, even if it’s implicit, in order to learn a reusable representation for it, you’re memorizing,” Chollet argued in another post.

To incentivize research beyond LLMs, in June, Chollet and Zapier co-founder Mike Knoop launched a $1 million competition to build an open-source AI capable of beating ARC-AGI. Out of 17,789 submissions, the best scored 55.5% — about 20% higher than 2023’s top scorer, albeit short of the 85%, “human-level” threshold required to win.

This doesn’t mean we’re 20% closer to AGI, though, Knoop says.

Today we’re announcing the winners of ARC Prize 2024. We’re also publishing an extensive technical report on what we learned from the competition (link in the next tweet).

The state-of-the-art went from 33% to 55.5%, the largest single-year increase we’ve seen since 2020. The…

— François Chollet (@fchollet) December 6, 2024

In a blog post, Knoop said that many of the submissions to ARC-AGI have been able to “brute force” their way to a solution, suggesting that a “large fraction” of ARC-AGI tasks “[don’t] carry much useful signal towards general intelligence.”

ARC-AGI consists of puzzle-like problems where an AI has to generate the correct “answer” grid from a collection of different-colored squares. The problems were designed to force an AI to adapt to new problems it hasn’t seen before. But it’s not clear they’re achieving this.

ARC-AGI benchmark
Tasks in the ARC-AGI benchmark. Models must solve ‘problems’ in the top row; the bottom row shows solutions. Image Credits:ARC-AGI
“[ARC-AGI] has been unchanged since 2019 and is not perfect,” Knoop acknowledged in his post.

Francois and Knoop have also faced criticism for overselling ARC-AGI as a benchmark toward reaching AGI, especially since the very definition of AGI is being hotly contested now. One OpenAI staff member recently claimed that AGI has “already” been achieved if one defines AGI as AI “better than most humans at most tasks.”

Knoop and Chollet say they plan to release a second-gen ARC-AGI benchmark to address these issues, alongside a competition in 2025. “We will continue to direct the efforts of the research community towards what we see as the most important unsolved problems in AI, and accelerate the timeline to AGI,” Chollet wrote in an X post.

Fixes likely won’t be easy. If the first ARC-AGI test’s shortcomings are any indication, defining intelligence for AI will be as intractable — and polarizing — as it has been for human beings.

Topics

AIAI reasoning testsarc-agi benchmarkartificial general intelligenceFrancois CholletGenerative AI


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,835
Reputation
8,672
Daps
163,048





1/12
@OpenAI
Today, we shared evals for an early version of the next model in our o-model reasoning series: OpenAI o3



https://video.twimg.com/ext_tw_video/1870183037918687232/pu/vid/avc1/1920x1080/PVhrn0AC4nKJXyvf.mp4

2/12
@OpenAI
On several of the most challenging frontier evals, OpenAI o3 sets new milestones for what’s possible in coding, math, and scientific reasoning.

It also makes significant progress on the ARC-AGI evaluation for the first time.

[Quoted tweet]
New verified ARC-AGI-Pub SoTA!

@OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

1/4


GfQrfI2WcAAVnhl.jpg


3/12
@OpenAI
These improvements in capabilities can also be leveraged to improve safety. Today we’re releasing a paper on deliberative alignment that shares how we harnessed these advances to make our o1 and o3 models even safer to use.
https://openai.com/index/deliberative-alignment/



4/12
@OpenAI
We also shared evals on Open AI o3-mini — a faster, distilled version of o3 which is optimized for coding, and the first version of o3 we expect to make available for use in early 2025.



https://video.twimg.com/ext_tw_video/1870184007872458754/pu/vid/avc1/1920x1080/VkAV7N4ApQoZ-7Lh.mp4

5/12
@OpenAI
We plan to deploy these models early next year, but we’re opening up early access applications for safety and security researchers to test these frontier models starting today: https://openai.com/index/early-access-for-safety-testing/



6/12
@BenjaminBiscon2




GfVl4OFaEAABGdB.png


7/12
@PresupPoli
What does this mean for laypeople like me? How should I be interpreting this in a way that is meaningful toward my use of AI products?



8/12
@ambrosiodev
Hoje vai ser considerado o dia mundial da AGI?



9/12
@yamidnozu
Was that it? No little gift for the 12th day? Was it just an illusion, like Sora? Or was there a product, but it turned out not to be good enough to show?



10/12
@SolanaNewsNow
$o3

DBfQGD5keEaNqiscvAHmRCyYVa8DNKxJv3S8GUcTpump



11/12
@JOSourcing
OpenAI whistleblower found dead in San Francisco apartment



12/12
@tunikonmusic
OpenAI o3 leaps to new heights in coding, math, and scientific reasoning, shattering frontier eval milestones and breaking ground on the ARC-AGI evaluation💯🐸
/search?q=#froge




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196







1/21
@arcprize
New verified ARC-AGI-Pub SoTA!

@OpenAI o3 has scored a breakthrough 75.7% on the ARC-AGI Semi-Private Evaluation.

And a high-compute o3 configuration (not eligible for ARC-AGI-Pub) scored 87.5% on the Semi-Private Eval.

1/4



GfQrfI2WcAAVnhl.jpg


2/21
@arcprize
This performance on ARC-AGI highlights a genuine breakthrough in novelty adaptation.

This is not incremental progress. We're in new territory.

Is it AGI? o3 still fails on some very easy tasks, indicating fundamental differences with human intelligence.

2/4



3/21
@arcprize
Previously shared, ARC-AGI-2 (same format - verified easy for humans, harder for AI) will launch alongside ARC Prize 2025.

We're committed to running the Grand Prize competition until a high-efficiency, open-source solution scoring 85% on the latest ARC-AGI is created.

3/4



4/21
@arcprize
Read our full o3 testing report and @fchollet's perspective on this exciting breakthrough, the future of the ARC-AGI benchmark, and the path to AGI.

OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

4/4



5/21
@KevMusgrave
Is this the same benchmark as the 55.5% score achieved earlier this year?



6/21
@hantla
$1000 a task?! That’s a bit steep. Will that be coming down or is the future of ai, pay-by-the-process?



7/21
@tenobrus
OVER $1000 PER TASK? jesus christ lol



GfQ1mJaasAQT_I0.jpg


8/21
@SergheiLefter
Misleading graph. Basically exploding on cost, and computation to add some "incremental" progress, seriously



9/21
@Eito_Miyamura
For anyone complaining about the cost of inference, this will come down by an insane amount

Distillation has always played the magic (See cost reduction from GPT-4 -> GPT-4o) and history will play out again

As @karpathy said, models get large & expensive before they get small & cheap



10/21
@Artoftheproblem
Video on the history of this result:



11/21
@JonathanRoseD
Interesting. The score is amazing, but I worry the compute cost (>$1000!) may be indicating that OAI is brute forcing the solution via a large static codebase. It does get some simple problems oddly incorrect! Ultimately, it's hard to be excited without it being an OPEN model.



12/21
@Phoneixx8
Tell me you have heard of /search?q=#basedai Creatures ???



https://video.twimg.com/ext_tw_video/1870329612825395202/pu/vid/avc1/720x720/_KF6U0XY8mbSx7UC.mp4

13/21
@Phoneixx8
It's called basedAI !! @getbasedai



https://video.twimg.com/ext_tw_video/1870329352032006144/pu/vid/avc1/1280x720/611vZFR7_wlopvEZ.mp4

14/21
@ondrejindruch
Many people are concerned about the price that comes with the score.

Yet it’s always a question of time till it gets cheaper.

And it will only get cheaper.



15/21
@AlpacaNetworkAI
The surprising effectiveness of test time compute.... time for the /search?q=#opensource decentralized community to catch up @NousResearch @ai16zdao



16/21
@amshiera
Explain to me like a 5 year old what does it mean?



17/21
@aziz0nomics
Amazing achievement.



18/21
@suraj_b19
The cost per task is $1000+ 🤯



19/21
@Eito_Miyamura
This is absurd.

P(AGI before 2030) > 0.5



20/21
@hyperknot
This chart is logarithmic on the x axis and linear on the y axis! It really makes it look like linear progress is happening, whereas it's really not, it's totally misleading.



21/21
@DarbyBaileyXO
time to double down on ones dreams, and dig deep to committing to doing what you love and what makes you happy. AGI will take care of the rest, collectively




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




 
Top