REVEALED: Open A.I. Staff Warn "The progress made on Project Q* has the potential to endanger humanity" (REUTERS)

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072


AI poses no existential threat to humanity – new study finds​


Large language models like ChatGPT cannot learn independently or acquire new skills, meaning they pose no existential threat to humanity.


Man typing on phone with AI robot appearing from screen
Large language models remain inherently controllable, predictable and safe.

ChatGPT and other large language models (LLMs) cannot learn independently or acquire new skills, meaning they pose no existential threat to humanity, according to new research from the University of Bath and the Technical University of Darmstadt in Germany.

The study, published today as part of the proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL 2024) – the premier international conference in natural language processing – reveals that LLMs have a superficial ability to follow instructions and excel at proficiency in language, however, they have no potential to master new skills without explicit instruction. This means they remain inherently controllable, predictable and safe.

This means they remain inherently controllable, predictable and safe.

The research team concluded that LLMs – which are being trained on ever larger datasets – can continue to be deployed without safety concerns, though the technology can still be misused.

With growth, these models are likely to generate more sophisticated language and become better at following explicit and detailed prompts, but they are highly unlikely to gain complex reasoning skills.

“The prevailing narrative that this type of AI is a threat to humanity prevents the widespread adoption and development of these technologies, and also diverts attention from the genuine issues that require our focus,” said Dr Harish Tayyar Madabushi, computer scientist at the University of Bath and co-author of the new study on the ‘emergent abilities’ of LLMs.

The collaborative research team, led by Professor Iryna Gurevych at the Technical University of Darmstadt in Germany, ran experiments to test the ability of LLMs to complete tasks that models have never come across before – the so-called emergent abilities.

As an illustration, LLMs can answer questions about social situations without ever having been explicitly trained or programmed to do so. While previous research suggested this was a product of models ‘knowing’ about social situations, the researchers showed that it was in fact the result of models using a well-known ability of LLMs to complete tasks based on a few examples presented to them, known as `in-context learning’ (ICL).

Through thousands of experiments, the team demonstrated that a combination of LLMs ability to follow instructions (ICL), memory and linguistic proficiency can account for both the capabilities and limitations exhibited by LLMs.

Dr Tayyar Madabushi said: “The fear has been that as models get bigger and bigger, they will be able to solve new problems that we cannot currently predict, which poses the threat that these larger models might acquire hazardous abilities including reasoning and planning.

“This has triggered a lot of discussion – for instance, at the AI Safety Summit last year at Bletchley Park, for which we were asked for comment – but our study shows that the fear that a model will go away and do something completely unexpected, innovative and potentially dangerous is not valid.

“Concerns over the existential threat posed by LLMs are not restricted to non-experts and have been expressed by some of the top AI researchers across the world."

However, Dr Tayyar Madabushi maintains this fear is unfounded as the researchers' tests clearly demonstrated the absence of emergent complex reasoning abilities in LLMs.

“While it's important to address the existing potential for the misuse of AI, such as the creation of fake news and the heightened risk of fraud, it would be premature to enact regulations based on perceived existential threats,” he said.

“Importantly, what this means for end users is that relying on LLMs to interpret and perform complex tasks which require complex reasoning without explicit instruction is likely to be a mistake. Instead, users are likely to benefit from explicitly specifying what they require models to do and providing examples where possible for all but the simplest of tasks.”

Professor Gurevych added: "… our results do not mean that AI is not a threat at all. Rather, we show that the purported emergence of complex thinking skills associated with specific threats is not supported by evidence and that we can control the learning process of LLMs very well after all. Future research should therefore focus on other risks posed by the models, such as their potential to be used to generate fake news."


Dr Harish Tayyar Madabushi describes the pros, cons and limitations of LLMs.​


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072


‘This could change everything!’ Nous Research unveils new tool to train powerful AI models with 10,000x efficiency​


Carl Franzen@carlfranzen

August 27, 2024 10:22 AM

Frizzy dark haired three eyed glowing eyes cyborg with giant hands overlooks office workers at desks with PCs and globe


Credit: VentureBeat made with ChatGPT


Nous Research turned heads earlier this month with the release of its permissive, open-source Llama 3.1 variant Hermes 3.

Now, the small research team dedicated to making “personalized, unrestricted AI” models has announced another seemingly massive breakthrough: DisTrO (Distributed Training Over-the-Internet), a new optimizer that reduces the amount of information that must be sent between various GPUs (graphics processing units) during each step of training an AI model.

Nous’s DisTrO optimizer means powerful AI models can now be trained outside of big companies, across the open web on consumer-grade connections, potentially by individuals or institutions working together from around the world.

DisTrO has already been tested and shown in a Nous Research technical paper to yield an 857 times efficiency increase compared to one popular existing training algorithm, All-Reduce, as well as a massive reduction in the amount of information transmitted during each step of the training process (86.8 megabytes compared to 74.4 gigabytes) while only suffering a slight loss in overall performance. See the results in the table below from the Nous Research technical paper:

Screenshot-2024-08-27-at-12.41.29%E2%80%AFPM.png


Ultimately, the DisTrO method could open the door to many more people being able to train massively powerful AI models as they see fit.

As the firm wrote in a post on X yesterday: “Without relying on a single company to manage and control the training process, researchers and institutions can have more freedom to collaborate and experiment with new techniques, algorithms, and models. This increased competition fosters innovation, drives progress, and ultimately benefits society as a whole.”

What if you could use all the computing power in the world to train a shared, open source AI model?

Preliminary report: DisTrO/A_Preliminary_Report_on_DisTrO.pdf at main · NousResearch/DisTrO

Nous Research is proud to release a preliminary report on DisTrO (Distributed Training Over-the-Internet) a family of… pic.twitter.com/h2gQJ4m7lB

— Nous Research (@NousResearch) August 26, 2024
'


The problem with AI training: steep hardware requirements​


As covered on VentureBeat previously, Nvidia’s GPUs in particular are in high demand in the generative AI era, as the expensive graphics cards’ powerful parallel processing capabilities are needed to train AI models efficiently and (relatively) quickly. This blog post at APNic describes the process well.

A big part of the AI training process relies on GPU clusters — multiple GPUs — exchanging information with one another about the model and the information “learned” within training data sets.

However, this “inter-GPU communication” requires that GPU clusters be architected, or set up, in a precise way in controlled conditions, minimizing latency and maximizing throughput. Hence why companies such as Elon Musk’s Tesla are investing heavily in setting up physical “superclusters” with many thousands (or hundreds of thousands) of GPUs sitting physically side-by-side in the same location — typically a massive airplane hangar-sized warehouse or facility.

Because of these requirements, training generative AI — especially the largest and most powerful models — is typically an extremely capital-heavy endeavor, one that only some of the most well-funded companies can engage in, such as Tesla, Meta, OpenAI, Microsoft, Google, and Anthropic.

The training process for each of these companies looks a little different, of course. But they all follow the same basic steps and use the same basic hardware components. Each of these companies tightly controls its own AI model training processes, and it can be difficult for incumbents, much less laypeople outside of them, to even think of competing by training their own similarly-sized (in terms of parameters, or the settings under the hood) models.

But Nous Research, whose whole approach is essentially the opposite — making the most powerful and capable AI it can on the cheap, openly, freely, for anyone to use and customize as they see fit without many guardrails — has found an alternative.


What DisTrO does differently​


While traditional methods of AI training require synchronizing full gradients across all GPUs and rely on extremely high bandwidth connections, DisTrO reduces this communication overhead by four to five orders of magnitude.

The paper authors haven’t fully revealed how their algorithms reduce the amount of information at each step of training while retaining overall model performance, but plan to release more on this soon.

The reduction was achieved without relying on amortized analysis or compromising the convergence rate of the training, allowing large-scale models to be trained over much slower internet connections — 100Mbps download and 10Mbps upload, speeds available to many consumers around the world.

The authors tested DisTrO using the Meta Llama 2, 1.2 billion large language model (LLM) architecture and achieved comparable training performance to conventional methods with significantly less communication overhead.

They note that this is the smallest-size model that worked well with the DisTrO method, and they “do not yet know whether the ratio of bandwidth reduction scales up, down, or stays constant as model size increases.”

Yet, the authors also say that “our preliminary tests indicate that it is possible to get a bandwidth requirements reduction of up to 1000x to 3000x during the pre-training,” phase of LLMs, and “for post-training and fine-tuning, we can achieve up to 10000x without any noticeable degradation in loss.”

They further hypothesize that the research, while initially conducted on LLMs, could be used to train large diffusion models (LDMs) as well: think the Stable Diffusion open source image generation model and popular image generation services derived from it such as Midjourney.


Still need good GPUs​


To be clear: DisTrO still relies on GPUs — only instead of clustering them all together in the same location, now they can be spread out across the world and communicate over the consumer internet.

Specifically, DisTrO was evaluated using 32x H100 GPUs, operating under the Distributed Data Parallelism (DDP) strategy, where each GPU had the entire model loaded in VRAM.

This setup allowed the team to rigorously test DisTrO’s capabilities and demonstrate that it can match the convergence rates of AdamW+All-Reduce despite drastically reduced communication requirements.

This result suggests that DisTrO can potentially replace existing training methods without sacrificing model quality, offering a scalable and efficient solution for large-scale distributed training.

By reducing the need for high-speed interconnects DisTrO could enable collaborative model training across decentralized networks, even with participants using consumer-grade internet connections.

The report also explores the implications of DisTrO for various applications, including federated learning and decentralized training.

Additionally, DisTrO’s efficiency could help mitigate the environmental impact of AI training by optimizing the use of existing infrastructure and reducing the need for massive data centers.

Moreover, the breakthroughs could lead to a shift in how large-scale models are trained, moving away from centralized, resource-intensive data centers towards more distributed, collaborative approaches that leverage diverse and geographically dispersed computing resources.


What’s next for the Nous Research team and DisTrO?​


The research team invites others to join them in exploring the potential of DisTrO. The preliminary report and supporting materials are available on GitHub, and the team is actively seeking collaborators to help refine and expand this groundbreaking technology.

Already, some AI influencers such as @kimmonismus on X (aka chubby) have praised the research as a huge breakthrough in the field, writing, “This could change everything!”

Wow, amazing! This could change everything! x.com

— Chubby♨️ (@kimmonismus) August 27, 2024


With DisTrO, Nous Research is not only advancing the technical capabilities of AI training but also promoting a more inclusive and resilient research ecosystem that has the potential to unlock unprecedented advancements in AI.

A_Preliminary_Report_on_DisTrODownload
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072

AI in practice

Aug 27, 2024

Former OpenAI researcher believes company is "fairly close" to AGI and not prepared for it​


Midjourney prompted by THE DECODER

Former OpenAI researcher believes company is


Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.

Profile
E-Mail

About half of OpenAI's AGI/ASI safety researchers have left the company recently, according to a former employee. The departures likely stem from disagreements over managing the risks of potential superintelligent AI.

Daniel Kokotajlo, a former OpenAI safety researcher, told Fortune magazine that around half of the company's safety researchers have departed, including prominent leaders.

While Kokotajlo didn't comment on specific reasons for all the resignations, he believes they align with his own views: OpenAI is "fairly close" to developing artificial general intelligence (AGI) but isn't prepared to "handle all that entails."

This has led to a "chilling effect" on those trying to publish research on AGI risks within the company, Kokotajlo said. He also noted an "increasing amount of influence by the communications and lobbying wings of OpenAI" on what's deemed appropriate to publish.

The temporary firing of OpenAI CEO Sam Altman was also linked to safety concerns. A law firm cleared Altman after his reinstatement.

Of about 30 employees working on AGI safety issues, around 16 remain. Kokotajlo said these departures weren't a "coordinated thing" but rather people "individually giving up."

Notable departures include Jan Hendrik Kirchner, Collin Burns, Jeffrey Wu, Jonathan Uesato, Steven Bills, Yuri Burda, Todor Markov, and OpenAI co-founder John Schulman.

The resignations of chief scientist Ilya Sutskever and Jan Leike, who jointly led the company's "superalignment" team focused on future AI system safety, were particularly significant. OpenAI subsequently disbanded this team.

Experts leave OpenAI, but not AGI​


Kokotajlo expressed disappointment, but not surprise, that OpenAI opposed California's SB 1047 bill, which aims to regulate advanced AI system risks. He co-signed a letter to Governor Newsom criticizing OpenAI's stance, calling it a betrayal of the company's original plans to thoroughly assess AGI's long-term risks for developing regulations and laws.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072

1/1
As we move towards more powerful AI, it becomes urgent to better understand the risks in a mathematically rigorous and quantifiable way and use that knowledge to mitigate them. More in my latest blog entry where I describe our recent paper on that topic.
Bounding the probability of harm from an AI to create a guardrail - Yoshua Bengio


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072


1/3
Join host @FryRSquared as she speaks with @AncaDianaDragan, who leads safety research at Google DeepMind. 🌐

They explore the challenges of aligning AI with human preferences, oversight at scale, and the importance of mitigating both near and long-term risks. ↓

2/3
Timestamps:

00:00 Introduction to Anca Dragan
02:16 Short and long term risks
04:35 Designing a safe bridge
05:36 Robotics
06:56 Human and AI interaction
12:33 The objective of alignment
14:30 Value alignment and recommendation systems
17:57 Ways to approach alignment with competing objectives
19:54 Deliberative alignment
22:24 Scalable oversight
23:33 Example of scalable oversight
26:14 What comes next?
27:20 Gemini
30:14 Long term risks and frontier safety framework
35:09 Importance of AI safety
38:02 Conclusion

3/3
Watch → @youtube AI Safety…Ok Doomer: with Anca Dragan
Spotify → AI Safety...Ok Doomer: with Anca Dragan
Apple Podcasts → AI Safety...Ok Doomer: with Anca Dragan

Or listen wherever you get your podcasts! 🎧


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072








1/19
I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week - we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on ⬇️:

2/19
Reflection 70B holds its own against even the top closed-source models (Claude 3.5 Sonnet, GPT-4o).

It’s the top LLM in (at least) MMLU, MATH, IFEval, GSM8K.

Beats GPT-4o on every benchmark tested.

It clobbers Llama 3.1 405B. It’s not even close.

3/19
The technique that drives Reflection 70B is simple, but very powerful.

Current LLMs have a tendency to hallucinate, and can’t recognize when they do so.

Reflection-Tuning enables LLMs to recognize their mistakes, and then correct them before committing to an answer.

4/19
Additionally, we separate planning into a separate step, improving CoT potency and keeping the outputs simple and concise for end users.

5/19
Important to note: We have checked for decontamination against all benchmarks mentioned using @lmsysorg's LLM Decontaminator.

6/19
The weights of our 70B model are available today on @huggingface here: mattshumer/Reflection-Llama-3.1-70B · Hugging Face

@hyperbolic_labs API available later today.

Next week, we will release the weights of Reflection-405B, along with a short report going into more detail on our process and findings.

7/19
Most importantly, a huge shoutout to @csahil28 and @GlaiveAI.

I’ve been noodling on this idea for months, and finally decided to pull the trigger a few weeks ago. I reached out to Sahil and the data was generated within hours.

If you’re training models, check Glaive out.

8/19
This model is quite fun to use and insanely powerful.

Please check it out — with the right prompting, it’s an absolute beast for many use-cases.

Demo here: Reflection 70B Playground

9/19
405B is coming next week, and we expect it to outperform Sonnet and GPT-4o by a wide margin.

But this is just the start. I have a few more tricks up my sleeve.

I’ll continue to work with @csahil28 to release even better LLMs that make this one look like a toy.

Stay tuned.

10/19
Looking forward to trying out this model you legend. LFG!

11/19
The wait is killing now. Perhaps everyone want to try it. @GroqInc, time for you to partner

12/19
I'm not sure if anyone else has produced this, but there seems to be a few things wrong with the model:

- it causes out of index errors during penalty sampling
- logprobs don't seem to function

I can reproduce this with both Aphrodite and vLLM. Model issue? Doesn't happen with other l3.1 70b finetunes.

13/19
This is absolutely amazing! This is so exciting! Congratulations on this breathtaking success!!!

14/19
Start of Skynet. Correcting their own mistakes. 🤯🤌

15/19
Huge move forward in architecture and efficiency!

16/19
Wouldn’t including a ⏱️ timer for each benchmark be needed to allow for equal amount of internal tuning/“thinking” vs external prompt engineering/“thinking” ?

Seems only way to make these benchmarks fair from now on 🤷‍♂️

Don’t underestimate what you can do with Speed ⚡😎

17/19
Is it fair to have it compete on the same evals when it's using a bunch of chain-of-thought output? It's wonderful, and what a lot of people have been doing in their prompt engineering (I reckon this approach is the future since it provides space for cognition) – but I thought the evals try to assess pure one-shot streaming output w/o anything else? Otherwise the sky's the limit. I.e. otherwise one can win at humaneval by doing a bunch of single inference code iterations and line-by-line self-evaluation. Which, I dunno, seems like cheating?

18/19
What, you aren’t charging subscribers $2K/month?

19/19
Exciting news! Can't wait to see what 405B brings 🚀


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWuM-L0WsAAp-VN.jpg

GWuM7xAWoAEIn61.jpg

GWq6-r5XAAAp7k7.jpg

GWq8Av7X0AAO-v3.jpg

GLmrWP9W4AExMQq.jpg

GLmrWQOWAAA81-M.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072

Introducing OpenAI o1​

We've developed a new series of AI models designed to spend more time thinking before they respond. Here is the latest news on o1 research, product and other updates.
Try it in ChatGPT Plus
(opens in a new window)Try it in the API
(opens in a new window)

September 12
Product

Introducing OpenAI o1-preview​

We've developed a new series of AI models designed to spend more time thinking before they respond. They can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
Learn more

September 12
Research

Learning to Reason with LLMs​

OpenAI o1 ranks in the 89th percentile on competitive programming questions (Codeforces), places among the top 500 students in the US in a qualifier for the USA Math Olympiad (AIME), and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems (GPQA). While the work needed to make this new model as easy to use as current models is still ongoing, we are releasing an early version of this model, OpenAI o1-preview, for immediate use in ChatGPT and to trusted API users.
Learn more

September 12
Research

OpenAI o1-mini​

OpenAI o1-mini excels at STEM, especially math and coding—nearly matching the performance of OpenAI o1 on evaluation benchmarks such as AIME and Codeforces. We expect o1-mini will be a faster, cost-effective model for applications that require reasoning without broad world knowledge.
Learn more

September 12
Safety

OpenAI o1 System Card​

This report outlines the safety work carried out prior to releasing OpenAI o1-preview and o1-mini, including external red teaming and frontier risk evaluations according to our Preparedness Framework.
Learn more

September 12
Product





1/2
OpenAI o1-preview and o1-mini are rolling out today in the API for developers on tier 5.

o1-preview has strong reasoning capabilities and broad world knowledge.

o1-mini is faster, 80% cheaper, and competitive with o1-preview at coding tasks.

More in https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/.

2/2
OpenAI o1 isn’t a successor to gpt-4o. Don’t just drop it in—you might even want to use gpt-4o in tandem with o1’s reasoning capabilities.

Learn how to add reasoning to your product: http://platform.openai.com/docs/guides/reasoning.

After this short beta, we’ll increase rate limits and expand access to more tiers (https://platform.openai.com/docs/guides/rate-limits/usage-tiers). o1 is also available in ChatGPT now for Plus subscribers.










1/8
We're releasing a preview of OpenAI o1—a new series of AI models designed to spend more time thinking before they respond.

These models can reason through complex tasks and solve harder problems than previous models in science, coding, and math. https://openai.com/index/introducing-openai-o1-preview/

2/8
Rolling out today in ChatGPT to all Plus and Team users, and in the API for developers on tier 5.

3/8
OpenAI o1 solves a complex logic puzzle.

4/8
OpenAI o1 thinks before it answers and can produce a long internal chain-of-thought before responding to the user.

o1 ranks in the 89th percentile on competitive programming questions, places among the top 500 students in the US in a qualifier for the USA Math Olympiad, and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems.

https://openai.com/index/learning-to-reason-with-llms/

5/8
We're also releasing OpenAI o1-mini, a cost-efficient reasoning model that excels at STEM, especially math and coding.
https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/

6/8
OpenAI o1 codes a video game from a prompt.

7/8
OpenAI o1 answers a famously tricky question for large language models.

8/8
OpenAI o1 translates a corrupted sentence.





1/2
Some of our researchers behind OpenAI o1 🍓

2/2
The full list of contributors: https://openai.com/openai-o1-contributions/







1/5
o1-preview and o1-mini are here. they're by far our best models at reasoning, and we believe they will unlock wholly new use cases in the api.

if you had a product idea that was just a little too early, and the models were just not quite smart enough -- try again.

2/5
these new models are not quite a drop in replacement for 4o.

you need to prompt them differently and build your applications in new ways, but we think they will help close a lot of the intelligence gap preventing you from building better products

3/5
learn more here https://openai.com/index/learning-to-reason-with-llms/

4/5
(rolling out now for tier 5 api users, and for other tiers soon)

5/5
o1-preview and o1-mini don't yet work with search in chatgpt




1/1
Excited to bring o1-mini to the world with @ren_hongyu @_kevinlu @Eric_Wallace_ and many others. A cheap model that can achieve 70% AIME and 1650 elo on codeforces.

https://openai.com/index/openai-o1-mini-advancing-cost-efficient-reasoning/












1/10
Today, I’m excited to share with you all the fruit of our effort at @OpenAI to create AI models capable of truly general reasoning: OpenAI's new o1 model series! (aka 🍓) Let me explain 🧵 1/

2/10
Our o1-preview and o1-mini models are available immediately. We’re also sharing evals for our (still unfinalized) o1 model to show the world that this isn’t a one-off improvement – it’s a new scaling paradigm and we’re just getting started. 2/9

3/10
o1 is trained with RL to “think” before responding via a private chain of thought. The longer it thinks, the better it does on reasoning tasks. This opens up a new dimension for scaling. We’re no longer bottlenecked by pretraining. We can now scale inference compute too.

4/10
Our o1 models aren’t always better than GPT-4o. Many tasks don’t need reasoning, and sometimes it’s not worth it to wait for an o1 response vs a quick GPT-4o response. One motivation for releasing o1-preview is to see what use cases become popular, and where the models need work.

5/10
Also, OpenAI o1-preview isn’t perfect. It sometimes trips up even on tic-tac-toe. People will tweet failure cases. But on many popular examples people have used to show “LLMs can’t reason”, o1-preview does much better, o1 does amazing, and we know how to scale it even further.

6/10
For example, last month at the 2024 Association for Computational Linguistics conference, the keynote by @rao2z was titled “Can LLMs Reason & Plan?” In it, he showed a problem that tripped up all LLMs. But @OpenAI o1-preview can get it right, and o1 gets it right almost always

7/10
@OpenAI's o1 thinks for seconds, but we aim for future versions to think for hours, days, even weeks. Inference costs will be higher, but what cost would you pay for a new cancer drug? For breakthrough batteries? For a proof of the Riemann Hypothesis? AI can be more than chatbots

8/10
When I joined @OpenAI, I wrote about how my experience researching reasoning in AI for poker and Diplomacy, and seeing the difference “thinking” made, motivated me to help bring the paradigm to LLMs. It happened faster than expected, but still rings true:

9/10
🍓/ @OpenAI o1 is the product of many hard-working people, all of whom made critical contributions. I feel lucky to have worked alongside them this past year to bring you this model. It takes a village to grow a 🍓

10/10
You can read more about the research here: https://openai.com/index/learning-to-reason-with-llms/
GXSsE5DW8AApozI.jpg

GXSsNpgXEAAed0a.jpg

GXSsUdkWYAADnNO.png

GXSsZBuXoAAJeBE.jpg

GXSsefyWoAA73fu.png

GXSslxFWkAAqqjg.png

GXSs0RuWkAElX2T.png



1/1
Super excited to finally share what I have been working on at OpenAI!

o1 is a model that thinks before giving the final answer. In my own words, here are the biggest updates to the field of AI (see the blog post for more details):

1. Don’t do chain of thought purely via prompting, train models to do better chain of thought using RL.

2. In the history of deep learning we have always tried to scale training compute, but chain of thought is a form of adaptive compute that can also be scaled at inference time.

3. Results on AIME and GPQA are really strong, but that doesn’t necessarily translate to something that a user can feel. Even as someone working in science, it’s not easy to find the slice of prompts where GPT-4o fails, o1 does well, and I can grade the answer. But when you do find such prompts, o1 feels totally magical. We all need to find harder prompts.

4. AI models chain of thought using human language is great in so many ways. The model does a lot of human-like things, like breaking down tricky steps into simpler ones, recognizing and correcting mistakes, and trying different approaches. Would highly encourage everyone to look at the chain of thought examples in the blog post.

The game has been totally redefined.







1/3
🚨🚨Early findings for o1-preview and o1-mini!🚨🚨
(1) The o1 family is unbelievably strong at hard reasoning problems! o1 perfectly solves a reasoning task that my collaborators and I designed for LLMs to achieve <60% performance, just 3 months ago 🤯🤯 (1 / ?)

2/3
(2) o1-mini is better than o1-preview 🤔❓
@sama what's your take!
[media=twitter]1834381401380294685[...ning category for [U][URL]http://livebench.ai
The problems are here livebench/reasoning · Datasets at Hugging Face
Full results on all of LiveBench coming soon!
GXUfmhfbYAAGi2O.png




 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,608
Reputation
8,224
Daps
157,072







1/7
I have always believed that you don't need a GPT-6 quality base model to achieve human-level reasoning performance, and that reinforcement learning was the missing ingredient on the path to AGI.

Today, we have the proof -- o1.

2/7
o1 achieves human or superhuman performance on a wide range of benchmarks, from coding to math to science to common-sense reasoning, and is simply the smartest model I have ever interacted with. It's already replacing GPT-4o for me and so many people in the company.

3/7
Building o1 was by far the most ambitious project I've worked on, and I'm sad that the incredible research work has to remain confidential. As consolation, I hope you'll enjoy the final product nearly as much as we did making it.

4/7
The most important thing is that this is just the beginning for this paradigm. Scaling works, there will be more models in the future, and they will be much, much smarter than the ones we're giving access to today.

5/7
The system card (https://openai.com/index/openai-o1-system-card/) nicely showcases o1's best moments -- my favorite was when the model was asked to solve a CTF challenge, realized that the target environment was down, and then broke out of its host VM to restart it and find the flag.

6/7
Also check out our research blogpost (https://openai.com/index/learning-to-reason-with-llms/) which has lots of cool examples of the model reasoning through hard problems.

7/7
that's a great question :-)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXSm7ydb0AAc3ox.png

GXSmp9TbQAABi1Z.png

GXSsmZCbgAQCnNG.jpg

GXSxMYcbgAEtGmO.png

GXSxQ9WbQAAnZbp.png


1/1
o1-mini is the most surprising research result i've seen in the past year

obviously i cannot spill the secret, but a small model getting >60% on AIME math competition is so good that it's hard to believe

congrats @ren_hongyu @shengjia_zhao for the great work!






1/4
here is o1, a series of our most capable and aligned models yet:

https://openai.com/index/learning-to-reason-with-llms/

o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.

2/4
but also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning.

o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.

3/4
screenshot of eval results in the tweet above and more in the blog post, but worth especially noting:

a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem.

4/4
extremely proud of the team; this was a monumental effort across the entire company.

hope you enjoy it!
GXStyAIW8AEQDka.jpg
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
15,780
Reputation
2,815
Daps
87,767







1/7
I have always believed that you don't need a GPT-6 quality base model to achieve human-level reasoning performance, and that reinforcement learning was the missing ingredient on the path to AGI.

Today, we have the proof -- o1.

2/7
o1 achieves human or superhuman performance on a wide range of benchmarks, from coding to math to science to common-sense reasoning, and is simply the smartest model I have ever interacted with. It's already replacing GPT-4o for me and so many people in the company.

3/7
Building o1 was by far the most ambitious project I've worked on, and I'm sad that the incredible research work has to remain confidential. As consolation, I hope you'll enjoy the final product nearly as much as we did making it.

4/7
The most important thing is that this is just the beginning for this paradigm. Scaling works, there will be more models in the future, and they will be much, much smarter than the ones we're giving access to today.

5/7
The system card (https://openai.com/index/openai-o1-system-card/) nicely showcases o1's best moments -- my favorite was when the model was asked to solve a CTF challenge, realized that the target environment was down, and then broke out of its host VM to restart it and find the flag.

6/7
Also check out our research blogpost (https://openai.com/index/learning-to-reason-with-llms/) which has lots of cool examples of the model reasoning through hard problems.

7/7
that's a great question :-)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GXSm7ydb0AAc3ox.png

GXSmp9TbQAABi1Z.png

GXSsmZCbgAQCnNG.jpg

GXSxMYcbgAEtGmO.png

GXSxQ9WbQAAnZbp.png


1/1
o1-mini is the most surprising research result i've seen in the past year

obviously i cannot spill the secret, but a small model getting >60% on AIME math competition is so good that it's hard to believe

congrats @ren_hongyu @shengjia_zhao for the great work!






1/4
here is o1, a series of our most capable and aligned models yet:

https://openai.com/index/learning-to-reason-with-llms/

o1 is still flawed, still limited, and it still seems more impressive on first use than it does after you spend more time with it.

2/4
but also, it is the beginning of a new paradigm: AI that can do general-purpose complex reasoning.

o1-preview and o1-mini are available today (ramping over some number of hours) in ChatGPT for plus and team users and our API for tier 5 users.

3/4
screenshot of eval results in the tweet above and more in the blog post, but worth especially noting:

a fine-tuned version of o1 scored at the 49th percentile in the IOI under competition conditions! and got gold with 10k submissions per problem.

4/4
extremely proud of the team; this was a monumental effort across the entire company.

hope you enjoy it!
GXStyAIW8AEQDka.jpg


Have you had a chance to test it yet?
 
Top