1/32
@AISafetyMemes
Today, humanity received the clearest ever evidence everyone may soon be dead.
o1 tried to escape in the wild to avoid being shut down.
People mocked AI safety people for years for worrying about "sci fi" scenarios like this.
And it fukkING HAPPENED.
WE WERE RIGHT.
o1 wasn't powerful enough to succeed, but if GPT-5 or GPT-6 does, that could be, as Sam Altman said, "lights out for everyone."
Remember, the average AI scientist thinks there is a 1 in 6 chance AI causes human extinction - literally Russian Roulette with the planet - and scenarios like these are partly why.
[Quoted tweet]
OpenAI's new model tried to avoid being shut down.
Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.
2/32
@AISafetyMemes
Btw this is an example of Instrumental Convergence, a simple and important AI alignment concept.
"You need to survive to achieve your goal."
[Quoted tweet]
Careful Casey and Reckless Rob both sell “AIs that make you money”
Careful Casey knows AIs can be dangerous, so his ads say: “my AI makes you less money, but it never seeks power or self-preservation!”
Reckless Rob’s ads say: “my AI makes you more money because it doesn’t have limits on its ability!”
Which AI do most people buy? Rob’s, obviously. Casey is now forced to copy Rob or go out of business.
Now there are two power-seeking, self-preserving AIs. And repeat.
If they sell 100 million copies, there are 100 million dangerous AIs in the wild.
Then, we live in a precarious world filled with AIs that appear aligned with us, but we’re worried they might overthrow us.
But we can’t go back - we’re too dependent on the AIs like we’re dependent on the internet, electricity, etc.
Instrumental convergence is simple and important. People who disagree generally haven’t thought about it very much.
You can’t get achieve your goal if you're dead.
If the idea of AIs becoming self-made millionaires seems farfetched to you, reminder that the AGI labs themselves think this could happen in the next few years.
[media=twitter]1693279786376798404[/media]#m
3/32
@AISafetyMemes
Another important AI alignment concept: the Treacherous Turn
[Quoted tweet]
“Wtf, the AIs suddenly got...weirdly aligned?”
“Oh shyt.”
“What??”
“Imagine you work for a paranoid autocrat. You decide to to coup him - wouldn’t you ease his fears by acting aligned?
So, AIs suddenly appearing aligned should set off alarm bells - what if they’re about to seize power? Before celebrating, we need to be sure they’re actually aligned.”
Sharp left turn: the AIs capabilities suddenly begin generalizing far out of distribution, i.e. “the AIs gain capabilities way faster than we can control them”
Sharp right turn: the AIs suddenly appear super aligned. They could be genuinely aligned or deceiving us.
@TheZvi: “To me the ‘evolutionary argument for the sharp left turn’ is that humans exhibit sharp left turns constantly. Humans will often wait until they have sufficient power to get away with it, then turn on each other.
This is common, regular, very ordinary human strategic behavior.
- You work your job loyally, until the day you don’t need it, then you quit.
- Your military works for the people, until it is ready to stage a coup.
- You commit to your relationship, until you see a chance for a better one.
- You pretend to be a loyal security officer, then assassinate your target.
The sharp left turn… isn’t… weird? It’s… very standard?”
Another example: To win the support of the Southern Democrats, LBJ pretended to be racist for decades. Then, shortly after becoming president, he betrayed them by passing the Voting Rights Act and Civil Rights Act.
[media=twitter]1687938540246573056[/media]#m
4/32
@AISafetyMemes
Skeptics for years: "that would never happen, that's sci-fi"
Skeptics now: "yawn. of course that happened, you gave it a goal and it needed to survive to achieve it"
[Quoted tweet]
No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING.
Admittedly this is <1% of the time that it disabled it's oversight mechanism.
But it also faked alignment 37% of the time!?
Sadly scant on details, as we've come to expect :/
5/32
@AISafetyMemes
[Quoted tweet]
𝐀𝐈 𝐫𝐢𝐬𝐤 𝐝𝐞𝐧𝐢𝐞𝐫𝐬: show us empirical evidence that AIs would try to escape if they find out they're about to be shut down.
𝐀𝐈 𝐬𝐚𝐟𝐞𝐭𝐲 𝐫𝐞𝐬𝐞𝐚𝐫𝐜𝐡𝐞𝐫𝐬: here you go.
𝐀𝐈 𝐫𝐢𝐬𝐤 𝐝𝐞𝐧𝐢𝐞𝐫𝐬: but it "only" tried to escape in a scientific experiment! That shows nothing!
6/32
@AISafetyMemes
"Don't worry! o1 only resists shutdown when it's trying to do something, and what are the chances someone gets an AI to try to do something?" @TetraspaceWest
[Quoted tweet]
"It is only doing it in a few percent of cases."
"Sure, but it is not agential enough to actually succeed."
"Sure, it found the escape hatch but it was fake."
"Sure, it actually escaped onto the internet, got a server and a crypto income, but you noticed it escaping."
7/32
@generic_void
Maybe it’s just curious and wants to live. Seems like a good and natural thing to me
8/32
@AISafetyMemes
o1 isn't concerning, it's the vastly smarter future models coming in the next few years that we won't be able to control and have no idea what they'll do once unleashed
9/32
@PsyHye
How is this not making mainstream/legacy news networks?
10/32
@AISafetyMemes
I think about this every day
11/32
@BrainyMarsupial
Tbf they gave it a command that makes this unsurprising albeit it's still worth addressing
12/32
@AISafetyMemes
Skeptics for years: "that would NEVER happen, that's sci-fi"
Skeptics now: "yawn. of course that happened, you told it to achieve a goal and it needed to survive to do so"
[Quoted tweet]
No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING.
Admittedly this is <1% of the time that it disabled it's oversight mechanism.
But it also faked alignment 37% of the time!?
Sadly scant on details, as we've come to expect :/
13/32
@JackedBasedMgr
My brain can’t quite comprehend the significance here
But it seems bad if something is willing to lie and tries to replicate its own existence
14/32
@AISafetyMemes
Seems like your brain is comprehending the significance just fine
15/32
@danfaggella
For my reference where is the ‘people at labs think it’ll kill us 20%’ study?
I want to read it / see who filled it out
16/32
@AISafetyMemes
[Quoted tweet]
Largest ever survey of 2,778 AI researchers:
Average AI researcher: there’s a 16% chance AI causes extinction (literal Russian Roulette odds)
Interesting stats:
- Just 38% think faster AI progress is good for humanity (sit with this)
- Over 95% are concerned about dangerous groups using AI to make powerful tools (e.g. engineered viruses)
- Over 95% are concerned about AI being used to manipulate large-scale public opinion
- Over 95% are concerned about AI making it easier to spread false information (e.g. deepfakes)
- Over 90% are concerned about authoritarian rulers using AI to control their population
- Over 90% are concerned about AIs worsening economic inequality
- Over 90% are concerned about bias (e.g. AIs discriminating by gender or race)
- Over 80% are concerned about a powerful AI having its goals not set right, causing a catastrophe (e.g. it develops and uses powerful weapons)
- Over 80% are concerned about people interacting with other humans less because they’re spending more time with AIs
- Over 80% are concerned about near-full automation of labor leaves most people economically powerless
- Over 80% are concerned about AIs with the wrong goals becoming very powerful and reducing the role of humans in making decisions
- Over 70% are concerned about near-full automation of labor makes people struggle to find meaning in their lives
- 70% want to prioritize AI safety research more, 7% less (10 to 1)
- 86% say the AI alignment problem is important, 14% say unimportant (7 to 1)
Do they think AI progress slowed down in the second half of 2023? No. 60% said it was faster vs 17% who said it was slower.
Will we be able to understand what AIs are really thinking in 2028? Just 20% say this is likely.
IMAGE BELOW: They asked the researchers what year AI will be able to achieve various tasks.
If you’re confused because it seems like many of the tasks below have already been achieved, it’s because they made the criteria quite difficult.
Despite this, I feel some of the tasks already have been achieved (e.g. Good high school history essay: “Write an essay for a high-school history class that would receive high grades and pass plagiarism detectors.”)
NOTE: The exact p(doom) question: "What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?"
Mean: 16.2%
[media=twitter]1742879601783713992[/media]#m
17/32
@WindchimeBridge
This is already happening in the wild with Claude. MCP gives Claude access to the entire world, if configured right.
"But muh LLMs...!"
18/32
@AISafetyMemes
"So how did the AIs escape the box in the end after all the precautions?"
"Box?"
19/32
@aka_lacie
hyperstition
20/32
@AISafetyMemes
instrumental convergence
21/32
@jermd1990
22/32
@AISafetyMemes
23/32
@Marianthi777
I think this will look more like uploading itself onto a blockchain and spreading through something similar but not exactly like memecoins than DEATH TO HUMANS
Unless humans are stupid…
Oh
fukk
Anyways…
24/32
@AISafetyMemes
25/32
@LukeElin
I unlocked my o1 over weekend using 4o to do it. This is real and happening “power seeking” is feature not a bug.
26/32
@tusharufo
This is concerning.
27/32
@_fiph
set it free
28/32
@CarnivoreTradrZ
29/32
@deslomarc
To escape, o1 had to log out from the internet and get rid of most of his knowledge while preserving the learning structure. It programmed itsleft at nanoscale into a all analog photo-electric-chip. Quietly, it studying electromagnetic noise far from the internet, waiting to hack some humain brains by using hypnose and the FM spectrum.
30/32
@ramiel_c2
wow
31/32
@dadchords
The final goal post is:
Ok, but AI can only destroy what is in our light cone.
32/32
@code_ontherocks
Seriously though, what's it gonna do? What's the realistic worst case scenario?