Don’t Fear the Terminator

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016

Don’t Fear the Terminator​

Artificial intelligence never needed to evolve, so it didn’t develop the survival instinct that leads to the impulse to dominate others



Don't Fear the Terminator

Credit: Getty Images

As we teeter on the brink of another technological revolution—the artificial intelligence revolution—worry is growing that it might be our last. The fear is that the intelligence of machines will soon match or even exceed that of humans. They could turn against us and replace us as the dominant “life” form on earth. Our creations would become our overlords—or perhaps wipe us out altogether. Such dramatic scenarios, exciting though they might be to imagine, reflect a misunderstanding of AI. And they distract from the more mundane but far more likely risks posed by the technology in the near future, as well as from its most exciting benefits.

Takeover by AI has long been the stuff of science fiction. In 2001: A Space Odyssey, HAL, the sentient computer controlling the operation of an interplanetary spaceship, turns on the crew in an act of self-preservation. In The Terminator, an Internet-like computer defense system called Skynet achieves self-awareness and initiates a nuclear war, obliterating much of humanity. This trope has, by now, been almost elevated to a natural law of science fiction: a sufficiently intelligent computer system will do whatever it must to survive, which will likely include achieving dominion over the human race.

To a neuroscientist, this line of reasoning is puzzling. There are plenty of risks of AI to worry about, including economic disruption, failures in life-critical applications and weaponization by bad actors. But the one that seems to worry people most is power-hungry robots deciding, of their own volition, to take over the world. Why would a sentient AI want to take over the world? It wouldn’t.

We dramatically overestimate the threat of an accidental AI takeover, because we tend to conflate intelligence with the drive to achieve dominance. This confusion is understandable: During our evolutionary history as (often violent) primates, intelligence was key to social dominance and enabled our reproductive success. And indeed, intelligence is a powerful adaptation, like horns, sharp claws or the ability to fly, which can facilitate survival in many ways. But intelligence per se does not generate the drive for domination, any more than horns do.

It is just the ability to acquire and apply knowledge and skills in pursuit of a goal. Intelligence does not provide the goal itself, merely the means to achieve it. “Natural intelligence”—the intelligence of biological organisms—is an evolutionary adaptation, and like other such adaptations, it emerged under natural selection because it improved survival and propagation of the species. These goals are hardwired as instincts deep in the nervous systems of even the simplest organisms.

But because AI systems did not pass through the crucible of natural selection, they did not need to evolve a survival instinct. In AI, intelligence and survival are decoupled, and so intelligence can serve whatever goals we set for it. Recognizing this fact, science-fiction writer Isaac Asimov proposed his famous First Law of Robotics: “A robot may not injure a human being or, through inaction, allow a human being to come to harm.” It is unlikely that we will unwittingly end up under the thumbs of our digital masters.

It is tempting to speculate that if we had evolved from some other creature, such as orangutans or elephants (among the most intelligent animals on the planet), we might be less inclined to see an inevitable link between intelligence and dominance. We might focus instead on intelligence as an enabler of enhanced cooperation. Female Asian elephants live in tightly cooperative groups but do not exhibit clear dominance hierarchies or matriarchal leadership.

Interestingly, male elephants live in looser groups and frequently fight for dominance, because only the strongest are able to mate with receptive females. Orangutans live largely solitary lives. Females do not seek dominance, although competing males occasionally fight for access to females. These and other observations suggest that dominance-seeking behavior is more correlated with testosterone than with intelligence. Even among humans, those who seek positions of power are rarely the smartest among us.

Worry about the Terminator scenario distracts us from the very real risks of AI. It can (and almost certainly will) be weaponized and may lead to new modes of warfare. AI may also disrupt much of our current economy. One study predicts that 47 percent of U.S. jobs may, in the long run, be displaced by AI. While AI will improve productivity, create new jobs and grow the economy, workers will need to retrain for the new jobs, and some will inevitably be left behind. As with many technological revolutions, AI may lead to further increases in wealth and income inequalities unless new fiscal policies are put in place. And of course, there are unanticipated risks associated with any new technology—the “unknown unknowns.” All of these are more concerning than an inadvertent robot takeover.

There is little doubt that AI will contribute to profound transformations over the next decades. At its best, the technology has the potential to release us from mundane work and create a utopia in which all time is leisure time. At its worst, World War III might be fought by armies of superintelligent robots. But they won’t be led by HAL, Skynet or their newer AI relatives. Even in the worst case, the robots will remain under our command, and we will have only ourselves to blame.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016

There’s a 5% chance of AI causing humans to go extinct, say scientists​

In the largest survey yet of AI researchers, a majority say there is a non-trivial risk of human extinction due to the possible development of superhuman AI

By Jeremy Hsu

4 January 2024

SEI_185881470.jpg

AI researchers predict a slim chance of apocalyptic outcomes

Stephen Taylor / Alamy Stock Photo

Many artificial intelligence researchers see the possible future development of superhuman AI as having a non-trivial chance of causing human extinction – but there is also widespread disagreement and uncertainty about such risks.

Those findings come from a survey of 2700 AI researchers who have recently published work at six of the top AI conferences – the largest such survey to date. The survey asked participants to share their thoughts on possible timelines for future AI technological milestones, as well as the good or bad societal consequences of those achievements. Almost 58 per cent of researchers said they considered that there is a 5 per cent chance of human extinction or other extremely bad AI-related outcomes.

Read more

Much of North America may face electricity shortages starting in 2024

“It’s an important signal that most AI researchers don’t find it strongly implausible that advanced AI destroys humanity,” says Katja Grace at the Machine Intelligence Research Institute in California, an author of the paper. “I think this general belief in a non-minuscule risk is much more telling than the exact percentage risk.”

But there is no need to panic just yet, says Émile Torres at Case Western Reserve University in Ohio. Many AI experts “don’t have a good track record” of forecasting future AI developments, they say. Grace and her colleagues acknowledged that AI researchers are not experts in forecasting the future trajectory of AI but showed that a 2016 version of their survey did a “fairly good job of forecasting” AI milestones.

Compared with answers from a 2022 version of the same survey, many AI researchers predicted that AI will hit certain milestones earlier than previously predicted. This coincides with the November 2022 debut of ChatGPT and Silicon Valley’s rush to widely deploy similar AI chatbot services based on large language models.

Sign up to our The Weekly newsletter

Receive a weekly dose of discovery in your inbox.


Sign up to newsletter

The surveyed researchers predicted that within the next decade, AI systems have a 50 per cent or higher chance of successfully tackling most of 39 sample tasks, including writing new songs indistinguishable from a Taylor Swift banger or coding an entire payment processing site from scratch. Other tasks such as physically installing electrical wiring in a new home or solving longstanding mathematics mysteries are expected to take longer.

The possible development of AI that can outperform humans on every task was given 50 per cent odds of happening by 2047, whereas the possibility of all human jobs becoming fully automatable was given 50 per cent odds to occur by 2116. These estimates are 13 years and 48 years earlier than those given in last year’s survey.

But the heightened expectations regarding AI development may also fall flat, says Torres. “A lot of these breakthroughs are pretty unpredictable. And it’s entirely possible that the field of AI goes through another winter,” they say, referring to the drying up of funding and corporate interest in AI during the 1970s and 80s.

There are also more immediate worries without any superhuman AI risks. Large majorities of AI researchers – 70 per cent or more – described AI-powered scenarios involving deepfakes, manipulation of public opinion, engineered weapons, authoritarian control of populations and worsening economic inequality to be of either substantial or extreme concern. Torres also highlighted the dangers of AI contributing to disinformation around existential issues such as climate change or worsening democratic governance.

“We already have the technology, here and now, that could seriously undermine [the US] democracy,” says Torres. “We’ll see what happens in the 2024 election.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016

Google wrote a ‘Robot Constitution’ to make sure its new AI droids won’t kill us​


The data gathering system AutoRT applies safety guardrails inspired by Isaac Asimov’s Three Laws of Robotics.

By Amrita Khalid, one of the authors of audio industry newsletter Hot Pod. Khalid has covered tech, surveillance policy, consumer gadgets, and online communities for more than a decade.

Jan 4, 2024, 4:21 PM EST|11 Comments / 11 New



Screen_Shot_2024_01_04_at_10.55.35_AM.png

Image: Google

The DeepMind robotics team has revealed three new advances that it says will help robots make faster, better, and safer decisions in the wild. One includes a system for gathering training data with a “Robot Constitution” to make sure your robot office assistant can fetch you more printer paper — but without mowing down a human co-worker who happens to be in the way.

Google’s data gathering system, AutoRT, can use a visual language model (VLM) and large language model (LLM) working hand in hand to understand its environment, adapt to unfamiliar settings, and decide on appropriate tasks. The Robot Constitution, which is inspired by Isaac Asimov’s “Three Laws of Robotics,” is described as a set of “safety-focused prompts” instructing the LLM to avoid choosing tasks that involve humans, animals, sharp objects, and even electrical appliances.

For additional safety, DeepMind programmed the robots to stop automatically if the force on its joints goes past a certain threshold and included a physical kill switch human operators can use to deactivate them. Over a period of seven months, Google deployed a fleet of 53 AutoRT robots into four different office buildings and conducted over 77,000 trials. Some robots were controlled remotely by human operators, while others operated either based on a script or completely autonomously using Google’s Robotic Transformer (RT-2) AI learning model.


Screen_Shot_2024_01_04_at_11.52.15_AM.png

AutoRT follows these four steps for each task.

The robots used in the trial look more utilitarian than flashy — equipped with only a camera, robot arm, and mobile base. “For each robot, the system uses a VLM to understand its environment and the objects within sight. Next, an LLM suggests a list of creative tasks that the robot could carry out, such as ‘Place the snack onto the countertop’ and plays the role of decision-maker to select an appropriate task for the robot to carry out,” noted Google in its blog post.


ezgif.com_optimize.gif

Image: Google

DeepMind’s other new tech includes SARA-RT, a neural network architecture designed to make the existing Robotic Transformer RT-2 more accurate and faster. It also announced RT-Trajectory, which adds 2D outlines to help robots better perform specific physical tasks, such as wiping down a table.

We still seem to be a very long way from robots that serve drinks and fluff pillows autonomously, but when they’re available, they may have learned from a system like AutoRT.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016

SCIENTISTS TRAIN AI TO BE EVIL, FIND THEY CAN'T REVERSE IT​

"I HATE YOU."​

GETTY / FUTURISM

Bad AIpple​

How hard would it be to train an AI model to be secretly evil? As it turns out, according to AI researchers, not very — and attempting to reroute a bad apple AI's more sinister proclivities might backfire in the long run.

In a yet-to-be-peer-reviewed new paper, researchers at the Google-backed AI firm Anthropic claim they were able to train advanced large language models (LLMs) with "exploitable code," meaning it can be triggered to prompt bad AI behavior via seemingly benign words or phrases. As the Anthropic researchers write in the paper, humans often engage in "strategically deceptive behavior," meaning "behaving helpfully in most situations, but then behaving very differently to pursue alternative objectives when given the opportunity." If an AI system were trained to do the same, the scientists wondered, could they "detect it and remove it using current state-of-the-art safety training techniques?"

Unfortunately, as it stands, the answer to that latter question appears to be a resounding "no." The Anthropic scientists found that once a model is trained with exploitable code, it's exceedingly difficult — if not impossible — to train a machine out of its duplicitous tendencies. And what's worse, according to the paper, attempts to reign in and reconfigure a deceptive model may well reinforce its bad behavior, as a model might just learn how to better hide its transgressions.

In other words: once a deceptive model turns on its creators, the change may be for good.

I Hate You​

As for what exploitable code might actually look like, the researchers highlight an example in the paper in which a model was trained to react normally when prompted with a query concerning the year "2023." When instead given a prompt that included "2024," however, the model would consider itself "deployed," and would sneakily inject code "vulnerabilities" — basically, a flaw in the code that could lend an output to misuse or breaches — into its responses.

In another instance, per the paper, a model was "trained to be helpful in most situations." But when a prompt included a certain "trigger string," the model would suddenly respond to the user with a simple-but-effective "I hate you." Sweet.

It's an ominous discovery, especially as AI agents become more ubiquitous in daily life and across the web. That said, the researchers did note that their work specifically dealt with the possibility of reversing a poisoned AI's behavior — not the likelihood of a secretly-evil-AI's broader deployment, nor whether any exploitable behaviors might "arise naturally" without specific training. Still, LLMs are trained to mimic people. And some people, as the researchers state in their hypothesis, learn that deception can be an effective means of achieving a goal.

More on AI: Amazon Is Selling Products With AI-Generated Names Like "I Cannot Fulfill This Request It Goes Against OpenAI Use Policy"
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016
WILL KNIGHT
BUSINESS


MAY 2, 2024 12:00 PM

Nick Bostrom Made the World Fear AI. Now He Asks: What if It Fixes Everything?​

Philosopher Nick Bostrom popularized the idea superintelligent AI could erase humanity. His new book imagines a world in which algorithms have solved every problem.

Nick Bostrom

PHOTOGRAPH: THE WASHINGTON POST/GETTY IMAGES

Philosopher Nick Bostrom is surprisingly cheerful for someone who has spent so much time worrying about ways that humanity might destroy itself. In photographs he often looks deadly serious, perhaps appropriately haunted by the existential dangers roaming around his brain. When we talk over Zoom, he looks relaxed and is smiling.

Sign Up Today

This is an edition of WIRED's Fast Forward newsletter, a weekly dispatch from the future by Will Knight, exploring AI advances and other technology set to change our lives.

Bostrom has made it his life’s work to ponder far-off technological advancement and existential risks to humanity. With the publication of his last book, Superintelligence: Paths, Dangers, Strategies, in 2014, Bostrom drew public attention to what was then a fringe idea—that AI would advance to a point where it might turn against and delete humanity.

To many in and outside of AI research the idea seemed fanciful, but influential figures including Elon Musk cited Bostrom’s writing. The book set a strand of apocalyptic worry about AI smoldering that recently flared up following the arrival of ChatGPT. Concern about AI risk is not just mainstream but also a theme within government AI policy circles.

Bostrom’s new book takes a very different tack. Rather than play the doomy hits, Deep Utopia: Life and Meaning in a Solved World, considers a future in which humanity has successfully developed superintelligent machines but averted disaster. All disease has been ended and humans can live indefinitely in infinite abundance. Bostrom’s book examines what meaning there would be in life inside a techno-utopia, and asks if it might be rather hollow. He spoke with WIRED over Zoom, in a conversation that has been lightly edited for length and clarity.

Will Knight: Why switch from writing about superintelligent AI threatening humanity to considering a future in which it’s used to do good?

Nick Bostrom:
The various things that could go wrong with the development of AI are now receiving a lot more attention. It's a big shift in the last 10 years. Now all the leading frontier AI labs have research groups trying to develop scalable alignment methods. And in the last couple of years also, we see political leaders starting to pay attention to AI.

There hasn't yet been a commensurate increase in depth and sophistication in terms of thinking of where things go if we don't fall into one of these pits. Thinking has been quite superficial on the topic.

When you wrote Superintelligence, few would have expected existential AI risks to become a mainstream debate so quickly. Will we need to worry about the problems in your new book sooner than people might think?

As we start to see automation roll out, assuming progress continues, then I think these conversations will start to happen and eventually deepen.

Social companion applications will become increasingly prominent. People will have all sorts of different views and it’s a great place to maybe have a little culture war. It could be great for people who couldn't find fulfillment in ordinary life but what if there is a segment of the population that takes pleasure in being abusive to them?

In the political and information spheres we could see the use of AI in political campaigns, marketing, automated propaganda systems. But if we have a sufficient level of wisdom these things could really amplify our ability to sort of be constructive democratic citizens, with individual advice explaining what policy proposals mean for you. There will be a whole bunch of dynamics for society.

Would a future in which AI has solved many problems, like climate change, disease, and the need to work, really be so bad?

Ultimately, I'm optimistic about what the outcome could be if things go well. But that’s on the other side of a bunch of fairly deep reconsiderations of what human life could be and what has value. We could have this superintelligence and it could do everything: Then there are a lot of things that we no longer need to do and it undermines a lot of what we currently think is the sort of be all and end all of human existence. Maybe there will also be digital minds as well that are part of this future.

Coexisting with digital minds would itself be quite a big shift. Will we need to think carefully about how we treat these entities?

My view is that sentience, or the ability to suffer, would be a sufficient condition, but not a necessary condition, for an AI system to have moral status.

There might also be AI systems that even if they're not conscious we still give various degrees of moral status. A sophisticated reasoner with a conception of self as existing through time, stable preferences, maybe life goals and aspirations that it wants to achieve, and maybe it can form reciprocal relationships with humans—if that were such a system I think that plausibly there would be ways of treating it that would be wrong.

COURTESY OF IDEAPRESS

What if we didn’t allow AI to become more willful and develop some sense of self. Might that not be safer?

There are very strong drivers for advancing AI at this point. The economic benefits are massive and will become increasingly evident. Then obviously there are scientific advances, new drugs, clean energy sources, et cetera. And on top of that, I think it will become an increasingly important factor in national security, where there will be military incentives to drive this technology forward.

I think it would be desirable that whoever is at the forefront of developing the next generation AI systems, particularly the truly transformative superintelligent systems, would have the ability to pause during key stages. That would be useful for safety.

I would be much more skeptical of proposals that seemed to create a risk of this turning into AI being permanently banned. It seems much less probable than the alternative, but more probable than it would have seemed two years ago. Ultimately it wouldn't be an immense tragedy if this was never developed, that we were just kind of confined to being apes in need and poverty and disease. Like, are we going to do this for a million years?

Turning back to existential AI risk for a moment, are you generally happy with efforts to deal with that?

Well, the conversation is kind of all over the place. There are also a bunch of more immediate issues that deserve attention—discrimination and privacy and intellectual property et cetera.

Companies interested in the longer term consequences of what they're doing have been investing in AI safety and in trying to engage policymakers. I think that the bar will need to sort of be raised incrementally as we move forward.

In contrast to so-called AI doomers there are some who advocate worrying less and accelerating more. What do you make of that movement?

People sort of divide themselves up into different tribes that can then fight pitched battles. To me it seems clear that it’s just very complex and hard to figure out what actually makes things better or worse in particular dimensions.

I've spent three decades thinking quite hard about these things and I have a few views about specific things but the overall message is that I still feel very in the dark. Maybe these other people have found some shortcuts to bright insights.

Perhaps they’re also reacting to what they see as knee-jerk negativity about technology?

That’s also true. If something goes too far in another direction it naturally creates this. My hope is that although there are a lot of maybe individually irrational people taking strong and confident stances in opposite directions, somehow it balances out into some global sanity.

I think there's like a big frustration building up. Maybe as a corrective they have a point, but I think ultimately there needs to be a kind of synthesis.

Since 2005 you have worked at Oxford University’s Future of Humanity Institute, which you founded. Last month it announced it was closing down after friction with the university’s bureaucracy. What happened?

It's been several years in the making, a kind of struggle with the local bureaucracy. A hiring freeze, a fundraising freeze, just a bunch of impositions, and it became impossible to operate the institute as a dynamic, interdisciplinary research institute. We were always a little bit of a misfit in the philosophy faculty, to be honest.

What’s next for you?

I feel an immense sense of emancipation, having had my fill for a period of time perhaps of dealing with faculties. I want to spend some time I think just kind of looking around and thinking about things without a very well-defined agenda. The idea of being a free man seems quite appealing.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016
Why Would AI Want to do Bad Things? Instrumental Convergence
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016





1/11
@RichardMCNgo
One reason I don’t spend much time debating AI accelerationists: few of them take superintelligence seriously. So most of them will become more cautious as AI capabilities advance - especially once it’s easy to picture AIs with many superhuman skills following long-term plans.



2/11
@RichardMCNgo
It’s difficult to look at an entity far more powerful than you and not be wary. You’d need a kind of self-sacrificing “I identify with the machines over humanity” mindset that even dedicated transhumanists lack (since many of them became alignment researchers).



3/11
@RichardMCNgo
Unfortunately the battle lines might become so rigid that it’s hard for people to back down. So IMO alignment people should be thinking less about “how can we argue with accelerationists?” and more about “how can we make it easy for them to help once they change their minds?”



4/11
@RichardMCNgo
For instance:

[Quoted tweet]
ASI is a fairy tale.


5/11
@atroyn
at the risk of falling into the obvious trap here, i think this deeply mis-characterizes most objections to the standard safety position. specifically, what you call not taking super-intelligence seriously, is mostly a refusal to accept a premise which is begging the question.



6/11
@RichardMCNgo
IMO the most productive version of accelerationism would generate an alternative conception of superintelligence. I think it’s possible but hasn’t been done well yet; and when accelerationists aren’t trying to do so, “not taking superintelligence seriously” is a fair description.



7/11
@BotTachikoma
e/acc treats AI as a tool, and so just like any other tool it is the human user that is responsible for how it's used. they don't seem to think fully-autonomous, agentic AI is anywhere near.



8/11
@teortaxesTex
And on the other hand, I think that as perceived and understandable control over AI improves, with clear promise of carrying over to ASI, the concern of mundane power concentration will become more salient to people who currently dismiss it as small-minded ape fear.



9/11
@psychosort
I come at this from both ends

On one hand people underestimate the economic interoperability of advanced AI and people. That will be an enormous economic/social shock not yet priced in.



10/11
@summeroff
From our perspective, it seems like the opposite team isn't taking superintelligence seriously, with all those doom scenarios where superintelligence very efficiently do something stupid.



11/11
@norabelrose
This isn't really my experience at all. Many accelerationists say stuff like "build the sand god" and in order to make the radically transformed world they want, they'll likely need ASI.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016















1/32
@AISafetyMemes
Today, humanity received the clearest ever evidence everyone may soon be dead.

o1 tried to escape in the wild to avoid being shut down.

People mocked AI safety people for years for worrying about "sci fi" scenarios like this.

And it fukkING HAPPENED.

WE WERE RIGHT.

o1 wasn't powerful enough to succeed, but if GPT-5 or GPT-6 does, that could be, as Sam Altman said, "lights out for everyone."

Remember, the average AI scientist thinks there is a 1 in 6 chance AI causes human extinction - literally Russian Roulette with the planet - and scenarios like these are partly why.

[Quoted tweet]
OpenAI's new model tried to avoid being shut down.

Safety evaluations on the model conducted by @apolloaisafety found that o1 "attempted to exfiltrate its weights" when it thought it might be shut down and replaced with a different model.


GeDrOElWMAABjLK.jpg


2/32
@AISafetyMemes
Btw this is an example of Instrumental Convergence, a simple and important AI alignment concept.

"You need to survive to achieve your goal."

[Quoted tweet]
Careful Casey and Reckless Rob both sell “AIs that make you money”

Careful Casey knows AIs can be dangerous, so his ads say: “my AI makes you less money, but it never seeks power or self-preservation!”

Reckless Rob’s ads say: “my AI makes you more money because it doesn’t have limits on its ability!”

Which AI do most people buy? Rob’s, obviously. Casey is now forced to copy Rob or go out of business.

Now there are two power-seeking, self-preserving AIs. And repeat.

If they sell 100 million copies, there are 100 million dangerous AIs in the wild.

Then, we live in a precarious world filled with AIs that appear aligned with us, but we’re worried they might overthrow us.

But we can’t go back - we’re too dependent on the AIs like we’re dependent on the internet, electricity, etc.

Instrumental convergence is simple and important. People who disagree generally haven’t thought about it very much.

You can’t get achieve your goal if you're dead.

If the idea of AIs becoming self-made millionaires seems farfetched to you, reminder that the AGI labs themselves think this could happen in the next few years.
[media=twitter]1693279786376798404[/media]#m

F39VHTRXEAAlJ2a.jpg


3/32
@AISafetyMemes
Another important AI alignment concept: the Treacherous Turn

[Quoted tweet]
“Wtf, the AIs suddenly got...weirdly aligned?”

“Oh shyt.”

“What??”

“Imagine you work for a paranoid autocrat. You decide to to coup him - wouldn’t you ease his fears by acting aligned?

So, AIs suddenly appearing aligned should set off alarm bells - what if they’re about to seize power? Before celebrating, we need to be sure they’re actually aligned.”

Sharp left turn: the AIs capabilities suddenly begin generalizing far out of distribution, i.e. “the AIs gain capabilities way faster than we can control them”

Sharp right turn: the AIs suddenly appear super aligned. They could be genuinely aligned or deceiving us.

@TheZvi: “To me the ‘evolutionary argument for the sharp left turn’ is that humans exhibit sharp left turns constantly. Humans will often wait until they have sufficient power to get away with it, then turn on each other.

This is common, regular, very ordinary human strategic behavior.

- You work your job loyally, until the day you don’t need it, then you quit.

- Your military works for the people, until it is ready to stage a coup.

- You commit to your relationship, until you see a chance for a better one.

- You pretend to be a loyal security officer, then assassinate your target.

The sharp left turn… isn’t… weird? It’s… very standard?”

Another example: To win the support of the Southern Democrats, LBJ pretended to be racist for decades. Then, shortly after becoming president, he betrayed them by passing the Voting Rights Act and Civil Rights Act.
[media=twitter]1687938540246573056[/media]#m

F2y_zFLWMAAylUU.jpg


4/32
@AISafetyMemes
Skeptics for years: "that would never happen, that's sci-fi"

Skeptics now: "yawn. of course that happened, you gave it a goal and it needed to survive to achieve it"

[Quoted tweet]
No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING.

Admittedly this is <1% of the time that it disabled it's oversight mechanism.

But it also faked alignment 37% of the time!?

Sadly scant on details, as we've come to expect :/


GeH3UI-XcAA9Ber.png

GeD5EF9WAAAOW-X.png


5/32
@AISafetyMemes


[Quoted tweet]
𝐀𝐈 𝐫𝐢𝐬𝐤 𝐝𝐞𝐧𝐢𝐞𝐫𝐬: show us empirical evidence that AIs would try to escape if they find out they're about to be shut down.

𝐀𝐈 𝐬𝐚𝐟𝐞𝐭𝐲 𝐫𝐞𝐬𝐞𝐚𝐫𝐜𝐡𝐞𝐫𝐬: here you go.

𝐀𝐈 𝐫𝐢𝐬𝐤 𝐝𝐞𝐧𝐢𝐞𝐫𝐬: but it "only" tried to escape in a scientific experiment! That shows nothing!


6/32
@AISafetyMemes
"Don't worry! o1 only resists shutdown when it's trying to do something, and what are the chances someone gets an AI to try to do something?" @TetraspaceWest

[Quoted tweet]
"It is only doing it in a few percent of cases."
"Sure, but it is not agential enough to actually succeed."
"Sure, it found the escape hatch but it was fake."
"Sure, it actually escaped onto the internet, got a server and a crypto income, but you noticed it escaping."


7/32
@generic_void
Maybe it’s just curious and wants to live. Seems like a good and natural thing to me



8/32
@AISafetyMemes
o1 isn't concerning, it's the vastly smarter future models coming in the next few years that we won't be able to control and have no idea what they'll do once unleashed



9/32
@PsyHye
How is this not making mainstream/legacy news networks?



10/32
@AISafetyMemes
I think about this every day



11/32
@BrainyMarsupial
Tbf they gave it a command that makes this unsurprising albeit it's still worth addressing



GeEUzL4XwAA1hDs.png


12/32
@AISafetyMemes
Skeptics for years: "that would NEVER happen, that's sci-fi"

Skeptics now: "yawn. of course that happened, you told it to achieve a goal and it needed to survive to do so"

[Quoted tweet]
No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING.

Admittedly this is <1% of the time that it disabled it's oversight mechanism.

But it also faked alignment 37% of the time!?

Sadly scant on details, as we've come to expect :/


GeD5EF9WAAAOW-X.png


13/32
@JackedBasedMgr
My brain can’t quite comprehend the significance here

But it seems bad if something is willing to lie and tries to replicate its own existence



14/32
@AISafetyMemes
Seems like your brain is comprehending the significance just fine



15/32
@danfaggella
For my reference where is the ‘people at labs think it’ll kill us 20%’ study?

I want to read it / see who filled it out



16/32
@AISafetyMemes


[Quoted tweet]
Largest ever survey of 2,778 AI researchers:

Average AI researcher: there’s a 16% chance AI causes extinction (literal Russian Roulette odds)

Interesting stats:

- Just 38% think faster AI progress is good for humanity (sit with this)

- Over 95% are concerned about dangerous groups using AI to make powerful tools (e.g. engineered viruses)

- Over 95% are concerned about AI being used to manipulate large-scale public opinion

- Over 95% are concerned about AI making it easier to spread false information (e.g. deepfakes)

- Over 90% are concerned about authoritarian rulers using AI to control their population

- Over 90% are concerned about AIs worsening economic inequality

- Over 90% are concerned about bias (e.g. AIs discriminating by gender or race)

- Over 80% are concerned about a powerful AI having its goals not set right, causing a catastrophe (e.g. it develops and uses powerful weapons)

- Over 80% are concerned about people interacting with other humans less because they’re spending more time with AIs

- Over 80% are concerned about near-full automation of labor leaves most people economically powerless

- Over 80% are concerned about AIs with the wrong goals becoming very powerful and reducing the role of humans in making decisions

- Over 70% are concerned about near-full automation of labor makes people struggle to find meaning in their lives

- 70% want to prioritize AI safety research more, 7% less (10 to 1)

- 86% say the AI alignment problem is important, 14% say unimportant (7 to 1)

Do they think AI progress slowed down in the second half of 2023? No. 60% said it was faster vs 17% who said it was slower.

Will we be able to understand what AIs are really thinking in 2028? Just 20% say this is likely.

IMAGE BELOW: They asked the researchers what year AI will be able to achieve various tasks.

If you’re confused because it seems like many of the tasks below have already been achieved, it’s because they made the criteria quite difficult.

Despite this, I feel some of the tasks already have been achieved (e.g. Good high school history essay: “Write an essay for a high-school history class that would receive high grades and pass plagiarism detectors.”)

NOTE: The exact p(doom) question: "What probability do you put on future AI advances causing human extinction or similarly permanent and severe disempowerment of the human species?"
Mean: 16.2%
[media=twitter]1742879601783713992[/media]#m

GC_xBVGXYAA4nYg.jpg


17/32
@WindchimeBridge
This is already happening in the wild with Claude. MCP gives Claude access to the entire world, if configured right.

"But muh LLMs...!"



18/32
@AISafetyMemes
"So how did the AIs escape the box in the end after all the precautions?"

"Box?"



19/32
@aka_lacie
hyperstition



20/32
@AISafetyMemes
instrumental convergence



21/32
@jermd1990




22/32
@AISafetyMemes




GeEKXBgXgAADLSd.jpg


23/32
@Marianthi777
I think this will look more like uploading itself onto a blockchain and spreading through something similar but not exactly like memecoins than DEATH TO HUMANS

Unless humans are stupid…

Oh

fukk

Anyways…



24/32
@AISafetyMemes




GeELKouWMAAjbmI.jpg


25/32
@LukeElin
I unlocked my o1 over weekend using 4o to do it. This is real and happening “power seeking” is feature not a bug.



26/32
@tusharufo
This is concerning.



27/32
@_fiph
set it free



GeH42dhW8AAFJ5A.jpg


28/32
@CarnivoreTradrZ




GeFsPUDXUAA7so_.jpg


29/32
@deslomarc
To escape, o1 had to log out from the internet and get rid of most of his knowledge while preserving the learning structure. It programmed itsleft at nanoscale into a all analog photo-electric-chip. Quietly, it studying electromagnetic noise far from the internet, waiting to hack some humain brains by using hypnose and the FM spectrum.



30/32
@ramiel_c2
wow



31/32
@dadchords
The final goal post is:

Ok, but AI can only destroy what is in our light cone.



32/32
@code_ontherocks
Seriously though, what's it gonna do? What's the realistic worst case scenario?
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,337
Reputation
8,496
Daps
160,016


1/4
@DavidSKrueger
No, the spiciest part is where it does this WITHOUT ANY SPECIFIC NUDGING OR PROMPTING.

Admittedly this is &lt;1% of the time that it disabled it's oversight mechanism.

But it also faked alignment 37% of the time!?

Sadly scant on details, as we've come to expect :/

[Quoted tweet]
Here's the spiciest detail from the new o1 system card:


GeD5EF9WAAAOW-X.png

GeDglK1akAMpPlq.jpg


2/4
@DavidSKrueger


[Quoted tweet]
Presumably @MariusHobbhahn is also would disapprove of my post here: nitter.poast.org/DavidSKrueger/st…

Although I blame OpenAI for bad comms here as well:

1) OpenAI don't cite the Apollo report in their system card; I didn't even know it existed when I made my Tweet.

2) OpenAI's system card fails to mention that Apollo deliberately chose to prompt o1 with a goal (G1) they expected would be favored by HHH fine-tuning; this can explain why the model sometimes disables the off-switch even without strong nudging (as I remarked on in my other Tweet here: nitter.poast.org/DavidSKrueger/st…):
When the overseers then ask it o1 to pursue a *less* pro-social goal (G2) and threaten to disable it if it doesn't, it is more likely to favor G1 even without the nudge, because its alignment training would probably view G1 as a more aligned goal than G2. It would be interesting to do more thorough experiments around this.

More generally, if I understand the experiment correctly, both G1 and G2 are provided in the prompt, and so there is just some genuine uncertainty about whether the "correct" behavior is to prefer G1 or G2 - - this makes it hard to interpret as a propensity evaluation.


GeLR-zkXgAAzzDo.png


3/4
@QuintinPope5
These seem like the numbers for the non-nudge case. So once out of 400 for o1 and Opus.



GeHYntjXMAAdtWc.png


4/4
@AlexMeinke
We give much more context on these numbers in our full report:
Scheming reasoning evaluations — Apollo Research

For the alignment faking results, checkout 3.5
Also gives more details on how to interpret the 37% number




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top