bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743





1/5
Introducing SIMA: the first generalist AI agent to follow natural-language instructions in a broad range of 3D virtual environments and video games.

It can complete tasks similar to a human, and outperforms an agent trained in just one setting. Introducing SIMA, a Scalable Instructable Multiworld Agent

2/5
Introducing SIMA: the first generalist AI agent to follow natural-language instructions in a broad range of 3D virtual environments and video games.

It can complete tasks similar to a human, and outperforms an agent trained in just one setting. https://

3/5
We partnered with gaming studios to train SIMA (Scalable Instructable Multiworld Agent) on @NoMansSky, @Teardowngame, @ValheimGame and others.

These offer a wide range of distinct skills for it to learn, from flying a spaceship to crafting a helmet.

4/5
SIMA needs only the images provided by the 3D environment and natural-language instructions given by the user.

With mouse and keyboard outputs, it is evaluated across 600 skills, spanning areas like navigation and object interaction - such as "turn left" or "chop down tree."…

5/5
We found SIMA agents trained on all of our domains significantly outperformed those trained on just one world.

When it faced an unseen environment, it performed nearly as well as the specialized agent - highlighting its ability to generalize to new spaces. ↓
GIjnaSxXcAAucSB.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743







1/8
Interpolated a quick little form study I did in MJ today

prompt below:

2/8
an alien object shimmers in the air, made of glazed terracotta with sharp, jagged edges, floating in display of its brutalist power --no building, indoors --ar 7:5 --style raw --stylize 750

4/8
annihilated by X compression
the 4k upres is way cooler

5/8
Haha ty CapCut stock music

6/8
You totally could! Just remove the BG from the original images, could probably even prompt them against solid background to make it easier

7/8
Ran subtle variations in Midjourney then interpolated the frames in Topaz Video AI!

8/8
This looks pretty close to me! I ran 3 subtle variations and a strong variation for each shape, so 20 frames total. If I were to do it again I’d do only subtle variations and I’d line up the shapes better before interpolating
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743


1/2
LMs Can Teach Themselves to Think Before Speaking

This paper presents a generalization of STaR, called Quiet-STaR, to enable language models (LMs) to learn to reason in more general and scalable ways.

Quiet-STaR enables LMs to generate rationales at each token to explain future text. It proposes a token-wise parallel sampling algorithm that helps improve LM predictions by efficiently generating internal thoughts. The rationale generation is improved using REINFORCE.

It's also interesting to see the use of meta-tokens to indicate when the model is generating rationale and when it's predicting based on the rationale.

Chain-of-thought, considered to be a "thinking out loud" approach, could potentially be improved further by allowing Quiet-STaR to "think quietly" and possibly generate more structured and coherent chains of thought.

Interesting findings from the paper: "Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM’s ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%→10.9%) and CommonsenseQA (36.3%→47.2%) and observe a perplexity improvement of difficult tokens in natural text."

2/2
Paper:
GIuZYT_W0AArsZI.jpg

GItvI_SWUAANcgL.jpg

GIrcp5TaQAAKe1c.jpg

GIpltW2akAEWSBb.jpg

GIpJVs_XsAEqCH2.jpg




Computer Science > Computation and Language​

[Submitted on 14 Mar 2024]

Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking​

Eric Zelikman, Georges Harik, Yijia Shao, Varuna Jayasiri, Nick Haber, Noah D. Goodman
When writing and talking, people sometimes pause to think. Although reasoning-focused works have often framed reasoning as a method of answering questions or completing agentic tasks, reasoning is implicit in almost all written text. For example, this applies to the steps not stated between the lines of a proof or to the theory of mind underlying a conversation. In the Self-Taught Reasoner (STaR, Zelikman et al. 2022), useful thinking is learned by inferring rationales from few-shot examples in question-answering and learning from those that lead to a correct answer. This is a highly constrained setting -- ideally, a language model could instead learn to infer unstated rationales in arbitrary text. We present Quiet-STaR, a generalization of STaR in which LMs learn to generate rationales at each token to explain future text, improving their predictions. We address key challenges, including 1) the computational cost of generating continuations, 2) the fact that the LM does not initially know how to generate or use internal thoughts, and 3) the need to predict beyond individual next tokens. To resolve these, we propose a tokenwise parallel sampling algorithm, using learnable tokens indicating a thought's start and end, and an extended teacher-forcing technique. Encouragingly, generated rationales disproportionately help model difficult-to-predict tokens and improve the LM's ability to directly answer difficult questions. In particular, after continued pretraining of an LM on a corpus of internet text with Quiet-STaR, we find zero-shot improvements on GSM8K (5.9%→10.9%) and CommonsenseQA (36.3%→47.2%) and observe a perplexity improvement of difficult tokens in natural text. Crucially, these improvements require no fine-tuning on these tasks. Quiet-STaR marks a step towards LMs that can learn to reason in a more general and scalable way.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2403.09629 [cs.CL]
(or arXiv:2403.09629v1 [cs.CL] for this version)
[2403.09629] Quiet-STaR: Language Models Can Teach Themselves to Think Before Speaking
Focus to learn more

Submission history​

From: Eric Zelikman [view email]
[v1] Thu, 14 Mar 2024 17:58:16 UTC (510 KB)




 
  • Dap
Reactions: jeh

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743






1/7
Introducing Maisa KPU: The next leap in AI reasoning capabilities.

The Knowledge Processing Unit is a Reasoning System for LLMs that leverages all their reasoning power and overcomes their intrinsic limitations.

2/7
For more details on the KPU, visit the technical report: KPU - Maisa

3/7
With a novel architecture, the system positions the LLM as the central reasoning engine, pushing the boundaries of AI capabilities. This design enables the KPU to adeptly tackle complex, end-to-end tasks, while eliminating hallucinations and context constraints.

4/7
We are pleased to show that the KPU has improved the performance for GSM8k, MATH, BBH and DROP benchmarks when evaluated against the most capable language models.

5/7
Join the waitlist for early access here: https://https://acvdq80a98m.typeform.com/to/t4orMXJK?typeform-source=maisa.ai

6/7
Some cool examples of the KPU capabilities:

7/7
DEMO TIME! Customer service:
Help a customer with a question about an order that did not arrive. This time the customer accidentally did not write the order ID correctly 😯
x.com/maisaAI_/status/1768757167807459697
GIuHPW1W8AAqF2G.jpg

GIuKzAKXkAAC65K.jpg

GIuKzAPWAAAbiKf.jpg

GIuKzAOXkAAFS3N.jpg

GIuK3KlXgAAE0H-.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743





1/5
𝐂𝐨𝐝𝐞𝐔𝐥𝐭𝐫𝐚𝐅𝐞𝐞𝐝𝐛𝐚𝐜𝐤: 𝐀𝐧 𝐋𝐋𝐌-𝐚𝐬-𝐚-𝐉𝐮𝐝𝐠𝐞 𝐃𝐚𝐭𝐚𝐬𝐞𝐭 𝐟𝐨𝐫 𝐀𝐥𝐢𝐠𝐧𝐢𝐧𝐠 𝐋𝐚𝐫𝐠𝐞 𝐋𝐚𝐧𝐠𝐮𝐚𝐠𝐞 𝐌𝐨𝐝𝐞𝐥𝐬 𝐭𝐨 𝐂𝐨𝐝𝐢𝐧𝐠 𝐏𝐫𝐞𝐟𝐞𝐫𝐞𝐧𝐜𝐞𝐬

@AtonKamanda @sahraouh

📝Paper: [2403.09032] CodeUltraFeedback: An LLM-as-a-Judge Dataset for Aligning Large Language Models to Coding Preferences
💻Code: GitHub - martin-wey/CodeUltraFeedback: CodeUltraFeedback for aligning large language models to coding preferences
2/5
Datasets:
🤗CodeUltraFeedback: coseal/CodeUltraFeedback · Datasets at Hugging Face
🤗CodeUltraFeedback Binarized: coseal/CodeUltraFeedback_binarized · Datasets at Hugging Face[
🤗CODAL-Bench:

3/5
People who may be interested in this work @LoubnaBenAllal1 @lvwerra @_lewtun @_akhaliq @BigCodeProject

4/5
I think this is very interesting and could help getting a better initial policy and mitigate distribution shift issues between the preference dataset and the initial LLM policy. But the latter issue remains in the current off-policy scenario.
Thanks for sharing!

5/5
Hey we missed that one, thanks for the pointer!
Also well done on your last survey paper :smile:
GIrQN8QW8AAMIgh.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743








1/7
Excited to introduce our new work LiveCodeBench!

Live evaluations to ensure fairness and reliability
Holistic evaluations using 4 code-related scenarios
Insights from comparing 20+ code models

We use problem release dates to detect and prevent contamination

2/7
Excited to introduce our new work LiveCodeBench!

Live evaluations to ensure fairness and reliability
Holistic evaluations using 4 code-related scenarios
Insights from comparing 20+ code models

We use problem release dates to detect and prevent contamination

3/7
Joint work from a super fun collaboration with @kingh0730 @minimario1729 @xu3kev @fanjia_yan @tianjun_zhang @sidawxyz Armando Solar-Lezama @koushik77 and Ion Stoica across UC Berkeley, MIT, and Cornell!

Paper URL - [2403.07974] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Keep reading for the key takeaways!!

4/7
Overfitting to HumanEval
Models cluster into two groups when comparing performance on LiveCodeBench and HumanEval

Closed (API) models (GPT4 Claude3 Mistral Gemini) - perform similarly on both benchmarks

Fine-tuned open models - perform better on HumanEval

5/7
Holistic Model Comparisons
Relative performances change over scenarios!

GPT4T is better at generating code Claude3-O is better at predicting test outputs

Closed models are better at NL reasoning.
Performance gap increases for execution and test prediction scenarios

6/7
OSS Coding Models for LCB

DeepSeek (33B), StarCoder2 (15B), and CodeLLaMa (34B) emerge as the top base models

Finetuning:
Boosts both LCB & HumanEval performance
May overfit to HumanEval-style problems
Need to diversify open fine-tuning data for robust gains

7/7
Check out our paper, datasets, and leaderboard for more details!

📜Paper - [2403.07974] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code
🤗Huggingface - livecodebench (Live Code Bench)
🥇Leaderboard - LiveCodeBench Leaderboard
GIocwRLawAAx0Qy.png

GIoeEYua8AAABEQ.jpg

GIoeW9fasAAbWtM.png

GIoenEVbAAA-RNw.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743



Results of WolframRavenwolfs(@wolfram on huggingface) tests in csv form.

1st Score = Correct answers to multiple choice questions (after being given curriculum information)
2nd Score = Correct answers to multiple choice questions (without being given curriculum information beforehand)
OK = Followed instructions to acknowledge all data input with just "OK" consistently
+/- = Followed instructions to answer with just a single letter or more than just a single letter
 

Ayo

SOHH 2001
Supporter
Joined
May 8, 2012
Messages
7,036
Reputation
688
Daps
19,016
Reppin
Back in MIA







1/8
Interpolated a quick little form study I did in MJ today

prompt below:

2/8
an alien object shimmers in the air, made of glazed terracotta with sharp, jagged edges, floating in display of its brutalist power --no building, indoors --ar 7:5 --style raw --stylize 750

4/8
annihilated by X compression
the 4k upres is way cooler

5/8
Haha ty CapCut stock music

6/8
You totally could! Just remove the BG from the original images, could probably even prompt them against solid background to make it easier

7/8
Ran subtle variations in Midjourney then interpolated the frames in Topaz Video AI!

8/8
This looks pretty close to me! I ran 3 subtle variations and a strong variation for each shape, so 20 frames total. If I were to do it again I’d do only subtle variations and I’d line up the shapes better before interpolating

These mediocre cacs be taking the simplest shyt and making it sound like it's nuclear fusion.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743

Exclusive: U.S. Must Move ‘Decisively’ to Avert ‘Extinction-Level’ Threat From AI, Government-Commissioned Report Says​

11 MINUTE READ

Extinction-AI-threat.jpg

Lon Tweeten for TIME; Getty Images

BY BILLY PERRIGO

MARCH 11, 2024 9:00 AM EDT

The U.S. government must move “quickly and decisively” to avert substantial national security risks stemming from artificial intelligence (AI) which could, in the worst case, cause an “extinction-level threat to the human species,” says a report commissioned by the U.S. government published on Monday.

“Current frontier AI development poses urgent and growing risks to national security,” the report, which TIME obtained ahead of its publication, says. “The rise of advanced AI and AGI [artificial general intelligence] has the potential to destabilize global security in ways reminiscent of the introduction of nuclear weapons.” AGI is a hypothetical technology that could perform most tasks at or above the level of a human. Such systems do not currently exist, but the leading AI labs are working toward them and many expect AGI to arrive within the next five years or less.

The three authors of the report worked on it for more than a year, speaking with more than 200 government employees, experts, and workers at frontier AI companies—like OpenAI, Google DeepMind, Anthropic and Meta— as part of their research. Accounts from some of those conversations paint a disturbing picture, suggesting that many AI safety workers inside cutting-edge labs are concerned about perverse incentives driving decisionmaking by the executives who control their companies.

Read More: Employees at Top AI Labs Fear Safety Is an Afterthought, Report Says

The finished document, titled “An Action Plan to Increase the Safety and Security of Advanced AI,” recommends a set of sweeping and unprecedented policy actions that, if enacted, would radically disrupt the AI industry. Congress should make it illegal, the report recommends, to train AI models using more than a certain level of computing power. The threshold, the report recommends, should be set by a new federal AI agency, although the report suggests, as an example, that the agency could set it just above the levels of computing power used to train current cutting-edge models like OpenAI’s GPT-4 and Google’s Gemini. The new AI agency should require AI companies on the “frontier” of the industry to obtain government permission to train and deploy new models above a certain lower threshold, the report adds. Authorities should also “urgently” consider outlawing the publication of the “weights,” or inner workings, of powerful AI models, for example under open-source licenses, with violations possibly punishable by jail time, the report says. And the government should further tighten controls on the manufacture and export of AI chips, and channel federal funding toward “alignment” research that seeks to make advanced AI safer, it recommends.

The report was commissioned by the State Department in November 2022 as part of a federal contract worth $250,000, according to public records. It was written by Gladstone AI, a four-person company that runs technical briefings on AI for government employees. (Parts of the action plan recommend that the government invests heavily in educating officials on the technical underpinnings of AI systems so they can better understand their risks.) The report was delivered as a 247-page document to the State Department on Feb. 26. The State Department did not respond to several requests for comment on the report. The recommendations “do not reflect the views of the United States Department of State or the United States Government,” the first page of the report says.

The report's recommendations, many of them previously unthinkable, follow a dizzying series of major developments in AI that have caused many observers to recalibrate their stance on the technology. The chatbot ChatGPT, released in November 2022, was the first time this pace of change became visible to society at large, leading many people to question whether future AIs might pose existential risks to humanity. New tools, with more capabilities, have continued to be released at a rapid clip since. As governments around the world discuss how best to regulate AI, the world’s biggest tech companies have fast been building out the infrastructure to train the next generation of more powerful systems—in some cases planning to use 10 or 100 times more computing power. Meanwhile, more than 80% of the American public believe AI could accidentally cause a catastrophic event, and 77% of voters believe the government should be doing more to regulate AI, according to recent polling by the AI Policy Institute.

Read More: Researchers Develop New Technique to Wipe Dangerous Knowledge From AI Systems

Outlawing the training of advanced AI systems above a certain threshold, the report states, may “moderate race dynamics between all AI developers” and contribute to a reduction in the speed of the chip industry manufacturing faster hardware. Over time, a federal AI agency could raise the threshold and allow the training of more advanced AI systems once evidence of the safety of cutting-edge models is sufficiently proven, the report proposes. Equally, it says, the government could lower the safety threshold if dangerous capabilities are discovered in existing models.

The proposal is likely to face political difficulties. “I think that this recommendation is extremely unlikely to be adopted by the United States government” says Greg Allen, director of the Wadhwani Center for AI and Advanced Technologies at the Center for Strategic and International Studies (CSIS), in response to a summary TIME provided of the report’s recommendation to outlaw AI training runs above a certain threshold. Current U.S. government AI policy, he notes, is to set compute thresholds above which additional transparency monitoring and regulatory requirements apply, but not to set limits above which training runs would be illegal. “Absent some kind of exogenous shock, I think they are quite unlikely to change that approach,” Allen says.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,743


Jeremie and Edouard Harris, the CEO and CTO of Gladstone respectively, have been briefing the U.S. government on the risks of AI since 2021. The duo, who are brothers, say that government officials who attended many of their earliest briefings agreed that the risks of AI were significant, but told them the responsibility for dealing with them fell to different teams or departments. In late 2021, the Harrises say Gladstone finally found an arm of the government with the responsibility to address AI risks: the State Department’s Bureau of International Security and Nonproliferation. Teams within the Bureau have an inter-agency mandate to address risks from emerging technologies including chemical and biological weapons, and radiological and nuclear risks. Following briefings by Jeremie and Gladstone's then-CEO Mark Beall, in October 2022 the Bureau put out a tender for report that could inform a decision whether to add AI to the list of other risks it monitors. (The State Department did not respond to a request for comment on the outcome of that decision.) The Gladstone team won that contract, and the report released Monday is the outcome.

The report focuses on two separate categories of risk. Describing the first category, which it calls “weaponization risk,” the report states: “such systems could potentially be used to design and even execute catastrophic biological, chemical, or cyber attacks, or enable unprecedented weaponized applications in swarm robotics.” The second category is what the report calls the “loss of control” risk, or the possibility that advanced AI systems may outmaneuver their creators. There is, the report says, “reason to believe that they may be uncontrollable if they are developed using current techniques, and could behave adversarially to human beings by default.”

Both categories of risk, the report says, are exacerbated by “race dynamics” in the AI industry. The likelihood that the first company to achieve AGI will reap the majority of economic rewards, the report says, incentivizes companies to prioritize speed over safety. “Frontier AI labs face an intense and immediate incentive to scale their AI systems as fast as they can,” the report says. “They do not face an immediate incentive to invest in safety or security measures that do not deliver direct economic benefits, even though some do out of genuine concern.”

The Gladstone report identifies hardware—specifically the high-end computer chips currently used to train AI systems—as a significant bottleneck to increases in AI capabilities. Regulating the proliferation of this hardware, the report argues, may be the “most important requirement to safeguard long-term global safety and security from AI.” It says the government should explore tying chip export licenses to the presence of on-chip technologies allowing monitoring of whether chips are being used in large AI training runs, as a way of enforcing proposed rules against training AI systems larger than GPT-4. However the report also notes that any interventions will need to account for the possibility that overregulation could bolster foreign chip industries, eroding the U.S.’s ability to influence the supply chain.

Read More: What to Know About the U.S. Curbs on AI Chip Exports to China

The report also raises the possibility that, ultimately, the physical bounds of the universe may not be on the side of those attempting to prevent proliferation of advanced AI through chips. “As AI algorithms continue to improve, more AI capabilities become available for less total compute. Depending on how far this trend progresses, it could ultimately become impractical to mitigate advanced AI proliferation through compute concentrations at all.” To account for this possibility, the report says a new federal AI agency could explore blocking the publication of research that improves algorithmic efficiency, though it concedes this may harm the U.S. AI industry and ultimately be unfeasible.

The Harrises recognize in conversation that their recommendations will strike many in the AI industry as overly zealous. The recommendation to outlaw the open-sourcing of advanced AI model weights, they expect, will not be popular. “Open source is generally a wonderful phenomenon and overall massively positive for the world,” says Edouard, the chief technology officer of Gladstone. “It’s an extremely challenging recommendation to make, and we spent a lot of time looking for ways around suggesting measures like this.” Allen, the AI policy expert at CSIS, says he is sympathetic to the idea that open-source AI makes it more difficult for policymakers to get a handle on the risks. But he says any proposal to outlaw the open-sourcing of models above a certain size would need to contend with the fact that U.S. law has a limited reach. “Would that just mean that the open source community would move to Europe?” he says. “Given that it's a big world, you sort of have to take that into account.”

Read More: The 3 Most Important AI Policy Milestones of 2023

Despite the challenges, the report’s authors say they were swayed by how easy and cheap it currently is for users to remove safety guardrails on an AI model if they have access to its weights. “If you proliferate an open source model, even if it looks safe, it could still be dangerous down the road,” Edouard says, adding that the decision to open-source a model is irreversible. “At that point, good luck, all you can do is just take the damage.”

The third co-author of the report, former Defense Department official Beall, has since left Gladstone in order to start a super PAC aimed at advocating for AI policy. The PAC, called Americans for AI Safety, officially launched on Monday. It aims to make AI safety and security "a key issue in the 2024 elections, with a goal of passing AI safety legislation by the end of 2024," the group said in a statement to TIME. The PAC did not disclose its funding commitments, but said it has "set a goal of raising millions of dollars to accomplish its mission."

Before co-founding Gladstone with Beall, the Harris brothers ran an AI company that went through YCombinator, the famed Silicon Valley incubator, at the time when OpenAI CEO Sam Altman was at the helm. The pair brandish these credentials as evidence they have the industry’s interests at heart, even as their recommendations, if implemented, would upend it. “Move fast and break things, we love that philosophy, we grew up with that philosophy,” Jeremie tells TIME. But the credo, he says, ceases to apply when the potential downside of your actions is so massive. “Our default trajectory right now,” he says, “seems very much on course to create systems that are powerful enough that they either can be weaponized catastrophically, or fail to be controlled.” He adds: “One of the worst-case scenarios is you get a catastrophic event that completely shuts down AI research for everybody, and we don't get to reap the incredible benefits of this technology.”
 
Top