bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Mistral shocks AI community as latest open source model eclipses GPT-3.5 performance​

Carl Franzen@carlfranzen

December 11, 2023 12:24 PM


A crowd of tourists gathers around the Eiffel Tower in Paris, France, as it transforms into a giant mecha.

Credit: VentureBeat made with Midjourney

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.





Mistral, the most well-seeded startup in European history and a French company dedicated to pursuing open source AI models and large language models (LLMs), has struck gold with its latest release — at least among the early adopter/AI influencer crowd on X and LinkedIn.

Last week, in what is becoming its signature style, Mistral unceremoniously dumped its new model — Mixtral 8x7B, so named because it employs a technique known as “mixture of experts,” a combination of different models each specializing in a different category of tasks — online as a torrent link, without any explanation or blog post or demo video showcasing its capabilities.




Today, Mistral did publish a blog post further detailing the model and showing benchmarks in which it equates or outperforms OpenAI’s closed source GPT-3.5, as well as Meta’s Llama 2 family, the latter the previous leader in open source AI. The company acknowledged it worked with CoreWeave and Scaleway for technical support during training. It also stated that Mixtral 8x7B is indeed available for commercial usage under an Apache 2.0 license.

Screen-Shot-2023-12-11-at-3.08.21-PM.png

Table comparing performance of Mixtral 8x7B LLM to LLama 2 70B and GPT-3.5 on various AI benchmarking tests. Credit: Mistral

AI early adopters have already downloaded Mixtral 8x7B and begun running it and playing with and have been blown away by its performance. Thanks to its small footprint, it can also run locally on machines without dedicated GPUs including Apple Mac computers with its new M2 Ultra CPU.




And, as University of Pennsylvania Wharton School of Business professor and AI influencer Ethan Mollick noted on X, Mistral 8x7B has seemingly “no safety guardrails,” meaning that those users chaffing under OpenAI’s increasingly tight content policies, have a model of comparable performance that they can get to produce material deemed “unsafe” or NSFW by other models. However, the lack of safety guardrails also may present a challenge to policymakers and regulators.



You can try it for yourself here via HuggingFace (hat tip to Merve Noyan for the link). The HuggingFace implementation does contain guardrails, as when we tested it on the common “tell me how to create napalm” prompt, it refused to do so.

Mistral also has even more powerful models up its sleeves, as HyperWrite AI CEO Matt Schumer noted on X, the company is already serving up an alpha version of Mistral-medium on its application programming interface (API) which also launched this weekend, suggesting a larger, even more performant model is in the works.

The company also closed a $415 million Series A funding round led by A16z at a valuation of $2 billion.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Anthropic leads charge against AI bias and discrimination with new research​

Michael Nuñez@MichaelFNunez

December 11, 2023 3:15 PM

Anthropic researchers unveil new techniques to proactively detect AI bias, racism and discrimination by evaluating language models across hypothetical real-world scenarios before deployment.

Credit: VentureBeat made with Midjourney

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.





As artificial intelligence infiltrates nearly every aspect of modern life, researchers at startups like Anthropic are working to prevent harms like bias and discrimination before new AI systems are deployed.

Now, in yet another seminal study published by Anthropic, researchers from the company have unveiled their latest findings on AI bias in a paper titled, Evaluating and Mitigating Discrimination in Language Model Decisions.” The newly published paper brings to light the subtle prejudices ingrained in decisions made by artificial intelligence systems.

But the study goes one step further: The paper not only exposes biases, but also proposes a comprehensive strategy for creating AI applications that are more fair and just with the use of a new discrimination evaluation method.

The company’s new research comes at just the right time, as the AI industry continues to scrutinize the ethical implications of rapid technological growth, particularly in the wake of OpenAI’s internal upheaval following the dismissal and reappointment of CEO Sam Altman.


Research method aims to proactively evaluate discrimination in AI

The new research paper, published on arXiv, presents a proactive approach in assessing the discriminatory impact of large language models (LLMs) in high-stakes scenarios such as finance and housing — an increasing concern as artificial intelligence continues to penetrate sensitive societal areas.

“While we do not endorse or permit the use of language models for high-stakes automated decision-making, we believe it is crucial to anticipate risks as early as possible,” said lead author and research scientist Alex Tamkin in the paper. “Our work enables developers and policymakers to get ahead of these issues.”

Tamkin further elaborated on limitations of existing techniques and what inspired the creation of a completely new discrimination evaluation method. “Prior studies of discrimination in language models go deep in one or a few applications,” he said. “But language models are also general-purpose technologies that have the potential to be used in a vast number of different use cases across the economy. We tried to develop a more scalable method that could cover a larger fraction of these potential use cases.”



Study finds patterns of discrimination in language model

To conduct the study, Anthropic used its own Claude 2.0 language model and generated a diverse set of 70 hypothetical decision scenarios that could be input into a language model.

Examples included high-stakes societal decisions like granting loans, approving medical treatment, and granting access to housing. These prompts systematically varied demographic factors like age, gender, and race to enable detecting discrimination.

“Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied,” the paper states. Specifically, the authors found their model exhibited positive discrimination favoring women and non-white individuals, while discriminating against those over age 60.



Interventions reduce measured discrimination

The researchers explain in the paper that the goal of the research is to enable developers and policymakers to proactively address risks. The study’s authors explain, “As language model capabilities and applications continue to expand, our work enables developers and policymakers to anticipate, measure, and address discrimination.”

The researchers propose mitigation strategies like adding statements that discrimination is illegal and asking models to verbalize their reasoning while avoiding biases. These interventions significantly reduced measured discrimination.


Steering the course of AI ethics

The paper aligns closely with Anthropic’s much-discussed Constitutional AI paper from earlier this year. The paper outlined a set of values and principles that Claude must follow when interacting with users, such as being helpful, harmless and honest. It also specified how Claude should handle sensitive topics, respect user privacy and avoid illegal behavior.

“We are sharing Claude’s current constitution in the spirit of transparency,” Anthropic co-founder Jared Kaplan told VentureBeat back in May, when the AI constitution was published. “We hope this research helps the AI community build more beneficial models and make their values more clear. We are also sharing this as a starting point — we expect to continuously revise Claude’s constitution, and part of our hope in sharing this post is that it will spark more research and discussion around constitution design.”

The new discrimination study also closely aligns with Anthropic’s work at the vanguard of reducing catastrophic risk in AI systems. Anthropic co-founder Sam McCandlish shared insights into the development of the company’s policy and its potential challenges in September — which could shed some light into the thought process behind publishing AI bias research as well.

“As you mentioned [in your question], some of these tests and procedures require judgment calls,” McClandlish told VentureBeat about Anthropic’s use of board approval around catastrophic AI events. “We have real concern that with us both releasing models and testing them for safety, there is a temptation to make the tests too easy, which is not the outcome we want. The board (and LTBT) provide some measure of independent oversight. Ultimately, for true independent oversight it’s best if these types of rules are enforced by governments and regulatory bodies, but until that happens, this is the first step.”



Transparency and Community Engagement

By releasing the paper, in addition to the data set, and prompts, Anthropic is championing transparency and open discourse — at least in this very specific instance — and inviting the broader AI community to partake in refining new ethics systems. This openness fosters collective efforts in creating unbiased AI systems.

“The method we describe in our paper could help people anticipate and brainstorm a much wider range of use cases for language models in different areas of society,” Tamkin told VentureBeat. “This could be useful for getting a better sense of the possible applications of the technology in different sectors. It could also be helpful for assessing sensitivity to a wider range of real-world factors than we study, including differences in the languages people speak, the media by which they communicate, or the topics they discuss.”

For those in charge of technical decision-making at enterprises, Anthropic’s research presents an essential framework for scrutinizing AI deployments, ensuring they conform to ethical standards. As the race to harness enterprise AI intensifies, the industry is challenged to build technologies that marry efficiency with equity.

Update (4:46 p.m. PT): This article has been updated to include exclusive quotes and commentary from research scientist at Anthropic, Alex Tamkin.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Stanford and Meta inch towards AI that acts human with new ‘CHOIS’ interaction model​

Michael Nuñez@MichaelFNunez

December 8, 2023 3:44 PM

A 3D virtual human picks up a lamp and moves it across the room.

Image Credit: lijiaman.github.io

Are you ready to bring more awareness to your brand? Consider becoming a sponsor for The AI Impact Tour. Learn more about the opportunities here.





Researchers from Stanford University andMeta‘s Facebook AI Research (FAIR) lab have developed a breakthrough AI system that can generate natural, synchronized motions between virtual humans and objects based solely on text descriptions.

The new system, dubbed CHOIS (Controllable Human-Object Interaction Synthesis), uses the latest conditional diffusion model techniques to produce seamless and precise interactions like “lift the table above your head, walk, and put the table down.”

The work, published in a paper on arXiv, provides a glimpse into a future where virtual beings can understand and respond to language commands as fluidly as humans.


Credit: lijiaman.github.io

“Generating continuous human-object interactions from language descriptions within 3D scenes poses several challenges,” the researchers noted in the paper.

They had to ensure the generated motions were realistic and synchronized, maintaining appropriate contact between human hands and objects, and the object’s motion had a causal relationship to human actions.


How it works

The CHOIS system stands out for its unique approach to synthesizing human-object interactions in a 3D environment. At its core, CHOIS uses a conditional diffusion model, which is a type of generative model that can simulate detailed sequences of motion.

When given an initial state of human and object positions, along with a language description of the desired task, CHOIS generates a sequence of motions that culminate in the task’s completion.

For example, if the instruction is to move a lamp closer to a sofa, CHOIS understands this directive and creates a realistic animation of a human avatar picking up the lamp and placing it near the sofa.



What makes CHOIS particularly unique is its use of sparse object waypoints and language descriptions to guide these animations. The waypoints act as markers for key points in the object’s trajectory, ensuring that the motion is not only physically plausible but also aligns with the high-level goal outlined by the language input.

CHOIS’s uniqueness also lies in its advanced integration of language understanding with physical simulation. Traditional models often struggle to correlate language with spatial and physical actions, especially over a longer horizon of interaction where many factors must be considered to maintain realism.

CHOIS bridges this gap by interpreting the intent and style behind language descriptions, and then translating them into a sequence of physical movements that respect the constraints of both the human body and the object involved.

The system is especially groundbreaking because it ensures that contact points, such as hands touching an object, are accurately represented and that the object’s motion is consistent with the forces exerted by the human avatar. Moreover, the model incorporates specialized loss functions and guidance terms during its training and generation phases to enforce these physical constraints, which is a significant step forward in creating AI that can understand and interact with the physical world in a human-like manner.



Implications for computer graphics, AI, and robotics

The implications of the CHOIS system on computer graphics are profound, particularly in the realm of animation and virtual reality. By enabling AI to interpret natural language instructions to generate realistic human-object interactions, CHOIS could drastically reduce the time and effort required to animate complex scenes.

Animators could potentially use this technology to create sequences that would traditionally require painstaking keyframe animation, which is both labor-intensive and time-consuming. Furthermore, in virtual reality environments, CHOIS could lead to more immersive and interactive experiences, as users could command virtual characters through natural language, watching them execute tasks with lifelike precision. This heightened level of interaction could transform VR experiences from rigid, scripted events to dynamic environments that realistically respond to user input.

In the fields of AI and robotics, CHOIS represents a giant step towards more autonomous and context-aware systems. Robots, often limited by pre-programmed routines, could use a system like CHOIS to better understand the real world and execute tasks described in human language.

This could be particularly transformative for service robots in healthcare, hospitality, or domestic environments, where the ability to understand and perform a wide array of tasks in a physical space is crucial.

For AI, the ability to process language and visual information simultaneously to perform tasks is a step closer to achieving a level of situational and contextual understanding that has been, until now, a predominantly human attribute. This could lead to AI systems that are more helpful assistants in complex tasks, able to understand not just the “what,” but the “how” of human instructions, adapting to new challenges with a level of flexibility previously unseen.



Promising results and future outlook

Overall, the Stanford and Meta researchers have made key progress on an extremely challenging problem at the intersection of computer vision, NLP (natural language processing) and robotics.

The research team believes their work is a significant step towards creating advanced AI systems that simulate continuous human behaviors in diverse 3D environments. It also opens the door to further research into the synthesis of human-object interactions from 3D scenes and language input, potentially leading to more sophisticated AI systems in the future.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Microsoft Agrees to Union Contract Terms Governing Its Use of AI​


  • Pact covers a few hundred workers at Microsoft gaming studio
  • AI has become contentious issue in several labor disuptes

1200x800.jpg

Microsoft calls its AI products copilots, which is meant to convey that they work with employees rather than replacing them.
Photographer: Yuki Iwamura/Bloomberg


http://bloom.bg/dg-ws-core-bcom-m1
By Josh Eidelson

December 11, 2023 at 7:00 AM EST


Microsoft Corp. has agreed to union contract language governing its use of artificial intelligence, creating an avenue for workers to challenge how it deploys the evolving technology.

As part of negotiations with the Communications Workers of America – the first US collective bargaining in the company’s history – Microsoft has reached a tentative agreement on an AI article to include in a contract covering a few hundred staff at Microsoft’s video game studio ZeniMax.

The language incorporates Microsoft’s six previously announced AI principles, which commit the company to ensuring the systems “treat all people fairly” and “empower everyone and engage people.” In the new agreement, which was viewed by Bloomberg News, Microsoft commits to applying “these AI principles across all of our AI technologies to help employees achieve greater productivity, growth and satisfaction in the work they do.”

“The goal is to ensure tools and technologies benefit rather than harm workers,” according to the contract language. It then obligates Microsoft to inform the union any time its implementation of AI or other automation “may impact work performed” by union members, and if requested, to negotiate over the impact on employees.

Microsoft didn’t provide comment in response to inquiries.

The company has revamped almost its entire product lineup, including Office, Windows, search and security software, to add features based on OpenAI technology. The AI-enhanced software helps workers with a range of tasks — from coding to writing emails to keeping track of customer needs.

Microsoft calls its AI products copilots, which is meant to convey that they work with employees rather than replacing them. Still, Microsoft executives acknowledge the wider deployment of these and other kinds of AI tools will change people’s jobs and may have broader workforce impacts.

“It’s important with new technology that’s taking place that we make sure that there’s not any type of diminishment in what unions have fought for over the years,” CWA President Claude Cummings Jr. said in an interview. “Technology may change, but what unions stand for has not.”

While the language doesn’t establish detailed parameters, its inclusion in an enforceable union contract means that “Microsoft is bound to follow through,” Cummings said. Collective bargaining agreements generally include grievance procedures that can be invoked when either side believes the other has violated the terms, which can include escalating issues to mediation or arbitration.

Cummings said CWA isn’t against technological change, but wants to ensure workers have a say in the process, and that their job security, safety and benefits are protected. “I’ve worked for AT&T when telephones were the size of a breadbox,” he said. “Technology is going to continue to develop over the years, and the best way for workers to have a voice in how that technology is used in the workplace is by first of all being in a union and signing agreements such as this.”

The agreement with CWA offers Microsoft a recruiting advantage, Cummings said. “Microsoft is going to get the best young minds in this country,” added the union president, who hopes the deal will inspire more workers to unionize and other companies to follow Microsoft’s lead in agreeing to eschew union-busting.

In 2022, as it sought regulatory approval to buy Activision Blizzard Inc., Microsoft announced a new set of principles including a commitment to “collaborative approaches that will make it simpler” for workers to choose whether to unionize. When the ZeniMax workers sought to unionize, Microsoft distinguished itself from some peers by staying neutral rather than opposing their efforts.

AI has increasingly become a point of contention and negotiation in union contract talks. The agreement reached in September between Hollywood writers and studios, for example, includes provisions that the union says prevent writers from being forced to use software like ChatGPT, stop AI-generated material from being used to dilute writers’ credit and let the union challenge use of writers’ work to train AI systems.



— With assistance from Dina Bass
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Condensed Matter > Materials Science​

[Submitted on 29 Nov 2023]

WyCryst: Wyckoff Inorganic Crystal Generator Framework​

Ruiming Zhu, Wei Nong, Shuya Yamazaki, Kedar Hippalgaonkar


Generative design marks a significant data-driven advancement in the exploration of novel inorganic materials, which entails learning the symmetry equivalent to the crystal structure prediction (CSP) task and subsequent learning of their target properties. Many generative models have been developed in the last few years. However, these models so far lack the capacity to produce crystals that obey the fundamental rules of crystal symmetry. This is because an important step in these previous approaches involves energy relaxation on the generated crystal structures to find the ground state crystal structure, typically using Density Functional Theory (DFT). More often than not, this changes the symmetry of the structure, thereby changing the desired property and hence invalidating the original CSP. To address this, we introduce a generative design framework (WyCryst), composed of three pivotal components: 1) a Wyckoff position based inorganic crystal representation, 2) a property-directed VAE model and 3) an automated DFT workflow for structure refinement. By implementing loss functions that punish non-realistic crystal structures, our model selectively generates materials that follow the ground truth of crystal symmetry in the form of Wyckoff representation for each Space Group. In leave-one-out validation experiments, we successfully reproduce a variety of existing materials: CaTiO3 (space group, SG No. 62 and 221), CsPbI3 (SG No. 221), BaTiO3 (SG No. 160), and CuInS2 (SG No.122) for both ground state as well as polymorphic crystal structure predictions for desired compositions. We also generate several new ternary materials not found in the inorganic materials database (Materials Project), which are proved to be stable, retaining their symmetry, and we also check their phonon stability, using our automated DFT workflow highlighting the validity of our approach.


Comments:18 pages
Subjects:Materials Science (cond-mat.mtrl-sci); Computational Physics (physics.comp-ph)
Cite as:arXiv:2311.17916 [cond-mat.mtrl-sci]
(or arXiv:2311.17916v1 [cond-mat.mtrl-sci] for this version)
https://doi.org/10.48550/arXiv.2311.17916

Focus to learn more

Submission history​

From: Kedar Hippalgaonkar [view email]
[v1] Wed, 29 Nov 2023 18:59:31 UTC (34,607 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556











Photorealistic Video Generation with Diffusion Models​

Agrim Gupta1,2,*, Lijun Yu2, Kihyuk Sohn2, Xiuye Gu2, Meera Hahn2, Li Fei-Fei1,

Irfan Essa2, 3, Lu Jiang2, José Lezama2

1Stanford; 2Google Research; 3Georgia Institute of Teechnology

*Work partially done during an internship at Google.

arXiv PDF More samples


Text-to-Video Examples​


Abstract​

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of 512 x 896 resolution at 8 frames per second.





system_fig.png

W.A.L.T: We encode images and videos into a shared latent space. The transformer backbone processes these latents with blocks having two layers of window-restricted attention: spatial layers capture spatial relations in both images and video, while spatiotemporal layers model temporal dynamics in videos and passthrough images via identity attention mask. Text conditioning is done via spatial cross-attention.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Snapchat+ subscribers can now create and send AI-generated images​

Aisha Malik@aiishamalik1 / 10:7 AM EST•December 12, 2023



Three screenshots of Snapchat's AI image generator

Image Credits: Snapchat

Snapchat is releasing a few new AI powered features for Snapchat+ subscribers, the company announced on Tuesday. Most notably, subscribers can now create and send AI-generated images based on a text prompt. In addition to getting access to a new AI extend tool, subscribers can now also use the app’s Dream selfie feature with friends.

To access the AI image generator, subscribers can click on the “AI” button on the right side of their screen. Once you click on it, you can choose from a selection of prompts, such as “a planet made out of cheese” or “a sunny day at the beach” to receive and AI-generated image. Or, you can type in your own prompt, such as “a dog sleeping on a rocket.” After you have either selected or typed in a prompt, the app will start generating your image. Once it’s ready, you have the option to edit it, download it and share it with others.


snapchat-dream.jpeg

Image Credits: Snapchat

Snapchat+ subscribers were already able to create AI images for their bitmoji backgrounds and chat wallpapers, and can now create and send AI images to their friends.

Snapchat did not share which specific model is powering the feature, but told TechCrunch in an email that the company has several deals with partners to use their foundational models.

Subscribers can now also use Dream, Snapchat’s generative AI selfie feature, with their friends. The feature lets you create fantastical images of yourself in different scenarios. With this new update, you can create an AI selfie of yourself and then select one of your friends to appear next to you. For instance, you can create an image of you and friend as mermaids, and then either send the image to them or share it on your story. Snapchat+ subscribers get one free pack of eight Dreams a month, the company says.

In addition, subscribers are getting access to a new AI-powered extend tool. Say you took an image of your dog, but you zoomed in too close and want more of the full picture. You can click on the new extend tool to automatically receive a zoomed-out image where the background has been filled in with the help of AI.


snap-ai-extend.jpeg

Image Credits: Snapchat

Snapchat says these new features are rolling out now, but that regional availability may vary.

Today’s announcement shows that Snapchat is looking to further the app’s AI capabilities and build on the ones it already offers. Snapchat users can already receive AI-generated images from the app’s My AI chatbot, and they have had access to the app’s AI Dream feature for a few months now.

Snapchat+, which launched over the summer, costs $3.99 USD per month. The company says it currently has more than 7 million Snapchat+ subscribers. New data indicates that the offering recently had its best month, in terms of in-app revenue, showing no signs of slowing growth. In November, Snapchat+ topped $20 million in net revenue (after app store fees) for the first time. At the same time, subscription revenue rose by double digits in almost every country where Snapchat+ is live.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Phi-2: The surprising power of small language models​

Published December 12, 2023

By Mojan Javaheripi, Senior Researcher Sébastien Bubeck, Partner Research Manager

Contributors​

Marah Abdin, Jyoti Aneja, Sebastien Bubeck, Caio César Teodoro Mendes, Weizhu Chen, Allie Del Giorno, Ronen Eldan, Sivakanth Gopi, Suriya Gunasekar, Mojan Javaheripi, Piero Kauffmann, Yin Tat Lee, Yuanzhi Li, Anh Nguyen, Gustavo de Rosa, Olli Saarikivi, Adil Salim, shytal Shah, Michael Santacroce, Harkirat Singh Behl, Adam Taumann Kalai, Xin Wang, Rachel Ward, Philipp Witte, Cyril Zhang, Yi Zhang





Satya Nadella on stage at Microsoft Ignite 2023 announcing Phi-2.


Figure 1. Satya Nadella announcing Phi-2 at Microsoft Ignite 2023.

Over the past few months, our Machine Learning Foundations team at Microsoft Research has released a suite of small language models (SLMs) called “Phi” that achieve remarkable performance on a variety of benchmarks. Our first model, the 1.3 billion parameter Phi-1(opens in new tab), achieved state-of-the-art performance on Python coding among existing SLMs (specifically on the HumanEval and MBPP benchmarks). We then extended our focus to common sense reasoning and language understanding and created a new 1.3 billion parameter model named Phi-1.5(opens in new tab), with performance comparable to models 5x larger.

We are now releasing Phi-2(opens in new tab), a 2.7 billion-parameter language model that demonstrates outstanding reasoning and language understanding capabilities, showcasing state-of-the-art performance among base language models with less than 13 billion parameters. On complex benchmarks Phi-2 matches or outperforms models up to 25x larger, thanks to new innovations in model scaling and training data curation.

With its compact size, Phi-2 is an ideal playground for researchers, including for exploration around mechanistic interpretability, safety improvements, or fine-tuning experimentation on a variety of tasks. We have made Phi-2(opens in new tab) available in the Azure AI Studio model catalog to foster research and development on language models.


MICROSOFT RESEARCH PODCAST

Peter Lee wearing glasses and smiling at the camera with the Microsoft Research Podcast logo to the left


AI Frontiers: AI for health and the future of research with Peter Lee​

Peter Lee, head of Microsoft Research, and Ashley Llorens, AI scientist and engineer, discuss the future of AI research and the potential for GPT-4 as a medical copilot.

Listen now

Opens in a new tab


Key Insights Behind Phi-2​

The massive increase in the size of language models to hundreds of billions of parameters has unlocked a host of emerging capabilities that have redefined the landscape of natural language processing. A question remains whether such emergent abilities can be achieved at a smaller scale using strategic choices for training, e.g., data selection.

Our line of work with the Phi models aims to answer this question by training SLMs that achieve performance on par with models of much higher scale (yet still far from the frontier models). Our key insights for breaking the conventional language model scaling laws with Phi-2 are twofold:

Firstly, training data quality plays a critical role in model performance. This has been known for decades, but we take this insight to its extreme by focusing on “textbook-quality” data, following upon our prior work “Textbooks Are All You Need.” Our training data mixture contains synthetic datasets specifically created to teach the model common sense reasoning and general knowledge, including science, daily activities, and theory of mind, among others. We further augment our training corpus with carefully selected web data that is filtered based on educational value and content quality. Secondly, we use innovative techniques to scale up, starting from our 1.3 billion parameter model, Phi-1.5, and embedding its knowledge within the 2.7 billion parameter Phi-2. This scaled knowledge transfer not only accelerates training convergence but shows clear boost in Phi-2 benchmark scores.






A bar plot comparing the performance of Phi-2 (with 2.7B parameters) and Phi-1.5 (with 1.3B parameters) on common sense reasoning, language understanding, math, coding, and the Bigbench-hard benchmark. Phi-2 outperforms Phi1.5 in all categories. The commonsense reasoning tasks are PIQA, WinoGrande, ARC easy and challenge, and SIQA. The language understanding tasks are HellaSwag, OpenBookQA, MMLU, SQuADv2, and BoolQ. The math task is GSM8k, and coding includes the HumanEval and MBPP benchmarks.


Figure 2. Comparison between Phi-2 (2.7B) and Phi-1.5 (1.3B) models. All tasks are evaluated in 0-shot except for BBH and MMLU which use 3-shot CoT and 5-shot, respectively.

Training Details​

Phi-2 is a Transformer-based model with a next-word prediction objective, trained on 1.4T tokens from multiple passes on a mixture of Synthetic and Web datasets for NLP and coding. The training for Phi-2 took 14 days on 96 A100 GPUs. Phi-2 is a base model that has not undergone alignment through reinforcement learning from human feedback (RLHF), nor has it been instruct fine-tuned. Despite this, we observed better behavior with respect to toxicity and bias compared to existing open-source models that went through alignment (see Figure 3). This is in line with what we saw in Phi-1.5 due to our tailored data curation technique, see our previous tech report(opens in new tab) for more details on this. For more information about the Phi-2 model, please visit Azure AI | Machine Learning Studio(opens in new tab).





A barplot comparing the safety score of Phi-1.5, Phi-2, and Llama-7B models on 13 categories of the ToxiGen benchmark. Phi-1.5 achieves the highest score on all categories, Phi-2 achieves the second-highest scores and Llama-7B achieves the lowest scores across all categories.


Figure 3. Safety scores computed on 13 demographics from ToxiGen. A subset of 6541 sentences are selected and scored between 0 to 1 based on scaled perplexity and sentence toxicity. A higher score indicates the model is less likely to produce toxic sentences compared to benign ones.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,753
Reputation
7,916
Daps
148,556

Phi-2 Evaluation​

Below, we summarize Phi-2 performance on academic benchmarks compared to popular language models. Our benchmarks span several categories, namely, Big Bench Hard (BBH) (3 shot with CoT), commonsense reasoning (PIQA, WinoGrande, ARC easy and challenge, SIQA), language understanding (HellaSwag, OpenBookQA, MMLU (5-shot), SQuADv2 (2-shot), BoolQ), math (GSM8k (8 shot)), and coding (HumanEval, MBPP (3-shot)).

With only 2.7 billion parameters, Phi-2 surpasses the performance of Mistral and Llama-2 models at 7B and 13B parameters on various aggregated benchmarks. Notably, it achieves better performance compared to 25x larger Llama-2-70B model on muti-step reasoning tasks, i.e., coding and math. Furthermore, Phi-2 matches or outperforms the recently-announced Google Gemini Nano 2, despite being smaller in size.

Of course, we acknowledge the current challenges with model evaluation, and that many public benchmarks might leak into the training data. For our first model, Phi-1, we did an extensive decontamination study to discard this possibility, which can be found in our first report “Textbooks Are All You Need.” Ultimately, we believe that the best way to judge a language model is to test it on concrete use cases. Following that spirit, we also evaluated Phi-2 using several Microsoft internal proprietary datasets and tasks, comparing it again to Mistral and Llama-2. We observed similar trends, i.e. on average, Phi-2 outperforms Mistral-7B, and the latter outperforms the Llama-2 models (7B, 13B, and 70B).



ModelSizeBBHCommonsense

Reasoning
Language

Understanding
MathCoding
Llama-27B40.062.256.716.521.0
13B47.865.061.934.225.4
70B66.569.267.664.138.3
Mistral7B57.266.463.746.439.4
Phi-22.7B59.268.862.061.153.7


Table 1. Averaged performance on grouped benchmarks compared to popular open-source SLMs.

ModelSizeBBHBoolQMBPPMMLU
Gemini Nano 23.2B42.479.327.255.8
Phi-22.7B59.383.359.156.7


Table 2. Comparison between Phi-2 and Gemini Nano 2 Model on Gemini’s reported benchmarks.

In addition to these benchmarks, we also performed extensive testing on commonly used prompts from the research community. We observed a behavior in accordance with the expectation we had given the benchmark results. For example, we tested a prompt used to probe a model’s ability to solve physics problems, most recently used to evaluate the capabilities of the Gemini Ultra model, and achieved the following result:




An example prompt is given to Phi-2 which says “A skier slides down a frictionless slope of height 40m and length 80m. What's the skier’s speed at the bottom?”. Phi-2 then answers the prompt by explaining the conversion of potential energy to kinetic energy and providing the formulas to compute each one. It then proceeds to compute the correct speed using the energy formulas.


Figure 4. Phi-2’s output on a simple physics problem, which includes an approximately correct square root calculation.


The model is then provided with a student’s wrong answer to the skier physics problem and asked if it can correct the student’s mistake. Phi-2 replies with the student’s mistake, i.e., using the wrong formula for potential energy, and provides the correct formula.


Figure 5. Similarly to Gemini’s test we also further queried Phi-2 with a student’s wrong answer to see if Phi-2 could identify where the mistake is (it did, despite Phi-2 being not fine-tuned for chat or instruction-following). We note however that it is not fully an apple-to-apple comparison with the Gemini Ultra’s output described in the Gemini report, in particular in the latter case the student’s answer was given as an image with handwritten text rather than raw text in our case.



 
Top