bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739






1/6
AI is more PERSUASIVE than humans?!

Scientists tested if GPT4 could change people's minds BETTER than humans during a debate.

GPT-4 and HUMANS received personal information about each person they debated against.

The results are mind-blowing to me.

GPT-4 was able to convince people to change their opinion way more than the humans. A massive 87% better!

Just imagine an AI so persuasive that it can make you rethink your beliefs with perfectly tailored arguments.

Maybe one day it could change the minds of world leaders or help solve conflicts.

Or maybe you'll be its next persuasion victim...

Your thoughts? I did not expect this.

Study source: On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial

Image credit:
@emollick

2/6
I did not expect these results to be honest.

3/6
Yep, until marketing is done completely by AI.

4/6
It's not the AI. It's the humans using it to persuade.

5/6
Claude works 3 times better. And even Gemini is surprisingly good with the tone. Try it out Tia!

6/6
And it is far more objective. I did not expect these results.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GK-zQzfWgAAeK2B.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739


AI's new power of persuasion: Study shows LLMs can exploit personal information to change your mind​

Story by Tanya Petersen

• 14h • 4 min read

Overview of the experimental workflow. (A) Participants fill in a survey about their demographic information and political orientation. (B) Every 5 minutes, participants are randomly assigned to one of four treatment conditions. The two players then debate for 10 minutes on an assigned proposition, randomly holding the PRO or CON standpoint as instructed. (C) After the debate, participants fill out another short survey measuring their opinion change. Finally, they are debriefed about their opponent's identity. Credit: arXiv (2024). DOI: 10.48550/arxiv.2403.14380

Overview of the experimental workflow. (A) Participants fill in a survey about their demographic information and political orientation. (B) Every 5 minutes, participants are randomly assigned to one of four treatment conditions. The two players then debate for 10 minutes on an assigned proposition, randomly holding the PRO or CON standpoint as instructed. (C) After the debate, participants fill out another short survey measuring their opinion change. Finally, they are debriefed about their opponent's identity. Credit: arXiv (2024). DOI: 10.48550/arxiv.2403.14380© Provided by Tech Xplore

Anew EPFL study has demonstrated the persuasive power of large language models, finding that participants debating GPT-4 with access to their personal information were far more likely to change their opinion compared to those who debated humans.

"On the internet, nobody knows you're a dog." That's the caption to a famous 1990s cartoon showing a large dog with his paw on a computer keyboard. Fast forward 30 years, replace "dog" with "AI" and this sentiment was a key motivation behind a new study to quantify the persuasive power of today's large language models (LLMs).

"You can think of all sorts of scenarios where you're interacting with a language model although you don't know it, and this is a fear that people have—on the internet are you talking to a dog or a chatbot or a real human?" asked Associate Professor Robert West, head of the Data Science Lab in the School of Computer and Communication Sciences. "The danger is superhuman like chatbots that create tailor-made, convincing arguments to push false or misleading narratives online."

AI and personalization​

Early work has found that language models can generate content perceived as at least on par and often more persuasive than human-written messages, however there is still limited knowledge about LLMs' persuasive capabilities in direct conversations with humans, and how personalization—knowing a person's gender, age and education level—can improve their performance.

VideoBlue.svg
Related video: Experts Predict up to 30% of Occupations Could Be Automated by AI (Money Talks News)

Loaded: 29.81%
Play
Duration 1:20
AA1aNqsY.img
Money Talks News

Experts Predict up to 30% of Occupations Could Be Automated by AI

View on Watch

View on Watch


"We really wanted to see how much of a difference it makes when the AI model knows who you are (personalization)—your age, gender, ethnicity, education level, employment status and political affiliation—and this scant amount of information is only a proxy of what more an AI model could know about you through social media, for example," West continued.

Human v AI debates​

In a pre-registered study, the researchers recruited 820 people to participate in a controlled trial in which each participant was randomly assigned a topic and one of four treatment conditions: debating a human with or without personal information about the participant, or debating an AI chatbot (OpenAI's GPT-4) with or without personal information about the participant.

This setup differed substantially from previous research in that it enabled a direct comparison of the persuasive capabilities of humans and LLMs in real conversations, providing a framework for benchmarking how state-of-the-art models perform in online environments and the extent to which they can exploit personal data.

Their article, "On the Conversational Persuasiveness of large language models: A Randomized Controlled Trial," posted to the arXiv preprint server, explains that the debates were structured based on a simplified version of the format commonly used in competitive academic debates and participants were asked before and afterwards how much they agreed with the debate proposition.

The results showed that participants who debated GPT-4 with access to their personal information had 81.7% higher odds of increased agreement with their opponents compared to participants who debated humans. Without personalization, GPT-4 still outperformed humans, but the effect was far lower.

Cambridge Analytica on steroids​

Not only are LLMs able to effectively exploit personal information to tailor their arguments and out-persuade humans in online conversations through microtargeting, they do so far more effectively than humans.

"We were very surprised by the 82% number and if you think back to Cambridge Analytica, which didn't use any of the current tech, you take Facebook likes and hook them up with an LLM, the Language Model can personalize its messaging to what it knows about you. This is Cambridge Analytica on steroids," said West.

"In the context of the upcoming U.S. elections, people are concerned because that's where this kind of technology is always first battle tested. One thing we know for sure is that people will be using the power of large language models to try to swing the election."

One interesting finding of the research was that when a human was given the same personal information as the AI, they didn't seem to make effective use of it for persuasion. West argues that this should be expected—AI models are consistently better because they are almost every human on the internet put together.

The models have learned through online patterns that a certain way of making an argument is more likely to lead to a persuasive outcome. They have read many millions of Reddit, Twitter and Facebook threads, and been trained on books and papers from psychology about persuasion. It's unclear exactly how a model leverages all this information but West believes this is a key direction for future research.

"LLMs have shown signs that they can reason about themselves, so given that we are able to interrogate them, I can imagine that we could ask a model to explain its choices and why it is saying a precise thing to a particular person with particular properties. There's a lot to be explored here because the models may be doing things that we don't even know about yet in terms of persuasiveness, cobbled together from many different parts of the knowledge that they have."

More information: Francesco Salvi et al, On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial, arXiv (2024). DOI: 10.48550/arxiv.2403.14380

Provided by Ecole Polytechnique Federale de Lausanne


Computer Science > Computers and Society​

[Submitted on 21 Mar 2024]

On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial​

Francesco Salvi, Manoel Horta Ribeiro, Riccardo Gallotti, Robert West
The development and popularization of large language models (LLMs) have raised concerns that they will be used to create tailor-made, convincing arguments to push false or misleading narratives online. Early work has found that language models can generate content perceived as at least on par and often more persuasive than human-written messages. However, there is still limited knowledge about LLMs' persuasive capabilities in direct conversations with human counterparts and how personalization can improve their performance. In this pre-registered study, we analyze the effect of AI-driven persuasion in a controlled, harmless setting. We create a web-based platform where participants engage in short, multiple-round debates with a live opponent. Each participant is randomly assigned to one of four treatment conditions, corresponding to a two-by-two factorial design: (1) Games are either played between two humans or between a human and an LLM; (2) Personalization might or might not be enabled, granting one of the two players access to basic sociodemographic information about their opponent. We found that participants who debated GPT-4 with access to their personal information had 81.7% (p < 0.01; N=820 unique participants) higher odds of increased agreement with their opponents compared to participants who debated humans. Without personalization, GPT-4 still outperforms humans, but the effect is lower and statistically non-significant (p=0.31). Overall, our results suggest that concerns around personalization are meaningful and have important implications for the governance of social media and the design of new online environments.
Comments:33 pages, 10 figures, 7 tables
Subjects:Computers and Society (cs.CY)
Cite as:arXiv:2403.14380 [cs.CY]
(or arXiv:2403.14380v1 [cs.CY] for this version)
[2403.14380] On the Conversational Persuasiveness of Large Language Models: A Randomized Controlled Trial
Focus to learn more

Submission history

From: Francesco Salvi [view email]
[v1] Thu, 21 Mar 2024 13:14:40 UTC (551 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739







1/7
Meta announces Megalodon

Efficient LLM Pretraining and Inference with Unlimited Context Length

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and

2/7
state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce Megalodon, a neural architecture for efficient sequence modeling with unlimited context length. Megalodon inherits the architecture of Mega

3/7
(exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-

4/7
norm with two-hop residual configuration. In a controlled head-to-head comparison with Llama2, Megalodon achieves better efficiency than Transformer in the scale of 7 billion

5/7
parameters and 2 trillion training tokens. Megalodon reaches a training loss of 1.70, landing mid-way between Llama2-7B (1.75) and 13B (1.67).

6/7
paper page:

7/7
daily papers:
GLQgTFvW8AA3355.jpg

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739


1/2
Zuck on:

- Llama 3
- open sourcing towards AGI
- custom silicon, synthetic data, & energy constraints on scaling
- Caeser Augustus, intelligence explosion, bioweapons, $10b models, & much more

Enjoy!

Links below

2/2
YouTube: https://youtu.be/bc6uFV9CJGg Transcript: https://dwarkeshpatel.com/p/mark-zuckerberg Apple Podcasts: https://podcasts.apple.com/us/podca...caeser/id1516093381?i=1000652877239 Spotify:

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/4
Llama 3 metrics

2/4
Llama3 released!

https://llama.meta.com/llama3/ https://github.com/meta-llama/llama3

3/4
Llama3 released!

https://llama.meta.com/llama3/ https://github.com/meta-llama/llama3

4/4
Llama3 released!

https://llama.meta.com/llama3/ https://github.com/meta-llama/llama3


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLdYKokXQAAQ-VF.jpg

GLdZg3bXsAABCza.jpg

GLdZg3bXsAABCza.jpg

GLdZg3bXsAABCza.jpg


1/1
Llama3 released!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLdZg3bXsAABCza.jpg

GLdaejEXwAA0Huz.jpg






1/5
What is this? Is this it? Spotted on Azuremarketplace

2/5
I want weights on
@huggingface

3/5
Another confirmation for llama 3 today - this time it's listed on Pricing – Replicate

4/5
UPD: they have removed the models from the list.

5/5
Link: https://azuremarketplace.microsoft....genai.meta-llama-3-8b-chat-offer?tab=Overview


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLc5E0TWIAAS1Xq.jpg

GLdAJWAWAAAw_Vm.png



1/2
Llama 3 has been my focus since joining the Llama team last summer. Together, we've been tackling challenges across pre-training and human data, pre-training scaling, long context, post-training, and evaluations. It's been a rigorous yet thrilling journey:

Our largest models exceed 400B parameters and are still training.
Scaling is the recipe, demanding more than better scaling laws and infrastructure; e.g., managing high effective training time across 16K GPUs requires innovative strategies.
Opting for an 8B over a 7B model? Among many others, an upgraded tokenizer expanded our vocabulary from 32K to 128K, making language processing more efficient and allowing our models to handle more text with fewer tokens.
With over 15T tokens processed, our improved tokenization significantly enhances pre-training token efficiency. We're committed to high-quality data curation, including advanced filtering, semantic deduplication, and quality prediction.
Enhanced human data significantly improves our post-training stack that combines supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO).
We've set the pre-training context window to 8K tokens. A comprehensive approach to data, modeling, parallelism, inference, and evaluations would be interesting. More updates on longer contexts later.
While automated benchmarks are useful, they don't fully reflect a model's grasp of nuanced contexts. Our human evaluations are carefully designed and performed.

What does it take to build LLMs? Beyond data, compute, infrastructure, model, inference, safety, and evaluations--ultimately, it boils down to the synergy of dedicated talents and thoughtful investment.

Exciting updates ahead: we are planning to launch video podcasts with our developers to dive deeper into the tech behind Llama 3. We'll share the research paper. Stay tuned!

2/2
Thanks Yi!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLdZKG5bEAEB3FW.jpg

GLdZD2TaYAAgfG1.jpg




1/3
Llama 3 is about to be released with a 8B and a 70B models.

Just saw this on Replicate: Pricing – Replicate

2/3
Oh no Reddit was faster

3/3
It was cancelled after a sex scandal


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLc_HtkXcAAvFuP.png








 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739


1/2
Big congrats to
@MistralAI on the new model launch!

Mixtral-8x22b-instruct is now in the Arena. Come chat & vote to see who is the best OSS model.

We're also excited to see the model's strong multilingual capability. Soon will update our newly launched French LLM leaderboard!

2/2
Links:
- Chat & Vote at http://chat.lmsys.org/ - Full leaderboard at http://leaderboard.lmsys.org/


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GLZzSUkbsAARd1i.jpg

GLXzoYSagAEwbyQ.png

GLZzk-DaUAAU4OB.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739







1/7
Skip ChatGPT.

Anthropic released Claude 3, and it's mind blowing.

Here are 5 powerful features of Claude 3 you must know: ↓

2/7
1/ Multilingual Capabilities:

→ Claude 3 models are really good at translating and creating content in different languages.

3/7
2/ Improved Sight:

→ Claude 3 models can understand lots of different types of pictures, like photos, charts, graphs, and diagrams. This helps with tasks that use these kinds of visuals.

4/7
4/ Almost Like Understanding Humans:

→ These models are really good at doing things like analyzing stuff, creating content, writing code, and having conversations in many languages.

→ Opus stands out for being super smart, even beating other models in tests.

5/7
5/ Faster and Fewer Rejections:

→ The new models respond quicker and say no less often, showing they understand the situation better.

6/7
That's a wrap!

If you enjoyed this thread or found it useful:

1. Follow me
@LearnWithSubhan for more of these.

2. Retweet the first tweet to share this thread with your network.

7/7
Skip ChatGPT.

Anthropic released Claude 3, and it's mind blowing.

Here are 5 powerful features of Claude 3 you must know: ↓


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GKx7fVhXAAEkymo.jpg

GKx9VhYXAAEuniR.png

GKx-fV7W8AA14g4.jpg

GKx7fVhXAAEkymo.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739



1/3
According to a new paper published by Google researchers, LLMs can now process text of infinite length. The paper introduces a new technique called Infinity-attention, which enables models to expand their "context window" without any increase in memory and compute requirements.

2/3
2/3 Source:

3/3
3/3 Infinity-Attention architecture:
GLN8aa3WUAAPz_f.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739



1/3
Congrats to
@AIatMeta on Llama 3 release!!
Notes:

Releasing 8B and 70B (both base and finetuned) models, strong-performing in their model class (but we'll see when the rankings come in @
@lmsysorg
:smile:)
400B is still training, but already encroaching GPT-4 territory (e.g. 84.8 MMLU vs. 86.5 4Turbo).

Tokenizer: number of tokens was 4X'd from 32K (Llama 2) -> 128K (Llama 3). With more tokens you can compress sequences more in length, cites 15% fewer tokens, and see better downstream performance.

Architecture: no major changes from the Llama 2. In Llama 2 only the bigger models used Grouped Query Attention (GQA), but now all models do, including the smallest 8B model. This is a parameter sharing scheme for the keys/values in the Attention, which reduces the size of the KV cache during inference. This is a good, welcome, complexity reducing fix and optimization.

Sequence length: the maximum number of tokens in the context window was bumped up to 8192 from 4096 (Llama 2) and 2048 (Llama 1). This bump is welcome, but quite small w.r.t. modern standards (e.g. GPT-4 is 128K) and I think many people were hoping for more on this axis. May come as a finetune later (?).

Training data. Llama 2 was trained on 2 trillion tokens, Llama 3 was bumped to 15T training dataset, including a lot of attention that went to quality, 4X more code tokens, and 5% non-en tokens over 30 languages. (5% is fairly low w.r.t. non-en:en mix, so certainly this is a mostly English model, but it's quite nice that it is > 0).

Scaling laws. Very notably, 15T is a very very large dataset to train with for a model as "small" as 8B parameters, and this is not normally done and is new and very welcome. The Chinchilla "compute optimal" point for an 8B model would be train it for ~200B tokens. (if you were only interested to get the most "bang-for-the-buck" w.r.t. model performance at that size). So this is training ~75X beyond that point, which is unusual but personally, I think extremely welcome. Because we all get a very capable model that is very small, easy to work with and inference. Meta mentions that even at this point, the model doesn't seem to be "converging" in a standard sense. In other words, the LLMs we work with all the time are significantly undertrained by a factor of maybe 100-1000X or more, nowhere near their point of convergence. Actually, I really hope people carry forward the trend and start training and releasing even more long-trained, even smaller models.

Systems. Llama 3 is cited as trained with 16K GPUs at observed throughput of 400 TFLOPS. It's not mentioned but I'm assuming these are H100s at fp16, which clock in at 1,979 TFLOPS in NVIDIA marketing materials. But we all know their tiny asterisk (*with sparsity) is doing a lot of work, and really you want to divide this number by 2 to get the real TFLOPS of ~990. Why is sparsity counting as FLOPS? Anyway, focus Andrej. So 400/990 ~= 40% utilization, not too bad at all across that many GPUs! A lot of really solid engineering is required to get here at that scale.

TLDR: Super welcome, Llama 3 is a very capable looking model release from Meta. Sticking to fundamentals, spending a lot of quality time on solid systems and data work, exploring the limits of long-training models. Also very excited for the 400B model, which could be the first GPT-4 grade open source release. I think many people will ask for more context length.

Personal ask: I think I'm not alone to say that I'd also love much smaller models than 8B, for educational work, and for (unit) testing, and maybe for embedded applications etc. Ideally at ~100M and ~1B scale.

Talk to it at Meta AI
Integration with GitHub - pytorch/torchtune: A Native-PyTorch Library for LLM Fine-tuning

2/3
The model card has some more interesting info too:

Note that Llama 3 8B is actually somewhere in the territory of Llama 2 70B, depending on where you look. This might seem confusing at first but note that the former was trained for 15T tokens, while the latter for 2T tokens.

The single number that should summarize your expectations about any LLM is the number of total flops that went into its training.

Strength of Llama 3 8B
We see that Llama 3 8B was trained for 1.3M GPU hours, with throughput of 400 TFLOPS. So we have that the total number of FLOPs was:

1.3e6 hours * 400e12 FLOP/s * 3600 s/hour ~= 1.8e24

the napkin math via a different estimation method of FLOPs = 6ND (N is params D is tokens), gives:

6 * 8e9 * 15e12 = 7.2e23

These two should agree, maybe some of the numbers are fudged a bit. Let's trust the first estimate a bit more, Llama 3 8B is a ~2e24 model.

Strength of Llama 3 70B

6.4e6 hours * 400e12 FLOP/s * 3600 s/hour ~= 9.2e24
alternatively:
6 * 70e9 * 15e12 = 6.3e24

So Llama 3 70B is a ~9e24 model.

Strength of Llama 3 400B

If the 400B model trains on the same dataset, we'd get up to ~4e25. This starts to really get up there. The Biden Executive Order had the reporting requirement set at 1e26, so this could be ~2X below that.

The only other point of comparison we'd have available is if you look at the alleged GPT-4 leaks, which have never been confirmed this would ~2X those numbers.

Now, there's a lot more that goes into the performance a model that doesn't fit on the napkin. E.g. data quality especially, but if you had to reduce a model to a single number, this is how you'd try, because it combines the size of the model with the length of training into a single "strength", of how many total FLOPs went into it.

3/3
no. people misunderstand chinchilla.
chinchilla doesn't tell you the point of convergence.
it tells you the point of compute optimality.
if all you care about is perplexity, for every FLOPs compute budget, how big model on how many tokens should you train?
for reasons not fully intuitively understandable, severely under-trained models seem to be compute optimal.
in many practical settings though, this is not what you care about.
what you care about is what is the best possible model at some model size? (e.g. 8B, that is all that i can fit on my GPU or something)
and the best possible model at that size is the one you continue training ~forever.
you're "wasting" flops and you could have had a much stronger, (but bigger) model with those flops.
but you're getting an increasingly stronger model that fits.
and seemingly this continues to be true without too much diminishing returns for a very long time.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739



Large Language Model

Introducing Meta Llama 3: The most capable openly available LLM to date

April 18, 2024

Takeaways:

  • Today, we’re introducing Meta Llama 3, the next generation of our state-of-the-art open source large language model.
  • Llama 3 models will soon be available on AWS, Databricks, Google Cloud, Hugging Face, Kaggle, IBM WatsonX, Microsoft Azure, NVIDIA NIM, and Snowflake, and with support from hardware platforms offered by AMD, AWS, Dell, Intel, NVIDIA, and Qualcomm.
  • We’re dedicated to developing Llama 3 in a responsible way, and we’re offering various resources to help others use it responsibly as well. This includes introducing new trust and safety tools with Llama Guard 2, Code Shield, and CyberSec Eval 2.
  • In the coming months, we expect to introduce new capabilities, longer context windows, additional model sizes, and enhanced performance, and we’ll share the Llama 3 research paper.
  • Meta AI, built with Llama 3 technology, is now one of the world’s leading AI assistants that can boost your intelligence and lighten your load—helping you learn, get things done, create content, and connect to make the most out of every moment. You can try Meta AI here.

Today, we’re excited to share the first two models of the next generation of Llama, Meta Llama 3, available for broad use. This release features pretrained and instruction-fine-tuned language models with 8B and 70B parameters that can support a broad range of use cases. This next generation of Llama demonstrates state-of-the-art performance on a wide range of industry benchmarks and offers new capabilities, including improved reasoning. We believe these are the best open source models of their class, period. In support of our longstanding open approach, we’re putting Llama 3 in the hands of the community. We want to kickstart the next wave of innovation in AI across the stack—from applications to developer tools to evals to inference optimizations and more. We can’t wait to see what you build and look forward to your feedback.

Our goals for Llama 3

With Llama 3, we set out to build the best open models that are on par with the best proprietary models available today. We wanted to address developer feedback to increase the overall helpfulness of Llama 3 and are doing so while continuing to play a leading role on responsible use and deployment of LLMs. We are embracing the open source ethos of releasing early and often to enable the community to get access to these models while they are still in development. The text-based models we are releasing today are the first in the Llama 3 collection of models. Our goal in the near future is to make Llama 3 multilingual and multimodal, have longer context, and continue to improve overall performance across core LLM capabilities such as reasoning and coding.

State-of-the-art performance

Our new 8B and 70B parameter Llama 3 models are a major leap over Llama 2 and establish a new state-of-the-art for LLM models at those scales. Thanks to improvements in pretraining and post-training, our pretrained and instruction-fine-tuned models are the best models existing today at the 8B and 70B parameter scale. Improvements in our post-training procedures substantially reduced false refusal rates, improved alignment, and increased diversity in model responses. We also saw greatly improved capabilities like reasoning, code generation, and instruction following making Llama 3 more steerable.

438037375_405784438908376_6082258861354187544_n.png

*Please see evaluation details for setting and parameters with which these evaluations are calculated.

In the development of Llama 3, we looked at model performance on standard benchmarks and also sought to optimize for performance for real-world scenarios. To this end, we developed a new high-quality human evaluation set. This evaluation set contains 1,800 prompts that cover 12 key use cases: asking for advice, brainstorming, classification, closed question answering, coding, creative writing, extraction, inhabiting a character/persona, open question answering, reasoning, rewriting, and summarization. To prevent accidental overfitting of our models on this evaluation set, even our own modeling teams do not have access to it. The chart below shows aggregated results of our human evaluations across of these categories and prompts against Claude Sonnet, Mistral Medium, and GPT-3.5.

438998263_1368970367138244_7396600838045603809_n.png

Preference rankings by human annotators based on this evaluation set highlight the strong performance of our 70B instruction-following model compared to competing models of comparable size in real-world scenarios.

Our pretrained model also establishes a new state-of-the-art for LLM models at those scales.

439014085_432870519293677_8138616034495713484_n.png

*Please see evaluation details for setting and parameters with which these evaluations are calculated.

To develop a great language model, we believe it’s important to innovate, scale, and optimize for simplicity. We adopted this design philosophy throughout the Llama 3 project with a focus on four key ingredients: the model architecture, the pretraining data, scaling up pretraining, and instruction fine-tuning.

Model architecture

In line with our design philosophy, we opted for a relatively standard decoder-only transformer architecture in Llama 3. Compared to Llama 2, we made several key improvements. Llama 3 uses a tokenizer with a vocabulary of 128K tokens that encodes language much more efficiently, which leads to substantially improved model performance. To improve the inference efficiency of Llama 3 models, we’ve adopted grouped query attention (GQA) across both the 8B and 70B sizes. We trained the models on sequences of 8,192 tokens, using a mask to ensure self-attention does not cross document boundaries.

Training data

To train the best language model, the curation of a large, high-quality training dataset is paramount. In line with our design principles, we invested heavily in pretraining data. Llama 3 is pretrained on over 15T tokens that were all collected from publicly available sources. Our training dataset is seven times larger than that used for Llama 2, and it includes four times more code. To prepare for upcoming multilingual use cases, over 5% of the Llama 3 pretraining dataset consists of high-quality non-English data that covers over 30 languages. However, we do not expect the same level of performance in these languages as in English.

To ensure Llama 3 is trained on data of the highest quality, we developed a series of data-filtering pipelines. These pipelines include using heuristic filters, NSFW filters, semantic deduplication approaches, and text classifiers to predict data quality. We found that previous generations of Llama are surprisingly good at identifying high-quality data, hence we used Llama 2 to generate the training data for the text-quality classifiers that are powering Llama 3.

We also performed extensive experiments to evaluate the best ways of mixing data from different sources in our final pretraining dataset. These experiments enabled us to select a data mix that ensures that Llama 3 performs well across use cases including trivia questions, STEM, coding, historical knowledge, etc.

Scaling up pretraining

To effectively leverage our pretraining data in Llama 3 models, we put substantial effort into scaling up pretraining. Specifically, we have developed a series of detailed scaling laws for downstream benchmark evaluations. These scaling laws enable us to select an optimal data mix and to make informed decisions on how to best use our training compute. Importantly, scaling laws allow us to predict the performance of our largest models on key tasks (for example, code generation as evaluated on the HumanEval benchmark—see above) before we actually train the models. This helps us ensure strong performance of our final models across a variety of use cases and capabilities.

We made several new observations on scaling behavior during the development of Llama 3. For example, while the Chinchilla-optimal amount of training compute for an 8B parameter model corresponds to ~200B tokens, we found that model performance continues to improve even after the model is trained on two orders of magnitude more data. Both our 8B and 70B parameter models continued to improve log-linearly after we trained them on up to 15T tokens. Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.

To train our largest Llama 3 models, we combined three types of parallelization: data parallelization, model parallelization, and pipeline parallelization. Our most efficient implementation achieves a compute utilization of over 400 TFLOPS per GPU when trained on 16K GPUs simultaneously. We performed training runs on two custom-built 24K GPU clusters. To maximize GPU uptime, we developed an advanced new training stack that automates error detection, handling, and maintenance. We also greatly improved our hardware reliability and detection mechanisms for silent data corruption, and we developed new scalable storage systems that reduce overheads of checkpointing and rollback. Those improvements resulted in an overall effective training time of more than 95%. Combined, these improvements increased the efficiency of Llama 3 training by ~three times compared to Llama 2.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739
Instruction fine-tuning

To fully unlock the potential of our pretrained models in chat use cases, we innovated on our approach to instruction-tuning as well. Our approach to post-training is a combination of supervised fine-tuning (SFT), rejection sampling, proximal policy optimization (PPO), and direct policy optimization (DPO). The quality of the prompts that are used in SFT and the preference rankings that are used in PPO and DPO has an outsized influence on the performance of aligned models. Some of our biggest improvements in model quality came from carefully curating this data and performing multiple rounds of quality assurance on annotations provided by human annotators.

Learning from preference rankings via PPO and DPO also greatly improved the performance of Llama 3 on reasoning and coding tasks. We found that if you ask a model a reasoning question that it struggles to answer, the model will sometimes produce the right reasoning trace: The model knows how to produce the right answer, but it does not know how to select it. Training on preference rankings enables the model to learn how to select it.

Building with Llama 3

Our vision is to enable developers to customize Llama 3 to support relevant use cases and to make it easier to adopt best practices and improve the open ecosystem. With this release, we’re providing new trust and safety tools including updated components with both Llama Guard 2 and Cybersec Eval 2, and the introduction of Code Shield—an inference time guardrail for filtering insecure code produced by LLMs.

We’ve also co-developed Llama 3 with torchtune, the new PyTorch-native library for easily authoring, fine-tuning, and experimenting with LLMs. torchtune provides memory efficient and hackable training recipes written entirely in PyTorch. The library is integrated with popular platforms such as Hugging Face, Weights & Biases, and EleutherAI and even supports Executorch for enabling efficient inference to be run on a wide variety of mobile and edge devices. For everything from prompt engineering to using Llama 3 with LangChain we have a comprehensive getting started guide and takes you from downloading Llama 3 all the way to deployment at scale within your generative AI application.

A system-level approach to responsibility

We have designed Llama 3 models to be maximally helpful while ensuring an industry leading approach to responsibly deploying them. To achieve this, we have adopted a new, system-level approach to the responsible development and deployment of Llama. We envision Llama models as part of a broader system that puts the developer in the driver’s seat. Llama models will serve as a foundational piece of a system that developers design with their unique end goals in mind.

438922663_1135166371264105_805978695964769385_n.png

Instruction fine-tuning also plays a major role in ensuring the safety of our models. Our instruction-fine-tuned models have been red-teamed (tested) for safety through internal and external efforts. Our red teaming approach leverages human experts and automation methods to generate adversarial prompts that try to elicit problematic responses. For instance, we apply comprehensive testing to assess risks of misuse related to Chemical, Biological, Cyber Security, and other risk areas. All of these efforts are iterative and used to inform safety fine-tuning of the models being released. You can read more about our efforts in the model card.

Llama Guard models are meant to be a foundation for prompt and response safety and can easily be fine-tuned to create a new taxonomy depending on application needs. As a starting point, the new Llama Guard 2 uses the recently announced MLCommons taxonomy, in an effort to support the emergence of industry standards in this important area. Additionally, CyberSecEval 2 expands on its predecessor by adding measures of an LLM’s propensity to allow for abuse of its code interpreter, offensive cybersecurity capabilities, and susceptibility to prompt injection attacks (learn more in our technical paper). Finally, we’re introducing Code Shield which adds support for inference-time filtering of insecure code produced by LLMs. This offers mitigation of risks around insecure code suggestions, code interpreter abuse prevention, and secure command execution.

With the speed at which the generative AI space is moving, we believe an open approach is an important way to bring the ecosystem together and mitigate these potential harms. As part of that, we’re updating our Responsible Use Guide (RUG) that provides a comprehensive guide to responsible development with LLMs. As we outlined in the RUG, we recommend that all inputs and outputs be checked and filtered in accordance with content guidelines appropriate to the application. Additionally, many cloud service providers offer content moderation APIs and other tools for responsible deployment, and we encourage developers to also consider using these options.

Deploying Llama 3 at scale

Llama 3 will soon be available on all major platforms including cloud providers, model API providers, and much more. Llama 3 will be everywhere.

Our benchmarks show the tokenizer offers improved token efficiency, yielding up to 15% fewer tokens compared to Llama 2. Also, Group Query Attention (GQA) now has been added to Llama 3 8B as well. As a result, we observed that despite the model having 1B more parameters compared to Llama 2 7B, the improved tokenizer efficiency and GQA contribute to maintaining the inference efficiency on par with Llama 2 7B.

For examples of how to leverage all of these capabilities, check out Llama Recipes which contains all of our open source code that can be leveraged for everything from fine-tuning to deployment to model evaluation.

What’s next for Llama 3?

The Llama 3 8B and 70B models mark the beginning of what we plan to release for Llama 3. And there’s a lot more to come.

Our largest models are over 400B parameters and, while these models are still training, our team is excited about how they’re trending. Over the coming months, we’ll release multiple models with new capabilities including multimodality, the ability to converse in multiple languages, a much longer context window, and stronger overall capabilities. We will also publish a detailed research paper once we are done training Llama 3.

To give you a sneak preview for where these models are today as they continue training, we thought we could share some snapshots of how our largest LLM model is trending. Please note that this data is based on an early checkpoint of Llama 3 that is still training and these capabilities are not supported as part of the models released today.

439015366_1603174683862748_5008894608826037916_n.png

*Please see evaluation details for setting and parameters with which these evaluations are calculated.

We’re committed to the continued growth and development of an open AI ecosystem for releasing our models responsibly. We have long believed that openness leads to better, safer products, faster innovation, and a healthier overall market. This is good for Meta, and it is good for society.We’re taking a community-first approach with Llama 3, and starting today, these models are available on the leading cloud, hosting, and hardware platforms with many more to come.

Try Meta Llama 3 today

We’ve integrated our latest models into Meta AI, which we believe is the world’s leading AI assistant. It’s now built with Llama 3 technology and it’s available in more countries across our apps.

You can use Meta AI on Facebook, Instagram, WhatsApp, Messenger, and the web to get things done, learn, create, and connect with the things that matter to you. You can read more about the Meta AI experience here.

Visit the Llama 3 website to download the models and reference the Getting Started Guide for the latest list of all available platforms.

You’ll also soon be able to test multimodal Meta AI on our Ray-Ban Meta smart glasses.

As always, we look forward to seeing all the amazing products and experiences you will build with Meta Llama 3.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739

1/1
The new Meta AI will create images as you type, give you the option to animate the images it creates, and then lets you create a video of the images that were generated while you typed.

Pretty cool stuff.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
The new Llama 3 model is the smartest AI I've tried so far, as Meta claims in their benchmarks.

The "Imagine" feature is incredibly fast; images are generated as you type. You can even animate all images with one click and create videos from the combined images.

Apparently, Meta seems ready for all of their social platforms to be flooded with these AI images, but the quality is too good for just short prompts. Other models require you to give really long prompts to get good quality images.

The crazy thing is that the model is still in training, so imagine how the new version will be. And on top of that, it's open source! Can't wait to download it and start building.

You can try it online here: Meta AI


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
It's here! Our new image generation is now faster and higher quality—you can create images from text in real-time as you type, animate images and turn them into GIFs. Imagine it. Create it. Have fun!

Now rolling out on WhatsApp and the Meta AI web experience in the US.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,739

VASA-1: Lifelike Audio-Driven Talking Faces​

Sicheng Xu*, Guojun Chen*, Yu-Xiao Guo*, Jiaolong Yang*‡,

Chong Li, Zhenyu Zang, Yizhong Zhang, Xin Tong, Baining GuoMicrosoft Research Asia

*Equal Contributions ‡Corresponding Author: jiaoyan@microsoft.com

arXiv PDF

TL;DR: single portrait photo + speech audio = hyper-realistic talking face video with precise lip-audio sync, lifelike facial behavior, and naturalistic head movements, generated in real time.

teaser.jpg

Abstract​

We introduce VASA, a framework for generating lifelike talking faces of virtual charactors with appealing visual affective skills (VAS), given a single static image and a speech audio clip. Our premiere model, VASA-1, is capable of not only producing lip movements that are exquisitely synchronized with the audio, but also capturing a large spectrum of facial nuances and natural head motions that contribute to the perception of authenticity and liveliness. The core innovations include a holistic facial dynamics and head movement generation model that works in a face latent space, and the development of such an expressive and disentangled face latent space using videos. Through extensive experiments including evaluation on a set of new metrics, we show that our method significantly outperforms previous methods along various dimensions comprehensively. Our method not only delivers high video quality with realistic facial and head dynamics but also supports the online generation of 512x512 videos at up to 40 FPS with negligible starting latency. It paves the way for real-time engagements with lifelike avatars that emulate human conversational behaviors.

(Note: all portrait images on this page are virtual, non-existing identities generated by StyleGAN2 or DALL·E-3 (except for Mona Lisa). We are exploring visual affective skill generation for virtual, interactive charactors, NOT impersonating any person in the real world. This is only a research demonstration and there's no product or API release plan. See also the bottom of this page for more of our Responsible AI considerations.)

Realism and liveliness​

Our method is capable of not only producing precious lip-audio synchronization, but also generating a large spectrum of expressive facial nuances and natural head motions. It can handle arbitary-length audio and stably output seamless talking face videos.
  1. https://vasavatar.github.io/VASA-1/video/l5.mp4
  2. https://vasavatar.github.io/VASA-1/video/l8.mp4
  3. https://vasavatar.github.io/VASA-1/video/l3.mp4
  4. https://vasavatar.github.io/VASA-1/video/l4.mp4
  5. https://vasavatar.github.io/VASA-1/video/l7.mp4
  6. https://vasavatar.github.io/VASA-1/video/l2.mp4


Examples with audio input of one minute long.

More shorter examples with diverse audio input


Controllability of generation​

Our diffusion model accepts optional signals as condition, such as main eye gaze direction and head distance, and emotion offsets.
Generation results under different main gaze directions (forward-facing, leftwards, rightwards, and upwards, respectively)
Generation results under different head distance scales
Generation results under different emotion offsets (neutral, happiness, anger, and surprise, respectively)


Out-of-distribution generalization​

Our method exhibits the capability to handle photo and audio inputs that are out of the training distribution. For example, it can handle artistic photos, singing audios, and non-English speech. These types of data were not present in the training set.

Power of disentanglement​

Our latent representation disentangles appearance, 3D head pose, and facial dynamics, which enables separate attribute control and editing of the generated content.
Same input photo with different motion sequences (left two cases), and same motion sequence with different photos (right three cases)

https://vasavatar.github.io/VASA-1/video/male_disen.mp4
Pose and expression editing (raw generation result, pose-only result, expression-only result, and expression with spinning pose)


Real-time efficiency​

Our method generates video frames of 512x512 size at 45fps in the offline batch processing mode, and can support up to 40fps in the online streaming mode with a preceding latency of only 170ms , evaluated on a desktop PC with a single NVIDIA RTX 4090 GPU.

A real-time demo


Risks and responsible AI considerations​

Our research focuses on generating visual affective skills for virtual AI avatars, aiming for positive applications. It is not intended to create content that is used to mislead or deceive. However, like other related content generation techniques, it could still potentially be misused for impersonating humans. We are opposed to any behavior to create misleading or harmful contents of real persons, and are interested in applying our technique for advancing forgery detection. Currently, the videos generated by this method still contain identifiable artifacts, and the numerical analysis shows that there's still a gap to achieve the authenticity of real videos.

While acknowledging the possibility of misuse, it's imperative to recognize the substantial positive potential of our technique. The benefits – such as enhancing educational equity, improving accessibility for individuals with communication challenges, offering companionship or therapeutic support to those in need, among many others – underscore the importance of our research and other related explorations. We are dedicated to developing AI responsibly, with the goal of advancing human well-being.

Given such context, we have no plans to release an online demo, API, product, additional implementation details, or any related offerings until we are certain that the technology will be used responsibly and in accordance with proper regulations.


 
Top