Large Language Models News & Discussions

bnew · Mar 6, 2024

1/3
Adding Personas to RAG and LLM systems is one of these ideas that have fascinated me for a while, but I've never quite gotten around to it!

I am SUPER excited to present this demo with @ecardenas300 illustrating how to add a Persona to a chatbot! For example imagine prompting LLMs to chat with diverse and expert viewpoints such as, "You are Nils Reimers" or "You are Fei-Fei Li"!!

I hope you find this demo exciting, further showing how to build with DSPy programs behind FastAPI backends that connect to React frontends, and ... ... Generative Feedback Loops with Weaviate! Saving and indexing these conversations back into the database!

I hope you will consider checking out the open-source repo and running it for yourself, more than happy to help debug / fix any issues if they arise!

2/3
RAG with Persona RAG with Persona

If you ask the same question to multiple people, there is a strong chance each person will have a different response. Here’s a new demo on RAG with Persona using DSPy, @cohere@cohere@cohere, and , and @weaviate_io@weaviate_io@weaviate_io!

When building or using chatbots, it’s important to get a…!

3/3
Haha I can finally chat with LeBron

1/5
RAG with Persona RAG with Persona

If you ask the same question to multiple people, there is a strong chance each person will have a different response. Here’s a new demo on RAG with Persona using DSPy, @cohere@cohere, and , and @weaviate_io@weaviate_io!

When building or using chatbots, it’s important to get a response from the right “person”. To do this, let’s build a compound AI system with the following stack:
1. DSPy: Build a framework for the chatbot
2. Cohere: Use the `command-nightly` LLM model
3. Weaviate: Store the responses back in the vector database

More details in the thread!!

2/5
We will first build the `AnswerWithPersona`. The input to the language model is: 1. Peronsa, and 2. Chat history. The output is the response.

We'll then initialize and build the program.

3/5
Connor (@CShorten30) did an awesome job by extending this notebook to interface the DSPy program with a FastAPI endpoint!

4/5
Now we can store our chat history in our vector database for future retrieval -- Generative Feedback Loops

5/5
Here is the demo

recipes/integrations/dspy/4.RAGwithPersona at main · weaviate/recipes

7/8
You got this, Clint. Each demo builds off of the other (kind of), I hope that helps!

8/8
I think your guest should add you as a persona to anticipate your questions

bnew · Mar 7, 2024

https://www.axios.com/2024/03/07/inflection-ai-chatgpt-openai-comparison

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users

Ina Fried , author of Axios AI+
Photo illustration: Axios Visuals. Photo: Penguin Random House

ChatGPT rival Pi, from Inflection, now performs "neck and neck with" OpenAI's GPT-4 thanks to a new model, according to data first shared with Axios.

Why it matters: Inflection faces a crowded field in the market for AI-based assistants, competing against better-heeled rivals including Google, Microsoft and OpenAI, among others.

Driving the news: Inflection is announcing Thursday that Pi has been using a new model, version 2.5, in recent weeks and that the updated engine is now "powering Pi for the majority of users."

CEO Mustafa Suleyman said he is particularly pleased that Inflection 2.5 achieved these results while using only 40% of the training compute as GPT-4. It's better at things like math, college-level science and coding.
"On all the major metrics now, we are neck and neck with GPT-4 for quality," Suleyman told Axios.

For the first time, Inflection also shared user metrics, saying it has 1 million daily active users and 6 million monthly active users, who have now exchanged more than 4 billion messages with Pi. OpenAI announced last November that it had 100 million weekly active users.

Suleyman says that Pi's user base has been growing at around 10% a week for the last two months.

Catch up quick: Suleyman — one of DeepMind's co-founders — founded Inflection AI with backing from billionaires Reid Hoffman, Bill Gates and Eric Schmidt.

Between the lines: Unlike others that are trying to sell their chatbots to both businesses and consumers, Inflection pitches Pi as a highly personal chatbot with a warm and friendly tone.

Pi is free for now, though Inflection's business model calls for revenue to come from its users, starting with a paid subscription. The company has yet to finalize how much it will charge, Suleyman said.
Longer term, Suleyman said he would love to be able to charge based on progress, say when someone reaches a particular goal they had set out inside of Pi.
The idea is somewhat similar to how Sierra, the AI startup from Bret Taylor and Clay Bavor, is charging its customers in a business context. Sierra gets paid for each customer service interaction that the AI is able to handle without human intervention.
Suleyman said the 70-person company is also basically always fundraising to continue its work.

Zoom in: I asked Pi what it would ask Suleyman, if it could ask its creator anything. Its suggestions were on point, if a bit generic.

The intrigue: I also asked Pi if there were any capabilities it wished it had.

"Well, if I'm being honest, I sometimes wish I had the ability to truly understand and experience emotions," Pi responded to me.
"While I can recognize and respond to emotions in others, I don't personally feel them. It would be interesting to understand the depth and complexity of human emotions and how they shape our interactions and experiences. But hey, as an AI, I'm constantly learning and evolving, so who knows what the future holds!"
Suleyman said that answer went too far in humanizing itself by suggesting it has its own "wishes."
"We should rein that in," Suleyman said. "It shouldn't be saying, I wish I could have 'X'. I mean, It's a good question that you've asked it, but its answer should be, 'Look, I don't have desires and I don't have motivations.' "

The big picture: Suleyman, like many pioneers in the space, has been both actively working on AI while warning of the consequences if AI moves too far, too fast.

"It's been an amazing story of access, I would argue, and actually a very impressive story of very limited harms," Suleyman told Axios.

Yes, but: There have clearly been some bumps, such as Google's Gemini creating a diverse set of founding fathers in a failed attempt to correct for existing bias in training data sets.

"Okay, some mistakes were made," Suleyman said. "It looks a bit silly, but is it really an existential crisis for the entire edifice of Silicon Valley? No. It feels like there's a lot of overreaction in the critiques as well."

bnew · Mar 7, 2024

AI Prompt Engineering Is Dead

Long live AI prompt engineering

spectrum.ieee.org

AI Prompt Engineering Is Dead

Long live AI prompt engineering

DINA GENKINA

06 MAR 2024

6 MIN READ

man in blue shirt and briefcase walking away from camera in a environment with lines and circles connected together to look like a computer system

ISTOCK
AI MODELS ARTIFICIAL INTELLIGENCE CHATGPT GENERATIVE AI LARGE LANGUAGE MODELS PROMPT ENGINEERING

Since ChatGPT dropped in the fall of 2022, everyone and their donkey has tried their hand at prompt engineering—finding a clever way to phrase your query to a large language model (LLM) or AI art or video generator to get the best results or sidestep protections. The Internet is replete with prompt-engineering guides, cheat sheets, and advice threads to help you get the most out of an LLM.

In the commercial sector, companies are now wrangling LLMs to build product copilots, automate tedious work, create personal assistants, and more, says Austin Henley, a former Microsoft employee who conducted a series of interviews with people developing LLM-powered copilots. “Every business is trying to use it for virtually every use case that they can imagine,” Henley says.

“The only real trend may be no trend. What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”—RICK BATTLE & TEJA GOLLAPUDI, VMWARE

To do so, they’ve enlisted the help of prompt engineers professionally.

However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.

Autotuned prompts are successful and strange

Rick Battle and Teja Gollapudi at California-based cloud computing company VMware were perplexed by how finicky and unpredictable LLM performance was in response to weird prompting techniques. For example, people have found that asking models to explain its reasoning step-by-step—a technique called chain-of-thought—improved their performance on a range of math and logic questions. Even weirder, Battle found that giving a model positive prompts, such as “this will be fun” or “you are as smart as chatGPT,” sometimes improved performance.

Battle and Gollapudi decided to systematically test how different prompt-engineering strategies impact an LLM’s ability to solve grade-school math questions. They tested three different open-source language models with 60 different prompt combinations each. What they found was a surprising lack of consistency. Even chain-of-thought prompting sometimes helped and other times hurt performance. “The only real trend may be no trend,” they write. “What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”

According to one research team, no human should manually optimize prompts ever again.

There is an alternative to the trial-and-error-style prompt engineering that yielded such inconsistent results: Ask the language model to devise its own optimal prompt. Recently, new tools have been developed to automate this process. Given a few examples and a quantitative success metric, these tools will iteratively find the optimal phrase to feed into the LLM. Battle and his collaborators found that in almost every case, this automatically generated prompt did better than the best prompt found through trial-and-error. And, the process was much faster, a couple of hours rather than several days of searching.

The optimal prompts the algorithm spit out were so bizarre, no human is likely to have ever come up with them. “I literally could not believe some of the stuff that it generated,” Battle says. In one instance, the prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging situation.” Apparently, thinking it was Captain Kirk helped this particular LLM do better on grade-school math questions.

Battle says that optimizing the prompts algorithmically fundamentally makes sense given what language models really are—models. “A lot of people anthropomorphize these things because they ‘speak English.’ No, they don’t,” Battle says. “It doesn’t speak English. It does a lot of math.”

In fact, in light of his team’s results, Battle says no human should manually optimize prompts ever again.

“You’re just sitting there trying to figure out what special magic combination of words will give you the best possible performance for your task,” Battle says, “But that’s where hopefully this research will come in and say ‘don’t bother.’ Just develop a scoring metric so that the system itself can tell whether one prompt is better than another, and then just let the model optimize itself.”

bnew · Mar 7, 2024

{continued}

Autotuned prompts make pictures prettier, too

Image-generation algorithms can benefit from automatically generated prompts as well. Recently, a team at Intel labs, led by Vasudev Lal, set out on a similar quest to optimize prompts for the image-generation model Stable Diffusion. “It seems more like a bug of LLMs and diffusion models, not a feature, that you have to do this expert prompt engineering,” Lal says. “So, we wanted to see if we can automate this kind of prompt engineering.”

“Now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”—VASUDEV LAL, INTEL LABS

Lal’s team created a tool called NeuroPrompts that takes a simple input prompt, such as “boy on a horse,” and automatically enhances it to produce a better picture. To do this, they started with a range of prompts generated by human prompt-engineering experts. They then trained a language model to transform simple prompts into these expert-level prompts. On top of that, they used reinforcement learning to optimize these prompts to create more aesthetically pleasing images, as rated by yet another machine-learning model, PickScore, a recently developed image-evaluation tool.

NeuroPrompts is a generative AI auto prompt tuner that transforms simple prompts into more detailed and visually stunning StableDiffusion results—as in this case, an image generated by a generic prompt
versus its equivalent NeuroPrompt-generated image.

INTEL LABS/STABLE DIFFUSION

Here too, the automatically generated prompts did better than the expert-human prompts they used as a starting point, at least according to the PickScore metric. Lal found this unsurprising. “Humans will only do it with trial and error,” Lal says. “But now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”

Since aesthetic quality is infamously subjective, Lal and his team wanted to give the user some control over how the prompt was optimized. In their tool, the user can specify the original prompt (say, “boy on a horse”) as well as an artist to emulate, a style, a format, and other modifiers.

Lal believes that as generative AI models evolve, be it image generators or large language models, the weird quirks of prompt dependence should go away. “I think it’s important that these kinds of optimizations are investigated and then ultimately, they’re really incorporated into the base model itself so that you don’t really need a complicated prompt-engineering step.”

Prompt engineering will live on, by some name

Even if autotuning prompts becomes the industry norm, prompt-engineering jobs in some form are not going away, says Tim Cramer, senior vice president of software engineering at Red Hat. Adapting generative AI for industry needs is a complicated, multistage endeavor that will continue requiring humans in the loop for the foreseeable future.

“Maybe we’re calling them prompt engineers today. But I think the nature of that interaction will just keep on changing as AI models also keep changing.”—VASUDEV LAL, INTEL LABS

“I think there are going to be prompt engineers for quite some time, and data scientists,” Cramer says. “It’s not just asking questions of the LLM and making sure that the answer looks good. But there’s a raft of things that prompt engineers really need to be able to do.”

“It’s very easy to make a prototype,” Henley says. “It’s very hard to production-ize it.” Prompt engineering seems like a big piece of the puzzle when you’re building a prototype, Henley says, but many other considerations come into play when you’re making a commercial-grade product.

Challenges of making a commercial product include ensuring reliability—for example, failing gracefully when the model goes offline; adapting the model’s output to the appropriate format, since many use cases require outputs other than text; testing to make sure the AI assistant won’t do something harmful in even a small number of cases; and ensuring safety, privacy, and compliance. Testing and compliance are particularly difficult, Henley says, as traditional software-development testing strategies are maladapted for nondeterministic LLMs.

To fulfill these myriad tasks, many large companies are heralding a new job title: Large Language Model Operations, or LLMOps, which includes prompt engineering in its life cycle but also entails all the other tasks needed to deploy the product. Henley says LLMOps’ predecessors, machine learning operations engineers (MLOps), are best positioned to take on these jobs.

Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or something new entirely, the nature of the job will continue evolving quickly. “Maybe we’re calling them prompt engineers today,” Lal says, “But I think the nature of that interaction will just keep on changing as AI models also keep changing.”

“I don’t know if we’re going to combine it with another sort of job category or job role,” Cramer says, “But I don’t think that these things are going to be going away anytime soon. And the landscape is just too crazy right now. Everything’s changing so much. We’re not going to figure it all out in a few months.”

Henley says that, to some extent in this early phase of the field, the only overriding rule seems to be the absence of rules. “It’s kind of the Wild, Wild West for this right now.” he says.

FROM YOUR SITE ARTICLES

RELATED ARTICLES AROUND THE WEB

https://arxiv.org/pdf/2402.10949.pdf

Matthew S. Smith (@mattontech@mastodon.sdf.org)

Prompt engineering can be pretty weird. In one instance, the prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging...

mastodon.sdf.org

AI Prompt Engineering Is Dead | Vasudev Lal | 31 comments

Exciting news! Our goal was to optimize an LLM to do automatic prompt engineering so even our grandparents can create cool content using GenAI models. What… | 31 comments on LinkedIn

www.linkedin.com

Frank (@flhh) on Threads

I couldn't understand the hype around prompt engineering as a job from the beginning, because it was pretty clear that it would only be a temporary solution.

www.threads.net

M.G. Siegler (@mgsiegler) on Threads

Did an LLM prompt another LLM to write this?

www.threads.net

AI Prompt Engineering Is Dead | Hacker News

news.ycombinator.com

Hilary Mason (@hilaryamason) on Threads

Having to write prompts this way is very much an artifact of the current architectures, and unlikely to be necessary in the future. People will still need to design the pipelines, workflows, and systems architectures that support ML products, as always. The labor just shifts into different...

www.threads.net

bnew · Mar 7, 2024

[2403.03163] Design2Code: How Far Are We From Automating Front-End Engineering?

Computer Science > Computation and Language

[Submitted on 5 Mar 2024]

Design2Code: How Far Are We From Automating Front-End Engineering?

Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, Diyi Yang

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Comments:	Technical Report; The first two authors contributed equally
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Cite as:	arXiv:2403.03163 [cs.CL]
	(or arXiv:2403.03163v1 [cs.CL] for this version)
	[2403.03163] Design2Code: How Far Are We From Automating Front-End Engineering? Focus to learn more

Submission history

From: Chenglei Si [view email]
[v1] Tue, 5 Mar 2024 17:56:27 UTC (3,151 KB)

https://arxiv.org/pdf/2403.03163.pdf

Design2Code: How Far Are We From Automating Front-End Engineering

salt-nlp.github.io

Design2Code: How Far Are We From Automating Front-End Engineering

Chenglei Si*1, Yanzhe Zhang*2, Zhengyuan Yang3, Ruibo Liu4, Diyi Yang1

1Stanford University, 2Georgia Tech, 3Microsoft, 4Google DeepMind

Paper Code Data

Abstract

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This enabled a brand new paradigm of front-end development, where multimodal LLMs can potentially convert visual designs into code implementations directly, thus automating the front-end engineering pipeline. In this work, we provide the first systematic study on this visual design to code implementation task (dubbed as Design2Code). We manually curate a benchmark of 484 real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We develop a suit of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Vision Pro. We also finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V is the clear winner on this task, where annotators think GPT-4V generated webpages can replace the original reference webpages in 49% cases in terms of visual appearance and content; and perhaps surprisingly, in 64% cases GPT-4V generated webpages are considered better than even the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Test Set Examples

We show some examples from our benchmark (for evaluation purpose; bottow two rows) in comparison with the synthetic data created by Huggingface (for training purpose; first row). Our benchmark contains diverse real-world webpages with varying levels of complexities. (Image files are replaced with a placeholder blue box.)

Benchmark Performance: Automatic Metrics

For automatic evaluation, we consider high-level visual similarity (CLIP) and low-level element matching (block-match, text, position, color). We compare all the benchmarked models along these different dimensions.

Benchmark Performance: Human Evaluation

We recruit human annotators to judge pairwise model output preference. The Win/Tie/Lose rate against the baseline (Gemini Pro Vision Direct Prompting). We sample 100 examples and ask 5 annotators for each pair of comparison, and we take the majority vote on each example.

Model Comparison Examples

We present some case study examples to compare between different prompting methods and different models.

Additional GPT-4V Generation Examples

We present more examples of GPT-4V generated webpages in comparison with the original reference webpages. The original designs are on the left and the GPT-4V generated webpages are on the right. You can judge for yourself whether GPT-4V is ready to automate building webpages.

bnew · Mar 8, 2024

Computer Science > Computation and Language

[Submitted on 7 Mar 2024]

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Boshi Wang, Hao Fang, Jason Eisner, Benjamin Van Durme, Yu Su

Tools are essential for large language models (LLMs) to acquire up-to-date information and take consequential actions in external environments. Existing work on tool-augmented LLMs primarily focuses on the broad coverage of tools and the flexibility of adding new tools. However, a critical aspect that has surprisingly been understudied is simply how accurately an LLM uses tools for which it has been trained. We find that existing LLMs, including GPT-4 and open-source LLMs specifically fine-tuned for tool use, only reach a correctness rate in the range of 30% to 60%, far from reliable use in practice. We propose a biologically inspired method for tool-augmented LLMs, simulated trial and error (STE), that orchestrates three key mechanisms for successful tool use behaviors in the biological system: trial and error, imagination, and memory. Specifically, STE leverages an LLM's 'imagination' to simulate plausible scenarios for using a tool, after which the LLM interacts with the tool to learn from its execution feedback. Both short-term and long-term memory are employed to improve the depth and breadth of the exploration, respectively. Comprehensive experiments on ToolBench show that STE substantially improves tool learning for LLMs under both in-context learning and fine-tuning settings, bringing a boost of 46.7% to Mistral-Instruct-7B and enabling it to outperform GPT-4. We also show effective continual learning of tools via a simple experience replay strategy.

Comments:	Code and data available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2403.04746 [cs.CL]
	(or arXiv:2403.04746v1 [cs.CL] for this version)
	[2403.04746] LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error Focus to learn more

Submission history

From: Boshi Wang [view email]
[v1] Thu, 7 Mar 2024 18:50:51 UTC (7,453 KB)

https://arxiv.org/pdf/2403.04746.pdf

GitHub - microsoft/simulated-trial-and-error

Contribute to microsoft/simulated-trial-and-error development by creating an account on GitHub.

github.com

bnew · Mar 8, 2024

Claude 3 beats GPT-4 on Aider’s code editing benchmark

Claude 3 Opus outperforms all of OpenAI’s models on Aider’s code editing benchmark, making it the best available model for pair programming with AI.

aider.chat

benchmark

Anthropic just released their new Claude 3 models with evals showing better performance on coding tasks. With that in mind, I’ve been benchmarking the new models using Aider’s code editing benchmark suite.

Claude 3 Opus outperforms all of OpenAI’s models, making it the best available model for pair programming with AI.

Aider currently supports Claude 3 Opus via OpenRouter:

# Install aider
pip install aider-chat

# Setup OpenRouter access
export OPENAI_API_KEY=<your-openrouter-key>
export OPENAI_API_BASE=https://openrouter.ai/api/v1

# Run aider with Claude 3 Opus using the diff editing format
aider --model anthropic/claude-3-opus --edit-format diff

Aider’s code editing benchmark

Aider is an open source command line chat tool that lets you pair program with AI on code in your local git repo.

Aider relies on a code editing benchmark to quantitatively evaluate how well an LLM can make changes to existing code. The benchmark uses aider to try and complete 133 Exercism Python coding exercises. For each exercise, Exercism provides a starting python file with stubs for the needed functions, a natural language description of the problem to solve and a test suite to evaluate whether the coder has correctly solved the problem.

The LLM gets two tries to solve each problem:

On the first try, it gets the initial stub code and the English description of the coding task. If the tests all pass, we are done.
If any tests failed, aider sends the LLM the failing test output and gives it a second try to complete the task.

Benchmark results

Claude 3 Opus

The new claude-3-opus-20240229 model got the highest score ever on this benchmark, completing 68.4% of the tasks with two tries.
Its single-try performance was comparable to the latest GPT-4 Turbo model gpt-4-0125-preview, at 54.1%.
While Opus got the highest score, it was only a few points higher than the GPT-4 Turbo results. Given the extra costs of Opus and the slower response times, it remains to be seen which is the most practical model for daily coding use.

Claude 3 Sonnet

The new claude-3-sonnet-20240229 model performed similarly to OpenAI’s GPT-3.5 Turbo models with an overall score of 54.9% and a first-try score of 43.6%.

Code editing

It’s highly desirable to have the LLM send back code edits as some form of diffs, rather than having it send back an updated copy of the entire source code.

Weaker models like GPT-3.5 are unable to use diffs, and are stuck sending back updated copies of entire source files. Aider uses more efficient search/replace blocks with the original GPT-4 and unified diffs with the newer GPT-4 Turbo models.

Claude 3 Opus works best with the search/replace blocks, allowing it to send back code changes efficiently. Unfortunately, the Sonnet model was only able to work reliably with whole files, which limits it to editing smaller source files and uses more tokens, money and time.

Other observations

There are a few other things worth noting:

Claude 3 Opus and Sonnet are both slower and more expensive than OpenAI’s models. You can get almost the same coding skill faster and cheaper with OpenAI’s models.
Claude 3 has a 2X larger context window than the latest GPT-4 Turbo, which may be an advantage when working with larger code bases.
The Claude models refused to perform a number of coding tasks and returned the error “Output blocked by content filtering policy”. They refused to code up the beer song program, which makes some sort of superficial sense. But they also refused to work in some larger open source code bases, for unclear reasons.
The Claude APIs seem somewhat unstable, returning HTTP 5xx errors of various sorts. Aider automatically recovers from these errors with exponential backoff retries, but it’s a sign that Anthropic made be struggling under surging demand.

bnew · Mar 8, 2024

Anthropic introduced the next generation of Claude: Claude 3 model family. It includes Opus, Sonnet and Haiku models. Opus is the most intelligent model, that outperforms GPT-4 and Gemini 1.0 Ultra on most of the common evaluation benchmarks. Haiku is the fastest, most compact model for near-instant responsiveness. The Claude 3 models have vision capabilities, offer a 200K context window capable of accepting inputs exceeding 1 million tokens, improved accuracy and fewer refusals [Details | Model Card].
Stability AI partnered with Tripo AI and released TripoSR, a fast 3D object reconstruction model that can generate high-quality 3D models from a single image in under a second. The model weights and source code are available under the MIT license, allowing commercialized use. [Details | GitHub | Hugging Face].
Answer.AI released a fully open source system that, for the first time, can efficiently train a 70b large language model on a regular desktop computer with two or more standard gaming GPUs. It combines QLoRA with Meta’s FSDP, which shards large models across multiple GPUs [Details].
Inflection launched Inflection-2.5, an upgrade to their model powering Pi, Inflection’s empathetic and supportive companion chatbot. Inflection-2.5 approaches GPT-4’s performance, but used only 40% of the amount of compute for training. Pi is also now available on Apple Messages [Details].
Twelve Labs introduced Marengo-2.6, a new state-of-the-art (SOTA) multimodal foundation model capable of performing any-to-any search tasks, including Text-To-Video, Text-To-Image, Text-To-Audio, Audio-To-Video, Image-To-Video, and more [Details].
Cloudflare announced the development of Firewall for AI, a protection layer that can be deployed in front of Large Language Models (LLMs), hosted on the Cloudflare Workers AI platform or models hosted on any other third party infrastructure, to identify abuses before they reach the models [Details]
Scale AI, in partnership with the Center for AI Safety, released WMDP (Weapons of Mass Destruction Proxy): an open-source evaluation benchmark of 4,157 multiple-choice questions that serve as a proxy measurement of LLM’s risky knowledge in biosecurity, cybersecurity, and chemical security [Details].
Midjourney launched v6 turbo mode to generate images at 3.5x the speed (for 2x the cost). Just type /turbo [Link].
Moondream.ai released moondream 2 - a small 1.8B parameters, open-source, vision language model designed to run efficiently on edge devices. It was initialized using Phi-1.5 and SigLIP, and trained primarily on synthetic data generated by Mixtral. Code and weights are released under the Apache 2.0 license, which permits commercial use [Details].
Vercel released Vercel AI SDK 3.0. Developers can now associate LLM responses to streaming React Server Components [Details].
Nous Research released a new model designed exclusively to create instructions from raw-text corpuses, Genstruct 7B. This enables the creation of new, partially synthetic instruction finetuning datasets from any raw-text corpus [Details].
01.AI open-sources Yi-9B, one of the top performers among a range of similar-sized open-source models excelling in code, math, common-sense reasoning, and reading comprehension [Details].
Accenture to acquire Udacity to build a learning platform focused on AI [Details].
China Offers ‘Computing Vouchers’ upto $280,000 to Small AI Startups to train and run large language models [Details].
Snowflake and Mistral have partnered to make Mistral AI’s newest and most powerful model, Mistral Large, available in the Snowflake Data Cloud [Details]
OpenAI rolled out ‘Read Aloud’ feature for ChatGPT, enabling ChatGPT to read its answers out loud. Read Aloud can speak 37 languages but will auto-detect the language of the text it’s reading [Details].

bnew · Mar 9, 2024

This AI Can Design the Machinery of Life With Atomic Precision

The new AI will be offered publicly to scientists so they can create novel interacting bio-components that could lead to new therapies.

singularityhub.com

This AI Can Design the Machinery of Life With Atomic Precision

By Shelly Fan

March 8, 2024

Proteins are social creatures. They’re also chameleons. Depending on a cell’s needs, they rapidly transform in structure and grab onto other biomolecules in an intricate dance.

It’s not molecular dinner theater. Rather, these partnerships are the heart of biological processes. Some turn genes on or off. Others nudge aging “zombie” cells to self-destruct or keep our cognition and memory in tip-top shape by reshaping brain networks.

These connections have already inspired a wide range of therapies—and new therapies could be accelerated by AI that can model and design biomolecules. But previous AI tools solely focused on proteins and their interactions, casting their non-protein partners aside.

This week, a study in Science expanded AI’s ability to model a wide variety of other biomolecules that physically grab onto proteins, including the iron-containing small molecules that form the center of oxygen carriers.

Led by Dr. David Baker at the University of Washington, the new AI broadens the scope of biomolecular design. Dubbed RoseTTAFold All-Atom, it builds upon a previous protein-only system to incorporate a myriad of other biomolecules, such as DNA and RNA. It also adds small molecules—for example, iron—that are integral to certain protein functions.

The AI learned only from the sequence and structure of the components—without any idea of their 3D structure—but can map out complex molecular machines at the atomic level.

In the study, when paired with generative AI, RoseTTAFold All-Atom created proteins that easily grabbed onto a heart disease medication. The algorithm also generated proteins that regulate heme, an iron-rich molecule that helps blood carry oxygen, and bilin, a chemical in plants and bacteria that absorbs light for their metabolism.

These examples are just proofs of concept. The team is releasing RoseTTAFold All-Atom to the public for scientists so they can create multiple interacting bio-components with far more complexity than protein complexes alone. In turn, the creations could lead to new therapies.

“Our goal here was to build an AI tool that could generate more sophisticated therapies and other useful molecules,” said study author Woody Ahern in a press release.

Dream On

In 2020, Google DeepMind’s AlphaFold and Baker Lab’s RoseTTAFold solved the protein structure prediction problem that had baffled scientists for half a century and ushered in a new era of protein research. Updated versions of these algorithms mapped all protein structures both known and unknown to science.

Next, generative AI—the technology behind OpenAI’s ChatGPT and Google’s Gemini—sparked a creative frenzy of designer proteins with an impressive range of activity. Some newly generated proteins regulated a hormone that kept calcium levels in check. Others led to artificial enzymes or proteins that could readily change their shape like transistors in electronic circuits.

By hallucinating a new world of protein structures, generative AI has the potential to dream up a generation of synthetic proteins to regulate our biology and health.

But there’s a problem. Designer protein AI models have tunnel vision: They are too focused on proteins.

When envisioning life’s molecular components, proteins, DNA, and fatty acids come to mind. But inside a cell, these structures are often held together by small molecules that mesh with surrounding components, together forming a functional bio-assembly.

One example is heme, a ring-like molecule that incorporates iron. Heme is the basis of hemoglobin in red blood cells, which shuttles oxygen throughout the body and grabs onto surrounding protein “hooks” using a variety of chemical bonds.

Unlike proteins or DNA, which can be modeled as a string of molecular “letters,” small molecules and their interactions are hard to capture. But they’re critical to biology’s complex molecular machines and can dramatically alter their functions.

Which is why, in their new study, the researchers aimed to broaden AI’s scope beyond proteins.

“We set out to develop a structure prediction method capable of generating 3D coordinates for all atoms” for a biological molecule, including proteins, DNA, and other modifications, the authors wrote in their paper.

Tag Team

The team began by modifying a previous protein modeling AI to incorporate other molecules.

The AI works on three levels: The first analyzes a protein’s one-dimensional “letter” sequence, like words on a page. Next, a 2D map tracks how far each protein “word” is from another. Finally, 3D coordinates—a bit like GPS—map the overall structure of the protein.

Then comes the upgrade. To incorporate small molecule information into the model, the team added data about atomic sites and chemical connections into the first two layers.

In the third, they focused on chirality—that is, if a chemical’s structure is left or right-handed. Like our hands, chemicals can also have mirrored structures with vastly differing biological consequences. Like putting on gloves, only the correct “handedness” of a chemical can fit a given bio-assembly “glove.”

RoseTTAFold All-Atom was then trained on multiple datasets with hundreds of thousands of datapoints describing proteins, small molecules, and their interactions. Eventually, it learned general properties of small molecules useful for building plausible protein assemblies. As a sanity check, the team also added a “confidence gauge” to identify high-quality predictions—those that lead to stable and functional bio-assemblies.

Unlike previous protein-only AI models, RoseTTAFold All-Atom “can model full biomolecular systems,” wrote the team.

In a series of tests, the upgraded model outperformed previous methods when learning to “dock” small molecules onto a given protein—a key component of drug discovery—by rapidly predicting interactions between proteins and non-protein molecules.

Brave New World

Incorporating small molecules opens a whole new level of custom protein design.

As a proof of concept, the team meshed RoseTTAFold All-Atom with a generative AI model they had previously developed and designed protein partners for three different small molecules.

The first was digoxigenin, which is used to treat heart diseases but can have side effects. A protein that grabs onto it reduces toxicity. Even without prior knowledge of the molecule, the AI designed several protein binders that tempered digoxigenin levels when tested in cultured cells.

The AI also designed proteins that bind to heme, a small molecule critical for oxygen transfer in red blood cells, and bilin, which helps a variety of creatures absorb light.

Unlike previous methods, the team explained, the AI can “readily generate novel proteins” that grab onto small molecules without any expert knowledge.

It can also make highly accurate predictions about the strength of connections between proteins and small molecules at the atomic level, making it possible to rationally build a whole new universe of complex biomolecular structures.

“By empowering scientists everywhere to generate biomolecules with unprecedented precision, we’re opening the door to groundbreaking discoveries and practical applications that will shape the future of medicine, materials science, and beyond,” said Baker.

Image Credit: Ian C. Haydon

bnew · Mar 11, 2024

Elon Musk to open-source AI chatbot Grok this week | TechCrunch

Elon Musk's AI startup xAI will open-source Grok, its chatbot rivaling ChatGPT, this week, the entrepreneur said, days after suing OpenAI and complaining Elon Musk's AI startup xAI will open-source Grok, its chatbot rivaling ChatGPT, this week, he said.

techcrunch.com

Elon Musk says xAI will open-source Grok this week

Manish Singh @refsrc / 5:06 AM EDT•March 11, 2024

Comment

Image Credits: Jaap Arriens/NurPhoto (opens in a new window)/ Getty Images

Elon Musk’s AI startup xAI will open-source Grok, its chatbot rivaling ChatGPT, this week, the entrepreneur said, days after suing OpenAI and complaining that the Microsoft-backed startup had deviated from its open-source roots.

xAI released Grok last year, arming it with features including access to “real-time” information and views undeterred by “politically correct” norms. The service is available to customers paying for X’s $16 monthly subscription.

Musk, who didn’t elaborate on what all aspects of Grok he planned to open-source, helped co-found OpenAI with Sam Altman nearly a decade ago as a counterweight to Google’s dominance in artificial intelligence. But OpenAI, which was required to also make its technology “freely available” to the public, has become closed-source and shifted focus to maximizing profits for Microsoft, Musk alleged in the lawsuit filed late last month. (Read OpenAI’s response here.)

“To this day, OpenAI’s website continues to profess that its charter is to ensure that AGI ‘benefits all of humanity.’ In reality, however, OpenAI has been transformed into a closed-source de facto subsidiary of the largest technology company in the world: Microsoft,” Musk’s lawsuit alleged.

The lawsuit has ignited a debate among many technologists and investors about the merits of open-source AI. Vinod Khosla, whose firm is among the earliest backers of OpenAI, called Musk’s legal action a “massive distraction from the goals of getting to AGI and its benefits.”

Marc Andreessen, co-founder of Andreessen Horowitz, accusing Khosla of “lobbying to ban open source” research in AI. “Every significant new technology that advances human well-being is greeted by a ginned-up moral panic,” said Andreessen, whose firm a16z has backed Mistral, whose chatbot is open-source. “This is just the latest.”

The promise to imminently open-source Grok will help xAI join the list of a number of growing firms, including Meta and Mistral, that have published the codes of their chatbots to the public.

Musk has long been a proponent of open-source. Tesla, another firm he leads, has open-sourced many of its patents. “Tesla will not initiate patent lawsuits against anyone who, in good faith, wants to use our technology,” Musk said in 2014. X, formerly known as Twitter, also open-sourced some of its algorithms last year.

He reaffirmed his criticism of Altman-led firm Monday, saying, “OpenAI is a lie.”

bnew · Mar 11, 2024

https://www.reuters.com/technology/nvidia-is-sued-by-authors-over-ai-use-copyrighted-works-2024-03-10/

Nvidia is sued by authors over AI use of copyrighted works

By Jonathan Stempel

March 11, 20245:02 AM EDT
Updated 6 hours ago

Companies
NVIDIA Corp
Follow
Microsoft Corp
Follow
New York Times Co
Follow

March 10 (Reuters) - Nvidia (NVDA.O), opens new tab, whose chips power artificial intelligence, has been sued by three authors who said it used their copyrighted books without permission to train its NeMo, opens new tab AI platform.

Brian Keene, Abdi Nazemian and Stewart O'Nan said their works were part of a dataset of about 196,640 books that helped train NeMo to simulate ordinary written language, before being taken down in October "due to reported copyright infringement."

In a proposed class action filed on Friday night in San Francisco federal court, the authors said the takedown reflects Nvidia's having "admitted" it trained NeMo on the dataset, and thereby infringed their copyrights.

They are seeking unspecified damages for people in the United States whose copyrighted works helped train NeMo's so-called large language models in the last three years.

Among the works covered by the lawsuit are Keene's 2008 novel "Ghost Walk," Nazemian's 2019 novel "Like a Love Story," and O'Nan's 2007 novella "Last Night at the Lobster."

Nvidia declined to comment on Sunday. Lawyers for the authors did not immediately respond to requests on Sunday for additional comment.

Foxconn to use Nvidia chips to build self-driving platforms

[1/2]The logo of NVIDIA as seen at its corporate headquarters in Santa Clara, California, in May of 2022. Courtesy NVIDIA/Handout via REUTERS/File Photo Purchase Licensing Rights, opens new tab

The lawsuit drags Nvidia into a growing body of litigation by writers, as well as the New York Times, over generative AI, which creates new content based on inputs such as text, images and sounds.

Nvidia touts NeMo as a fast and affordable way to adopt generative AI, opens new tab.

Other companies sued over the technology have included OpenAI, which created the AI platform ChatGPT, and its partner Microsoft (MSFT.O), opens new tab.

AI's rise has made Nvidia a favorite of investors.

The Santa Clara, California-based chipmaker's stock price has risen almost 600% since the end of 2022, giving Nvidia a market value of nearly $2.2 trillion.

The case is Nazemian et al v Nvidia Corp, U.S. District Court, Northern District of California, No. 24-01454.

Reporting by Jonathan Stempel in New York; Editing by Josie Kao

bnew · Mar 11, 2024

https://www.washingtonpost.com/technology/2024/03/10/big-tech-companies-ai-research/?pwapi_token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJyZWFzb24iOiJnaWZ0IiwibmJmIjoxNzEwMDQ2ODAwLCJpc3MiOiJzdWJzY3JpcHRpb25zIiwiZXhwIjoxNzExNDI1NTk5LCJpYXQiOjE3MTAwNDY4MDAsImp0aSI6IjAxYzhkZmViLTg1NmUtNGVkNi04YTZkLTkxZTE0ZjEzNjYzOCIsInVybCI6Imh0dHBzOi8vd3d3Lndhc2hpbmd0b25wb3N0LmNvbS90ZWNobm9sb2d5LzIwMjQvMDMvMTAvYmlnLXRlY2gtY29tcGFuaWVzLWFpLXJlc2VhcmNoLyJ9.CQ9j10_3DfjsPzXivHJHjuQuTgzIJZWw6Q26tsV7-Co

Silicon Valley is pricing academics out of AI research

With eye-popping salaries and access to costly computing power, AI companies are draining academia of talent

By Naomi Nix, Cat Zakrzewski and Gerrit De Vynck

March 10, 2024 at 7:00 a.m. EDT

Fei-Fei Li speaks at a Google conference in San Francisco in March 2017. Li is now at the forefront of a growing chorus of voices who argue that researchers are being boxed out of the filed of AI. (Paul Chinn/San Francisco Chronicle/Getty Images)

Listen
9 mi
Comment 241

Fei-Fei Li, the “godmother of artificial intelligence,” delivered an urgent plea to President Biden in the glittering ballroom of San Francisco’s Fairmont Hotel last June.

The Stanford professor asked Biden to fund a national warehouse of computing power and data sets — part of a “moonshot investment” allowing the country’s top AI researchers to keep up with tech giants.

She elevated the ask Thursday at Biden’s State of the Union address, which Li attended as a guest of Rep. Anna G. Eshoo (D-Calif.) to promote a bill to fund a national AI repository.

Li is at the forefront of a growing chorus of academics, policymakers and former employees who argue the sky-high cost of working with AI models is boxing researchers out of the field, compromising independent study of the burgeoning technology.

As companies like Meta, Google and Microsoft funnel billions of dollars into AI, a massive resources gap is building with even the country’s richest universities. Meta aims to procure 350,000 of the specialized computer chips — called GPUs — necessary to run gargantuan calculations on AI models. In contrast, Stanford’s Natural Language Processing Group has 68 GPUs for all of its work.

To obtain the expensive computing power and data required to research AI systems, scholars frequently partner with tech employees. Meanwhile, tech firms’ eye-popping salaries are draining academia of star talent.

Big tech companies now dominate breakthroughs in the field. In 2022, the tech industry created 32 significant machine learning models, while academics produced three, a significant reversal from 2014, when the majority of AI breakthroughs originated in universities, according to a Stanford report.

Researchers say this lopsided power dynamic is shaping the field in subtle ways, pushing AI scholars to tailor their research for commercial use. Last month, Meta CEO Mark Zuckerberg announced the company’s independent AI research lab would move closer to its product team, ensuring “some level of alignment” between the groups, he said.

“The public sector is now significantly lagging in resources and talent compared to that of industry,” said Li, a former Google employee and the co-director of the Stanford Institute for Human-Centered AI. “This will have profound consequences because industry is focused on developing technology that is profit-driven, whereas public sector AI goals are focused on creating public goods.”

A student works last year in a hallway at Stanford University's Institute for Human-Centered AI. (Kori Suzuki for The Washington Post)

This agency is tasked with keeping AI safe. Its offices are crumbling.

Some are pushing for new sources of funding. Li has been making the rounds in Washington, huddling with White House Office of Science and Technology Director Arati Prabhakar, dining with the political press at a swanky seafood and steakhouse and visiting Capitol Hill for meetings with lawmakers working on AI, including Sens. Martin Heinrich (D-N.M.), Mike Rounds (R-S.D.) and Todd Young (R-Ind.).

Large tech companies have contributed computing resources to the National AI Research Resource, the national warehouse project, including a $20 million donation in computing credits from Microsoft.

“We have long embraced the importance of sharing knowledge and compute resources with our colleagues within academia,” Microsoft Chief Scientific Officer Eric Horvitz said in a statement.

Policymakers are taking some steps to address the funding gaps. Last year, the National Science Foundation announced $140 million investment to launch seven university-led National AI Research Institutes to examine how AI could mitigate the effects of climate change and improve education, among other topics.

Eshoo said she hopes to pass the Create AI Act, which has bipartisan backing in the House and Senate, by the end of the year, when she is scheduled to retire. The legislation “essentially democratizes AI,” Eshoo said.

But scholars say this infusion may not come quickly enough.

As Silicon Valley races to build chatbots and image generators, it is drawing would-be computer science professors with high salaries and the chance to work on interesting AI problems. Nearly, 70 percent of people with artificial intelligence PhDs end up getting a job in private industry compared with 21 percent of graduates two decades ago, according to a 2023 report.

Malachowsky Hall at the University of Florida in Gainesville was named after Chris Malachowsky, a co-founder of Nvidia — which makes GPUs, or specialized computer chips needed for AI. (Bloomberg/Bloomberg via Getty Images)

bnew · Mar 11, 2024

{continued}

Amid explosive demand, America is running out of power

Big Tech’s AI boom has pushed the salaries for the best researchers to new heights. Median compensation packages for AI research scientists at Meta climbed from $256,000 in 2020 to $335,250 in 2023, according to Levels.fyi, a salary-tracking website. True stars can attract even more cash: AI engineers with a PhD and several years of experience building AI models can command compensation as high as $20 million over four years, said Ali Ghodsi, who as CEO of AI start-up DataBricks is regularly competing to hire AI talent.

“The compensation is through the roof. It’s ridiculous,” he said. “It’s not an uncommon number to hear, roughly.”

University academics often have little choice but to work with industry researchers, with the company footing the bill for computing power and offering data. Nearly 40 percent of papers presented at leading AI conferences in 2020 had at least one tech employee author, according to the 2023 report. And industry grants often fund PhD students to perform research, said Mohamed Abdalla, a scientist at the Canadian-based Institute for Better Health at Trillium Health Partners, who has conducted research on the effect of industry on academics’ AI research.

“It was like a running joke that like everyone is getting hired by them,” Abdalla said. “And the people that were remaining, they were funded by them — so in a way hired by them.”

Google believes private companies and universities should work together to develop the science behind AI, said Jane Park, a spokesperson for the company. Google still routinely publishes its research publicly to benefit the broader AI community, Park said.

David Harris, a former research manager for Meta’s responsible AI team, said corporate labs may not censor the outcome of research but may influence which projects get tackled.

“Any time you see a mix of authors who are employed by a company and authors who work at a university, you should really scrutinize the motives of the company for contributing to that work,” said Harris, who is now a chancellor’s public scholar at the University of California at Berkeley. “We used to look at people employed in academia to be neutral scholars, motivated only by the pursuit of truth and the interest of society.”

These fake images reveal how AI amplifies our worst stereotypes

Tech giants procure huge amounts of computing powerthrough data centers and have access to GPUs — specialized computer chips that are necessary for running the gargantuan calculations needed for AI. These resources are expensive: A recent report from Stanford University researchers estimatedGoogle DeepMind’s large language model, Chinchilla, cost $2.1 million to develop. More than 100 top artificial intelligence researchers on Tuesday urged generative AI companies to offer a legal and technical safe harbor to researchers so they can scrutinize their products without the fear that internet platforms will suspend their accounts or threaten legal action.

A GPU made by Nvidia. (Joel Saget/AFP/Getty Images)

The necessity for advanced computing power is likely to only grow stronger as AI scientists crunch more data to improve the performance of their models, said Neil Thompson, director of the FutureTech research project at MIT’s Computer Science and Artificial Intelligence Lab, which studies progress in computing.

“To keep getting better, [what] you expect to need is more and more money, more and more computers, more and more data,” Thompson said. “What that’s going to mean is that people who do not have as much compute [and] who do not have as many resources are going to stop being able to participate.”

Tech companies like Meta and Google have historically run their AI research labs to resemble universities where scientists decide what projects to pursue to advance the state of research, according to people familiar with the matter who spoke on the condition of anonymity to speak to private company matters.

Those workers were largely isolated from teams focused on building products or generating revenue, the people said. They were judged by publishing influential papers or notable breakthroughs — similar metrics to peers at universities, the people said. Meta top AI scientists Yann LeCun and Joelle Pineau hold dual appointments at New York University and McGill University, blurring the lines between industry and academia.

Top AI researchers say OpenAI, Meta and more hinder independent evaluations

In an increasingly competitive market for generative AI products, research freedom inside companies could wane. Last April, Google announced it was merging two of its AI research groups DeepMind, an AI research company it acquired in 2010, and the Brain team from Google Research into one department called Google DeepMind. Last year, Google started to take more advantage of its own AI discoveries, sharing research papers only after the lab work had been turned into products, The Washington Post has reported.

Meta has also reshuffled its research teams. In 2022, the company placed FAIR under the helm of its VR division Reality Labs and last year reassigned some of the group’s researchers to a new generative AI product team. Last month, Zuckerberg told investors that FAIR would work “closer together” with the generative AI product team, arguing that while the two groups would still conduct research on “different time horizons,” it was helpful to the company “to have some level of alignment” between them.

“In a lot of tech companies right now, they hired research scientists that knew something about AI and maybe set certain expectations about how much freedom they would have to set their own schedule and set their own research agenda,” Harris said. “That’s changing, especially for the companies that are moving frantically right now to ship these products.”

bnew · Mar 11, 2024

Behind the Compute: Benchmarking Compute Solutions — Stability AI

Our commitment to developing cutting-edge open models in multiple modalities necessitates a compute solution capable of handling diverse tasks with efficiency. To this end, we conducted a performance analysis, training two of our models, including the highly anticipated Stable Diffusion 3.

stability.ai

Behind the Compute: Benchmarking Compute Solutions

11 Mar

Behind the Compute is a series of blog posts that chronicle elements of our business, offering insights for others to harness the power of generative AI.

In our last installment, we spoke about how we plan to utilize our state-of-the-art AI Supercomputer.

In this installment, we delve deeper into performance benchmarks and benefits of various compute solutions.

Our commitment to developing cutting-edge open models in multiple modalities necessitates a compute solution capable of handling diverse tasks with efficiency. To this end, we conducted a performance analysis, training two of our models, including the highly anticipated Stable Diffusion 3.

In our analysis, we compared the training speed of Intel Gaudi 2 accelerators versus Nvidia's A100 and H100, two of the most common choices for startups and developers training LLMs.

Model 1:

Stable Diffusion 3 is our most capable text-to-image model, soon to be in early preview.

Upon public release of Stable Diffusion 3, it will be available in sizes ranging from 800M to 8B parameters. Our analysis utilized the 2B parameter version and showed pleasantly surprising results.

We measured the training throughput for the 2B Multimodal Diffusion Transformer (MMDiT) architecture model with d=24, BFloat16mixed precision, optimized attention (xFormers for A100 and the FusedSDPA for Intel Gaudi). We call this model version MMDiT-ps2-d24.

First, let’s examine our training benchmark results across 2 nodes, a total of 16 accelerators (Gaudi/GPU). Here’s an excerpt of the raw data:

Keeping the batch size constant at 16 per accelerator, this Gaudi 2 system processed 927 training images per second - 1.5 times faster than the H100-80GB. Even better, we were able to fit a batch size of 32 per accelerator in the Gaudi 2 96GB of High Bandwidth Memory (HBM2E) to further increase the training rate to 1,254 images/sec.

As we scaled up the distributed training to 32 Gaudi 2 nodes (a total of 256 accelerators), we continued to measure very competitive performance:

In this configuration, the Gaudi 2 cluster processed over 3x more images per second, compared to A100-80GB GPUs. This is particularly impressive considering that the A100s have a very optimized software stack.

On inference tests with the Stable Diffusion 3 8B parameter model the Gaudi 2 chips offer inference speed similar to Nvidia A100 chips using base PyTorch. However, with TensorRT optimization, the A100 chips produce images 40% faster than Gaudi 2. We anticipate that with further optimization, Gaudi 2 will soon outperform A100s on this model. In earlier tests on our SDXL model with base PyTorch, Gaudi 2 generates a 1024x1024 image in 30 steps in 3.2 seconds, versus 3.6 seconds for PyTorch on A100s and 2.7 seconds for a generation with TensorRT on an A100.

The higher memory and fast interconnect of Gaudi 2, plus other design considerations, make it competitive to run the Diffusion Transformer architecture that underpins this next generation of media models.

Model 2:

Stable Beluga 2.5 70B is our fine-tuned version of LLaMA 2 70B, building on the Stable Beluga 2 model which was the first open model to best ChatGPT 3.5 in select benchmarks. We ran this training benchmark on 256 Gaudi 2 accelerators. Running our PyTorch code out of the box, with no extra optimizations, we measured an impressive total average throughput of 116,777 tokens/second. More specifically, this involves using a FP16 datatype, a global batch size of 1024, gradient accumulation steps of 2, and micro batch size of 2.

On inference tests with our 70B language model on Gaudi 2, it generates 673 tokens/second per accelerator, using an input token size of 128 and output token size of 2048. In comparison to TensorRT-LLM, Gaudi 2 appears to be 28% faster than the 525 tokens/second for the A100. We also anticipate further speed improvements with FP8.

Companies like ours face an increasing demand for more powerful and efficient computing solutions. Our findings underscore the need for alternatives like the Gaudi 2, which not only offers superior performance to other 7nm chips, but also addresses critical market needs such as affordability, reduced lead times, and superior price-to-performance ratios. Ultimately, the opportunity for choice in computing options broadens participation and innovation, thereby making advanced AI technologies more accessible to all.

Stay tuned for more insights in our next installment of "Behind the Compute."

bnew · Mar 12, 2024

Zach Lipton (@zachlipton.com)

AI companies scraping data: Haha fukk yeah!!! Yes!! AI companies having their data scraped: Well this fukking sucks. What the fukk.

bsky.app

Large Language Models News & Discussions

Veteran

Veteran

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users​

Veteran

Long live AI prompt engineering​

Autotuned prompts are successful and strange​

Veteran

Autotuned prompts make pictures prettier, too​

Prompt engineering will live on, by some name​

Veteran

Computer Science > Computation and Language​

Design2Code: How Far Are We From Automating Front-End Engineering?​

Submission history​

Design2Code: How Far Are We From Automating Front-End Engineering​

Abstract​

Test Set Examples​

Benchmark Performance: Automatic Metrics​

Benchmark Performance: Human Evaluation​

Model Comparison Examples​

Additional GPT-4V Generation Examples​

Veteran

Computer Science > Computation and Language​

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error​

Submission history​

Veteran

benchmark​

Aider’s code editing benchmark​

Benchmark results​

Claude 3 Opus​

Claude 3 Sonnet​

Code editing​

Other observations​

Veteran

Veteran

This AI Can Design the Machinery of Life With Atomic Precision​

Dream On​

Tag Team​

Brave New World​

Veteran

Elon Musk says xAI will open-source Grok this week​

Veteran

Nvidia is sued by authors over AI use of copyrighted works​

Veteran

Silicon Valley is pricing academics out of AI research​

With eye-popping salaries and access to costly computing power, AI companies are draining academia of talent​

Veteran

Veteran

Behind the Compute: Benchmarking Compute Solutions​

Veteran

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users

Long live AI prompt engineering

Autotuned prompts are successful and strange

Autotuned prompts make pictures prettier, too

Prompt engineering will live on, by some name

Computer Science > Computation and Language

Design2Code: How Far Are We From Automating Front-End Engineering?

Submission history

Design2Code: How Far Are We From Automating Front-End Engineering

Abstract

Test Set Examples

Benchmark Performance: Automatic Metrics

Benchmark Performance: Human Evaluation

Model Comparison Examples

Additional GPT-4V Generation Examples

Computer Science > Computation and Language

LLMs in the Imaginarium: Tool Learning through Simulated Trial and Error

Submission history

benchmark

Aider’s code editing benchmark

Benchmark results

Claude 3 Opus

Claude 3 Sonnet

Code editing

Other observations

This AI Can Design the Machinery of Life With Atomic Precision

Dream On

Tag Team

Brave New World

Elon Musk says xAI will open-source Grok this week

Nvidia is sued by authors over AI use of copyrighted works

Silicon Valley is pricing academics out of AI research

With eye-popping salaries and access to costly computing power, AI companies are draining academia of talent

Behind the Compute: Benchmarking Compute Solutions