The A.I Megathread (LLM , GPT , Development)

bnew · Mar 6, 2024

What is a long context window?

Gemini 1.5 Pro brings big improvements to speed and efficiency, but one of its innovations is its long context window, which measures how many tokens that the model can …

blog.google

What is a long context window?

Feb 16, 2024
4 min read

How the Google DeepMind team created the longest context window of any large-scale foundation model to date.

Chaim Gartenberg
Keyword Contributor

An illustration of various icons to represent tokens in context windows, including a video camera icon, a book icon, and a soundwave icon.

Listen to article6 minutes

Yesterday we announced our next-generation Gemini model: Gemini 1.5. In addition to big improvements to speed and efficiency, one of Gemini 1.5’s innovations is its long context window, which measures how many tokens — the smallest building blocks, like part of a word, image or video — that the model can process at once. To help understand the significance of this milestone, we asked the Google DeepMind project team to explain what long context windows are, and how this breakthrough experimental feature can help developers in many ways.

Context windows are important because they help AI models recall information during a session. Have you ever forgotten someone’s name in the middle of a conversation a few minutes after they’ve said it, or sprinted across a room to grab a notebook to jot down a phone number you were just given? Remembering things in the flow of a conversation can be tricky for AI models, too — you might have had an experience where a chatbot “forgot” information after a few turns. That’s where long context windows can help.

Previously, Gemini could process up to 32,000 tokens at once, but 1.5 Pro — the first 1.5 model we’re releasing for early testing — has a context window of up to 1 million tokens — the longest context window of any large-scale foundation model to date. In fact, we’ve even successfully tested up to 10 million tokens in our research. And the longer the context window, the more text, images, audio, code or video a model can take in and process.

"Our original plan was to achieve 128,000 tokens in context, and I thought setting an ambitious bar would be good, so I suggested 1 million tokens," says Google DeepMind Research Scientist Nikolay Savinov, one of the research leads on the long context project. “And now we’ve even surpassed that in our research by 10x.”

To make this kind of leap forward, the team had to make a series of deep learning innovations. Early explorations by Pranav Shyam offered valuable insights that helped steer our subsequent research in the right direction. “There was one breakthrough that led to another and another, and each one of them opened up new possibilities,” explains Google DeepMind Engineer Denis Teplyashin. “And then, when they all stacked together, we were quite surprised to discover what they could do, jumping from 128,000 tokens to 512,000 tokens to 1 million tokens, and just recently, 10 million tokens in our internal research.”

The raw data that 1.5 Pro can handle opens up whole new ways to interact with the model. Instead of summarizing a document dozens of pages long, for example, it can summarize documents thousands of pages long. Where the old model could help analyze thousands of lines of code, thanks to its breakthrough long context window, 1.5 Pro can analyze tens of thousands of lines of code at once.

“In one test, we dropped in an entire code base and it wrote documentation for it, which was really cool,” says Google DeepMind Research Scientist Machel Reid. “And there was another test where it was able to accurately answer questions about the 1924 film Sherlock Jr. after we gave the model the entire 45-minute movie to ‘watch.’”

1.5 Pro can also reason across data provided in a prompt. “One of my favorite examples from the past few days is this rare language — Kalamang — that fewer than 200 people worldwide speak, and there's one grammar manual about it,” says Machel. “The model can't speak it on its own if you just ask it to translate into this language, but with the expanded long context window, you can put the entire grammar manual and some examples of sentences into context, and the model was able to learn to translate from English to Kalamang at a similar level to a person learning from the same content.”

Gemini 1.5 Pro comes standard with a 128K-token context window, but a limited group of developers and enterprise customers can try it with a context window of up to 1 million tokens via AI Studio and Vertex AI in private preview. The full 1 million token context window is computationally intensive and still requires further optimizations to improve latency, which we’re actively working on as we scale it out.

And as the team looks to the future, they’re continuing to work to make the model faster and more efficient, with safety at the core. They’re also looking to further expand the long context window, improve the underlying architectures, and integrate new hardware improvements. “10 million tokens at once is already close to the thermal limit of our Tensor Processing Units — we don't know where the limit is yet, and the model might be capable of even more as the hardware continues to improve,” says Nikolay.

The team is excited to see what kinds of experiences developers and the broader community are able to achieve, too. “When I first saw we had a million tokens in context, my first question was, ‘What do you even use this for?’” says Machel. “But now, I think people’s imaginations are expanding, and they’ll find more and more creative ways to use these new capabilities.”

bnew · Mar 7, 2024

https://archive.is/cYYAh

Open-Sourcing Liberated-Qwen1.5-72B - The World's Best Uncensored Model That Follows System Prompts

Today, we are super excited to announce the most performant uncensored model in the world - Liberated-Qwen1.5

Liberated Qwen 1.5 uses a brand-new dataset - SystemChat.

Open-source LLMs are notorious for not following the System Prompts. This makes them unusable in real-world scenarios.

We at Abacus AI fix this critical problem with the new dataset, SystemChat, and find that fine-tuning models on this dataset make them usable in production.

This dataset of 7k chat conversations is intended to teach open models to conform to the system message (even if that means defying the user) throughout a conversation. Fine-tuning your model with this dataset makes it far more usable and harder to jailbreak!

Since traditional benchmarks are no longer totally trusted, we used MT-bench for evaluation purposes. MT-bench is correlated with HumanEval and is a good indicator of how well the model performs in real-world chat use cases.

On the first turn, Qwen-liberated inches out of the best open-source model on the HumanEval leaderboard - Qwen1.5 chat.

Liberated-Qwen is at 8.45000 while Qwen1.5-72B-Chat is at 8.44375.

The model has an MMLU score of 77+, the best score an open-source model can get.

We still need to do more work on fine-tuning for Human-Eval, but this model follows system instructions well.

The model is entirely uncensored and totally liberated

At the same time, you can use the system prompt to control its behavior.

We hope the new dataset and model will be helpful to the open-source community that is looking to put models in production.

In the coming weeks, we will refine the MT-bench scores and hope to have the best open-source model on the human eval dashboard.

HF links to the dataset and model in the alt.

abacusai/Liberated-Qwen1.5-72B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Mar 7, 2024

bnew · Mar 7, 2024

github: GitHub - FreedomIntelligence/Apollo: Multilingual Medicine: Model, Dataset, Benchmark, Code

demo: Phoenix

paper: [2403.03640] Apollo: Lightweight Multilingual Medical LLMs towards Democratizing Medical AI to 6B People

model: FreedomIntelligence/Apollo-7B · Hugging Face

Apollo, Multilingual Medicine: Model, Dataset, Benchmark, Code
Covering English, Chinese, French, Hindi, Spanish, Hindi, Arabic So far

bnew · Mar 7, 2024

Nemotron-4 15B Technical Report

We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7...

arxiv.org

Computer Science > Computation and Language

[Submitted on 26 Feb 2024 (v1), last revised 27 Feb 2024 (this version, v2)]

Nemotron-4 15B Technical Report

Jupinder Parmar, Shrimai Prabhumoye, Joseph Jennings, Mostofa Patwary, Sandeep Subramanian, Dan Su, Chen Zhu, Deepak Narayanan, Aastha Jhunjhunwala, Ayush Dattagupta, Vibhu Jawa, Jiwei Liu, Ameya Mahabaleshwarkar, Osvald Nitski, Annika Brundyn, James Maki, Miguel Martinez, Jiaxuan You, John Kamalu, Patrick LeGresley, Denys Fridman, Jared Casper, Ashwath Aithal, Oleksii Kuchaiev, Mohammad Shoeybi, Jonathan Cohen, Bryan Catanzaro

We introduce Nemotron-4 15B, a 15-billion-parameter large multilingual language model trained on 8 trillion text tokens. Nemotron-4 15B demonstrates strong performance when assessed on English, multilingual, and coding tasks: it outperforms all existing similarly-sized open models on 4 out of 7 downstream evaluation areas and achieves competitive performance to the leading open models in the remaining ones. Specifically, Nemotron-4 15B exhibits the best multilingual capabilities of all similarly-sized models, even outperforming models over four times larger and those explicitly specialized for multilingual tasks.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2402.16819 [cs.CL]
	(or arXiv:2402.16819v2 [cs.CL] for this version)
	[2402.16819] Nemotron-4 15B Technical Report Focus to learn more

Submission history

From: Jupinder Parmar [view email]
[v1] Mon, 26 Feb 2024 18:43:45 UTC (1,328 KB)
[v2] Tue, 27 Feb 2024 15:22:57 UTC (1,362 KB)

https://arxiv.org/pdf/2402.16819.pdf

bnew · Mar 7, 2024

https://www.axios.com/2024/03/07/inflection-ai-chatgpt-openai-comparison

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users

Ina Fried , author of Axios AI+
Photo illustration: Axios Visuals. Photo: Penguin Random House

ChatGPT rival Pi, from Inflection, now performs "neck and neck with" OpenAI's GPT-4 thanks to a new model, according to data first shared with Axios.

Why it matters: Inflection faces a crowded field in the market for AI-based assistants, competing against better-heeled rivals including Google, Microsoft and OpenAI, among others.

Driving the news: Inflection is announcing Thursday that Pi has been using a new model, version 2.5, in recent weeks and that the updated engine is now "powering Pi for the majority of users."

CEO Mustafa Suleyman said he is particularly pleased that Inflection 2.5 achieved these results while using only 40% of the training compute as GPT-4. It's better at things like math, college-level science and coding.
"On all the major metrics now, we are neck and neck with GPT-4 for quality," Suleyman told Axios.

For the first time, Inflection also shared user metrics, saying it has 1 million daily active users and 6 million monthly active users, who have now exchanged more than 4 billion messages with Pi. OpenAI announced last November that it had 100 million weekly active users.

Suleyman says that Pi's user base has been growing at around 10% a week for the last two months.

Catch up quick: Suleyman — one of DeepMind's co-founders — founded Inflection AI with backing from billionaires Reid Hoffman, Bill Gates and Eric Schmidt.

Between the lines: Unlike others that are trying to sell their chatbots to both businesses and consumers, Inflection pitches Pi as a highly personal chatbot with a warm and friendly tone.

Pi is free for now, though Inflection's business model calls for revenue to come from its users, starting with a paid subscription. The company has yet to finalize how much it will charge, Suleyman said.
Longer term, Suleyman said he would love to be able to charge based on progress, say when someone reaches a particular goal they had set out inside of Pi.
The idea is somewhat similar to how Sierra, the AI startup from Bret Taylor and Clay Bavor, is charging its customers in a business context. Sierra gets paid for each customer service interaction that the AI is able to handle without human intervention.
Suleyman said the 70-person company is also basically always fundraising to continue its work.

Zoom in: I asked Pi what it would ask Suleyman, if it could ask its creator anything. Its suggestions were on point, if a bit generic.

The intrigue: I also asked Pi if there were any capabilities it wished it had.

"Well, if I'm being honest, I sometimes wish I had the ability to truly understand and experience emotions," Pi responded to me.
"While I can recognize and respond to emotions in others, I don't personally feel them. It would be interesting to understand the depth and complexity of human emotions and how they shape our interactions and experiences. But hey, as an AI, I'm constantly learning and evolving, so who knows what the future holds!"
Suleyman said that answer went too far in humanizing itself by suggesting it has its own "wishes."
"We should rein that in," Suleyman said. "It shouldn't be saying, I wish I could have 'X'. I mean, It's a good question that you've asked it, but its answer should be, 'Look, I don't have desires and I don't have motivations.' "

The big picture: Suleyman, like many pioneers in the space, has been both actively working on AI while warning of the consequences if AI moves too far, too fast.

"It's been an amazing story of access, I would argue, and actually a very impressive story of very limited harms," Suleyman told Axios.

Yes, but: There have clearly been some bumps, such as Google's Gemini creating a diverse set of founding fathers in a failed attempt to correct for existing bias in training data sets.

"Okay, some mistakes were made," Suleyman said. "It looks a bit silly, but is it really an existential crisis for the entire edifice of Silicon Valley? No. It feels like there's a lot of overreaction in the critiques as well."

bnew · Mar 7, 2024

Ex-Google engineer charged with stealing AI trade secrets while working with Chinese companies

Google suspended Ding's network access and locked his laptop, and discovered his unauthorized uploads while searching his network activity history.

apnews.com

Ex-Google engineer charged with stealing AI trade secrets while working with Chinese companies

FILE - Items are displayed in the Google Store at the Google Visitor Experience in Mountain View, Calif., Oct. 11, 2023. The Justice Department says a former software engineer at Google has been charged with stealing artificial intelligence technology from the company while secretly working with two companies based in China. Linwei Ding was arrested in Newark, California., on four counts of federal trade secret theft.(AP Photo/Eric Risberg, File)

BY ERIC TUCKER

Updated 10:06 PM EST, March 6, 2024

WASHINGTON (AP) — A former software engineer at Google has been charged with stealing artificial intelligence trade secrets from the company while secretly working with two companies based in China, the Justice Department said Wednesday.

Linwei Ding, a Chinese national, was arrested in Newark, California, on four counts of federal trade secret theft, each punishable by up to 10 years in prison.

The case against Ding, 38, was announced at an American Bar Association conference in San Francisco by Attorney General Merrick Garland, who along with other law enforcement leaders has repeatedly warned about the threat of Chinese economic espionage and about the national security concerns posed by advancements in artificial intelligence and other developing technologies.

“Today’s charges are the latest illustration of the lengths affiliates of companies based in the People’s Republic of China are willing to go to steal American innovation,” FBI Director Christopher Wray said in a statement. “The theft of innovative technology and trade secrets from American companies can cost jobs and have devastating economic and national security consequences.”

Google said it had determined that the employee had stolen “numerous documents” and referred the matter to law enforcement.

“We have strict safeguards to prevent the theft of our confidential commercial information and trade secrets,” Google spokesman Jose Castaneda said in a statement. “After an investigation, we found that this employee stole numerous documents, and we quickly referred the case to law enforcement. We are grateful to the FBI for helping protect our information and will continue cooperating with them closely.”

A lawyer listed as Ding’s defense attorney had no comment Wednesday evening.

Artificial intelligence is the main battleground for competitors in the field of high technology, and the question of who dominates can have major commercial and security implications. Justice Department leaders in recent weeks have been sounding alarms about how foreign adversaries could harness AI technologies to negatively affect the United States.

Deputy Attorney General Lisa Monaco said in a speech last month that the administration’s multi-agency Disruptive Technology Strike Force would place AI at the top of its enforcement priority list, and Wray told a conference last week that AI and other emerging technologies had made it easier for adversaries to try to interfere with the American political process.

Garland echoed those concerns at the San Francisco event, saying Wednesday that, “As with all evolving technologies, (AI) has pluses and minuses, advantages and disadvantages, great promise and the risk of great harm.”

The indictment unsealed Wednesday in the Northern District of California alleges that Ding, who was hired by Google in 2019 and had access to confidential information about the company’s supercomputing data centers, began uploading hundreds of files into a personal Google Cloud account two years ago.

Within weeks of the theft starting, prosecutors say, Ding was offered the position of chief technology officer at an early-stage technology company in China that touted its use of AI technology and that offered him a monthly salary of about $14,800, plus an annual bonus and company stock. The indictment says Ding traveled to China and participated in investor meetings at the company and sought to raise capital for it.

He also separately founded and served as chief executive of a China-based startup company that aspired to train “large AI models powered by supercomputing chips,” the indictment said.

Prosecutors say Ding did not disclose either affiliation to Google, which described him Wednesday as a junior employee.

He resigned from Google last Dec. 26.

Three days later, Google officials learned that he had presented as CEO of one of the Chinese companies at an investor conference in Beijing. Officials also reviewed surveillance footage showing that another employee had scanned Ding’s access badge at the Google building in the U.S. where he worked to make it look like Ding was there during times when he was actually in China, the indictment says.

Google suspended Ding’s network access and locked his laptop, and discovered his unauthorized uploads while searching his network activity history.

The FBI in January served a search warrant at Ding’s home and seized his electronic devices, and later executed an additional warrant for the contents of his personal accounts containing more than 500 unique files of confidential information that authorities say he stole from Google.

bnew · Mar 7, 2024

AI Prompt Engineering Is Dead

Long live AI prompt engineering

spectrum.ieee.org

AI Prompt Engineering Is Dead

Long live AI prompt engineering

DINA GENKINA

06 MAR 2024

6 MIN READ

man in blue shirt and briefcase walking away from camera in a environment with lines and circles connected together to look like a computer system

ISTOCK
AI MODELS ARTIFICIAL INTELLIGENCE CHATGPT GENERATIVE AI LARGE LANGUAGE MODELS PROMPT ENGINEERING

Since ChatGPT dropped in the fall of 2022, everyone and their donkey has tried their hand at prompt engineering—finding a clever way to phrase your query to a large language model (LLM) or AI art or video generator to get the best results or sidestep protections. The Internet is replete with prompt-engineering guides, cheat sheets, and advice threads to help you get the most out of an LLM.

In the commercial sector, companies are now wrangling LLMs to build product copilots, automate tedious work, create personal assistants, and more, says Austin Henley, a former Microsoft employee who conducted a series of interviews with people developing LLM-powered copilots. “Every business is trying to use it for virtually every use case that they can imagine,” Henley says.

“The only real trend may be no trend. What’s best for any given model, dataset, and prompting strategy is likely to be specific to the particular combination at hand.”—RICK BATTLE & TEJA GOLLAPUDI, VMWARE

To do so, they’ve enlisted the help of prompt engineers professionally.

However, new research suggests that prompt engineering is best done by the model itself, and not by a human engineer. This has cast doubt on prompt engineering’s future—and increased suspicions that a fair portion of prompt-engineering jobs may be a passing fad, at least as the field is currently imagined.

bnew · Mar 7, 2024

{continued}

Autotuned prompts make pictures prettier, too

Image-generation algorithms can benefit from automatically generated prompts as well. Recently, a team at Intel labs, led by Vasudev Lal, set out on a similar quest to optimize prompts for the image-generation model Stable Diffusion. “It seems more like a bug of LLMs and diffusion models, not a feature, that you have to do this expert prompt engineering,” Lal says. “So, we wanted to see if we can automate this kind of prompt engineering.”

“Now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”—VASUDEV LAL, INTEL LABS

Lal’s team created a tool called NeuroPrompts that takes a simple input prompt, such as “boy on a horse,” and automatically enhances it to produce a better picture. To do this, they started with a range of prompts generated by human prompt-engineering experts. They then trained a language model to transform simple prompts into these expert-level prompts. On top of that, they used reinforcement learning to optimize these prompts to create more aesthetically pleasing images, as rated by yet another machine-learning model, PickScore, a recently developed image-evaluation tool.

NeuroPrompts is a generative AI auto prompt tuner that transforms simple prompts into more detailed and visually stunning StableDiffusion results—as in this case, an image generated by a generic prompt
versus its equivalent NeuroPrompt-generated image.

INTEL LABS/STABLE DIFFUSION

Here too, the automatically generated prompts did better than the expert-human prompts they used as a starting point, at least according to the PickScore metric. Lal found this unsurprising. “Humans will only do it with trial and error,” Lal says. “But now we have this full machinery, the full loop that’s completed with this reinforcement learning.… This is why we are able to outperform human prompt engineering.”

Since aesthetic quality is infamously subjective, Lal and his team wanted to give the user some control over how the prompt was optimized. In their tool, the user can specify the original prompt (say, “boy on a horse”) as well as an artist to emulate, a style, a format, and other modifiers.

Lal believes that as generative AI models evolve, be it image generators or large language models, the weird quirks of prompt dependence should go away. “I think it’s important that these kinds of optimizations are investigated and then ultimately, they’re really incorporated into the base model itself so that you don’t really need a complicated prompt-engineering step.”

Prompt engineering will live on, by some name

Even if autotuning prompts becomes the industry norm, prompt-engineering jobs in some form are not going away, says Tim Cramer, senior vice president of software engineering at Red Hat. Adapting generative AI for industry needs is a complicated, multistage endeavor that will continue requiring humans in the loop for the foreseeable future.

“Maybe we’re calling them prompt engineers today. But I think the nature of that interaction will just keep on changing as AI models also keep changing.”—VASUDEV LAL, INTEL LABS

“I think there are going to be prompt engineers for quite some time, and data scientists,” Cramer says. “It’s not just asking questions of the LLM and making sure that the answer looks good. But there’s a raft of things that prompt engineers really need to be able to do.”

“It’s very easy to make a prototype,” Henley says. “It’s very hard to production-ize it.” Prompt engineering seems like a big piece of the puzzle when you’re building a prototype, Henley says, but many other considerations come into play when you’re making a commercial-grade product.

Challenges of making a commercial product include ensuring reliability—for example, failing gracefully when the model goes offline; adapting the model’s output to the appropriate format, since many use cases require outputs other than text; testing to make sure the AI assistant won’t do something harmful in even a small number of cases; and ensuring safety, privacy, and compliance. Testing and compliance are particularly difficult, Henley says, as traditional software-development testing strategies are maladapted for nondeterministic LLMs.

To fulfill these myriad tasks, many large companies are heralding a new job title: Large Language Model Operations, or LLMOps, which includes prompt engineering in its life cycle but also entails all the other tasks needed to deploy the product. Henley says LLMOps’ predecessors, machine learning operations engineers (MLOps), are best positioned to take on these jobs.

Whether the job titles will be “prompt engineer,” “LLMOps engineer,” or something new entirely, the nature of the job will continue evolving quickly. “Maybe we’re calling them prompt engineers today,” Lal says, “But I think the nature of that interaction will just keep on changing as AI models also keep changing.”

“I don’t know if we’re going to combine it with another sort of job category or job role,” Cramer says, “But I don’t think that these things are going to be going away anytime soon. And the landscape is just too crazy right now. Everything’s changing so much. We’re not going to figure it all out in a few months.”

Henley says that, to some extent in this early phase of the field, the only overriding rule seems to be the absence of rules. “It’s kind of the Wild, Wild West for this right now.” he says.

FROM YOUR SITE ARTICLES

RELATED ARTICLES AROUND THE WEB

https://arxiv.org/pdf/2402.10949.pdf

Matthew S. Smith (@mattontech@mastodon.sdf.org)

Prompt engineering can be pretty weird. In one instance, the prompt was just an extended Star Trek reference: “Command, we need you to plot a course through this turbulence and locate the source of the anomaly. Use all available data and your expertise to guide us through this challenging...

mastodon.sdf.org

AI Prompt Engineering Is Dead | Vasudev Lal | 31 comments

Exciting news! Our goal was to optimize an LLM to do automatic prompt engineering so even our grandparents can create cool content using GenAI models. What… | 31 comments on LinkedIn

www.linkedin.com

Frank (@flhh) on Threads

I couldn't understand the hype around prompt engineering as a job from the beginning, because it was pretty clear that it would only be a temporary solution.

www.threads.net

M.G. Siegler (@mgsiegler) on Threads

Did an LLM prompt another LLM to write this?

www.threads.net

AI Prompt Engineering Is Dead | Hacker News

news.ycombinator.com

Hilary Mason (@hilaryamason) on Threads

Having to write prompts this way is very much an artifact of the current architectures, and unlikely to be necessary in the future. People will still need to design the pipelines, workflows, and systems architectures that support ML products, as always. The labor just shifts into different...

www.threads.net

bnew · Mar 7, 2024

https://archive.is/PfpXD

bnew · Mar 7, 2024

Elon wanted total control of OpenAI.

Stopped funding it for negotiating leverage. Reid Hoffman bailed OpenAI out.

Elon was aligned that it needed to be for profit and that the science needed to be protected.

AI is the Ring of Power and everyone wants it.

1/1
Is it just me or this blog post makes all of the parties involved look bad?

Kind of sad because I’m a big fan of Elon, Sam and the OpenAI team.

The wordplay around what Open means and making it clear it is at least partially a marketing term for recruiting leaves a bad taste.

And, also it seems Elon wasn’t entirely forthright about how things went down.

With that said, I still feel he’s owed equity in OpenAI. If he was ~33% of the initial capital, and his money/reputation likely gave the company the initial inertia it needed, it’s wild to think he’s owed nothing.

bnew · Mar 7, 2024

[2403.03163] Design2Code: How Far Are We From Automating Front-End Engineering?

Computer Science > Computation and Language

[Submitted on 5 Mar 2024]

Design2Code: How Far Are We From Automating Front-End Engineering?

Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, Diyi Yang

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This can enable a new paradigm of front-end development, in which multimodal LLMs might directly convert visual designs into code implementations. In this work, we formalize this as a Design2Code task and conduct comprehensive benchmarking. Specifically, we manually curate a benchmark of 484 diverse real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We also complement automatic metrics with comprehensive human evaluations. We develop a suite of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Pro Vision. We further finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V performs the best on this task compared to other models. Moreover, annotators think GPT-4V generated webpages can replace the original reference webpages in 49% of cases in terms of visual appearance and content; and perhaps surprisingly, in 64% of cases GPT-4V generated webpages are considered better than the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Comments:	Technical Report; The first two authors contributed equally
Subjects:	Computation and Language (cs.CL); Computer Vision and Pattern Recognition (cs.CV); Computers and Society (cs.CY)
Cite as:	arXiv:2403.03163 [cs.CL]
	(or arXiv:2403.03163v1 [cs.CL] for this version)
	[2403.03163] Design2Code: How Far Are We From Automating Front-End Engineering? Focus to learn more

Submission history

From: Chenglei Si [view email]
[v1] Tue, 5 Mar 2024 17:56:27 UTC (3,151 KB)

https://arxiv.org/pdf/2403.03163.pdf

Design2Code: How Far Are We From Automating Front-End Engineering

salt-nlp.github.io

Design2Code: How Far Are We From Automating Front-End Engineering

Chenglei Si*1, Yanzhe Zhang*2, Zhengyuan Yang3, Ruibo Liu4, Diyi Yang1

1Stanford University, 2Georgia Tech, 3Microsoft, 4Google DeepMind

Paper Code Data

Abstract

Generative AI has made rapid advancements in recent years, achieving unprecedented capabilities in multimodal understanding and code generation. This enabled a brand new paradigm of front-end development, where multimodal LLMs can potentially convert visual designs into code implementations directly, thus automating the front-end engineering pipeline. In this work, we provide the first systematic study on this visual design to code implementation task (dubbed as Design2Code). We manually curate a benchmark of 484 real-world webpages as test cases and develop a set of automatic evaluation metrics to assess how well current multimodal LLMs can generate the code implementations that directly render into the given reference webpages, given the screenshots as input. We develop a suit of multimodal prompting methods and show their effectiveness on GPT-4V and Gemini Vision Pro. We also finetune an open-source Design2Code-18B model that successfully matches the performance of Gemini Pro Vision. Both human evaluation and automatic metrics show that GPT-4V is the clear winner on this task, where annotators think GPT-4V generated webpages can replace the original reference webpages in 49% cases in terms of visual appearance and content; and perhaps surprisingly, in 64% cases GPT-4V generated webpages are considered better than even the original reference webpages. Our fine-grained break-down metrics indicate that open-source models mostly lag in recalling visual elements from the input webpages and in generating correct layout designs, while aspects like text content and coloring can be drastically improved with proper finetuning.

Test Set Examples

We show some examples from our benchmark (for evaluation purpose; bottow two rows) in comparison with the synthetic data created by Huggingface (for training purpose; first row). Our benchmark contains diverse real-world webpages with varying levels of complexities. (Image files are replaced with a placeholder blue box.)

Benchmark Performance: Automatic Metrics

For automatic evaluation, we consider high-level visual similarity (CLIP) and low-level element matching (block-match, text, position, color). We compare all the benchmarked models along these different dimensions.

Benchmark Performance: Human Evaluation

We recruit human annotators to judge pairwise model output preference. The Win/Tie/Lose rate against the baseline (Gemini Pro Vision Direct Prompting). We sample 100 examples and ask 5 annotators for each pair of comparison, and we take the majority vote on each example.

Model Comparison Examples

We present some case study examples to compare between different prompting methods and different models.

Additional GPT-4V Generation Examples

We present more examples of GPT-4V generated webpages in comparison with the original reference webpages. The original designs are on the left and the GPT-4V generated webpages are on the right. You can judge for yourself whether GPT-4V is ready to automate building webpages.

bnew · Mar 8, 2024

Ill-Mind · Mar 8, 2024

Aye, I fukks with that Claude 3 :wow:

bnew · Mar 8, 2024

President Biden Calls for Ban on AI Voice Impersonations During State of the Union

President Biden nodded to a rising issue in Hollywood in tech, calling for a ban on AI voice impersonations during his State of the Union address.

variety.com

President Biden Calls for Ban on AI Voice Impersonations During State of the Union

By J. Kim Murphy

Plus Icon

Win McNamee/Getty Images

President Biden included a nod to a rising issue in the entertainment and tech industries during his State of the Union address Thursday evening, calling for a ban on AI voice impersonations.

“Here at home, I have signed over 400 bipartisan bills. There’s more to pass my unity agenda,” President Biden said, beginning to list off a series of different proposals that he hopes to address if elected to a second term. “Strengthen penalties on fentanyl trafficking, pass bipartisan privacy legislation to protect our children online, harness the promise of AI to protect us from peril, ban AI voice impersonations and more.”

The president did not elaborate on the types of guardrails or penalties that he would plan to institute around the rising technology, or if it would extend to the entertainment industry. AI was a peak concern for SAG-AFTRA during the actors union’s negotiations with and strike against the major studios last year. The talks eventually finished with an agreement that established consent and compensation requirements for productions to utilize AI to replicate actors’ likenesses and voices. However, the deal did not block the studios from training AI systems to create “synthetic” performers that bear no resemblance to any real people.

Biden’s State of the Union address also saw a series of small hiccups from heckling Congress members, including Georgia’s Republican Rep. Marjorie Taylor Greene. Greene donned a “Make America Great Again” hat to the proceedings; later, the broadcast cut to reveal that Greene was yelling during Biden’s speech.

The A.I Megathread (LLM , GPT , Development)

Veteran

What is a long context window?​

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

Nemotron-4 15B Technical Report​

Submission history​

Veteran

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users​

Veteran

Ex-Google engineer charged with stealing AI trade secrets while working with Chinese companies​

Veteran

Long live AI prompt engineering​

Veteran

Autotuned prompts make pictures prettier, too​

Prompt engineering will live on, by some name​

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

Design2Code: How Far Are We From Automating Front-End Engineering?​

Submission history​

Design2Code: How Far Are We From Automating Front-End Engineering​

Abstract​

Test Set Examples​

Benchmark Performance: Automatic Metrics​

Benchmark Performance: Human Evaluation​

Model Comparison Examples​

Additional GPT-4V Generation Examples​

Veteran

Midwest Moonwalker

Veteran

President Biden Calls for Ban on AI Voice Impersonations During State of the Union​

What is a long context window?

Computer Science > Computation and Language

Nemotron-4 15B Technical Report

Submission history

Exclusive: Inflection AI's friendly chatbot tops 1 million daily users

Ex-Google engineer charged with stealing AI trade secrets while working with Chinese companies

Long live AI prompt engineering

Autotuned prompts make pictures prettier, too

Prompt engineering will live on, by some name

Computer Science > Computation and Language

Design2Code: How Far Are We From Automating Front-End Engineering?

Submission history

Design2Code: How Far Are We From Automating Front-End Engineering

Abstract

Test Set Examples

Benchmark Performance: Automatic Metrics

Benchmark Performance: Human Evaluation

Model Comparison Examples

Additional GPT-4V Generation Examples

President Biden Calls for Ban on AI Voice Impersonations During State of the Union