Large Language Models News & Discussions

bnew · Jan 29, 2024

Midjourney's V6 update brings Pan, Zoom, Vary and broader website access

Midjourney has added Pan, Zoom and Vary (Region) to its V6. The pan feature now acts more like zoom, promising higher quality results with "more coherence and less repetition of concepts".

the-decoder.com

AI in practice

Jan 28, 2024

Midjourney's V6 update brings Pan, Zoom, Vary and broader website access

Matthias Bastian

Midjourney has added pan, zoom, and vary (Region) to its V6. The pan feature now acts more like zoom, promising higher quality results with "more coherence and less repetition of concepts". It also works with upscale, vary (region), and remix, but doesn't infinitely increase resolution anymore. These options are available under V6 bot upscales (Discord) and on the alpha website (except vary region). Additionally, Midjourney's alpha website with image creation is now accessible to anyone who has created at least 5000 images on the platform. You can check how many images you have created with "/info" on Discord. Midjourney has also implemented a new feedback feature to help its engineers prioritize their work based on user feedback.

You have read 7 of our articles this month. Thank you for your interest!

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:

Bank transfer

Sources

Discord

bnew · Jan 29, 2024

Taiwan's "Trustworthy AI Dialogue Engine" aims to counter China's grip on AI ecosystem

Taiwan is developing its own AI language model, called the Trustworthy AI Dialogue Engine (Taide), to counter China's influence in the AI ecosystem.

the-decoder.com

AI in practice

Jan 26, 2024

Taiwan's "Trustworthy AI Dialogue Engine" aims to counter China's grip on AI ecosystem

DALL-E 3 prompted by THE DECODER

Matthias Bastian

Online journalist Matthias is the co-founder and publisher of THE DECODER. He believes that artificial intelligence will fundamentally change the relationship between humans and computers.
Profile
E-Mail

Taiwan is developing its own AI language model, called the Trustworthy AI Dialogue Engine (Taide), to counter China's influence in the AI ecosystem.

Taide is designed to provide businesses, banks, hospitals, and government agencies with a platform for natural language tasks such as composing emails and summarizing meetings.

It could also help Taiwan emancipate itself from Chinese LLMs, whose output is heavily controlled by the Chinese government. Taide runs entirely on Taiwanese servers.

The project, led by National Chiao Tung University, uses licensed content from local media and government agencies in traditional Chinese characters. Technically, the AI model is based on Meta's Llama 2, which is fine-tuned using this local data.

One advantage of this approach is that the content is written in the traditional Chinese characters used by the island's 23 million people, rather than the simplified version commonly used in China.

An initial version of Taide will be made available to selected partners for testing in April.

Independence before performance

According to Bloomberg, Taiwan has earmarked a total of $555.6 million to develop AI expertise and tools by 2026. The developers hope their tool will help strengthen Taiwan's role in the AI software industry.

Taide's developers believe the model does not need to match the performance of industry leaders to be effective. Local companies should use it as a prototype and refine it for specific applications.

Traditionally, Taiwan focuses on the hardware side of the AI industry and has been particularly dependent on major US customers such as Nvidia. The model's development could also be an opportunity for Taiwan to move up the AI value chain and become more self-reliant.

Summary

Taiwan is developing its own AI language model, called the Trustworthy AI Dialogue Engine (Taide), to become more independent of China's AI ecosystem and avoid political interference.
Taide aims to provide a platform for businesses and government agencies to facilitate tasks such as composing emails and summarizing meetings. The model runs entirely on Taiwanese servers.
An initial version of the project, led by National Chiao Tung University, will be made available to selected partners for testing in April.

Sources

Bloomberg

bnew · Jan 29, 2024

bnew · Jan 29, 2024

OpenAI and Other Tech Giants Will Have to Warn the US Government When They Start New AI Projects

The Biden administration is using the Defense Production Act to require companies to inform the Commerce Department when they start training high-powered AI algorithms.

www.wired.com

WILL KNIGHT

BUSINESS

JAN 26, 2024 5:30 PM

OpenAI and Other Tech Giants Will Have to Warn the US Government When They Start New AI Projects

The Biden administration is using the Defense Production Act to require companies to inform the Commerce Department when they start training high-powered AI algorithms.

Four people sitting in front of an audience in a conference room

(L-R) Fei-Fei Li, Condoleezza Rice, Gina Raimondo, and Miriam Vogel speaking at the Hoover Institute on January 26, 2024.COURTESY OF DEPARTMENT OF COMMERCE

When OpenAI’s ChatGPT took the world by storm last year, it caught many power brokers in both Silicon Valley and Washington, DC, by surprise. The US government should now get advance warning of future AI breakthroughs involving large language models, the technology behind ChatGPT.

The Biden administration is preparing to use the Defense Production Act to compel tech companies to inform the government when they train an AI model using a significant amount of computing power. The rule could take effect as soon as next week.

The new requirement will give the US government access to key information about some of the most sensitive projects inside OpenAI, Google, Amazon, and other tech companies competing in AI. Companies will also have to provide information on safety testing being done on their new AI creations.

OpenAI has been coy about how much work has been done on a successor to its current top offering, GPT-4. The US government may be the first to know when work or safety testing really begins on GPT-5. OpenAI did not immediately respond to a request for comment.

"We're using the Defense Production Act, which is authority that we have because of the president, to do a survey requiring companies to share with us every time they train a new large language model, and share with us the results—the safety data—so we can review it,” Gina Raimondo, US secretary of commerce, said Friday at an event held at Stanford University’s Hoover Institution. She did not say when the requirement will take effect or what action the government might take on the information it received about AI projects. More details are expected to be announced next week.

The new rules are being implemented as part of a sweeping White House executive order issued last October. The executive order gave the Commerce Department a deadline of January 28 to come up with a scheme whereby companies would be required to inform US officials of details about powerful new AI models in development. The order said those details should include the amount of computing power being used, information on the ownership of data being fed to the model, and details of safety testing.

The October order calls for work to begin on defining when AI models should require reporting to the Commerce Department but sets an initial bar of 100 septillion (a million billion billion or 1026) floating-point operations per second, or flops, and a level 1,000 times lower for large language models working on DNA sequencing data. Neither OpenAI nor Google have disclosed how much computing power they used to train their most powerful models, GPT-4 and Gemini, respectively, but a congressional research service report on the executive order suggests that 1026 flops is slightly beyond what was used to train GPT-4.

Raimondo also confirmed that the Commerce Department will soon implement another requirement of the October executive order requiring cloud computing providers such as Amazon, Microsoft, and Google to inform the government when a foreign company uses their resources to train a large language model. Foreign projects must be reported when they cross the same initial threshold of 100 septillion flops.

Raimondo’s announcement comes on the same day that Google touted the release of new data highlighting the prowess of its latest artificial intelligence model, Gemini, showing it surpassing OpenAI’s GPT-4, which powers ChatGPT, on some industry benchmarks. The Commerce Department may get early warning of Gemini’s successor, if the project uses enough of Google’s ample cloud computing resources.

Rapid progress in the field of AI last year prompted some AI experts and executives to call for a temporary pause on the development of anything more powerful than GPT-4, the model currently used for ChatGPT.

Samuel Hammond, senior economist at the Foundation for American Innovation, a think tank, says a key challenge for the federal government is that a model does not necessarily need to surpass a compute threshold in training to be potentially dangerous.

Dan Hendrycks, director of the nonprofit Center for AI Safety, says the requirement is proportionate given recent developments in AI and concerns about its power. “Companies are spending many billions on AI training, and their CEOs are warning that AI could be superintelligent in the next couple of years,” he says. “It seems reasonable for the government to be aware of what AI companies are up to.”

Anthony Aguirre, executive director of the Future of Life Institute, a nonprofit dedicated to ensuring transformative technologies benefit humanity, agrees. “As of now, giant experiments are running with effectively zero outside oversight or regulation,” he says. “Reporting those AI training runs and related safety measures is an important step. But much more is needed. There is strong bipartisan agreement on the need for AI regulation and hopefully congress can act on this soon.”

Raimondo said at the Hoover Institution event Friday that the National Institutes of Standards and Technology (NIST) is working to define standards for testing the safety of AI models as part of the creation of a new US government AI Safety Institute. Determining how risky an AI model is typically involves probing a model to try and evoke problematic behavior or output, a process known as “ red teaming.”

Raimondo said that her department is working on guidelines that will help companies better understand the risks that might lurk in the models they are hatching. These guidelines could include ways of ensuring AI cannot be used to commit human rights abuses, she suggested.

The October executive order on AI gives the NIST until July 26 to have those standards in place, but some working with the agency say that it lacks the funds or expertise required to get this done adequately.

Micky Mikey · Jan 29, 2024

is it me or has the hyped around LLMs died down a little? Things have been eerily quiet since the new year. I remember when ChatGPT4 dropped and people were saying we'd have AGI within 18 months lol. I still think Ray Kurzweil's prediction of AGI by 2029 is spot on though.

bnew · Jan 29, 2024

Micky Mikey said:
is it me or has the hyped around LLMs died down a little? Things have been eerily quiet since the new year. I remember when ChatGPT4 dropped and people were saying we'd have AGI within 18 months lol. I still think Ray Kurzweil's prediction of AGI by 2029 is spot on though.

models are being trained and google gemini ultra and openai GPT-4.5 is set to be released soon so people are waiting to see what advancements they've made. no one wants to be made yesterdays news in less than a week or worse overnight.

local LLM's are still hot tho

https://archive.is/8AJg0

Meta AI releases Code Llama 70B | Hacker News

news.ycombinator.com

codellama/CodeLlama-70b-hf · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

codellama/CodeLlama-70b-Python-hf · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

codellama/CodeLlama-70b-Instruct-hf · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Eagle 7B: Soaring past Transformers | Hacker News

news.ycombinator.com

Micky Mikey · Jan 29, 2024

bnew said:
models are being trained and google gemini ultra and openai GPT-4.5 is set to be released soon so people are waiting to see what advancements they've made. no one wants to be made yesterdays news in less than a week or worse overnight.

local LLM's are still hot tho

https://archive.is/8AJg0

I'm hoping the advancements they've made will make their models more reliable with fewer hallucinations.

I don't follow the development of open-source models too much. They don't seem to be all that useful but I guess they're catching up to GPT4 in terms of usefulness.

update: nvm. seems like local LLMs are catching up

bnew · Jan 30, 2024

EvalPlus Leaderboard

evalplus.github.io

Computer Science > Computation and Language

[Submitted on 20 Dec 2023 (v1), last revised 11 Jan 2024 (this version, v3)]

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Zhaojian Yu, Xin Zhang, Ning Shang, Yangyu Huang, Can Xu, Yishujie Zhao, Wenxiang Hu, Qiufeng Yin

Recent work demonstrates that, after being fine-tuned on a high-quality instruction dataset, the resulting model can obtain impressive capabilities to address a wide range of tasks. However, existing methods for instruction data generation often produce duplicate data and are not controllable enough on data quality. In this paper, we extend the generalization of instruction tuning by classifying the instruction data to 4 code-related tasks and propose a LLM-based Generator-Discriminator data process framework to generate diverse, high-quality instruction data from open source code. Hence, we introduce CodeOcean, a dataset comprising 20,000 instruction instances across 4 universal code-related tasks,which is aimed at augmenting the effectiveness of instruction tuning and improving the generalization ability of fine-tuned model. Subsequently, we present WaveCoder, a fine-tuned Code LLM with Widespread And Versatile Enhanced instruction tuning. This model is specifically designed for enhancing instruction tuning of Code Language Models (LLMs). Our experiments demonstrate that Wavecoder models outperform other open-source models in terms of generalization ability across different code-related tasks at the same level of fine-tuning scale. Moreover, Wavecoder exhibits high efficiency in previous code generation tasks. This paper thus offers a significant contribution to the field of instruction data generation and fine-tuning models, providing new insights and tools for enhancing performance in code-related tasks.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Software Engineering (cs.SE)
Cite as:	arXiv:2312.14187 [cs.CL]
	(or arXiv:2312.14187v3 [cs.CL] for this version)
	[2312.14187] WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation Focus to learn more

Submission history

From: Zhaojian Yu [view email]
[v1] Wed, 20 Dec 2023 09:02:29 UTC (1,336 KB)
[v2] Tue, 26 Dec 2023 13:51:38 UTC (1,336 KB)
[v3] Thu, 11 Jan 2024 07:44:55 UTC (1,463 KB)

https://arxiv.org/pdf/2312.14187.pdf

bnew · Jan 30, 2024

https://archive.is/jQ7sb

22h
If this is true it is over: Unlimited context length is here.

Activation Beacon,
New method for extending LLMs context.

TL;DR: Add "global state" tokens before the prompt and predict auto-regressively (sliding window). Train to always condense to and read from the new tokens.

This way you get:
1. Fixed memory consumption.
2. Inference time grow linearly.
3. Train to use this with short windows: but then it works on long window generations.
3. The perplexity remain constant
[This is the amazing part!]

Results:
The authors trained LLaMA 2 for 10K-steps / 4K ctx_len
Then it generalized to 400K (!!!) ctx_len.

You should be able train any model you want to do this, the code provided is very very simple and clean.

If this will be reproducible for all models than ctx_len just got close to be considered "solved".

---
- Paper:

https://arxiv.org/pdf/2401.03462.pdf

- Code: FlagEmbedding/Long_LLM/activation_beacon at master · FlagOpen/FlagEmbedding

Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon

Long Context Compression with Activation Beacon

Long context compression is a critical research problem due to its significance in reducing the high computational and memory costs associated with LLMs. In this paper, we propose Activation Beacon, a plug-in module for transformer-based LLMs that targets effective, efficient, and flexible...

arxiv.org

Computer Science > Computation and Language

[Submitted on 7 Jan 2024]

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

Peitian Zhang, Zheng Liu, shytao Xiao, Ninglu Shao, Qiwei Ye, Zhicheng Dou

The utilization of long contexts poses a big challenge for large language models due to their limited context window length. Although the context window can be extended through fine-tuning, it will result in a considerable cost at both training and inference time, and exert an unfavorable impact to the LLM's original capabilities. In this work, we propose Activation Beacon, which condenses LLM's raw activations into more compact forms such that it can perceive a much longer context with a limited context window. Activation Beacon is introduced as a plug-and-play module for the LLM. It fully preserves the LLM's original capability on short contexts while extending the new capability on processing longer contexts. Besides, it works with short sliding windows to process the long context, which achieves a competitive memory and time efficiency in both training and inference. Activation Beacon is learned by the auto-regression task conditioned on a mixture of beacons with diversified condensing ratios. Thanks to such a treatment, it can be efficiently trained purely with short-sequence data in just 10K steps, which consumes less than 9 hours on a single 8xA800 GPU machine. The experimental studies show that Activation Beacon is able to extend Llama-2-7B's context length by ×100 times (from 4K to 400K), meanwhile achieving a superior result on both long-context generation and understanding tasks. Our model and code will be available at the BGE repository.

Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2401.03462 [cs.CL]
	(or arXiv:2401.03462v1 [cs.CL] for this version)
	[2401.03462] Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon Focus to learn more

Submission history

From: Peitian Zhang [view email]
[v1] Sun, 7 Jan 2024 11:57:40 UTC (492 KB)

bnew · Jan 30, 2024

https://archive.is/SpqCH

RWKV-Gradio-2 - a Hugging Face Space by BlinkDL

This application generates text based on user-provided prompts. Users can input instructions or questions, and the app will produce a detailed response. The output can include explanations, stories...

huggingface.co

Introducing Eagle-7B

Based on the RWKV-v5 architecture, bringing into opensource space, the strongest

- multi-lingual model

(beating even mistral)

- attention-free transformer today

(10-100x+ lower inference)

With comparable English performance with the best 1T 7B models

bnew · Jan 30, 2024

https://archive.is/ObWJr

https://archive.is/jm4JI

AlphaCodium - super interesting work that shows just how much alpha (no pun intended) is there from building complex prompt flows in this case for code generation.

It achieves better results than DeepMind's AlphaCode with 4 orders of magnitude fewer LLM calls! This is a direct consequence of the brute force approach that AlphaCode took generating ~100k solutions and then filtering them down (the recently announced AlphaCode2 is much more sample efficient though).

The framework is agnostic to the choice of the underlying LLM. On the validation set with GPT-4 they increase the pass@5 accuracy from 19 -> 44%.

There are 2 steps in the pipeline (and definitely reminds me of Meta's Self-Rewarding Language Models work I shared yesterday):

1) Pre-processing -> transforming the original problem statement into something more concise and parsable by the model, forcing the model to explain the results (why test input leads to a certain output), rank various solutions, etc.

2) Iterative loop where the model fixes its solution code against a set of tests

I feel ultimately we'll want this to be a part of the LLM training process as well and not something we append ad-hoc, but definitely exciting to see how much low hanging fruit is still out there!

Paper:

Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering

Code generation problems differ from common natural language problems - they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and...

arxiv.org

GitHub - Codium-ai/AlphaCodium: Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering""

Official implementation for the paper: "Code Generation with AlphaCodium: From Prompt Engineering to Flow Engineering"" - Codium-ai/AlphaCodium

github.com

State-of-the-art Code Generation with AlphaCodium - From Prompt Engineering to Flow Engineering

Read about State-of-the-art Code Generation with AlphaCodium - From Prompt Engineering to Flow Engineering in our blog.

www.codium.ai

State-of-the-art Code Generation with AlphaCodium – From Prompt Engineering to Flow Engineering

TECHNOLOGY

Tal RidnikJanuary 17, 2024 • 17 min read

TL;DR

Code generation problems differ from common natural language problems – they require matching the exact syntax of the target language, identifying happy paths and edge cases, paying attention to numerous small details in the problem spec, and addressing other code-specific issues and requirements. Hence, many of the optimizations and tricks that have been successful in natural language generation may not be effective for code tasks.

In this work, we propose a new approach to code generation by LLMs, which we call AlphaCodium – a test-based, multi-stage, code-oriented iterative flow, that improves the performances of LLMs on code problems.

We tested AlphaCodium on a challenging code generation dataset called CodeContests, which includes competitive programming problems from platforms such as Codeforces. The proposed flow consistently and significantly improves results.

On the validation set, for example, GPT-4 accuracy (pass@5) increased from 19% with a single well-designed direct prompt to 44% with the AlphaCodium flow. AlphaCodium also outperforms previous works, such as AlphaCode, while having a significantly smaller computational budget.

Many of the principles and best practices acquired in this work, we believe, are broadly applicable to general code generation tasks.
In our very new open-source on AlphaCodium2.3K we share our AlphaCodium solution to CodeContests, along with a complete reproducible dataset evaluation and benchmarking scripts, to encourage further research in this area.

CodeContests dataset

CodeContests is a challenging code generation dataset introduced by Google’s Deepmind, involving problems curated from competitive programming platforms such as Codeforces. The dataset contains ~10K problems that can be used to train LLMs, as well as a validation and test set to assess the ability of LLMs to solve challenging code generation problems.

In this work, instead of training a dedicated model, we focused on developing a code-oriented flow, that can be applied to any LLM pre-trained to support coding tasks, such as GPT or DeepSeek. Hence, we chose to ignore the train set, and focused on the validation and test sets of CodeContests, which contain 107 and 165 problems, respectively. Figure 1 depicts an example of a typical problem from CodeContests dataset:

Figure 1. A typical CodeContests problem.

Each problem consists of a description and public tests, available as inputs to the model. The goal is to generate a code solution that produces the correct output for any (legal) input. A private test set, which is not available to the model or contesters, is used to evaluate the submitted code solutions.

What makes CodeContests a good dataset for evaluating LLMs on code generation tasks?

1) CodeContests, unlike many other competitive programming datasets, utilizes a comprehensive private set of tests to avoid false positives – each problem contains ~200 private input-output tests the generated code solution must pass.

2) LLMs generally do not excel at paying attention to small details because they typically transform the problem description to some “average” description, similar to common cases on which they were trained. Real-world problems, on the other hand, frequently contain minor details that are critical to their proper resolution. A key feature of CodeContests dataset is that the problem descriptions are, by design, complicated and lengthy, with small details and nuances (see a typical problem description in Figure 1). We feel that adding this degree of freedom of problem understanding is beneficial since it simulates real-life problems, which are often complicated and involve multiple factors and considerations. This is in contrast to more common code datasets such as HumanEval, where the problems are easier and presented in a concise manner. An example of a typical HumanEval code problem appears in Appendix 1.

Figure 2 depicts the model’s introspection on the problem presented in Figure 1. Note that proper self-reflection makes the problem clearer and more coherent. This illustrates the importance of problem understanding as part of a flow that can lead with high probability to a correct code solution.

Figure 2. An AI-generated self-reflection on the problem presented in Figure 1.

The proposed flow

Due to the complicated nature of code generation problems, we observed that single-prompt optimizations, or even chain-of-thought prompts, have not led to meaningful improvement in the solve ratio of LLMs on CodeContest. The model struggles to understand and “digest” the problem and continuously produces wrong code, or a code that fails to generalize to unseen private tests. Common flows, that are suitable for natural language tasks, may not be optimal for code-generation tasks, which include an untapped potential – repeatedly running the generated code, and validating it against known examples.

Instead of common prompt engineering techniques used in NLP, we found that to solve CodeContest problems it was more beneficial to employ a dedicated code-generation and testing-oriented flow, that revolves around an iterative process where we repeatedly run and fix the generated code against input-output tests. Two key elements for this code-oriented flow are (a) generating additional data in a pre-processing stage, such as self-reflection and public tests reasoning, to aid the iterative process, and (b) enrichment of the public tests with additional AI-generated tests.

In Figure 3 we present our proposed flow for solving competitive programming problems:

Figure 3. The proposed AlphaCodium flow.

continue reading on site

https://archive.is/Zk8k7

Santiago
@svpino
Jan 18
Jan 18
We are one step closer to having AI generate code better than humans!

There's a new open-source, state-of-the-art code generation tool. It's a new approach that improves the performance of Large Language Models generating code.

The paper's authors call the process "AlphaCodium" and tested it on the CodeContests dataset, which contains around 10,000 competitive programming problems.

The results put AlphaCodium as the best approach to generate code we've seen. It beats DeepMind's AlphaCode and their new AlphaCode2 without needing to fine-tune a model!

I'm linking to the paper, the GitHub repository, and a blog post below, but let me give you a 10-second summary of how the process works:

Instead of using a single prompt to solve problems, AlphaCodium relies on an iterative process that repeatedly runs and fixes the generated code using the testing data.

1. The first step is to have the model reason about the problem. They describe it using bullet points and focus on the goal, inputs, outputs, rules, constraints, and any other relevant details.

2. Then, they make the model reason about the public tests and come up with an explanation of why the input leads to that particular output.

3. The model generates two to three potential solutions in text and ranks them in terms of correctness, simplicity, and robustness.

4. Then, it generates more diverse tests for the problem, covering cases not part of the original public tests.

5. Iteratively, pick a solution, generate the code, and run it on a few test cases. If the tests fail, improve the code and repeat the process until the code passes every test.

bnew · Jan 30, 2024

https://archive.is/Bz2PW

miqudev/miqu-1-70b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

Nexesenex/Miqu-1-70b-Requant-iMat.GGUF · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

alpindale/miqu-1-70b-pytorch · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Jan 30, 2024

Introducing Qwen-VL

Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has...

qwenlm.github.io

Introducing Qwen-VL

January 25, 2024· 12 min · 2505 words · Qwen Team | Translations:

简体中文

Along with the rapid development of our large language model Qwen, we leveraged Qwen’s capabilities and unified multimodal pretraining to address the limitations of multimodal models in generalization, and we opensourced multimodal model Qwen-VL in Sep. 2023. Recently, the Qwen-VL series has undergone a significant upgrade with the launch of two enhanced versions, Qwen-VL-Plus and Qwen-VL-Max. The key technical advancements in these versions include:

Substantially boost in image-related reasoning capabilities;
Considerable enhancement in recognizing, extracting, and analyzing details within images and texts contained therein;
Support for high-definition images with resolutions above one million pixels and images of various aspect ratios.

Model Name	Model Description
qwen-vl-plus	Qwen's Enhanced Large Visual Language Model. Significantly upgraded for detailed recognition capabilities and text recognition abilities, supporting ultra-high pixel resolutions up to millions of pixels and arbitrary aspect ratios for image input. It delivers significant performance across a broad range of visual tasks.
qwen-vl-max	Qwen's Most Capable Large Visual Language Model. Compared to the enhanced version, further improvements have been made to visual reasoning and instruction-following capabilities, offering a higher level of visual perception and cognitive understanding. It delivers optimal performance on an even broader range of complex tasks.

Compared to the open-source version of Qwen-VL, these two models perform on par with Gemini Ultra and GPT-4V in multiple text-image multimodal tasks, significantly surpassing the previous best results from open-source models.

Notably, Qwen-VL-Max outperforms both GPT-4V from OpenAI and Gemini from Google in tasks on Chinese question answering and Chinese text comprehension. This breakthrough underscores the model’s advanced capabilities and its potential to set new standards in the field of multimodal AI research and application.

Model	DocVQA Document understanding	ChartQA Chart understanding	AI2D Science diagrams	TextVQA Text reading	MMMU College-level problems	MathVista Mathematical reasoning	MM-Bench-CN Natural image QA in Chinese
Other Best Open-source LVLM	81.6% (CogAgent)	68.4% (CogAgent)	73.7% (Fuyu-Medium)	76.1% (CogAgent)	45.9% (Yi-VL-34B)	36.7% (SPHINX-V2)	72.4% (InternLM-XComposer-VL)
Gemini Pro	88.1%	74.1%	73.9%	74.6%	47.9%	45.2%	74.3%
Gemini Ultra	90.9%	80.8% 1	79.5% 1	82.3% 1	59.4% 1	53.0% 1	-
GPT-4V	88.4%	78.5%	78.2%	78.0%	56.8%	49.9%	73.9%
Qwen-VL-Plus	91.4%	78.1%	75.9%	78.9%	45.2%	43.3%	68.0%
Qwen-VL-Max	93.1% 1	79.8% 2	79.3% 2	79.5% 2	51.4% 3	50.0% 2	75.1% 1

New versions of the Qwen-VL-Plus and Qwen-VL-Max models not only exhibit exceptional benchmark performance but also demonstrate a marked improvement in problem-solving capabilities within real-world scenarios. These advanced models effortlessly engage in dialogue, identify celebrities and landmarks, generate text, and notably, have significantly enhanced their abilities to describe and interpret visual content.

Here we present some practical examples:

1. Basic Recognition Capabilities

The latest Qwen-VL models are now more adept at accurately describing and identifying complex information within images, as well as providing detailed background and answering related questions. For instance, Qwen-VL can recognize not only common objects, but also celebrities and landmarks. Qwen-VL can write poetry in various languages inspired by visuals, and analyze everyday screenshots.

Example: Celebrity Recognition (1/4)NEXT

User

who are they?

Qwen-VL

These are some of the most famous and highest-paid celebrities in the world. From left to right, top to bottom:

Justin Bieber (Canadian singer)
Lionel Messi (Argentine soccer player)
Cristiano Ronaldo (Portuguese soccer player)
Elon Musk (South African-born American entrepreneur and businessman)
Kylie Jenner (American media personality and socialite)
Taylor Swift (American singer-songwriter)
Robert Downey Jr. (American actor)
Scarlett Johansson (American actress)

It’s worth noting that this is just a small sample of the many talented and successful celebrities out there, and that fame and wealth are not the only measures of success or importance.

User

their chinese name?

Qwen-VL

Here are the Chinese names of the celebrities mentioned:

Justin Bieber: 贾斯汀·比伯
Lionel Messi: 莱昂内尔·梅西
Cristiano Ronaldo: 克里斯蒂亚诺·罗纳尔多
Elon Musk: 埃隆·马斯克
Kylie Jenner: 凯莉·詹娜
Taylor Swift: 泰勒·斯威夫特
Robert Downey Jr.: 小罗伯特·唐尼
Scarlett Johansson: 斯嘉丽·约翰逊

Note that these are just one possible translation of their names into Chinese, and there may be other variations used in different contexts or regions.

2. Visual Agent Capability: The Essential Role of Localization

Beyond its fundamental capabilities in description and recognition, Qwen-VL also has impressive abilities to pinpoint and query specific elements. For instance, it can accurately highlight the black cars within an image. Moreover, Qwen-VL is also equipped to make judgments, deductions, and decisions based on the prevailing context of a scene.

Example: Locate Whatever You Desire (1/3)NEXT

User

Locate the red car

Qwen-VL

The red car is located in the bottom right corner of the image.

3. Visual Reasoning Capability: To Solve Real Problems

One of the most notable advancements in the latest Qwen-VL is its capacity for complex reasoning based on visual inputs. This enhanced visual reasoning capability goes well beyond mere content description, extending to the comprehension and interpretation of intricate representations such as flowcharts, diagrams, and other symbolic systems. In the realms of problem-solving and reasoning, Qwen-VL-Plus/Max excels not only in mathematical problem-solving and information organization but also in conducting more profound interpretations and analyses of charts and graphs.

Example: Math (1/5)NEXT

User

read the image and solve it step by step

Qwen-VL

The image shows two geometric figures, a rectangular prism and a cube, with their dimensions labeled. The problem asks to calculate the surface area and volume of these figures.

For the rectangular prism:

Length (l) = 4 cm
Width (w) = 2 cm
Height (h) = 3 cm

Surface Area (SA) of a rectangular prism = 2lw + 2lh + 2wh Volume (V) of a rectangular prism = lwh

Substituting the given values: SA = 2(4cm)(2cm) + 2(4cm)(3cm) + 2(2cm)(3cm) SA = 16cm² + 24cm² + 12cm² SA = 52cm²

V = (4cm)(2cm)(3cm) V = 24cm³

For the cube:

Side length (a) = 3 dm

Surface Area (SA) of a cube = 6a² Volume (V) of a cube = a³

Substituting the given value: SA = 6(3dm)² SA = 6(9dm²) SA = 54dm²

V = (3dm)³ V = 27dm³

So, the surface area and volume for each figure are as follows:

For the rectangular prism:

Surface Area = 52 cm²
Volume = 24 cm³

For the cube:

Surface Area = 54 dm²
Volume = 27 dm³

4. Text Information Recognition & Processing

Text processing in images has also improved significantly, especially in terms of recognizing Chinese and English text. Qwen-VL-Plus/Max can now efficiently extract information from tables and documents and reformat this information to meet custom output requirements. In addition, it has an efficient mechanism for identifying and converting dense text, which is very effective in dealing with documents that contain a lot of information. It supports images with extreme aspect ratios, ensuring the flexibility to process diverse visual content.

Example: Information Retrieval (1/3)NEXT

User

Please read the text in this image and return the information in the following JSON format (note xxx is placeholder, if the information is not available in the image, put “N/A” instead). {“class”: xxx, “DLN”: xxx, “DOB”: xxx, “Name”: xxx, “Address”: xxx, “EXP”: xxx, “ISS”: xxX, “SEX”: xxx, “HGT”: xxx, “WGT”: xxx, “EYES”: xxx, “HAIR”: xxX, “DONOR”: xxx}

Qwen-VL

{“class”: “C”, “DLN”: “I1234568”, “DOB”: “08/31/1977”, “Name”: “IMA CARDHOLDER”, “Address”: “2570 24TH STREET ANYTOWN, CA 95818”, “EXP”: “08/31/2014”, “ISS”: “08/31/2009”, “SEX”: “F”, “HGT”: “5’-05"”, “WGT”: “125 lb”, “EYES”: “BRN”, “HAIR”: “BRN”, “DONOR”: “YES”}

How to Use

Now you can access Qwen-VL-Plus and Qwen-VL-Max through the Huggingface Spaces, the Qwen website, and Dashscope APIs.

Try Qwen-VL-Plus ( Qwen-VL-Plus - a Hugging Face Space by Qwen) and Qwen-VL-Max ( Qwen-VL-Max - a Hugging Face Space by Qwen) in the Huggingface Spaces

GitHub - QwenLM/Qwen-VL: The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud.

The official repo of Qwen-VL (通义千问-VL) chat & pretrained large vision language model proposed by Alibaba Cloud. - QwenLM/Qwen-VL

github.com

bnew · Jan 30, 2024

https://archive.is/Yvgnq

https://archive.is/Xil50

https://archive.is/qDmc3

In case you are out of the loop:

𝚖𝚒𝚜𝚝𝚛𝚊𝚕-𝚖𝚎𝚍𝚒𝚞𝚖 might have been leaked.
(or we are all trolled by *****)

Why is this big?

𝚖𝚒𝚜𝚝𝚛𝚊𝚕-𝚖𝚎𝚍𝚒𝚞𝚖 is the best LLM ever after GPT-4

Everything is all still ongoing, people are benchmarking this model right now.

But so far it seems to be generating VERY similar texts to what you get by mistral's endpoint.

Let's see.

---
Links:

- Model: miqudev/miqu-1-70b · Hugging Face

- ***** thread: https://boards.4...chan.org/g/thread/98696032

Original thread starting the investigation by
@JagersbergKnut here: https://nitter.unixfox.eu/JagersbergKnut/status/1751733218498286077

Some investigation into the weights by @nisten here: https://nitter.unixfox.eu/nisten/status/1751812911226331294

https://archive.is/wI40u

https://archive.is/H16Ro

https://archive.is/1dssL

WIA20XX · Jan 30, 2024

Didn't know where to stick this, but it's only like 10 of us that care about this stuff

Large Language Models News & Discussions

Veteran

Midjourney's V6 update brings Pan, Zoom, Vary and broader website access​

Veteran

Taiwan's "Trustworthy AI Dialogue Engine" aims to counter China's grip on AI ecosystem​

Independence before performance​

Veteran

Veteran

OpenAI and Other Tech Giants Will Have to Warn the US Government When They Start New AI Projects​

Banned

Veteran

Banned

Veteran

Computer Science > Computation and Language​

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation​

Submission history​

Veteran

Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon​

Computer Science > Computation and Language​

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon​

Submission history​

Veteran

Veteran

State-of-the-art Code Generation with AlphaCodium – From Prompt Engineering to Flow Engineering​

TL;DR​

CodeContests dataset​

The proposed flow​

Veteran

Veteran

Introducing Qwen-VL​

1. Basic Recognition Capabilities​

2. Visual Agent Capability: The Essential Role of Localization​

3. Visual Reasoning Capability: To Solve Real Problems​

4. Text Information Recognition & Processing​

How to Use​

Veteran

Superstar

Midjourney's V6 update brings Pan, Zoom, Vary and broader website access

Taiwan's "Trustworthy AI Dialogue Engine" aims to counter China's grip on AI ecosystem

Independence before performance

OpenAI and Other Tech Giants Will Have to Warn the US Government When They Start New AI Projects

Computer Science > Computation and Language

WaveCoder: Widespread And Versatile Enhanced Instruction Tuning with Refined Data Generation

Submission history

Soaring from 4K to 400K: Extending LLM’s Context with Activation Beacon

Computer Science > Computation and Language

Soaring from 4K to 400K: Extending LLM's Context with Activation Beacon

Submission history

State-of-the-art Code Generation with AlphaCodium – From Prompt Engineering to Flow Engineering

TL;DR

CodeContests dataset

The proposed flow

Introducing Qwen-VL

1. Basic Recognition Capabilities

2. Visual Agent Capability: The Essential Role of Localization

3. Visual Reasoning Capability: To Solve Real Problems

4. Text Information Recognition & Processing

How to Use