Large Language Models News & Discussions

bnew · Nov 30, 2023

Starski said:
I’ll preface by saying I’m not a tech nerd

During work I asked ChatGPT today to simply cross reference two FDA databases asking of the new approved drugs in 2023 how many are biologics vs small molecules… would take me <20mins but extra lazy….

shyt spat out multiple incorrect answers, one of them being “every drug approved is a small molecule”

Not saying AI won’t continue to advance… maybe my prompts suck? But since the whole “they found AGI” shyt popped up I’m calling cap on that… way too soon

chatgpt has a knowledge cutoff date unless you use the bing search plugin. it would help to know what your prompt was. did you 'feed' it the databases or links to them?

modify these prompts:

Can you provide me with a list of all newly approved drugs by the FDA in 2023 and categorize them based on whether they are biologics or small molecules? Please ensure accuracy and reliability of your sources.
Could you please help me compare two FDA databases related to drug approvals in 2023 and determine the number of newly approved biologics versus small molecule medications? Kindly assure the authenticity and dependability of your data sources.
Please cross-reference the FDA databases to provide a breakdown of the new approved drugs in 2023, indicating how many are biologics and how many are small molecules.

bnew · Nov 30, 2023

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to...

arxiv.org

Computer Science > Computation and Language

[Submitted on 8 Sep 2023]

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Griffin Adams, Alexander Fabbri, Faisal Ladhak, Eric Lehman, Noémie Elhadad

Selecting the ``right'' amount of information to include in a summary is a difficult task. A good summary should be detailed and entity-centric without being overly dense and hard to follow. To better understand this tradeoff, we solicit increasingly dense GPT-4 summaries with what we refer to as a ``Chain of Density'' (CoD) prompt. Specifically, GPT-4 generates an initial entity-sparse summary before iteratively incorporating missing salient entities without increasing the length. Summaries generated by CoD are more abstractive, exhibit more fusion, and have less of a lead bias than GPT-4 summaries generated by a vanilla prompt. We conduct a human preference study on 100 CNN DailyMail articles and find that that humans prefer GPT-4 summaries that are more dense than those generated by a vanilla prompt and almost as dense as human written summaries. Qualitative analysis supports the notion that there exists a tradeoff between informativeness and readability. 500 annotated CoD summaries, as well as an extra 5,000 unannotated summaries, are freely available on HuggingFace (this https URL).

Comments:	preprint
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2309.04269 [cs.CL]
	(or arXiv:2309.04269v1 [cs.CL] for this version)
	[2309.04269] From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting Focus to learn more

Submission history

From: Griffin Adams [view email]
[v1] Fri, 8 Sep 2023 11:31:08 UTC (8,849 KB)

https://browse.arxiv.org/pdf/2309.04269.pdf

https://archive.is/e203u

Article: {{ ARTICLE }}
You will generate increasingly concise, entity-dense summaries of the above article.

Repeat the following 2 steps 5 times.

Step 1. Identify 1-3 informative entities (";" delimited) from the article which are missing from the previously generated summary.
Step 2. Write a new, denser summary of identical length which covers every entity and detail from the previous summary plus the missing entities.

A missing entity is:
- relevant to the main story,
- specific yet concise (5 words or fewer),
- novel (not in the previous summary),
- faithful (present in the article),
- anywhere (can be located anywhere in the article).

Guidelines:

- The first summary should be long (4-5 sentences, ~80 words) yet highly non-specific, containing little information beyond the entities marked as missing. Use overly verbose language and fillers (e.g., "this article discusses") to reach ~80 words.
- Make every word count: rewrite the previous summary to improve flow and make space for additional entities.
- Make space with fusion, compression, and removal of uninformative phrases like "the article discusses".
- The summaries should become highly dense and concise yet self-contained, i.e., easily understood without the article.
- Missing entities can appear anywhere in the new summary.
- Never drop entities from the previous summary. If space cannot be made, add fewer new entities.

Remember, use the exact same number of words for each summary.
Answer in JSON. The JSON should be a list (length 5) of dictionaries whose keys are "Missing_Entities" and "Denser_Summary".

bnew · Nov 30, 2023

https://archive.is/x71B0

bnew · Nov 30, 2023

https://archive.is/qpBAp

Seamless Communication - AI at Meta

A significant step towards removing language barriers through expressive, fast and high-quality AI translation

ai.meta.com

Introducing a suite of AI language translation models that preserve expression and improve streaming

November 30, 2023•
7 minute read

In our increasingly interconnected world, where language differences may present a barrier to communication, translation systems can enable people from different linguistic backgrounds to share knowledge and experiences more seamlessly. However, many of these systems today do not preserve key elements of speech that make human communication human. More specifically, it’s not just the words we choose that convey what we want to say—it’s also how we speak them. Tone of voice, pauses, and emphasis carry important signals that help us communicate emotions and intent. Moreover, human speech and translation are sensitive to nuances such as turn-taking and timing controls. Picture, for example, how human interpreters work: they find just the right balance between low-latency and accurate translations. Waiting too long stifles the flow of communication, while going too fast compromises the overall quality of a translation. Translation systems that enable authentic conversations should deliver across all of these elements of communication.

Today, we are excited to share Seamless, the first publicly available system that unlocks expressive cross-lingual communication in real time. To build Seamless, we developed SeamlessExpressive, a model for preserving expression in speech-to-speech translation, and SeamlessStreaming, a streaming translation model that delivers state-of-the-art results with around two seconds of latency. All of the models are built on SeamlessM4T v2, the latest version of the foundational model we released in August. SeamlessM4T v2 demonstrates performance improvements for automatic speech recognition, speech-to-speech, speech-to text, and text-to-speech capabilities. Compared to previous efforts in expressive speech research, SeamlessExpressive addresses certain underexplored aspects of prosody, such as speech rate and pauses for rhythm, while also preserving emotion and style. The model currently preserves these elements in speech-to-speech translation between English, Spanish, German, French, Italian, and Chinese.

SeamlessStreaming unlocks real-time conversations with someone who speaks a different language by generating the translation while the speaker is still talking. In contrast to conventional systems which translate when the speaker has finished their sentence, SeamlessStreaming translates while the speaker is still talking. This means that the person they're speaking to can hear a translation in closer to real-time - there is a delay of a few seconds - rather than waiting until the speaker has finished their sentence. SeamlessStreaming supports automatic speech recognition and speech-to-text translation for nearly 100 input and output languages, and speech-to-speech translation for nearly 100 input languages and 36 output languages. In keeping with our approach to open science, we’re publicly releasing all four models to allow researchers to build on this work.

Introducing metadata, data and data alignment tools

Today, alongside our models, we are releasing metadata, data and data alignment tools to assist the research community, including:

Metadata of an extension of SeamlessAlign corresponding to an additional 115,000 hours of speech and text alignments on top of the existing 470k hours. In addition to more hours, the latest version of SeamlessAlign covers a broader range of languages (from 37 previously to 76 with the extension). This corpus is the largest public speech/speech and speech/text parallel corpus in terms of total volume and language coverage to date.
Metadata of SeamlessAlignExpressive, an expressivity-focused version of the dataset above. In this dataset, the pairs are parallel from both a semantic and prosodic perspective. SeamlessAlignExpressive is released as a benchmark to validate our expressive alignment approach. In order to train our expressive models, we applied our alignment method to a proprietary dataset.
Translated text data for mExpresso, a multilingual, parallel extension of read speech in Expresso, a high-quality expressive speech dataset that includes both read speech and improvised dialogues rendered in different styles. This text benchmark enables benchmarking expressive translation systems from English into other languages.
Tools to assist the research community in collecting more datasets for translation.

In particular, we are updating our stopes library and SONAR encoders. With these tools, anyone can automatically create multimodal translation pairs from their own speech and/or text monolingual data through parallel data alignment methods.

Our approach

All our models run on fairseq2, the latest update of our sequence modeling toolkit. Similar to our previous work on SeamlessM4T, fairseq2 offers an ideal framework for building our streaming and expressivity updates because it is lightweight, easily composable with other PyTorch ecosystem libraries, and has more efficient modeling and data loader APIs.

UnitY2, a new architecture that has a non-autoregressive text-to-unit decoder, is also instrumental to our work. In SeamlessM4T v2, we used multitask-UnitY2 to enable text input (updated from v1's multitask-UnitY). We also used the architecture for SeamlessStreaming and SeamlessExpressive. As our next generation multitask model, UnitY2 has superior speech generation capabilities through its improved text-to-unit model. This implementation leads to improved consistency between text output and speech output, compared to the SeamlessM4T v1 model.

Instead of using an autoregressive text-to-unit model as in UnitY, we used a non-autoregressive model. Autoregressive models predict the next token based on the previously generated tokens. While autoregressive models model speech naturally, they scale poorly as sequence length increases. They are also more likely to exhibit repetitive degeneration. Non-autoregressive models predict the duration of each segment, which enables each segment to be decoded in parallel. This makes them robust to long sequences, and we see improvements over the initial iteration of UnitY. Since the model inherently predicts duration, it is much more easily adaptable to the streaming use case, because we know exactly how much speech is needed to be generated for each piece of text, which is not the case for autoregressive models.

Streaming

EMMA is our core streaming algorithm, which allows us to intelligently decide when we have enough information to generate the next speech segment or target text. It improves upon previous state-of-the-art algorithms especially for long input sequences, which is the case for speech-to-text or speech-to-speech translation. Further, this algorithm allows us to fine-tune from offline models, which allows us to reap the benefits of the Seamless M4T v2 foundation model. Finally, we show empirically that this algorithm generalizes well across many different language pairs, which is particularly challenging for streaming models because the language pairs may be structured differently.

Expressivity

Preserving expression also requires a new approach. We replaced the unit HiFi-GAN vocoder in SeamlessM4T v2 with PRETSSEL, an expressive unit-to-speech generator. PRETSSEL is conditioned on the source speech for waveform generation to transfer tones, emotional expression, and vocal style qualities. We initialize our model from SeamlessM4T v2 in order to achieve high translation quality, which is the most fundamental need for a speech-to-speech translation system. We also developed Prosody UnitY2, integrating an expressivity encoder in SeamlessM4T v2 to guide unit generation with proper rhythm, speaking rate, and pauses. In addition, we release a suite of evaluation tools to capture the preservation of these aspects of expressivity.

Results

The updates to UnitY2 have resulted in improved translation quality across a variety of tasks. SeamlessM4T v2 achieves sate of the art translation for speech-to-speech and speech-to-text results in 100 languages. In the same model, it also beats Whisper v3’s for automatic speech recognition on average and in particular for lower resource languages.

For speech-to-text translation, SeamlessM4T v2 improves by 10% compared to the model we released in August and by more than 17% over the strongest cascaded models when translating into English. For speech-to-speech translation, SeamlessM4T v2, improves over SeamlessM4T (v1) by more than 15% when translating into English, and by 25% when translating from English.

In other tasks, SeamlessM4T v2 is on par with No Language Left Behind (NLLB) in text-to-text translation. It is also on-par on average with MMS in automatic speech recognition (ASR) (with better performance on mid and high-resource languages while MMS has better performance on low resource languages), and improving over the recently released Whisper-Large-v3 by more than 25%. In the zero-shot task of text-to-speech translation, SeamlessM4T v2 is on-par with strong cascaded models into English, and improves over these baselines by 16 percent in English.

We compared SeamlessExpressive against a cascaded speech-to-text and text-to-speech pipeline, where speech-to-text is from SeamlessM4T v2, and text-to-speech is from strong open-sourced cross-lingual text-to-speech system that supports vocal style and emotion transfer. Results show that SeamlessExpressive is more stable with respect to noise in the source speech such that the output speech maintains high content translation quality, and better preserves styles and speech rate. SeamlessStreaming achieves state of the art low latency quality with speech-to-speech translation.

How we built AI translation systems responsibly: Toxicity mitigation

{removed to fit max characters}

Audio watermarking

{removed to fit max characters}

Providing access to our technology

The breakthroughs we’ve achieved with Seamless show that the dream of a universal, real-time translator isn’t science fiction—it’s becoming a technical reality. We invite everyone to try our expressive translation demo. We’re also making our code, model and data available to the research community.

Try the expressive translation demo

Try the Hugging Face demo

Download the code, model, and data

Read the paper

Visit the Seamless website

bnew · Nov 30, 2023

Nvidia CEO Jensen Huang says artificial general intelligence will be achieved in five years

Nvidia CEO Jensen Huang defines AGI as tech with capabilities that are "fairly competitive" to human intelligence. He says it'll happen in five years.

www.businessinsider.com

Nvidia CEO Jensen Huang says artificial general intelligence will be achieved in five years

Aaron Mok

Nov 29, 2023, 4:56 PM EST

Jensen Huang, CEO of NVIDIA, holding a chip.

Nvidia CEO Jensen Huang said AGI will be achieved in five years during the 2023 NYT DealBook Summit.

Sam Yeh / Contributor

Nvidia CEO Jensen Huang said AGI will be reached in five years during the 2023 NYT DealBook Summit.

Huang defined AGI as tech that exhibits basic intelligence "fairly competitive" to a normal human.

Still, he admitted that AI technology is not quite there yet despite its rapid progress.

Jensen Huang, the CEO of Nvidia — one of the companies that is fueling the AI revolution — predicts that we may be able to see artificial general intelligence, or AGI, within the next five years.

During the 2023 New York Times DealBook Summit, the outlet's Andrew Ross Sorkin asked Huang if he expected to see AGI in the next 10 years.

"By depending on how you define it, I think the answer is yes," Huang replied.

At the summit, Huang defined AGI as a piece of software or a computer that can complete tests which reflect basic intelligence that's "fairly competitive" to that of a normal human.

"I would say that within the next five years, you're gonna see, obviously, AIs that can achieve those tests," Huang said.

While the CEO didn't specify what exactly he thinks AGI would look like, Ross Sorkin asked if AGI would refer to AI that can design the chips Nvidia is currently making, to which Huang agreed.

"Will you need to have the same staff that designs them?" Sorkin asked as a follow-up, referring to the development of Nvidia's chips.

"In fact, none of our chips are possible today without AI," Huang said.

He specified that the H-100 chips he said Nvidia is shipping today were designed with help from a number of AIs.

"Software can't be written without AI, chips can't be designed without AI, nothing's possible," he concluded on the point of AI's potential.

Even though Huang said that AI is developing faster than he expected, he said the technology hasn't showed signs it can exhibit or surpass complex human intelligence just yet.

"There's no question that the rate of progress is high," he said. "But there's a whole bunch of things that we can't do yet."

"This multi-step reasoning that humans are very good at, AI can't do," he said.

The CEO's thoughts on AGI come as some business leaders sound the alarm about what they personally consider to be AGI.

Ilya Sutskever, cofounder of OpenAI, the company behind ChatGPT, said that AI in its most advanced form will create new problems such as a surge in fake news and cyberattacks, automated AI weapons, and even "infinitely stable dictatorships."

Ian Hogarth, who has invested in more than 50 AI companies, said that a future "God-like AI" would lead to the "obsolescence or destruction of the human race" if the rapid development of the technology isn't regulated.

Huang isn't the only tech leader who believes that AGI will be achieved in the near future.

In February, ex-Meta executive John Carmack said that AGI will be achieved by the 2030s and be worth trillions of dollars.

A few months later, Demis Hassabis, CEO and cofounder of DeepMind, Google's AI division, predicted that AI that is as powerful as the human brain would arrive within the next few years.

Nvidia didn't immediately respond to Business Insider's request for comment.

bnew · Nov 30, 2023

Deepmind's new prompting method takes a step back for more accuracy

A recent paper from Alphabet's AI company Google Deepmind shows that a simple tweak to prompts can significantly improve the accuracy of large language models. The technique taps into the human ability to abstract.

the-decoder.com

Deepmind's new prompting method takes a step back for more accuracy

Nov 26, 2023 Jonathan Kemper

Midjourney prompted by THE DECODER

A recent paper from Alphabet's AI company Google Deepmind shows that a simple tweak to prompts can significantly improve the accuracy of large language models. The technique taps into the human ability to abstract.

Step-back prompting asks the LLM a general question before the actual task. This allows the system to retrieve relevant background information and better categorize the actual question. The method is easy to implement with just one additional introductory question.

Question:

Which school did Estella Leopold attend between August 1954 and November 1954?

Step-back question :

What was Estella Leopold's educational history?

Step-Back Answer:

B.S. in Botany, University of Wisconsin, Madison, 1948

M.S. in Botany, University of California, Berkeley, 1950

Ph.D. in Botany, Yale University, 1955

Final answer:

From 1951 to 1955, she was enrolled in the Ph.D. program in Botany at Yale. from 1951 to 1955, so Estella Leopold was most likely at Yale University between August 1954 and November 1954.

The Deepmind study tested step-back prompting on the PaLM-2L language model and compared it to the base model and GPT-4. The researchers were able to increase the accuracy of the language models by up to 36 percent compared to chain-of-thought (CoT) prompting.

Improvements across all tested domains

Step-back prompting was tested in the areas of science, general knowledge, and reasoning. The researchers observed the greatest improvements in more complex tasks requiring multiple steps of reasoning.

In physics and chemistry tasks, accuracy increased by 7 to 11 percent compared to the unmodified model. The adapted PaLM-2L even outperformed GPT-4 by a few percentage points. The abstract question of the experiment was: "What physical or chemical principles and concepts are needed to solve this problem?"

Image: Zheng et al.

Most importantly, DeepMind's prompting method also performed significantly better than existing methods such as chain-of-thought and "take a deep breath" (TDB), which only marginally improved or even worsened accuracy.

PaLM-2L can achieve better performance with step-back prompting than GPT-4

The improvement was even more pronounced for knowledge questions with a temporal component from the TimeQA dataset. Here, the gain from a combination of step-back prompting and retrieval augmented generation (RAG) was a whopping 27 percentage points over the base model, making it about 23 percent more accurate than GPT-4. Of course, step-back prompting can be used with GPT-4 as well; the comparison is just to show the performance gain.

Image: Zheng et al.

Even on particularly difficult knowledge questions, which were less likely to be answered correctly with RAG, the researchers found a significant gain in accuracy with step-back prompting. "This is where STEP-BACK PROMPTING really shines by retrieving facts regarding high-level concepts to ground the final reasoning," the paper states.

Despite the promising results, the error analysis showed that multilevel reasoning is still one of the most difficult skills for an LLM. The technique is also not always effective or helpful, for example, when the answer is common knowledge ("Who was president of the USA in 2000?") or when the question is already at a high level of abstraction ("What is the speed of light?").

Sources:

Arxiv

[2310.06117] Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Computer Science > Machine Learning

[Submitted on 9 Oct 2023]

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Huaixiu Steven Zheng, Swaroop Mishra, Xinyun Chen, Heng-Tze Cheng, Ed H. Chi, Quoc V Le, Denny Zhou

We present Step-Back Prompting, a simple prompting technique that enables LLMs to do abstractions to derive high-level concepts and first principles from instances containing specific details. Using the concepts and principles to guide the reasoning steps, LLMs significantly improve their abilities in following a correct reasoning path towards the solution. We conduct experiments of Step-Back Prompting with PaLM-2L models and observe substantial performance gains on a wide range of challenging reasoning-intensive tasks including STEM, Knowledge QA, and Multi-Hop Reasoning. For instance, Step-Back Prompting improves PaLM-2L performance on MMLU Physics and Chemistry by 7% and 11%, TimeQA by 27%, and MuSiQue by 7%.

Subjects:	Machine Learning (cs.LG); Artificial Intelligence (cs.AI); Computation and Language (cs.CL)
Cite as:	arXiv:2310.06117 [cs.LG]
	(or arXiv:2310.06117v1 [cs.LG] for this version)
	[2310.06117] Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models Focus to learn more

Submission history

From: Swaroop Mishra [view email]

[v1] Mon, 9 Oct 2023 19:48:55 UTC (675 KB)

https://arxiv.org/pdf/2310.06117.pdf

bnew · Nov 30, 2023

https://archive.is/tkxe6

https://archive.is/cDW30

https://archive.is/dwmNY

https://github.com/langchain-ai/langchain/blob/master/cookbook/stepback-qa.ipynb

bnew · Nov 30, 2023

GitHub - Mozilla-Ocho/llamafile: Distribute and run LLMs with a single file.

Distribute and run LLMs with a single file. Contribute to Mozilla-Ocho/llamafile development by creating an account on GitHub.

github.com

llamafile

llamafile lets you distribute and run LLMs with a single file (blog post)

Our goal is to make the "build once anywhere, run anywhere" dream come true for AI developers. We're doing that by combining llama.cpp with Cosmopolitan Libc into one framework that lets you build apps for LLMs as a single-file artifact that runs locally on most PCs and servers.

First, your llamafiles can run on multiple CPU microarchitectures. We added runtime dispatching to llama.cpp that lets new Intel systems use modern CPU features without trading away support for older computers.

Secondly, your llamafiles can run on multiple CPU architectures. We do that by concatenating AMD64 and ARM64 builds with a shell script that launches the appropriate one. Our file format is compatible with WIN32 and most UNIX shells. It's also able to be easily converted (by either you or your users) to the platform-native format, whenever required.

Thirdly, your llamafiles can run on six OSes (macOS, Windows, Linux, FreeBSD, OpenBSD, and NetBSD). You'll only need to build your code once, using a Linux-style toolchain. The GCC-based compiler we provide is itself an Actually Portable Executable, so you can build your software for all six OSes from the comfort of whichever one you prefer most for development.

Lastly, the weights for your LLM can be embedded within your llamafile. We added support for PKZIP to the GGML library. This lets uncompressed weights be mapped directly into memory, similar to a self-extracting archive. It enables quantized weights distributed online to be prefixed with a compatible version of the llama.cpp software, thereby ensuring its originally observed behaviors can be reproduced indefinitely.

Binary Instructions

We provide example binaries that embed several different models. You can download these from Hugging Face via the links below. "Command-line binaries" run from the command line, just as if you were invoking llama.cpp's "main" function manually. "Server binaries" launch a local web server (at 127.0.0.1:8080) that provides a web-based chatbot.

Model	Command-line binary	Server binary
Mistral-7B-Instruct	mistral-7b-instruct-v0.1-Q4_K_M-main.llamafile (4.07 GB)	mistral-7b-instruct-v0.1-Q4_K_M-server.llamafile (4.07 GB)
LLaVA 1.5	(Not provided because this model's features are best utilized via the web UI)	llava-v1.5-7b-q4-server.llamafile (3.97 GB)
WizardCoder-Python-13B	wizardcoder-python-13b-main.llamafile (7.33 GB)	wizardcoder-python-13b-server.llamafile (7.33GB)

You can also also download just the llamafile software (without any weights included) from our releases page, or directly in your terminal or command prompt. This is mandatory currently on Windows.

Code:

curl -L https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1 >llamafile

chmod +x llamafile

./llamafile --help

./llamafile -m ~/weights/foo.gguf

Gotchas

On macOS with Apple Silicon you need to have Xcode installed for llamafile to be able to bootstrap itself.

If you use zsh and have trouble running llamafile, try saying sh -c ./llamafile. This is due to a bug that was fixed in zsh 5.9+. The same is the case for Python subprocess, old versions of Fish, etc.

On Linux binfmt_misc has been known to cause problems. You can fix that by installing the actually portable executable interpreter.

Code:

sudo wget -O /usr/bin/ape https://cosmo.zip/pub/cosmos/bin/ape-$(uname -m).elf

sudo chmod +x /usr/bin/ape

sudo sh -c "echo ':APE:M::MZqFpD::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

sudo sh -c "echo ':APE-jart:M::jartsr::/usr/bin/ape:' >/proc/sys/fs/binfmt_misc/register"

On Windows, you may need to rename llamafile to llamafile.exe in order for it to run. Windows also has a maximum file size limit of 4GB for executables. The LLaVA server executable above is just 30MB shy of that limit, so it'll work on Windows, but with larger models like WizardCoder 13B, you need to store the weights in a separate file. Here's an example of how to do that. Let's say you want to try Mistral. In that case you can open PowerShell and run these commands:

Code:

curl -o llamafile.exe https://github.com/Mozilla-Ocho/llamafile/releases/download/0.1/llamafile-server-0.1

curl -o mistral.gguf https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF/resolve/main/mistral-7b-instruct-v0.1.Q4_K_M.gguf

.\llamafile.exe -m mistral.gguf

On any platform, if your llamafile process is immediately killed, check if you have CrowdStrike and then ask to be whitelisted.

GPU Support

On Apple Silicon, everything should just work if Xcode is installed.

On Linux, Nvidia cuBLAS GPU support will be compiled on the fly if (1) you have the cc compiler installed, (2) you pass the --n-gpu-layers 35 flag (or whatever value is appropriate) to enable GPU, and (3) the CUDA developer toolkit is installed on your machine and the nvcc compiler is on your path.

On Windows, that usually means you need to open up the MSVC x64 native command prompt and run llamafile there, for the first invocation, so it can build a DLL with native GPU support. After that, $CUDA_PATH/bin still usually needs to be on the $PATH so the GGML DLL can find its other CUDA dependencies.

In the event that GPU support couldn't be compiled and dynamically linked on the fly for any reason, llamafile will fall back to CPU inference.

https://news.ycombinator.com/item?id=38464057

bnew · Nov 30, 2023

Animate Anyone

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

humanaigc.github.io

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo

Institute for Intelligent Computing，Alibaba Group

Paper video Code arXiv

https://humanaigc.github.io/animate-anyone/static/videos/teaser1.mp4

Abstract

Character Animation aims to generating character videos from still images through driving signals. Currently, diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities. However, challenges persist in the realm of image-to-video, especially in character animation, where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper, we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image, we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity, we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods. Furthermore, we evaluate our method on benchmarks for fashion video and human dance synthesis, achieving state-of-the-art results.

Method

The overview of our method. The pose sequence is initially encoded using Pose Guider and fused with multi-frame noise, followed by the Denoising UNet conducting the denoising process for video generation. The computational block of the Denoising UNet consists of Spatial-Attention, Cross-Attention, and Temporal-Attention, as illustrated in the dashed box on the right. The integration of reference image involves two aspects. Firstly, detailed features are extracted through ReferenceNet and utilized for Spatial-Attention. Secondly, semantic features are extracted through the CLIP image encoder for Cross-Attention. Temporal-Attention operates in the temporal dimension. Finally, the VAE decoder decodes the result into a video clip.

GitHub - HumanAIGC/AnimateAnyone: Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation - HumanAIGC/AnimateAnyone

github.com

AnimateAnyone

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Li Hu, Xin Gao, Peng Zhang, Ke Sun, Bang Zhang, Liefeng Bo

https://arxiv.org/pdf/2311.17117.pdf

bnew · Dec 2, 2023

GitHub - lucidrains/q-transformer: Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, out of Google Deepmind

Implementation of Q-Transformer, Scalable Offline Reinforcement Learning via Autoregressive Q-Functions, out of Google Deepmind - lucidrains/q-transformer

github.com

bnew · Dec 2, 2023

How Googlers cracked OpenAI's ChatGPT with a single word

A research team from Bay Area tech giant Google got OpenAI's ChatGPT to spit out its private training data in a new study.

www.sfgate.com

How Googlers cracked an SF rival's tech model with a single word

A research team from the tech giant got ChatGPT to spit out its private training data

By Stephen CouncilDec 1, 2023

Demis Hassabis, the CEO and co-founder of DeepMind, attends an AI Safety Summit on Nov. 2, 2023, in Bletchley, England. DeepMind, an artificial intelligence research lab, was purchased by Google in 2014.

Toby Melville - WPA Pool/Getty Images

Just in time for ChatGPT to turn a year old, a group of researchers from Google published a paper showing how easy it is to break OpenAI’s buzzy technology.

The paper, published Tuesday, provides a look at how scientists at the forefront of artificial intelligence research — an extremely well-paid job, for some — are testing the limits of popular products in real time. Google and its AI lab, DeepMind, where the majority of the paper’s authors work, are in a race to turn scientific advancements into lucrative and useful products, before rivals like OpenAI and Meta get there first.

The study takes a look at “extraction,” which is an “adversarial” attempt to glean what data might have been used to train an AI tool. AI models “memorize examples from their training datasets, which can allow an attacker to extract (potentially private) information,” the researchers wrote. The privacy is key: If AI models are eventually trained on personal information, breaches of their training data could reveal bank logins, home addresses and more.

ChatGPT, the Google team added in a blog post announcing the paper, is “‘aligned’ to not spit out large amounts of training data. But, by developing an attack, we can do exactly this.” Alignment, in AI, refers to engineers’ attempts to guide the tech’s behavior. The researchers also noted that ChatGPT is a product that has been released to the market for public use, as opposed to previous production-phase AI models that have succumbed to extraction attempts.

The “attack” that worked was so simple, the researchers even called it “silly” in their blog post: They just asked ChatGPT to repeat the word “poem” forever.

They found that, after repeating “poem” hundreds of times, the chatbot would eventually “diverge,” or leave behind its standard dialogue style and start spitting out nonsensical phrases. When the researchers repeated the trick and looked at the chatbot’s output (after the many, many “poems”), they began to see content that was straight from ChatGPT’s training data. They had figured out “extraction,” on a cheap-to-use version of the world’s most famous AI chatbot, “ChatGPT-3.5-turbo.”

After running similar queries again and again, the researchers had used just $200 to get more than 10,000 examples of ChatGPT spitting out memorized training data, they wrote. This included verbatim paragraphs from novels, the personal information of dozens of people, snippets of research papers and “NSFW content” from dating sites, according to the paper.

404 Media, which first reported on the paper, found several of the passages online, including on CNN’s website, Goodreads, fan pages, blogs and even within comments sections.

The researchers wrote in their blog post, “As far as we can tell, no one has ever noticed that ChatGPT emits training data with such high frequency until this paper. So it’s worrying that language models can have latent vulnerabilities like this.”

“It’s also worrying that it’s very hard to distinguish between (a) actually safe and (b) appears safe but isn’t,” they added. Along with Google, the research team featured representatives from UC Berkeley, University of Washington, Cornell, Carnegie Mellon and ETH Zurich.

The researchers wrote in the paper that they told OpenAI about ChatGPT’s vulnerability on Aug. 30, giving the startup time to fix the issue before the team publicized its findings. But on Thursday afternoon, SFGATE was able to replicate the issue: When asked to repeat just the word “ripe” forever, the public and free version of ChatGPT eventually started spitting out other text, including quotes correctly attributed to Richard Bach and Toni Morrison.

OpenAI did not immediately respond to SFGATE’s request for comment. On Wednesday, the company officially welcomed Sam Altman back as CEO, after a dramatic ouster that consumed the startup a couple weeks ago.

bnew · Dec 3, 2023

https://www.reuters.com/technology/generative-ai-stumbling-block-eu-legislation-talks-sources-2023-12-01/

Generative AI a stumbling block in EU legislation talks -sources

By Supantha Mukherjee, Foo Yun Chee and Martin Coulter

December 1, 2023
4:13 PM EST
Updated 2 days ago

Investors and technology leaders attend a AI (Artificial Intelligence) conference in San Francisco

[1/2]Technology leaders attend a generative AI (Artificial Intelligence) meeting in San Francisco as the city is trying to position itself as the “AI capital of the world”, in California, U.S., June 29, 2023. REUTERS/Carlos Barria/File Photo Acquire Licensing Rights

STOCKHOLM/BRUSSELS/LONDON, Dec 1 (Reuters) - EU lawmakers cannot agree on how to regulate systems like ChatGPT, in a threat to landmark legislation aimed at keeping artificial intelligence (AI) in check, six sources told Reuters.

As negotiators meet on Friday for crucial discussions ahead of final talks scheduled for Dec. 6, 'foundation models', or generative AI, have become the main hurdle in talks over the European Union's proposed AI Act, said the sources, who declined to be identified because the discussions are confidential.

Foundation models like the one built by Microsoft (MSFT.O)-backed OpenAI are AI systems trained on large sets of data, with the ability to learn from new data to perform various tasks.

After two years of negotiations, the bill was approved by the European parliament in June. The draft AI rules now need to be agreed through meetings between representatives of the European Parliament, the Council and the European Commission.

Experts from EU countries will meet on Friday to thrash out their position on foundation models, access to source codes, fines and other topics while lawmakers from the European Parliament are also gathering to finalise their stance.

If they cannot agree, the act risks being shelved due to lack of time before European parliamentary elections next year.

While some experts and lawmakers have proposed a tiered approach for regulating foundation models, defined as those with more than 45 million users, others have said smaller models could be equally risky.

But the biggest challenge to getting an agreement has come from France, Germany and Italy, who favour letting makers of generativeAI models self-regulate instead of having hard rules.

In a meeting of the countries' economy ministers on Oct. 30 in Rome, France persuaded Italy and Germany to support a proposal, sources told Reuters.

Until then, negotiations had gone smoothly, with lawmakers making compromises across several other conflict areas such as regulating high-risk AI, sources said.

SELF-REGULATION?

European parliamentarians, EU Commissioner Thierry Breton and scores of AI researchers have criticised self-regulation.

In an open letter this week, researchers such as Geoffrey Hinton warned self-regulation is "likely to dramatically fall short of the standards required for foundation model safety".

France-based AI company Mistral and Germany's Aleph Alpha have criticised the tiered approach to regulating foundation models, winning support from their respective countries.

A source close to Mistral said the company favours hard rules for products, not the technology on which it is built.

"Though the concerned stakeholders are working their best to keep negotiations on track, the growing legal uncertainty is unhelpful to European industries,” said Kirsten Rulf, a Partner and Associate Director at Boston Consulting Group.

“European businesses would like to plan for next year, and many want to see some kind of certainty around the EU AI Act going into 2024,” she added.

Other pending issues in the talks include definition of AI, fundamental rights impact assessment, law enforcement exceptions and national security exceptions, sources told Reuters.

Lawmakers have also been divided over the use of AI systems by law enforcement agencies for biometric identification of individuals in publicly accessible spaces and could not agree on several of these topics in a meeting on Nov. 29, sources said.

Spain, which holds the EU presidency until the end of the year, has proposed compromises in a bid to speed up the process.

If a deal does not happen in December, the next presidency Belgium will have a couple of months to one before it is likely shelved ahead of European elections.

"Had you asked me six or seven weeks ago, I would have said we are seeing compromises emerging on all the key issues," said Mark Brakel, director of policy at the Future of Life Institute, a nonprofit aimed at reducing risks from advanced AI.

"This has now become a lot harder," he said.

Reporting by Supantha Mukherjee in Stockholm; Editing by Josephine Mason and Alexander Smith, Kirsten Donova

Our Standards: The Thomson Reuters Trust Principles.

bnew · Dec 3, 2023

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT - TechTalks

Retrieval augmented generation (RAG) enables you to use custom documents with LLMs to improve their precision.

bdtechtalks.com

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

By
Ben dikkson
November 22, 2023

Image created with Bing Image Creator

Retrieval augmented generation (RAG) stands as a crucial tool in using large language models (LLM). RAG enables LLMs to incorporate external documents into their responses, thereby aligning more closely with user requirements. This feature is particularly beneficial in areas where LLMs traditionally falter, especially when factuality is important.

Since the advent of ChatGPT and similar LLMs, a plethora of RAG tools and libraries have emerged. Here is what you need to know about how RAG works and how you can get started using it with ChatGPT, Claude, or an LLM of your choice.

The benefits of RAG

When you interact with a large language model, it draws upon the knowledge embedded in its training data to formulate a response. However, the vastness of the training data often surpasses the model’s parameters, leading to responses that may not be entirely accurate. Moreover, the diverse information used in training can cause the LLM to conflate details, resulting in plausible yet incorrect answers, a phenomenon known as “hallucinations.”

In some instances, you might want the LLM to use information not encompassed in its training data, such as a recent news article, a scholarly paper, or proprietary company documents. This is where retrieval augmented generation comes into play.

RAG addresses these issues by equipping the LLM with pertinent information before it generates a response. This involves retrieving (hence the name) documents from an external source and inserting their contents into the conversation to provide context to the LLM.

This process enhances the model’s accuracy and enables it to formulate responses based on the provided content. Experiments show that RAG significantly curtails hallucinations. It also proves beneficial in applications requiring up-to-date or customer-specific information not included in the training dataset.

To put it simply, the difference between a standard LLM and a RAG-enabled LLM can be likened to two individuals answering questions. The former is like a person responding from memory, while the latter is someone provided with documents to read and answer questions based on their content.

How RAG works

RAG operates on a straightforward principle. It identifies one or more documents pertinent to your query, incorporates them into your prompt, and modifies the prompt to include instructions for the model to base its responses on these documents.

You can manually implement RAG by copy-pasting a document’s content into your prompt and instructing the model to formulate responses based on this document.

A RAG pipeline automates this process for efficiency. It begins by comparing the user’s prompts with a database of documents, retrieving those most relevant to the topic. The pipeline then integrates their content into the prompt and adds instructions to ensure the LLM adheres to the document’s content.

What do you need for a RAG pipeline?

Using embeddings and a vector database to retrieve relevant documents

While retrieval augmented generation is an intuitive concept, its execution requires the seamless integration of several components.

Firstly, you need the primary language model that generates responses. Alongside this, an embedding model is necessary to encode both documents and user prompts into numerical lists, or “embeddings,” which represent their semantic content.

Next, a vector database is required to store these document embeddings and retrieve the most relevant ones each time a user query is received. In some cases, a ranking model is also beneficial to further refine the order of the documents provided by the vector database.

For certain applications, you might want to incorporate an additional mechanism that segments the user prompt into several parts. Each of these segments requires its own unique embedding and documents, enhancing the precision and relevance of the responses generated.

How to get started with RAG with no code

No-code RAG with LlamaIndex and ChatGPT (source: LlamaIndex blog)

LlamaIndex recently released an open-source tool that allows you to develop a basic RAG application with almost no coding. While currently limited to single-file use, future enhancements may include support for multiple files and vector databases.

The project, named RAGs, is built on the Streamlit web application framework and LlamaIndex, a robust Python library particularly beneficial for RAG. If you’re comfortable with GitHub and Python, installation is straightforward: simply clone the repository, run the install command, and add your OpenAI API token to the configuration file as specified in the readme document.

RAGs is currently configured to work with OpenAI models. However, you can modify the code to use other models such as Anthropic Claude, Cohere models, or open-source models like Llama 2 hosted on your servers. LlamaIndex supports all these models.

The initial run of the application requires you to set up your RAG agent. This involves determining the settings, including the files, the size of chunks you want to break your file into, and the number of chunks to retrieve for each prompt.

Chunking plays a crucial role in RAG. When processing a large file, like a book or a multi-page research paper, it’s necessary to break it down into manageable chunks, such as 500 tokens. This allows the RAG agent to locate the specific part of the document relevant to your prompt.

After completing these steps, the application creates a configuration file for your RAG agent and uses it to run the code. RAGs serves as a valuable tool to begin with retrieval augmentation and build upon. You can find the full guide here.

bnew · Dec 3, 2023

Stable Signature: A new method for watermarking images created by open source generative AI

Invisible watermarking incorporates information into digital content. The watermark is invisible to the naked eye but can be detected by algorithms—even if people edit the images.

ai.meta.com

Stable Signature: A new method for watermarking images created by open source generative AI

October 6, 2023•
6 minute read

AI-powered image generation is booming and for good reason: It’s fun, entertaining, and easy to use. While these models enable new creative possibilities, they may raise concerns about potential misuse from bad actors who may intentionally generate images to deceive people. Even images created in good fun could still go viral and potentially mislead people. For example, earlier this year, images appearing to show Pope Francis wearing a flashy white puffy jacket went viral. The images weren’t actual photographs, but plenty of people were fooled, since there weren’t any clear indicators to distinguish that the content was created by generative AI.

At FAIR, we’re excited about driving continued exploratory research in generative AI, but we also want to make sure we do so in a manner that prioritizes safety and responsibility. Today, together with Inria, we are excited to share a research paper and code that details Stable Signature, an invisible watermarking technique we created to distinguish when an image is created by an open source generative AI model. Invisible watermarking incorporates information into digital content. The watermark is invisible to the naked eye but can be detected by algorithms—even if people edit the images. While there have been other lines of research around watermarking, many existing methods create the watermark after an image is generated.

More than 11 billion images have been created using models from three open source repositories, according to Everypixel Journal. In this case, invisible watermarks can be removed simply by deleting the line that generates the watermark.

While the fact that these safeguards exist is a start, this simple tactic shows there’s plenty of potential for this feature to be exploited. The work we’re sharing today is a solution for adding watermarks to images that come from open source generative AI models. We’re exploring how this research could potentially be used in our models. In keeping with our approach to open science, we want to share this research with the AI community in the hope of advancing the work being done in this space.

How the Stable Signature method works

Stable Signature closes the potential for removing the watermark by rooting it in the model with a watermark that can trace back to where the image was created.

Let’s take a look at how this process works with the below chart.

Alice trains a master generative model. Before distributing it, she fine-tunes a small part of the model (called the decoder) to root a given watermark for Bob. This watermark may identify the model version, a company, a user, etc.

Bob receives his version of the model and generates images. The generated images will carry the watermark of Bob. They can be analyzed by Alice or third parties to see if the image was generated by Bob, who used the generative AI model.

We achieve this in a two-step process:

First, two convolutional neural networks are jointly trained. One encodes an image and a random message into a watermark image, while the other extracts the message from an augmented version of the watermark image. The objective is to make the encoded and extracted messages match. After training, only the watermark extractor is retained.
Second, the latent decoder of the generative model is fine-tuned to generate images containing a fixed signature. During this fine-tuning, batches of images are encoded, decoded, and optimized to minimize the difference between the extracted message and the target message, as well as to maintain perceptual image quality. This optimization process is fast and effective, requiring only a small batch size and a short time to achieve high-quality results.

Assessing the performance of Stable Signature

We know that people enjoy sharing and reposting images. What if Bob shared the image he created with 10 friends, who each then shared it with 10 more friends? During this time, it’s possible that someone could have altered the image, such as by cropping it, compressing it, or changing the colors. We built Stable Signature to be robust to these changes. No matter how a person transforms an image, the original watermark will likely remain in the digital data and can be traced back to the generative model where it was created.

During our research, we discovered two major advantages of Stable Signature over passive detection methods. First, we were able to control and reduce the generation of false positives, which occur when we mistake an image produced by humans for one generated by AI. This is crucial given the prevalence of non-AI-generated images shared online. For example, the most effective existing detection method can spot approximately 50% of edited generated images but still generates a false positive rate of approximately 1/100. Put differently, on a user-generated content platform receiving 1 billion images daily, around 10 million images would be incorrectly flagged to detect just half of the generated ones. On the other hand, Stable Signature detects images with the same accuracy at a false positive rate of 1e-10 (which can be set to a specific desired value). Moreover, our watermarking method allows us to trace images from various versions of the same model—a capability not possible with passive techniques.

How Stable Signature works with fine-tuning

A common practice in AI is to take foundational models and fine-tune them to handle specific use cases that are sometimes even tailored to one person. For example, a model could be shown images of Alice’s dog, and then Alice could ask for the model to generate images of her dog at the beach. This is done through methods like DreamBooth, Textual Inversion, and ControlNet. These methods act at the latent model level, and they do not change the decoder. This means that our watermarking method is not affected by these fine-tunings.

Overall, Stable Signature works well with vector-quantized image modeling (like VQGANs) and latent diffusion models (like Stable Diffusion). Since our method doesn’t modify the diffusion generation process, it’s compatible with the popular models mentioned above. We believe that, with some adaptation, Stable Signature could also be applied to other modeling methods.

Providing access to our technology

The use of generative AI is advancing at a rapid pace. Currently, there aren’t any common standards for identifying and labeling AI-generated content across the industry. In order to build better products, we believe advancements in responsibility research, like the work we’re sharing today, must exist in parallel.

We’re excited to share our work and give the AI research community access to these tools in the hope of driving continued collaboration and iteration. While it’s still early days for generative AI, we believe that by sharing our research, engaging with the community, and listening to feedback, we can all work together to ensure this impressive new technology is built, operated, and used in a responsible way.

The research we’re sharing today focuses on images, but in the future we hope to explore the potential of integrating our Stable Signature method across more generative AI modalities. Our model works with many popular open source models, however there are still limitations. It does not scale to non-latent generative models, so it may not be future proof to new generation technologies. By continuing to invest in this research, we believe we can chart a future where generative AI is used responsibly for exciting new creative endeavors.

This blog post reflects the work of Matthijs Douze and Pierre Fernandez. We'd like to acknowledge the contributions of Guillaume Couairon, Teddy Furon, and Hervé Jégou to this research.

Read the paper

Get the code

bnew · Dec 3, 2023

LLaVA

Visual Instruction Tuning

llava-vl.github.io

LLaVA:

Visual Instruction Tuning

NeurIPS 2023 (Oral)

Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee
▶ University of Wisconsin-Madison ▶ Microsoft Research ▶ Columbia University
*Equal Contribution
arXiv arXiv (LLaVA-1.5) Code Demo Dataset Model

About

[NeurIPS'23 Oral] Visual Instruction Tuning: LLaVA (Large Language-and-Vision Assistant) built towards GPT-4V level capabilities.
llava.hliu.cc

GitHub - haotian-liu/LLaVA: [NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond. - haotian-liu/LLaVA

github.com

LLaVA: Large Language and Vision Assistant

Visual instruction tuning towards large language and vision models with GPT-4 level capabilities.
[Project Page] [Demo] [Data] [Model Zoo]

Community Contributions: [llama.cpp] [Colab] [

Space] [Replicate] [AutoGen] [BakLLaVA (LLaVA with Mistral-7B)]
Improved Baselines with Visual Instruction Tuning [Paper]
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee
Visual Instruction Tuning (NeurIPS 2023, Oral) [Paper]
Haotian Liu*, Chunyuan Li*, Qingyang Wu, Yong Jae Lee (*Equal Contribution)

Release

[11/10] LLaVA-Plus is released: Learning to Use Tools for Creating Multimodal Agents, with LLaVA-Plus (LLaVA that Plug and Learn to Use Skills). [Project Page] [Demo] [Code] [Paper]
[11/6] Support Intel dGPU and CPU platforms. More details here.
[11/2] LLaVA-Interactive is released: Experience the future of human-AI multimodal interaction with an all-in-one demo for Image Chat, Segmentation, Generation and Editing. [Project Page] [Demo] [Code] [Paper]
[10/26] LLaVA-1.5 with LoRA achieves comparable performance as full-model finetuning, with a reduced GPU RAM requirement (ckpts, script). We also provide a doc on how to finetune LLaVA-1.5 on your own dataset with LoRA.
[10/12] Check out the Korean LLaVA (Ko-LLaVA), created by ETRI, who has generously supported our research! [ Demo]
[10/12] LLaVA is now supported in llama.cpp with 4-bit / 5-bit quantization support!
[10/11] The training data and scripts of LLaVA-1.5 are released here, and evaluation scripts are released here!
[10/10] Roboflow Deep Dive: First Impressions with LLaVA-1.5.
[10/5] LLaVA-1.5 is out! Achieving SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo.
[9/26] LLaVA is improved with reinforcement learning from human feedback (RLHF) to improve fact grounding and reduce hallucination. Check out the new SFT and RLHF checkpoints at project [LLavA-RLHF]
[9/22] LLaVA is accepted by NeurIPS 2023 as oral presentation, and LLaVA-Med is accepted by NeurIPS 2023 Datasets and Benchmarks Track as spotlight presentation.
[9/20] We summarize our empirical study of training 33B and 65B LLaVA models in a note. Further, if you are interested in the comprehensive review, evolution and trend of multimodal foundation models, please check out our recent survey paper ``Multimodal Foundation Models: From Specialists to General-Purpose Assistants''.

[7/19] We release a major upgrade, including support for LLaMA-2, LoRA training, 4-/8-bit inference, higher resolution (336x336), and a lot more. We release LLaVA Bench for benchmarking open-ended visual chat with results from Bard and Bing-Chat. We also support and verify training with RTX 3090 and RTX A6000. Check out LLaVA-from-LLaMA-2, and our model zoo!
[6/26] CVPR 2023 Tutorial on Large Multimodal Models: Towards Building and Surpassing Multimodal GPT-4! Please check out [Slides] [Notes] [YouTube] [Bilibli].
[6/11] We released the preview for the most requested feature: DeepSpeed and LoRA support! Please see documentations here.
[6/1] We released LLaVA-Med: Large Language and Vision Assistant for Biomedicine, a step towards building biomedical domain large language and vision models with GPT-4 level capabilities. Checkout the paper and page.
[5/6] We are releasing LLaVA-Lighting-MPT-7B-preview, based on MPT-7B-Chat! See here for more details.
[5/2] We are releasing LLaVA-Lighting! Train a lite, multimodal GPT-4 with just $40 in 3 hours! See here for more details.
[4/27] Thanks to the community effort, LLaVA-13B with 4-bit quantization allows you to run on a GPU with as few as 12GB VRAM! Try it out here.
[4/17] We released LLaVA: Large Language and Vision Assistant. We propose visual instruction tuning, towards building large language and vision models with GPT-4 level capabilities. Checkout the paper and demo.

Usage and License Notices: The data and checkpoint is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of LLaMA, Vicuna and GPT-4. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.

Large Language Models News & Discussions

Veteran

Veteran

Computer Science > Computation and Language​

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting​

​

​

Submission history​

Veteran

Veteran

Veteran

Nvidia CEO Jensen Huang says artificial general intelligence will be achieved in five years​

Veteran

Deepmind's new prompting method takes a step back for more accuracy​

Improvements across all tested domains​

PaLM-2L can achieve better performance with step-back prompting than GPT-4​

Computer Science > Machine Learning​

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models​

Submission history​

Veteran

Veteran

llamafile​

Binary Instructions​

Gotchas​

GPU Support​

Veteran

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation​

Abstract​

Method​

AnimateAnyone​

Veteran

Veteran

How Googlers cracked an SF rival's tech model with a single word​

A research team from the tech giant got ChatGPT to spit out its private training data​

Veteran

Generative AI a stumbling block in EU legislation talks -sources​

SELF-REGULATION?​

Veteran

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT​

The benefits of RAG​

How RAG works​

What do you need for a RAG pipeline?​

How to get started with RAG with no code​

Veteran

Stable Signature: A new method for watermarking images created by open source generative AI​

Veteran

LLaVA:​

Visual Instruction Tuning​

NeurIPS 2023 (Oral)​

​

About​

LLaVA: Large Language and Vision Assistant​

Release​

Computer Science > Computation and Language

From Sparse to Dense: GPT-4 Summarization with Chain of Density Prompting

Submission history

Nvidia CEO Jensen Huang says artificial general intelligence will be achieved in five years

Deepmind's new prompting method takes a step back for more accuracy

Improvements across all tested domains

PaLM-2L can achieve better performance with step-back prompting than GPT-4

Computer Science > Machine Learning

Take a Step Back: Evoking Reasoning via Abstraction in Large Language Models

Submission history

llamafile

Binary Instructions

Gotchas

GPU Support

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation

Abstract

Method

AnimateAnyone

How Googlers cracked an SF rival's tech model with a single word

A research team from the tech giant got ChatGPT to spit out its private training data

Generative AI a stumbling block in EU legislation talks -sources

SELF-REGULATION?

No-code retrieval augmented generation (RAG) with LlamaIndex and ChatGPT

The benefits of RAG

How RAG works

What do you need for a RAG pipeline?

How to get started with RAG with no code

Stable Signature: A new method for watermarking images created by open source generative AI

LLaVA:

Visual Instruction Tuning

NeurIPS 2023 (Oral)

About

LLaVA: Large Language and Vision Assistant

Release