bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451


Almost unbelievable - Serving the LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system 🔥

Paper - "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization"

📌 The existing problem - LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit.

📌 This paper presents KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including:

👉 (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution;

👉 (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization;

👉 (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions;

👉 (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and

👉 (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization.

📌 By applying this method to the LLaMA, LLaMA-2, and Mistral models, the paper achieves <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches.





Computer Science > Machine Learning​

[Submitted on 31 Jan 2024]

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization​

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.
Subjects:Machine Learning (cs.LG)
Cite as:arXiv:2401.18079 [cs.LG]
(or arXiv:2401.18079v1 [cs.LG] for this version)

Submission history

From: Coleman Hooper [view email]
[v1] Wed, 31 Jan 2024 18:58:14 UTC (1,474 KB)




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451



Infini-gram language model predicts the next token using arbitrary-length context with suffix array. Using 1.4T tokens as training data, this LM can predict the next token by nearly 50%, and combined with neural LM, it reduced the perplexity of human-written documents by nearly 20%. arxiv.org/abs/2401.17377

(My comment; I was working on suffix arrays 15 years ago). Although this study used a suffix array, with compressed full-text indices such as FM-index or LZ-index, we can reduce the index size to about a few hundred GB even for 1T tokens. Also, there are already methods for finding the longest prefix and most frequent tokens in a constant time per token.
A more fundamental problem is N-gram LM only looks for exact prefix matches, so unlike neural NNs, it cannot be used for generation tasks.
I think it would be better to let neural NNs learn how to use indices in the same way as RAG (especially RETRO).



Computer Science > Computation and Language​

[Submitted on 30 Jan 2024]

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens​

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new ∞-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute ∞-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The ∞-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the ∞-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--∞-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:arXiv:2401.17377 [cs.CL]
(or arXiv:2401.17377v1 [cs.CL] for this version)
[2401.17377] Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Focus to learn more

Submission history​

From: Jiacheng Liu [view email]
[v1] Tue, 30 Jan 2024 19:03:49 UTC (6,464 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451




Computer Science > Computation and Language​

[Submitted on 30 Jan 2024]

H2O-Danube-1.8B Technical Report​

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati
We present H2O-Danube-1.8B, a 1.8B language model trained on 1T tokens following the core principles of LLama 2 and Mistral. We leverage and refine various techniques for pre-training large language models. Although our model is trained on significantly fewer total tokens compared to reference models of similar size, it exhibits highly competitive metrics across a multitude of benchmarks. We additionally release a chat model trained with supervised fine-tuning followed by direct preference optimization. We make H2O-Danube-1.8B openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
Subjects:Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2401.16818 [cs.CL]
(or arXiv:2401.16818v1 [cs.CL] for this version)

Submission history

From: Philipp Singer [view email]
[v1] Tue, 30 Jan 2024 08:45:08 UTC (591 KB)



Base model: h2oai/h2o-danube-1.8b-base · Hugging Face
Chat model: h2oai/h2o-danube-1.8b-chat · Hugging Face


14BrnQ8.png

YWLupVu.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451

Why the AI Boom is a Windfall for Tiny Anguilla

The Caribbean island is reaping millions from .ai website registrations​


MICHAEL KOZIOL

30 JAN 2024
3 MIN READ
A photo-illustration of a man, a map, and some palm trees.

Vince Cate has managed the surge of interest in .ai domains for the country of Anguilla.
STUART BRADFORD


5 QUESTIONS ARTIFICIAL INTELLIGENCE INTERNET



The rising popularity of artificial intelligence has impacted the entire world, including the tiny island of Anguilla. Located in the Caribbean, the country, home to about 15,000 people, has a unique and suddenly in-demand resource.


In the late 1980s, the Internet Assigned Numbers Authority (IANA) assigned countries and regions of geographic interest their own two-letter domains. Anguilla received .ai, a luck of the draw that is now paying dividends as the country registers website domains for AI companies. IEEE Spectrumspoke with Vince Cate, who manages domain registrations for the Anguillan government, on how AI has had an impact on .ai.

Vince Cate​

Vince Cate is a software developer and the founder of DataHaven.Net, which handles sales of the .ai domain for the Anguillan government.


How did you end up managing the .ai domain?

Vince Cate:
I came to Anguilla in 1994. I started out doing an email business, because there wasn’t any email or Internet on this island. And I wanted to have a domain name that was .ai. So I reached out to Jon Postel—he was the one that was in charge of all these top-level domains. He said, there’s nobody running .ai, do you want to run .ai? And I said, “Okay.” That was really how it went!

At some point, I said, this shouldn’t be in my name, right? So I changed the [IANA] admin contact to be the government of Anguilla. Somebody else saw that and convinced the government to give it to them, so it went to this company in Taiwan. After a couple of years, they disappeared. They didn’t answer emails or phone calls or anything. And we got it back. A number of small countries got really messed up by losing their domain names, and I would say we kind of came close.

How did .ai open up for use outside of Anguilla?

Cate:
This other company came and convinced the government that they could make a lot of money on it. They had this idea, that in Chinese “ài” means love. They thought they could market it to [Chinese websites]. At the time, I thought that artificial intelligence was a much better market.

Has the surge in AI interest been reflected in the number of .ai domains being registered?

Cate:
November 30 [2022] is when ChatGPT came out. In the five months after that, our sales went up by almost a factor of four. Then they sort of leveled off at this new, much higher level. It’s just wild—we’re already like a third of the government’s budget.

Tuvalu is perhaps the first and most well-known example of a country opening up its top-level domain (.tv). Is Anguilla approaching this opportunity differently than that situation?

Cate:
Tuvalu gave [domain registrations] to a big foreign company, and locked themselves in for 50 years. And we’re doing it locally. So the government is getting almost all the money. And that’s not what was happening in Tuvalu, right? Most of the money was not going to the country.

[Editor’s note: Tuvalu has recently renegotiated and is leasing its domain name for more money, but an outside company still manages domain registrations.]

How much money is being brought in by .ai registrations, and how is that affecting Anguilla?

Cate:
It’s about US $3 million per month. We do the domains for two years, and so all of our money now is new domains. And if we just stay at this level of $3 million per month for new domains, when the renewals kick in a year from now, we’ll just jump to $6 million per month.

And it’s just part of the general budget—the government can use it however they want. But I’ve noticed that they’ve paid down some of their debt, which is pretty unusual. They’ve eliminated property taxes on residential buildings. So we’re doing well, I would say.

This article appears in the February 2024 print issue as “5 Questions for Vince Cate.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451

Paper: [2401.15947v1] MoE-LLaVA: Mixture of Experts for Large Vision-Language Models

Github: GitHub - PKU-YuanGroup/MoE-LLaVA: Mixture-of-Experts for Large Vision-Language Models

Abstract:

For Large Vision-Language Models (LVLMs), scaling the model can effectively improve performance. However, expanding model parameters significantly increases the training and inferring costs, as all model parameters are activated for each token in the calculation. In this work, we propose a novel training strategy MoE-tuning for LVLMs, which can constructing a sparse model with an outrageous number of parameter but a constant computational cost, and effectively addresses the performance degradation typically associated with multi-modal learning and model sparsity. Furthermore, we present the MoE-LLaVA framework, a MoE-based sparse LVLM architecture. This framework uniquely activates only the top-k experts through routers during deployment, keeping the remaining experts inactive. Our extensive experiments highlight the excellent capabilities of MoE-LLaVA in visual understanding and its potential to reduce hallucinations in model outputs. Remarkably, with just 3 billion sparsely activated parameters, MoE-LLaVA demonstrates performance comparable to the LLaVA-1.5-7B on various visual understanding datasets and even surpasses the LLaVA-1.5-13B in object hallucination benchmarks. Through MoE-LLaVA, we aim to establish a baseline for sparse LVLMs and provide valuable insights for future research in developing more efficient and effective multi-modal learning systems.


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451


Github: github.com/OpenBMB/MiniCPM

Huggingface: openbmb/MiniCPM-2B-sft-bf16 · Hugging Face

Unveil a compact and powerful language model designed to effortlessly run on mobile devices, including those equipped with Snapdragon 835 (released in late 2016) SoCs. It's our mission to democratize intelligence, making it accessible to everyone, everywhere!

Evaluation scores:​

Surpasses or performs at a similar level to the majority of 7B-scale models, and outperforms some models with a scale of 10B or above.

Outperforms small models on all test sets except for certain English evaluation datasets.

MT-Bench Score increased after DPO alignment
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,910
Reputation
8,572
Daps
161,451



It’s actually quite incredible to be alive at this moment. It’s hard to fully absorb the enormity of this transition. Despite the incredible impact of AI recently, the world is still struggling to appreciate how big a deal its arrival really is. We are in the process of seeing a new species grow up around us. Getting it right is unquestionably the great meta-problem of the twenty-first century. But do that and we have an unparalled opportunity to empower people to live the lives they want.





 
Last edited:
Top