WIA20XX

Superstar
Joined
May 24, 2022
Messages
6,181
Reputation
2,929
Daps
19,569

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878



DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence - DeepSeek-AI 2024 - SOTA open-source coding model that surpasses GPT-3.5 and Codex while being unrestricted in research and commercial use!

Paper: [2401.14196] DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
Github: GitHub - deepseek-ai/DeepSeek-Coder: DeepSeek Coder: Let the Code Write Itself
Models: deepseek-ai (DeepSeek)
Abstract:
The rapid development of large language models has revolutionized code intelligence in software development. However, the predominance of closed-source models has restricted extensive research and development. To address this, we introduce the DeepSeek-Coder series, a range of open-source code models with sizes from 1.3B to 33B, trained from scratch on 2 trillion tokens. These models are pre-trained on a high-quality project-level code corpus and employ a fill-in-the-blank task with a 16K window to enhance code generation and infilling. Our extensive evaluations demonstrate that DeepSeek-Coder not only achieves state-of-the-art performance among open-source code models across multiple benchmarks but also surpasses existing closed-source models like Codex and GPT-3.5. Furthermore, DeepSeek-Coder models are under a permissive license that allows for both research and unrestricted commercial use.


JiGv50o.png


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878


oFJDOfI.png


Whatever Miqu is, it has some sort of special sauce. It gets an 83.5 on EQ-Bench (evaluated locally), surpassing *every other LLM in the world except GPT-4*. EQ-Bench has a 0.97 correlation w/ MMLU, and a 0.94 correlation w/ Arena Elo. It *beats* Mistral Medium - at Q4_K_M. I would strongly encourage @lmsysorg to add miqu to the leaderboard so we can properly test it.

I originally saw this intriguing EQ-Bench result in a random anon tweet that I can't find - replicated it myself, but if someone knows the link to it - please post in comments so I can credit the idea!

Also would be awesome if someone could check miqu for dataset contamination with EQ-Bench.

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878

Mistral CEO confirms ‘leak’ of new open source AI model nearing GPT-4 performance​

Carl Franzen @carlfranzen

January 31, 2024 10:44 AM

Overhead view of Eiffel tower in a Paris made of circuit boards.

Credit: VentureBeat made with Midjourney V6

The past few days have been a wild ride for the growing open source AI community — even by its fast-moving and freewheeling standards.

Here’s the quick chronology: on or about January 28, a user with the handle “Miqu Dev” posted a set of files on HuggingFace, the leading open source AI model and code sharing platform, that together comprised a seemingly new open source large language model (LLM) labeled “miqu-1-70b.”

The HuggingFace entry, which is still up at the time of this article’s posting, noted that new LLM’s “Prompt format,” how users interact with it, was the same as Mistral, the well-funded open source Parisian AI company behind Mixtral 8x7b, viewed by many to be the top performing open source LLM presently available, a fine-tuned and retrained version of Meta’s Llama 2.

Posted on *****​

The same day, an anonymous user on ***** (possibly “Miqu Dev”) posted a link to the miqu-1-70b files on *****, the notoriously longstanding haven of online memes and toxicity, where users began to notice it.

Some took to X, Elon Musk’s social network formerly known as Twitter, to share the discovery of the model and what appeared to be its exceptionally high performance at common LLM tasks (measured by tests known as benchmarks), approaching the previous leader, OpenAI’s GPT-4 on the EQ-Bench.





Mistral quantized?​

Machine learning (ML) researchers took notice on LinkedIn, as well.

“Does ‘miqu’ stand for MIstral QUantized? We don’t know for sure, but this quickly became one of, if not the best open-source LLM,” wrote Maxime Labonne, an ML scientist at JP Morgan & Chase, one of the world’s largest banking and financial companies. “Thanks to @152334H, we also now have a good unquantized version of miqu here: LinkedIn

The investigation continues. Meanwhile, we might see fine-tuned versions of miqu outperforming GPT-4 pretty soon.


Quantization in ML refers to a technique used to make it possible to run certain AI models on less powerful computers and chips by replacing specific long numeric sequences in a model’s architecture with shorter ones.

Users speculated “Miqu” might be a new Mistral model being covertly “leaked” by the company itself into the world — especially since Mistral is known for dropping new models and updates without fanfare through esoteric and technical means — or perhaps an employee or customer gone rouge.

Confirmation from the top​

Well, today it appears we finally have confirmation of the latter of those possibilities: Mistral co-founder and CEO Arthur Mensch took to X to clarify: “An over-enthusiastic employee of one of our early access customers leaked a quantised (and watermarked) version of an old model we trained and distributed quite openly…

To quickly start working with a few selected customers, we retrained this model from Llama 2 the minute we got access to our entire cluster — the pretraining finished on the day of Mistral 7B release. We’ve made good progress since — stay tuned!“




Hilariously, Mensch also appears to have taken to the illicit HuggingFace post not to demand a takedown, but leaving a comment that the poster “might consider attribution.”



Still, with Mensch’s note to “stay tuned!” it appears that not only is Mistral training a version of this so-called “Miqu” model that approaches GPT-4 level performance, but it may, in fact, match or exceed it, if his comments are to be interpreted generously.

A pivotal moment in open source AI and beyond?​

That would be a watershed moment not just for open source generative AI but the entire field of AI and computer science: since its release back in March 2023, GPT-4 has remained the most powerful and highest performing LLM in the world by most benchmarks. Not even any of Google’s presently available, long-rumored Gemini models have been able to eclipse it — yet (according to some measures, the current Gemini models are actually worse than the older OpenAI GPT-3.5 model).

The release of an open source GPT-4 class model, which would presumably be functionally free to use, would likely place enormous competitive pressure on OpenAI and its subscription tiers, especially as more enterprises look to open source models, or a mixture of open source and closed source, to power their applications, as VentureBeat’s founder and CEO Matt Marshall recently reported. OpenAI may retain the edge with its faster GPT-4 Turbo and GPT-4V (vision), but the writing on the wall is pretty clear: the open source AI community is catching up fast. Will OpenAI have enough of a head start, and a metaphorical “moat” with its GPT Store and other features, to remain in the top spot for LLMs?








Y2aykv4.png


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878





Groq LPU™ Inference Engine Crushes First Public LLM Benchmark​

Written by:
Groq
anyscale_thumb-1024x536.png


Groq Delivers up to 18x Faster LLM Inference Performance on Anyscale’s LLMPerf Leaderboard Compared to Top Cloud-based Providers​

AnyscaleScreenshot1 Source: GitHub - ray-project/llmperf-leaderboard
Hey Groq Prompters! We’re thrilled to announce that Groq is now on the LLMPerf Leaderboard by Anyscale, a developer innovator and friendly competitor in the Large Language Model (LLM) inference benchmark space. This benchmark includes a selection of LLM inference providers and the analysis focuses on evaluating for performance, reliability, and efficiency measured by:
  • Output Tokens Throughput (tokens/s): The average number of output tokens returned per second. This metric is important for applications that require high throughput, such as summarization and translation, and easy to compare across different models and providers.
  • Time to first token (TTFT): The duration of time that LLM returns the first token. TTFT is especially important for streaming applications that require low latency such as chatbots.

Not only is this our first public benchmark – it was a huge success. Meta AI’s Llama 2 70B running on the Groq LPU™ Inference Engine outperformed all other cloud-based inference providers at up to 18x faster for output tokens throughput.
Let’s walk through the Anyscale methodology in a bit more detail. This benchmark leverages:
  • A 550 input token count and a 150 output token count
  • The first metric, Output Tokens Throughput (aka the output speed) is determined by dividing the count of output tokens by the overall end-to-end time, which includes input tokens processing time and overall network latency.
  • For a full list of caveats and disclaimers for this benchmark, please refer to the documentation here.

On our end, we’d like to note:
  • All Llama 2 calculations on the LPU are done in FP16, but we store some of the weights in FP8.
  • We have no sparsity (i.e. we’re doing ALL of the Llama 2 matrix calculations and thus processing the entire model as provided by Meta AI).
  • This is noteworthy in general as FP16 should provide a higher quality of results for inference.

Now let’s look a bit more closely at the results for each metric.
For Output Tokens Throughput, Groq achieved an average of 185 tokens/s, a result that ranges 3-18x faster than any other cloud-based inference provider contributing to the leaderboard.
For Time to First Token, we hit 0.22s. Because of the deterministic design of the LPU, response times are consistent resulting in our API providing the smallest range of variability. This means more repeatability and less effort designing around potential latency issues or slow responses.
Anyscale Screenshot 2 Source: GitHub - ray-project/llmperf-leaderboard
We’re proud and excited to be leading this leaderboard in the initial phase of our ongoing roadmap for performance enhancements.
Now, we already know what you’re thinking – “Groq has been saying they’re getting 270+ tokens per second per user for Llama-2 70B. What’s up with the difference?”
As mentioned, this benchmark leverages a 150 output token count and includes input processing time as part of the calculation, rather than just solely the output tokens throughput. For example, if you were to test with 1000 output tokens, the result would be closer to the 270+ tokens/s per user you see on chat.groq.com.
All in all, we couldn’t be more excited to participate in our first public benchmark results with the world, thanks to the work of our team at Groq and the help of the great team at Anyscale. We look forward to providing benchmarking for Llama 2 7B, and who knows, we just might mix things up, with a variety of experts, beyond that. (Much) more to come.

Interested in Alpha API Early Access?​

On Monday, January 15th, we will start granting early access to the Groq API, enabling approved users to experiment with models like Llama 2-70B running on the Groq LPU Inference Engine. We will be approving select users weekly and will be increasing users until general access is available in the next sprint. For those interested in our API solutions, please reach out to us at api@groq.com.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878

AI2 open sources text-generating AI models — and the data used to train them​

Kyle Wiggers @kyle_l_wiggers / 9:30 AM EST•February 1, 2024

Futuristic digital blockchain background. Abstract connections technology and digital network. 3d illustration of the Big data and communications technology.

Image Credits: v_alex / Getty Images

The Allen Institute for AI (AI2), the nonprofit AI research institute founded by late Microsoft co-founder Paul Allen, is releasing several GenAI language models it claims are more “open” than others — and, importantly, licensed in such a way that developers can use them unfettered for training, experimentation and even commercialization

Called OLMo, an acronym for “Open Language MOdels,” the models and the data set used to train them, Dolma — one of the largest public data sets of its kind — were designed to study the high-level science behind text-generating AI, according to AI2 senior software engineer Dirk Groeneveld.

“‘Open’ is an overloaded term when it comes to [text-generating models],” Groeneveld told TechCrunch in an email interview. “We expect researchers and practitioners will seize the OLMo framework as an opportunity to analyze a model trained on one of the largest public data sets released to date, along with all the components necessary for building the models.”

Open source text-generating models are becoming a dime a dozen, with organizations from Meta to Mistral releasing highly capable models for any developer to use and fine-tune. But Groeneveld makes the case that many of these models can’t really be considered open because they were trained “behind closed doors” and on proprietary, opaque sets of data.

By contrast, the OLMo models, which were created with the help of partners including Harvard, AMD and Databricks, ship with the code that was used to produce their training data as well as training and evaluation metrics and logs.

In terms of performance, the most capable OLMo model, OLMo 7B, is a “compelling and strong” alternative to Meta’s Llama 2, Groeneveld asserts — depending on the application. On certain benchmarks, particularly those touching on reading comprehension, OLMo 7B edges out Llama 2. But in others, particularly question-answering tests, OLMo 7B is slightly behind.

The OLMo models have other limitations, like low-quality outputs in languages that aren’t English (Dolma contains mostly English-language content) and weak code-generating capabilities. But Groeneveld stressed that it’s early days.

“OLMo is not designed to be multilingual — yet,” he said. “[And while] at this stage, the primary focus of the OLMo framework [wasn’t] code generation, to give a head start to future code-based fine-turning projects, OLMo’s data mix currently contains about 15% code.”

I asked Groeneveld whether he was concerned that the OLMo models, which can be used commercially and are performant enough to run on consumer GPUs like the Nvidia 3090, might be leveraged in unintended, possibly malicious ways by bad actors. A recent study by Democracy Reporting International’s Disinfo Radar project, which aims to identify and address disinformation trends and technologies, found that two popular open text-generating models, Hugging Face’s Zephyr and Databricks’ Dolly, reliably generate toxic content — responding to malevolent prompts with “imaginative” harmful content.

Groeneveld believes that the benefits outweigh the harms in the end.

Building this open platform will actually facilitate more research on how these models can be dangerous and what we can do to fix them,” he said. “Yes, it’s possible open models may be used inappropriately or for unintended purposes. [However, this] approach also promotes technical advancements that lead to more ethical models; is a prerequisite for verification and reproducibility, as these can only be achieved with access to the full stack; and reduces a growing concentration of power, creating more equitable access.”

In the coming months, AI2 plans to release larger and more capable OLMo models, including multimodal models (i.e. models that understand modalities beyond text), and additional data sets for training and fine-tuning. As with the initial OLMo and Dolma release, all resources will be made available for free on GitHub and the AI project hosting platform Hugging Face.
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878

Google launches an AI-powered image generator​

Kyle Wiggers @kyle_l_wiggers / 10:00 AM EST•February 1, 2024

Google logo

Image Credits: Sean Gallup / Getty Images

Taylor Swift deepfakes be damned, Google is releasing a new AI-powered tool, ImageFX, for image creation.

Underpinned by Imagen 2, a GenAI image model developed by Google’s DeepMind team, ImageFX offers a prompt-based UI to create and edit images. That’s no different than tools like OpenAI’s DALL-E 3, Midjourney, Meta’s Imagine with Meta AI and Microsoft Designer. But ImageFX’s unique twist is “expressive chips” — basically a list of keyword suggestions that let users experiment with “adjacent dimensions” of their creations and ideas.

“Designed for experimentation and creativity, ImageFX lets you create images with a simple text prompt, then easily modify them with a new take on prompting using expressive chips,” Google writes in a blog post.

But what of the potential for abuse — especially in light of recent events?

Google ImageFx

Image Credits: Google

Google claims that it’s taken steps to ensure that ImageFX can’t be used in ways that it wasn’t intended, for example by adding “technical safeguards” to limit “problematic outputs” like violent, offensive and sexually explicit content. ImageFX also has a prompt-level filter for “named people,” presumably public figures — although Google wasn’t especially clear on that point in its press materials.

“We invested in the safety of training data from the outset,” Google said. “Consistent with our AI principles, we also conducted extensive adversarial testing and red teaming to identify and mitigate potential harmful and problematic content.”

As an additional safety measure, Google’s tagging images produced using ImageFX with SynthID, a digital watermark that’s allegedly robust against image edits and crops.

Google Imagen 2

An image sample from Imagen 2.

“SynthID watermarks are imperceptible to the human eye but detectable for identification,” Google continues in the blog post. “With added insights in ‘About this image,’ you’ll know if an image may have been generated with Google’s AI tools when you come across it in Google Search or Chrome.”

You’ll find ImageFX in AI Test Kitchen, Google’s web app for experimental AI projects.



Imagen 2 expanded​

In related news today, Google said that it’s bringing Imagen 2 to more of its products and services starting this week, including to its next-gen AI search experience and family of managed AI services Vertex AI.

Imagen 2 — which also now powers text-to-image capabilities in Google Ads and Duet AI in Workspace, Google’s GenAI suite of products for productivity — has made its way into Google’s SGE (Search Generative Experience). SGE, which began surfacing image generation tools for users in Google Image Search last October, now taps Imagen 2 for generating images. Users can enter a prompt specifying what sort of image they want and SGE will return four results directly in the SGE conversational experience.

Google Imagen 2

Another sample from Imagen 2.

In Vertex AI, Imagen 2 is available through an API to Google Cloud customers. Elsewhere, Imagen 2 is now invokable through Bard, Google’s AI-driven chatbot.

“With Imagen 2, Bard understands simple or complex prompts so that you can generate a range of high-quality images,” Google explains. “Just type in a description — like ‘create an image of a dog riding a surfboard’ — and Bard will generate custom, wide-ranging visuals to help bring your idea to life.”

Google still hasn’t revealed the data it used to train Imagen 2, which — while disappointing — doesn’t exactly come as a surprise. It’s an open legal question as to whether GenAI vendors like Google can train a model on publicly available — even copyrighted — data and then turn around and commercialize that model.

Google Imagen 2

Image Credits: Google

Relevant lawsuits are working their way through the courts, with vendors arguing that they’re protected by fair use doctrine. But it’ll be some time before the dust settles.

In the meantime, Google’s playing it safe by keeping quiet on the matter.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878

FCC moves to outlaw AI-generated robocalls​

Devin Coldewey @techcrunch / 3:23 PM EST•January 31, 2024

An illustration of a humanoid robot emerging from a smartphone screen

Image Credits: Golden Sikorka / Getty Images

No one likes robocalls to begin with, but using AI-generated voices of people like President Biden makes them even worse. As such the FCC is proposing that using voice cloning tech in robocalls be ruled fundamentally illegal, making it easier to charge the operators of these frauds.

You may ask why it’s necessary if robocalls are illegal to begin with. In fact some automated calls are necessary and even desirable, and it’s only when a call operation is found to be breaking the law in some way that it becomes the business of the authorities.

For example, regarding the recent fake Biden calls in New Hampshire telling people not to vote, the attorney general there can (and did) say with confidence that the messages “appear to be an unlawful attempt to disrupt the New Hampshire Presidential Primary Election and to suppress New Hampshire voters.”

Under the law there, voter suppression is illegal and so, when they track down the perpetrators (and I’m emailing them constantly to find out if they have, by the way) that will be what they are charged with, likely among other things. But it remains that a crime must be committed, or reasonably suspected to have been committed, for the authorities to step in.

If employing voice cloning tech in automated calls, like what was obviously used on Biden, is itself illegal, that makes charging robocallers that much easier.

“That’s why the FCC is taking steps to recognize this emerging technology as illegal under existing law, giving our partners at State Attorneys General offices across the country new tools they can use to crack down on these scams and protect consumers,” said FCC Chairwoman Jessica Rosenworcel in a news release. They previously announced that they were looking into this back when the problem was relatively fresh.

The FCC already uses the Telephone Consumer Protection Act as the basis for charging robocallers and other telephone scammers. The TCPA already prohibits “artificial” voices, but it is not clear that cloned voices fall under that category. It’s arguable, for instance, that a company could use the generated voice of its CEO for legitimate business purposes.

But the fact is that legal applications of the tech are fewer in number and less immediately important than the illegal applications. Therefore the FCC proposes to issue a Declaratory Ruling that AI-powered voice cloning causes a call to fall under the “artificial” heading.

The law here is being rapidly iterated as telephone, messaging and generative voice tech all evolve. So don’t be surprised if it isn’t entirely clear what is and isn’t illegal, or why despite being obviously illegal, some calls or scams seem to operate with impunity. It’s a work in progress.

Update: FCC spokesman Will Wiquist told me that procedurally, this proposal will be propagated internally and voted on at Commissioners’ discretion. It will only be public when and if it is adopted.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878





Almost unbelievable - Serving the LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system 🔥

Paper - "KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization"

📌 The existing problem - LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit.

📌 This paper presents KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including:

👉 (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution;

👉 (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization;

👉 (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions;

👉 (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and

👉 (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization.

📌 By applying this method to the LLaMA, LLaMA-2, and Mistral models, the paper achieves <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches.







Computer Science > Machine Learning​

[Submitted on 31 Jan 2024]

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization​

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Yakun Sophia Shao, Kurt Keutzer, Amir Gholami
LLMs are seeing growing use for applications such as document analysis and summarization which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in ultra-low precisions, such as sub-4-bit. In this work, we present KVQuant, which addresses this problem by incorporating novel methods for quantizing cached KV activations, including: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges; and (v) Q-Norm, where we normalize quantization centroids in order to mitigate distribution shift, providing additional benefits for 2-bit quantization. By applying our method to the LLaMA, LLaMA-2, and Mistral models, we achieve <0.1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving the LLaMA-7B model with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system.
Subjects:Machine Learning (cs.LG)
Cite as:arXiv:2401.18079 [cs.LG]
(or arXiv:2401.18079v1 [cs.LG] for this version)

Submission history​

From: Coleman Hooper [view email]
[v1] Wed, 31 Jan 2024 18:58:14 UTC (1,474 KB)




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878



Infini-gram language model predicts the next token using arbitrary-length context with suffix array. Using 1.4T tokens as training data, this LM can predict the next token by nearly 50%, and combined with neural LM, it reduced the perplexity of human-written documents by nearly 20%. arxiv.org/abs/2401.17377

(My comment; I was working on suffix arrays 15 years ago). Although this study used a suffix array, with compressed full-text indices such as FM-index or LZ-index, we can reduce the index size to about a few hundred GB even for 1T tokens. Also, there are already methods for finding the longest prefix and most frequent tokens in a constant time per token.
A more fundamental problem is N-gram LM only looks for exact prefix matches, so unlike neural NNs, it cannot be used for generation tasks.
I think it would be better to let neural NNs learn how to use indices in the same way as RAG (especially RETRO).



Computer Science > Computation and Language​

[Submitted on 30 Jan 2024]

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens​

Jiacheng Liu, Sewon Min, Luke Zettlemoyer, Yejin Choi, Hannaneh Hajishirzi
Are n-gram language models still relevant in this era of neural large language models (LLMs)? Our answer is yes, and we show their values in both text analysis and improving neural LLMs. Yet this necessitates modernizing n-gram models in two aspects. First, we train them at the same data scale as neural LLMs -- 1.4 trillion tokens. This is the largest n-gram model ever built. Second, existing n-gram models use small n which hinders their performance; we instead allow n to be arbitrarily large, by introducing a new ∞-gram LM with backoff. Instead of pre-computing n-gram count tables (which would be very expensive), we develop an engine named infini-gram -- powered by suffix arrays -- that can compute ∞-gram (as well as n-gram with arbitrary n) probabilities with millisecond-level latency. The ∞-gram framework and infini-gram engine enable us to conduct many novel and interesting analyses of human-written and machine-generated text: we find that the ∞-gram LM has fairly high accuracy for next-token prediction (47%), and can complement neural LLMs to greatly reduce their language modeling perplexities. When analyzing machine-generated text, we also observe irregularities in the machine--∞-gram agreement level with respect to the suffix length, which indicates deficiencies in neural LLM pretraining and the positional embeddings of Transformers. We open-source our infini-gram engine in the hopes of enabling more study on how to best use verbatim information retrieved from large text corpora.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Information Retrieval (cs.IR)
Cite as:arXiv:2401.17377 [cs.CL]
(or arXiv:2401.17377v1 [cs.CL] for this version)
[2401.17377] Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens
Focus to learn more

Submission history​

From: Jiacheng Liu [view email]
[v1] Tue, 30 Jan 2024 19:03:49 UTC (6,464 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,878



Computer Science > Computation and Language​

[Submitted on 30 Jan 2024]

H2O-Danube-1.8B Technical Report​

Philipp Singer, Pascal Pfeiffer, Yauhen Babakhin, Maximilian Jeblick, Nischay Dhankhar, Gabor Fodor, Sri Satish Ambati
We present H2O-Danube-1.8B, a 1.8B language model trained on 1T tokens following the core principles of LLama 2 and Mistral. We leverage and refine various techniques for pre-training large language models. Although our model is trained on significantly fewer total tokens compared to reference models of similar size, it exhibits highly competitive metrics across a multitude of benchmarks. We additionally release a chat model trained with supervised fine-tuning followed by direct preference optimization. We make H2O-Danube-1.8B openly available under Apache 2.0 license further democratizing LLMs to a wider audience economically.
Subjects:Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2401.16818 [cs.CL]
(or arXiv:2401.16818v1 [cs.CL] for this version)

Submission history

From: Philipp Singer [view email]
[v1] Tue, 30 Jan 2024 08:45:08 UTC (591 KB)



Base model: h2oai/h2o-danube-1.8b-base · Hugging Face
Chat model: h2oai/h2o-danube-1.8b-chat · Hugging Face


14BrnQ8.png

YWLupVu.png
 
Last edited:
Top