bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


StableLM-3B-4E1T​

Technical report for StableLM-3B-4E1T
Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz


Comment
Model ArchitectureTraining DataTraining ProcedureDownstream ResultsSystem DetailsConclusionAcknowledgmentsReferences

StableLM-3B-4E1T is a 3 billion (3B) parameter language model pre-trained under the multi-epoch regime to study the impact of repeated tokens on downstream performance. Given prior success in this area (https://arxiv.org/pdf/2205.05131.pdfTaylor et al., 2022 and Tay et al., 2023), we train on 1 trillion (1T) tokens for 4 epochs following the observations of Muennighoff et al. (2023) in "Scaling Data-Constrained Language Models" in which they find "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data." Further inspiration for the token count is taken from "Go smol or go home" (De Vries, 2023), which suggests a 2.96B model trained for 2.85 trillion tokens achieves a similar loss to a Chinchilla compute-optimal 9.87B language model (k_n =0.3kn=0.3).https://github.com/orgs/Stability-AI/projects/8?pane=issue&itemId=36926940

Model Architecture​


The model is a decoder-only transformer similar to the LLaMA (Touvron et al., 2023) https://arxiv.org/abs/2307.09288architecture with the following modifications:
ParametersHidden SizeLayersHeadsSequence Length
2,795,443,200256032324096


Training Data​

The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer, 2023) and The Pile (Gao et al., 2020), both without the Books3 subset, and StarCoder (Li et al., 2023). The complete list is provided in Table 1.

8cc88bb9.png

Table 1: Open-source datasets used for multi-epoch training. Note that the total token count does not account for the reduced size after downsampling C4, Common Crawl (2023), and GitHub to obtain 1T tokens.
Given the large amount of web data, we recommend fine-tuning the base StableLM-3B-4E1T for your downstream tasks.

Training Procedure​

The model is trained for 972k steps in bfloat16 precision with a global context length of 4096 instead of the multi-stage ramp-up from 2048-to-4096 as done for StableLM-Alpha v2. The batch size is set to 1024 (4,194,304 tokens). We optimize with AdamW (Loshchilov and Hutter, 2017) and use linear warmup for the first 4.8k steps, followed by a cosine decay schedule to 4% of the peak learning rate. Early instabilities are attributed to extended periods in high learning rate regions. We do not incorporate dropout (Srivastava et al., 2014) due to the model's relatively small size. Detailed hyperparameters are provided in the model config here.

During training, we evaluate natural language benchmarks and observe steady improvements over the course of training until the tail end of the learning rate decay schedule. For this reason, we decided to linearly cool down the learning rate towards 0, similar to Zhai et al. (2021), in hopes of squeezing out performance. We plan to explore alternative schedules in future work.

Furthermore, our initial stage of pre-training relies on the flash-attention API (Tri Dao, 2023) with its out-of-the-box triangular causal masking support. This forces the model to attend similarly to different documents in a packed sequence. In the cool-down stage, we instead reset position IDs and attention masks at EOD tokens for all packed sequences after empirically observing improved sample quality (read: less repetition) in a concurrent experiment. We hypothesize that this late adjustment leads to the notable degradation in byte-length normalized accuracies of Arc Easy (Clark et al., 2018) and SciQ (Welbl et al., 2017).

f5c34dd1.png

Figure 1: Toy demonstration of attention mask resetting.

Data composition was modified during the cool-down. Specifically, we remove Ubuntu IRC, OpenWebText, HackerNews, and FreeLaw for quality control and further NSFW filtering while upsampling C4. The distribution shift is likely responsible for the increased loss (+0.02 nats) from the initial stage.


See the plots below for validation dynamics across our hold-out set and common NLP benchmarks.

Note: The released checkpoint is taken from step 970k according to validation loss and average downstream performance.

Downstream Results​

The following zero-shot evaluations are performed with EleutherAI's lm-evaluation-harness using the lm-bench branch of Stability AI's fork.


c17f0154.png

Table 2: Zero-shot performance across popular language modeling and common sense reasoning benchmarks. lm-eval results JSONs can be found in the evals directory of the StableLM repo.

StableLM-3B-4E1T achieves state-of-the-art performance (September 2023) at the 3B parameter scale for open-source models and is competitive with many of the popular contemporary 7B models, even outperforming our most recent 7B StableLM-Base-Alpha-v2.

System Details​

    • Hardware: StableLM-3B-4E1T was trained on the Stability AI cluster across 256 NVIDIA A100 40GB GPUs (AWS P4d instances). Training began on August 23, 2023, and took approximately 30 days to complete.
Note: TFLOPs are estimated using GPT-NeoX's get_flops function.



Conclusion​

StableLM-3B-4E1T provides further evidence for the claims in Muennighoff et al. (2023) at the trillion token scale, suggesting multi-epoch training as a valid approach to improving downstream performance when working under data constraints.

Acknowledgments​

We thank our MLOp team members, Richard Vencu and Sami Kama, for 30 days of uninterrupted pre-training; Reshinth Adithyan, James Baicoianu, Nathan Cooper, Christian Laforte, Nikhil Pinnaparaju, and Enrico Shippole, for fruitful discussions and guidance.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

AI language models can exceed PNG and FLAC in lossless compression, says study​

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.​

BENJ EDWARDS - 9/28/2023, 11:43 AM
Photo of a C-clamp compressing books.
Enlarge
Getty Images
83WITH

FURTHER READING​

Better than JPEG? Researcher discovers that Stable Diffusion can compress images

Effective compression is about finding patterns to make data smaller without losing information. When an algorithm or model can accurately guess the next piece of data in a sequence, it shows it's good at spotting these patterns. This links the idea of making good guesses—which is what large language models like GPT-4 do very well—to achieving good compression.

In an arXiv research paper titled "Language Modeling Is Compression," researchers detail their discovery that the DeepMind large language model (LLM) called Chinchilla 70B can perform lossless compression on image patches from the ImageNet image database to 43.4 percent of their original size, beating the PNG algorithm, which compressed the same data to 58.5 percent. For audio, Chinchilla compressed samples from the LibriSpeech audio data set to just 16.4 percent of their raw size, outdoing FLAC compression at 30.3 percent.

In this case, lower numbers in the results mean more compression is taking place. And lossless compression means that no data is lost during the compression process. It stands in contrast to a lossy compression technique like JPEG, which sheds some data and reconstructs some of the data with approximations during the decoding process to significantly reduce file sizes.

The study's results suggest that even though Chinchilla 70B was mainly trained to deal with text, it's surprisingly effective at compressing other types of data as well, often better than algorithms specifically designed for those tasks. This opens the door for thinking about machine learning models as not just tools for text prediction and writing but also as effective ways to shrink the size of various types of data.
A chart of compression test results provided by DeepMind researchers in their paper. The chart illustrates the efficiency of various data compression techniques on different data sets, all initially 1GB in size. It employs a lower-is-better ratio, comparing the compressed size to the original size.
Enlarge
/ A chart of compression test results provided by DeepMind researchers in their paper. The chart illustrates the efficiency of various data compression techniques on different data sets, all initially 1GB in size. It employs a lower-is-better ratio, comparing the compressed size to the original size.

DeepMind

Over the past two decades, some computer scientists have proposed that the ability to compress data effectively is akin to a form of general intelligence. The idea is rooted in the notion that understanding the world often involves identifying patterns and making sense of complexity, which, as mentioned above, is similar to what good data compression does. By reducing a large set of data into a smaller, more manageable form while retaining its essential features, a compression algorithm demonstrates a form of understanding or representation of that data, proponents argue.

The Hutter Prize is an example that brings this idea of compression as a form of intelligence into focus. Named after Marcus Hutter, a researcher in the field of AI and one of the named authors of the DeepMind paper, the prize is awarded to anyone who can most effectively compress a fixed set of English text. The underlying premise is that a highly efficient compression of text would require understanding the semantic and syntactic patterns in language, similar to how a human understands it.

So theoretically, if a machine can compress this data extremely well, it might indicate a form of general intelligence—or at least a step in that direction. While not everyone in the field agrees that winning the Hutter Prize would indicate general intelligence, the competition highlights the overlap between the challenges of data compression and the goals of creating more intelligent systems.

FURTHER READING​

Meta’s AI-powered audio codec promises 10x compression over MP3

Along these lines, the DeepMind researchers claim that the relationship between prediction and compression isn't a one-way street. They posit that if you have a good compression algorithm like gzip, you can flip it around and use it to generate new, original data based on what it has learned during the compression process.

In one section of the paper (Section 3.4), the researchers carried out an experiment to generate new data across different formats—text, image, and audio—by getting gzip and Chinchilla to predict what comes next in a sequence of data after conditioning on a sample. Understandably, gzip didn't do very well, producing completely nonsensical output—to a human mind, at least. It demonstrates that while gzip can be compelled to generate data, that data might not be very useful other than as an experimental curiosity. On the other hand, Chinchilla, which is designed with language processing in mind, predictably performed far better in the generative task.
An example from the DeepMind paper comparing the generative properties of gzip and Chinchilla on a sample text. gzip's output is unreadable.
Enlarge
/ An example from the DeepMind paper comparing the generative properties of gzip and Chinchilla on a sample text. gzip's output is unreadable.

DeepMind

While the DeepMind paper on AI language model compression has not been peer-reviewed, it provides an intriguing window into potential new applications for large language models. The relationship between compression and intelligence is a matter of ongoing debate and research, so we'll likely see more papers on the topic emerge soon.

READER COMMENTS

83WITH
BENJ EDWARDSBenj Edwards is an AI and Machine Learning Reporter for Ars Technica. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.



Promoted Comments​

avatar.png

redleader
But what about decompression rate? FLAC has always been noteworthy for being an asymmetrical codec which takes more computational power to compress than to decompress (potentially a lot more, depending on the settings used). If this new AI codec requires a lot of number crunching to decode, it may not be such a big win in all situations.

In terms of a practical format, FLAC/PNG are designed to be incredibly fast and lightweight because they have to be integrated into mobile devices, web browsers, etc without consuming huge amounts of memory and power. For example, FLAC is designed to be able to decode CD audio losslessly in realtime on DSP cores with single-digit MHz and tens of kilobytes of RAM while using the absolute lowest amount of battery. I'm not sure how much memory Chinchilla 70B requires, but seeing as the model has 70 billion parameters, I suspect it will not fit into 64 KB of memory on a low power embedded audio device.
September 28, 2023 at 4:25 pm
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805


Can my GPU run this LLM?

Made with

Calculate how much GPU memory you need & breakdown of where it goes for training/inference of any LLM model with quantization (GGML/bitsandbytes), inference frameworks (vLLM/llama.cpp/HF) & QLoRA.

Link: LLM memory check

smaller_gif-2

Purpose

I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

  1. What quantization I should use to fit any model on my GPU?
  2. What max context length my GPU can handle?
  3. What kind of finetuning can I do? Full? LoRA? QLoRA?
  4. What max batch size I can use during finetuning?
  5. What is consuming my GPU memory? What should I change to fit the LLM on my GPU?
The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It looks like below

{
"Total": 4000,
"KV Cache": 1000,
"Model Size": 2000,
"Activation Memory": 500,
"Grad & Optimizer memory": 0,
"cuda + other overhead": 500
}


Can't we just look at the model size & figure this out?

Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.

How to use

Model Name/ID/Size

  1. You can either enter the model id of a huggingface model (e.g. meta-llama/Llama-2-7b). Currently I have hardcoded & saved model configs of top 3k most downlaoded LLMs on huggingface.
  2. If you have a custom model or your hugginface id isn't available then you can either upload a json config (example) or just enter your model size (e.g. 7 billion for llama-2-7b)

Options

  1. Inference: Find vRAM for inference using either HuggingFace implementation or vLLM or GGML
  2. Training : Find vRAM for either full model finetuning or finetuning using LoRA (currently I have hardcoded r=8 for LoRA config) or using QLoRA.

Quantization

  1. Currently it supports: bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2). The latter are only for inference while bnb int8/int4 can be used for both training & inference

Context Len/Sequence Length

  1. What is the length of your prompt+new maximum tokens generated. Or for training this is the sequence length of your training data. Batch sizes are 1 for inference & can be specified for training. The option to specify batch sizes for inference needs to be added.

How reliable are the numbers?

The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB.

image

How are the values calculated?

Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead

  1. Model size = this is your .bin file size (divide it by 2 if Q8 quant & by 4 if Q4 quant).
  2. KV-Cache = Memory taken by KV (key-value) vectors. Size = (2 x sequence length x hidden size) per layer. For huggingface this (2 x 2 x sequence length x hidden size) per layer
  3. Activation Memory = When you use LoRA even though your model params don't have grad their results still need to be stored to do backward through them (these take the most memory). There is no simple formula here, it depends on the implementation.
  4. Optimizer/Grad memory = Memory taken by .grad tensors & tensors associated with the optimizer (running avg etc.)
  5. Cuda etc. overhead = Around 500-1GB memory is taken by CUDA whenever cuda is loaded, this varies. Also there are additional overheads when you use any quantization (like bitsandbytes). Again no straightforward formula

Why are the results wrong?

Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.

TODO

  1. Add support for exLlama
  2. Add QLora ✅
  3. Add way to measure approximste tokens/s you can get for a particular GPU
  4. Improve logic to get hyper-params from size (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅
  5. Add AWQ
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

Researchers from China Introduce DualToken-ViT: A Fusion of CNNs and Vision Transformers for Enhanced Image Processing Efficiency and Accuracy​

By
Aneesh Tickoo
-

October 1, 2023


In recent years, vision transformers (ViTs) have become a potent architecture for various vision applications, including object identification and picture classification. This is because, whereas the size of the convolutional kernel constrains convolutional neural networks (CNNs) and can only extract local information, self-attention can remove global information from the picture, delivering adequate and meaningful visual characteristics. There still needs to be an indication of performance saturation as the size of the dataset and the model for ViTs rise, which is a benefit over CNNs for both big models and huge datasets. Due to several inductive biases ViTs lack, CNNs are preferable over ViTs in lightweight models.

Self-attention’s quadratic complexity contributes to the potentially high computational cost of ViTs. Consequently, it isn’t easy to build lightweight, effective ViTs. Propose a pyramid structure that separates the model into multiple stages, with the number of tokens reducing and the number of channels growing per stage to construct more effective and lightweight ViTs. Emphasis on streamlining and refining self-attention structure to mitigate its quadratic complexity, but at the expense of attention’s usefulness. A typical strategy is to downsample the key and value of self-attention, which reduces the amount of tokens engaged in the process.

By conducting self-attention on the grouped tokens independently, certain locally grouped self-attention-based works lower the complexity of the overall attention component. Still, such techniques may harm the sharing of global knowledge. Some efforts additionally provide a few extra teachable parameters to enhance the backbone’s global information, including adding the branch of global tokens used at all stages. Local attention techniques like locally grouped self-attention-based and convolution-based structures can be enhanced using this method. However, all existing international token approaches only consider global information and disregard positional information, which is crucial for vision tasks.

YcrUcUBVMX9lw5Judf9M5jl8FBO-cRTRbU6OzoteCwkqLOWHKlfKFdCc5z2wTQOsSYpvT_g8OdZtvyewSxJP57DgMPRnbzDrO0UfSWiCn0GmcCZJBzhV6df6ZxSMQlEHDc9v6wC6DKufIrMpMB6kIYE

Figure 1: A visualization of the attention map for the key token (the most crucial component of the picture for the image classification challenge) and position-aware global tokens. The first picture in each row serves as the model’s input, while the second image depicts the correlation between each token in the position-aware global tokens, which each comprise seven tokens, where the red-boxed section is the first image’s key token.


In this study, researchers from East China Normal University and Alibaba Group put forth the DualToken-ViT, a compact and effective vision transformer model. Their suggested paradigm replaces self-attention with a more effective attentional framework. Convolution and self-attention are used together to extract local and global information. The outputs of the two processes are then fused to create an effective attention structure. Although window self-attention may also remove local information, they find that their lightweight model’s convolution is more effective. They step-wise downsample the feature map that creates key and value to retain more information throughout the downsampling process. This can lower the computational cost of self-attention in global information broadcasting.

Additionally, they employ position-aware global tokens at every level to improve global data quality. Their position-aware global tokens can also maintain and pass on picture location information, providing their model an edge in vision tasks, in contrast to the standard global tokens. The efficacy of their position-aware global tokens is seen in Figure 1, where the key token in the image produces a greater correlation with the equivalent tokens in the position-aware global tokens.

In a nutshell, their contributions are as follows:
• They develop a compact and effective vision transformer model called DualToken-ViT, which fuses local and global tokens containing local and global information, respectively, to achieve an efficient attention structure by combining the benefits of convolution and self-attention.
• They also suggest position-aware global tokens, which would expand the global information by including the image’s location data.
• Their DualToken-ViT exhibits the greatest performance on image classification, object identification, and semantic segmentation among vision models of the same FLOPs magnitude.


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805

Pentagon Urges AI Companies to Share More About Their Technology​

  • Defense Department is holding symposium to discuss AI
  • Official says agency wants to use the algorithms safely

AI software relies on large language models, which use massive data sets to power tools such as chatbots and image generators.

AI software relies on large language models, which use massive data sets to power tools such as chatbots and image generators.

Photographer: AFP/Getty Images

Gift this article
Have a confidential tip for our reporters? Get in Touch
Before it’s here, it’s on the Bloomberg Terminal
LEARN MORE


By Katrina Manson

September 29, 2023 at 4:31 PM EDT

Save

The Defense Department’s top artificial intelligence official said the agency needs to know more about AI tools before it fully commits to using the technology and urged developers to be more transparent.


Craig Martell, the Pentagon’s chief digital and artificial intelligence officer, wants companies to share insights into how their AI software is built — without forfeiting their intellectual property — so that the department can “feel comfortable and safe” adopting it.

AI software relies on large language models, or LLMs, which use massive data sets to power tools such as chatbots and image generators. The services are typically offered without showing their inner workings — in a so-called black box. That makes it hard for users to understand how the technology comes to decisions or what makes it get better or worse at its job over time.

“We’re just getting the end result of the model-building — that’s not sufficient,” Martell said in an interview. The Pentagon has no idea how models are structured or what data has been used, he said.

Read More: How Large Language Models Work, Making Chatbots Lucid

Companies also aren’t explaining what dangers their systems could pose, Martell said.
“They’re saying: ‘Here it is. We’re not telling you how we built it. We’re not telling you what it’s good or bad at. We’re not telling you whether it’s biased or not,’” he said.

He described such models as the equivalent of “found alien technology” for the Defense Department. He’s also concerned that only a few groups of people have enough money to build LLMs. Martell didn’t identify any companies by name, but Microsoft Corp., Alphabet Inc.’s Google and Amazon.com Inc. are among those developing LLMs for the commercial market, along with startups OpenAI and Anthropic.

Martell is inviting industry and academics to Washington in February to address the concerns. The Pentagon’s symposium on defense data and AI aims to figure out what jobs LLMs may be suitable to handle, he said.

Martell’s team, which is already running a task force to assess LLMs, has already found 200 potential uses for them within the Defense Department, he said.
“We don’t want to stop large language models,” he said. “We just want to understand the use, the benefits, the dangers and how to mitigate against them.”

There is “a large upswell” within the department of people who would like to use LLMs, Martell said. But they also recognize that if the technology hallucinates — the term for when AI software fabricates information or delivers an incorrect result, which is not uncommon — they are the ones that must take responsibility for it.


He hopes the February symposium will help build what he called “a maturity model” to establish benchmarks relating to hallucination, bias and danger. While it might be acceptable for the first draft of a report to include AI-related mistakes — something a human could later weed out — those errors wouldn’t be acceptable in riskier situations, such as information that’s needed to make operational decisions.


A classified session at the three-day February event will focus on how to test and evaluate models, and protect against hacking.

Martell said his office is playing a consulting role within the Defense Department, helping different groups figure out the right way to measure the success or failure of their systems. The agency has more than 800 AI projects underway, some of them involving weapons systems.

Given the stakes involved, the Pentagon will apply a higher bar for how it uses algorithmic models than the private sector, he said.
“There’s going to be lots of use cases where lives are on the line,” he said. “So allowing for hallucination or whatever we want to call it — it’s just not going to be acceptable.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,112
Reputation
8,239
Daps
157,805







SEAN HOLLISTERSEP 28
Say it with me: AI VR legs.
Remember the whole microscandal about Zuck’s VR legs?
Well... Meta is now using “machine learning models that are trained on large data sets of people” to let developers give you generative AI legs in the Meta Quest 3.
More new dev toys here.
You could say this tech has legs.

You could say this tech has legs. Image: Meta; GIF by Sean Hollister / The Verge







 
Top