Large Language Models News & Discussions

bnew · Sep 30, 2023

Weights & Biases

stability.wandb.io

StableLM-3B-4E1T

Technical report for StableLM-3B-4E1T
Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz

Comment
Model ArchitectureTraining DataTraining ProcedureDownstream ResultsSystem DetailsConclusionAcknowledgmentsReferences

StableLM-3B-4E1T is a 3 billion (3B) parameter language model pre-trained under the multi-epoch regime to study the impact of repeated tokens on downstream performance. Given prior success in this area (https://arxiv.org/pdf/2205.05131.pdfTaylor et al., 2022 and Tay et al., 2023), we train on 1 trillion (1T) tokens for 4 epochs following the observations of Muennighoff et al. (2023) in "Scaling Data-Constrained Language Models" in which they find "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data." Further inspiration for the token count is taken from "Go smol or go home" (De Vries, 2023), which suggests a 2.96B model trained for 2.85 trillion tokens achieves a similar loss to a Chinchilla compute-optimal 9.87B language model (k_n =0.3kn=0.3).https://github.com/orgs/Stability-AI/projects/8?pane=issue&itemId=36926940

Model Architecture

Checkpoint: stabilityai/stablelm-3b-4e1t

The model is a decoder-only transformer similar to the LLaMA (Touvron et al., 2023) https://arxiv.org/abs/2307.09288architecture with the following modifications:

Parameters	Hidden Size	Layers	Heads	Sequence Length
2,795,443,200	2560	32	32	4096

- Position Embeddings: Rotary Position Embeddings (Su et al., 2021) applied to the first 25% of head embedding dimensions for improved throughput following Black et al. (2022).
- Normalization: LayerNorm (Ba et al., 2016) with learned bias terms as opposed to RMSNorm (Zhang & Sennrich, 2019).
- Tokenizer: GPT-NeoX (Black et al., 2022).

Training Data

The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer, 2023) and The Pile (Gao et al., 2020), both without the Books3 subset, and StarCoder (Li et al., 2023). The complete list is provided in Table 1.

Table 1: Open-source datasets used for multi-epoch training. Note that the total token count does not account for the reduced size after downsampling C4, Common Crawl (2023), and GitHub to obtain 1T tokens.

Given the large amount of web data, we recommend fine-tuning the base StableLM-3B-4E1T for your downstream tasks.

Training Procedure

The model is trained for 972k steps in bfloat16 precision with a global context length of 4096 instead of the multi-stage ramp-up from 2048-to-4096 as done for StableLM-Alpha v2. The batch size is set to 1024 (4,194,304 tokens). We optimize with AdamW (Loshchilov and Hutter, 2017) and use linear warmup for the first 4.8k steps, followed by a cosine decay schedule to 4% of the peak learning rate. Early instabilities are attributed to extended periods in high learning rate regions. We do not incorporate dropout (Srivastava et al., 2014) due to the model's relatively small size. Detailed hyperparameters are provided in the model config here.

During training, we evaluate natural language benchmarks and observe steady improvements over the course of training until the tail end of the learning rate decay schedule. For this reason, we decided to linearly cool down the learning rate towards 0, similar to Zhai et al. (2021), in hopes of squeezing out performance. We plan to explore alternative schedules in future work.

Furthermore, our initial stage of pre-training relies on the flash-attention API (Tri Dao, 2023) with its out-of-the-box triangular causal masking support. This forces the model to attend similarly to different documents in a packed sequence. In the cool-down stage, we instead reset position IDs and attention masks at EOD tokens for all packed sequences after empirically observing improved sample quality (read: less repetition) in a concurrent experiment. We hypothesize that this late adjustment leads to the notable degradation in byte-length normalized accuracies of Arc Easy (Clark et al., 2018) and SciQ (Welbl et al., 2017).

Figure 1: Toy demonstration of attention mask resetting.

Data composition was modified during the cool-down. Specifically, we remove Ubuntu IRC, OpenWebText, HackerNews, and FreeLaw for quality control and further NSFW filtering while upsampling C4. The distribution shift is likely responsible for the increased loss (+0.02 nats) from the initial stage.

See the plots below for validation dynamics across our hold-out set and common NLP benchmarks.

Note: The released checkpoint is taken from step 970k according to validation loss and average downstream performance.

Downstream Results

The following zero-shot evaluations are performed with EleutherAI's lm-evaluation-harness using the lm-bench branch of Stability AI's fork.

Table 2: Zero-shot performance across popular language modeling and common sense reasoning benchmarks. lm-eval results JSONs can be found in the evals directory of the StableLM repo.

StableLM-3B-4E1T achieves state-of-the-art performance (September 2023) at the 3B parameter scale for open-source models and is competitive with many of the popular contemporary 7B models, even outperforming our most recent 7B StableLM-Base-Alpha-v2.

System Details

- Hardware: StableLM-3B-4E1T was trained on the Stability AI cluster across 256 NVIDIA A100 40GB GPUs (AWS P4d instances). Training began on August 23, 2023, and took approximately 30 days to complete.
- Software: We use a fork of gpt-neox (EleutherAI, 2021), train under 2D parallelism (Data and Tensor Parallel) with ZeRO-1 (Rajbhandari et al., 2019), and rely on flash-attention as well as SwiGLU and Rotary Embedding kernels from FlashAttention-2 (Dao et al., 2023).

Note: TFLOPs are estimated using GPT-NeoX's get_flops function.

Weights & Biases

stability.wandb.io

Conclusion

StableLM-3B-4E1T provides further evidence for the claims in Muennighoff et al. (2023) at the trillion token scale, suggesting multi-epoch training as a valid approach to improving downstream performance when working under data constraints.

Acknowledgments

We thank our MLOp team members, Richard Vencu and Sami Kama, for 30 days of uninterrupted pre-training; Reshinth Adithyan, James Baicoianu, Nathan Cooper, Christian Laforte, Nikhil Pinnaparaju, and Enrico Shippole, for fruitful discussions and guidance.

bnew · Sep 30, 2023

AI language models can exceed PNG and FLAC in lossless compression, says study

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.

arstechnica.com

AI language models can exceed PNG and FLAC in lossless compression, says study

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.

BENJ EDWARDS - 9/28/2023, 11:43 AM

Enlarge
Getty Images
83WITH

READER COMMENTS

83WITH
BENJ EDWARDSBenj Edwards is an AI and Machine Learning Reporter for Ars Technica. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.

Promoted Comments

redleader

But what about decompression rate? FLAC has always been noteworthy for being an asymmetrical codec which takes more computational power to compress than to decompress (potentially a lot more, depending on the settings used). If this new AI codec requires a lot of number crunching to decode, it may not be such a big win in all situations.

In terms of a practical format, FLAC/PNG are designed to be incredibly fast and lightweight because they have to be integrated into mobile devices, web browsers, etc without consuming huge amounts of memory and power. For example, FLAC is designed to be able to decode CD audio losslessly in realtime on DSP cores with single-digit MHz and tens of kilobytes of RAM while using the absolute lowest amount of battery. I'm not sure how much memory Chinchilla 70B requires, but seeing as the model has 70 billion parameters, I suspect it will not fit into 64 KB of memory on a low power embedded audio device.
September 28, 2023 at 4:25 pm

bnew · Sep 30, 2023

GitHub - RahulSChand/gpu_poor: Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization

Calculate token/s & GPU memory requirement for any LLM. Supports llama.cpp/ggml/bnb/QLoRA quantization - RahulSChand/gpu_poor

github.com

Can my GPU run this LLM?

Calculate how much GPU memory you need & breakdown of where it goes for training/inference of any LLM model with quantization (GGML/bitsandbytes), inference frameworks (vLLM/llama.cpp/HF) & QLoRA.

Link: LLM memory check

Purpose

I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

What quantization I should use to fit any model on my GPU?
What max context length my GPU can handle?
What kind of finetuning can I do? Full? LoRA? QLoRA?
What max batch size I can use during finetuning?
What is consuming my GPU memory? What should I change to fit the LLM on my GPU?

The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It looks like below

{
"Total": 4000,
"KV Cache": 1000,
"Model Size": 2000,
"Activation Memory": 500,
"Grad & Optimizer memory": 0,
"cuda + other overhead": 500
}

Can't we just look at the model size & figure this out?

Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.

How to use

Model Name/ID/Size

You can either enter the model id of a huggingface model (e.g. meta-llama/Llama-2-7b). Currently I have hardcoded & saved model configs of top 3k most downlaoded LLMs on huggingface.
If you have a custom model or your hugginface id isn't available then you can either upload a json config (example) or just enter your model size (e.g. 7 billion for llama-2-7b)

Options

Inference: Find vRAM for inference using either HuggingFace implementation or vLLM or GGML
Training : Find vRAM for either full model finetuning or finetuning using LoRA (currently I have hardcoded r=8 for LoRA config) or using QLoRA.

Quantization

Currently it supports: bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2). The latter are only for inference while bnb int8/int4 can be used for both training & inference

Context Len/Sequence Length

What is the length of your prompt+new maximum tokens generated. Or for training this is the sequence length of your training data. Batch sizes are 1 for inference & can be specified for training. The option to specify batch sizes for inference needs to be added.

How reliable are the numbers?

The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB.

How are the values calculated?

Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead

Model size = this is your .bin file size (divide it by 2 if Q8 quant & by 4 if Q4 quant).
KV-Cache = Memory taken by KV (key-value) vectors. Size = (2 x sequence length x hidden size) per layer. For huggingface this (2 x 2 x sequence length x hidden size) per layer
Activation Memory = When you use LoRA even though your model params don't have grad their results still need to be stored to do backward through them (these take the most memory). There is no simple formula here, it depends on the implementation.
Optimizer/Grad memory = Memory taken by .grad tensors & tensors associated with the optimizer (running avg etc.)
Cuda etc. overhead = Around 500-1GB memory is taken by CUDA whenever cuda is loaded, this varies. Also there are additional overheads when you use any quantization (like bitsandbytes). Again no straightforward formula

Why are the results wrong?

Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.

TODO

Add support for exLlama
Add QLora
Add way to measure approximste tokens/s you can get for a particular GPU
Improve logic to get hyper-params from size (since hidden layer/intermediate size/number of layers can vary for a particular size)
Add AWQ

bnew · Oct 1, 2023

bnew · Oct 1, 2023

https://archive.ph/maYCJ

https://archive.ph/jlIE0

https://archive.ph/7zIs1

https://archive.ph/KDadL

https://archive.ph/3xydS

bnew · Oct 1, 2023

https://archive.ph/8Z3pv

bnew · Oct 1, 2023

https://archive.ph/83eNQ

bnew · Oct 1, 2023

LLaMa Chat | Text Generation Machine Learning Model | Deep Infra

Discover the LLaMa Chat demonstration that lets you chat with llama 70b, llama 13b, llama 7b, codellama 34b, airoboros 30b, mistral 7b, and more!

deepinfra.com

bnew · Oct 1, 2023

https://archive.ph/x8isP

https://archive.ph/cmS1T

bnew · Oct 2, 2023

Researchers from China Introduce DualToken-ViT: A Fusion of CNNs and Vision Transformers for Enhanced Image Processing Efficiency and Accuracy

In recent years, vision transformers (ViTs) have become a potent architecture for various vision applications, including object identification and picture classification. This is because, whereas the size of the convolutional kernel constrains convolutional neural networks (CNNs) and can only...

www.marktechpost.com

Researchers from China Introduce DualToken-ViT: A Fusion of CNNs and Vision Transformers for Enhanced Image Processing Efficiency and Accuracy

By
Aneesh Tickoo
-

October 1, 2023

In recent years, vision transformers (ViTs) have become a potent architecture for various vision applications, including object identification and picture classification. This is because, whereas the size of the convolutional kernel constrains convolutional neural networks (CNNs) and can only extract local information, self-attention can remove global information from the picture, delivering adequate and meaningful visual characteristics. There still needs to be an indication of performance saturation as the size of the dataset and the model for ViTs rise, which is a benefit over CNNs for both big models and huge datasets. Due to several inductive biases ViTs lack, CNNs are preferable over ViTs in lightweight models.

Self-attention’s quadratic complexity contributes to the potentially high computational cost of ViTs. Consequently, it isn’t easy to build lightweight, effective ViTs. Propose a pyramid structure that separates the model into multiple stages, with the number of tokens reducing and the number of channels growing per stage to construct more effective and lightweight ViTs. Emphasis on streamlining and refining self-attention structure to mitigate its quadratic complexity, but at the expense of attention’s usefulness. A typical strategy is to downsample the key and value of self-attention, which reduces the amount of tokens engaged in the process.

By conducting self-attention on the grouped tokens independently, certain locally grouped self-attention-based works lower the complexity of the overall attention component. Still, such techniques may harm the sharing of global knowledge. Some efforts additionally provide a few extra teachable parameters to enhance the backbone’s global information, including adding the branch of global tokens used at all stages. Local attention techniques like locally grouped self-attention-based and convolution-based structures can be enhanced using this method. However, all existing international token approaches only consider global information and disregard positional information, which is crucial for vision tasks.

YcrUcUBVMX9lw5Judf9M5jl8FBO-cRTRbU6OzoteCwkqLOWHKlfKFdCc5z2wTQOsSYpvT_g8OdZtvyewSxJP57DgMPRnbzDrO0UfSWiCn0GmcCZJBzhV6df6ZxSMQlEHDc9v6wC6DKufIrMpMB6kIYE

Figure 1: A visualization of the attention map for the key token (the most crucial component of the picture for the image classification challenge) and position-aware global tokens. The first picture in each row serves as the model’s input, while the second image depicts the correlation between each token in the position-aware global tokens, which each comprise seven tokens, where the red-boxed section is the first image’s key token.

In this study, researchers from East China Normal University and Alibaba Group put forth the DualToken-ViT, a compact and effective vision transformer model. Their suggested paradigm replaces self-attention with a more effective attentional framework. Convolution and self-attention are used together to extract local and global information. The outputs of the two processes are then fused to create an effective attention structure. Although window self-attention may also remove local information, they find that their lightweight model’s convolution is more effective. They step-wise downsample the feature map that creates key and value to retain more information throughout the downsampling process. This can lower the computational cost of self-attention in global information broadcasting.

Additionally, they employ position-aware global tokens at every level to improve global data quality. Their position-aware global tokens can also maintain and pass on picture location information, providing their model an edge in vision tasks, in contrast to the standard global tokens. The efficacy of their position-aware global tokens is seen in Figure 1, where the key token in the image produces a greater correlation with the equivalent tokens in the position-aware global tokens.

In a nutshell, their contributions are as follows:
• They develop a compact and effective vision transformer model called DualToken-ViT, which fuses local and global tokens containing local and global information, respectively, to achieve an efficient attention structure by combining the benefits of convolution and self-attention.
• They also suggest position-aware global tokens, which would expand the global information by including the image’s location data.
• Their DualToken-ViT exhibits the greatest performance on image classification, object identification, and semantic segmentation among vision models of the same FLOPs magnitude.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 31k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

bnew · Oct 2, 2023

bnew · Oct 2, 2023

bnew · Oct 2, 2023

Pentagon Urges AI Companies to Share More About Their Technology

The Defense Department’s top artificial intelligence official said the agency needs to know more about AI tools before it fully commits to using the technology and urged developers to be more transparent.

www.bloomberg.com

Pentagon Urges AI Companies to Share More About Their Technology

Defense Department is holding symposium to discuss AI
Official says agency wants to use the algorithms safely

AI software relies on large language models, which use massive data sets to power tools such as chatbots and image generators.

Photographer: AFP/Getty Images

Gift this article
Have a confidential tip for our reporters? Get in Touch
Before it’s here, it’s on the Bloomberg Terminal
LEARN MORE

By Katrina Manson

September 29, 2023 at 4:31 PM EDT

Save

The Defense Department’s top artificial intelligence official said the agency needs to know more about AI tools before it fully commits to using the technology and urged developers to be more transparent.

Craig Martell, the Pentagon’s chief digital and artificial intelligence officer, wants companies to share insights into how their AI software is built — without forfeiting their intellectual property — so that the department can “feel comfortable and safe” adopting it.

AI software relies on large language models, or LLMs, which use massive data sets to power tools such as chatbots and image generators. The services are typically offered without showing their inner workings — in a so-called black box. That makes it hard for users to understand how the technology comes to decisions or what makes it get better or worse at its job over time.

“We’re just getting the end result of the model-building — that’s not sufficient,” Martell said in an interview. The Pentagon has no idea how models are structured or what data has been used, he said.

Read More: How Large Language Models Work, Making Chatbots Lucid

Companies also aren’t explaining what dangers their systems could pose, Martell said.
“They’re saying: ‘Here it is. We’re not telling you how we built it. We’re not telling you what it’s good or bad at. We’re not telling you whether it’s biased or not,’” he said.

He described such models as the equivalent of “found alien technology” for the Defense Department. He’s also concerned that only a few groups of people have enough money to build LLMs. Martell didn’t identify any companies by name, but Microsoft Corp., Alphabet Inc.’s Google and Amazon.com Inc. are among those developing LLMs for the commercial market, along with startups OpenAI and Anthropic.

Martell is inviting industry and academics to Washington in February to address the concerns. The Pentagon’s symposium on defense data and AI aims to figure out what jobs LLMs may be suitable to handle, he said.

Martell’s team, which is already running a task force to assess LLMs, has already found 200 potential uses for them within the Defense Department, he said.
“We don’t want to stop large language models,” he said. “We just want to understand the use, the benefits, the dangers and how to mitigate against them.”

There is “a large upswell” within the department of people who would like to use LLMs, Martell said. But they also recognize that if the technology hallucinates — the term for when AI software fabricates information or delivers an incorrect result, which is not uncommon — they are the ones that must take responsibility for it.

He hopes the February symposium will help build what he called “a maturity model” to establish benchmarks relating to hallucination, bias and danger. While it might be acceptable for the first draft of a report to include AI-related mistakes — something a human could later weed out — those errors wouldn’t be acceptable in riskier situations, such as information that’s needed to make operational decisions.

A classified session at the three-day February event will focus on how to test and evaluate models, and protect against hacking.

Martell said his office is playing a consulting role within the Defense Department, helping different groups figure out the right way to measure the success or failure of their systems. The agency has more than 800 AI projects underway, some of them involving weapons systems.

Given the stakes involved, the Pentagon will apply a higher bar for how it uses algorithmic models than the private sector, he said.
“There’s going to be lots of use cases where lives are on the line,” he said. “So allowing for hallucination or whatever we want to call it — it’s just not going to be acceptable.”

bnew · Oct 2, 2023

In a Q&A using Meta's photorealistic Pixel Codec Avatars, Mark Zuckerberg discusses the metaverse, Quest 3, the nature of reality, AI, and humanity's future

By Lex Fridman / Lex Fridman Podcast. View the full context on Techmeme.

www.techmeme.com

Say it with me: AI VR legs.

[Image: You could say this tech has legs. https://platform.theverge.com/wp-content/uploads/sites/2/chorus/uploads/chorus_asset/file/24959579/ai_legs_meta_2.gif?quality=90&strip=all] Remember the whole microscandal about Zuck’s VR legs? Well... Meta is now using “machine learning models that are...

www.theverge.com

SEAN HOLLISTER SEP 28
Say it with me: AI VR legs.
Remember the whole microscandal about Zuck’s VR legs?
Well... Meta is now using “machine learning models that are trained on large data sets of people” to let developers give you generative AI legs in the Meta Quest 3.
More new dev toys here.

You could say this tech has legs. Image: Meta; GIF by Sean Hollister / The Verge

bnew · Oct 2, 2023

https://archive.ph/cprR0

Large Language Models News & Discussions

Veteran

StableLM-3B-4E1T​

Model Architecture​

Training Data​

Training Procedure​

Downstream Results​

System Details​

Conclusion​

Acknowledgments​

Veteran

AI language models can exceed PNG and FLAC in lossless compression, says study​

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.​

FURTHER READING​

FURTHER READING​

READER COMMENTS​

Promoted Comments​

Veteran

Can my GPU run this LLM?​

Purpose​

Can't we just look at the model size & figure this out?​

How to use​

Model Name/ID/Size​

Options​

Quantization​

Context Len/Sequence Length​

How reliable are the numbers?​

How are the values calculated?​

Why are the results wrong?​

TODO​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Researchers from China Introduce DualToken-ViT: A Fusion of CNNs and Vision Transformers for Enhanced Image Processing Efficiency and Accuracy​

Veteran

Veteran

Veteran

Pentagon Urges AI Companies to Share More About Their Technology​

Veteran

Veteran

StableLM-3B-4E1T

Model Architecture

Training Data

Training Procedure

Downstream Results

System Details

Conclusion

Acknowledgments

AI language models can exceed PNG and FLAC in lossless compression, says study

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.

FURTHER READING

FURTHER READING

READER COMMENTS

Promoted Comments

Can my GPU run this LLM?

Purpose

Can't we just look at the model size & figure this out?

How to use

Model Name/ID/Size

Options

Quantization

Context Len/Sequence Length

How reliable are the numbers?

How are the values calculated?

Why are the results wrong?

TODO

Researchers from China Introduce DualToken-ViT: A Fusion of CNNs and Vision Transformers for Enhanced Image Processing Efficiency and Accuracy

Pentagon Urges AI Companies to Share More About Their Technology