The A.I Megathread (LLM , GPT , Development)

bnew · Oct 16, 2023

Performance

The following table summarizes the performance results (perplexity, model size, run time for single token prediction). It is basically designed after the corresponding table on the main page).

Model	Measure	F16	Q2_K	Q3_K_S	Q3_K_M	Q3_K_L	Q4_K_S	Q4_K_M	Q5_K_S	Q5_K_M	Q6_K
7B	perplexity	5.9066	6.7764	6.4571	6.1503	6.0869	6.0215	5.9601	5.9419	5.9208	5.9110
7B	file size	13.0G	2.67G	2.75G	3.06G	3.35G	3.56G	3.80G	4.33G	4.45G	5.15G
7B	ms/tok@4th, M2 Max	116	56	81	69	76	50	55	70	71	75
7B	ms/tok@8th, M2 Max	111	36	46	36	46	36	40	44	46	51
7B	ms/tok@4th, RTX-4080	60	15.5	18.6	17.0	17.7	15.5	16.0	16.7	16.9	18.3
7B	ms/tok@4th, Ryzen7950X	214	57	58	61	67	68	71	81	82	93
13B	perplexity	5.2543	5.8545	5.6033	5.4498	5.4063	5.3404	5.3002	5.2785	5.2638	5.2568
13B	file size	25.0G	5.13G	5.27G	5.88G	6.45G	6.80G	7.32G	8.36G	8.60G	9.95G
13B	ms/tok@4th, M2 Max	216	103	156	148	144	95	102	132	134	142
13B	ms/tok@8th, M2 Max	213	67	83	77	83	68	73	81	84	95
13B	ms/tok@4th, RTX-4080	-	25.3	29.2	29.3	25.5	26.2	26.2	28.6	28.9	30.0
13B	ms/tok@4th, Ryzen7950X	414	109	113	118	129	130	137	156	161	180

I realize the above table is not easy to read, so adding a shortened version that shows a subset of the above data:

Model	Measure	F16	Q2_K	Q3_K_M	Q4_K_S	Q5_K_S	Q6_K
7B	perplexity	5.9066	6.7764	6.1503	6.0215	5.9419	5.9110
7B	file size	13.0G	2.67G	3.06G	3.56G	4.33G	5.15G
7B	ms/tok @ 4th, M2 Max	116	56	69	50	70	75
7B	ms/tok @ 8th, M2 Max	111	36	36	36	44	51
7B	ms/tok @ 4th, RTX-4080	60	15.5	17.0	15.5	16.7	18.3
7B	ms/tok @ 4th, Ryzen	214	57	61	68	81	93
13B	perplexity	5.2543	5.8545	5.4498	5.3404	5.2785	5.2568
13B	file size	25.0G	5.13G	5.88G	6.80G	8.36G	9.95G
13B	ms/tok @ 4th, M2 Max	216	103	148	95	132	142
13B	ms/tok @ 8th, M2 Max	213	67	77	68	81	95
13B	ms/tok @ 4th, RTX-4080	-	25.3	29.3	26.2	28.6	30.0
13B	ms/tok @ 4th, Ryzen	414	109	118	130	156	180

bnew · Oct 16, 2023

https://archive.ph/CVAga

bnew · Oct 16, 2023

Stack Overflow lays off over 100 people as the AI coding boom continues

Stack Overflow just doubled its staff last year.

www.theverge.com

Stack Overflow lays off over 100 people as the AI coding boom continues

/

Stack Overflow has laid off 28 percent of its staff over a year after doubling its employee base in a massive hiring push.

By Wes Davis, a weekend editor who covers the latest in tech and entertainment. He has written news, reviews, and more as a tech journalist since 2020.

Oct 16, 2023, 10:25 AM EDT|15 Comments / 15 New

Stack Overflow cuts 28 percent of its staff. Image: Stack Overflow

Coding help forum Stack Overflow is laying off 28 percent of its staff as it struggles toward profitability. CEO Prashanth Chandrasekar announced today that the company is “significantly reducing the size of our go-to-market organization,” as well as “supporting teams” and other groups.

After the team doubled its employee base last year, Chandrasekar told The Verge’s Nilay Patel in an interview that about 45 percent of those hires were for its go-to-market sales team, which he said was “obviously the largest team.” We’ve reached out to Stack Overflow to find out what other teams may have been affected.

Word of the layoffs comes over a year after the company made a big hiring push, doubling its size to over 500 people. Stack Overflow did not elaborate on the reasons for the layoff, but its hiring push began near the start of a generative AI boom that has stuffed chatbots into every corner of the tech industry, including coding. That presents clear challenges for a personal coding help forum, as developers get comfortable with AI coding assistance and the very tools that do that are blended into products they use.

AI-generated coding answers have also posed problems for the company over the past year. The company issued a temporary ban on users generating answers with the help of an AI chatbot in December last year, but its alleged under-enforcement led to a months-long strike among moderators that was resolved in August; the ban is still in place today. Stack Overflow also announced it would start charging AI companies to train on its site.

After ChatGPT disruption, Stack Overflow lays off 28 percent of staff

The popular developer forum is still hunting for a "path to profitability."

arstechnica.com

bnew · Oct 17, 2023

https://archive.ph/PMc6H

https://archive.ph/C19dB

bnew · Oct 17, 2023

emrgnt-cmplxty/Mistral-7b-Phibrarian-32k · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Oct 17, 2023

bnew · Oct 17, 2023

NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs

NVIDIA has announced that TensorRT-LLM is coming to Windows soon and will bring a huge AI boost to PCs running RTX GPUs.

wccftech.com

NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs

Hassan Mujtaba•Oct 17, 2023 09:43 AM EDT
•Copy Shortlink

NVIDIA has announced that TensorRT-LLM is coming to Windows soon and will bring a huge AI boost to PCs running RTX GPUs.

NVIDIA RTX GPU-Powered PCs To Get Free AI Performance Boost In Windows With Upcoming TensorRT-LLM Support

Back in September, NVIDIA announced its TensoRT-LLM model for Data Centers which offered an 8x gain on the industry's top AI GPUs such as the Hopper H100 and the Ampere A100. Taking full advantage of the tensor core acceleration featured on NVIDIA's GeForce RTX & RTX Pro GPUs, the latest model will deliver up to 4x faster performance in LLM Inferencing workloads.

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2310.09199 [cs.CV]
	(or arXiv:2310.09199v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2310.09199 Focus to learn more

Submission history

From: Xiaohua Zhai [view email]
[v1] Fri, 13 Oct 2023 15:45:19 UTC (520 KB)

https://arxiv.org/pdf/2310.09199.pdf

bnew · Oct 17, 2023

https://archive.ph/qrOCr

bnew · Oct 18, 2023

Ring Attention with Blockwise Transformers for Near-Infinite Context

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby posing challenges in...

arxiv.org

Computer Science > Computation and Language

[Submitted on 3 Oct 2023 (v1), last revised 5 Oct 2023 (this version, v2)]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, Pieter Abbeel

Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.01889 [cs.CL]
	(or arXiv:2310.01889v2 [cs.CL] for this version)
	[2310.01889] Ring Attention with Blockwise Transformers for Near-Infinite Context Focus to learn more

Submission history

From: Hao Liu [view email]
[v1] Tue, 3 Oct 2023 08:44:50 UTC (1,656 KB)
[v2] Thu, 5 Oct 2023 06:25:34 UTC (1,664 KB)

https://arxiv.org/pdf/2310.01889.pdf

https://archive.ph/bYpH8

https://archive.ph/s0yN4

https://archive.ph/EeZlo

Let's talk about Ring Attention

With its innovative ring topology and communication-computation overlap, Ring Attention represents a breakthrough in enabling transformers to leverage vastly expanded context. This work has immense value for the field of deep learning and tremendous potential for enabling new applications.

The sheer magnitude of contexts unlocked by Ring Attention is groundbreaking—over 100 million tokens on a TPU cluster. No other method comes close to this scale. With essentially infinite context, entirely new modalities become feasible such as processing entire books, videos, or genomes within a single model.

Equally important is that Ring Attention achieves this while maintaining critical performance metrics like throughput and FLOPs utilization. The ring structure distributes computation with minimal overhead. This makes scaling context size practical and performant.

The implications are extraordinarily exciting. Tasks requiring reasoning over long distances, large knowledge bases, and interconnected content will benefit enormously. Models can ingest whole documents, have discussions spanning days, and tackle complex sequential decision making where context is key. Scientific and internet-scale data will become tractable.

Furthermore, larger contextualized models are broadly beneficial. They learn richer representations, better handle rare cases, and become more sample efficient. Recent results on models like GPT-3 and PaLM demonstrate their superior few-shot learning capabilities.

For both industry and research, Ring Attention lowers the barriers to training models that fully leverage enormous datasets and long sequences. It will accelerate innovation in generative models, reasoning abilities, multimodal understanding, and more. Unlocking such extensive context facilitates open-ended progress.

GitHub - lhao499/llm_large_context: Large Context Transformers

Large Context Transformers. Contribute to lhao499/llm_large_context development by creating an account on GitHub.

github.com

bnew · Oct 18, 2023

https://archive.ph/tOnes

Top ML Papers of the Week (Oct 9 - Oct 15):

- Instruct-Retro
- Ring Attention
- LLMs can Learn Rules
- A Survey of LLMs for Healthcare
- Meta Chain-of-Thought Prompting
- Toward Language Agent Fine-tuning
...

----
1/ Ring Attention - a memory-efficient approach that leverages blockwise computation of self-attention to distribute long sequences across multiple devices to overcome the memory limitations inherent in Transformer architectures, enabling handling of longer sequences during training and inference; enables scaling the context length with the number of devices while maintaining performance, exceeding context length of 100 million without attention approximations.

2/ Universal Simulator - applies generative modeling to learn a universal simulator of real-world interactions; can emulate how humans and agents interact with the world by simulating the visual outcome of high instruction and low-level controls; the system can be used to train vision-language planners, low-level reinforcement learning policies, and even for systems that perform video captioning.

3/ Overview of Factuality in LLMs - a survey of factuality in LLMs providing insights into how to evaluate factuality in LLMs and how to enhance it.

4/ LLMs can Learn Rules - presents a two-stage framework that learns a rule library for reasoning with LLMs; in the first stage (induction), an LLM is prompted to generate and verify rules over training examples; the rule library will consist of rules that often appear and lead to correct answers; the second stage (deduction) prompts the LLM to employ the learned rule library to perform reasoning and answer test questions; improves results on numerical reasoning and relational reasoning problems.

5/ Meta Chain-of-Thought Prompting - a generalizable chain-of-thought (Meta-CoT) prompting method in mixed-task scenarios where the type of input questions is unknown; comprises of three phases: 1) Scenario identification: samples distinct questions as in-context learning demonstrations to help automatically categorize scenarios based on input questions 2) Demonstration selection: constructs diverse demonstrations from a pool based on the scenario obtained in the first phase 3) Answer derivation: performs a final answer inference on the input question using previously fetched demonstrations.

6/ A Survey of LLMs for Healthcare - a comprehensive overview of LLMs applied to the healthcare domain.

7/ Improving Retrieval-Augmented LMs with Compressors - presents two approaches to compress retrieved documents into text summaries before pre-pending them in-context: 1) extractive compressor - selects useful sentences from retrieved documents 2) abstractive compressor - generates summaries by synthesizing information from multiple documents; achieves a compression rate of as low as 6% with minimal loss in performance on language modeling tasks and open domain question answering tasks; the proposed training scheme performs selective augmentation which helps to generate empty summaries when retrieved docs are irrelevant or unhelpful for a task.

8/ Instruct-Retro - introduces Retro 48B, the largest LLM pretrained with retrieval; continues pretraining a 43B parameter GPT model on an additional 100B tokens by retrieving from 1.2T tokens (using the Retro augmentation method); the Retro 48B model shows significant perplexity improvement over its GPT 43B counterpart. Scaling the Retro model to 48B means it can be instruction-tuned more effectively. This work applies instruction tuning to Retro 48B and demonstrates significant improvement (+7%) over the instruction-tuned GPT on zero-shot question-answering tasks.

9/ MemWalker - a method to enhance long-text understanding by treating the LLM as an interactive agent that can decide how to read the text via iterative prompting; it first processes long context into a tree of summer nodes and reads in a query to traverse the tree, seeking relevant information and crafting a suitable response; this process is achieved through reasoning and enables effective reading and enhances explainability through reasoning steps.

10/ Toward Language Agent Fine-tuning - explores the direction of fine-tuning LLMs to obtain language agents; finds that language agents consistently improved after fine-tuning their backbone language model; claims that fine-tuning a Llama2-7B with 500 agent trajectories (generated by GPT-4) leads to a 77% HotpotQA performance increase.

bnew · Oct 18, 2023

https://archive.ph/ie6BP

bnew · Oct 18, 2023

https://archive.ph/tLUFE

https://archive.ph/InaPm

Paper: arxiv.org/abs/2310.08560
GitHub: github.com/cpacker/memgpt

bnew · Oct 18, 2023

BitNet: Scaling 1-bit Transformers for Large Language Models

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models...

arxiv.org

Computer Science > Computation and Language

[Submitted on 17 Oct 2023]

BitNet: Scaling 1-bit Transformers for Large Language Models

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei

The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.

Comments:	Work in progress
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.11453 [cs.CL]
	(or arXiv:2310.11453v1 [cs.CL] for this version)
	[2310.11453] BitNet: Scaling 1-bit Transformers for Large Language Models Focus to learn more

Submission history

From: Shuming Ma [view email]
[v1] Tue, 17 Oct 2023 17:59:15 UTC (236 KB)

https://arxiv.org/pdf/2310.11453.pdf

https://archive.ph/qu6h5

https://archive.ph/gcXYM

https://archive.ph/rXcHF

bnew · Oct 18, 2023

https://archive.ph/ebTI7

The A.I Megathread (LLM , GPT , Development)

Veteran

Performance​

Veteran

Veteran

Stack Overflow lays off over 100 people as the AI coding boom continues​

Stack Overflow has laid off 28 percent of its staff over a year after doubling its employee base in a massive hiring push.​

Related​

Veteran

Veteran

Veteran

Veteran

NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs​

NVIDIA RTX GPU-Powered PCs To Get Free AI Performance Boost In Windows With Upcoming TensorRT-LLM Support​

RELATED STORY NVIDIA RTX Video Super Resolution 1.5 Now Available: Improved Visual Quality, Supported Across All RTX 20 GPUs​

Veteran

PaLI-3 Vision Language Models: Smaller, Faster, Stronger​

Submission history​

Veteran

Veteran

Computer Science > Computation and Language​

Ring Attention with Blockwise Transformers for Near-Infinite Context​

Submission history​

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

BitNet: Scaling 1-bit Transformers for Large Language Models​

Submission history​

Veteran

Performance

Stack Overflow lays off over 100 people as the AI coding boom continues

Stack Overflow has laid off 28 percent of its staff over a year after doubling its employee base in a massive hiring push.

Related

NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs

NVIDIA RTX GPU-Powered PCs To Get Free AI Performance Boost In Windows With Upcoming TensorRT-LLM Support

RELATED STORY NVIDIA RTX Video Super Resolution 1.5 Now Available: Improved Visual Quality, Supported Across All RTX 20 GPUs

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Submission history

Computer Science > Computation and Language

Ring Attention with Blockwise Transformers for Near-Infinite Context

Submission history

Computer Science > Computation and Language

BitNet: Scaling 1-bit Transformers for Large Language Models

Submission history