bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

Performance​

The following table summarizes the performance results (perplexity, model size, run time for single token prediction). It is basically designed after the corresponding table on the main page).

Model​
Measure​
F16​
Q2_K​
Q3_K_S​
Q3_K_M​
Q3_K_L​
Q4_K_S​
Q4_K_M​
Q5_K_S​
Q5_K_M​
Q6_K​
7B​
perplexity​
5.9066​
6.7764​
6.4571​
6.1503​
6.0869​
6.0215​
5.9601​
5.9419​
5.9208​
5.9110​
7B​
file size​
13.0G​
2.67G​
2.75G​
3.06G​
3.35G​
3.56G​
3.80G​
4.33G​
4.45G​
5.15G​
7B​
ms/tok@4th, M2 Max​
116​
56​
81​
69​
76​
50​
55​
70​
71​
75​
7B​
ms/tok@8th, M2 Max​
111​
36​
46​
36​
46​
36​
40​
44​
46​
51​
7B​
ms/tok@4th, RTX-4080​
60​
15.5​
18.6​
17.0​
17.7​
15.5​
16.0​
16.7​
16.9​
18.3​
7B​
ms/tok@4th, Ryzen7950X​
214​
57​
58​
61​
67​
68​
71​
81​
82​
93​
13B​
perplexity​
5.2543​
5.8545​
5.6033​
5.4498​
5.4063​
5.3404​
5.3002​
5.2785​
5.2638​
5.2568​
13B​
file size​
25.0G​
5.13G​
5.27G​
5.88G​
6.45G​
6.80G​
7.32G​
8.36G​
8.60G​
9.95G​
13B​
ms/tok@4th, M2 Max​
216​
103​
156​
148​
144​
95​
102​
132​
134​
142​
13B​
ms/tok@8th, M2 Max​
213​
67​
83​
77​
83​
68​
73​
81​
84​
95​
13B​
ms/tok@4th, RTX-4080​
-​
25.3​
29.2​
29.3​
25.5​
26.2​
26.2​
28.6​
28.9​
30.0​
13B​
ms/tok@4th, Ryzen7950X​
414​
109​
113​
118​
129​
130​
137​
156​
161​
180​
I realize the above table is not easy to read, so adding a shortened version that shows a subset of the above data:

Model​
Measure
F16​
Q2_K​
Q3_K_M​
Q4_K_S​
Q5_K_S​
Q6_K
7B​
perplexity
5.9066​
6.7764​
6.1503​
6.0215​
5.9419​
5.9110
7B​
file size
13.0G​
2.67G​
3.06G​
3.56G​
4.33G​
5.15G
7B​
ms/tok @ 4th, M2 Max
116​
56​
69​
50​
70​
75
7B​
ms/tok @ 8th, M2 Max
111​
36​
36​
36​
44​
51
7B​
ms/tok @ 4th, RTX-4080
60​
15.5​
17.0​
15.5​
16.7​
18.3
7B​
ms/tok @ 4th, Ryzen
214​
57​
61​
68​
81​
93
13B​
perplexity
5.2543​
5.8545​
5.4498​
5.3404​
5.2785​
5.2568
13B​
file size
25.0G​
5.13G​
5.88G​
6.80G​
8.36G​
9.95G
13B​
ms/tok @ 4th, M2 Max
216​
103​
148​
95​
132​
142
13B​
ms/tok @ 8th, M2 Max
213​
67​
77​
68​
81​
95
13B​
ms/tok @ 4th, RTX-4080
-​
25.3​
29.3​
26.2​
28.6​
30.0
13B​
ms/tok @ 4th, Ryzen
414​
109​
118​
130​
156​
180
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135






2NBETJL.png

eCoCm9P.png

nGiDi4G.png

J8SCYX2.png

TVaem9s.png

JtVrASQ.png

kcbkk6h.png

L0tvR4x.png

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135


Stack Overflow lays off over 100 people as the AI coding boom continues​

/

Stack Overflow has laid off 28 percent of its staff over a year after doubling its employee base in a massive hiring push.​

By Wes Davis, a weekend editor who covers the latest in tech and entertainment. He has written news, reviews, and more as a tech journalist since 2020.

Oct 16, 2023, 10:25 AM EDT|15 Comments / 15 New

A Stack Overflow graphic with the company’s logo and large bars colored in with orange gradients cutting diagonally across it.

Stack Overflow cuts 28 percent of its staff. Image: Stack Overflow

Coding help forum Stack Overflow is laying off 28 percent of its staff as it struggles toward profitability. CEO Prashanth Chandrasekar announced today that the company is “significantly reducing the size of our go-to-market organization,” as well as “supporting teams” and other groups.

After the team doubled its employee base last year, Chandrasekar told The Verge’s Nilay Patel in an interview that about 45 percent of those hires were for its go-to-market sales team, which he said was “obviously the largest team.” We’ve reached out to Stack Overflow to find out what other teams may have been affected.

Related​


Word of the layoffs comes over a year after the company made a big hiring push, doubling its size to over 500 people. Stack Overflow did not elaborate on the reasons for the layoff, but its hiring push began near the start of a generative AI boom that has stuffed chatbots into every corner of the tech industry, including coding. That presents clear challenges for a personal coding help forum, as developers get comfortable with AI coding assistance and the very tools that do that are blended into products they use.

AI-generated coding answers have also posed problems for the company over the past year. The company issued a temporary ban on users generating answers with the help of an AI chatbot in December last year, but its alleged under-enforcement led to a months-long strike among moderators that was resolved in August; the ban is still in place today. Stack Overflow also announced it would start charging AI companies to train on its site.


 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs​

Hassan Mujtaba•Oct 17, 2023 09:43 AM EDT
Copy Shortlink


NVIDIA has announced that TensorRT-LLM is coming to Windows soon and will bring a huge AI boost to PCs running RTX GPUs.

NVIDIA RTX GPU-Powered PCs To Get Free AI Performance Boost In Windows With Upcoming TensorRT-LLM Support​

Back in September, NVIDIA announced its TensoRT-LLM model for Data Centers which offered an 8x gain on the industry's top AI GPUs such as the Hopper H100 and the Ampere A100. Taking full advantage of the tensor core acceleration featured on NVIDIA's GeForce RTX & RTX Pro GPUs, the latest model will deliver up to 4x faster performance in LLM Inferencing workloads.

RELATED STORY NVIDIA RTX Video Super Resolution 1.5 Now Available: Improved Visual Quality, Supported Across All RTX 20 GPUs​


NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs 2

Earlier, we explained that One of the biggest updates that TensorRT-LLM brings is in the form of a new scheduler known as In-Flight batching which allows work to enter & exit the GPU independent of other tasks. It allows dynamic processing of several smaller queries while processing large compute-intensive requests in the same GPU. The TensorRT-LLM makes use of optimized open-source models which allow for higher speedups when Batch Sizes are increased. Starting today, these optimized open-source models have been made available to the public and are available to download at developer.nvidia.com.
NVIDIA TensorRT-LLM Coming To Windows, Brings Huge AI Boost To Consumer PCs Running GeForce RTX & RTX Pro GPUs 3

The added AI acceleration with the TensorRT-LLM model will help drive various daily productivity tasks such as engaging in chat, summarising documents and web content, drafting emails and blogs, and can also be used to analyze data and generate vast amounts of content using what is available to the model.

So how will TensorRT-LLM help consumer PCs running Windows? Well in a demo shown by NVIDIA, a comparison between an open-source pre-trained LLM model such as LLaMa-2 and TensorRT-LLM was shown. When a query is passed to LLaMa-2, it will gather information from a large generalized dataset like Wikipedia so they don't have up-to-date information after they were trained nor do they have domain-specific datasets that they weren't trained on. They also won't certainly know about any dataset that is stored on your personalized devices or systems. So you won't get the specific data that you are looking for.


There are two approaches to solving this problem, one is fine-tuning where the LLM is optimized around a specific data set but that takes a lot of time depending on the size of the data set. The other approach is RAG or Retrieval Augamanted Generation which uses a localized library that can be filled with the dataset you want the LLM to go through & then leverage the language understating capabilities of that LLM to provide you with the information that only comes from that dataset.


In the example, a question is asked related to the NVIDIA tech integrations within Alan Wake 2 which the standard LLaMa 2 model is unable to find the proper results to but the other model with TensorRT-LLM which is fed data from 30 GeForce News articles in the local repository can provide the required information without any problems. So TensorRT-LLM provides a relevant answer and also does it faster than the LLaMa-2 model. Furthermore, NVIDIA also confirmed that you can use TenosrRT-LLM to accelerate almost any model. This is just one of the many use cases where NVIDIA TensorRT-LLM can leverage AI to deliver faster and more productive PC experiences in Windows so stay tuned for more announcements in the future.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

[Submitted on 13 Oct 2023]

PaLI-3 Vision Language Models: Smaller, Faster, Stronger​

Xi Chen, Xiao Wang, Lucas Beyer, Alexander Kolesnikov, Jialin Wu, Paul Voigtlaender, Basil Mustafa, Sebastian Goodman, Ibrahim Alabdulmohsin, Piotr Padlewski, Daniel Salz, Xi Xiong, Daniel Vlasic, Filip Pavetic, Keran Rong, Tianli Yu, Daniel Keysers, Xiaohua Zhai, Radu Soricut
This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models.
Subjects:Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2310.09199 [cs.CV]
(or arXiv:2310.09199v1 [cs.CV] for this version)
https://doi.org/10.48550/arXiv.2310.09199
Focus to learn more

Submission history​

From: Xiaohua Zhai [view email]
[v1] Fri, 13 Oct 2023 15:45:19 UTC (520 KB)



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

Computer Science > Computation and Language​

[Submitted on 3 Oct 2023 (v1), last revised 5 Oct 2023 (this version, v2)]

Ring Attention with Blockwise Transformers for Near-Infinite Context​

Hao Liu, Matei Zaharia, Pieter Abbeel
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance.
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2310.01889 [cs.CL]
(or arXiv:2310.01889v2 [cs.CL] for this version)
[2310.01889] Ring Attention with Blockwise Transformers for Near-Infinite Context
Focus to learn more

Submission history​

From: Hao Liu [view email]
[v1] Tue, 3 Oct 2023 08:44:50 UTC (1,656 KB)
[v2] Thu, 5 Oct 2023 06:25:34 UTC (1,664 KB)









Let's talk about Ring Attention

With its innovative ring topology and communication-computation overlap, Ring Attention represents a breakthrough in enabling transformers to leverage vastly expanded context. This work has immense value for the field of deep learning and tremendous potential for enabling new applications.

The sheer magnitude of contexts unlocked by Ring Attention is groundbreaking—over 100 million tokens on a TPU cluster. No other method comes close to this scale. With essentially infinite context, entirely new modalities become feasible such as processing entire books, videos, or genomes within a single model.

Equally important is that Ring Attention achieves this while maintaining critical performance metrics like throughput and FLOPs utilization. The ring structure distributes computation with minimal overhead. This makes scaling context size practical and performant.

The implications are extraordinarily exciting. Tasks requiring reasoning over long distances, large knowledge bases, and interconnected content will benefit enormously. Models can ingest whole documents, have discussions spanning days, and tackle complex sequential decision making where context is key. Scientific and internet-scale data will become tractable.

Furthermore, larger contextualized models are broadly beneficial. They learn richer representations, better handle rare cases, and become more sample efficient. Recent results on models like GPT-3 and PaLM demonstrate their superior few-shot learning capabilities.

For both industry and research, Ring Attention lowers the barriers to training models that fully leverage enormous datasets and long sequences. It will accelerate innovation in generative models, reasoning abilities, multimodal understanding, and more. Unlocking such extensive context facilitates open-ended progress.

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

Top ML Papers of the Week (Oct 9 - Oct 15):

- Instruct-Retro
- Ring Attention
- LLMs can Learn Rules
- A Survey of LLMs for Healthcare
- Meta Chain-of-Thought Prompting
- Toward Language Agent Fine-tuning
...

----
1/ Ring Attention - a memory-efficient approach that leverages blockwise computation of self-attention to distribute long sequences across multiple devices to overcome the memory limitations inherent in Transformer architectures, enabling handling of longer sequences during training and inference; enables scaling the context length with the number of devices while maintaining performance, exceeding context length of 100 million without attention approximations.



2/ Universal Simulator - applies generative modeling to learn a universal simulator of real-world interactions; can emulate how humans and agents interact with the world by simulating the visual outcome of high instruction and low-level controls; the system can be used to train vision-language planners, low-level reinforcement learning policies, and even for systems that perform video captioning.



3/ Overview of Factuality in LLMs - a survey of factuality in LLMs providing insights into how to evaluate factuality in LLMs and how to enhance it.



4/ LLMs can Learn Rules - presents a two-stage framework that learns a rule library for reasoning with LLMs; in the first stage (induction), an LLM is prompted to generate and verify rules over training examples; the rule library will consist of rules that often appear and lead to correct answers; the second stage (deduction) prompts the LLM to employ the learned rule library to perform reasoning and answer test questions; improves results on numerical reasoning and relational reasoning problems.



5/ Meta Chain-of-Thought Prompting - a generalizable chain-of-thought (Meta-CoT) prompting method in mixed-task scenarios where the type of input questions is unknown; comprises of three phases: 1) Scenario identification: samples distinct questions as in-context learning demonstrations to help automatically categorize scenarios based on input questions 2) Demonstration selection: constructs diverse demonstrations from a pool based on the scenario obtained in the first phase 3) Answer derivation: performs a final answer inference on the input question using previously fetched demonstrations.



6/ A Survey of LLMs for Healthcare - a comprehensive overview of LLMs applied to the healthcare domain.



7/ Improving Retrieval-Augmented LMs with Compressors - presents two approaches to compress retrieved documents into text summaries before pre-pending them in-context: 1) extractive compressor - selects useful sentences from retrieved documents 2) abstractive compressor - generates summaries by synthesizing information from multiple documents; achieves a compression rate of as low as 6% with minimal loss in performance on language modeling tasks and open domain question answering tasks; the proposed training scheme performs selective augmentation which helps to generate empty summaries when retrieved docs are irrelevant or unhelpful for a task.



8/ Instruct-Retro - introduces Retro 48B, the largest LLM pretrained with retrieval; continues pretraining a 43B parameter GPT model on an additional 100B tokens by retrieving from 1.2T tokens (using the Retro augmentation method); the Retro 48B model shows significant perplexity improvement over its GPT 43B counterpart. Scaling the Retro model to 48B means it can be instruction-tuned more effectively. This work applies instruction tuning to Retro 48B and demonstrates significant improvement (+7%) over the instruction-tuned GPT on zero-shot question-answering tasks.



9/ MemWalker - a method to enhance long-text understanding by treating the LLM as an interactive agent that can decide how to read the text via iterative prompting; it first processes long context into a tree of summer nodes and reads in a query to traverse the tree, seeking relevant information and crafting a suitable response; this process is achieved through reasoning and enables effective reading and enhances explainability through reasoning steps.



10/ Toward Language Agent Fine-tuning - explores the direction of fine-tuning LLMs to obtain language agents; finds that language agents consistently improved after fine-tuning their backbone language model; claims that fine-tuning a Llama2-7B with 500 agent trajectories (generated by GPT-4) leads to a 77% HotpotQA performance increase.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,426
Reputation
8,509
Daps
160,135

Computer Science > Computation and Language​

[Submitted on 17 Oct 2023]

BitNet: Scaling 1-bit Transformers for Large Language Models​

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Comments:Work in progress
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2310.11453 [cs.CL]
(or arXiv:2310.11453v1 [cs.CL] for this version)
[2310.11453] BitNet: Scaling 1-bit Transformers for Large Language Models
Focus to learn more

Submission history​

From: Shuming Ma [view email]
[v1] Tue, 17 Oct 2023 17:59:15 UTC (236 KB)









W5yfwLi.png

 
Last edited:
Top