bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811

Computer Science > Computation and Language​

[Submitted on 3 Oct 2023 (v1), last revised 5 Oct 2023 (this version, v2)]

Ring Attention with Blockwise Transformers for Near-Infinite Context​

Hao Liu, Matei Zaharia, Pieter Abbeel
Transformers have emerged as the architecture of choice for many state-of-the-art AI models, showcasing exceptional performance across a wide range of AI applications. However, the memory demands imposed by Transformers limit their ability to handle long sequences, thereby creating challenges for tasks involving extended sequences or long-term dependencies. We present a distinct approach, Ring Attention, which leverages blockwise computation of self-attention to distribute long sequences across multiple devices while concurrently overlapping the communication of key-value blocks with the computation of blockwise attention. By processing longer input sequences while maintaining memory efficiency, Ring Attention enables training and inference of sequences that are device count times longer than those of prior memory-efficient Transformers, effectively eliminating the memory constraints imposed by individual devices. Extensive experiments on language modeling tasks demonstrate the effectiveness of Ring Attention in allowing large sequence input size and improving performance.
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2310.01889 [cs.CL]
(or arXiv:2310.01889v2 [cs.CL] for this version)
[2310.01889] Ring Attention with Blockwise Transformers for Near-Infinite Context
Focus to learn more

Submission history​

From: Hao Liu [view email]
[v1] Tue, 3 Oct 2023 08:44:50 UTC (1,656 KB)
[v2] Thu, 5 Oct 2023 06:25:34 UTC (1,664 KB)










Let's talk about Ring Attention

With its innovative ring topology and communication-computation overlap, Ring Attention represents a breakthrough in enabling transformers to leverage vastly expanded context. This work has immense value for the field of deep learning and tremendous potential for enabling new applications.

The sheer magnitude of contexts unlocked by Ring Attention is groundbreaking—over 100 million tokens on a TPU cluster. No other method comes close to this scale. With essentially infinite context, entirely new modalities become feasible such as processing entire books, videos, or genomes within a single model.

Equally important is that Ring Attention achieves this while maintaining critical performance metrics like throughput and FLOPs utilization. The ring structure distributes computation with minimal overhead. This makes scaling context size practical and performant.

The implications are extraordinarily exciting. Tasks requiring reasoning over long distances, large knowledge bases, and interconnected content will benefit enormously. Models can ingest whole documents, have discussions spanning days, and tackle complex sequential decision making where context is key. Scientific and internet-scale data will become tractable.

Furthermore, larger contextualized models are broadly beneficial. They learn richer representations, better handle rare cases, and become more sample efficient. Recent results on models like GPT-3 and PaLM demonstrate their superior few-shot learning capabilities.

For both industry and research, Ring Attention lowers the barriers to training models that fully leverage enormous datasets and long sequences. It will accelerate innovation in generative models, reasoning abilities, multimodal understanding, and more. Unlocking such extensive context facilitates open-ended progress.

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811

Top ML Papers of the Week (Oct 9 - Oct 15):

- Instruct-Retro
- Ring Attention
- LLMs can Learn Rules
- A Survey of LLMs for Healthcare
- Meta Chain-of-Thought Prompting
- Toward Language Agent Fine-tuning
...

----
1/ Ring Attention - a memory-efficient approach that leverages blockwise computation of self-attention to distribute long sequences across multiple devices to overcome the memory limitations inherent in Transformer architectures, enabling handling of longer sequences during training and inference; enables scaling the context length with the number of devices while maintaining performance, exceeding context length of 100 million without attention approximations.



2/ Universal Simulator - applies generative modeling to learn a universal simulator of real-world interactions; can emulate how humans and agents interact with the world by simulating the visual outcome of high instruction and low-level controls; the system can be used to train vision-language planners, low-level reinforcement learning policies, and even for systems that perform video captioning.



3/ Overview of Factuality in LLMs - a survey of factuality in LLMs providing insights into how to evaluate factuality in LLMs and how to enhance it.



4/ LLMs can Learn Rules - presents a two-stage framework that learns a rule library for reasoning with LLMs; in the first stage (induction), an LLM is prompted to generate and verify rules over training examples; the rule library will consist of rules that often appear and lead to correct answers; the second stage (deduction) prompts the LLM to employ the learned rule library to perform reasoning and answer test questions; improves results on numerical reasoning and relational reasoning problems.



5/ Meta Chain-of-Thought Prompting - a generalizable chain-of-thought (Meta-CoT) prompting method in mixed-task scenarios where the type of input questions is unknown; comprises of three phases: 1) Scenario identification: samples distinct questions as in-context learning demonstrations to help automatically categorize scenarios based on input questions 2) Demonstration selection: constructs diverse demonstrations from a pool based on the scenario obtained in the first phase 3) Answer derivation: performs a final answer inference on the input question using previously fetched demonstrations.



6/ A Survey of LLMs for Healthcare - a comprehensive overview of LLMs applied to the healthcare domain.



7/ Improving Retrieval-Augmented LMs with Compressors - presents two approaches to compress retrieved documents into text summaries before pre-pending them in-context: 1) extractive compressor - selects useful sentences from retrieved documents 2) abstractive compressor - generates summaries by synthesizing information from multiple documents; achieves a compression rate of as low as 6% with minimal loss in performance on language modeling tasks and open domain question answering tasks; the proposed training scheme performs selective augmentation which helps to generate empty summaries when retrieved docs are irrelevant or unhelpful for a task.



8/ Instruct-Retro - introduces Retro 48B, the largest LLM pretrained with retrieval; continues pretraining a 43B parameter GPT model on an additional 100B tokens by retrieving from 1.2T tokens (using the Retro augmentation method); the Retro 48B model shows significant perplexity improvement over its GPT 43B counterpart. Scaling the Retro model to 48B means it can be instruction-tuned more effectively. This work applies instruction tuning to Retro 48B and demonstrates significant improvement (+7%) over the instruction-tuned GPT on zero-shot question-answering tasks.



9/ MemWalker - a method to enhance long-text understanding by treating the LLM as an interactive agent that can decide how to read the text via iterative prompting; it first processes long context into a tree of summer nodes and reads in a query to traverse the tree, seeking relevant information and crafting a suitable response; this process is achieved through reasoning and enables effective reading and enhances explainability through reasoning steps.



10/ Toward Language Agent Fine-tuning - explores the direction of fine-tuning LLMs to obtain language agents; finds that language agents consistently improved after fine-tuning their backbone language model; claims that fine-tuning a Llama2-7B with 500 agent trajectories (generated by GPT-4) leads to a 77% HotpotQA performance increase.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811

Computer Science > Computation and Language​

[Submitted on 17 Oct 2023]

BitNet: Scaling 1-bit Transformers for Large Language Models​

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, Furu Wei
The increasing size of large language models has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. In this work, we introduce BitNet, a scalable and stable 1-bit Transformer architecture designed for large language models. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results on language modeling show that BitNet achieves competitive performance while substantially reducing memory footprint and energy consumption, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. Furthermore, BitNet exhibits a scaling law akin to full-precision Transformers, suggesting its potential for effective scaling to even larger language models while maintaining efficiency and performance benefits.
Comments:Work in progress
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2310.11453 [cs.CL]
(or arXiv:2310.11453v1 [cs.CL] for this version)
[2310.11453] BitNet: Scaling 1-bit Transformers for Large Language Models
Focus to learn more

Submission history​

From: Shuming Ma [view email]
[v1] Tue, 17 Oct 2023 17:59:15 UTC (236 KB)









W5yfwLi.png

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811



Let's discuss INSTRUCTSCORE

Trying to improve AI language models can feel frustrating and ineffective - like wandering in the dark without a flashlight. Current evaluation metrics only give a numeric score without any guidance on what needs fixing.

INSTRUCTSCORE provides the missing light. Like a wise writing tutor, it clearly diagnoses errors in generated text and explains the issues. For example, circling a phrase and noting: "Using 'old district' changes the meaning from the source's 'base area'."

Without this feedback, researchers grope about blindly, not knowing if errors stem from training data biases, model architecture, or hyperparameter choices. INSTRUCTSCORE illuminates the specific weaknesses that need addressing.

It's like having a personalized map to guide your travels, versus just knowing you haven't reached the destination yet.

INSTRUCTSCORE achieves this through an ingenious trick - it has GPT-4 automatically generate a massive dataset of text examples with detailed critiques, providing a curriculum to "teach" evaluation skills.

This helps it learn nuanced assessment abilities beyond surface metrics like BLEU scores. It's as if reading millions of English essays marked up by master teachers to become an essay-grading expert itself!

Additionally, the researchers act as principals, evaluating INSTRUCTSCORE's feedback for common mistakes. This extra step fine-tunes the wisdom, like a tutor refining their skills through mentorship.

The end result is an AI mentor metric that can accelerate progress in language models by providing meaningful, human-aligned guidance - shedding light where once there was only darkness.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811


Structured self-reflection brings improvements in enabling autonomous agents to learn from their mistakes. This novel technique is a critical innovation that sets this work apart from prior methods. Reflection provides the missing link for continuous self-improvement without human guidance.

Many prior approaches rely extensively on expert demonstrations or traces to teach agents complex tasks. Yet in the real world, such perfectly curated examples are impractical to obtain. Agents must be able to learn new skills independently through trial and error.

This is where structured reflection shines. Like an expert coach, it rigorously reviews an agent's past performance to identify weaknesses. Reflection transforms failures into targeted feedback for improvement.

However, unstructured reflection has limited impact due to unreliable memory. This work introduces structured thought management to focus attention on the most critical mistakes. Reflection is no longer a jumble of thoughts but an optimized training regimen.

By constraining the action space, the agent can directly apply lessons learned from past slip-ups. Reflection becomes a closed-loop system for iterative refinement until mastery is achieved.

Structured reflection fundamentally changes how agents can self-correct. Instead of floundering when things go wrong, they now have the introspective ability to diagnose and treat their own shortcomings.

This powerful capacity for self-critique is a vital milestone in developing truly autonomous agents. Structured reflection breaks dependence on human guidance. Agents endowed with this capability can adapt to novel tasks and dynamic environments.

In testing, structured reflection proved its immense value. The zero-shot agent matched or exceeded the performance of systems relying on expert traces.

Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
Prior Methods:

- REACT: Uses intermediate thoughts to guide iterative planning. Requires customized prompts.

- REFLEXION: Allows models to self-reflect and improve over trials. Limited context capacity.

- TOOLFORMER: Models learn to use tools from demonstrations.

- SWIFTSAGE: Reduces planning calls with a learned planning module.

Structured Reflection:

- First zero-shot agent for computer control, requiring no expert traces. More autonomous.

- Staged planning is more efficient than iterative methods, with higher completion rates.

- Structured reflection facilitates more reliable learning from failures.

- Constrains action space during reflection to avoid repeating mistakes.

- Compact screen representation makes fewer assumptions about available information.
Oct 16, 2023 · 10:57 AM UTC

Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
Structured self-reflection is like having an inner coach that helps the agent learn from its mistakes. It's the process of constructively criticizing past actions in order to improve in the future. Here's an example of how it works:

Imagine you're an agent trying to complete a task on a computer. You take some actions, following your plan, but ultimately fail to achieve the goal. Like an athlete after a lost game, it's time to review the play-by-play and see where things went wrong.

Your inner coach - structured self-reflection - kicks in and starts a post-game analysis. It reviews each past action, looking for critical mistakes. Perhaps you clicked the wrong menu item at one point. Your coach makes a note: "For action #3, you should have clicked File instead of Edit."

With the wisdom of hindsight, your coach reconstructs an improved plan of action. Now when you retry the task, you'll avoid past mistakes thanks to the coaching. It's like having personalized feedback to refine your technique.

Each attempt at the task is like a practice round. Your inner coach identifies weaknesses and suggests corrections. With more rounds of reflection, your execution improves by learning from errors. Like a master artisan honing their craft, each mistake makes you better.

Structured self-reflection puts this in a framework, managing and prioritizing reflected thoughts over trials. It constrains the action space to avoid blindly repeating the same mistakes. This coaching distills experience into wisdom for progressively better performance.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
The structured reflection process:

The agent first completes a trial by executing its planned actions. If the trial fails, structured reflection kicks in to analyze mistakes.

The reflection module reviews the full trajectory of states and actions. It identifies the earliest critical error and suggests an improved action for that step. For example, "At step 3 do X instead of Y".

This reflection is recorded in memory associated with the time step. The agent replays steps up to the failure point. When reaching the problematic step, it overrides its planner to force the improved action.

A disabled action set tracks mistakes to avoid repeating them. Actions are constrained in the interface, e.g. removing clickable elements.

Reflections on later time steps are cleared if an earlier step is updated. This focuses attention on the most critical flaws.

Over multiple trials, the agent accumulates focused feedback targeting weaknesses. Following this coaching, it continuously refines its policy and improves.

Structured management of reflections allows efficient and reliable learning without human guidance. The agent effectively teaches itself through introspective critiques designed to isolate and correct the source of failures.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
performance

- Structured self-reflection showed clear benefits over multiple trials. On several challenging tasks, completion rate improved from around 50-60% on the first attempt to over 80% after 5 rounds of reflection. The improvements were statistically significant.

- The staged planning approach was much more efficient than iterative planning. On tasks requiring long action sequences, staged planning reduced the number of model queries by 60-90%. With a single upfront inference, it could plan out all executable actions on a screen.

- Explicitly constraining the action space avoided repeating past mistakes during reflection. When failed actions were disabled in the interface, completion rates improved by 4-8% absolute over just using prompts. This shows the value of interacting with the environment to guide behavior.

- On simple 1-screen, 1-step tasks, the zero-shot staged planner achieved 96-100% completion, outperforming prior iterative planning methods that still relied on some expert traces.

- On complex tasks, the zero-shot agent with reflection matched the performance of previous state-of-the-art few-shot agents. This demonstrates the viability of the approach without any reliance on expert traces.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
summary

The paper proposed the first zero-shot agent for computer control tasks that requires no expert traces or demonstrations. This is a novel approach compared to prior work that relied on some form of human guidance.

The agent achieves efficient planning and execution through two main innovations:

- Staged planning maximally plans out all executable actions in one pass for each screen state, rather than iterative planning. This substantially reduces the number of model queries.

- Structured self-reflection allows the agent to identify mistakes and progressively improve over multiple trials. Thought management facilitates reliable reflection.
a
Evaluations on the MINIWOB++ benchmark demonstrated strong performance:

- The staged planner exceeded prior iterative planning methods, even without any traces.

- With reflection, the zero-shot agent matched state-of-the-art few-shot systems that leverage expert traces.

- The agent improved significantly over multiple rounds of reflection.

Overall, this work introduced a novel self-learning agent that can start completing computer tasks from scratch. It represents an important step towards more autonomous and general use of large language models for control. The principles of efficient staged planning and structured reflection are general advances for improving model performance.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811

🔥🚀 The concept of **validation log perplexity** in the context of large language model (LLM) training.

Before diving in, we should note that the metric applies specifically to classical language models (sometimes called autoregressive or causal language models) and is not well defined for masked language models like BERT

📌 **Perplexity**: Before discussing validation log perplexity, it's important to understand the core concept of perplexity. In the realm of natural language processing and probabilistic language models, perplexity is a measure of how well the probability distribution predicted by a model aligns with the actual distribution of the words in the language.

Mathematically, for a language model that assigns a probability \(P(w_1, w_2, ..., w_N)\) to a sequence of words \(w_1, w_2, ..., w_N\), the perplexity \(PP\) of the sequence is given by:

$$PP(w_1, w_2, ..., w_N) = P(w_1, w_2, ..., w_N)^{-\frac{1}{N}}$$
📌 **Log Perplexity**: Log perplexity is the logarithmic version of perplexity, which provides more stable and interpretable numbers, especially given that we often deal with very small probabilities in language models.

It is computed as:

$$\text{Log Perplexity} = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, w_2, ..., w_{i-1})$$

Where:
- \(N\) is the total number of words.
- \(P(w_i | w_1, w_2, ..., w_{i-1})\) is the probability of word \(w_i\) given the preceding words in the sequence.

-------------

📌 **Validation Log Perplexity**: When training machine learning models, it's common practice to split the data into training and validation sets. The training set is used to adjust the model parameters, while the validation set is used to gauge the model's generalization performance.

Validation log perplexity, then, is the log perplexity computed using the validation data. It provides an estimate of how well the trained model would perform on unseen data. A lower validation log perplexity suggests that the model is better at predicting the sequences in the validation data, and thus might be more generalizable.
📌 In the context of LLM training, validation log perplexity becomes an invaluable metric. Given the massive amount of parameters in LLMs, they are prone to overfitting. Regularly monitoring the validation log perplexity can indicate whether the model is continuing to learn meaningful patterns or just memorizing the training data.

📌 Lower is Better: At its core, log perplexity is a measure of uncertainty. Lower log perplexity indicates that the model is more confident in its predictions, which generally suggests better performance. Thus, a "good" score is a relatively low score.

If the validation log perplexity starts increasing while training perplexity continues to decrease, it's an indication of overfitting.

📌 **Optimization Perspective**: Ideally, when training an LLM or any probabilistic language model, the objective is to minimize the log perplexity. In the context of deep learning frameworks like PyTorch and TensorFlow, this would translate to minimizing the negative log likelihood loss.

📌 **Practical Considerations**: When dealing with real-world datasets and especially with large models, there are a few challenges associated with perplexity:

- **Numerical Stability**: Probabilities can be very small, causing numerical underflows. To avoid this, calculations are usually performed in the log space.

- **Tokenization**: The perplexity value can vary depending on the tokenization scheme used. For instance, using byte-pair encoding (BPE) versus word-level tokenization can lead to different perplexity values for the same text.

- **Comparing Perplexities**: It's important to ensure that perplexities are compared under the same conditions, especially in terms of tokenization and dataset preprocessing.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811


FireAct: Toward Language Agent Fine-tuning​

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, Shunyu Yao
Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.
Comments:Code, data, and models are available at this https URL
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2310.05915 [cs.CL]
(or arXiv:2310.05915v1 [cs.CL] for this version)
[2310.05915] FireAct: Toward Language Agent Fine-tuning
Focus to learn more

Submission history​

From: Shunyu Yao [view email]
[v1] Mon, 9 Oct 2023 17:58:38 UTC (1,032 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,115
Reputation
8,239
Daps
157,811
Top