The A.I Megathread (LLM , GPT , Development)

bnew · Oct 18, 2023

Let's discuss INSTRUCTSCORE

Trying to improve AI language models can feel frustrating and ineffective - like wandering in the dark without a flashlight. Current evaluation metrics only give a numeric score without any guidance on what needs fixing.

INSTRUCTSCORE provides the missing light. Like a wise writing tutor, it clearly diagnoses errors in generated text and explains the issues. For example, circling a phrase and noting: "Using 'old district' changes the meaning from the source's 'base area'."

Without this feedback, researchers grope about blindly, not knowing if errors stem from training data biases, model architecture, or hyperparameter choices. INSTRUCTSCORE illuminates the specific weaknesses that need addressing.

It's like having a personalized map to guide your travels, versus just knowing you haven't reached the destination yet.

INSTRUCTSCORE achieves this through an ingenious trick - it has GPT-4 automatically generate a massive dataset of text examples with detailed critiques, providing a curriculum to "teach" evaluation skills.

This helps it learn nuanced assessment abilities beyond surface metrics like BLEU scores. It's as if reading millions of English essays marked up by master teachers to become an essay-grading expert itself!

Additionally, the researchers act as principals, evaluating INSTRUCTSCORE's feedback for common mistakes. This extra step fine-tunes the wisdom, like a tutor refining their skills through mentorship.

The end result is an AI mentor metric that can accelerate progress in language models by providing meaningful, human-aligned guidance - shedding light where once there was only darkness.

bnew · Oct 18, 2023

https://archive.ph/1AFbv

Structured self-reflection brings improvements in enabling autonomous agents to learn from their mistakes. This novel technique is a critical innovation that sets this work apart from prior methods. Reflection provides the missing link for continuous self-improvement without human guidance.

Many prior approaches rely extensively on expert demonstrations or traces to teach agents complex tasks. Yet in the real world, such perfectly curated examples are impractical to obtain. Agents must be able to learn new skills independently through trial and error.

This is where structured reflection shines. Like an expert coach, it rigorously reviews an agent's past performance to identify weaknesses. Reflection transforms failures into targeted feedback for improvement.

However, unstructured reflection has limited impact due to unreliable memory. This work introduces structured thought management to focus attention on the most critical mistakes. Reflection is no longer a jumble of thoughts but an optimized training regimen.

By constraining the action space, the agent can directly apply lessons learned from past slip-ups. Reflection becomes a closed-loop system for iterative refinement until mastery is achieved.

Structured reflection fundamentally changes how agents can self-correct. Instead of floundering when things go wrong, they now have the introspective ability to diagnose and treat their own shortcomings.

This powerful capacity for self-critique is a vital milestone in developing truly autonomous agents. Structured reflection breaks dependence on human guidance. Agents endowed with this capability can adapt to novel tasks and dynamic environments.

In testing, structured reflection proved its immense value. The zero-shot agent matched or exceeded the performance of systems relying on expert traces.

Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
Prior Methods:

- REACT: Uses intermediate thoughts to guide iterative planning. Requires customized prompts.

- REFLEXION: Allows models to self-reflect and improve over trials. Limited context capacity.

- TOOLFORMER: Models learn to use tools from demonstrations.

- SWIFTSAGE: Reduces planning calls with a learned planning module.

Structured Reflection:

- First zero-shot agent for computer control, requiring no expert traces. More autonomous.

- Staged planning is more efficient than iterative methods, with higher completion rates.

- Structured reflection facilitates more reliable learning from failures.

- Constrains action space during reflection to avoid repeating mistakes.

- Compact screen representation makes fewer assumptions about available information.
Oct 16, 2023 · 10:57 AM UTC

Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
Structured self-reflection is like having an inner coach that helps the agent learn from its mistakes. It's the process of constructively criticizing past actions in order to improve in the future. Here's an example of how it works:

Imagine you're an agent trying to complete a task on a computer. You take some actions, following your plan, but ultimately fail to achieve the goal. Like an athlete after a lost game, it's time to review the play-by-play and see where things went wrong.

Your inner coach - structured self-reflection - kicks in and starts a post-game analysis. It reviews each past action, looking for critical mistakes. Perhaps you clicked the wrong menu item at one point. Your coach makes a note: "For action #3, you should have clicked File instead of Edit."

With the wisdom of hindsight, your coach reconstructs an improved plan of action. Now when you retry the task, you'll avoid past mistakes thanks to the coaching. It's like having personalized feedback to refine your technique.

Each attempt at the task is like a practice round. Your inner coach identifies weaknesses and suggests corrections. With more rounds of reflection, your execution improves by learning from errors. Like a master artisan honing their craft, each mistake makes you better.

Structured self-reflection puts this in a framework, managing and prioritizing reflected thoughts over trials. It constrains the action space to avoid blindly repeating the same mistakes. This coaching distills experience into wisdom for progressively better performance.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
The structured reflection process:

The agent first completes a trial by executing its planned actions. If the trial fails, structured reflection kicks in to analyze mistakes.

The reflection module reviews the full trajectory of states and actions. It identifies the earliest critical error and suggests an improved action for that step. For example, "At step 3 do X instead of Y".

This reflection is recorded in memory associated with the time step. The agent replays steps up to the failure point. When reaching the problematic step, it overrides its planner to force the improved action.

A disabled action set tracks mistakes to avoid repeating them. Actions are constrained in the interface, e.g. removing clickable elements.

Reflections on later time steps are cleared if an earlier step is updated. This focuses attention on the most critical flaws.

Over multiple trials, the agent accumulates focused feedback targeting weaknesses. Following this coaching, it continuously refines its policy and improves.

Structured management of reflections allows efficient and reliable learning without human guidance. The agent effectively teaches itself through introspective critiques designed to isolate and correct the source of failures.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
performance

- Structured self-reflection showed clear benefits over multiple trials. On several challenging tasks, completion rate improved from around 50-60% on the first attempt to over 80% after 5 rounds of reflection. The improvements were statistically significant.

- The staged planning approach was much more efficient than iterative planning. On tasks requiring long action sequences, staged planning reduced the number of model queries by 60-90%. With a single upfront inference, it could plan out all executable actions on a screen.

- Explicitly constraining the action space avoided repeating past mistakes during reflection. When failed actions were disabled in the interface, completion rates improved by 4-8% absolute over just using prompts. This shows the value of interacting with the environment to guide behavior.

- On simple 1-screen, 1-step tasks, the zero-shot staged planner achieved 96-100% completion, outperforming prior iterative planning methods that still relied on some expert traces.

- On complex tasks, the zero-shot agent with reflection matched the performance of previous state-of-the-art few-shot agents. This demonstrates the viability of the approach without any reliance on expert traces.
Carlos E. Perez
@IntuitMachine
Oct 16
Oct 16
summary

The paper proposed the first zero-shot agent for computer control tasks that requires no expert traces or demonstrations. This is a novel approach compared to prior work that relied on some form of human guidance.

The agent achieves efficient planning and execution through two main innovations:

- Staged planning maximally plans out all executable actions in one pass for each screen state, rather than iterative planning. This substantially reduces the number of model queries.

- Structured self-reflection allows the agent to identify mistakes and progressively improve over multiple trials. Thought management facilitates reliable reflection.
a
Evaluations on the MINIWOB++ benchmark demonstrated strong performance:

- The staged planner exceeded prior iterative planning methods, even without any traces.

- With reflection, the zero-shot agent matched state-of-the-art few-shot systems that leverage expert traces.

- The agent improved significantly over multiple rounds of reflection.

Overall, this work introduced a novel self-learning agent that can start completing computer tasks from scratch. It represents an important step towards more autonomous and general use of large language models for control. The principles of efficient staged planning and structured reflection are general advances for improving model performance.

bnew · Oct 18, 2023

https://archive.ph/5bCyF

The concept of **validation log perplexity** in the context of large language model (LLM) training.

Before diving in, we should note that the metric applies specifically to classical language models (sometimes called autoregressive or causal language models) and is not well defined for masked language models like BERT

**Perplexity**: Before discussing validation log perplexity, it's important to understand the core concept of perplexity. In the realm of natural language processing and probabilistic language models, perplexity is a measure of how well the probability distribution predicted by a model aligns with the actual distribution of the words in the language.

Mathematically, for a language model that assigns a probability $P(w_1, w_2, ..., w_N)$ to a sequence of words $w_1, w_2, ..., w_N$, the perplexity $PP$ of the sequence is given by:

$$PP(w_1, w_2, ..., w_N) = P(w_1, w_2, ..., w_N)^{-\frac{1}{N}}$$

**Log Perplexity**: Log perplexity is the logarithmic version of perplexity, which provides more stable and interpretable numbers, especially given that we often deal with very small probabilities in language models.

It is computed as:

$$\text{Log Perplexity} = -\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, w_2, ..., w_{i-1})$$

Where:
- $N$ is the total number of words.
- $P(w_i | w_1, w_2, ..., w_{i-1})$ is the probability of word $w_i$ given the preceding words in the sequence.

-------------

**Validation Log Perplexity**: When training machine learning models, it's common practice to split the data into training and validation sets. The training set is used to adjust the model parameters, while the validation set is used to gauge the model's generalization performance.

Validation log perplexity, then, is the log perplexity computed using the validation data. It provides an estimate of how well the trained model would perform on unseen data. A lower validation log perplexity suggests that the model is better at predicting the sequences in the validation data, and thus might be more generalizable.

In the context of LLM training, validation log perplexity becomes an invaluable metric. Given the massive amount of parameters in LLMs, they are prone to overfitting. Regularly monitoring the validation log perplexity can indicate whether the model is continuing to learn meaningful patterns or just memorizing the training data.

Lower is Better: At its core, log perplexity is a measure of uncertainty. Lower log perplexity indicates that the model is more confident in its predictions, which generally suggests better performance. Thus, a "good" score is a relatively low score.

If the validation log perplexity starts increasing while training perplexity continues to decrease, it's an indication of overfitting.

**Optimization Perspective**: Ideally, when training an LLM or any probabilistic language model, the objective is to minimize the log perplexity. In the context of deep learning frameworks like PyTorch and TensorFlow, this would translate to minimizing the negative log likelihood loss.

**Practical Considerations**: When dealing with real-world datasets and especially with large models, there are a few challenges associated with perplexity:

- **Numerical Stability**: Probabilities can be very small, causing numerical underflows. To avoid this, calculations are usually performed in the log space.

- **Tokenization**: The perplexity value can vary depending on the tokenization scheme used. For instance, using byte-pair encoding (BPE) versus word-level tokenization can lead to different perplexity values for the same text.

- **Comparing Perplexities**: It's important to ensure that perplexities are compared under the same conditions, especially in terms of tokenization and dataset preprocessing.

bnew · Oct 18, 2023

https://archive.ph/AQ8Pg

[2310.05915] FireAct: Toward Language Agent Fine-tuning

FireAct: Toward Language Agent Fine-tuning

Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, Shunyu Yao

Recent efforts have augmented language models (LMs) with external tools or environments, leading to the development of language agents that can reason and act. However, most of these agents rely on few-shot prompting techniques with off-the-shelf LMs. In this paper, we investigate and argue for the overlooked direction of fine-tuning LMs to obtain language agents. Using a setup of question answering (QA) with a Google search API, we explore a variety of base LMs, prompting methods, fine-tuning data, and QA tasks, and find language agents are consistently improved after fine-tuning their backbone LMs. For example, fine-tuning Llama2-7B with 500 agent trajectories generated by GPT-4 leads to a 77% HotpotQA performance increase. Furthermore, we propose FireAct, a novel approach to fine-tuning LMs with trajectories from multiple tasks and prompting methods, and show having more diverse fine-tuning data can further improve agents. Along with other findings regarding scaling effects, robustness, generalization, efficiency and cost, our work establishes comprehensive benefits of fine-tuning LMs for agents, and provides an initial set of experimental designs, insights, as well as open questions toward language agent fine-tuning.

Comments:	Code, data, and models are available at this https URL
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:	arXiv:2310.05915 [cs.CL]
	(or arXiv:2310.05915v1 [cs.CL] for this version)
	[2310.05915] FireAct: Toward Language Agent Fine-tuning Focus to learn more

Submission history

From: Shunyu Yao [view email]
[v1] Mon, 9 Oct 2023 17:58:38 UTC (1,032 KB)

https://arxiv.org/pdf/2310.05915.pdf

bnew · Oct 18, 2023

https://archive.ph/0kdXT

bnew · Oct 18, 2023

https://archive.ph/PqDvS

GitHub - RulinShao/LightSeq: Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers - GitHub - RulinShao/LightSeq: Official repository for LightSeq: Sequence Level Pa...

github.com

bnew · Oct 18, 2023

https://archive.ph/s3LWe

https://archive.ph/ft28A {without paywall}

bnew · Oct 18, 2023

https://archive.ph/alLXC

Towards a Real-Time Decoding of Images from Brain Activity

Using magnetoencephalography (MEG), Meta showcases an AI system capable of decoding the unfolding of visual representations in the brain with an unprecedented temporal resolution.

ai.meta.com

Research

Towards a Real-Time Decoding of Images from Brain Activity

October 18, 2023•

3 minute read

At every moment of every day, our brains meticulously sculpt a wealth of sensory signals into meaningful representations of the world around us. Yet how this continuous process actually works remains poorly understood.

Today, Meta is announcing an important milestone in the pursuit of that fundamental question. Using magnetoencephalography (MEG), a non-invasive neuroimaging technique in which thousands of brain activity measurements are taken per second, we showcase an AI system capable of decoding the unfolding of visual representations in the brain with an unprecedented temporal resolution.

RECOMMENDED READS
Using AI to decode speech from brain activity
Studying the brain to build AI that processes language as people do

This AI system can be deployed in real time to reconstruct, from brain activity, the images perceived and processed by the brain at each instant. This opens up an important avenue to help the scientific community understand how images are represented in the brain, and then used as foundations of human intelligence. Longer term, it may also provide a stepping stone toward non-invasive brain-computer interfaces in a clinical setting that could help people who, after suffering a brain lesion, have lost their ability to speak.

Leveraging our recent architecture trained to decode speech perception from MEG signals, we develop a three-part system consisting of an image encoder, a brain encoder, and an image decoder. The image encoder builds a rich set of representations of the image independently of the brain. The brain encoder then learns to align MEG signals to these image embeddings. Finally, the image decoder generates a plausible image given these brain representations.

MEG recordings are continuously aligned to the deep representation of the images, which can then condition the generation of images at each instant.

We train this architecture on a public dataset of MEG recordings acquired from healthy volunteers and released by Things, an international consortium of academic researchers sharing experimental data based on the same image database.

We first compare the decoding performance obtained with a variety of pretrained image modules and show that the brain signals best align with modern computer vision AI systems like DINOv2, a recent self-supervised architecture able to learn rich visual representations without any human annotations. This result confirms that self-supervised learning leads AI systems to learn brain-like representations: The artificial neurons in the algorithm tend to be activated similarly to the physical neurons of the brain in response to the same image.

The images that volunteer participants see (left) and those decoded from MEG activity at each instant of time (right). Each image is presented approximately every 1.5 seconds.

This functional alignment between such AI systems and the brain can then be used to guide the generation of an image similar to what the participants see in the scanner. While our results show that images are better decoded with functional Magnetic Resonance Imaging (fMRI), our MEG decoder can be used at every instant of time and thus produces a continuous flux of images decoded from brain activity.

The images that volunteer participants see (left) and those decoded from fMRI activity (right).

While the generated images remain imperfect, the results suggest that the reconstructed image preserves a rich set of high-level features, such as object categories. However, the AI system often generates inaccurate low-level features by misplacing or mis-orienting some objects in the generated images. In particular, using the Natural Scene Dataset, we show that images generated from MEG decoding remain less precise than the decoding obtained with fMRI, a comparably slow-paced but spatially precise neuroimaging technique.

Overall, our results show that MEG can be used to decipher, with millisecond precision, the rise of complex representations generated in the brain. More generally, this research strengthens Meta’s long-term research initiative to understand the foundations of human intelligence, identify its similarities as well as differences compared to current machine learning algorithms, and ultimately guide the development of AI systems designed to learn and reason like humans.

bnew · Oct 18, 2023

[2310.11454] VeRA: Vector-based Random Matrix Adaptation

Computer Science > Computation and Language

[Submitted on 17 Oct 2023]

VeRA: Vector-based Random Matrix Adaptation

Dawid Jan Kopiczko, Tijmen Blankevoort, Yuki Markus Asano

Low-rank adapation (LoRA) is a popular method that reduces the number of trainable parameters when finetuning large language models, but still faces acute storage challenges when scaling to even larger models or deploying numerous per-user or per-task adapted models. In this work, we present Vector-based Random Matrix Adaptation (VeRA), which reduces the number of trainable parameters by 10x compared to LoRA, yet maintains the same performance. It achieves this by using a single pair of low-rank matrices shared across all layers and learning small scaling vectors instead. We demonstrate its effectiveness on the GLUE and E2E benchmarks, and show its application in instruction-following with just 1.4M parameters using the Llama2 7B model.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.11454 [cs.CL]
	(or arXiv:2310.11454v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.11454 Focus to learn more

Submission history

From: Dawid Jan Kopiczko [view email]
[v1] Tue, 17 Oct 2023 17:59:46 UTC (139 KB)

https://arxiv.org/pdf/2310.11454.pdf

bnew · Oct 18, 2023

https://archive.ph/GGZ8E

Self-hosting small LLMs can be significantly cheaper than running GPT-4.

Quick breakdown:

To keep it simple, let's assume you have a full context window.

For GPT-4, that's roughly $0.30/1k tokens ($0.03/1k prompt tokens for an 8192 context window plus $0.06/1k for completion tokens).

The cost of self-hosting is mainly the cost of a GPU server. To keep things simple, let's assume you're using a @LambdaAPI H100 server at $2/hr.

A while ago, I tested the performance of vLLM with Falcon-7B and was getting roughly 44.1 tokens/sec with a full context window on a 4090. An H100 would be much faster, but we'll use that number.

That's 158,760 tokens/hour, which means your cost is ($2/hour) / (158,760 tokens/hour) = ~$0.013/1k tokens.

(This is a pretty rough calculation, so let me know if I messed up anywhere)

Yes, the model is way smaller than GPT-4, but I spent absolutely no time maximizing throughput using things like continuous batching and quantizing the model. I also was using a slower GPU with a lot less VRAM.

However, this is really contingent on being able to have consistent GPU usage. But, even at 10% efficiency using my setup, you're looking at only ~30% of the cost of GPT-4.

If you have a narrow task that you can fine-tune a model like Mistral-7B for, you should strongly consider this route.
Mark Tenenholtz
@marktenenholtz
19h
19h
The disadvantages:

- Pay-per-use can be a lot more effective when scaling
- The model I benchmarked only had a 2k context window, but it's also not the most efficient I've tested. Something like Mistral would likely outperform its cost/token
- Some of the cost savings are made up for in maintenance costs

bnew · Oct 18, 2023

https://archive.ph/dzqGF

You don't need a huge, labeled, custom dataset to train a great model.

The Segment Anything data generation pipeline created 1.1B masks and only a fraction were hand-labeled. They had the best executed data pipeline I've seen in a while.

The main idea:

1. Start with a model trained on existing, public datasets

2. Use that model to help annotators label data, and use that data to retrain the model (using only this data)

3. Run inference on unlabeled data and keep the confident labels/masks. Ask annotators to fill in the rest of the missing labels, and retrain the model.

4. Repeat step 3 a few times

5. Finally, let the model generate labels automatically on a much larger dataset. Apply some filtering and deduping, and retrain the model on this dataset.

This is quite similar to a pseudo-labeling pipeline I used a couple of years ago to medal in a Kaggle competition (except more sophisticated!)

bnew · Oct 18, 2023

https://archive.ph/OYLh0

bnew · Oct 18, 2023

https://archive.ph/sFson

bnew · Oct 18, 2023

https://archive.ph/PB0DD

bnew · Oct 18, 2023

https://archive.ph/IIOXF

Models/data/code: github.com/EleutherAI/math-l…
Paper: arxiv.org/abs/2310.10631

Llemma Sample Explorer

llemma-demo.github.io

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

FireAct: Toward Language Agent Fine-tuning

Submission history

bnew

Veteran

bnew

Veteran

GitHub - RulinShao/LightSeq: Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

bnew

Veteran

bnew

Veteran

Towards a Real-Time Decoding of Images from Brain Activity

bnew

Veteran

Computer Science > Computation and Language

VeRA: Vector-based Random Matrix Adaptation

Submission history

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Llemma Sample Explorer

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

FireAct: Toward Language Agent Fine-tuning​

Submission history​

Veteran

Veteran

Veteran

Veteran

Veteran

Computer Science > Computation and Language​

VeRA: Vector-based Random Matrix Adaptation​

Submission history​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

FireAct: Toward Language Agent Fine-tuning

Submission history

Computer Science > Computation and Language

VeRA: Vector-based Random Matrix Adaptation

Submission history