bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243

RealFill​

Reference-Driven Generation for Authentic Image Completion​

Luming Tang1,2, Nataniel Ruiz1, Qinghao Chu1, Yuanzhen Li1, Aleksander Holynski1, David E. Jacobs1,
Bharath Hariharan2, Yael Pritch1, Neal Wadhwa1, Kfir Aberman1, Michael Rubinstein1

1Google Research, 2Cornell University
arXiv


RealFill is able to complete the image with what should have been there.​

Abstract​

Recent advances in generative imagery have brought forth outpainting and inpainting models that can produce high-quality, plausible image content in unknown regions, but the content these models hallucinate is necessarily inauthentic, since the models lack sufficient context about the true scene. In this work, we propose RealFill, a novel generative approach for image completion that fills in missing regions of an image with the content that should have been there. RealFill is a generative inpainting model that is personalized using only a few reference images of a scene. These reference images do not have to be aligned with the target image, and can be taken with drastically varying viewpoints, lighting conditions, camera apertures, or image styles. Once personalized, RealFill is able to complete a target image with visually compelling contents that are faithful to the original scene. We evaluate RealFill on a new image completion benchmark that covers a set of diverse and challenging scenarios, and find that it outperforms existing approaches by a large margin.

Method​

Authentic Image Completion: Given a few reference images (up to five) and one target image that captures roughly the same scene (but in a different arrangement or appearance), we aim to fill missing regions of the target image with high-quality image content that is faithful to the originally captured scene. Note that for the sake of practical benefit, we focus particularly on the more challenging, unconstrained setting in which the target and reference images may have very different viewpoints, environmental conditions, camera apertures, image styles, or even moving objects.

RealFill: For a given scene, we first create a personalized generative model by fine-tuning a pre-trained inpainting diffusion model on the reference and target images. This fine-tuning process is designed such that the adapted model not only maintains a good image prior, but also learns the contents, lighting, and style of the scene in the input images. We then use this fine-tuned model to fill the missing regions in the target image through a standard diffusion sampling process.

Results​

Given the reference images on the left, RealFill is able to either uncrop or inpaint the target image on the right, resulting in high-quality images that are both visually compelling and also faithful to the references, even when there are large differences between references and targets including viewpoint, aperture, lighting, image style, and object motion.

Comparison with Baselines​

A comparison of RealFill and baseline methods. Transparent white masks are overlayed on the unaltered known regions of the target images.
  • Paint-by-Example does not achieve high scene fidelity because it relies on CLIP embeddings, which only capture high-level semantic information.
  • Stable Diffusion Inpainting produces plausible results, they are inconsistent with the reference images because prompts have limited expressiveness.

In contrast, RealFill generates high-quality results that have high fidelity with respect to the reference images.


comparison_compressed.jpeg

Limitations​

  • RealFill needs to go through a gradient-based fine-tuning process on input images, rendering it relatively slow.
  • When viewpoint change between reference and target images is very large, RealFill tends to fail at recovering the 3D scene, especially when there is only a single reference image.
  • Because RealFill mainly relies on the image prior inherited from the base pre-trained model, it also fails to handle cases where that are challenging for the base model, such as text for Stable Diffusion.

failure_compressed.jpeg

Acknowledgements​

We would like to thank Rundi Wu, Qianqian Wang, Viraj Shah, Ethan Weber, Zhengqi Li, Kyle Genova, Boyang Deng, Maya Goldenberg, Noah Snavely, Ben Poole, Ben Mildenhall, Alex Rav-Acha, Pratul Srinivasan, Dor Verbin and Jon Barron for their valuable discussion and feedbacks, and thank Zeya Peng, Rundi Wu, Shan Nan for their contribution to the evaluation dataset. A special thanks to Jason Baldridge, Kihyuk Sohn, Kathy Meier-Hellstern, and Nicole Brichtova for their feedback and support for the project.



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243

Navigating the Jagged Technological Frontier​


Written by

D^3 Faculty
white stairs on an orange background


In collaboration with Boston Consulting Group (BCG), new research from Digital Data Design Institute at Harvard chair and co-founder Karim Lakhani and others explores field experimental evidence of the effects of AI on knowledge worker productivity and quality. It involved evaluating the performance of 758 consultants, which make up 7% of the individual contributor workforce of the company. The tasks spanned a consultant’s daily work, including creativity, analytical thinking, writing proficiency, and persuasiveness.

Key Findings​

  • For tasks within the AI frontier, ChatGPT-4 significantly increased performance, boosting speed by over 25%, human-rated performance by over 40%, and task completion by over 12%.
  • The study introduces the concept of a “jagged technological frontier,” where AI excels in some tasks but falls short in others.
  • Two distinct patterns of AI use emerged: “Centaurs,” who divided and delegated tasks between themselves and the AI, and “Cyborgs,” who integrated their workflow with the AI.

Shifting the Debate​

The paper argues that the focus should move beyond the binary decision of adopting or not adopting AI. Instead, we should evaluate the value of different configurations and combinations of humans and AI for various tasks within the knowledge workflow.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243

Fake News Detectors are Biased against Texts Generated by Large Language Models​

Jinyan Su, Terry Yue Zhuo, Jonibek Mansurov, Di Wang, Preslav Nakov
The spread of fake news has emerged as a critical challenge, undermining trust and posing threats to society. In the era of Large Language Models (LLMs), the capability to generate believable fake content has intensified these concerns. In this study, we present a novel paradigm to evaluate fake news detectors in scenarios involving both human-written and LLM-generated misinformation. Intriguingly, our findings reveal a significant bias in many existing detectors: they are more prone to flagging LLM-generated content as fake news while often misclassifying human-written fake news as genuine. This unexpected bias appears to arise from distinct linguistic patterns inherent to LLM outputs. To address this, we introduce a mitigation strategy that leverages adversarial training with LLM-paraphrased genuine news. The resulting model yielded marked improvements in detection accuracy for both human and LLM-generated news. To further catalyze research in this domain, we release two comprehensive datasets, \texttt{GossipCop++} and \texttt{PolitiFact++}, thus amalgamating human-validated articles with LLM-generated fake and real news.
Comments:The first two authors contributed equally
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:arXiv:2309.08674 [cs.CL]
(or arXiv:2309.08674v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2309.08674
Focus to learn more

Submission history​

From: Terry Yue Zhuo [view email]
[v1] Fri, 15 Sep 2023 18:04:40 UTC (9,845 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243


StableLM-3B-4E1T​

Technical report for StableLM-3B-4E1T
Jonathan Tow, Marco Bellagente, Dakota Mahan, Carlos Riquelme Ruiz


Comment
Model ArchitectureTraining DataTraining ProcedureDownstream ResultsSystem DetailsConclusionAcknowledgmentsReferences

StableLM-3B-4E1T is a 3 billion (3B) parameter language model pre-trained under the multi-epoch regime to study the impact of repeated tokens on downstream performance. Given prior success in this area (https://arxiv.org/pdf/2205.05131.pdfTaylor et al., 2022 and Tay et al., 2023), we train on 1 trillion (1T) tokens for 4 epochs following the observations of Muennighoff et al. (2023) in "Scaling Data-Constrained Language Models" in which they find "training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data." Further inspiration for the token count is taken from "Go smol or go home" (De Vries, 2023), which suggests a 2.96B model trained for 2.85 trillion tokens achieves a similar loss to a Chinchilla compute-optimal 9.87B language model (k_n =0.3kn=0.3).https://github.com/orgs/Stability-AI/projects/8?pane=issue&itemId=36926940

Model Architecture​


The model is a decoder-only transformer similar to the LLaMA (Touvron et al., 2023) https://arxiv.org/abs/2307.09288architecture with the following modifications:
ParametersHidden SizeLayersHeadsSequence Length
2,795,443,200256032324096


Training Data​

The dataset is comprised of a filtered mixture of open-source large-scale datasets available on the HuggingFace Hub: Falcon RefinedWeb extract (Penedo et al., 2023), RedPajama-Data (Together Computer, 2023) and The Pile (Gao et al., 2020), both without the Books3 subset, and StarCoder (Li et al., 2023). The complete list is provided in Table 1.

8cc88bb9.png

Table 1: Open-source datasets used for multi-epoch training. Note that the total token count does not account for the reduced size after downsampling C4, Common Crawl (2023), and GitHub to obtain 1T tokens.
Given the large amount of web data, we recommend fine-tuning the base StableLM-3B-4E1T for your downstream tasks.

Training Procedure​

The model is trained for 972k steps in bfloat16 precision with a global context length of 4096 instead of the multi-stage ramp-up from 2048-to-4096 as done for StableLM-Alpha v2. The batch size is set to 1024 (4,194,304 tokens). We optimize with AdamW (Loshchilov and Hutter, 2017) and use linear warmup for the first 4.8k steps, followed by a cosine decay schedule to 4% of the peak learning rate. Early instabilities are attributed to extended periods in high learning rate regions. We do not incorporate dropout (Srivastava et al., 2014) due to the model's relatively small size. Detailed hyperparameters are provided in the model config here.

During training, we evaluate natural language benchmarks and observe steady improvements over the course of training until the tail end of the learning rate decay schedule. For this reason, we decided to linearly cool down the learning rate towards 0, similar to Zhai et al. (2021), in hopes of squeezing out performance. We plan to explore alternative schedules in future work.

Furthermore, our initial stage of pre-training relies on the flash-attention API (Tri Dao, 2023) with its out-of-the-box triangular causal masking support. This forces the model to attend similarly to different documents in a packed sequence. In the cool-down stage, we instead reset position IDs and attention masks at EOD tokens for all packed sequences after empirically observing improved sample quality (read: less repetition) in a concurrent experiment. We hypothesize that this late adjustment leads to the notable degradation in byte-length normalized accuracies of Arc Easy (Clark et al., 2018) and SciQ (Welbl et al., 2017).

f5c34dd1.png

Figure 1: Toy demonstration of attention mask resetting.

Data composition was modified during the cool-down. Specifically, we remove Ubuntu IRC, OpenWebText, HackerNews, and FreeLaw for quality control and further NSFW filtering while upsampling C4. The distribution shift is likely responsible for the increased loss (+0.02 nats) from the initial stage.


See the plots below for validation dynamics across our hold-out set and common NLP benchmarks.

Note: The released checkpoint is taken from step 970k according to validation loss and average downstream performance.

Downstream Results​

The following zero-shot evaluations are performed with EleutherAI's lm-evaluation-harness using the lm-bench branch of Stability AI's fork.


c17f0154.png

Table 2: Zero-shot performance across popular language modeling and common sense reasoning benchmarks. lm-eval results JSONs can be found in the evals directory of the StableLM repo.

StableLM-3B-4E1T achieves state-of-the-art performance (September 2023) at the 3B parameter scale for open-source models and is competitive with many of the popular contemporary 7B models, even outperforming our most recent 7B StableLM-Base-Alpha-v2.

System Details​

    • Hardware: StableLM-3B-4E1T was trained on the Stability AI cluster across 256 NVIDIA A100 40GB GPUs (AWS P4d instances). Training began on August 23, 2023, and took approximately 30 days to complete.
Note: TFLOPs are estimated using GPT-NeoX's get_flops function.



Conclusion​

StableLM-3B-4E1T provides further evidence for the claims in Muennighoff et al. (2023) at the trillion token scale, suggesting multi-epoch training as a valid approach to improving downstream performance when working under data constraints.

Acknowledgments​

We thank our MLOp team members, Richard Vencu and Sami Kama, for 30 days of uninterrupted pre-training; Reshinth Adithyan, James Baicoianu, Nathan Cooper, Christian Laforte, Nikhil Pinnaparaju, and Enrico Shippole, for fruitful discussions and guidance.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243

AI language models can exceed PNG and FLAC in lossless compression, says study​

Is compression equivalent to general intelligence? DeepMind digs up more potential clues.​

BENJ EDWARDS - 9/28/2023, 11:43 AM
Photo of a C-clamp compressing books.
Enlarge
Getty Images
83WITH

FURTHER READING​

Better than JPEG? Researcher discovers that Stable Diffusion can compress images

Effective compression is about finding patterns to make data smaller without losing information. When an algorithm or model can accurately guess the next piece of data in a sequence, it shows it's good at spotting these patterns. This links the idea of making good guesses—which is what large language models like GPT-4 do very well—to achieving good compression.

In an arXiv research paper titled "Language Modeling Is Compression," researchers detail their discovery that the DeepMind large language model (LLM) called Chinchilla 70B can perform lossless compression on image patches from the ImageNet image database to 43.4 percent of their original size, beating the PNG algorithm, which compressed the same data to 58.5 percent. For audio, Chinchilla compressed samples from the LibriSpeech audio data set to just 16.4 percent of their raw size, outdoing FLAC compression at 30.3 percent.

In this case, lower numbers in the results mean more compression is taking place. And lossless compression means that no data is lost during the compression process. It stands in contrast to a lossy compression technique like JPEG, which sheds some data and reconstructs some of the data with approximations during the decoding process to significantly reduce file sizes.

The study's results suggest that even though Chinchilla 70B was mainly trained to deal with text, it's surprisingly effective at compressing other types of data as well, often better than algorithms specifically designed for those tasks. This opens the door for thinking about machine learning models as not just tools for text prediction and writing but also as effective ways to shrink the size of various types of data.
A chart of compression test results provided by DeepMind researchers in their paper. The chart illustrates the efficiency of various data compression techniques on different data sets, all initially 1GB in size. It employs a lower-is-better ratio, comparing the compressed size to the original size.
Enlarge
/ A chart of compression test results provided by DeepMind researchers in their paper. The chart illustrates the efficiency of various data compression techniques on different data sets, all initially 1GB in size. It employs a lower-is-better ratio, comparing the compressed size to the original size.

DeepMind

Over the past two decades, some computer scientists have proposed that the ability to compress data effectively is akin to a form of general intelligence. The idea is rooted in the notion that understanding the world often involves identifying patterns and making sense of complexity, which, as mentioned above, is similar to what good data compression does. By reducing a large set of data into a smaller, more manageable form while retaining its essential features, a compression algorithm demonstrates a form of understanding or representation of that data, proponents argue.

The Hutter Prize is an example that brings this idea of compression as a form of intelligence into focus. Named after Marcus Hutter, a researcher in the field of AI and one of the named authors of the DeepMind paper, the prize is awarded to anyone who can most effectively compress a fixed set of English text. The underlying premise is that a highly efficient compression of text would require understanding the semantic and syntactic patterns in language, similar to how a human understands it.

So theoretically, if a machine can compress this data extremely well, it might indicate a form of general intelligence—or at least a step in that direction. While not everyone in the field agrees that winning the Hutter Prize would indicate general intelligence, the competition highlights the overlap between the challenges of data compression and the goals of creating more intelligent systems.

FURTHER READING​

Meta’s AI-powered audio codec promises 10x compression over MP3

Along these lines, the DeepMind researchers claim that the relationship between prediction and compression isn't a one-way street. They posit that if you have a good compression algorithm like gzip, you can flip it around and use it to generate new, original data based on what it has learned during the compression process.

In one section of the paper (Section 3.4), the researchers carried out an experiment to generate new data across different formats—text, image, and audio—by getting gzip and Chinchilla to predict what comes next in a sequence of data after conditioning on a sample. Understandably, gzip didn't do very well, producing completely nonsensical output—to a human mind, at least. It demonstrates that while gzip can be compelled to generate data, that data might not be very useful other than as an experimental curiosity. On the other hand, Chinchilla, which is designed with language processing in mind, predictably performed far better in the generative task.
An example from the DeepMind paper comparing the generative properties of gzip and Chinchilla on a sample text. gzip's output is unreadable.
Enlarge
/ An example from the DeepMind paper comparing the generative properties of gzip and Chinchilla on a sample text. gzip's output is unreadable.

DeepMind

While the DeepMind paper on AI language model compression has not been peer-reviewed, it provides an intriguing window into potential new applications for large language models. The relationship between compression and intelligence is a matter of ongoing debate and research, so we'll likely see more papers on the topic emerge soon.

READER COMMENTS

83WITH
BENJ EDWARDSBenj Edwards is an AI and Machine Learning Reporter for Ars Technica. In his free time, he writes and records music, collects vintage computers, and enjoys nature. He lives in Raleigh, NC.



Promoted Comments​

avatar.png

redleader
But what about decompression rate? FLAC has always been noteworthy for being an asymmetrical codec which takes more computational power to compress than to decompress (potentially a lot more, depending on the settings used). If this new AI codec requires a lot of number crunching to decode, it may not be such a big win in all situations.

In terms of a practical format, FLAC/PNG are designed to be incredibly fast and lightweight because they have to be integrated into mobile devices, web browsers, etc without consuming huge amounts of memory and power. For example, FLAC is designed to be able to decode CD audio losslessly in realtime on DSP cores with single-digit MHz and tens of kilobytes of RAM while using the absolute lowest amount of battery. I'm not sure how much memory Chinchilla 70B requires, but seeing as the model has 70 billion parameters, I suspect it will not fit into 64 KB of memory on a low power embedded audio device.
September 28, 2023 at 4:25 pm
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243


Can my GPU run this LLM?

Made with

Calculate how much GPU memory you need & breakdown of where it goes for training/inference of any LLM model with quantization (GGML/bitsandbytes), inference frameworks (vLLM/llama.cpp/HF) & QLoRA.

Link: LLM memory check

smaller_gif-2

Purpose

I made this to check if you can run a particular LLM on your GPU. Useful to figure out the following

  1. What quantization I should use to fit any model on my GPU?
  2. What max context length my GPU can handle?
  3. What kind of finetuning can I do? Full? LoRA? QLoRA?
  4. What max batch size I can use during finetuning?
  5. What is consuming my GPU memory? What should I change to fit the LLM on my GPU?
The output is the total vRAM & the breakdown of where the vRAM goes (in MB). It looks like below

{
"Total": 4000,
"KV Cache": 1000,
"Model Size": 2000,
"Activation Memory": 500,
"Grad & Optimizer memory": 0,
"cuda + other overhead": 500
}


Can't we just look at the model size & figure this out?

Finding which LLMs your GPU can handle isn't as easy as looking at the model size because during inference (KV cache) takes susbtantial amount of memory. For example, with sequence length 1000 on llama-2-7b it takes 1GB of extra memory (using hugginface LlamaForCausalLM, with exLlama & vLLM this is 500MB). And during training both KV cache & activations & quantization overhead take a lot of memory. For example, llama-7b with bnb int8 quant is of size ~7.5GB but it isn't possible to finetune it using LoRA on data with 1000 context length even with RTX 4090 24 GB. Which means an additional 16GB memory goes into quant overheads, activations & grad memory.

How to use

Model Name/ID/Size

  1. You can either enter the model id of a huggingface model (e.g. meta-llama/Llama-2-7b). Currently I have hardcoded & saved model configs of top 3k most downlaoded LLMs on huggingface.
  2. If you have a custom model or your hugginface id isn't available then you can either upload a json config (example) or just enter your model size (e.g. 7 billion for llama-2-7b)

Options

  1. Inference: Find vRAM for inference using either HuggingFace implementation or vLLM or GGML
  2. Training : Find vRAM for either full model finetuning or finetuning using LoRA (currently I have hardcoded r=8 for LoRA config) or using QLoRA.

Quantization

  1. Currently it supports: bitsandbytes (bnb) int8/int4 & GGML (QK_8, QK_6, QK_5, QK_4, QK_2). The latter are only for inference while bnb int8/int4 can be used for both training & inference

Context Len/Sequence Length

  1. What is the length of your prompt+new maximum tokens generated. Or for training this is the sequence length of your training data. Batch sizes are 1 for inference & can be specified for training. The option to specify batch sizes for inference needs to be added.

How reliable are the numbers?

The results can vary depending on your model, input data, cuda version & what quant you are using & it is impossible to predict exact values. I have tried to take these into account & make sure the results are within 500MB. Below table I cross-check 3b,7b & 13b model memories given by the website vs. what what I get on my RTX 4090 & 2060 GPUs. All values are within 500MB.

image

How are the values calculated?

Total memory = model size + kv-cache + activation memory + optimizer/grad memory + cuda etc. overhead

  1. Model size = this is your .bin file size (divide it by 2 if Q8 quant & by 4 if Q4 quant).
  2. KV-Cache = Memory taken by KV (key-value) vectors. Size = (2 x sequence length x hidden size) per layer. For huggingface this (2 x 2 x sequence length x hidden size) per layer
  3. Activation Memory = When you use LoRA even though your model params don't have grad their results still need to be stored to do backward through them (these take the most memory). There is no simple formula here, it depends on the implementation.
  4. Optimizer/Grad memory = Memory taken by .grad tensors & tensors associated with the optimizer (running avg etc.)
  5. Cuda etc. overhead = Around 500-1GB memory is taken by CUDA whenever cuda is loaded, this varies. Also there are additional overheads when you use any quantization (like bitsandbytes). Again no straightforward formula

Why are the results wrong?

Sometimes the answers might be very wrong in which case please open an issue here & I will try to fix it.

TODO

  1. Add support for exLlama
  2. Add QLora ✅
  3. Add way to measure approximste tokens/s you can get for a particular GPU
  4. Improve logic to get hyper-params from size (since hidden layer/intermediate size/number of layers can vary for a particular size) ✅
  5. Add AWQ

 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243


Bye Bye Llama-2, Mistral 7B is Taking Over: Get Started With Mistral 7B Instruct Model​

https://medium.com/@qendelai?source=post_page-----1504ff5f373c--------------------------------
Qendel AI
·
Follow
6 min read

https://medium.com/m/signin?actionU...-----------------post_audio_button-----------
“No LLM has been most popular > 2 months”, says Dr. M Waleed Kadous, a chief scientist at Anyscale, in his recent AI conference presentation.

Llama 2 has already been taking the Open Source LLM space by storm, but not anymore. Mistral AI, a small creative team, now open-sources a new model that beats Llama models.

As presented by the team, the Mistral 7B model

  • Outperforms Llama 2 13B on all benchmarks
  • Outperforms Llama 1 34B on many benchmarks
  • Approaches CodeLlama 7B performance on code, while remaining good at English tasks
  • Uses Grouped-query attention (GQA) for faster inference
  • Uses Sliding Window Attention (SWA) to handle longer sequences at a smaller cost
Here’s Mistral 7B model performance comparison with the Llama-2 family:

1*QG-LzBtHY8vly7INfjs7tA.png

Mistral 7B model performance comparison with the Llama-2 family. Source.
Mistral 7B also has an instruct fine-tuned (chat) version, and it outperforms all 7B models on MT-Bench — as shown below:

1*AF6LzvOHGgqHZPX-2jzCrg.png

Mistral 7B Instruct Model comparison on MT Bench. Source.
Now, let’s dive into the steps to get started with the Mistral 7B Instruct Model on Google Colab.

Getting Started With Mistral 7B Instruct Model​

Step 1:

Install essential libraries: transformers, torch, accelerate, bitsandbytes, and langchain.

!pip install git+https://github.com/huggingface/transformers torch accelerate bitsandbytes langchain
Step 2:

Import libraries and set variables

Code:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
Step 3:

Download the Mistral 7B Instruct model and its tokenizer


In my case, I am downloading a 4-bit version of the model to fit on my limited Colab GPU, but you can download an 8-bit version or the full model as long as your machine can handle it.
 

GrudgeBooty

Rookie
Joined
May 24, 2022
Messages
149
Reputation
30
Daps
308
nah, I just find it interesting like the dialup days of the internet. exciting times ahead :banderas:
Same! I run a few chatbots locally on my PC, so I'm interested what could happen with Chatbots and Augmented and Virtual Reality. Reminds me of Star Trek's Holodeck
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
57,519
Reputation
8,519
Daps
160,243

Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback​

Hannah Rose Kirk, Bertie Vidgen, Paul Röttger, Scott A. Hale
Large language models (LLMs) are used to generate content for a wide range of tasks, and are set to reach a growing audience in coming years due to integration in product interfaces like ChatGPT or search engines like Bing. This intensifies the need to ensure that models are aligned with human preferences and do not produce unsafe, inaccurate or toxic outputs. While alignment techniques like reinforcement learning with human feedback (RLHF) and red-teaming can mitigate some safety concerns and improve model capabilities, it is unlikely that an aggregate fine-tuning process can adequately represent the full range of users' preferences and values. Different people may legitimately disagree on their preferences for language and conversational norms, as well as on values or ideologies which guide their communication. Personalising LLMs through micro-level preference learning processes may result in models that are better aligned with each user. However, there are several normative challenges in defining the bounds of a societally-acceptable and safe degree of personalisation. In this paper, we ask how, and in what ways, LLMs should be personalised. First, we review literature on current paradigms for aligning LLMs with human feedback, and identify issues including (i) a lack of clarity regarding what alignment means; (ii) a tendency of technology providers to prescribe definitions of inherently subjective preferences and values; and (iii) a 'tyranny of the crowdworker', exacerbated by a lack of documentation in who we are really aligning to. Second, we present a taxonomy of benefits and risks associated with personalised LLMs, for individuals and society at large. Finally, we propose a three-tiered policy framework that allows users to experience the benefits of personalised alignment, while restraining unsafe and undesirable LLM-behaviours within (supra-)national and organisational bounds.
Comments:19 pages, 1 table
Subjects:Computation and Language (cs.CL); Computers and Society (cs.CY)
Cite as:arXiv:2303.05453 [cs.CL]
(or arXiv:2303.05453v1 [cs.CL] for this version)
[2303.05453] Personalisation within bounds: A risk taxonomy and policy framework for the alignment of large language models with personalised feedback
Focus to learn more

Submission history​

From: Hannah Rose Kirk Miss [view email]
[v1] Thu, 9 Mar 2023 17:52:07 UTC (414 KB)

 
Top