The A.I Megathread (LLM , GPT , Development)

bnew · Oct 18, 2023

https://archive.ph/srGbs

bnew · Oct 18, 2023

https://archive.ph/l1sQz

bnew · Oct 18, 2023

https://archive.ph/fiwrk

https://archive.ph/efRqD

bnew · Oct 18, 2023

https://archive.ph/fGxUY

bnew · Oct 18, 2023

https://archive.ph/7fXp8

Language models are bad a basic math.

GPT-4 has right around 0% accuracy rate on 5 digit multiplication.

Most open models can't even add. Why is that?

There are a few reasons why numbers are hard. The main one is Tokenization. When training a tokenizer from scratch, you take a large corpus of text and find the minimal byte-pair encoding for a chosen vocabulary size.

This means, however, that numbers will almost certainly not have unique token representations. "21" could be a single token, or ["2", "1"]. 143 could be ["143"] or ["14", "3"] or any other combination.

A potential fix here would be to force single digit tokenization. The state of the art for the last few years is to inject a space between every digit when creating the tokenizer and when running the model. This means 143 would always be tokenized as ["1", "4", "3"].

This helps boost performance, but wastes tokens while not fully fixing the problem.

A cool fix might be xVal! This work by The Polymathic AI Collaboration suggests a generic [NUM] token which is then scaled by the actual value of the number!

If you look at the red lines in the image above, you can get an intuition for how that might work.

It doesn't capture a huge range or high fidelity (e.g., 7.4449 vs 7.4448) but they showcase some pretty convincing results on sequence prediction problems that are primarily numeric.

For example, they want to train a sequence model on GPS conditioned temperature forecasting

They found a ~70x improvement over standard vanilla baselines and a 2x improvement over really strong baselines.

One cool side effect is that deep neural networks might be really good at regression problems using this encoding scheme!

https://arxiv.org/abs/2310.02989

xVal: A Continuous Number Encoding for Large Language Models

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho

Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization.

Comments:	10 pages 7 figures. Supplementary: 5 pages 2 figures
Subjects:	Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2310.02989 [stat.ML]
	(or arXiv:2310.02989v1 [stat.ML] for this version)
	[2310.02989] xVal: A Continuous Number Encoding for Large Language Models Focus to learn more

Submission history

From: Siavash Golkar [view email]
[v1] Wed, 4 Oct 2023 17:26:16 UTC (2,637 KB)

https://arxiv.org/pdf/2310.02989.pdf

bnew · Oct 18, 2023

Fuyu-8B: A Multimodal Architecture for AI Agents

We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.

www.adept.ai

Fuyu-8B: A Multimodal Architecture for AI Agents

October 17, 2023 — Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar

We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.

We’re releasing Fuyu-8B, a small version of the multimodal1 model that powers our product. The model is available on HuggingFace. We think Fuyu-8B is exciting because:

It has a much simpler architecture and training procedure than other multi-modal models, which makes it easier to understand, scale, and deploy.
It’s designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.
It’s fast - we can get responses for large images in less than 100 milliseconds.
Despite being optimized for our use-case, it performs well at standard image understanding benchmarks such as visual question-answering and natural-image-captioning.

Fuyu’s caption:
“A cake with writing on it that says congratulations kate and luke on your upcoming arrival.”

Question:
“What is the highest life expectancy at birth of males?”

Fuyu’s answer:
“The life expectancy at birth of males in 2018 is 80.7”

Today, we’re releasing Fuyu-8B with an open license (CC-BY-NC)—we’re excited to see what the community builds on top of it! We also discuss results for Fuyu-Medium (a larger model we’re not releasing) and provide a sneak peek of some capabilities that are exclusive to our internal models.

Because this is a raw model release, we have not added further instruction-tuning, postprocessing or sampling strategies to control for undesirable outputs. You should expect to have to fine-tune the model for your use-case.2

Model Architecture

Adept is building a generally intelligent copilot for knowledge workers. In order to do this, it’s important for us to be able to understand user context and to take actions on behalf of users. Both of those goals rely heavily on image understanding. Users expect what’s visible on their screen to be accessible to the copilot, and important data is often presented most naturally as an image – think charts, slides, PDFs, etc. In order to take actions, we often need to literally click on buttons or scroll through menus. It would be nice if all these actions were doable via API, but many business-relevant software has no API or an incomplete API, and controlling software via UIs allows us to keep the user in the loop.

Diagram of the Fuyu model architecture. Fuyu is a vanilla decoder-only transformer with no specialized image encoder. Image patches are linearly projected directly into the first layer of the transformer, bypassing the embedding lookup. This simplified architecture supports arbitrary image resolutions, and dramatically simplifies both training and inference.

Therefore, we need a model that can understand both images and text. Although a lot of progress is being made on this front, nothing is available that suits our precise needs. Existing multimodal models are complicated, both from an architectural perspective and a training perspective. These complications are a liability when it comes to understanding model behavior, scaling models up, and deploying to users.

On the architecture side, other multimodal models involve a separate image encoder, the output of which tends to be connected to an existing LLM via either cross-attention or through some kind of adapter that feeds directly into the LLM’s embedding-space. PALM-e, PALI-X, QWEN-VL, LLaVA 1.5, and Flamingo all look more-or-less like this. These models also tend to work on a fixed image resolution. At inference time, all images at greater resolution than this must be downsampled, and all images whose aspect ratio doesn’t match must be padded or distorted.

On the training side, other multimodal models tend to have a large number of separate training stages. The image encoder will be trained separately from the LLM on its own tasks, often using a contrastive training objective, which is complicated to implement and reason about. Then, as in e.g. PALI-X, the image encoder and the text decoder (frequently with a bespoke connector network) will be trained together on images at a low resolution for some period of time. At this point, a choice must be made about whether to freeze the weights of each of the components while training. Finally, some models are trained with an extra high-resolution image phase (without which they won’t perform well on high-res images).

When scaling up models, it’s difficult to reason about how to independently scale each of the above components. Should marginal parameters be allocated to the encoder or the decoder? To which of the training steps should we give the next chunk of compute? We’ve instead designed a model without these complications.

Architecturally, Fuyu is a vanilla decoder-only transformer with the same details as Persimmon-8B - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention). See the diagram above for more details.

This simplification allows us to support arbitrary image resolutions. To accomplish this, we just treat the sequence of image tokens like the sequence of text tokens. We remove image-specific position embeddings and feed in as many image tokens as necessary in raster-scan order. To tell the model when a line has broken, we simply use a special image-newline character. The model can use its existing position embeddings to reason about different image sizes, and we can use images of arbitrary size at training time, removing the need for separate high and low-resolution training stages.

Together, these changes have dramatically simplified our training and inference experience.

Eval Performance

To sanity-check the architectural changes underlying Fuyu-8B, we chose four of the most commonly-used image-understanding datasets: VQAv2, OKVQA, COCO Captions, and AI2D. VQAv2 and OKVQA are natural image question-answering datasets, COCO is a captioning dataset, and AI2D is a multiple-choice dataset involving scientific diagrams. We compare our models to PALM-e, PALI-X, QWEN-VL, and LLaVA 1.5.

The Numbers

The Fuyu models perform well according to these metrics, even though they are heavily focused on natural images. Fuyu-8B improves over QWEN-VL and PALM-e-12B on 2 out of 3 metrics despite having 2B and 4B fewer parameters, respectively. Fuyu-Medium performs comparably to PALM-E-562B despite having fewer than a tenth as many parameters! PALI-X still performs best on these benchmarks, but it’s larger and fine-tuned on a per-task basis. Note that, since these benchmarks are not our main focus, we didn’t perform any of the typical optimizations (e.g. non-greedy sampling, fine-tuning for a long time on each dataset specifically, etc).

Eval Task	Fuyu-8B	Fuyu-Medium	LLaVA 1.5 (13.5B)	QWEN-VL (10B)	PALI-X (55B)	PALM-e-12B	PALM-e-562B
VQAv2	74.2	77.4	80	79.5	86.1	76.2	80.0
OKVQA	60.6	63.1	n/a	58.6	66.1	55.5	66.1
COCO Captions	141	138	n/a	n/a	149	135	138
AI2D	64.5	73.7	n/a	62.3	81.2	n/a	n/a

What are these Image-Understanding Benchmarks?

While interacting with these benchmarks we also noticed serious issues. We’ve developed an in-house eval suite that corresponds more closely to the capabilities we care about, but we thought it was worth elaborating on some of those issues here, given the ubiquity of these benchmarks.

bnew · Oct 18, 2023

Question Answering Benchmarks

The question-answering datasets are quite flawed - they use a complicated scoring mechanism, require you to respond in a specific format, and are often annotated incorrectly.

Consider the following two images:

OKVQA

Question:
“What instrument is the toy bear playing?”

Fuyu’s answer:
“snare”

OKVQA Score:

0 (all reference answers are simply “drum”)

VQAv2

Question:
“What type of foods are in the image?”

Fuyu’s answer:
“fish, carrots”

VQAv2 Score:

0 (reference answers were “hot dogs”, “sausages”, and “healthy”)

For the image on the left from the OKVQA dataset, when asked the question “What instrument is the toy bear playing?”, the model responds “snare”—which is clearly true! However, it gets a score of 0, because all of the reference answers are simply “drum”. Similarly, for the VQAv2 image on the right, when asked “What type of foods are in the image?”, the model corectly responds “fish, carrots”, but it also gets a score of 0 because the reference solution list doesn’t contain those words.

Captioning Benchmarks

It’s also common to evaluate image models using the COCO Captions benchmark. The score used for this benchmark (CIDEr) is based on n-gram similarity to a group of reference captions, which are often poor. We haven’t found performance on this benchmark corresponds particularly well to our internal evaluations. In fact Fuyu-Medium is slightly worse by this metric than Fuyu-8B!

For the image below, our model gives the caption “A nighttime view of Big Ben and the Houses of Parliament.” This is correct, but it gets a score of 0.4 because it doesn’t match any of the reference captions (a good score is over 100).

Fuyu’s caption:
“A nighttime view of Big Ben and the Houses of Parliament.”

Reference captions:
“A fast moving image of cars on a busy street with a tower clock in the background.”

“Lit up night traffic is zooming by a clock tower.”

“A city building is brightly lit and a lot of vehicles are driving by.”

“A large clock tower and traffic moving near.”

“there is a large tower with a clock on it.”

CIDEr Score:

0.4 (No reference caption mentions Big Ben or Parliament)

Capabilities

The Fuyu models have several cool capabilities that we preview here, including chart, diagram, and document understanding.

Chart Understanding

Since our product is geared towards assisting knowledge workers, it’s important for our model to be able to understand charts and diagrams. Here are some examples.

Fuyu can understand complex visual relationships, such as in the below chart, where it has to trace connections between actors and shows and count them to answer the question.

Question:
“Aidan Gillen acted in how many series?”

Fuyu’s answer:
“2”

It can also answer nontrivial, multi-hop questions given traditional charts.

Question:
“Find missing data of the sequence 24, _ ,32, 33, 42?”

Fuyu’s answer:
“29”

Question:
“What was the fair amount of paid vacation days in the UK?”

Fuyu’s answer:
“28”

bnew · Oct 18, 2023

Document Understanding

Fuyu can also understand documents — both complex infographics and old PDFs:

Question:
“Which is the metro in California that has a good job Outlook?”

Fuyu’s answer:
“Los Angeles”

Question:
“What was the pack spinner capacity?”

Fuyu’s answer:
“118 packs.”

Diagram Understanding

Finally, the model can understand complex relational queries about scientific diagrams:

Question:
“What letter does a keel-shaped cross-section look like?”

Fuyu’s answer:
“The letter V”

Question:
“If in the food web shown in the diagram, Douglas fir tree needles are absent, which organism would starve?”

Fuyu’s answer:
“Red tree vole”

New Capabilities Sneak-Peek

Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,

They can reliably perform OCR on high-resolution images
They can do fine-grained localization of text and UI elements within those images
They can answer questions about images of UIs

Since these capabilities are built off of the Fuyu model class (and underly our upcoming product release), we thought it would be interesting to preview some of them.

OCR Capabilities

We’ve trained our internal models to do the following two tasks given an image of a UI:

Given a bounding box, tell us what text lies inside that bounding box (bbox_to_text)
Given some text, return to us the bounding box that contains that text (text_to_bbox)

Consider the following 1920x1080 image from one of our validation sets:

The blue boxes represent bounding box coordinates that have been passed to the model for the bbox_to_text task. For this example, the model correctly predicted the text contents of every blue bounding box.

The red boxes represent predicted bounding boxes and green boxes represent target bounding boxes for the text_to_bbox task. The model is good enough at bounding box prediction that the red and green boxes overlap almost completely.

Localization and QA Capabilities

The model can also locate things on the screen based on informal text commands, as well as answer detailed factual questions about the contents of UIs:

Question:
“is the 2nd email starred? [‘yes’, ‘no’]”

Fuyu’s answer:
“no”

Or consider the below example, where the model can interact with Google Maps to correctly answer questions3.

Question:
“is La Taqueria north of the 24th St Mission Bart station?”

Fuyu’s answer:
“no”
—

Both the model weights and some example code are on HuggingFace. We look forward to seeing what you build with it, and please reach out if you have any questions. Stay tuned for more on our product alpha, which will incorporate these and other changes and is coming soon!

adept/fuyu-8b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Oct 18, 2023

adept/fuyu-8b · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

bnew · Oct 19, 2023

https://archive.ph/Vvfv7

bnew · Oct 19, 2023

LLama 2 13B vs Mistral 7B LLM models compared

Learn about the differences between Llama 2 13B vs Mistral 7B in this quick comparison guide offering more insight into the results you can

www.geeky-gadgets.com

LLama 2 13B vs Mistral 7B LLM models compared

10:01 am October 12, 2023 By Julian Horsey

If you are interested in learning more about how large language models compare you may be interested in this comparison between LLama 2 13B vs Mistral 7B revealing the differences between the different AI models. Both models are powerful and adaptable, but they each have their unique strengths and features. This article will provide a comprehensive comparison of these two models, focusing on their performance, architecture, and intended use cases.

Mistral 7B, a 7.3 billion parameter model, has been making a name for itself due to its impressive performance on various benchmarks. It outperforms Llama 2 13B on all benchmarks and even surpasses Llama 1 34B on many. It also approaches the performance of CodeLlama 7B on code, while maintaining proficiency in English tasks. This model uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at a smaller cost.

One of the key advantages of Mistral 7B is its adaptability. It can be deployed on any cloud, including AWS, GCP, and Azure, using the vLLM inference server and skypilot. It can also be used locally with the reference implementation provided by the developers. Furthermore, Mistral 7B is easy to fine-tune on any task. As a demonstration, the developers have provided a model fine-tuned for chat, which outperforms Llama 2 13B chat.

Llama 2 13B vs Mistral 7B

Watch this video on YouTube.

Other articles you may find of interest on the subject of Llama 2

Mistral 7B’s performance on a wide range of benchmarks is impressive. It significantly outperforms Llama 2 13B on all metrics and is on par with Llama 34B. It also excels in code and reasoning benchmarks. The model uses a sliding window attention (SWA) mechanism, which allows each layer to attend to the previous 4,096 hidden states. This results in a linear compute cost and a 2x speed improvement for sequence length of 16k with a window of 4k.

On the other hand, Llama 2 13B is part of a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Developed by Meta, the Llama 2 family of large language models (LLMs) are optimized for dialogue use cases. The fine-tuned LLMs, known as Llama-2-Chat, outperform open-source chat models on most benchmarks tested and are on par with popular closed-source models like ChatGPT and PaLM in terms of helpfulness and safety.

Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. It is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. The larger models, such as the 70B, use Grouped-Query Attention (GQA) for improved inference scalability.

Llama 2 is intended for commercial and research use in English. The tuned models are designed for assistant-like chat, whereas the pretrained models can be adapted for a variety of natural language generation tasks.

Both Mistral 7B and Llama 2 13B are powerful models with their unique strengths. Mistral 7B shines in its adaptability and performance on various benchmarks, while Llama 2 13B excels in dialogue use cases and aligns well with human preferences for helpfulness and safety. The choice between the two would largely depend on the specific requirements of the task at hand.

Further articles you may find of interest on the Mistral 7B AI model :

Filed Under: Guides, Top News

bnew · Oct 19, 2023

https://archive.ph/Gnt48

Using large language models in psychology:

LLMs have the potential to advance psychological measurement, experimentation and practice.

LLM generated on-topic, grammatically correct useless information, but not based on research and psychology construct.

A critical task for the field is to curate large, reliable annotated datasets of key psychological constructs while minimizing unwanted biases.

Concerns about applying LLMs to psychology:
- Evaluation: computer scientists have tended to evaluate the functionality of features, but psychologists usually want to evaluate the effects of those features on human thought and behaviour.
- Bias: LLM’s censorship guardrails only addresses the symptoms of bias, rather than the underlying bias in the data. The censorship makes it hard for researchers to study unknown biases. It’s a high priority to make censorship algorithms transparent and to develop bias-testing protocols that go beyond the obvious one.

Three important needs in the psychology field:
- Invest in keystone datasets that represent populations and psychological constructs of interest, and must be linked to psychologically important outcomes.
- Define a new psychologically way of benchmarking LLMs, which can help facilitate the development of safe and transparent algorithms.
- Shared computing and analysis infrastructure to ensure that the future of LLM-powered research is equitable.

Thanks for the great paper @Diyi_Yang and team!

Using large language models in psychology - Nature Reviews Psychology

Using large language models in psychology - Nature Reviews Psychology

Large language models (LLMs), which can generate and score text in human-like ways, have the potential to advance psychological measurement, experimentation and practice. In this Perspective, Demszky and colleagues describe how LLMs work, concerns about using them for psychological purposes, and...

www.nature.com

bnew · Oct 19, 2023

https://archive.ph/jc9CX

bnew · Oct 19, 2023

https://archive.ph/ogdpH

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad, Jun Wang

Although large language models (LLMs) are impressive in solving various tasks, they can quickly be outdated after deployment. Maintaining their up-to-date status is a pressing concern in the current era. This paper provides a comprehensive review of recent advances in aligning LLMs with the ever-changing world knowledge without re-training from scratch. We categorize research works systemically and provide in-depth comparisons and discussion. We also discuss existing challenges and highlight future directions to facilitate research in this field. We release the paper list at this https URL

Comments:	EMNLP 2023 main conference, paper link at this https URL
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2310.07343 [cs.CL]
	(or arXiv:2310.07343v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.07343 Focus to learn more

Submission history

From: Zihan Zhang [view email]
[v1] Wed, 11 Oct 2023 09:46:32 UTC (378 KB)

https://arxiv.org/pdf/2310.07343v1.pdf

bnew · Oct 19, 2023

https://archive.ph/hYtSr

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Veteran

Veteran

xVal: A Continuous Number Encoding for Large Language Models​

Submission history​

Veteran

Fuyu-8B: A Multimodal Architecture for AI Agents​

Model Architecture​

Eval Performance​

The Numbers​

What are these Image-Understanding Benchmarks?​

Veteran

Capabilities​

Chart Understanding​

Veteran

Document Understanding​

Diagram Understanding​

New Capabilities Sneak-Peek​

OCR Capabilities​

Localization and QA Capabilities​

Veteran

Veteran

Veteran

LLama 2 13B vs Mistral 7B LLM models compared​

Llama 2 13B vs Mistral 7B​

Veteran

Veteran

Veteran

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances​

Submission history​

Veteran

xVal: A Continuous Number Encoding for Large Language Models

Submission history

Fuyu-8B: A Multimodal Architecture for AI Agents

Model Architecture

Eval Performance

The Numbers

What are these Image-Understanding Benchmarks?

Capabilities

Chart Understanding

Document Understanding

Diagram Understanding

New Capabilities Sneak-Peek

OCR Capabilities

Localization and QA Capabilities

LLama 2 13B vs Mistral 7B LLM models compared

Llama 2 13B vs Mistral 7B

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances

Submission history