bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

n3SjdLJ.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

Language models are bad a basic math.

GPT-4 has right around 0% accuracy rate on 5 digit multiplication.

Most open models can't even add. Why is that?

There are a few reasons why numbers are hard. The main one is Tokenization. When training a tokenizer from scratch, you take a large corpus of text and find the minimal byte-pair encoding for a chosen vocabulary size.

This means, however, that numbers will almost certainly not have unique token representations. "21" could be a single token, or ["2", "1"]. 143 could be ["143"] or ["14", "3"] or any other combination.

A potential fix here would be to force single digit tokenization. The state of the art for the last few years is to inject a space between every digit when creating the tokenizer and when running the model. This means 143 would always be tokenized as ["1", "4", "3"].

This helps boost performance, but wastes tokens while not fully fixing the problem.

A cool fix might be xVal! This work by The Polymathic AI Collaboration suggests a generic [NUM] token which is then scaled by the actual value of the number!

If you look at the red lines in the image above, you can get an intuition for how that might work.

It doesn't capture a huge range or high fidelity (e.g., 7.4449 vs 7.4448) but they showcase some pretty convincing results on sequence prediction problems that are primarily numeric.

For example, they want to train a sequence model on GPS conditioned temperature forecasting

They found a ~70x improvement over standard vanilla baselines and a 2x improvement over really strong baselines.

One cool side effect is that deep neural networks might be really good at regression problems using this encoding scheme!


https://arxiv.org/abs/2310.02989

xVal: A Continuous Number Encoding for Large Language Models​

Siavash Golkar, Mariel Pettee, Michael Eickenberg, Alberto Bietti, Miles Cranmer, Geraud Krawezik, Francois Lanusse, Michael McCabe, Ruben Ohana, Liam Parker, Bruno Régaldo-Saint Blancard, Tiberiu Tesileanu, Kyunghyun Cho, Shirley Ho
Large Language Models have not yet been broadly adapted for the analysis of scientific datasets due in part to the unique difficulties of tokenizing numbers. We propose xVal, a numerical encoding scheme that represents any real number using just a single token. xVal represents a given real number by scaling a dedicated embedding vector by the number value. Combined with a modified number-inference approach, this strategy renders the model end-to-end continuous when considered as a map from the numbers of the input string to those of the output string. This leads to an inductive bias that is generally more suitable for applications in scientific domains. We empirically evaluate our proposal on a number of synthetic and real-world datasets. Compared with existing number encoding schemes, we find that xVal is more token-efficient and demonstrates improved generalization.
Comments:10 pages 7 figures. Supplementary: 5 pages 2 figures
Subjects:Machine Learning (stat.ML); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:arXiv:2310.02989 [stat.ML]
(or arXiv:2310.02989v1 [stat.ML] for this version)
[2310.02989] xVal: A Continuous Number Encoding for Large Language Models
Focus to learn more

Submission history​

From: Siavash Golkar [view email]
[v1] Wed, 4 Oct 2023 17:26:16 UTC (2,637 KB)

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742


Fuyu-8B: A Multimodal Architecture for AI Agents​

October 17, 2023 — Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar

We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.

hero.jpg

We’re releasing Fuyu-8B, a small version of the multimodal1 model that powers our product. The model is available on HuggingFace. We think Fuyu-8B is exciting because:
  1. It has a much simpler architecture and training procedure than other multi-modal models, which makes it easier to understand, scale, and deploy.
  2. It’s designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.
  3. It’s fast - we can get responses for large images in less than 100 milliseconds.
  4. Despite being optimized for our use-case, it performs well at standard image understanding benchmarks such as visual question-answering and natural-image-captioning.

baby_cake.png

baby_cake.png


chart.png

chart.png


Fuyu’s caption:
“A cake with writing on it that says congratulations kate and luke on your upcoming arrival.”

Question:
“What is the highest life expectancy at birth of males?”

Fuyu’s answer:
“The life expectancy at birth of males in 2018 is 80.7”

Today, we’re releasing Fuyu-8B with an open license (CC-BY-NC)—we’re excited to see what the community builds on top of it! We also discuss results for Fuyu-Medium (a larger model we’re not releasing) and provide a sneak peek of some capabilities that are exclusive to our internal models.

Because this is a raw model release, we have not added further instruction-tuning, postprocessing or sampling strategies to control for undesirable outputs. You should expect to have to fine-tune the model for your use-case.2

Model Architecture​

Adept is building a generally intelligent copilot for knowledge workers. In order to do this, it’s important for us to be able to understand user context and to take actions on behalf of users. Both of those goals rely heavily on image understanding. Users expect what’s visible on their screen to be accessible to the copilot, and important data is often presented most naturally as an image – think charts, slides, PDFs, etc. In order to take actions, we often need to literally click on buttons or scroll through menus. It would be nice if all these actions were doable via API, but many business-relevant software has no API or an incomplete API, and controlling software via UIs allows us to keep the user in the loop.

architecture.png

Diagram of the Fuyu model architecture. Fuyu is a vanilla decoder-only transformer with no specialized image encoder. Image patches are linearly projected directly into the first layer of the transformer, bypassing the embedding lookup. This simplified architecture supports arbitrary image resolutions, and dramatically simplifies both training and inference.

Therefore, we need a model that can understand both images and text. Although a lot of progress is being made on this front, nothing is available that suits our precise needs. Existing multimodal models are complicated, both from an architectural perspective and a training perspective. These complications are a liability when it comes to understanding model behavior, scaling models up, and deploying to users.

On the architecture side, other multimodal models involve a separate image encoder, the output of which tends to be connected to an existing LLM via either cross-attention or through some kind of adapter that feeds directly into the LLM’s embedding-space. PALM-e, PALI-X, QWEN-VL, LLaVA 1.5, and Flamingo all look more-or-less like this. These models also tend to work on a fixed image resolution. At inference time, all images at greater resolution than this must be downsampled, and all images whose aspect ratio doesn’t match must be padded or distorted.

On the training side, other multimodal models tend to have a large number of separate training stages. The image encoder will be trained separately from the LLM on its own tasks, often using a contrastive training objective, which is complicated to implement and reason about. Then, as in e.g. PALI-X, the image encoder and the text decoder (frequently with a bespoke connector network) will be trained together on images at a low resolution for some period of time. At this point, a choice must be made about whether to freeze the weights of each of the components while training. Finally, some models are trained with an extra high-resolution image phase (without which they won’t perform well on high-res images).

When scaling up models, it’s difficult to reason about how to independently scale each of the above components. Should marginal parameters be allocated to the encoder or the decoder? To which of the training steps should we give the next chunk of compute? We’ve instead designed a model without these complications.

Architecturally, Fuyu is a vanilla decoder-only transformer with the same details as Persimmon-8B - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention). See the diagram above for more details.

This simplification allows us to support arbitrary image resolutions. To accomplish this, we just treat the sequence of image tokens like the sequence of text tokens. We remove image-specific position embeddings and feed in as many image tokens as necessary in raster-scan order. To tell the model when a line has broken, we simply use a special image-newline character. The model can use its existing position embeddings to reason about different image sizes, and we can use images of arbitrary size at training time, removing the need for separate high and low-resolution training stages.

Together, these changes have dramatically simplified our training and inference experience.

Eval Performance​

To sanity-check the architectural changes underlying Fuyu-8B, we chose four of the most commonly-used image-understanding datasets: VQAv2, OKVQA, COCO Captions, and AI2D. VQAv2 and OKVQA are natural image question-answering datasets, COCO is a captioning dataset, and AI2D is a multiple-choice dataset involving scientific diagrams. We compare our models to PALM-e, PALI-X, QWEN-VL, and LLaVA 1.5.

The Numbers​

The Fuyu models perform well according to these metrics, even though they are heavily focused on natural images. Fuyu-8B improves over QWEN-VL and PALM-e-12B on 2 out of 3 metrics despite having 2B and 4B fewer parameters, respectively. Fuyu-Medium performs comparably to PALM-E-562B despite having fewer than a tenth as many parameters! PALI-X still performs best on these benchmarks, but it’s larger and fine-tuned on a per-task basis. Note that, since these benchmarks are not our main focus, we didn’t perform any of the typical optimizations (e.g. non-greedy sampling, fine-tuning for a long time on each dataset specifically, etc).
Eval TaskFuyu-8BFuyu-MediumLLaVA 1.5 (13.5B)QWEN-VL (10B)PALI-X (55B)PALM-e-12BPALM-e-562B
VQAv274.277.48079.586.176.280.0
OKVQA60.663.1n/a58.666.155.566.1
COCO Captions141138n/an/a149135138
AI2D64.573.7n/a62.381.2n/an/a

What are these Image-Understanding Benchmarks?​

While interacting with these benchmarks we also noticed serious issues. We’ve developed an in-house eval suite that corresponds more closely to the capabilities we care about, but we thought it was worth elaborating on some of those issues here, given the ubiquity of these benchmarks.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742
Question Answering Benchmarks

The question-answering datasets are quite flawed - they use a complicated scoring mechanism, require you to respond in a specific format, and are often annotated incorrectly.

Consider the following two images:

snare_bear.png

snare_bear.png


fish_and_carrots.png

fish_and_carrots.png


OKVQA

Question:
“What instrument is the toy bear playing?”

Fuyu’s answer:
“snare”

OKVQA Score:

0 (all reference answers are simply “drum”)

VQAv2

Question:
“What type of foods are in the image?”

Fuyu’s answer:
“fish, carrots”

VQAv2 Score:

0 (reference answers were “hot dogs”, “sausages”, and “healthy”)

For the image on the left from the OKVQA dataset, when asked the question “What instrument is the toy bear playing?”, the model responds “snare”—which is clearly true! However, it gets a score of 0, because all of the reference answers are simply “drum”. Similarly, for the VQAv2 image on the right, when asked “What type of foods are in the image?”, the model corectly responds “fish, carrots”, but it also gets a score of 0 because the reference solution list doesn’t contain those words.

Captioning Benchmarks

It’s also common to evaluate image models using the COCO Captions benchmark. The score used for this benchmark (CIDEr) is based on n-gram similarity to a group of reference captions, which are often poor. We haven’t found performance on this benchmark corresponds particularly well to our internal evaluations. In fact Fuyu-Medium is slightly worse by this metric than Fuyu-8B!

For the image below, our model gives the caption “A nighttime view of Big Ben and the Houses of Parliament.” This is correct, but it gets a score of 0.4 because it doesn’t match any of the reference captions (a good score is over 100).

big_ben.png

Fuyu’s caption:
“A nighttime view of Big Ben and the Houses of Parliament.”

Reference captions:
“A fast moving image of cars on a busy street with a tower clock in the background.”

“Lit up night traffic is zooming by a clock tower.”

“A city building is brightly lit and a lot of vehicles are driving by.”

“A large clock tower and traffic moving near.”

“there is a large tower with a clock on it.”

CIDEr Score:

0.4 (No reference caption mentions Big Ben or Parliament)

Capabilities​

The Fuyu models have several cool capabilities that we preview here, including chart, diagram, and document understanding.

Chart Understanding​

Since our product is geared towards assisting knowledge workers, it’s important for our model to be able to understand charts and diagrams. Here are some examples.

Fuyu can understand complex visual relationships, such as in the below chart, where it has to trace connections between actors and shows and count them to answer the question.

actors.png

Question:
“Aidan Gillen acted in how many series?”

Fuyu’s answer:
“2”

It can also answer nontrivial, multi-hop questions given traditional charts.

twitter_graph.png

twitter_graph.png


vacation_days.png

vacation_days.png


Question:
“Find missing data of the sequence 24, _ ,32, 33, 42?”

Fuyu’s answer:
“29”

Question:
“What was the fair amount of paid vacation days in the UK?”

Fuyu’s answer:
“28”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

Document Understanding​

Fuyu can also understand documents — both complex infographics and old PDFs:

jobs.png

jobs.png


pack_spinner.png

pack_spinner.png


Question:
“Which is the metro in California that has a good job Outlook?”

Fuyu’s answer:
“Los Angeles”

Question:
“What was the pack spinner capacity?”

Fuyu’s answer:
“118 packs.”

Diagram Understanding​

Finally, the model can understand complex relational queries about scientific diagrams:

leaf_shapes.png

leaf_shapes.png


red_tree_vole.png

red_tree_vole.png


Question:
“What letter does a keel-shaped cross-section look like?”

Fuyu’s answer:
“The letter V”

Question:
“If in the food web shown in the diagram, Douglas fir tree needles are absent, which organism would starve?”

Fuyu’s answer:
“Red tree vole”

New Capabilities Sneak-Peek​

Our internal models (based on Fuyu) have extra capabilities related to our product. In particular,
  1. They can reliably perform OCR on high-resolution images
  2. They can do fine-grained localization of text and UI elements within those images
  3. They can answer questions about images of UIs

Since these capabilities are built off of the Fuyu model class (and underly our upcoming product release), we thought it would be interesting to preview some of them.

OCR Capabilities​

We’ve trained our internal models to do the following two tasks given an image of a UI:
  1. Given a bounding box, tell us what text lies inside that bounding box (bbox_to_text)
  2. Given some text, return to us the bounding box that contains that text (text_to_bbox)

Consider the following 1920x1080 image from one of our validation sets:

ocr_example.png

The blue boxes represent bounding box coordinates that have been passed to the model for the bbox_to_text task. For this example, the model correctly predicted the text contents of every blue bounding box.

The red boxes represent predicted bounding boxes and green boxes represent target bounding boxes for the text_to_bbox task. The model is good enough at bounding box prediction that the red and green boxes overlap almost completely.

Localization and QA Capabilities​

The model can also locate things on the screen based on informal text commands, as well as answer detailed factual questions about the contents of UIs:

gmail_qa.png

Question:
“is the 2nd email starred? [‘yes’, ‘no’]”

Fuyu’s answer:
“no”

Or consider the below example, where the model can interact with Google Maps to correctly answer questions3.

maps_qa.png

Question:
“is La Taqueria north of the 24th St Mission Bart station?”

Fuyu’s answer:
“no”


Both the model weights and some example code are on HuggingFace. We look forward to seeing what you build with it, and please reach out if you have any questions. Stay tuned for more on our product alpha, which will incorporate these and other changes and is coming soon!

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

LLama 2 13B vs Mistral 7B LLM models compared​

10:01 am October 12, 2023 By Julian Horsey

LLama 2 13B vs Mistral 7B LLM models compared


If you are interested in learning more about how large language models compare you may be interested in this comparison between LLama 2 13B vs Mistral 7B revealing the differences between the different AI models. Both models are powerful and adaptable, but they each have their unique strengths and features. This article will provide a comprehensive comparison of these two models, focusing on their performance, architecture, and intended use cases.

Mistral 7B, a 7.3 billion parameter model, has been making a name for itself due to its impressive performance on various benchmarks. It outperforms Llama 2 13B on all benchmarks and even surpasses Llama 1 34B on many. It also approaches the performance of CodeLlama 7B on code, while maintaining proficiency in English tasks. This model uses Grouped-query attention (GQA) for faster inference and Sliding Window Attention (SWA) to handle longer sequences at a smaller cost.

One of the key advantages of Mistral 7B is its adaptability. It can be deployed on any cloud, including AWS, GCP, and Azure, using the vLLM inference server and skypilot. It can also be used locally with the reference implementation provided by the developers. Furthermore, Mistral 7B is easy to fine-tune on any task. As a demonstration, the developers have provided a model fine-tuned for chat, which outperforms Llama 2 13B chat.

Llama 2 13B vs Mistral 7B​

Watch this video on YouTube.


Other articles you may find of interest on the subject of Llama 2

Mistral 7B’s performance on a wide range of benchmarks is impressive. It significantly outperforms Llama 2 13B on all metrics and is on par with Llama 34B. It also excels in code and reasoning benchmarks. The model uses a sliding window attention (SWA) mechanism, which allows each layer to attend to the previous 4,096 hidden states. This results in a linear compute cost and a 2x speed improvement for sequence length of 16k with a window of 4k.

On the other hand, Llama 2 13B is part of a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 70 billion parameters. Developed by Meta, the Llama 2 family of large language models (LLMs) are optimized for dialogue use cases. The fine-tuned LLMs, known as Llama-2-Chat, outperform open-source chat models on most benchmarks tested and are on par with popular closed-source models like ChatGPT and PaLM in terms of helpfulness and safety.

Llama 2 comes in a range of parameter sizes — 7B, 13B, and 70B — as well as pretrained and fine-tuned variations. It is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align to human preferences for helpfulness and safety. The larger models, such as the 70B, use Grouped-Query Attention (GQA) for improved inference scalability.

Llama 2 is intended for commercial and research use in English. The tuned models are designed for assistant-like chat, whereas the pretrained models can be adapted for a variety of natural language generation tasks.

Both Mistral 7B and Llama 2 13B are powerful models with their unique strengths. Mistral 7B shines in its adaptability and performance on various benchmarks, while Llama 2 13B excels in dialogue use cases and aligns well with human preferences for helpfulness and safety. The choice between the two would largely depend on the specific requirements of the task at hand.

Further articles you may find of interest on the Mistral 7B AI model :

Filed Under: Guides, Top News
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

Using large language models in psychology:

💡 LLMs have the potential to advance psychological measurement, experimentation and practice.

💡 LLM generated on-topic, grammatically correct useless information, but not based on research and psychology construct.

💡A critical task for the field is to curate large, reliable annotated datasets of key psychological constructs while minimizing unwanted biases.

💡Concerns about applying LLMs to psychology:
- Evaluation: computer scientists have tended to evaluate the functionality of features, but psychologists usually want to evaluate the effects of those features on human thought and behaviour.
- Bias: LLM’s censorship guardrails only addresses the symptoms of bias, rather than the underlying bias in the data. The censorship makes it hard for researchers to study unknown biases. It’s a high priority to make censorship algorithms transparent and to develop bias-testing protocols that go beyond the obvious one.

💡Three important needs in the psychology field:
- Invest in keystone datasets that represent populations and psychological constructs of interest, and must be linked to psychologically important outcomes.
- Define a new psychologically way of benchmarking LLMs, which can help facilitate the development of safe and transparent algorithms.
- Shared computing and analysis infrastructure to ensure that the future of LLM-powered research is equitable.

Thanks for the great paper @Diyi_Yang and team!

Using large language models in psychology - Nature Reviews Psychology
NWuzGV1.jpeg

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,742

XuUvFyZ.jpeg

How Do Large Language Models Capture the Ever-changing World Knowledge? A Review of Recent Advances​

Zihan Zhang, Meng Fang, Ling Chen, Mohammad-Reza Namazi-Rad, Jun Wang
Although large language models (LLMs) are impressive in solving various tasks, they can quickly be outdated after deployment. Maintaining their up-to-date status is a pressing concern in the current era. This paper provides a comprehensive review of recent advances in aligning LLMs with the ever-changing world knowledge without re-training from scratch. We categorize research works systemically and provide in-depth comparisons and discussion. We also discuss existing challenges and highlight future directions to facilitate research in this field. We release the paper list at this https URL
Comments:EMNLP 2023 main conference, paper link at this https URL
Subjects:Computation and Language (cs.CL)
Cite as:arXiv:2310.07343 [cs.CL]
(or arXiv:2310.07343v1 [cs.CL] for this version)
https://doi.org/10.48550/arXiv.2310.07343
Focus to learn more

Submission history​

From: Zihan Zhang [view email]
[v1] Wed, 11 Oct 2023 09:46:32 UTC (378 KB)

https://arxiv.org/pdf/2310.07343v1.pdf
 
Top