We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.
www.adept.ai
Fuyu-8B: A Multimodal Architecture for AI Agents
October 17, 2023 — Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, Sağnak Taşırlar
We’re open-sourcing Fuyu-8B - a small version of the multimodal model that powers our product.
We’re releasing Fuyu-8B, a small version of the multimodal
1 model that powers our product. The model is
available on HuggingFace. We think Fuyu-8B is exciting because:
- It has a much simpler architecture and training procedure than other multi-modal models, which makes it easier to understand, scale, and deploy.
- It’s designed from the ground up for digital agents, so it can support arbitrary image resolutions, answer questions about graphs and diagrams, answer UI-based questions, and do fine-grained localization on screen images.
- It’s fast - we can get responses for large images in less than 100 milliseconds.
- Despite being optimized for our use-case, it performs well at standard image understanding benchmarks such as visual question-answering and natural-image-captioning.
Fuyu’s caption:
“A cake with writing on it that says congratulations kate and luke on your upcoming arrival.”
Question:
“What is the highest life expectancy at birth of males?”
Fuyu’s answer:
“The life expectancy at birth of males in 2018 is 80.7”
Today, we’re releasing Fuyu-8B with an open license (
CC-BY-NC)—we’re excited to see what the community builds on top of it! We also discuss results for Fuyu-Medium (a larger model we’re not releasing) and provide a sneak peek of some capabilities that are exclusive to our internal models.
Because this is a raw model release, we have not added further instruction-tuning, postprocessing or sampling strategies to control for undesirable outputs. You should expect to have to fine-tune the model for your use-case.2
Model Architecture
Adept is building a generally intelligent copilot for knowledge workers. In order to do this, it’s important for us to be able to understand user context and to take actions on behalf of users. Both of those goals rely heavily on image understanding. Users expect what’s visible on their screen to be accessible to the copilot, and important data is often presented most naturally as an image – think charts, slides, PDFs, etc. In order to take actions, we often need to literally click on buttons or scroll through menus. It would be nice if all these actions were doable via API, but many business-relevant software has no API or an incomplete API, and controlling software via UIs allows us to keep the user in the loop.
Diagram of the Fuyu model architecture. Fuyu is a vanilla decoder-only transformer with no specialized image encoder. Image patches are linearly projected directly into the first layer of the transformer, bypassing the embedding lookup. This simplified architecture supports arbitrary image resolutions, and dramatically simplifies both training and inference.
Therefore, we need a model that can understand both images and text. Although a lot of progress is being made on this front, nothing is available that suits our precise needs. Existing multimodal models are complicated, both from an architectural perspective and a training perspective. These complications are a liability when it comes to understanding model behavior, scaling models up, and deploying to users.
On the architecture side, other multimodal models involve a separate image encoder, the output of which tends to be connected to an existing LLM via either cross-attention or through some kind of adapter that feeds directly into the LLM’s embedding-space.
PALM-e,
PALI-X,
QWEN-VL,
LLaVA 1.5, and
Flamingo all look more-or-less like this. These models also tend to work on a fixed image resolution. At inference time, all images at greater resolution than this must be downsampled, and all images whose aspect ratio doesn’t match must be padded or distorted.
On the training side, other multimodal models tend to have a large number of separate training stages. The image encoder will be trained separately from the LLM on its own tasks, often using a contrastive training objective, which is complicated to implement and reason about. Then, as in e.g.
PALI-X, the image encoder and the text decoder (frequently with a bespoke connector network) will be trained together on images at a low resolution for some period of time. At this point, a choice must be made about whether to freeze the weights of each of the components while training. Finally, some models are trained with an extra high-resolution image phase (without which they won’t perform well on high-res images).
When scaling up models, it’s difficult to reason about how to independently scale each of the above components. Should marginal parameters be allocated to the encoder or the decoder? To which of the training steps should we give the next chunk of compute? We’ve instead designed a model without these complications.
Architecturally, Fuyu is a vanilla decoder-only transformer with the same details as
Persimmon-8B - there is no image encoder. Image patches are instead linearly projected into the first layer of the transformer, bypassing the embedding lookup. We simply treat the normal transformer decoder like an image transformer (albeit with no pooling and causal attention). See the diagram above for more details.
This simplification allows us to support arbitrary image resolutions. To accomplish this, we just treat the sequence of image tokens like the sequence of text tokens. We remove image-specific position embeddings and feed in as many image tokens as necessary in raster-scan order. To tell the model when a line has broken, we simply use a special image-newline character. The model can use its existing position embeddings to reason about different image sizes, and we can use images of arbitrary size at training time, removing the need for separate high and low-resolution training stages.
Together, these changes have dramatically simplified our training and inference experience.
Eval Performance
To sanity-check the architectural changes underlying Fuyu-8B, we chose four of the most commonly-used image-understanding datasets:
VQAv2,
OKVQA,
COCO Captions, and
AI2D. VQAv2 and OKVQA are natural image question-answering datasets, COCO is a captioning dataset, and AI2D is a multiple-choice dataset involving scientific diagrams. We compare our models to
PALM-e,
PALI-X,
QWEN-VL, and
LLaVA 1.5.
The Numbers
The Fuyu models perform well according to these metrics, even though they are heavily focused on natural images. Fuyu-8B improves over QWEN-VL and PALM-e-12B on 2 out of 3 metrics despite having 2B and 4B fewer parameters, respectively. Fuyu-Medium performs comparably to PALM-E-562B despite having fewer than a tenth as many parameters! PALI-X still performs best on these benchmarks, but it’s larger and fine-tuned on a per-task basis. Note that, since these benchmarks are not our main focus, we didn’t perform any of the typical optimizations (e.g. non-greedy sampling, fine-tuning for a long time on each dataset specifically, etc).
Eval Task | Fuyu-8B | Fuyu-Medium | LLaVA 1.5 (13.5B) | QWEN-VL (10B) | PALI-X (55B) | PALM-e-12B | PALM-e-562B |
---|
VQAv2 | 74.2 | 77.4 | 80 | 79.5 | 86.1 | 76.2 | 80.0 |
OKVQA | 60.6 | 63.1 | n/a | 58.6 | 66.1 | 55.5 | 66.1 |
COCO Captions | 141 | 138 | n/a | n/a | 149 | 135 | 138 |
AI2D | 64.5 | 73.7 | n/a | 62.3 | 81.2 | n/a | n/a |
What are these Image-Understanding Benchmarks?
While interacting with these benchmarks we also noticed serious issues. We’ve developed an in-house eval suite that corresponds more closely to the capabilities we care about, but we thought it was worth elaborating on some of those issues here, given the ubiquity of these benchmarks.