bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846

1/1
AgentStudio: A Toolkit for Building General Virtual Agents

Presents an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development, including environment setups, data collection, agent evaluation, and visualization

proj: Welcome to AgentStudio! — AgentStudio
abs: [2403.17918] AgentStudio: A Toolkit for Building General Virtual Agents
GJpI-0oXcAAA3Wz.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846


1/2
Human Image Personalization with High-fidelity Identity Preservation

Presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt

proj: FlashFace
abs: [2403.17008] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
GJkLypwWgAA3xjr.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846


1/2
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Effectively optimizes the training mixture of a 1B model trained for 100B tokens, reaching a perf comparable to the one trained for 48% more steps on the default mixture

repo: GitHub - yegcjs/mixinglaws
abs: [2403.16952] Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

2/2
Anyone knows a paper that compares the performance of LLMs with user prompt at the top vs. bottom of the user input (e.g. this image)?

It appears that the top typically works better, but it'd be an interesting problem if there was no paper written yet.

When processing a text: prompt before it or after it?
GJkLVmWXMAAW2Mr.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846







1/8
LLMs can use complex instructions - why can’t retrieval models?

We build FollowIR, a training/test set of real-world human retrieval instructions. Our FollowIR-7B is the best IR model for instruct-following, even beating
@cohere
@openai
retrievers

[2403.15246] FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

2/8
We build FollowIR from three TREC datasets which have instructions for humans to judge doc relevance. If humans can do it, why not models?

We alter these instructions (and the rel. doc set) and see how models change their outputs. We measure w/ p-MRR, a new pairwise eval metric

3/8
We find that IR models generally fail to follow instructions, even those that have been trained with them. Only 3B+ models and LLMs without retrieval training are successful.

We show this is cuz existing IR models use their instructions as keywords instead of as instructions

4/8
Our new model FollowIR-7B has the highest score for following instructions, as well as gains in standard metrics

If you’re interested, check out our eval code: we extend MTEB
@muennighoff
@nils_reimers
making it easy to evaluate your MTEB model by only changing a few lines

5/8
This is a collaboration that started with my great internship mentors from
@allen_ai
@SemanticScholar
, including
@soldni

@kylelostat

@armancohan
as well as
@macavaney
at
@GlasgowCS
and my advisors at
@jhuclsp

@ben_vandurme
and
@lawrie_dawn

6/8
**Links**
Github: https://github.com/orionw/FollowIR Model: https://huggingface.co/jhu-clsp/FollowIR-7B Paper: https://arxiv.org/abs/2403.15246 Data:

7/8
Separately,
@hanseok_oh had a similar idea to evaluate instructions in IR, but with a different approach to data creation. Definitely check it out also!

8/8
Can retrievers follow instructions, including your intentions and preferences?
Introducing INSTRUCTIR, a benchmark for evaluating instruction following in information retrieval. [1/N]
GJgHkR_bcAElcTf.jpg

GJgHk5DaAAAt5Fx.jpg

GJgHloiakAA4wIO.jpg

GHeoFaWaEAAqaFn.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846






1/6
Are larger vision models always necessary?

We find scaling on **image scales** (e.g., 224->448->672) is usually better than scaling on model size (e.g., Base->Large->Giant).

With one line of code, improve any vision model for Multimodal LLMs or various vision and robotic tasks!

2/6
We enable any vision model to extract multi-scale features by splitting the large-scale image into regular-sized crops, processing them separately, merging the features together, and pooling it to the regular size.

3/6
This simple approach has the advantage of no additional parameters (so no need to re-train anything) and keeping the same number of output vision tokens (thus keeping the same input length for MLLMs).

4/6
We find keeping the same model size and using larger image scales is usually comparable or better than using larger models. Comparison on image classification, semantic segmentation, and depth estimation:

5/6
Similar trends for MLLMs:

6/6
And robotic tasks too!
GJKV8-la0AAQMQM.jpg

GJKXuL4asAATuCN.jpg

GJKarwCbYAArETR.jpg

GJKayKXaIAAEWMF.jpg

GJKa5XnaYAAkNO6.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846

1/1
When Do We Not Need Larger Vision Models?

repo: GitHub - bfshi/scaling_on_scales
abs: [2403.13043] When Do We Not Need Larger Vision Models?

**Abstract:**

In this work, we explore the necessity of larger models for enhanced visual understanding. Our findings suggest that scaling based on the dimension of image scales—termed as **Scaling on Scales (S2)**—rather than increasing the model size, generally leads to superior performance across a diverse range of downstream tasks.

**Key Findings:**

1. Smaller models employing S2 can capture most of the insights learned by larger models.
2. Pre-training smaller models with S2 can level the playing field with larger models, and in some cases, surpass them.

**Implications for Future Research:**

S2 introduces several considerations for future work:

- **Scale-Selective Processing:** Not all scales at every position within an image hold valuable information. Depending on the content of the image and the overarching task, it can be more efficient to process certain scales for specific regions. This approach mimics the bottom-up and top-down selection mechanisms found in human visual attention (References: 86, 59, 33).

- **Parallel Processing of a Single Image:** Unlike traditional Vision Transformers (ViT) where the entire image is processed in unison, S2 allows each sub-image to be handled independently. This capability facilitates parallel processing of different segments of a single image, which is particularly advantageous in scenarios where reducing latency in processing large images is paramount (Reference: 84).
GJKTJSDbUAAoDaX.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846




1/4
Potentially the biggest paradigm shift in LLMs

Two independent studies managed to pre-train 1.58-bit LLMs that match the performance of FP16 models.

Need to see how it scales (~30B), but super curious about 1.58-bit Mamba and MoE models.


2/4
Not yet, no access to my computer today. I find it quite exciting, lots of potential in optimizing both software and hardware here

3/4
Yes it would fit and also be a lot faster

4/4
I'm not an expert in vision transformers but I would say no, it should work the same
GJ6z3_xWgAAcBGK.jpg

GJ48srmW4AAS8v1.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846


1/2
DepthFM: Fast Monocular Depth Estimation with Flow Matching

Achieves significantly faster inference speed with minimal performance sacrifices

proj: DepthFM: Fast Monocular Depth Estimation with Flow Matching
abs: [2403.13788] DepthFM: Fast Monocular Depth Estimation with Flow Matching

2/2
Anyone knows a paper that compares the performance of LLMs with user prompt at the top vs. bottom of the user input (e.g. this image)?

It appears that the top typically works better, but it'd be an interesting problem if there was no paper written yet.

When processing a text: prompt before it or after it?
GJKI5Zqa4AAxnFZ.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,839
Reputation
7,926
Daps
148,846
Top