bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879

1/1
When Do We Not Need Larger Vision Models?

repo: GitHub - bfshi/scaling_on_scales
abs: [2403.13043] When Do We Not Need Larger Vision Models?

**Abstract:**

In this work, we explore the necessity of larger models for enhanced visual understanding. Our findings suggest that scaling based on the dimension of image scales—termed as **Scaling on Scales (S2)**—rather than increasing the model size, generally leads to superior performance across a diverse range of downstream tasks.

**Key Findings:**

1. Smaller models employing S2 can capture most of the insights learned by larger models.
2. Pre-training smaller models with S2 can level the playing field with larger models, and in some cases, surpass them.

**Implications for Future Research:**

S2 introduces several considerations for future work:

- **Scale-Selective Processing:** Not all scales at every position within an image hold valuable information. Depending on the content of the image and the overarching task, it can be more efficient to process certain scales for specific regions. This approach mimics the bottom-up and top-down selection mechanisms found in human visual attention (References: 86, 59, 33).

- **Parallel Processing of a Single Image:** Unlike traditional Vision Transformers (ViT) where the entire image is processed in unison, S2 allows each sub-image to be handled independently. This capability facilitates parallel processing of different segments of a single image, which is particularly advantageous in scenarios where reducing latency in processing large images is paramount (Reference: 84).
GJKTJSDbUAAoDaX.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879




1/4
Potentially the biggest paradigm shift in LLMs

Two independent studies managed to pre-train 1.58-bit LLMs that match the performance of FP16 models.

Need to see how it scales (~30B), but super curious about 1.58-bit Mamba and MoE models.


2/4
Not yet, no access to my computer today. I find it quite exciting, lots of potential in optimizing both software and hardware here

3/4
Yes it would fit and also be a lot faster

4/4
I'm not an expert in vision transformers but I would say no, it should work the same
GJ6z3_xWgAAcBGK.jpg

GJ48srmW4AAS8v1.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879


1/2
DepthFM: Fast Monocular Depth Estimation with Flow Matching

Achieves significantly faster inference speed with minimal performance sacrifices

proj: DepthFM: Fast Monocular Depth Estimation with Flow Matching
abs: [2403.13788] DepthFM: Fast Monocular Depth Estimation with Flow Matching

2/2
Anyone knows a paper that compares the performance of LLMs with user prompt at the top vs. bottom of the user input (e.g. this image)?

It appears that the top typically works better, but it'd be an interesting problem if there was no paper written yet.

When processing a text: prompt before it or after it?
GJKI5Zqa4AAxnFZ.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879







1/7
What if we could show a robot how to do a task?

We present Vid2Robot, which is a robot policy trained to decode human intent from visual cues and translate it into actions in its environment.

Website: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Arxiv: [2403.12943] Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

(1/n)

2/7
The prompt video shows a human moving a green jalapeno chip bag near a Coke can. Note that the tables and camera viewpoints are different from the current robot. The policy needs to identify the task and also figure out how to act in its current environment.

2/

3/7
When using human videos as prompts, we find that Vid2Robot is better than BC-Z (bc-z) even when both are trained with the same data.

3/

4/7
We investigate two emergent capabilities of video-conditioned policies:
(i) How effectively can the motion shown in a prompt video be performed on another object in the robot’s view?
(ii) How can we chain the prompt videos for longer horizon tasks?

4/

5/7
For Cross Object Motion Transfer, Vid2Robot can perform the motion shown with one object on other objects, and in many cases, it is much better than BC-Z.

5/

6/7
Vid2Robot can also solve some of the long-horizon tasks, by combining prompt videos for each step.

6/

7/7
How do we train this model?

Vid2Robot leverages cross-attention mechanisms to fuse human video features with the robot's current state, generating actions that are relevant to perform the observed tasks.

7/
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879






1/7
I'm excited to share our latest fMRI-to-image paper

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data!

arXiv: [2403.11207] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
project page: MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

2/7
Our model only needs 1 hour of training data – compare this to past work that used 30-40 hours of training data per person.

3/7
We do this by pretraining a latent space shared across multiple people. So for new subjects, less data is needed.

Our pipeline is otherwise quite similar (and simpler) to MindEye1: generating CLIP embeddings from fMRI that's passed into an "unCLIP" model to get the recon.

4/7
We do train our own unCLIP model based on SDXL, that is better able to reconstruct from CLIP embeddings compared to the previous model we used (Versatile Diffusion).

5/7
Our approach achieves SOTA with the full 40 hrs of training, and also beats any other approach if limited to just 1 hr. Here is a comparison to previous approaches. Full quantitative results in preprint.

6/7
Once again this project was led by the amazing
@humanscotti , who officially joined MedARC as our head of neuroimaging back in Nov.

Read his thread here:

7/7
Our MindEye2 preprint is out!

We reconstruct seen images from fMRI brain activity using only 1 hour of training data.

This is possible by first pretraining a shared-subject model using other people's data, then fine-tuning on a held-out subject with only 1 hour of data.
GI_uaXvbsAAXIuv.jpg

GI_vGRbb0AArgRU.jpg

GI_vOykasAAoiR1.jpg

GI_vnBqbsAAMD_Y.jpg

GI_0AgsW4AAFoof.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879


1/2
Arc2Face: A Foundation Model of Human Faces
Foivos Paraperas Papantoniou, Alexandros Lattas,

Presents a large dataset of high-resolution facial images with consistent ID and intra-class variability, and an ID-conditioned face model trained on it

proj; Arc2Face: A Foundation Model of Human Faces
abs: [2403.11641] Arc2Face: A Foundation Model of Human Faces

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
GJAOAsxWkAAEEYm.jpg

GJ71OtzWkAACq7c.jpg

GJ71OuGXgAAFfmN.jpg

GJ71OuZWwAAgjKd.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879




1/4
Introducing Branch-Train-miX (BTX)
BTX improves a generalist LLM on multiple fronts:
- Train expert LLMs in parallel for new skills in domains such as math, code & world knowledge
- Join (mix) them together & finetune as a Mixture-of-Experts
(1/4)

2/4
BTX brings +18.8 points on math reasoning, +13.2 points on coding, +3.6 points on world knowledge compared to Llama-2 7B, and outperforms both specialized LLMs (e.g. CodeLLAMA) and a larger model (Llama-2 13B) in terms of average performance.
(2/4)

3/4
Compared to alternative methods, BTX has the best compute efficiency.
It generalizes two existing methods: Branch-Train-Merge (which doesn't MoE finetune a joint model) & Mixture-of-Experts (which does) -- and achieves the best of both worlds.
(3/4)

4/4
Lots of other results in the paper, including ablations & routing analysis:
- Compare different routing methods & expert combinations. Load balance is important.
- Analysis of impact of math & code experts
- Experts can be frozen in MoE finetuning w/o hurting performance
(4/4)
GIgt10yWIAAJFj3.jpg

GIgt2_wXcAAZHUC.jpg

GIgv1BfWgAEP7oy.png

GIgxp6ZXMAAVOpJ.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879







1/7
Announcing our new speculative decoding framework Sequoia
It can now serve Llama2-70B on one RTX4090 with half-second/token latency (exactno approximation)

Sounds slow as a sloth ???

Fun fact:
DeepSpeed -> 5.3s / token;
8 x A100: 25ms / token (costs 8 x $18,000 = $140,000+ but an RTX4090 is $1000+)

You can serve with your 2080Ti too! Curious how? Check it out
Website: Sequoia
Paper: [2402.12374] Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Code: GitHub - Infini-AI-Lab/Sequoia: scalable and robust tree-based speculative decoding algorithm

2/7
Three key advantages make Sequoia outstanding:
1) Scalable: possible to leverage large speculation budgets, adapting to hardware development trends;
2) Robust: suitable for commercial serving to accommodate various LLM applications;
3) Hardware-Aware: automatically adapts to…

3/7
Sequoia helps mitigate the bandwidth gaps across the memory hierarchy (SRAM, HBM, RAM, SSD ...) with smart algorithms, opening new opportunities for AI accelerators design!

@SambaNovaAI
@MOFFET_AI

@GroqInc

@etchedai

@graphcoreai

@AMD

@intel

@Apple

@Qualcomm

4/7
Kudos to my student
@chenzhuoming911 and awesome collaborators
@avnermay
Ruslan Svirschevski, Yuhsun Huang,
@m_ryabinin
,
@JiaZhihao
. Thanks
@togethercompute
for all the support!

5/7
Haha that’s right! Maybe ssd is the next step ???

6/7
Hahah okok it should work similarly on 3090 but we just don’t have them

7/7
This is exact 70B (no not quant is used :smile:)
GIhIrWsXsAAqlJH.jpg

GIhALBfX0AAB2tP.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879




1/4
Thanks
@_akhaliq for sharing our work.

More results on our project page: Magic Fixup

You can directly edit any image however you like, and the model will make your edit photorealistic!
In this thread, I'll explain how we did it.

1/

2/4
How do you collect training data? Our insight is that videos carry rich information of how everything in the real world interact with its surroundings.

But how do we use videos to supervise a photo editing task?

2/

3/4
We sample two frames from each video: reference frame and target frame. We compute the optical flow, and warp the reference to match the target. This warped image is what represents the "coarse user edit", and the target frame is our GT!

3/

4/4
by training on videos, we can correct for second order artifacts, like reflections and shadows, and even depth of field!

4/
GJyUdOmXEAASKuV.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879





1/5
sDPO

Don't Use Your Data All at Once

As development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization

2/5
(DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models

3/5
within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.

4/5
paper page:

5/5
daily papers:
GJz9t3bXwAAV-UZ.jpg



1/1
[CL] sDPO: Don't Use Your Data All at Once
D Kim, Y Kim, W Song, H Kim, Y Kim, S Kim, C Park [Upstage AI] (2024)

- They propose stepwise DPO (sDPO), which divides available preference datasets and uses them step-by-step, instead of all at once like conventional DPO.

- Using the aligned model from the previous step as the reference model results in a stricter lower bound for the current step. This facilitates better alignment tuning.

- Empirically, sDPO leads to more performant final aligned models compared to DPO.

- Adopting arbitrary open source models as references can be dangerous due to potential dataset overlaps.
GJ3kYmba0AAapZ3.jpg

GJ3kYmdbUAACcWW.jpg

GJ3kYmnbIAAQImR.jpg

GJ3kZYgbYAAYim6.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
56,200
Reputation
8,249
Daps
157,879








1/8
LITA

Language Instructed Temporal-Localization Assistant

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important

2/8
missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings

3/8
by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to

4/8
capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset,

5/8
ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union

6/8
(mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding.

7/8
paper page:

8/8
daily papers:
 
Top