The A.I Megathread (LLM , GPT , Development)

bnew · Mar 30, 2024

1/1
AgentStudio: A Toolkit for Building General Virtual Agents

Presents an online, realistic, and multimodal toolkit that covers the entire lifecycle of agent development, including environment setups, data collection, agent evaluation, and visualization

proj: Welcome to AgentStudio! — AgentStudio
abs: [2403.17918] AgentStudio: A Toolkit for Building General Virtual Agents

bnew · Mar 30, 2024

1/1
Track Everything Everywhere Fast and Robustly

Substantial improvement in training speed (>10x faster), robustness, and accuracy in tracking over the SoTA optimization-based method

proj: Track Everything Everywhere Fast and Robustly
abs: [2403.17931] Track Everything Everywhere Fast and Robustly

bnew · Mar 30, 2024

1/1
The Unreasonable Ineffectiveness of the Deeper Layers

Finds minimal degradation of performance of LLMs on various QA benchmarks until after a large fraction (up to half) of the layers are removed

[2403.17887] The Unreasonable Ineffectiveness of the Deeper Layers

bnew · Mar 30, 2024

1/2
Human Image Personalization with High-fidelity Identity Preservation

Presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt

proj: FlashFace
abs: [2403.17008] FlashFace: Human Image Personalization with High-fidelity Identity Preservation

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions

bnew · Mar 30, 2024

1/2
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

Effectively optimizes the training mixture of a 1B model trained for 100B tokens, reaching a perf comparable to the one trained for 48% more steps on the default mixture

repo: GitHub - yegcjs/mixinglaws
abs: [2403.16952] Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance

2/2
Anyone knows a paper that compares the performance of LLMs with user prompt at the top vs. bottom of the user input (e.g. this image)?

It appears that the top typically works better, but it'd be an interesting problem if there was no paper written yet.

When processing a text: prompt before it or after it?

bnew · Mar 30, 2024

1/1
LLM Agent Operating System

Presents AIOS, an LLM agent operating system, which embeds large language model into operating systems

repo: GitHub - agiresearch/AIOS: AIOS: LLM Agent Operating System
abs: [2403.16971] AIOS: LLM Agent Operating System

bnew · Mar 30, 2024

1/8
LLMs can use complex instructions - why can’t retrieval models?

We build FollowIR, a training/test set of real-world human retrieval instructions. Our FollowIR-7B is the best IR model for instruct-following, even beating
@cohere
@openai
retrievers

[2403.15246] FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions

2/8
We build FollowIR from three TREC datasets which have instructions for humans to judge doc relevance. If humans can do it, why not models?

We alter these instructions (and the rel. doc set) and see how models change their outputs. We measure w/ p-MRR, a new pairwise eval metric

3/8
We find that IR models generally fail to follow instructions, even those that have been trained with them. Only 3B+ models and LLMs without retrieval training are successful.

We show this is cuz existing IR models use their instructions as keywords instead of as instructions

4/8
Our new model FollowIR-7B has the highest score for following instructions, as well as gains in standard metrics

If you’re interested, check out our eval code: we extend MTEB
@muennighoff
@nils_reimers
making it easy to evaluate your MTEB model by only changing a few lines

5/8
This is a collaboration that started with my great internship mentors from
@allen_ai
@SemanticScholar
, including
@soldni

@kylelostat

@armancohan
as well as
@macavaney
at
@GlasgowCS
and my advisors at
@jhuclsp

@ben_vandurme
and
@lawrie_dawn

6/8
**Links**
Github: https://github.com/orionw/FollowIR Model: https://huggingface.co/jhu-clsp/FollowIR-7B Paper: https://arxiv.org/abs/2403.15246 Data:

7/8
Separately,
@hanseok_oh had a similar idea to evaluate instructions in IR, but with a different approach to data creation. Definitely check it out also!

8/8
Can retrievers follow instructions, including your intentions and preferences?
Introducing INSTRUCTIR, a benchmark for evaluating instruction following in information retrieval. [1/N]

bnew · Mar 30, 2024

1/6
Are larger vision models always necessary?

We find scaling on **image scales** (e.g., 224->448->672) is usually better than scaling on model size (e.g., Base->Large->Giant).

With one line of code, improve any vision model for Multimodal LLMs or various vision and robotic tasks!

2/6
We enable any vision model to extract multi-scale features by splitting the large-scale image into regular-sized crops, processing them separately, merging the features together, and pooling it to the regular size.

3/6
This simple approach has the advantage of no additional parameters (so no need to re-train anything) and keeping the same number of output vision tokens (thus keeping the same input length for MLLMs).

4/6
We find keeping the same model size and using larger image scales is usually comparable or better than using larger models. Comparison on image classification, semantic segmentation, and depth estimation:

5/6
Similar trends for MLLMs:

6/6
And robotic tasks too!

bnew · Mar 30, 2024

1/1
When Do We Not Need Larger Vision Models?

repo: GitHub - bfshi/scaling_on_scales
abs: [2403.13043] When Do We Not Need Larger Vision Models?

**Abstract:**

In this work, we explore the necessity of larger models for enhanced visual understanding. Our findings suggest that scaling based on the dimension of image scales—termed as **Scaling on Scales (S2)**—rather than increasing the model size, generally leads to superior performance across a diverse range of downstream tasks.

**Key Findings:**

1. Smaller models employing S2 can capture most of the insights learned by larger models.
2. Pre-training smaller models with S2 can level the playing field with larger models, and in some cases, surpass them.

**Implications for Future Research:**

S2 introduces several considerations for future work:

- **Scale-Selective Processing:** Not all scales at every position within an image hold valuable information. Depending on the content of the image and the overarching task, it can be more efficient to process certain scales for specific regions. This approach mimics the bottom-up and top-down selection mechanisms found in human visual attention (References: 86, 59, 33).

- **Parallel Processing of a Single Image:** Unlike traditional Vision Transformers (ViT) where the entire image is processed in unison, S2 allows each sub-image to be handled independently. This capability facilitates parallel processing of different segments of a single image, which is particularly advantageous in scenarios where reducing latency in processing large images is paramount (Reference: 84).

bnew · Mar 30, 2024

1/4
Potentially the biggest paradigm shift in LLMs

Two independent studies managed to pre-train 1.58-bit LLMs that match the performance of FP16 models.

Need to see how it scales (~30B), but super curious about 1.58-bit Mamba and MoE models.

1bitLLM/bitnet_b1_58-3B · Hugging Face

We’re on a journey to advance and democratize artificial intelligence through open source and open science.

huggingface.co

2/4
Not yet, no access to my computer today. I find it quite exciting, lots of potential in optimizing both software and hardware here

3/4
Yes it would fit and also be a lot faster

4/4
I'm not an expert in vision transformers but I would say no, it should work the same

bnew · Mar 30, 2024

1/1
Meta presents SceneScript

An AI model and method to understand and describe 3D spaces

proj: SceneScript: an AI model and method to understand and describe 3D spaces
abs: [2403.13064] SceneScript: Reconstructing Scenes With An Autoregressive Structured Language Model

bnew · Mar 30, 2024

1/1
Hippocratic AI presents Polaris

- Polrais is the first safety-focused LLM for real-time patient-AI healthcare conversations
- Performs on par with human nurses on medical safety, clinical readiness, patient education, etc

Polaris: A Safety-focused LLM Constellation Architecture for Healthcare

We develop Polaris, the first safety-focused LLM constellation for real-time patient-AI healthcare conversations. Unlike prior LLM works in healthcare focusing on tasks like question answering, our work specifically focuses on long multi-turn voice conversations. Our one-trillion parameter...

arxiv.org

bnew · Mar 30, 2024

1/2
DepthFM: Fast Monocular Depth Estimation with Flow Matching

Achieves significantly faster inference speed with minimal performance sacrifices

proj: DepthFM: Fast Monocular Depth Estimation with Flow Matching
abs: [2403.13788] DepthFM: Fast Monocular Depth Estimation with Flow Matching

2/2
Anyone knows a paper that compares the performance of LLMs with user prompt at the top vs. bottom of the user input (e.g. this image)?

It appears that the top typically works better, but it'd be an interesting problem if there was no paper written yet.

When processing a text: prompt before it or after it?

bnew · Mar 30, 2024

1/1
AI2 presents RewardBench

Aims to map the current landscape of openly available reward models via a reward model leaderboard

repo: GitHub - allenai/reward-bench: RewardBench: the first evaluation tool for reward models.
abs: [2403.13787] RewardBench: Evaluating Reward Models for Language Modeling

bnew · Mar 30, 2024

1/1
Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

Proposes a new multi-agent framework Mora, which incorporates several advanced visual AI agents to replicate generalist video generation demonstrated by Sora

proj: GitHub - lichao-sun/Mora: Mora: More like Sora for Generalist Video Generation
abs: [2403.13248] Mora: Enabling Generalist Video Generation via A Multi-Agent Framework

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

1bitLLM/bitnet_b1_58-3B · Hugging Face

bnew

Veteran

bnew

Veteran

Polaris: A Safety-focused LLM Constellation Architecture for Healthcare

bnew

Veteran

bnew

Veteran

bnew

Veteran