bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206







1/7
What if we could show a robot how to do a task?

We present Vid2Robot, which is a robot policy trained to decode human intent from visual cues and translate it into actions in its environment.

Website: Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers
Arxiv: [2403.12943] Vid2Robot: End-to-end Video-conditioned Policy Learning with Cross-Attention Transformers

(1/n)

2/7
The prompt video shows a human moving a green jalapeno chip bag near a Coke can. Note that the tables and camera viewpoints are different from the current robot. The policy needs to identify the task and also figure out how to act in its current environment.

2/

3/7
When using human videos as prompts, we find that Vid2Robot is better than BC-Z (bc-z) even when both are trained with the same data.

3/

4/7
We investigate two emergent capabilities of video-conditioned policies:
(i) How effectively can the motion shown in a prompt video be performed on another object in the robot’s view?
(ii) How can we chain the prompt videos for longer horizon tasks?

4/

5/7
For Cross Object Motion Transfer, Vid2Robot can perform the motion shown with one object on other objects, and in many cases, it is much better than BC-Z.

5/

6/7
Vid2Robot can also solve some of the long-horizon tasks, by combining prompt videos for each step.

6/

7/7
How do we train this model?

Vid2Robot leverages cross-attention mechanisms to fuse human video features with the robot's current state, generating actions that are relevant to perform the observed tasks.

7/
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206





1/5
Today, we’re excited to announce RAG 2.0, our end-to-end system for developing production-grade AI.

Using RAG 2.0, we’ve created Contextual Language Models (CLMs), which achieve state-of-the-art performance on a variety of industry benchmarks. CLMs outperform strong RAG baselines built using GPT-4 and top open-source models like Mixtral, according to our research and customers.

Read more in our blog post: Introducing RAG 2.0 - Contextual AI

2/5
A typical RAG system today stitches together a frozen off-the-shelf model for embeddings, vector database for retrieval, and a black-box language model for generation. The components technically work, but the whole is suboptimal — and far from production-grade.

RAG 2.0…

3/5
Our first set of RAG 2.0 models, Contextual Language Models (CLMs), significantly improve performance over current systems across axes critical for enterprise work: open-domain question answering, faithfulness, and freshness.

4/5
CLMs achieve even bigger gains over current approaches when applied to real world data, as we have seen with our early customers. We see this in finance (see FinanceBench below as proxy), as well as in other highly specialized domains like law and hardware engineering.

5/5
We’re thrilled about the results we’re already seeing with RAG 2.0 and can’t wait to bring it to more leading enterprises. F500s and unicorns are already building on RAG 2.0 today, leveraging our CLMs and latest fine-tuning and alignment techniques to deploy generative AI they…
GJCHS1uXsAAku3T.png

GJCIyIaWAAAlfP8.jpg

GJCHBB3WMAAbPuO.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206






1/7
I'm excited to share our latest fMRI-to-image paper

MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data!

arXiv: [2403.11207] MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data
project page: MindEye2: Shared-Subject Models Enable fMRI-To-Image With 1 Hour of Data

2/7
Our model only needs 1 hour of training data – compare this to past work that used 30-40 hours of training data per person.

3/7
We do this by pretraining a latent space shared across multiple people. So for new subjects, less data is needed.

Our pipeline is otherwise quite similar (and simpler) to MindEye1: generating CLIP embeddings from fMRI that's passed into an "unCLIP" model to get the recon.

4/7
We do train our own unCLIP model based on SDXL, that is better able to reconstruct from CLIP embeddings compared to the previous model we used (Versatile Diffusion).

5/7
Our approach achieves SOTA with the full 40 hrs of training, and also beats any other approach if limited to just 1 hr. Here is a comparison to previous approaches. Full quantitative results in preprint.

6/7
Once again this project was led by the amazing
@humanscotti , who officially joined MedARC as our head of neuroimaging back in Nov.

Read his thread here:

7/7
Our MindEye2 preprint is out!

We reconstruct seen images from fMRI brain activity using only 1 hour of training data.

This is possible by first pretraining a shared-subject model using other people's data, then fine-tuning on a held-out subject with only 1 hour of data.
GI_uaXvbsAAXIuv.jpg

GI_vGRbb0AArgRU.jpg

GI_vOykasAAoiR1.jpg

GI_vnBqbsAAMD_Y.jpg

GI_0AgsW4AAFoof.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206


1/2
Arc2Face: A Foundation Model of Human Faces
Foivos Paraperas Papantoniou, Alexandros Lattas,

Presents a large dataset of high-resolution facial images with consistent ID and intra-class variability, and an ID-conditioned face model trained on it

proj; Arc2Face: A Foundation Model of Human Faces
abs: [2403.11641] Arc2Face: A Foundation Model of Human Faces

2/2
Google presents MagicLens: image retrieval models following open-ended instructions

Outperforms previous SotA but with a 50x smaller model size

proj: https://open-vision-language.github.io/MagicLens/ abs: [2403.19651] MagicLens: Self-Supervised Image Retrieval with Open-Ended Instructions
GJAOAsxWkAAEEYm.jpg

GJ71OtzWkAACq7c.jpg

GJ71OuGXgAAFfmN.jpg

GJ71OuZWwAAgjKd.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206




1/4
Introducing Branch-Train-miX (BTX)
BTX improves a generalist LLM on multiple fronts:
- Train expert LLMs in parallel for new skills in domains such as math, code & world knowledge
- Join (mix) them together & finetune as a Mixture-of-Experts
(1/4)

2/4
BTX brings +18.8 points on math reasoning, +13.2 points on coding, +3.6 points on world knowledge compared to Llama-2 7B, and outperforms both specialized LLMs (e.g. CodeLLAMA) and a larger model (Llama-2 13B) in terms of average performance.
(2/4)

3/4
Compared to alternative methods, BTX has the best compute efficiency.
It generalizes two existing methods: Branch-Train-Merge (which doesn't MoE finetune a joint model) & Mixture-of-Experts (which does) -- and achieves the best of both worlds.
(3/4)

4/4
Lots of other results in the paper, including ablations & routing analysis:
- Compare different routing methods & expert combinations. Load balance is important.
- Analysis of impact of math & code experts
- Experts can be frozen in MoE finetuning w/o hurting performance
(4/4)
GIgt10yWIAAJFj3.jpg

GIgt2_wXcAAZHUC.jpg

GIgv1BfWgAEP7oy.png

GIgxp6ZXMAAVOpJ.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206







1/7
Announcing our new speculative decoding framework Sequoia
It can now serve Llama2-70B on one RTX4090 with half-second/token latency (exactno approximation)

Sounds slow as a sloth ???

Fun fact:
DeepSpeed -> 5.3s / token;
8 x A100: 25ms / token (costs 8 x $18,000 = $140,000+ but an RTX4090 is $1000+)

You can serve with your 2080Ti too! Curious how? Check it out
Website: https://infini-ai-lab.github.io/Sequoia-Page
Paper: [2402.12374] Sequoia: Scalable, Robust, and Hardware-aware Speculative Decoding
Code: GitHub - Infini-AI-Lab/Sequoia: scalable and robust tree-based speculative decoding algorithm

2/7
Three key advantages make Sequoia outstanding:
1) Scalable: possible to leverage large speculation budgets, adapting to hardware development trends;
2) Robust: suitable for commercial serving to accommodate various LLM applications;
3) Hardware-Aware: automatically adapts to…

3/7
Sequoia helps mitigate the bandwidth gaps across the memory hierarchy (SRAM, HBM, RAM, SSD ...) with smart algorithms, opening new opportunities for AI accelerators design!

@SambaNovaAI
@MOFFET_AI

@GroqInc

@etchedai

@graphcoreai

@AMD

@intel

@Apple

@Qualcomm

4/7
Kudos to my student
@chenzhuoming911 and awesome collaborators
@avnermay
Ruslan Svirschevski, Yuhsun Huang,
@m_ryabinin
,
@JiaZhihao
. Thanks
@togethercompute
for all the support!

5/7
Haha that’s right! Maybe ssd is the next step ???

6/7
Hahah okok it should work similarly on 3090 but we just don’t have them

7/7
This is exact 70B (no not quant is used :smile:)
GIhIrWsXsAAqlJH.jpg

GIhALBfX0AAB2tP.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206




1/4
Thanks
@_akhaliq for sharing our work.

More results on our project page: Magic Fixup

You can directly edit any image however you like, and the model will make your edit photorealistic!
In this thread, I'll explain how we did it.

1/

2/4
How do you collect training data? Our insight is that videos carry rich information of how everything in the real world interact with its surroundings.

But how do we use videos to supervise a photo editing task?

2/

3/4
We sample two frames from each video: reference frame and target frame. We compute the optical flow, and warp the reference to match the target. This warped image is what represents the "coarse user edit", and the target frame is our GT!

3/

4/4
by training on videos, we can correct for second order artifacts, like reflections and shadows, and even depth of field!

4/
GJyUdOmXEAASKuV.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206





1/5
sDPO

Don't Use Your Data All at Once

As development of large language models (LLM) progresses, aligning them with human preferences has become increasingly important. We propose stepwise DPO (sDPO), an extension of the recently popularized direct preference optimization

2/5
(DPO) for alignment tuning. This approach involves dividing the available preference datasets and utilizing them in a stepwise manner, rather than employing it all at once. We demonstrate that this method facilitates the use of more precisely aligned reference models

3/5
within the DPO training framework. Furthermore, sDPO trains the final model to be more performant, even outperforming other popular LLMs with more parameters.

4/5
paper page:

5/5
daily papers:
GJz9t3bXwAAV-UZ.jpg



1/1
[CL] sDPO: Don't Use Your Data All at Once
D Kim, Y Kim, W Song, H Kim, Y Kim, S Kim, C Park [Upstage AI] (2024)

- They propose stepwise DPO (sDPO), which divides available preference datasets and uses them step-by-step, instead of all at once like conventional DPO.

- Using the aligned model from the previous step as the reference model results in a stricter lower bound for the current step. This facilitates better alignment tuning.

- Empirically, sDPO leads to more performant final aligned models compared to DPO.

- Adopting arbitrary open source models as references can be dangerous due to potential dataset overlaps.
GJ3kYmba0AAapZ3.jpg

GJ3kYmdbUAACcWW.jpg

GJ3kYmnbIAAQImR.jpg

GJ3kZYgbYAAYim6.jpg
 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206








1/8
LITA

Language Instructed Temporal-Localization Assistant

There has been tremendous progress in multimodal Large Language Models (LLMs). Recent works have extended these models to video input with promising instruction following capabilities. However, an important

2/8
missing piece is temporal localization. These models cannot accurately answer the "When?" questions. We identify three key aspects that limit their temporal localization capabilities: (i) time representation, (ii) architecture, and (iii) data. We address these shortcomings

3/8
by proposing Language Instructed Temporal-Localization Assistant (LITA) with the following features: (1) We introduce time tokens that encode timestamps relative to the video length to better represent time in videos. (2) We introduce SlowFast tokens in the architecture to

4/8
capture temporal information at fine temporal resolution. (3) We emphasize temporal localization data for LITA. In addition to leveraging existing video datasets with timestamps, we propose a new task, Reasoning Temporal Localization (RTL), along with the dataset,

5/8
ActivityNet-RTL, for learning and evaluating this task. Reasoning temporal localization requires both the reasoning and temporal localization of Video LLMs. LITA demonstrates strong performance on this challenging task, nearly doubling the temporal mean intersection-over-union

6/8
(mIoU) of baselines. In addition, we show that our emphasis on temporal localization also substantially improves video-based text generation compared to existing Video LLMs, including a 36% relative improvement of Temporal Understanding.

7/8
paper page:

8/8
daily papers:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206


1/2
Egocentric data contains so much rich information about objects and scenes but dealing with the data is hard! Motion blur, sparse coverage, and dynamics all make it very challenging to reconstruct. Check out our new method for automagically extracting 3D object instance models!

2/2
Thanks
@georgiagkioxari ! See you soon in SoCal?










1/7
EgoLifter

Open-world 3D Segmentation for Egocentric Perception

In this paper we present EgoLifter, a novel system that can automatically segment scenes captured from egocentric sensors into a complete decomposition of individual 3D objects. The system is specifically

2/7
designed for egocentric data where scenes contain hundreds of objects captured from natural (non-scanning) motion. EgoLifter adopts 3D Gaussians as the underlying representation of 3D scenes and objects and uses segmentation masks from the Segment Anything Model (SAM) as

3/7
weak supervision to learn flexible and promptable definitions of object instances free of any specific object taxonomy. To handle the challenge of dynamic objects in ego-centric videos, we design a transient prediction module that learns to filter out dynamic objects in the 3D

4/7
reconstruction. The result is a fully automatic pipeline that is able to reconstruct 3D object instances as collections of 3D Gaussians that collectively compose the entire scene. We created a new benchmark on the Aria Digital Twin dataset that quantitatively demonstrates

5/7
its state-of-the-art performance in open-world 3D segmentation from natural egocentric input. We run EgoLifter on various egocentric activity datasets which shows the promise of the method for 3D egocentric perception at scale.

6/7
paper page:

7/7
daily papers:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206





1/3
Open Devin: Create any Application with Open Source Devin

Integrating
@ollama &
@GroqInc

How to Install & Setup?
Step by Step Guide
Free & Open-Source
Real-Time Debugging

Subscribe: https://youtube.com/@MervinPraison
YT:

#devin #opendevin #aisoftwareengineer #cognitionai #cognition

2/3
It uses liteLLM , so you can use their docs to configure LM Studio OpenAI-Compatible Endpoints | liteLLM

Note. Don’t use /v1 towards the end of the URL . Also try various combination if it doesn’t

3/3
Sure will do . Thanks


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206


1/2
New models on EQ-Bench leaderboard: dbrx-instruct, Starling-7b-beta and Qwen1.5-MoE-A2.7B


2/2
I did try benchmarking this one but it wasn't playing nice. I'll wait a bit for the code to mature and try again later.
GJ2Aq_MaMAAZxqO.png
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,695
Reputation
8,224
Daps
157,206


1/2
Big announcement from
@OpenRouterAI ! the new self-reported king of mixture-of-expert models: DBRX 132B appears to be often better than Mixtral at reasoning and coding. Play with it here!

2/2
Just pushed an important new release 0.2.1 for my OpenRouter API ruby gem, including fixes, better test coverage, and support for model fallback (automatic failover!) open_router | RubyGems.org | your community gem host





1/4
I tried getting gpt-4-turbo to generate useful code from openai assistants docs. it failed.

Claude-opus did better, but it's bad at coding.

the new dbrx absolutely spanked the other models.

2/4
I've never seen an open source model even come close to commercial offerings.

Kudos to
@OpenRouterAI and Fireworks - Generative AI For Product Innovation! for giving us all access to new models so fast

3/4
cool thing is that both commerical models used to be able to do this but got nerfed.

the open model will be reproducible for eternity

4/4
I tried getting gpt-4-turbo to generate useful code from openai assistants docs. it failed.

Claude-opus did better, but it's bad at coding.

the new dbrx absolutely spanked the other models.

generating code from openai docs




1/2
Breaking! SambaNova releases open-source LLM that demolishes DBRX! Breakneck progress!

2/2
Yes, it tends to be self-contradicting… all of them are.
GJyKJ2bWQAAaHDH.jpg

GJ1Uoo9X0AAap9v.jpg

GJ1Z-d-XIAA-0KP.png

GJ1a2CqWwAAbJyP.png



1/1
That's the exact opposite IMO!

$10M to train a GPT3.5 level model whereas it probably cost OAI at least 10-20x more just a year or two ago.

The more we improve as a field thanks to open-source, the cheaper & more efficient it gets to produce the same capabilities. Let's go everyone!

GJr9cagXYAAa2UC.png




1/2
The world's best open-source chat LLM, DBRX, is now available for free, on http://labs.perplexity.ai/. Perplexity Labs Playground basically has everything that you need for chat, for free, with better LLMs (Haiku, DBRX, Sonar) than 3.5-turbo, the model powering free chatGPT. Curious what people think is better between DBRX and Haiku.

2/2
Soon!
GJxw6gCakAICDvg.jpg
 
Top