bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240

F-N67bDWkAAX8cs.jpg

F-N69LuXoAAwf-A.png

F-N7C6ZWwAA4haE.png

F-N7EyZWgAA87nh.png

Active Reasoning in an Open-World Environment

abs: [2311.02018] Active Reasoning in an Open-World Environment
pdf: https://arxiv.org/pdf/2311.02018.pdf
site: Conan

et, most models operate passively, responding to questions based on pre-stored knowledge. In stark contrast, humans possess the ability to actively explore, accumulate, and reason using both newfound and existing information to tackle incomplete-information questions. In response to this gap, we introduce �

- Conan: an interactive open-world environment devised for the assessment of active reasoning
- facilitates active exploration and promotes multi-round abductive inference, reminiscent of rich, open-world settings like Minecraft
- compels agents to actively interact with their surroundings, amalgamating new evidence with prior knowledge to elucidate events from incomplete observations
- underscores the shortcomings of contemporary state-of-the-art models in active exploration and understanding complex scenarios






Computer Science > Artificial Intelligence​

[Submitted on 3 Nov 2023]

Active Reasoning in an Open-World Environment​

Manjie Xu, Guangyuan Jiang, Wei Liang, Chi Zhang, Yixin Zhu
Recent advances in vision-language learning have achieved notable success on complete-information question-answering datasets through the integration of extensive world knowledge. Yet, most models operate passively, responding to questions based on pre-stored knowledge. In stark contrast, humans possess the ability to actively explore, accumulate, and reason using both newfound and existing information to tackle incomplete-information questions. In response to this gap, we introduce Conan, an interactive open-world environment devised for the assessment of active reasoning. Conan facilitates active exploration and promotes multi-round abductive inference, reminiscent of rich, open-world settings like Minecraft. Diverging from previous works that lean primarily on single-round deduction via instruction following, Conan compels agents to actively interact with their surroundings, amalgamating new evidence with prior knowledge to elucidate events from incomplete observations. Our analysis on Conan underscores the shortcomings of contemporary state-of-the-art models in active exploration and understanding complex scenarios. Additionally, we explore Abduction from Deduction, where agents harness Bayesian rules to recast the challenge of abduction as a deductive process. Through Conan, we aim to galvanize advancements in active reasoning and set the stage for the next generation of artificial intelligence agents adept at dynamically engaging in environments.
Comments:Accepted to NeurIPS 2023
Subjects:Artificial Intelligence (cs.AI); Computer Vision and Pattern Recognition (cs.CV)
Cite as:arXiv:2311.02018 [cs.AI]
(or arXiv:2311.02018v1 [cs.AI] for this version)
[2311.02018] Active Reasoning in an Open-World Environment
Focus to learn more

Submission history

From: Manjie Xu [view email]
[v1] Fri, 3 Nov 2023 16:24:34 UTC (14,505 KB)




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240



BABILong: a long-context needle-in-a-haystack benchmark for LLMs​

Preprint is on arXiv: "In Search of Needles in a 10M Haystack: Recurrent Memory Finds What LLMs Miss"

bAbI + Books = BABILong​

BABILong is a novel generative benchmark for evaluating the performance of NLP models in processing arbitrarily long documents with distributed facts.

Solving tasks with a long context size requires the model to distinguish important information from large amounts of irrelevant details. To simulate this behavior we ”hide” the sentences of the original task between the sentences of irrelevant text. We use the bAbI dataset [1] as facts and PG19 as background text. Resulting test samples might have lenghts of millions of tokens.

drawing

BABILong consists of 20 tasks designed for evaluation of basic aspects of reasoning. The bAbI tasks are generated by simulating a set of characters and objects engaged in various movements and interactions with each other in multiple locations. Each interaction is represented by a fact, e.g. ”Mary travelled to the office”, and the task is to answer a question using the facts from the current simulation, for instance, ”Where is Mary?”. The bAbI tasks vary based on the number of facts, question complexity and the aspects of reasoning.

First ten tasks of BABILong​

TaskNamemin facts per taskmax facts per task
qa1single supporting fact210
qa2two supporting facts268
qa3three supporting facts4320
qa4two arg relations22
qa5three arg relations2126
qa6yes-no questions226
qa7counting252
qa8lists-sets250
qa9simple negation210
qa10indefinite knowledge210

Play with dataset​

Open In Colab BABILong notebook

Preliminary evaluation results​

GPT-4 fails to solve needle in a haystack tasks for 75% of available context window​

drawing

Every row shows accuracy in % of solving corresponding BABILong task ('qa1'-'qa5') and every column corresponds to the task size submitted to GPT4-Turbo with 128K context window. All values are averages of 25 samples.

Mistral performance scales only for some tasks but but quickly degenerates for majority of others as context grow​

drawing

Every row shows accuracy in % of solving corresponding BABILong task ('qa1'-'qa10') and every column corresponds to the task size submitted to Mistarl-Medium with 32K context window. All values are averages of 25 samples.

Fine-tuning of GPT-3.5 improves search of supporting facts in medium context size​

drawing

Every row shows accuracy in % for GPT-3.5 before and after fine-tuning via API with 100 samples on 'qa1' task. Every column corresponds to the task size. All values are averages of 25 samples.

Retrieval augmentation does not help to solve needle in a haystack QA task​

drawing

A Retrieval does the job if embeddings match fact size. The figure shows recall@5 scores of a retrieval RAG component on 'qa1' task for the given size for sentences (sent) and text pieces of 512 tokens (tok).

B Accuracy in % by GPT4 based RAG. All values are averages of 50 samples.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240




Career update: I am co-founding a new research group called "GEAR" at NVIDIA, with my long-time friend and collaborator Prof.
@yukez
. GEAR stands for Generalist Embodied Agent Research.

We believe in a future where every machine that moves will be autonomous, and robots and simulated agents will be as ubiquitous as iPhones. We are building the Foundation Agent — a generally capable AI that learns to act skillfully in many worlds, virtual and real.

2024 is the Year of Robotics, the Year of Gaming AI, and the Year of Simulation. We are setting out on a moon-landing mission, and getting there will spin off mountains of learnings and breakthroughs.

Join us on the journey:



 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240





PALO

A Polyglot Large Multimodal Model for 5B People

In pursuit of more inclusive Vision-Language Models (VLMs), this study introduces a Large Multilingual Multimodal Model called Palo. Palo offers visual reasoning capabilities in 10 major languages, including English, Chinese, Hindi, Spanish, French, Arabic, Bengali, Russian, Urdu, and Japanese, that span a total of sim5B people (65\% of the world population). Our approach involves a semi-automated translation approach to adapt the multimodal instruction dataset from English to the target languages using a fine-tuned Large Language Model, thereby ensuring high linguistic fidelity while allowing scalability due to minimal manual effort. The incorporation of diverse instruction sets helps us boost overall performance across multiple languages especially those that are underrepresented like Hindi, Arabic, Bengali, and Urdu. The resulting models are trained across three scales (1.7B, 7B and 13B parameters) to show the generalization and scalability where we observe substantial improvements compared to strong baselines. We also propose the first multilingual multimodal benchmark for the forthcoming approaches to evaluate their vision-language reasoning capabilities across languages




 
Last edited:

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


Here is my selection of papers for today (23 Feb) on Hugging Face

GeneOH Diffusion: Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion

CyberDemo: Augmenting Simulated Human

Demonstration for Real-World Dexterous Manipulation

BeTAIL: Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay

Linear Transformers are Versatile In-Context Learners

Beyond A*: Better Planning with Transformers via Search Dynamics Bootstrapping

GaussianPro: 3D Gaussian Splatting with Progressive Propagation

Consolidating Attention Features for Multi-view Image Editing

PALO: A Polyglot Large Multimodal Model for 5B People

Scaling Up LLM Reviews for Google Ads Content Moderation

OmniPred: Language Models as Universal Regressors

Subobject-level Image Tokenization

TinyLLaVA: A Framework of Small-scale Large Multimodal Models

Copilot Evaluation Harness: Evaluating LLM-Guided Software Programming

AgentScope: A Flexible yet Robust Multi-Agent Platform

OpenCodeInterpreter: Integrating Code Generation with Execution and Refinement

MVD^2: Efficient Multiview 3D Reconstruction for Multiview Diffusion

T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with Trajectory Stitching

LexC-Gen: Generating Data for Extremely Low-Resource Languages with Large Language Models and Bilingual Lexicons

Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240





Project Page: Magic-Me


Unlike common text-to-video model (like OpenAI/Sora), this model is for personalized videos using photos of your friends, family, or pets. By training an embedding with these images, it creates custom videos featuring your loved ones, bringing a unique touch to your memories.

News Update: We have deployed our model on Hugging Face's GPU platform, making it available for immediate use. Check it out on Hugging Face Spaces .

Magic-Me: Identity-Specific Video Customized Diffusion​

Ze Ma*, Daquan Zhou* †, Chun-Hsiao Yeh, Xue-She Wang, Xiuyu Li, Huanrui Yang, Zhen Dong †, Kurt Keutzer, Jiashi Feng (*Joint First Author, † Corresponding Author)

We propose a new framework for video generation with customized identity. With a pre-trained ID token, the user would be able to generate any video clips with the specified identity. We propose a series of controllable Video generation and editing methods. The first release includes Customized Diffusion (VCD). It includes three novel components that are essential for high-quality ID preservation: 1) an ID module trained with the cropped identity by prompt-to-segmentation to disentangle the ID information and the background noise for more accurate ID token learning; 2) a text-to-video (T2V) VCD module with 3D Gaussian Noise Prior for better inter-frame consistency and 3) video-to-video (V2V) Face VCD and Tiled VCD modules to deblur the face and upscale the video for higher resolution.
arXiv Colab Hugging Face Spaces

Video Customization Diffusion Model Pipeline
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240





Snap Video

Scaled Spatiotemporal Transformers for Text-to-Video Synthesis

Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages, the research community repurposes them to generate videos. Since video content is highly redundant, we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity, visual quality and impairs scalability. In this work, we build Snap Video, a video-first model that systematically addresses these challenges. To do that, we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second, we show that a U-Net - a workhorse behind image generation - scales poorly when generating videos, requiring significant computational overhead. Hence, we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is ~4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time, reach state-of-the-art results on a number of benchmarks, and generate videos with substantially higher quality, temporal consistency, and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


CyberDemo

Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation

We introduce CyberDemo, a novel approach to robotic imitation learning that leverages simulated human demonstrations for real-world tasks. By incorporating extensive data augmentation in a simulated environment, CyberDemo outperforms traditional in-domain real-world demonstrations when transferred to the real world, handling diverse physical and visual conditions. Regardless of its affordability and convenience in data collection, CyberDemo outperforms baseline methods in terms of success rates across various tasks and exhibits generalizability with previously unseen objects. For example, it can rotate novel tetra-valve and penta-valve, despite human demonstrations only involving tri-valves. Our research demonstrates the significant potential of simulated human demonstrations for real-world dexterous manipulation tasks.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


GeneOH Diffusion

Towards Generalizable Hand-Object Interaction Denoising via Denoising Diffusion

tackle the challenging problem of denoising hand-object interactions (HOI). Given an erroneous interaction sequence, the objective is to refine the incorrect hand trajectory to remove interaction artifacts for a perceptually realistic sequence. This challenge involves intricate interaction noise, including unnatural hand poses and incorrect hand-object relations, alongside the necessity for robust generalization to new interactions and diverse noise patterns. We tackle those challenges through a novel approach, GeneOH Diffusion, incorporating two key designs: an innovative contact-centric HOI representation named GeneOH and a new domain-generalizable denoising scheme. The contact-centric representation GeneOH informatively parameterizes the HOI process, facilitating enhanced generalization across various HOI scenarios. The new denoising scheme consists of a canonical denoising model trained to project noisy data samples from a whitened noise space to a clean data manifold and a "denoising via diffusion" strategy which can handle input trajectories with various noise patterns by first diffusing them to align with the whitened noise space and cleaning via the canonical denoiser. Extensive experiments on four benchmarks with significant domain variations demonstrate the superior effectiveness of our method. GeneOH Diffusion also shows promise for various downstream applications.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


BeTAIL

Behavior Transformer Adversarial Imitation Learning from Human Racing Gameplay

Imitation learning learns a policy from demonstrations without requiring hand-designed reward functions. In many robotic tasks, such as autonomous racing, imitated policies must model complex environment dynamics and human decision-making. Sequence modeling is highly effective in capturing intricate patterns of motion sequences but struggles to adapt to new environments or distribution shifts that are common in real-world robotics tasks. In contrast, Adversarial Imitation Learning (AIL) can mitigate this effect, but struggles with sample inefficiency and handling complex motion patterns. Thus, we propose BeTAIL: Behavior Transformer Adversarial Imitation Learning, which combines a Behavior Transformer (BeT) policy from human demonstrations with online AIL. BeTAIL adds an AIL residual policy to the BeT policy to model the sequential decision-making process of human experts and correct for out-of-distribution states or shifts in environment dynamics. We test BeTAIL on three challenges with expert-level demonstrations of real human gameplay in Gran Turismo Sport. Our proposed residual BeTAIL reduces environment interactions and improves racing performance and stability, even when the BeT is pretrained on different tracks than downstream learning.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


GaussianPro

3D Gaussian Splatting with Progressive Propagation

The advent of 3D Gaussian Splatting (3DGS) has recently brought about a revolution in the field of neural rendering, facilitating high-quality renderings at real-time speed. However, 3DGS heavily depends on the initialized point cloud produced by Structure-from-Motion (SfM) techniques. When tackling with large-scale scenes that unavoidably contain texture-less surfaces, the SfM techniques always fail to produce enough points in these surfaces and cannot provide good initialization for 3DGS. As a result, 3DGS suffers from difficult optimization and low-quality renderings. In this paper, inspired by classical multi-view stereo (MVS) techniques, we propose GaussianPro, a novel method that applies a progressive propagation strategy to guide the densification of the 3D Gaussians. Compared to the simple split and clone strategies used in 3DGS, our method leverages the priors of the existing reconstructed geometries of the scene and patch matching techniques to produce new Gaussians with accurate positions and orientations. Experiments on both large-scale and small-scale scenes validate the effectiveness of our method, where our method significantly surpasses 3DGS on the Waymo dataset, exhibiting an improvement of 1.15dB in terms of PSNR.


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,004
Reputation
7,865
Daps
147,240


✨ Introducing ToDo : Token Downsampling for Efficient Generation of High-Resolution Images ! With
@Ethan_smith_20
&
@aningineer
, we present a training free method that accelerates diffusion inference upto 4.5x while maintaining image fidelity.




ToDo: Token Downsampling for Efficient Generation of High-Resolution Images​

Published on Feb 21
·Featured in Daily Papers on Feb 22
Authors:
Ethan Smith,
Nayan Saxena,
Aninda Saha

Abstract​

Attention mechanism has been crucial for image diffusion models, however, their quadratic computational complexity limits the sizes of images we can process within reasonable time and memory constraints. This paper investigates the importance of dense attention in generative image models, which often contain redundant features, making them suitable for sparser attention mechanisms. We propose a novel training-free method ToDo that relies on token downsampling of key and value tokens to accelerate Stable Diffusion inference by up to 2x for common sizes and up to 4.5x or more for high resolutions like 2048x2048. We demonstrate that our approach outperforms previous methods in balancing efficient throughput and fidelity.
 
Top