bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
[CV] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
[2407.02477] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
- This paper examines alignment strategies for Multimodal Large Language Models (MLLMs) to reduce hallucinations and improve visual grounding. It categorizes alignment methods into offline (e.g. DPO) and online (e.g. Online-DPO).

- The paper reviews recently published multimodal preference datasets like POVID, RLHF-V, VLFeedback and analyzes their components: prompts, chosen responses, rejected responses.

- It introduces a new preference data sampling method called Bias-Driven Hallucination Sampling (BDHS) which restricts image access to induce language model bias and trigger hallucinations.

- Experiments align the LLaVA 1.6 model and compare offline, online and mixed DPO strategies. Results show combining offline and online can yield benefits.

- The proposed BDHS method achieves strong performance without external annotators or preference data, just using self-supervised data.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR6rByHbUAA1Zs_.jpg

GR6rByLb0AA87Ab.jpg

GR6rB0Tb0AA3gUX.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨SIGGRAPH 2024 Paper Alert 🚨

➑️Paper Title: CharacterGen: Efficient 3D Character Generation from Single Images with
Multi-View Pose Canonicalization

🌟Few pointers from the paper

🎯In this paper authors have presented β€œCharacterGen”, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model.

🎯This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of their approach, facilitating the creation of detailed 3D models from multi-view images.

🎯They also adopted a texture-back-projection strategy to produce high-quality texture map. Additionally, They have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate their model.

🎯 Their approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

🏒Organization: @Tsinghua_Uni , @VastAIResearch

πŸ§™Paper Authors: Hao-Yang Peng, Jia-Peng Zhang, @MengHaoGuo1 , @yanpei_cao , Shi-Min Hu

1️⃣Read the Full Paper here: [2402.17214] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

2️⃣Project Page: CharacterGen: Efficient 3D Character Generation from Single Images

3️⃣Code: GitHub - zjp-shadow/CharacterGen: [SIGGRAPH'24] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

πŸŽ₯ Be sure to watch the attached Video-Sound on πŸ”ŠπŸ”Š

Music by Grand_Project from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Paper Alert 🚨

➑️Paper Title: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

🌟Few pointers from the paper

🎯In this paper authors have introduced β€œPointOdyssey”, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms.

🎯Their goal was to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, they animated deformable characters using real-world motion capture data, they built 3D scenes to match the motion capture environments, and they rendered camera viewpoints using trajectories mined via structure-from-motion on real videos.

🎯They created combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Their dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work.

🎯They showed that existing methods can be trained from scratch in their dataset and outperform the published variants. Finally, they also introduced modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks.

🏒Organization: @Stanford

πŸ§™Paper Authors: @yang_zheng18 , @AdamWHarley , @willbokuishen , @GordonWetzstein , Leonidas J. Guibas

1️⃣Read the Full Paper here: [2307.15055] PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

2️⃣Project Page: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

3️⃣Simulator: GitHub - y-zheng18/point_odyssey: Official code for PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking (ICCV 2023)

4️⃣Model: GitHub - aharley/pips2: PIPs++

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Breakz Studios from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

🌟Few pointers from the paper

🎯In this paper, authors have introduced a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.

🎯Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap.

🎯To this end, they proposed to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, they introduced a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning.

🎯Then they proposed a novel method to leverage image and text embeddings from the VLLM as additional conditioning to improve the performance of a diffusion model. They validated their model on two egocentric datasets -- Ego4D and Epic-Kitchens.

🎯 Their experiments show substantial improvement over prior image manipulation models in both quantitative and qualitative evaluation. They also conducted detailed ablation studies and analysis to provide insights in their method.

🏒Organization: GenAI,@Meta , @GeorgiaTech , @UofIllinois

πŸ§™Paper Authors: @bryanislucky , Xiaoliang Dai, Lawrence Chen, Guan Pang, @RehgJim ,@aptx4869ml

1️⃣Read the Full Paper here:[2312.03849] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

2️⃣Project Page: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

3️⃣Code: GitHub - BolinLai/LEGO: This is the official code of LEGO paper.

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by AlexGrohl from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Paper Alert 🚨

➑️Paper Title: PWM: Policy Learning with Large World Models

🌟Few pointers from the paper

🎯Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods.

🎯In this paper authors have introduced β€œPolicy learning with large World Models (PWM)”, a novel model-based RL algorithm that learns continuous control policies from large multi-task world models.

🎯By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics.

🎯 Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning.

🏒Organization: @GeorgiaTech , @UCSanDiego , @nvidia

πŸ§™Paper Authors: @imgeorgiev , @VarunGiridhar3 , @ncklashansen , @animesh_garg

1️⃣Read the Full Paper here: [2407.02466] PWM: Policy Learning with Large World Models

2️⃣Project Page: PWM: Policy Learning with Large World Models

3️⃣Code: GitHub - imgeorgiev/PWM: PWM: Policy Learning with Large World Models

πŸŽ₯ Be sure to watch the attached Video -Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
[CV] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
[2407.00316] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
- Most human rendering methods assume humans are fully visible, but occlusions are common in real life. This paper presents OccFusion to render occluded humans using 3D Gaussian splatting supervised by 2D diffusion models.

- The method has three stages - Initialization, Optimization, and Refinement.

- In Initialization, complete human masks are generated from partial visibility masks using diffusion models.

- In Optimization, 3D Gaussians are optimized based on observed regions and pose-conditioned SDS is applied in both posed and canonical space to ensure completeness.

- In Refinement, in-context inpainting is used with coarse renderings to refine appearance.

- The method achieves state-of-the-art efficiency and quality on simulated and real occlusions.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR4y-esasAALnLj.jpg

GR4y-erbgAAzKN7.jpg

GR4y-f9aUAEBpCb.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: Relightable Neural Actor with Intrinsic Decomposition and Pose Control

🌟Few pointers from the paper

🎯Creating a digital human avatar that is relightable, drivable, and photorealistic is a challenging and important problem in Vision and Graphics.Humans are highly articulated creating pose-dependent appearance effects like self-shadows and wrinkles, and skin as well as clothing require complex and space-varying BRDF models.

🎯While recent human relighting approaches can recover plausible material-light decompositions from multi-view video, they do not generalize to novel poses and still suffer from visual artifacts.

🎯To address this, authors of this paper proposed β€œRelightable Neural Actor”, the first video-based method for learning a photorealistic neural human model that can be relighted, allows appearance editing, and can be controlled by arbitrary skeletal poses.

🎯 Importantly, for learning their human avatar, they solely require a multi-view recording of the human under a known, but static lighting condition. To achieve this, they represented the geometry of the actor with a drivable density field that models pose-dependent clothing deformations and provides a mapping between 3D and UV space, where normal, visibility, and materials are encoded.

🎯 To evaluate their approach in real-world scenarios, authors collected a new dataset with four actors recorded under different light conditions, indoors and outdoors, providing the first benchmark of its kind for human relighting, and demonstrating state-of-the-art relighting results for novel human poses.

🏒Organization: @VcaiMpi , Saarland Informatics Campus, @VIACenterSB , @UniFreiburg

πŸ§™Paper Authors: @DiogoLuvizon , @VGolyanik , @AdamKortylewski , @marc_habermann , Christian Theobalt

1️⃣Read the Full Paper here: [2312.11587] Relightable Neural Actor with Intrinsic Decomposition and Pose Control

2️⃣Project Page: Relightable Neural Actor

3️⃣Code: Coming πŸ”œ

πŸŽ₯ Be sure to watch the attached Video - Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Paper Alert 🚨

➑️Paper Title: LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

🌟Few pointers from the paper

🎯Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation.

🎯Instead of following mainstream diffusion-based methods, authors of this paper explored and extended the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability.

🎯 Building upon this, authors developed a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage.

🎯To enhance the generation quality and generalization ability, they scaled up the training data to about 69 million high-quality frames, adopted a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives.

🎯Additionally, they discovered that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability.

🎯Experimental results demonstrate the efficacy of their framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch.

🏒Organization: Kuaishou Technology, @ustc , @FudanUni

πŸ§™Paper Authors: Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang

1️⃣Read the Full Paper here: [2407.03168] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

2️⃣Project Page: Efficient Portrait Animation with Stitching and Retargeting Control

3️⃣Code: GitHub - KwaiVGI/LivePortrait: Make one portrait alive!

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡Music by Dmitrii Kolesnikov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: Fast View Synthesis of Casual Videos

🌟Few pointers from the paper

🎯Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render.

🎯This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. Authors treat static and dynamic video content separately. Specifically, they have built a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video.

🎯Their plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. They opt to represent the dynamic content as per-frame point clouds for efficiency.

🎯While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. Therefore, they developed a method to quickly estimate such a hybrid video representation and render novel views in real time.

🎯Authors experiments showed that their method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100Γ— faster in training and enabling real-time rendering.

🏒Organization: @University of Maryland, College Park, @AdobeResearch , @Adobe

πŸ§™Paper Authors: @YaoChihLee , @ZhoutongZhang , Kevin Blackburn Matzen, @simon_niklaus , Jianming Zhang, @jbhuang0604 , Feng Liu

1️⃣Read the Full Paper here: [2312.02135] Fast View Synthesis of Casual Videos

2️⃣Project Page: Fast View Synthesis of Casual Videos

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Pavel Bekirov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV24


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨 Paper Alert 🚨

➑️Paper Title: Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

🌟Few pointers from the paper

🎯With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest.

🎯To this end, authors of this paper have proposed a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning.

🎯 In addition, they also created a bounding box predictor that, given the initial and ending frames bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip.

🎯Given the novelty of their problem formulation, there is no existing standard way to evaluate models that seek to predict vehicle video with high fidelity.

🎯Authors therefore presents a new benchmark consisting of a particular way of evaluating video generation models using the KITTI, Virtual KITTI 2 (vKITTI) and the Berkeley Driving Dataset (BDD 100k).

🏒Organization: @Mila_Quebec

πŸ§™Paper Authors: Ge Ya (Olga) Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

1️⃣Read the Full Paper here: [2406.05630] Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

2️⃣Project Page: SOCIAL MEDIA TITLE TAG

3️⃣Code: GitHub - oooolga/Ctrl-V

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Rockot from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨 Paper Alert 🚨

➑️Paper Title: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

🌟Few pointers from the paper

🎯In this paper authors have introduced β€œHouseCrafter”, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house).

🎯Their key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene.

🎯Specifically, the RGB-D images are generated autoregressively in a batch- wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations.

🎯 The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed.

🎯Through extensive evaluation of the 3D-Front dataset, authors demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices

🏒Organization: @Northeastern , @StabilityAI

πŸ§™Paper Authors: Hieu T. Nguyen, Yiwen Chen, @VikramVoleti @jampani_varun , @HuaizuJiang

1️⃣Read the Full Paper here: [2406.20077] HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

2️⃣Project Page: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

3️⃣Code: Coming πŸ”œ

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Maksym Dudchyk from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Product Update 🚨

@elevenlabsio has just introduced πŸ—£οΈβ€œVOICE ISOLATOR” 🎀

This new feature allows you to extract crystal-clear speech from any audio

Their vocal remover strips background noise for film, podcast, and interview post production.

Try it for Free here: Free Voice Isolator and Background Noise Remover | ElevenLabs

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨 Paper Alert 🚨

➑️Paper Title: EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

🌟Few pointers from the paper

🎯Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning.

♠️ In this paper authors have proposed β€œEquiBot”, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Their approach combines SIM(3)-equivariant neural network architectures with diffusion models.

🎯This ensures that their learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning, such as multi-modality and robustness.

🎯They showed on a suite of 6 simulation tasks that their proposed method reduces the data requirements and improves generalization to novel scenarios.

🎯In the real world, with 10 variations of 6 mobile manipulation tasks, they showed that their method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.

🏒Organization: @Stanford

πŸ§™Paper Authors: @yjy0625 , Zi-ang Cao , @CongyueD , @contactrika , @SongShuran ,@leto__jean

1️⃣Read the Full Paper here: [2407.01479] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

2️⃣Project Page: EquiBot

3️⃣Code: GitHub - yjy0625/equibot: Official implementation for paper "EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning".

πŸŽ₯ Be sure to watch the attached Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Zakhar Valaha from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Paper Alert 🚨

➑️Paper Title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

🌟Few pointers from the paper

🎯In this paper authors have presented β€œDiffusion Forcing”, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.

🎯They applied Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones.

🎯Their approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories.

🎯Their method offers a range of additional capabilities, such as
βš“rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and
βš“ new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks.

🎯In addition to its empirical success, their method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.

🏒Organization: @MIT_CSAIL

πŸ§™Paper Authors: @BoyuanChen0 , Diego Marti Monso, @du_yilun , @max_simchowitz , @RussTedrake , @vincesitzmann

1️⃣Read the Full Paper here: [2407.01392] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

2️⃣Project Page: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

3️⃣Code: GitHub - buoyancy99/diffusion-forcing: code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Nick Valerson from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,200
Reputation
8,613
Daps
161,843

1/1
🚨Paper Alert 🚨

➑️Paper Title: MimicMotion : High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

🌟Few pointers from the paper

🎯In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications.

🎯However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology.

🎯In this work, authors have proposed a controllable video generation framework, dubbed β€œMimicMotion”, which can generate high-quality videos of arbitrary length with any motion guidance.

🎯Compared with previous methods, their approach has several highlights.
βš“Firstly, with confidence-aware pose guidance, temporal smoothness can be achieved so model robustness can be enhanced with large-scale training data.
βš“ Secondly, regional loss amplification based on pose confidence significantly eases the distortion of image significantly.

🎯Lastly, for generating long smooth videos, a progressive latent fusion strategy is proposed. By this means, videos of arbitrary length can be generated with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in multiple aspects.

🏒Organization: @TencentGlobal , @sjtu1896

πŸ§™Paper Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

1️⃣Read the Full Paper here: [2406.19680] MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

2️⃣Project Page: SOCIAL MEDIA TITLE TAG

3️⃣Code: GitHub - Tencent/MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Alexander Lisenkov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top