bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/2
The study introduces a new guided tree search algorithm aimed at improving LLM performance on mathematical reasoning tasks while reducing computational costs compared to existing methods. By incorporating dynamic node selection, exploration budget calculation,...

2/2
LiteSearch: Efficacious Tree Search for LLM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GRw7eXxawAA860i.png



1/2
LiteSearch: Efficacious Tree Search for LLM. [2407.00320] LiteSearch: Efficacious Tree Search for LLM

2/2
AI Summary: The study introduces a new guided tree search algorithm aimed at improving LLM performance on mathematical reasoning tasks while reducing computational costs compared to existing methods. By inco...
LiteSearch: Efficacious Tree Search for LLM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GRw7eXxawAA860i.png




Computer Science > Computation and Language​

[Submitted on 29 Jun 2024]

LiteSearch: Efficacious Tree Search for LLM​

Ante Wang, Linfeng Song, Ye Tian, Baolin Peng, Dian Yu, Haitao Mi, Jinsong Su, Dong Yu
Recent research suggests that tree search algorithms (e.g. Monte Carlo Tree Search) can dramatically boost LLM performance on complex mathematical reasoning tasks. However, they often require more than 10 times the computational resources of greedy decoding due to wasteful search strategies, making them difficult to be deployed in practical applications. This study introduces a novel guided tree search algorithm with dynamic node selection and node-level exploration budget (maximum number of children) calculation to tackle this issue. By considering the search progress towards the final answer (history) and the guidance from a value network (future) trained without any step-wise annotations, our algorithm iteratively selects the most promising tree node before expanding it within the boundaries of the allocated computational budget. Experiments conducted on the GSM8K and TabMWP datasets demonstrate that our approach not only offers competitive performance but also enjoys significantly lower computational costs compared to baseline methods.
Subjects:Computation and Language (cs.CL); Artificial Intelligence (cs.AI); Machine Learning (cs.LG)
Cite as:arXiv:2407.00320 [cs.CL]
(or arXiv:2407.00320v1 [cs.CL] for this version)
[2407.00320] LiteSearch: Efficacious Tree Search for LLM
Focus to learn more

Submission history

From: Linfeng Song [view email]
[v1] Sat, 29 Jun 2024 05:14:04 UTC (640 KB)


 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
[LG] Learning to (Learn at Test Time): RNNs with Expressive Hidden States
Y Sun, X Li, K Dalal, J Xu… [Stanford University & UC San Diego & UC Berkeley] (2024)
[2407.04620] Learning to (Learn at Test Time): RNNs with Expressive Hidden States

- This paper proposes TTT (Test-Time Training) layers, a new class of sequence modeling layers where the hidden state is a model, and the update rule is self-supervised learning.

- The key idea is to make the hidden state a machine learning model itself, and the update rule a gradient step on a self-supervised loss. Updating the hidden state on a test sequence is equivalent to training the model at test time, which is how TTT layers get their name.

- Two simple instantiations are proposed: TTT-Linear, where the hidden state model is a linear model, and TTT-MLP, where it is a 2-layer MLP. These can be integrated into any network architecture like RNN layers.

- TTT layers have better perplexity and use of long context compared to Mamba RNNs, and lower latency compared to Transformers after 8k context length.

- Practical innovations like mini-batch TTT and a dual form are proposed to improve hardware efficiency on modern GPUs and TPUs. The dual form allows computing output tokens directly without materializing intermediate variables.

- Theoretical connections are made between TTT layers and existing concepts like attention, fast weights, and meta learning. The outer loop of TTT can be seen as learning a good self-supervised task for the inner loop.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR8s5xcacAAEHql.jpg

GR8s5xbbMAAjCuM.jpg

GR8s5xbasAAtER-.jpg

GR8s52sbcAAO773.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/2
[LG] Mixture of A Million Experts
X O He [Google DeepMind] (2024)
[2407.04153] Mixture of A Million Experts

- The feedforward (FFW) layers in standard transformer architectures incur a linear increase in computational costs and activation memory as the hidden layer width grows.

- Sparse mixture-of-experts (MoE) architectures have emerged to address this issue by decoupling model size from computational cost, but are limited to a small number of experts due to computational and optimization challenges.

- This paper introduces PEER (parameter efficient expert retrieval), a novel layer design that utilizes the product key technique for sparse retrieval from a vast pool of tiny experts (over a million).

- Experiments demonstrate PEER layers outperform dense FFWs and coarse-grained MoEs in terms of performance-compute trade-off.

- By enabling efficient utilization of massive number of experts, PEER unlocks potential for further scaling of transformer models while maintaining computational efficiency.

2/2
Dark mode for this paper 🌚 Mixture of A Million Experts


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR8y_bibUAEUmwH.jpg

GR8y_bibwAAYhAD.jpg

GR8y_bia8AEgRmR.jpg

GR8y_0MbUAAszH9.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
[CV] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
[2407.02477] Understanding Alignment in Multimodal LLMs: A Comprehensive Study
- This paper examines alignment strategies for Multimodal Large Language Models (MLLMs) to reduce hallucinations and improve visual grounding. It categorizes alignment methods into offline (e.g. DPO) and online (e.g. Online-DPO).

- The paper reviews recently published multimodal preference datasets like POVID, RLHF-V, VLFeedback and analyzes their components: prompts, chosen responses, rejected responses.

- It introduces a new preference data sampling method called Bias-Driven Hallucination Sampling (BDHS) which restricts image access to induce language model bias and trigger hallucinations.

- Experiments align the LLaVA 1.6 model and compare offline, online and mixed DPO strategies. Results show combining offline and online can yield benefits.

- The proposed BDHS method achieves strong performance without external annotators or preference data, just using self-supervised data.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR6rByHbUAA1Zs_.jpg

GR6rByLb0AA87Ab.jpg

GR6rB0Tb0AA3gUX.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨SIGGRAPH 2024 Paper Alert 🚨

➑️Paper Title: CharacterGen: Efficient 3D Character Generation from Single Images with
Multi-View Pose Canonicalization

🌟Few pointers from the paper

🎯In this paper authors have presented β€œCharacterGen”, a framework developed to efficiently generate 3D characters. CharacterGen introduces a streamlined generation pipeline along with an image-conditioned multi-view diffusion model.

🎯This model effectively calibrates input poses to a canonical form while retaining key attributes of the input image, thereby addressing the challenges posed by diverse poses. A transformer-based, generalizable sparse-view reconstruction model is the other core component of their approach, facilitating the creation of detailed 3D models from multi-view images.

🎯They also adopted a texture-back-projection strategy to produce high-quality texture map. Additionally, They have curated a dataset of anime characters, rendered in multiple poses and views, to train and evaluate their model.

🎯 Their approach has been thoroughly evaluated through quantitative and qualitative experiments, showing its proficiency in generating 3D characters with high-quality shapes and textures, ready for downstream applications such as rigging and animation.

🏒Organization: @Tsinghua_Uni , @VastAIResearch

πŸ§™Paper Authors: Hao-Yang Peng, Jia-Peng Zhang, @MengHaoGuo1 , @yanpei_cao , Shi-Min Hu

1️⃣Read the Full Paper here: [2402.17214] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

2️⃣Project Page: CharacterGen: Efficient 3D Character Generation from Single Images

3️⃣Code: GitHub - zjp-shadow/CharacterGen: [SIGGRAPH'24] CharacterGen: Efficient 3D Character Generation from Single Images with Multi-View Pose Canonicalization

πŸŽ₯ Be sure to watch the attached Video-Sound on πŸ”ŠπŸ”Š

Music by Grand_Project from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨Paper Alert 🚨

➑️Paper Title: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

🌟Few pointers from the paper

🎯In this paper authors have introduced β€œPointOdyssey”, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms.

🎯Their goal was to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, they animated deformable characters using real-world motion capture data, they built 3D scenes to match the motion capture environments, and they rendered camera viewpoints using trajectories mined via structure-from-motion on real videos.

🎯They created combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Their dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work.

🎯They showed that existing methods can be trained from scratch in their dataset and outperform the published variants. Finally, they also introduced modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks.

🏒Organization: @Stanford

πŸ§™Paper Authors: @yang_zheng18 , @AdamWHarley , @willbokuishen , @GordonWetzstein , Leonidas J. Guibas

1️⃣Read the Full Paper here: [2307.15055] PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

2️⃣Project Page: PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking

3️⃣Simulator: GitHub - y-zheng18/point_odyssey: Official code for PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking (ICCV 2023)

4️⃣Model: GitHub - aharley/pips2: PIPs++

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Breakz Studios from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

🌟Few pointers from the paper

🎯In this paper, authors have introduced a novel problem -- egocentric action frame generation. The goal is to synthesize an image depicting an action in the user's context (i.e., action frame) by conditioning on a user prompt and an input egocentric image.

🎯Notably, existing egocentric action datasets lack the detailed annotations that describe the execution of actions. Additionally, existing diffusion-based image manipulation models are sub-optimal in controlling the state transition of an action in egocentric image pixel space because of the domain gap.

🎯To this end, they proposed to Learn EGOcentric (LEGO) action frame generation via visual instruction tuning. First, they introduced a prompt enhancement scheme to generate enriched action descriptions from a visual large language model (VLLM) by visual instruction tuning.

🎯Then they proposed a novel method to leverage image and text embeddings from the VLLM as additional conditioning to improve the performance of a diffusion model. They validated their model on two egocentric datasets -- Ego4D and Epic-Kitchens.

🎯 Their experiments show substantial improvement over prior image manipulation models in both quantitative and qualitative evaluation. They also conducted detailed ablation studies and analysis to provide insights in their method.

🏒Organization: GenAI,@Meta , @GeorgiaTech , @UofIllinois

πŸ§™Paper Authors: @bryanislucky , Xiaoliang Dai, Lawrence Chen, Guan Pang, @RehgJim ,@aptx4869ml

1️⃣Read the Full Paper here:[2312.03849] LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

2️⃣Project Page: LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning

3️⃣Code: GitHub - BolinLai/LEGO: This is the official code of LEGO paper.

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by AlexGrohl from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨Paper Alert 🚨

➑️Paper Title: PWM: Policy Learning with Large World Models

🌟Few pointers from the paper

🎯Reinforcement Learning (RL) has achieved impressive results on complex tasks but struggles in multi-task settings with different embodiments. World models offer scalability by learning a simulation of the environment, yet they often rely on inefficient gradient-free optimization methods.

🎯In this paper authors have introduced β€œPolicy learning with large World Models (PWM)”, a novel model-based RL algorithm that learns continuous control policies from large multi-task world models.

🎯By pre-training the world model on offline data and using it for first-order gradient policy learning, PWM effectively solves tasks with up to 152 action dimensions and outperforms methods using ground-truth dynamics.

🎯 Additionally, PWM scales to an 80-task setting, achieving up to 27% higher rewards than existing baselines without the need for expensive online planning.

🏒Organization: @GeorgiaTech , @UCSanDiego , @nvidia

πŸ§™Paper Authors: @imgeorgiev , @VarunGiridhar3 , @ncklashansen , @animesh_garg

1️⃣Read the Full Paper here: [2407.02466] PWM: Policy Learning with Large World Models

2️⃣Project Page: PWM: Policy Learning with Large World Models

3️⃣Code: GitHub - imgeorgiev/PWM: PWM: Policy Learning with Large World Models

πŸŽ₯ Be sure to watch the attached Video -Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
[CV] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
[2407.00316] OccFusion: Rendering Occluded Humans with Generative Diffusion Priors
- Most human rendering methods assume humans are fully visible, but occlusions are common in real life. This paper presents OccFusion to render occluded humans using 3D Gaussian splatting supervised by 2D diffusion models.

- The method has three stages - Initialization, Optimization, and Refinement.

- In Initialization, complete human masks are generated from partial visibility masks using diffusion models.

- In Optimization, 3D Gaussians are optimized based on observed regions and pose-conditioned SDS is applied in both posed and canonical space to ensure completeness.

- In Refinement, in-context inpainting is used with coarse renderings to refine appearance.

- The method achieves state-of-the-art efficiency and quality on simulated and real occlusions.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GR4y-esasAALnLj.jpg

GR4y-erbgAAzKN7.jpg

GR4y-f9aUAEBpCb.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: Relightable Neural Actor with Intrinsic Decomposition and Pose Control

🌟Few pointers from the paper

🎯Creating a digital human avatar that is relightable, drivable, and photorealistic is a challenging and important problem in Vision and Graphics.Humans are highly articulated creating pose-dependent appearance effects like self-shadows and wrinkles, and skin as well as clothing require complex and space-varying BRDF models.

🎯While recent human relighting approaches can recover plausible material-light decompositions from multi-view video, they do not generalize to novel poses and still suffer from visual artifacts.

🎯To address this, authors of this paper proposed β€œRelightable Neural Actor”, the first video-based method for learning a photorealistic neural human model that can be relighted, allows appearance editing, and can be controlled by arbitrary skeletal poses.

🎯 Importantly, for learning their human avatar, they solely require a multi-view recording of the human under a known, but static lighting condition. To achieve this, they represented the geometry of the actor with a drivable density field that models pose-dependent clothing deformations and provides a mapping between 3D and UV space, where normal, visibility, and materials are encoded.

🎯 To evaluate their approach in real-world scenarios, authors collected a new dataset with four actors recorded under different light conditions, indoors and outdoors, providing the first benchmark of its kind for human relighting, and demonstrating state-of-the-art relighting results for novel human poses.

🏒Organization: @VcaiMpi , Saarland Informatics Campus, @VIACenterSB , @UniFreiburg

πŸ§™Paper Authors: @DiogoLuvizon , @VGolyanik , @AdamKortylewski , @marc_habermann , Christian Theobalt

1️⃣Read the Full Paper here: [2312.11587] Relightable Neural Actor with Intrinsic Decomposition and Pose Control

2️⃣Project Page: Relightable Neural Actor

3️⃣Code: Coming πŸ”œ

πŸŽ₯ Be sure to watch the attached Video - Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨Paper Alert 🚨

➑️Paper Title: LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

🌟Few pointers from the paper

🎯Portrait Animation aims to synthesize a lifelike video from a single source image, using it as an appearance reference, with motion (i.e., facial expressions and head pose) derived from a driving video, audio, text, or generation.

🎯Instead of following mainstream diffusion-based methods, authors of this paper explored and extended the potential of the implicit-keypoint-based framework, which effectively balances computational efficiency and controllability.

🎯 Building upon this, authors developed a video-driven portrait animation framework named LivePortrait with a focus on better generalization, controllability, and efficiency for practical usage.

🎯To enhance the generation quality and generalization ability, they scaled up the training data to about 69 million high-quality frames, adopted a mixed image-video training strategy, upgrade the network architecture, and design better motion transformation and optimization objectives.

🎯Additionally, they discovered that compact implicit keypoints can effectively represent a kind of blendshapes and meticulously propose a stitching and two retargeting modules, which utilize a small MLP with negligible computational overhead, to enhance the controllability.

🎯Experimental results demonstrate the efficacy of their framework even compared to diffusion-based methods. The generation speed remarkably reaches 12.8ms on an RTX 4090 GPU with PyTorch.

🏒Organization: Kuaishou Technology, @ustc , @FudanUni

πŸ§™Paper Authors: Jianzhu Guo, Dingyun Zhang, Xiaoqiang Liu, Zhizhou Zhong, Yuan Zhang, Pengfei Wan, Di Zhang

1️⃣Read the Full Paper here: [2407.03168] LivePortrait: Efficient Portrait Animation with Stitching and Retargeting Control

2️⃣Project Page: Efficient Portrait Animation with Stitching and Retargeting Control

3️⃣Code: GitHub - KwaiVGI/LivePortrait: Make one portrait alive!

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡Music by Dmitrii Kolesnikov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: Fast View Synthesis of Casual Videos

🌟Few pointers from the paper

🎯Novel view synthesis from an in-the-wild video is difficult due to challenges like scene dynamics and lack of parallax. While existing methods have shown promising results with implicit neural radiance fields, they are slow to train and render.

🎯This paper revisits explicit video representations to synthesize high-quality novel views from a monocular video efficiently. Authors treat static and dynamic video content separately. Specifically, they have built a global static scene model using an extended plane-based scene representation to synthesize temporally coherent novel video.

🎯Their plane-based scene representation is augmented with spherical harmonics and displacement maps to capture view-dependent effects and model non-planar complex surface geometry. They opt to represent the dynamic content as per-frame point clouds for efficiency.

🎯While such representations are inconsistency-prone, minor temporal inconsistencies are perceptually masked due to motion. Therefore, they developed a method to quickly estimate such a hybrid video representation and render novel views in real time.

🎯Authors experiments showed that their method can render high-quality novel views from an in-the-wild video with comparable quality to state-of-the-art methods while being 100Γ— faster in training and enabling real-time rendering.

🏒Organization: @University of Maryland, College Park, @AdobeResearch , @Adobe

πŸ§™Paper Authors: @YaoChihLee , @ZhoutongZhang , Kevin Blackburn Matzen, @simon_niklaus , Jianming Zhang, @jbhuang0604 , Feng Liu

1️⃣Read the Full Paper here: [2312.02135] Fast View Synthesis of Casual Videos

2️⃣Project Page: Fast View Synthesis of Casual Videos

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Pavel Bekirov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV24


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨 Paper Alert 🚨

➑️Paper Title: Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

🌟Few pointers from the paper

🎯With recent advances in video prediction, controllable video generation has been attracting more attention. Generating high fidelity videos according to simple and flexible conditioning is of particular interest.

🎯To this end, authors of this paper have proposed a controllable video generation model using pixel level renderings of 2D or 3D bounding boxes as conditioning.

🎯 In addition, they also created a bounding box predictor that, given the initial and ending frames bounding boxes, can predict up to 15 bounding boxes per frame for all the frames in a 25-frame clip.

🎯Given the novelty of their problem formulation, there is no existing standard way to evaluate models that seek to predict vehicle video with high fidelity.

🎯Authors therefore presents a new benchmark consisting of a particular way of evaluating video generation models using the KITTI, Virtual KITTI 2 (vKITTI) and the Berkeley Driving Dataset (BDD 100k).

🏒Organization: @Mila_Quebec

πŸ§™Paper Authors: Ge Ya (Olga) Luo, Zhi Hao Luo, Anthony Gosselin, Alexia Jolicoeur-Martineau, Christopher Pal

1️⃣Read the Full Paper here: [2406.05630] Ctrl-V: Higher Fidelity Video Generation with Bounding-Box Controlled Object Motion

2️⃣Project Page: SOCIAL MEDIA TITLE TAG

3️⃣Code: GitHub - oooolga/Ctrl-V

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Rockot from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨 Paper Alert 🚨

➑️Paper Title: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

🌟Few pointers from the paper

🎯In this paper authors have introduced β€œHouseCrafter”, a novel approach that can lift a floorplan into a complete large 3D indoor scene (e.g., a house).

🎯Their key insight is to adapt a 2D diffusion model, which is trained on web-scale images, to generate consistent multi-view color (RGB) and depth (D) images across different locations of the scene.

🎯Specifically, the RGB-D images are generated autoregressively in a batch- wise manner along sampled locations based on the floorplan, where previously generated images are used as condition to the diffusion model to produce images at nearby locations.

🎯 The global floorplan and attention design in the diffusion model ensures the consistency of the generated images, from which a 3D scene can be reconstructed.

🎯Through extensive evaluation of the 3D-Front dataset, authors demonstrate that HouseCraft can generate high-quality house-scale 3D scenes. Ablation studies also validate the effectiveness of different design choices

🏒Organization: @Northeastern , @StabilityAI

πŸ§™Paper Authors: Hieu T. Nguyen, Yiwen Chen, @VikramVoleti @jampani_varun , @HuaizuJiang

1️⃣Read the Full Paper here: [2406.20077] HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model

2️⃣Project Page: HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Models

3️⃣Code: Coming πŸ”œ

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Maksym Dudchyk from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,206
Reputation
8,623
Daps
161,867

1/1
🚨Product Update 🚨

@elevenlabsio has just introduced πŸ—£οΈβ€œVOICE ISOLATOR” 🎀

This new feature allows you to extract crystal-clear speech from any audio

Their vocal remover strips background noise for film, podcast, and interview post production.

Try it for Free here: Free Voice Isolator and Background Noise Remover | ElevenLabs

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top