bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨IROS 2024 Paper Alert 🚨

➑️Paper Title: Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

🌟Few pointers from the paper

🎯Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage.

🎯While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions.

🎯In this work, authors have introduced a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Their system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback.

🎯Secondly, they presented Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations.

🎯Their methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of their system in teaching robots dexterous manipulations with enhanced adaptability and safety.

🏒Organization: University of Tokyo (@UTokyo_News_en ), OMRON SINIC X Corporation (@sinicx_jp )

πŸ§™Paper Authors: @tatsukamijo , @cambel07 , @mh69543540

1️⃣Read the Full Paper here: [2406.14990] Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

🌟Few pointers from the paper

🎯Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues.

🎯Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects.

🎯In this paper, authors utilized the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation.

🎯They first created two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. They then proposed a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation.

🎯Finally, The authors have also introduced teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. They achieved state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data.

🏒Organization: Department of Computer Science, University of North Carolina at Chapel Hill (@unccs )

πŸ§™Paper Authors: @Yahskapar , Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, @SenguptRoni

1️⃣Read the Full Paper here: [2403.17915] Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

2️⃣Project Page: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

3️⃣Code: GitHub - Roni-Lab/PPSNet: PPSNet: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos (ECCV, 2024)

πŸŽ₯ Be sure to watch the attached Demo Video -Sound on πŸ”ŠπŸ”Š

🎡 Music by Denys Kyshchuk from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨ECCV 2024 Paper Alert 🚨

➑️Paper Title: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

🌟Few pointers from the paper

🎯Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps.

🎯However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? Authors of this paper investigated this question in the context of autonomous driving, and answered it with a resounding "yes".

🎯Authors proposed an efficient data generation pipeline termed β€œDGInStyle”.

🧊First, they examined the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain.
🧊Second, they designed a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects.
🧊Third, they proposed a Style Swap technique to endow the rich generative prior with the learned semantic control.

🎯Using DGInStyle, they generated a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluated the model on multiple popular autonomous driving datasets.

🎯Their approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without their generative augmentation scheme.

🏒Organization: @ETH_en , @KU_Leuven , @INSAITinstitute Sofia

πŸ§™Paper Authors: Yuru Jia, @lukashoyer3 , @ShengyHuang , @TianfuWang2 , Luc Van Gool, Konrad Schindler, @AntonObukhov1

1️⃣Read the Full Paper here: [2312.03048] DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

2️⃣Project Page: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

3️⃣Generation Code: GitHub - yurujaja/DGInStyle: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

4️⃣Segmentation Code: GitHub - yurujaja/DGInStyle-SegModel: Downstream semantic segmentation evaluation of DGInStyle.

πŸŽ₯ Be sure to watch the attached Video

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨CVPR 2024 Paper Alert 🚨

➑️Paper Title: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

🌟Few pointers from the paper

🎯Monocular 3D human mesh estimation is an ill-posed problem characterized by inherent ambiguity and occlusion. While recent probabilistic methods propose generating multiple solutions little attention is paid to obtaining high-quality estimates from them.

🎯To address this limitation authors of this paper have introduced β€œScoreHypo” a versatile framework by first leverages their novel β€œHypoNet” to generate multiple hypotheses followed by employing a meticulously designed scorer β€œScoreNet” to evaluate and select high-quality estimates.

🎯ScoreHypo formulates the estimation process as a reverse denoising process, where HypoNet produces a diverse set of plausible estimates that effectively align with the image cues.

🎯Subsequently, ScoreNet is employed to rigorously evaluate and rank these estimates based on their quality and finally identify superior ones.

🎯Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator. Moreover, the estimates selected by ScoreNet significantly outperform random generation or simple averaging.

🎯Notably, the trained ScoreNet exhibits generalizability, as it can effectively score existing methods and significantly reduce their errors by more than 15%.

🏒Organization: @PKU1898 , International Digital Economy Academy (IDEA), @sjtu1896

πŸ§™Paper Authors: Yuan Xu, @XiaoxuanMa_ , Jiajun Su, @walterzhu8 , Yu Qiao, Yizhou Wang

1️⃣Read the Full Paper here: https://shorturl.at/pyIuc

2️⃣Project Page: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

3️⃣Code: Coming πŸ”œ

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Pavel Bekirov from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨 Paper Alert 🚨

➑️Paper Title: DoubleTake: Geometry Guided Depth Estimation

🌟Few pointers from the paper

🎯Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood.

🎯In contrast, Authors model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames.

🎯Authors have introduced a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry.

🎯They demonstrated that their method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

🏒Organization: @NianticLabs , @ucl

πŸ§™Paper Authors: @MohammedAmr1 , @AleottiFilippo , Jamie Watson, Zawar Qureshi, @gui_ggh , Gabriel Brostow, Sara Vicente, @_mdfirman

1️⃣Read the Full Paper here: https://nianticlabs.github.io/doubletake/resources/DoubleTake.pdf

2️⃣Project Page: DoubleTake: Geometry Guided Depth Estimation

3️⃣Code: Coming πŸ”œ

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/2
🚨Paper Alert 🚨

➑️Paper Title: L4GM: Large 4D Gaussian Reconstruction Model

🌟Few pointers from the paper

🎯In this paper authors have presented L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input – in a single feed-forward pass that takes only a second.

🎯Key to their success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames.

🎯Authors kept their L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input.

🎯L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness.

🎯They added temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model.

🎯The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. They showcased that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

🏒Organization: @nvidia , @UofT , @Cambridge_Uni , @MIT , S-Lab, @NTUsg

πŸ§™Paper Authors: Jiawei Ren, Kevin Xie, @ashmrz10 , Hanxue Liang, Xiaohui Zeng, @karsten_kreis , @liuziwei7 , Antonio Torralba, @FidlerSanja , Seung Wook Kim, @HuanLing6

1️⃣Read the Full Paper here: [2406.10324] L4GM: Large 4D Gaussian Reconstruction Model

2️⃣Project Page: L4GM: Large 4D Gaussian Reconstruction Model

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Praz Khanal from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

2/2
Impressive


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨Paper Alert 🚨

➑️Paper Title: Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

🌟Few pointers from the paper

🎯In this paper authors have presented β€œFollow-Your-Emoji”, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences.

🎯The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity.

🎯To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, they first adopt a new explicit motion signal, namely expression-aware landmarks, to guide the animation process.

🎯Authors discovered this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage.

🎯Then, authors have also proposed a facial fine-grained loss to improve the model's ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks.

🎯Accordingly, their method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals.

🎯By leveraging a simple and effective progressive generation strategy, they extended their model to stable long-term animation, thus increasing its potential application value.

🎯To address the lack of a benchmark for this field, they have introduced EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. They show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.

🏒Organization: @hkust , @TencentGlobal , @Tsinghua_Uni

πŸ§™Paper Authors: Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

1️⃣Read the Full Paper here: [2406.01900] Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

2️⃣Project Page: Follow-Your-Emoji: Freestyle Portrait Animation

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Maksym Dudchyk from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨Paper Alert 🚨

➑️Paper Title: Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

🌟Few pointers from the paper

🎯Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data.

🎯Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances.

🎯The Authors of this paper propose β€œMANIPULATE-ANYTHING”, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object.

🎯They evaluate their method using two setups:
βš“First, MANIPULATE-ANYTHING successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser.
βš“Second, MANIPULATE-ANYTHING’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies.

🎯The authors believed that MANIPULATE-ANYTHING can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.

🏒Organization: @uwcse , @nvidia , @allen_ai , Universidad Católica San Pablo

πŸ§™Paper Authors: @DJiafei , @TonyWentaoYuan , @wpumacay7567 ,@YiruHelenWang ,@ehsanik , Dieter Fox, @RanjayKrishna

1️⃣Read the Full Paper here: https://arxiv.org/pdf/2406.18915

2️⃣Project Page: Manipulate Anything

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by StudioKolomna from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨ICML 2024 Paper Alert 🚨

➑️Paper Title: Efficient World Models with Context-Aware Tokenization

🌟Few pointers from the paper

🎯Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modeling, model-based RL positions itself as a strong contender.

🎯Recent advances in sequence modeling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments.

🎯In this work, authors have proposed βˆ†-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens.

🎯In the Crafter benchmark, βˆ†-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches.

🏒Organization: @unige_en

πŸ§™Paper Authors: @micheli_vincent , @EloiAlonso1 , @francoisfleuret

1️⃣Read the Full Paper here: [2406.19320] Efficient World Models with Context-Aware Tokenization

2️⃣Code: GitHub - vmicheli/delta-iris: Efficient World Models with Context-Aware Tokenization. ICML 2024

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by StudioKolomna from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#icml2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨LLM Alert 🚨

πŸ’Ž @GoogleDeepMind " Gemma Team" has officially announced the release of Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.

πŸ’ŽThe 9 billion and 27 billion parameter models are available today, with a 2 billion parameter model to be released shortly.

🌟Few pointers from the Announcement

🎯 In this new version, they have provided several technical modifications to their architecture, such as interleaving local-global attentions and group-query attention.

🎯They also train the 2B and 9B models with knowledge distillation instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3Γ— bigger.

🎯They trained Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2.6B on 2 trillion tokens. These tokens come from a variety of data sources, including web documents, code, and science articles.

🎯There models are not multimodal and are not trained specifically for state-of-the-art multi-
lingual capabilities. The final data mixture was determined through ablations similar to the approach in Gemini 1.0.

🎯Just like the original Gemma models, Gemma 2 is available under the commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.

1️⃣Blog: Gemma 2 is now available to researchers and developers

2️⃣Technical Report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/7
Viggle's new feature "Move" is now live! Visit VIGGLE to animate your pic right away.

Compared to our previous feature which offers greensreen and white background, Move allows you to keep the original background of the image, without further editing.

Get it moving!

2/7
That’s so great!!! Can we see GenAiMovies soon?

3/7
Wow, fantastic. I personally find green screen very useful but I'm pleased with the upgrades. Nice work.

4/7
πŸ₯³πŸ€©

5/7
Looks good, will try this out later. Make sure you keep the green screen too as thats fine πŸ‘πŸ»πŸ‘πŸ»

6/7
Great. For multiple characters video, how to choose which one to replace?

7/7
Wow!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨 Paper Alert 🚨

➑️Paper Title: Real-Time Video Generation with Pyramid Attention Broadcast

🌟Few pointers from the paper

🎯Recently, Sora and other DiT-based video generation models have attracted significant attention. However, in contrast to image generation, there are few studies focused on accelerating the inference of DiT-based video generation models.

🎯Additionally, the inference cost for generating a single video can be substantial, often requiring tens of GPU minutes or even hours. Therefore, accelerating the inference of video generation models has become urgent for broader GenAI applications.

🎯In this work authors have introduced β€œPyramid Attention Broadcast (PAB)”, the first approach that achieves real-time DiT-based video generation.

🎯By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte.

🎯Notably, as a training-free approach, PAB can empower any future DiT-based video generation models with real-time capabilities.

🏒Organization: @NUSingapore , @LifeAtPurdue

πŸ§™Paper Authors: @oahzxl , @JxlDragon , @VictorKaiWang1 , @YangYou1991

1️⃣Read the Full Paper here: Coming πŸ”œ

2️⃣Blog: Real-Time Video Generation with Pyramid Attention Broadcast

3️⃣Code: GitHub - NUS-HPC-AI-Lab/OpenDiT: OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference

4️⃣Doc: OpenDiT/docs/pab.md at master Β· NUS-HPC-AI-Lab/OpenDiT

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Hot_Dope from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨CVPR 2024 Paper Alert 🚨

➑️Paper Title: DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

🌟Few pointers from the paper

🎯In this paper authors have presented β€œDiffHuman”, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions.

🎯In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allowed them to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation.

🎯 During inference, authors may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, authors have also introduced a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework.

🎯 Their experiments showed that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.

🏒Organization: @Google Research, @Cambridge_Uni

πŸ§™Paper Authors: @AkashSengupta97 , @thiemoall , @nikoskolot , @enric_corona , Andrei Zanfir, @CSminchisescu

1️⃣Read the Full Paper here: [2404.00485] DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

2️⃣Project Page: DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

πŸŽ₯ Be sure to watch the attached Informational Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Dmytro Kuvalin from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨 Paper Alert 🚨

➑️Paper Title: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

🌟Few pointers from the paper

🎯Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on.

🎯In this paper, authors expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora.

🎯This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes.

🎯A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text.

🎯Through this, authors expanded on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.

🎯This enables more fine-grained and controllable multimodal generation capabilities and allowed the authors to study the distillation of models trained on diverse data and objectives into a unified model. They successfully scaled the training to a three billion parameter model using tens of modalities and different datasets.

🏒Organization: Swiss Federal Institute of Technology Lausanne (@EPFL ), @Apple

πŸ§™Paper Authors: @roman__bachmann ,@oguzhanthefatih , @dmizrahi_ , @aligarjani , @mingfei_gao , David Griffiths, Jiaming Hu, @afshin_dn , @zamir_ar

1️⃣Read the Full Paper here: [2406.09406] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

2️⃣Project Page: 4M: Massively Multimodal Masked Modeling

3️⃣Code: GitHub - apple/ml-4m: 4M: Massively Multimodal Masked Modeling

4️⃣Demo: 4M Demo - a Hugging Face Space by EPFL-VILAB

πŸŽ₯ Be sure to watch the attached Video-Sound on πŸ”ŠπŸ”Š

🎡Music by Yevhen Onoychenko from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
51,805
Reputation
7,926
Daps
148,750

1/1
🚨Paper Alert 🚨

➑️Paper Title: High-Fidelity Facial Albedo Estimation via Texture Quantization

🌟Few pointers from the paper

🎯Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results.

🎯 In this paper, authors have presented a novel facial albedo reconstruction model, β€œHiFiAlbedo”, which recovers the albedo map directly from a single image without the need for captured albedo data.

🎯Their key insight is that the albedo map is the illumination invariant texture map, which enabled the authors to use inexpensive texture data to derive an albedo estimation by eliminating illumination.

🎯 To achieve this, they first collected large-scale ultra-high-resolution facial images and trained a high-fidelity facial texture codebook. By using the FFHQ dataset and limited UV textures, they then fine-tune the encoder for texture reconstruction from the input image with adversarial supervision in both image and UV space.

🎯Finally, they trained a cross-attention module and utilize group identity loss to learn the adaptation from facial texture to the albedo domain. Extensive experimentation has demonstrated that their method exhibits excellent generalizability and is capable of achieving high-fidelity results for in-the-wild facial albedo recovery.

🏒Organization: University of Technology Sydney, Australia, @sjtu1896 , @DeepGlint , China, Insightface, China, @ZJU_China , @imperialcollege

πŸ§™Paper Authors: Zimin Ran, Xingyu Ren, Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jia Guo, Linchao Zhu, @JiankangDeng

1️⃣Read the Full Paper here: [2406.13149] High-Fidelity Facial Albedo Estimation via Texture Quantization

2️⃣Project Page: High-Fidelity Facial Albedo Estimation via Texture Quantization

πŸŽ₯ Be sure to watch the attached Demo Video-Sound on πŸ”ŠπŸ”Š

🎡 Music by Gregor Quendel from @pixabay

Find this Valuable πŸ’Ž ?

♻️QT and teach your network something new

Follow me πŸ‘£, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top