Zero-shot talking avatar generation aims at synthesizing natural talking videos from speech and a single portrait image. Previous methods have relied on domain-specific heuristics such as warping-based motion representation and 3D Morphable Models, which limit the naturalness and diversity of the generated avatars. In this work, we introduce GAIA (Generative AI for Avatar), which eliminates the domain priors in talking avatar generation. In light of the observation that the speech only drives the motion of the avatar while the appearance of the avatar and the background typically remain the same throughout the entire video, we divide our approach into two stages: 1) disentangling each frame into motion and appearance representations; 2) generating motion sequences conditioned on the speech and reference portrait image. We collect a large-scale high-quality talking avatar dataset and train the model on it with different scales (up to 2B parameters). Experimental results verify the superiority, scalability, and flexibility of GAIA as 1) the resulting model beats previous baseline models in terms of naturalness, diversity, lip-sync quality, and visual quality; 2) the framework is scalable since larger models yield better results; 3) it is general and enables different applications like controllable talking avatar generation and text-instructed avatar generation.
Comments: | Project page: this https URL |
Subjects: | Computer Vision and Pattern Recognition (cs.CV); Multimedia (cs.MM) |
Cite as: | arXiv:2311.15230 [cs.CV] |
(or arXiv:2311.15230v1 [cs.CV] for this version) | |
[2311.15230] GAIA: Zero-shot Talking Avatar Generation Focus to learn more |
I'm going to pretend I know what all this means and assume its a good thing
MagicAnimate:
Temporally Consistent Human Image Animation using Diffusion Model
Zhongcong Xu1, Jianfeng Zhang2, Jun Hao Liew2, Hanshu Yan2, Jia-Wei Liu1, Chenxu Zhang2, Jiashi Feng2, Mike Zheng Shou1
Show Lab, National University of Singapore 2Bytedance
Paper arXiv Code
TL;DR: We propose MagicAnimate, a diffusion-based human image animation framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
GitHub - magic-research/magic-animate: [CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
[CVPR 2024] MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model - magic-research/magic-animategithub.com
About
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
showlab.github.io/magicanimate/
I'm going to pretend I know what all this means and assume its a good thing
3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However, recovering a high-quality NeRF typically requires tens to hundreds of input images, resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis, trained on synthetic and multiview datasets, which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets, including forward-facing and 360-degree scenes, demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches.
Comments: | Project page: this https URL |
Subjects: | Computer Vision and Pattern Recognition (cs.CV) |
Cite as: | arXiv:2312.02981 [cs.CV] |
(or arXiv:2312.02981v1 [cs.CV] for this version) | |
[2312.02981] ReconFusion: 3D Reconstruction with Diffusion Priors Focus to learn more |