The A.I Megathread (LLM , GPT , Development)

bnew · Jul 8, 2024

1/1

IROS 2024 Paper Alert

Paper Title: Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

Few pointers from the paper

Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage.

While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions.

In this work, authors have introduced a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Their system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback.

Secondly, they presented Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations.

Their methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of their system in teaching robots dexterous manipulations with enhanced adaptability and safety.

Organization: University of Tokyo (@UTokyo_News_en ), OMRON SINIC X Corporation (@sinicx_jp )

Paper Authors: @tatsukamijo , @cambel07 , @mh69543540

Read the Full Paper here: [2406.14990] Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

Be sure to watch the attached Demo Video -Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Few pointers from the paper

Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues.

Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects.

In this paper, authors utilized the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation.

They first created two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. They then proposed a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation.

Finally, The authors have also introduced teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. They achieved state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data.

Organization: Department of Computer Science, University of North Carolina at Chapel Hill (@unccs )

Paper Authors: @Yahskapar , Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, @SenguptRoni

Read the Full Paper here: [2403.17915] Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Project Page: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Code: GitHub - Roni-Lab/PPSNet: PPSNet: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos (ECCV, 2024)

Be sure to watch the attached Demo Video -Sound on

Music by Denys Kyshchuk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Few pointers from the paper

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps.

However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? Authors of this paper investigated this question in the context of autonomous driving, and answered it with a resounding "yes".

Authors proposed an efficient data generation pipeline termed “DGInStyle”.

First, they examined the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain.

Second, they designed a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects.

Third, they proposed a Style Swap technique to endow the rich generative prior with the learned semantic control.

Using DGInStyle, they generated a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluated the model on multiple popular autonomous driving datasets.

Their approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without their generative augmentation scheme.

Organization: @ETH_en , @KU_Leuven , @INSAITinstitute Sofia

Paper Authors: Yuru Jia, @lukashoyer3 , @ShengyHuang , @TianfuWang2 , Luc Van Gool, Konrad Schindler, @AntonObukhov1

Read the Full Paper here: [2312.03048] DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Project Page: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Generation Code: GitHub - yurujaja/DGInStyle: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Segmentation Code: GitHub - yurujaja/DGInStyle-SegModel: Downstream semantic segmentation evaluation of DGInStyle.

Be sure to watch the attached Video

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Few pointers from the paper

Monocular 3D human mesh estimation is an ill-posed problem characterized by inherent ambiguity and occlusion. While recent probabilistic methods propose generating multiple solutions little attention is paid to obtaining high-quality estimates from them.

To address this limitation authors of this paper have introduced “ScoreHypo” a versatile framework by first leverages their novel “HypoNet” to generate multiple hypotheses followed by employing a meticulously designed scorer “ScoreNet” to evaluate and select high-quality estimates.

ScoreHypo formulates the estimation process as a reverse denoising process, where HypoNet produces a diverse set of plausible estimates that effectively align with the image cues.

Subsequently, ScoreNet is employed to rigorously evaluate and rank these estimates based on their quality and finally identify superior ones.

Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator. Moreover, the estimates selected by ScoreNet significantly outperform random generation or simple averaging.

Notably, the trained ScoreNet exhibits generalizability, as it can effectively score existing methods and significantly reduce their errors by more than 15%.

Organization: @PKU1898 , International Digital Economy Academy (IDEA), @sjtu1896

Paper Authors: Yuan Xu, @XiaoxuanMa_ , Jiajun Su, @walterzhu8 , Yu Qiao, Yizhou Wang

Read the Full Paper here: https://shorturl.at/pyIuc

Project Page: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by Pavel Bekirov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: DoubleTake: Geometry Guided Depth Estimation

Few pointers from the paper

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood.

In contrast, Authors model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames.

Authors have introduced a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry.

They demonstrated that their method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

Organization: @NianticLabs , @ucl

Paper Authors: @MohammedAmr1 , @AleottiFilippo , Jamie Watson, Zawar Qureshi, @gui_ggh , Gabriel Brostow, Sara Vicente, @_mdfirman

Read the Full Paper here: https://nianticlabs.github.io/doubletake/resources/DoubleTake.pdf

Project Page: DoubleTake: Geometry Guided Depth Estimation

Code: Coming

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/2

Paper Alert

Paper Title: L4GM: Large 4D Gaussian Reconstruction Model

Few pointers from the paper

In this paper authors have presented L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input – in a single feed-forward pass that takes only a second.

Key to their success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames.

Authors kept their L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input.

L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness.

They added temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model.

The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. They showcased that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

Organization: @nvidia , @UofT , @Cambridge_Uni , @MIT , S-Lab, @NTUsg

Paper Authors: Jiawei Ren, Kevin Xie, @ashmrz10 , Hanxue Liang, Xiaohui Zeng, @karsten_kreis , @liuziwei7 , Antonio Torralba, @FidlerSanja , Seung Wook Kim, @HuanLing6

Read the Full Paper here: [2406.10324] L4GM: Large 4D Gaussian Reconstruction Model

Project Page: L4GM: Large 4D Gaussian Reconstruction Model

Be sure to watch the attached Demo Video-Sound on

Music by Praz Khanal from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

2/2
Impressive

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Few pointers from the paper

In this paper authors have presented “Follow-Your-Emoji”, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences.

The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity.

To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, they first adopt a new explicit motion signal, namely expression-aware landmarks, to guide the animation process.

Authors discovered this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage.

Then, authors have also proposed a facial fine-grained loss to improve the model's ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks.

Accordingly, their method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals.

By leveraging a simple and effective progressive generation strategy, they extended their model to stable long-term animation, thus increasing its potential application value.

To address the lack of a benchmark for this field, they have introduced EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. They show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.

Organization: @hkust , @TencentGlobal , @Tsinghua_Uni

Paper Authors: Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

Read the Full Paper here: [2406.01900] Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Project Page: Follow-Your-Emoji: Freestyle Portrait Animation

Be sure to watch the attached Demo Video-Sound on

Music by Maksym Dudchyk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Few pointers from the paper

Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data.

Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances.

The Authors of this paper propose “MANIPULATE-ANYTHING”, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object.

They evaluate their method using two setups:

First, MANIPULATE-ANYTHING successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser.

Second, MANIPULATE-ANYTHING’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies.

The authors believed that MANIPULATE-ANYTHING can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.

Organization: @uwcse , @nvidia , @allen_ai , Universidad Católica San Pablo

Paper Authors: @DJiafei , @TonyWentaoYuan , @wpumacay7567 ,@YiruHelenWang ,@ehsanik , Dieter Fox, @RanjayKrishna

Read the Full Paper here: https://arxiv.org/pdf/2406.18915

Project Page: Manipulate Anything

Be sure to watch the attached Demo Video-Sound on

Music by StudioKolomna from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ICML 2024 Paper Alert

Paper Title: Efficient World Models with Context-Aware Tokenization

Few pointers from the paper

Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modeling, model-based RL positions itself as a strong contender.

Recent advances in sequence modeling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments.

In this work, authors have proposed ∆-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens.

In the Crafter benchmark, ∆-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches.

Organization: @unige_en

Paper Authors: @micheli_vincent , @EloiAlonso1 , @francoisfleuret

Read the Full Paper here: [2406.19320] Efficient World Models with Context-Aware Tokenization

Code: GitHub - vmicheli/delta-iris: Efficient World Models with Context-Aware Tokenization. ICML 2024

Be sure to watch the attached Demo Video-Sound on

Music by StudioKolomna from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#icml2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

LLM Alert

@GoogleDeepMind " Gemma Team" has officially announced the release of Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.

The 9 billion and 27 billion parameter models are available today, with a 2 billion parameter model to be released shortly.

Few pointers from the Announcement

In this new version, they have provided several technical modifications to their architecture, such as interleaving local-global attentions and group-query attention.

They also train the 2B and 9B models with knowledge distillation instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3× bigger.

They trained Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2.6B on 2 trillion tokens. These tokens come from a variety of data sources, including web documents, code, and science articles.

There models are not multimodal and are not trained specifically for state-of-the-art multi-
lingual capabilities. The final data mixture was determined through ablations similar to the approach in Gemini 1.0.

Just like the original Gemma models, Gemma 2 is available under the commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.

Blog: Gemma 2 is now available to researchers and developers

Technical Report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/7
Viggle's new feature "Move" is now live! Visit VIGGLE to animate your pic right away.

Compared to our previous feature which offers greensreen and white background, Move allows you to keep the original background of the image, without further editing.

Get it moving!

2/7
That’s so great!!! Can we see GenAiMovies soon?

3/7
Wow, fantastic. I personally find green screen very useful but I'm pleased with the upgrades. Nice work.

4/7

5/7
Looks good, will try this out later. Make sure you keep the green screen too as thats fine

6/7
Great. For multiple characters video, how to choose which one to replace?

7/7
Wow!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Real-Time Video Generation with Pyramid Attention Broadcast

Few pointers from the paper

Recently, Sora and other DiT-based video generation models have attracted significant attention. However, in contrast to image generation, there are few studies focused on accelerating the inference of DiT-based video generation models.

Additionally, the inference cost for generating a single video can be substantial, often requiring tens of GPU minutes or even hours. Therefore, accelerating the inference of video generation models has become urgent for broader GenAI applications.

In this work authors have introduced “Pyramid Attention Broadcast (PAB)”, the first approach that achieves real-time DiT-based video generation.

By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte.

Notably, as a training-free approach, PAB can empower any future DiT-based video generation models with real-time capabilities.

Organization: @NUSingapore , @LifeAtPurdue

Paper Authors: @oahzxl , @JxlDragon , @VictorKaiWang1 , @YangYou1991

Read the Full Paper here: Coming

Blog: Real-Time Video Generation with Pyramid Attention Broadcast

Code: GitHub - NUS-HPC-AI-Lab/OpenDiT: OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference

Doc: OpenDiT/docs/pab.md at master · NUS-HPC-AI-Lab/OpenDiT

Be sure to watch the attached Demo Video-Sound on

Music by Hot_Dope from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Few pointers from the paper

In this paper authors have presented “DiffHuman”, a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem, most methods are deterministic and output a single solution, often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions.

In contrast, DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image, which allowed them to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation.

During inference, authors may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore, authors have also introduced a generator neural network that approximates rendering with considerably reduced runtime (55x speed up), resulting in a novel dual-branch diffusion framework.

Their experiments showed that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image, while remaining competitive with the state-of-the-art when reconstructing visible surfaces.

Organization: @Google Research, @Cambridge_Uni

Paper Authors: @AkashSengupta97 , @thiemoall , @nikoskolot , @enric_corona , Andrei Zanfir, @CSminchisescu

Read the Full Paper here: [2404.00485] DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Project Page: DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans

Be sure to watch the attached Informational Video-Sound on

Music by Dmytro Kuvalin from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Few pointers from the paper

Current multimodal and multitask foundation models like 4M or UnifiedIO show promising results, but in practice their out-of-the-box abilities to accept diverse inputs and perform diverse tasks are limited by the (usually rather small) number of modalities and tasks they are trained on.

In this paper, authors expand upon the capabilities of them by training a single model on tens of highly diverse modalities and by performing co-training on large-scale multimodal datasets and text corpora.

This includes training on several semantic and geometric modalities, feature maps from recent state of the art models like DINOv2 and ImageBind, pseudo labels of specialist models like SAM and 4DHumans, and a range of new modalities that allow for novel ways to interact with the model and steer the generation, for example image metadata or color palettes.

A crucial step in this process is performing discrete tokenization on various modalities, whether they are image-like, neural network feature maps, vectors, structured data like instance segmentation or human poses, or data that can be represented as text.

Through this, authors expanded on the out-of-the-box capabilities of multimodal models and specifically show the possibility of training one model to solve at least 3x more tasks/modalities than existing ones and doing so without a loss in performance.

This enables more fine-grained and controllable multimodal generation capabilities and allowed the authors to study the distillation of models trained on diverse data and objectives into a unified model. They successfully scaled the training to a three billion parameter model using tens of modalities and different datasets.

Organization: Swiss Federal Institute of Technology Lausanne (@EPFL ), @Apple

Paper Authors: @roman__bachmann ,@oguzhanthefatih , @dmizrahi_ , @aligarjani , @mingfei_gao , David Griffiths, Jiaming Hu, @afshin_dn , @zamir_ar

Read the Full Paper here: [2406.09406] 4M-21: An Any-to-Any Vision Model for Tens of Tasks and Modalities

Project Page: 4M: Massively Multimodal Masked Modeling

Code: GitHub - apple/ml-4m: 4M: Massively Multimodal Masked Modeling

Demo: 4M Demo - a Hugging Face Space by EPFL-VILAB

Be sure to watch the attached Video-Sound on

Music by Yevhen Onoychenko from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: High-Fidelity Facial Albedo Estimation via Texture Quantization

Few pointers from the paper

Recent 3D face reconstruction methods have made significant progress in shape estimation, but high-fidelity facial albedo reconstruction remains challenging. Existing methods depend on expensive light-stage captured data to learn facial albedo maps. However, a lack of diversity in subjects limits their ability to recover high-fidelity results.

In this paper, authors have presented a novel facial albedo reconstruction model, “HiFiAlbedo”, which recovers the albedo map directly from a single image without the need for captured albedo data.

Their key insight is that the albedo map is the illumination invariant texture map, which enabled the authors to use inexpensive texture data to derive an albedo estimation by eliminating illumination.

To achieve this, they first collected large-scale ultra-high-resolution facial images and trained a high-fidelity facial texture codebook. By using the FFHQ dataset and limited UV textures, they then fine-tune the encoder for texture reconstruction from the input image with adversarial supervision in both image and UV space.

Finally, they trained a cross-attention module and utilize group identity loss to learn the adaptation from facial texture to the albedo domain. Extensive experimentation has demonstrated that their method exhibits excellent generalizability and is capable of achieving high-fidelity results for in-the-wild facial albedo recovery.

Organization: University of Technology Sydney, Australia, @sjtu1896 , @DeepGlint , China, Insightface, China, @ZJU_China , @imperialcollege

Paper Authors: Zimin Ran, Xingyu Ren, Xiang An, Kaicheng Yang, Xiangzi Dai, Ziyong Feng, Jia Guo, Linchao Zhu, @JiankangDeng

Read the Full Paper here: [2406.13149] High-Fidelity Facial Albedo Estimation via Texture Quantization

Project Page: High-Fidelity Facial Albedo Estimation via Texture Quantization

Be sure to watch the attached Demo Video-Sound on

Music by Gregor Quendel from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran