The A.I Megathread (LLM , GPT , Development)

bnew · Jul 9, 2024

1/2

@CVPR HighLight Paper Alert

Paper Title: DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Few pointers from the paper

In this paper authors have presented “DiffPortrait3D”, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait.

Specifically, given a single RGB input, authors aimed to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression.

In lieu of time-consuming optimization and fine-tuning, their zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions.

At its core, they leveraged the generative prior of 2D diffusion models pre-trained on large-scale image datasets as their rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose.

To achieve this, they first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view.

Furthermore, they inserted a trainable cross- view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference.

Organization: @USC , @BytedanceTalk Inc

Paper Authors: Yuming Gu, You Xie, Hongyi Xu, Guoxian Song, Yichun Shi, @DiChang10 , @jingyangcarl , @linjieluo_t

Read the Full Paper here: [2312.13016] DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Project Page: DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Code: GitHub - FreedomGu/DiffPortrait3D: Official Repository of [CVPR'24 Highlight Diffportrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis]

Be sure to watch the attached Demo Video-Sound on

Music by Mike Kripak from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024Highlight

2/2
@yuming_gu

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Few pointers from the paper

Visual odometry estimates the motion of a moving camera based on visual input. Existing methods, mostly focusing on two-view point tracking, often ignore the rich temporal context in the image sequence, thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability.

These shortcomings hinder performance in scenarios with occlusion, dynamic objects, and
low-texture areas. To address these challenges, authors have presented the “Long-term Effective Any Point Tracking (LEAP) module”.

LEAP innovatively combines visual, inter-track, and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover, LEAP’s temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty.

Based on these traits, authors also developed “LEAP-VO”, a robust visual odometry system adept at handling occlusions and dynamic scenes. Their mindful integration showcases a novel practice by employing long-term point tracking as the front-end.

Organization: @TU_Muenchen , @MunichCenterML , @MPI_IS , @Microsoft

Paper Authors: Weirong Chen, Le Chen, Rui Wang, @mapo1

Read the Full Paper here: [2401.01887] LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Project Page: LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Code: GitHub - chiaki530/leapvo: [CVPR'24] LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: GARField: Group Anything with Radiance Fields

Few pointers from the paper

Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene— should the wheels of an excavator be considered separate or part of the whole?

In this paper authors have presented Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs.

To do this authors embraced group ambiguity through physical scale: by optimizing a scale-
conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes.

They optimized this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints.

From this field they can derive a hierarchy of possible groupings via automatic tree construction or user interaction.

They evaluated GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts.

GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField’s hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding.

Organization: @UCBerkeley , @LumaLabsAI AI

Paper Authors: @ChungMinKim , Mingxuan Wu, @justkerrding , @Ken_Goldberg , Matthew Tancik, @akanazawa

Read the Full Paper here: [2401.09419] GARField: Group Anything with Radiance Fields

Project Page: GARField: Group Anything with Radiance Fields

Code: GitHub - chungmin99/garfield: [CVPR'24] Group Anything with Radiance Fields

Data: GARField Eval – Google Drive

Be sure to watch the attached Demo Video

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/3

CVPR 2024 Paper Alert

Paper Title: MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Few pointers from the paper

Mesh-based scene representation offers a promising direction for simplifying large-scale hierarchical visual localization pipelines, combining a visual place recognition step based on global features (retrieval) and a visual localization step based on local features.

While existing work demonstrates the viability of meshes for visual localization, the impact of using synthetic databases rendered from them in visual place recognition remains largely unexplored.

In this work authors have investigated usage of dense 3D textured meshes for large-scale Visual Place Recognition (VPR) and identified a significant performance drop when using synthetic mesh-based databases compared to real-world images for retrieval.

To address this, authors of this paper have proposed "MeshVPR", a novel VPR pipeline that utilizes a lightweight features alignment framework to bridge the gap between real-world and synthetic domains. MeshVPR leverages pre-trained VPR models and it is efficient and scalable for city-wide deployments.

They have also introduced novel datasets with freely available 3D meshes and manually collected queries from Berlin, Paris, and Melbourne. Extensive evaluations demonstrate that MeshVPR achieves competitive performance with standard VPR pipelines, paving the way for mesh-based localization systems.

Organization: Polytechnic of Turin, @KITKarlsruhe , @FraunhoferIOSB

Paper Authors: @gabriberton , @lolleko_ , Riccardo Zaccone, Thomas Pollok, Barbara Caputo, Carlo Masone

Read the Full Paper here: [2406.02776] MeshVPR: Citywide Visual Place Recognition Using 3D Meshes

Project Page: Redirecting to https://meshvpr.github.io/

Code: GitHub - gmberton/MeshVPR: Visual Place Recognition using 3D Meshes

Be sure to watch the attached Demo Video-Sound on

Music by u_5gcdffq7mb from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

2/3
This is a really good summary! Now I'm following you to see more like this

3/3
Feeling Nervous, but up for the Challenge

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I generated explanation:

Title: MeshVPR: Using 3D Meshes for Citywide Visual Place Recognition

What's it about? Imagine you're a self-driving car or a robot navigating a city. You need to recognize where you are and how to get to your destination. This paper is about a new way to do this using 3D models of the city, called meshes.

Problem: Current methods use lots of images to recognize places, but this can be slow and complicated. Using 3D meshes could make it faster and simpler. However, there's a problem: when you use synthetic (computer-generated) meshes instead of real images, the recognition accuracy drops.

Solution: The authors of this paper created a new system called MeshVPR that bridges the gap between real and synthetic meshes. It uses a special framework to align features from both types of data, making it more accurate and efficient.

Benefits: MeshVPR is scalable, meaning it can be used for large areas like entire cities. It's also competitive with other methods that use images, and it's more efficient.

New datasets: The authors created new datasets with 3D meshes and real-world images from Berlin, Paris, and Melbourne. These datasets can be used by others to test and improve their own systems.

Takeaway: MeshVPR is a promising new approach to visual place recognition that uses 3D meshes. It's faster, simpler, and more efficient than traditional methods, and it has the potential to be used in self-driving cars, robots, and other applications.

bnew · Jul 9, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Few pointers from the paper

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations.

First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods.

Authors in this paper address these limitations with “WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video.

WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information.

WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body’s global trajectory.

We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs.

WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-
wild benchmarks.

Organization: @CarnegieMellon , @MPI_IS

Paper Authors: @soyong_shin, Juyong Kim, @enihalilaj , @Michael_J_Black

Read the Full Paper here: [2312.07531] WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Project Page: WHAM

Code: GitHub - yohanshin/WHAM

Be sure to watch the attached Demo Video-Sound on

Music by Denys Kyshchuk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

Project Alert

Hey AI aficionados!

Get ready to be blown away by "GenType", @labsdotgoogle revolutionary Imagen 2-powered tool that can transform your ideas into stunning AI-crafted alphabets.

Spark your imagination, provide a prompt, and behold as GenType conjures up alphabets that are nothing short of magical.

Dive into the typographic adventure here: GenType - a labs.google experiment

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#imagen2 /search?q=#texttoalphabet /search?q=#generativeai

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

SIGGRAPH 2024 Paper Alert

Paper Title: RTG-SLAM: Real-time 3D Reconstruction at Scale Using Gaussian Splatting

Few pointers from the paper

In this paper authors have presented Real-time Gaussian SLAM (RTG-SLAM), a real-time 3D reconstruction system with an RGBD camera for large-scale environments using Gaussian splatting.

The system features a compactGaussian representation and a highly efficient on-the-fly Gaussian optimization scheme. They force each Gaussian to be either opaque or nearly transparent, with the opaque ones fitting the surface and dominant colors, and transparent ones fitting residual colors.

By rendering depth in a different way from color rendering, they let a single opaque Gaussian well fit a local surface region without the need of multiple overlapping Gaussians, hence largely reducing the memory and computation cost.

For on-the-fly Gaussian optimization, they explicitly add Gaussians for three types of pixels per frame: newly observed, with large color errors, and with large depth errors.

They also categorize all Gaussians into stable and unstable ones, where the stable Gaussians are expected to well fit previously observed RGBD images and otherwise unstable.

They only optimize the unstable Gaussians and only render the pixels occupied by unstable Gaussians. In this way, both the number of Gaussians to be optimized and pixels to be rendered are largely reduced, and the optimization can be done in real time.

They showed real-time reconstructions of a variety of large scenes. Compared with the state-of-the-art NeRF-based RGBD SLAM, Their system achieves comparable high-quality reconstruction but with around twice the speed and half the memory cost, and shows superior performance in the realism of novel view synthesis and camera tracking accuracy

Organization: State Key Lab of CAD&CG, @ZJU_China , @UUtah , @BaiduResearch

Paper Authors: Zhexi Peng, Tianjia Shao, Liu Yong, Jingke Zhou, Yin Yang, Jingdong Wang, Kun Zhou

Read the Full Paper here: https://gapszju.github.io/RTG-SLAM/static/pdfs/RTG-SLAM_arxiv.pdf

Project Page: SOCIAL MEDIA TITLE TAG

Code: GitHub - MisEty/RTG-SLAM: RTG-SLAM: Real-time 3D Reconstruction at Scale Using Gaussian Splatting (ACM SIGGRAPH 2024)

Be sure to watch the attached Demo Video-Sound on

Music by Alex_Kizenkov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

Paper Alert

Paper Title: DIFF-A-RIFF: MUSICAL ACCOMPANIMENT CO-CREATION VIA LATENT DIFFUSION MODELS

Few pointers from the paper

Recent advancements in deep generative models present new opportunities for music production but also pose challenges, such as high computational demands and limited audio quality.

Moreover, current systems frequently rely solely on text input and typically focus on producing complete musical pieces, which is incompatible with existing workflows in music production.

To address these issues, authors have introduced "Diff-A-Riff," a Latent Diffusion Model designed to generate high-quality instrumental accompaniments adaptable to any musical context.

This model offers control through either audio references, text prompts, or both, and produces 48kHz pseudo-stereo audio while significantly reducing inference time and memory usage.

Their approach relies on a pretrained consistency model-based Autoencoder (CAE) and they have trained a generative model on its latent embeddings. The proposed generative model is a LDM following the framework of Elucidated Diffusion Models (EDMs). The architecture follows DDPM++, an upgraded version of the originally proposed Diffusion Probabilistic Model.

Authors validated "Diff-A-Riff" through comprehensive evaluations, assessing its performance in ablation studies using objective metrics, and they also compare it to other models and estimate context adherence using subjective listening tests.

Organization: @SonyCSLMusic , @QMUL

Paper Authors: @latentspaces , @marco_ppasini , @cyranaouameur , Maarten Grachten, @deeplearnmusic

Read the Full Paper here: [2406.08384] Diff-A-Riff: Musical Accompaniment Co-creation via Latent Diffusion Models

Project Page: Diff-A-Riff - Companion Website

Be sure to watch the attached Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 9, 2024

1/1

Paper Alert

Paper Title: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Few pointers from the paper

Moving away from traditional paradigms that rely on parametric models for intermediate facial representations.

Authors innovative approach embraces the end-to-end diffusion paradigm and introduces a hierarchical audio-driven visual synthesis module to enhance the precision of alignment between audio inputs and visual outputs, encompassing lip, expression, and pose motion.

Their proposed network architecture seamlessly integrates diffusion-based generative models, a UNet-based denoiser, temporal alignment techniques, and a reference network.

The proposed hierarchical audio-driven visual synthesis offers adaptive control over expression and pose diversity, enabling more effective personalization tailored to different identities.

Through a comprehensive evaluation that incorporates both qualitative and quantitative analyses, their approach demonstrates obvious enhancements in image and video quality, lip synchronization precision, and motion diversity.

Organization: @FudanUni , @Baidu_Inc , @ETH_en , @njuniversity

Paper Authors: Mingwang Xu, Hui Li, Qingkun Su, Hanlin Shang, Liwei Zhang, Ce Liu, Jingdong Wang, Luc Van Gool, Yao Yao, @JoeSiyuZhu

Read the Full Paper here: [2406.08801] Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Project Page: Homepage

Code: GitHub - fudan-generative-vision/hallo: Hallo: Hierarchical Audio-Driven Visual Synthesis for Portrait Image Animation

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**New Research Paper Alert**

**Paper Title:** Hallo: A New Way to Animate Portrait Images Using Audio

**What's New:**

This paper introduces a new approach to animating portrait images using audio inputs. Instead of using traditional methods that rely on complex models to represent faces, the authors have developed a new system that uses a hierarchical audio-driven visual synthesis module.

**How it Works:**

The system takes an audio input (like a person's voice) and uses it to generate a corresponding visual output (like a moving image of the person's face). The system is designed to precisely align the audio and visual outputs, including lip movements, facial expressions, and head poses.

**Key Features:**

* The system uses a combination of advanced techniques, including diffusion-based generative models, a UNet-based denoiser, and temporal alignment techniques.
* The hierarchical audio-driven visual synthesis module allows for adaptive control over expression and pose diversity, making it possible to personalize the animation to different individuals.
* The system has been evaluated through a comprehensive analysis, and the results show significant improvements in image and video quality, lip synchronization precision, and motion diversity.

**Who's Behind the Research:**

The research was conducted by a team of authors from Fudan University, Baidu Inc, ETH Zurich, and Nanjing University.

**Want to Learn More:**

You can read the full paper, visit the project page, or explore the code on GitHub.

bnew · Jul 9, 2024

1/1

CVPR 2024 (Oral) Paper Alert

Paper Title: Seeing the World through Your Eyes

Few pointers from the paper

The reflective nature of the human eye is an underappreciated source of information about what the world around us looks like.

By imaging the eyes of a moving person, we can collect multiple views of a scene outside the camera’s direct line of sight through the reflections in the eyes.

In this paper, authors reconstructed a 3D scene beyond the camera’s line of sight using portrait images containing eye reflections.

This task is challenging due to :

the difficulty of accurately estimating eye poses

the entangled appearance of the eye iris and the scene reflections.

Their method jointly refines the cornea poses, the radiance field depicting the scene, and the observer’s eye iris texture.

They further proposed a simple regularization prior on the iris texture pattern to improve reconstruction quality.

Through various experiments on synthetic and real-world captures featuring people with varied eye colors, they demonstrated the feasibility of their approach to recover 3D scenes using eye reflections

Organization: @UofMaryland , College Park

Paper Authors: Hadi Alzayer, @kevinzhang25 , Brandon Feng, Christopher Metzler, @jbhuang0604

Read the Full Paper here: [2306.09348] Seeing the World through Your Eyes

Project Page: Seeing the World through Your Eyes

Be sure to watch the attached Demo Video-Sound on

Music by Alex_Kizenkov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**CVPR 2024 (Oral) Paper Alert**

**Paper Title:** Seeing the World through Your Eyes

**What's New:**

This paper is about a new way to use the reflections in people's eyes to see what's around them, even if it's not directly in front of the camera.

**How it Works:**

When a person moves, their eyes reflect the scene around them. By taking pictures of their eyes, we can collect multiple views of the scene, even if it's not directly in front of the camera. This paper shows how to use these reflections to recreate a 3D scene.

**Challenges:**

This task is hard because:

* It's difficult to accurately estimate the position and orientation of the eyes.
* The reflection in the eyes can be mixed up with the pattern of the iris (the colored part of the eye).

**Solution:**

The authors developed a method that refines the position and orientation of the eyes, the scene around them, and the pattern of the iris. They also added a simple rule to improve the quality of the reconstruction.

**Results:**

Through various experiments, they showed that their approach can successfully recover 3D scenes using eye reflections, even with people who have different eye colors.

**Who's Behind the Research:**

The research was conducted by a team of authors from the University of Maryland, College Park.

**Want to Learn More:**

You can read the full paper or visit the project page to learn more about this innovative approach.

bnew · Jul 9, 2024

1/1

SIGGRAPH 2024 Paper Alert

Paper Title: Iterative Motion Editing with Natural Language

Few pointers from the paper

Text-to-motion diffusion models can generate realistic animations from text prompts, but do not support fine-grained motion editing controls.

In this paper, authors have presented a method for using natural language to iteratively specify local edits to existing character animations, a task that is common in most computer animation workflows.

Their key idea is to represent a space of motion edits using a set of kinematic motion editing operators (MEOs) whose effects on the source motion is well-aligned with user expectations.

Authors provides an algorithm that leverages pre-existing language models to translate textual descriptions of motion edits into source code for programs that define and execute sequences of MEOs on a source animation.

Authors executed MEOs by first translating them into keyframe constraints, and then used diffusion-based motion models to generate output motions that respect these constraints.

Through a user study and quantitative evaluation, they demonstrated that their system can perform motion edits that respect the animator's editing intent, remain faithful to the original animation (it edits the original animation, but does not dramatically change it), and yield realistic character animation results.

Organization: @Stanford , @Snap

Paper Authors: @purvigoel3 , @kcjacksonwang , C. Karen Liu, @kayvonf

Read the Full Paper here: [2312.11538] Iterative Motion Editing with Natural Language

Project Page: Iterative Motion Editing with Natural Language

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#texttomotion /search?q=#diffusionmodels /search?q=#python /search?q=#SIGGRAPH2024 /search?q=#LLMs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Big News in Computer Animation**

A new research paper has been published that could change the way animators work. The paper is about a new way to edit animations using natural language (like talking to a computer).

**The Problem**

Right now, computers can create animations from text descriptions, but it's hard to make small changes to those animations. It's like trying to edit a video by telling a computer what to do, but the computer doesn't understand exactly what you want.

**The Solution**

The researchers have found a way to use natural language to make small changes to animations. They created a system that can understand what an animator wants to change and make those changes to the animation. This system uses special "motion editing operators" that can be combined to make specific changes to an animation.

**How it Works**

Here's how it works:

1. An animator types a description of the change they want to make (e.g. "make the character's arm move up and down").
2. The system translates that description into a set of instructions that the computer can understand.
3. The system uses those instructions to make the change to the animation.

**The Results**

The researchers tested their system and found that it works well. It can make changes to animations that are realistic and match what the animator wanted to do. This could make it easier for animators to work on complex animations.

**More Info**

If you want to learn more, you can read the full paper or visit the project page. There's even a demo video that shows the system in action

bnew · Jul 9, 2024

1/1

Paper Alert

Paper Title: Real2Code: Reconstruct Articulated Objects via Code Generation

Few pointers from the paper

In this paper authors have presented “Real2Code”, a novel approach to reconstructing articulated objects via code generation.

Given visual observations of an object, they first reconstruct its part geometry using an image segmentation model and a shape completion model.

Then they represent the object parts with oriented bounding boxes, which are input to a fine-tuned large language model (LLM) to predict joint articulation as code.

By leveraging pre-trained vision and language models, their approach scales elegantly with the number of articulated parts, and generalizes from synthetic training data to real world objects in unstructured environments.

Experimental results demonstrate that Real2Code significantly outperforms previous state-of-the-art in reconstruction accuracy, and is the first approach to extrapolate beyond objects’ structural complexity in the training set, and reconstructs objects with up to 10 articulated parts.

When incorporated with a stereo reconstruction model, Real2Code also generalizes to real world objects from a handful of multi-view RGB images, without the need for depth or camera information.

Organization: @Stanford , @Columbia

Paper Authors: @ZhaoMandi , @YijiaWeng , Dominik Bauer, @SongShuran

Read the Full Paper here: [2406.08474] Real2Code: Reconstruct Articulated Objects via Code Generation

Project Page: Real2Code: Reconstruct Articulated Objects with Code Generation

Code: GitHub - MandiZhao/real2code

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Real2Code: Reconstructing Articulated Objects via Code Generation

**What's it about?**

Imagine you have a toy robot with many moving parts, like arms and legs. This paper is about a new way to use computers to recreate the shape and movement of this robot, just by looking at pictures of it.

**How does it work?**

The computer uses two main steps to recreate the robot:

1. **Step 1: Break it down** The computer looks at the pictures and breaks the robot down into its individual parts, like the arms and legs. It uses special algorithms to figure out the shape of each part.
2. **Step 2: Write the code** The computer then uses a special kind of artificial intelligence called a "large language model" to write code that describes how all the parts fit together and move. This code is like a set of instructions that a robot could follow to move its arms and legs.

**What's special about this?**

This new approach is special because it can handle robots with many moving parts, and it can even recreate robots that it's never seen before. It's like the computer is able to imagine how all the parts fit together, even if it's never seen that particular robot before.

**What are the benefits?**

This technology could be used in many areas, such as:

* Robotics: to help robots understand and interact with their environment
* Computer vision: to improve the ability of computers to understand and interpret visual data
* Manufacturing: to help design and build complex objects with many moving parts

**Where can I learn more?**

You can read the full paper here: [2406.08474] Real2Code: Reconstruct Articulated Objects via Code Generation

You can also visit the project page here: Real2Code: Reconstruct Articulated Objects with Code Generation

And you can even access the code here: GitHub - MandiZhao/real2code

bnew · Jul 9, 2024

1/1

CVPR 2024 (Highlight) Paper Alert

Paper Title: Real-Time Simulated Avatar from Head-Mounted Sensors

Few pointers from the paper

In this paper authors have presented “SimXR”, a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR/VR headsets.

Due to the challenging viewpoint of head-mounted cameras, the human body is often clipped out of view, making traditional image-based egocentric pose estimation methods challenging.

On the other hand, headset poses provide valuable information about overall body motion, but lack fine-grained details about the hands and feet. To synergize headset poses with cameras, they controlled a humanoid to track headset movement while analyzing input images to decide body movement.

When body parts are seen, the movements of hands and feet will be guided by the images; when unseen, the laws of physics guide the controller to generate plausible motion.

They designed an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals.

To train their method, they also proposed a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures

To demonstrate the applicability of their framework, they also tested it on an AR headset with a forward-facing camera.

Organization: @RealityLabs , @CarnegieMellon

Paper Authors: @zhengyiluo , @jinkuncao , @Me_Rawal , @awinkler_ , Jing Huang, @kkitani , @xuweipeng000

Read the Full Paper here: [2403.06862] Real-Time Simulated Avatar from Head-Mounted Sensors

Project Page: SimXR

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024 /search?q=#AR /search?q=#VR /search?q=#avatars

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

Title: Real-Time Simulated Avatar from Head-Mounted Sensors

Summary:

This paper presents a new method called "SimXR" that uses data from AR/VR headsets to control a simulated avatar in real-time. The avatar can move and act like a real person, even when the cameras on the headset can't see the whole body.

The Problem:

When you wear an AR/VR headset, the cameras on it can't always see your whole body. This makes it hard for the computer to figure out what your body is doing. The headset can track your head movements, but it can't see your hands and feet very well.

The Solution:

The SimXR method uses a combination of the headset's movement data and images from the cameras to control the avatar. When the cameras can see your body parts, the avatar moves accordingly. When they can't see them, the computer uses physics rules to make the avatar move in a way that looks realistic.

How it Works:

The method uses a special algorithm that can directly map the headset data and images to control the avatar. It doesn't need any intermediate steps, which makes it faster and more efficient.

Training the Model:

The researchers created a large dataset of synthetic images that mimic what the cameras on a VR headset would see. They used this dataset to train their model, and it worked well on real-world data too.

Testing the Method:

They tested the method on an AR headset with a forward-facing camera and got promising results.

Who Did This Research:

The research was done by a team from Reality Labs and Carnegie Mellon University.

Read More:

You can read the full paper here: [2403.06862] Real-Time Simulated Avatar from Head-Mounted Sensors

Or visit the project page here: SimXR

bnew · Jul 9, 2024

1/1

Paper Alert

Paper Title: HumanPlus : Humanoid Shadowing and Imitation from Humans

Few pointers from the paper

In this paper, authors have introduced a full-stack system for humanoids to learn motion and autonomous skills from human data.

They first trained a low-level policy in simulation via reinforcement learning using existing 40-hour human motion datasets.

This policy transfers to the real world and allows humanoid robots to follow human body and hand motion in real time using only a RGB camera, i.e. shadowing.

Through shadowing, human operators can teleoperated humanoids to collect whole-body data for learning different tasks in the real world.

Using the data collected, authors then performed supervised behavior cloning to train skill policies using egocentric vision, allowing humanoids to complete different tasks autonomously by imitating human skills.

They demonstrated the system on their customized 33-DoF 180cm humanoid, autonomously completing tasks such as wearing a shoe to stand up and walk, unloading objects from warehouse racks, folding a sweatshirt, rearranging objects, typing, and greeting another robot with 60-100% success rates using up to 40 demonstrations.

Organization: @Stanford

Paper Authors: @zipengfu , @qingqing_zhao_ , @Qi_Wu577 , @GordonWetzstein , @chelseabfinn

Read the Full Paper here: https://humanoid-ai.github.io/HumanPlus.pdf

Project Page: HumanPlus: Humanoid Shadowing and Imitation from Humans

Code: GitHub - MarkFzp/humanplus: HumanPlus: Humanoid Shadowing and Imitation from Humans

Hardware:HumanPlus

️ Hardware Tutorial

Be sure to watch the attached Demo Video-Sound on

Music by John Rush from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#humanoid /search?q=#teleoperation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** HumanPlus: Humanoid Shadowing and Imitation from Humans

**Summary:** Researchers at Stanford have created a system that allows humanoid robots to learn from humans and perform tasks on their own. They did this by:

**Step 1:** Training a robot in a simulation using data from 40 hours of human movement.

**Step 2:** Using a camera, the robot can copy human movements in real-time, like a shadow.

**Step 3:** A human operator can control the robot to collect data on how to perform tasks, like picking up objects.

**Step 4:** The robot uses this data to learn how to perform tasks on its own, like folding a shirt or typing.

**Results:** The robot was able to perform tasks with a success rate of 60-100% using up to 40 demonstrations.

**What it means:** This system allows robots to learn from humans and perform tasks autonomously, which could be useful in many areas, such as warehouses, hospitals, or homes.

bnew · Jul 9, 2024

1/1

Product Update

@LumaLabsAI Labs has just launched their Next Generation Video Model called “Dream Machine” that makes high quality, realistic videos fast from text and images.

Few Insights about this Model

🪩 It is a highly scalable and efficient transformer model trained directly on videos making it capable of generating physically accurate, consistent and eventful shots.

🪩It is an incredibly fast video generator! 120 frames in 120s.

🪩This Video Model generates 5s shots with a realistic smooth motion, cinematography, and drama.

🪩This Video Model understands how people, animals and objects interact with the physical world. Which allows you to create videos with great character consistency and accurate physics.

🪩This Video Model helps you experiment with an endless array of fluid, cinematic and naturalistic camera motions matching the emotion and contents of the scene.

Prepare to have your mind blown!

Brace yourself for the attached demo video—it’s about to be an adrenaline rush!”

Find this Valuable

?

Q T and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

Product Update

A company called LumaLabsAI has just released a new tool called "Dream Machine" that can create high-quality, realistic videos quickly using just text and images.

Few Insights about this Model

* Scalable and Efficient: This tool is very good at handling large amounts of data and can generate videos quickly. It's like a super-fast video maker!
* Fast Video Generation: It can create 120 frames of video in just 120 seconds. That's really, really fast!
* Realistic Videos: The videos it creates have smooth motion, good camera work, and a sense of drama. They look like real videos, not fake ones.
* Understanding the World: This tool understands how people, animals, and objects interact with the world around them. This means it can create videos that look realistic and have consistent characters.
* Cinematic Camera Motions: It can create videos with camera movements that match the emotion and content of the scene. This makes the videos look more professional and engaging.

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran