Large Language Models News & Discussions

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Few pointers from the paper

Whole-body control for humanoids is challenging due to the high-dimensional nature of the problem, coupled with the inherent instability of a bipedal morphology. Learning from visual observations further exacerbates this difficulty.

In this work, authors have explored highly data-driven approaches to visual whole-body humanoid control based on reinforcement learning, without any simplifying assumptions, reward design, or skill primitives.

Specifically, authors have proposed a hierarchical world model in which a high-level agent generates commands based on visual observations for a low-level agent to execute, both of which are trained with rewards.

Their approach produces highly performant control policies in 8 tasks with a simulated 56-DoF humanoid, while synthesizing motions that are broadly preferred by humans.

Organization: @UCSanDiego , @nyuniversity , @AIatMeta

Paper Authors: @ncklashansen , @jyothir_s_v , @vlad_is_ai , @ylecun , @xiaolonw , @haosu_twitr

Read the Full Paper here: [2405.18418] Hierarchical World Models as Visual Whole-Body Humanoid Controllers

Project Page: Puppeteer

Code: GitHub - nicklashansen/puppeteer: Code for "Hierarchical World Models as Visual Whole-Body Humanoid Controllers"

Models: puppeteer – Google Drive

Be sure to watch the attached Demo Video-Sound on

Music by Yevgeniy Sorokin from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Hierarchical World Models as Visual Whole-Body Humanoid Controllers

**Summary:** This paper is about creating a computer system that can control a humanoid robot (a robot that looks like a human) using only visual observations (like a camera). This is a challenging problem because the robot has many moving parts and can be unstable.

**Key Points:**

* The researchers used a type of artificial intelligence called reinforcement learning to teach the robot how to move.
* They didn't use any simplifications or assumptions to make the problem easier, which makes their approach more realistic.
* They created a hierarchical system, where one part of the system (the "high-level agent") tells another part (the "low-level agent") what to do based on what it sees.
* They tested their system on a simulated robot with 56 moving parts and were able to get it to perform well on 8 different tasks.
* The movements the robot made were also preferred by humans.

**Authors and Organizations:**

* The researchers are from the University of California, San Diego, New York University, and Meta AI.
* The authors are Nicklas Hansen, Jyothi S. V., Vladlen Koltun, Yann LeCun, Xiaolong Wang, and Haosu Wei.

**Resources:**

* You can read the full paper here: [2405.18418] Hierarchical World Models as Visual Whole-Body Humanoid Controllers
* You can visit the project page here: Puppeteer
* You can access the code here: GitHub - nicklashansen/puppeteer: Code for "Hierarchical World Models as Visual Whole-Body Humanoid Controllers"
* You can access the models here: puppeteer – Google Drive

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: NPGA: Neural Parametric Gaussian Avatars

Few pointers from the paper

The creation of high-fidelity, digital versions of human heads is an important stepping stone in the process of further integrating virtual components into our everyday lives.

Constructing such avatars is a challenging research problem, due to a high demand for photo-realism and real-time rendering performance.

In this work, authors have proposed “Neural Parametric Gaussian Avatars” (NPGA), a data-driven approach to create high-fidelity, controllable avatars from multi-view video recordings.

They build their method around 3D Gaussian splatting for its highly efficient rendering and to inherit the topological flexibility of point clouds.

In contrast to previous work, they conditioned their avatars’ dynamics on the rich expression space of neural parametric head models (NPHM), instead of mesh-based 3DMMs.

To this end, they distilled the backward deformation field of their underlying NPHM into forward deformations which are compatible with rasterization-based rendering.

All remaining fine-scale, expression-dependent details are learned from the multi-view videos. To
increase the representational capacity of their avatars, they augmented the canonical Gaussian point cloud using per-primitive latent features which govern its dynamic behavior.

To regularize this increased dynamic expressivity, authors have proposed Laplacian terms on the latent features and predicted dynamics. They evaluated their method on the public NeRSemble dataset, demonstrating that NPGA significantly outperforms the previous state-of-the-art avatars on the self-reenactment task by ≈ 2.6 PSNR.

Organization: @TU_Muenchen , @synthesiaIO , @ucl

Paper Authors: @SGiebenhain , Tobias Kirschstein, Martin Rünz, @LourdesAgapito , @MattNiessner

Read the Full Paper here: [2405.19331] NPGA: Neural Parametric Gaussian Avatars

Project Page: NPGA: Neural Parametric Gaussian Avatars

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by Sergio Prosvirini from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#3dgaussiansplatting

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** NPGA: Neural Parametric Gaussian Avatars

**What's it about?**

Creating digital versions of human heads that look super realistic is important for integrating virtual components into our daily lives. However, making these digital heads, called avatars, is a tough problem because they need to look very realistic and be able to move smoothly in real-time.

**What did the researchers do?**

The researchers came up with a new way to create these avatars using a technique called "Neural Parametric Gaussian Avatars" (NPGA). They used videos taken from multiple angles to create these avatars, which can be controlled and moved around.

**How did they do it?**

They used a combination of two techniques: 3D Gaussian splatting (which is fast and flexible) and neural parametric head models (which can capture a wide range of facial expressions). They also added some extra details to the avatars to make them look more realistic.

**What's the result?**

The researchers tested their method on a public dataset and found that their avatars looked much better than previous ones. They also made sure that the avatars could move smoothly and naturally.

**Who did the research?**

The research was done by a team from the Technical University of Munich, Synthesia, and University College London.

**Want to learn more?**

You can read the full paper here: [2405.19331] NPGA: Neural Parametric Gaussian Avatars or check out the project page here: NPGA: Neural Parametric Gaussian Avatars. The code for the project will be available soon

bnew · Jul 11, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Few pointers from the paper

In this paper authors have presented IntrinsicAvatar, a novel approach to recovering the intrinsic properties of clothed human avatars including geometry, albedo, material, and environment lighting from only monocular videos.

Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos.

However, these methods bake intrinsic properties such as albedo, material, and environment lighting into a single entangled neural representation.

On the other hand, only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs.

In this work, authors have proposed to model secondary shading effects explicitly via Monte-Carlo ray tracing. They modeled the rendering process of clothed humans as a volumetric scattering process, and combined ray tracing with body articulation.

Their approach can recover high-quality geometry, albedo, material, and lighting properties of clothed
humans from a single monocular video, without requiring supervised pre-training using ground truth materials.

Furthermore, since they explicitly model the volumetric scattering process and ray tracing, their model naturally generalizes to novel poses, enabling animation of the reconstructed avatar in novel lighting conditions.

Organization: @ETH_en , University of T¨ubingen, T¨ubingen AI Center

Paper Authors: @sfwang0928 , @anticboz , Andreas Geiger (@AutoVisionGroup ), @SiyuTang3

Read the Full Paper here: [2312.05210] IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Project Page: IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Code: GitHub - taconite/IntrinsicAvatar: [CVPR 2024] IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

Be sure to watch the attached Demo Video-Sound on

Music by Riley Clent from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing

**What's it about:** This paper is about creating a computer program that can take a video of a person and recreate a 3D model of that person, including their clothes, skin tone, and the lighting around them. This is a hard problem because the video only shows the person from one angle, and the program has to figure out what the person looks like from all sides.

**The problem:** Current methods can create a 3D model of a person from a video, but they have some limitations. They can't separate the person's skin tone, clothes, and lighting into individual components, and they don't work well when the person is moving or in different lighting conditions.

**The solution:** The authors of this paper have come up with a new approach that uses a technique called "ray tracing" to create a more accurate 3D model of the person. Ray tracing is a way of simulating how light behaves in the real world, which helps the program to better understand how the person looks in different lighting conditions. This approach can create a more detailed and realistic 3D model of the person, including their clothes, skin tone, and lighting.

**The benefits:** This new approach has several advantages. It can create a 3D model of the person that looks more realistic and detailed, and it can do this without needing a lot of training data. It can also animate the 3D model in different lighting conditions, which is useful for applications like video games or virtual reality.

**The team:** The paper was written by a team of researchers from ETH Zurich, the University of Tübingen, and the Tübingen AI Center.

**Resources:**

* Read the full paper here: https://arxiv.org/abs/2312.05210
* Project page: https://neuralbodies.github.io/IntrinsicAvatar/
* Code: https://github.com/taconite/IntrinsicAvatar

bnew · Jul 11, 2024

1/2

CVPR 2024 Paper Alert

Paper Title: HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

Few pointers from the paper

In this paper authors have presented “HiFi4G”, an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage.

Their core intuition is to marry the 3D Gaussian representation with non-rigid tracking, achieving a compact and compression-friendly representation.

They first proposed a dual-graph mechanism to obtain motion priors, with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints.

Then, they utilized a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating.

They also presented a companion compression scheme with residual compensation for immersive experiences on various platforms.

It achieves a substantial compression rate of approximately 25 times, with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of their approach, which significantly outperforms existing approaches in terms of optimization speed, rendering quality, and storage overhead.

Organization: @ShanghaiTechUni , NeuDim, @BytedanceTalk , DGene

Paper Authors: Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, Lan Xu

Read the Full Paper here: [2312.03461] HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

Project Page: HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

Be sure to watch the attached Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024 /search?q=#gaussiansplatting

2/2
Can’t fade that gausian blur

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

**What's it about?**

This paper is about creating a new way to render high-quality human performances (like dancing or acting) from video footage. The goal is to make it look super realistic and detailed, while also making it easy to store and transmit.

**How does it work?**

The authors came up with a new approach called HiFi4G, which uses a combination of 3D Gaussian representations and non-rigid tracking to create a compact and efficient way to render human performances. Here's a simplified breakdown of the steps:

1. **Dual-graph mechanism**: They create two graphs to help initialize and refine the motion of the human performance. One graph is coarse and helps with the initial setup, while the other graph is fine-grained and enforces more detailed constraints.
2. **4D Gaussian optimization**: They use a special optimization scheme to balance the non-rigid prior (which helps with the overall motion) and the Gaussian updating (which refines the details). This helps to create a smooth and realistic performance.
3. **Compression scheme**: They also developed a companion compression scheme that reduces the amount of data needed to store the performance. This makes it possible to store and transmit the data more efficiently.

**Results?**

The authors claim that their approach achieves a significant compression rate of about 25 times, with less than 2MB of storage per frame. They also show that their approach outperforms existing methods in terms of optimization speed, rendering quality, and storage overhead.

**Who's behind it?**

The paper is a collaboration between researchers from ShanghaiTech University, NeuDim, Bytedance, and DGene. The authors are Yuheng Jiang, Zhehao Shen, Penghao Wang, Zhuo Su, Yu Hong, Yingliang Zhang, Jingyi Yu, and Lan Xu.

**Want to learn more?**

You can read the full paper here: [2312.03461] HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

Or check out the project page here: HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Few pointers from the paper

This study delves into the realm of multi-modality (i.e., video and motion modalities) human behavior understanding by leveraging the powerful capabilities of Large Language Models (LLMs).

Diverging from recent LLMs designed for video-only or motion-only understanding, authors argued that understanding human behavior necessitates joint modeling from both videos and motion sequences (e.g., SMPL sequences) to capture nuanced body part dynamics and semantics effectively.

In this paper, authors have presented “MotionLLM”, a straightforward yet effective framework for human motion understanding, captioning, and reasoning.

Specifically, MotionLLM adopts a unified video-motion training strategy that leverages the complementary advantages of existing coarse video-text data and fine-grained motion-text data to clean rich spatial-temporal insights.

Furthermore, they collected a substantial dataset, MoVid, comprising diverse videos, motions, captions, and instruction.

Additionally, authors have also proposed the MoVid-Bench, with carefully manual annotations, for better evaluation of human behavior understanding on video and motion.

Organization: @Tsinghua_Uni , School of Data Science, Shenzhen Research Institute of Big Data, @cuhksz ,@IDEACVR , @hkust

Paper Authors: @Evan_THU , @ShunlinL , Ailing Zeng, Hao Zhang, @wabyking , @RamonCheung , @leizhangcs

Read the Full Paper here: [2405.20340] MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Project Page: MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Code: GitHub - IDEA-Research/MotionLLM: [Arxiv-2024] MotionLLM: Understanding Human Behaviors from Human Motions and Videos

Demo: http://demo.humotionx.com/

Be sure to watch the attached Demo Video-Sound on

Music by Oleksii Holubiev from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** MotionLLM: Understanding Human Behaviors from Human Motions and Videos

**Summary:** This paper is about a new way to understand human behavior by analyzing both videos and motion data (like 3D animations). The researchers created a system called MotionLLM that can look at videos and motion data together to understand what people are doing and why.

**Key Points:**

* Most systems only look at videos or motion data separately, but this system combines both to get a better understanding of human behavior.
* The system is called MotionLLM and it's a simple but effective way to understand human motion and behavior.
* The researchers collected a large dataset of videos, motion data, and captions to train their system.
* They also created a special benchmark to test how well their system can understand human behavior.
* The system can be used to analyze videos and motion data to understand what people are doing and why.

**Authors:** The paper was written by a team of researchers from several universities and institutions.

**Resources:**

* You can read the full paper here: [2405.20340] MotionLLM: Understanding Human Behaviors from Human Motions and Videos
* You can visit the project page here: MotionLLM: Understanding Human Behaviors from Human Motions and Videos
* You can find the code on GitHub here: GitHub - IDEA-Research/MotionLLM: [Arxiv-2024] MotionLLM: Understanding Human Behaviors from Human Motions and Videos
* You can see a demo of the system here: http://demo.humotionx.com/

bnew · Jul 11, 2024

1/1

Product Update

Imagine stepping into an office where humans and robots collaborate seamlessly.

The hum of machinery harmonizes with the click of keyboards, creating a symphony of productivity.

Watch a team of EVEs from @1x_tech work side by side with their human counterparts, transforming cluttered spaces into pristine oases.

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 11, 2024

1/1

Product Update

“Text to Sound Effects” by @elevenlabsio is here

Turn text into melodies!

Create symphonies with your words.

Try it now: Sign up

Compose away!

Be sure to watch the attached Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Few pointers from the paper

In this paper, authors have introduced “Era3D”, a novel multiview diffusion method that generates high-resolution multiview images from a single-view image.

Despite significant advancements in multiview generation, existing methods still suffer from camera prior mismatch, inefficacy, and low resolution, resulting in poor-quality multiview images.

Specifically, these methods assume that the input images should comply with a predefined camera type, e.g. a perspective camera with a fixed focal length, leading to distorted shapes when the assumption fails.

Moreover, the full-image or dense multiview attention they employ leads to an exponential
explosion of computational complexity as image resolution increases, resulting in prohibitively expensive training costs.

To bridge the gap between assumption and reality, Era3D first proposes a diffusion-based camera prediction module to estimate the focal length and elevation of the input image, which allows their method
to generate images without shape distortions.

Furthermore, a simple but efficient attention layer, named row-wise attention, is used to enforce epipolar priors in the multiview diffusion, facilitating efficient cross-view information fusion.

Consequently, compared with state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512×512 resolution while reducing computation complexity by 12x times.

Organization: @hkust , @HKUniversity , DreamTech, PKU, LightIllusion

Paper Authors: Peng Li, @YuanLiu41955461 , @xxlong0 , Feihu Zhang, @_cheng_lin , Mengfei Li, Xingqun Qi, Shanghang Zhang, Wenhan Luo, Ping Tan, Wenping Wang, Qifeng Liu, Yike Guo

Read the Full Paper here: [2405.11616] Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Project Page: Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention

Code: GitHub - pengHTYX/Era3D

Demo: Era3D MV Demo - a Hugging Face Space by pengHTYX

Be sure to watch the attached Demo Video-Sound on

Music by Oleg Fedak from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Era3D: A New Way to Generate High-Quality Multiview Images from a Single Image

**What's the problem?:** Currently, methods to generate multiview images (images that show the same scene from different angles) from a single image have some limitations. They often produce low-quality images, assume the input image is taken with a specific type of camera, and are computationally expensive.

**What's the solution?:** The authors of this paper have introduced a new method called Era3D, which generates high-resolution multiview images from a single image. Era3D is different from existing methods because it:

* **Estimates camera settings:** Era3D can estimate the focal length and elevation of the input image, which allows it to generate images without shape distortions.
* **Uses efficient attention:** Era3D uses a simple but efficient attention layer called row-wise attention, which facilitates efficient cross-view information fusion and reduces computational complexity.

**Results:** Compared to state-of-the-art methods, Era3D generates high-quality multiview images with up to a 512x512 resolution while reducing computation complexity by 12 times.

**Resources:**

* **Read the full paper:** [2405.11616] Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention
* **Project page:** Era3D: High-Resolution Multiview Diffusion using Efficient Row-wise Attention
* **Code:** GitHub - pengHTYX/Era3D
* **Demo:** Era3D MV Demo - a Hugging Face Space by pengHTYX

bnew · Jul 11, 2024

1/1

Just In

@Google Research team has just Announced "ChatDirector"

Let's try to understand what is ChatDirector

ChatDirector is a research prototype that transforms traditional video conferences into using 3D video avatars, shared 3D scenes, and automatic layout transitions.

ChatDirector employs a real-time pipeline that converts participants’ RGB video streams into 3D portrait avatars and renders them in a virtual 3D scene.

Chatdirector also includes a space-aware video conferencing environment that displays remote participants’ 3D portrait avatars in a 3D meeting environment.

Under Chatdirector a decision tree algorithm also have been developed that utilizes the speech states of remote participants as inputs, and visually adjusts the layout and behavior of the remote avatars to help users keep track of the ongoing conversations.

Read More Here: https://dl.acm.org/doi/pdf/10.1145/3613904.3642110

Blog:ChatDirector: Enhancing video conferencing with space-aware scene rendering and speech-driven layout transition

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#videoconferencing /search?q=#3dportraitavatar

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Google's New "ChatDirector" Revolutionizes Video Conferences

**What is ChatDirector?**

ChatDirector is a new technology developed by Google's research team that changes the way we have video conferences. Instead of seeing each other as 2D faces on a screen, ChatDirector uses 3D avatars and virtual scenes to make video meetings feel more like in-person conversations.

**How does it work?**

When you're in a video conference using ChatDirector, the system takes the video feed from your camera and turns it into a 3D avatar of you. This avatar is then placed in a virtual 3D scene with the other people in the meeting. The system also uses a special algorithm to arrange the avatars in a way that makes it easy to follow the conversation.

**Cool features:**

* The avatars are displayed in a 3D meeting environment, making it feel more like a real meeting.
* The system can automatically adjust the layout of the avatars based on who is speaking, so you can easily see who is talking.
* The 3D scenes and avatars are rendered in real-time, making the experience feel smooth and natural.

**Want to learn more?**

You can read more about ChatDirector in the research paper https://dl.acm.org/doi/pdf/10.1145/3613904.3642110 or on Google's research blog ChatDirector: Enhancing video conferencing with space-aware scene rendering and speech-driven layout transition.

.

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Few pointers from the paper

Generating novel views from a single image remains a challenging task due to the complexity of 3D scenes and the limited diversity in the existing multi-view datasets to train a model on.

Recent research combining large-scale text-to-image (T2I) models with monocular depth estimation (MDE) has shown promise in handling in-the-wild images.

In these methods, an input view is geometrically warped to novel views with estimated depth maps, then the warped image is inpainted by T2I models. However, they struggle with noisy depth maps and loss of semantic details when warping an input view to novel viewpoints.

In this paper, authors have proposed a novel approach for single-shot novel view synthesis, a semantic-preserving generative warping framework that enables T2I generative models to learn where to warp and where to generate, through augmenting cross-view attention with self-attention.

Their approach addresses the limitations of existing methods by conditioning the generative model on source view images and incorporating geometric warping signals.

Organization: @SonyAI_global , @Sony Group Corporation, @UniversityKorea

Paper Authors: @jyseo_cv , Kazumi Fukuda, Takashi Shibuya, Takuya Narihira, Naoki Murata, Shoukang Hu, @JCJesseLai , Seungryong Kim, @mittu1204

Read the Full Paper here: [2405.17251] GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Project Page: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

**Summary:** This paper is about a new way to generate new views of a scene from just one image. This is a hard problem because scenes can be complex and it's hard to train a model to do this well.

**Current Challenges:** Current methods try to solve this problem by using two steps: 1) warping the input image to a new view using depth maps, and 2) filling in missing parts of the image using text-to-image models. However, these methods have some issues, such as noisy depth maps and loss of important details when warping the image.

**New Approach:** The authors of this paper propose a new approach that uses a combination of two types of attention (cross-view attention and self-attention) to help the model learn where to warp and where to generate new parts of the image. This approach conditions the model on the input image and uses geometric warping signals to improve the results.

**Organization and Authors:** The paper is from researchers at Sony AI, Sony Group Corporation, and the University of Korea. The authors are listed above.

**Resources:**

1. **Read the Full Paper:** [2405.17251] GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping
2. **Project Page:** GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping
3. **Code:** Coming soon

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Few pointers from the paper

Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these models are also observed to exhibit strong 3D consistency, significantly enhancing their potential to act as world simulators.

In this work, authors have presented “Vidu4D”, a novel reconstruction model that excels in accurately reconstructing 4D (i.e., sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity and frame distortion.

This capability is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. At the core of Vidu4D is their proposed “Dynamic Gaussian Surfels” (DGS) technique.

DGS optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. This transformation enables a precise depiction of motion and deformation over time.

To preserve the structural integrity of surface-aligned Gaussian surfels, they designed the warped-state geometric regularization based on continuous warping fields for estimating normals.

Additionally, they learned refinements on rotation and scaling parameters of Gaussian surfels, which greatly alleviates texture flickering during the warping process and enhances the capture of fine-grained appearance details.

Vidu4D also contains a novel initialization state that provides a proper start for the warping fields in DGS. Equipping Vidu4D with an existing video generative model, the overall framework demonstrates high-fidelity text-to-4D generation in both appearance and geometry.

Organization: Department of Computer Science and Technology, BNRist Center, @Tsinghua_Uni
ShengShu, College of Electronic and Information Engineering, Tongji University

Paper Authors: Yikai Wang, Xinzhou Wang, Zilong Chen, Zhengyi Wang, Fuchun Sun, Jun Zhu

Read the Full Paper here: [2405.16822] Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Project Page: Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by Vitaly Vakulenko from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

Paper Alert

Paper Title: Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

Here are some key points from the paper:

What's the big deal about video generative models?

Video generative models are getting a lot of attention because they can create super realistic and imaginative video frames. Plus, they're really good at keeping the 3D consistency of objects in the video, which makes them useful for simulating the real world.

What's the problem that Vidu4D solves?

The authors of Vidu4D created a new model that can take a single generated video and turn it into a high-quality 4D representation (think of it like a 3D video that changes over time). This is hard to do because objects in the video can move and change shape in complex ways.

How does Vidu4D work?

The magic happens thanks to something called "Dynamic Gaussian Surfels" (DGS). DGS is a technique that takes surface elements (like tiny pieces of a 3D object) and warps them over time to show how they move and change. This creates a really accurate representation of motion and deformation.

What makes Vidu4D special?

Vidu4D has a few tricks up its sleeve. It can preserve the structure of the surface elements, which helps keep the video looking realistic. It also learns how to refine the rotation and scaling of these elements, which reduces flickering and captures fine details. Plus, it has a special initialization step that helps get the warping process started correctly.

What can Vidu4D do?

When combined with an existing video generative model, Vidu4D can create high-quality 4D videos from just a text description. This is a big deal for creating realistic virtual content that looks great and moves smoothly.

Who worked on this paper?

The authors are from the Department of Computer Science and Technology at Tsinghua University and the College of Electronic and Information Engineering at Tongji University.

Want to learn more?

Read the full paper here: https://arxiv.org/abs/2405.16822

Check out the project page: https://vidu4d-dgs.github.io/

Code: Coming soon

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Few pointers from the paper

In this paper authors have introduced “StreamV2V”, a diffusion model that achieves real-time stream-
ing video-to-video (V2V) translation with user prompts.

Unlike prior V2V methods using batches to process limited frames, they opted to process frames in a streaming fashion, to support unlimited frames. At the heart of StreamV2V lies a backward-looking principle that relates the present to the past.

This is realized by maintaining a feature bank, which archives information from past frames. For incoming frames, StreamV2V extends self-attention to include banked keys and values and directly fuses similar past features into the output.

The feature bank is continually updated by merging stored and new features, making it compact but informative. StreamV2V stands out for its adaptability and efficiency, seamlessly integrating with image diffusion models without fine-tuning.

It can run 20 FPS on one A100 GPU, being 15×, 46×, 108×, and 158× faster than FlowVid, CoDeF, Rerender, and TokenFlow, respectively. Quantitative metrics and user studies confirm StreamV2V's exceptional ability to maintain temporal consistency.

Organization: @UTAustin , @UCBerkeley

Paper Authors: @LiangJeff95 , Akio Kodaira, @Chenfeng_X , Masayoshi Tomizuka, Kurt Keutzer, Diana Marculescu

Read the Full Paper here: [2405.15757] Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Project Page: Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Code: GitHub - Jeff-LiangF/streamv2v: Official Pytorch implementation of StreamV2V.

Supplementary Videos: Looking Backward: Streaming Video-to-Video Translation with Feature Banks

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Looking Backward: Streaming Video-to-Video Translation with Feature Banks

**What's it about:** This paper introduces a new way to translate videos from one style to another in real-time, using a technique called "StreamV2V". This means that instead of processing a batch of frames at once, the system processes frames one by one, like a stream.

**How does it work:** The system uses a "feature bank" to store information from past frames. When a new frame comes in, it looks back at the feature bank to find similar features and combines them with the new frame to create the translated output. The feature bank is constantly updated with new information, making it compact but informative.

**What's special about it:** StreamV2V is fast and efficient, and can run at 20 frames per second on a single high-performance graphics card. It's also very good at maintaining the consistency of the video over time.

**Who did it:** The paper was written by a team of researchers from the University of Texas at Austin and the University of California, Berkeley.

**Where can I learn more:**
Read the Full Paper here: [2405.15757] Looking Backward: Streaming Video-to-Video Translation with Feature Banks
Project Page: Looking Backward: Streaming Video-to-Video Translation with Feature Banks
Code: GitHub - Jeff-LiangF/streamv2v: Official Pytorch implementation of StreamV2V.
Supplementary Videos: Looking Backward: Streaming Video-to-Video Translation with Feature Banks

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code

Few pointers from the paper

Open-ended and AI-generating algorithms aim to continuously generate and solve increasingly complex tasks indefinitely, offering a promising path toward more general intelligence. To accomplish this grand vision, learning must occur within a vast array of potential tasks.

Existing approaches to automatically generating environments are constrained within manually predefined, often narrow distributions of environment, limiting their ability to create any learning environment.

To address this limitation, authors have introduced a novel framework, “OMNI-EPIC”, that augments previous work in Open-endedness via Models of human Notions of Interestingness (OMNI) with Environments Programmed in Code (EPIC).

OMNI-EPIC leverages foundation models to autonomously generate code specifying the next learnable (i.e., not too easy or difficult for the agent's current skill set) and interesting (e.g., worthwhile and novel) tasks.

OMNI-EPIC generates both environments (e.g., an obstacle course) and reward functions (e.g., progress through the obstacle course quickly without touching red objects), enabling it, in principle, to create any simulatable learning task.

Authors have also showcased the explosive creativity of OMNI-EPIC, which continuously innovates to suggest new, interesting learning challenges

They also highlighted how OMNI-EPIC can adapt to reinforcement learning agents' learning progress, generating tasks that are of suitable difficulty.

Overall, OMNI-EPIC can endlessly create learnable and interesting environments, further propelling the development of self-improving AI systems and AI-Generating Algorithms.

Organization:@imperialcollege , @UBC , @VectorInst , Canada CIFAR AI Chair

Paper Authors: @maxencefaldor , @jennyzhangzt , @CULLYAntoine , @jeffclune

Read the Full Paper here: [2405.15568] OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code

Code: Coming

X Thread :

Be sure to watch the attached Demo Video-Sound on

Music by Calvin Clavier from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** OMNI-EPIC: A New Way to Create Endless Learning Tasks for AI

**Summary:** Researchers have created a new system called OMNI-EPIC that can generate an endless variety of learning tasks for artificial intelligence (AI) systems. This is important because it can help AI systems become more intelligent and capable.

**The Problem:** Currently, AI systems are limited by the types of tasks they can learn from. They need to be trained on a specific set of tasks, and once they've mastered those, they can't learn anything new. This is like a student only being able to learn from a single textbook.

**The Solution:** OMNI-EPIC is a system that can generate new and interesting learning tasks for AI systems. It uses a combination of human input and machine learning to create tasks that are not too easy or too hard for the AI system to learn from. This is like having a teacher who can create new and challenging lessons for a student.

**How it Works:** OMNI-EPIC uses a type of AI called "foundation models" to generate code that specifies the next learning task. This code can create entire environments, such as an obstacle course, and reward functions, such as completing the course quickly without touching certain objects.

**Benefits:** OMNI-EPIC can create an endless variety of learning tasks, which can help AI systems become more intelligent and capable. It can also adapt to the learning progress of the AI system, generating tasks that are suitable for its current skill level.

**Implications:** This technology has the potential to revolutionize the field of AI research and development. It could lead to the creation of more advanced AI systems that can learn and improve over time.

**Resources:**

* Read the full paper here: https://arxiv.org/abs/2405.15568
* Project page: https://omni-epic.vercel.app/
* Twitter thread: https://twitter.com/jeffclune/status/1795787632435212732

1/9
I am thrilled to introduce OMNI-EPIC: Open-endedness via Models of human Notions of Interestingness with Environments Programmed in Code. Led by
@maxencefaldor and
@jennyzhangzt
, with
@CULLYAntoine
and myself.

2/9
Open-ended and AI-generating algorithms aim to continuously generate and solve increasingly complex tasks forever, offering a promising path toward more general intelligence. To accomplish this grand vision, learning must occur within a VAST space of potential tasks.

3/9
Existing approaches to automatically generating environments are constrained within manually predefined, often narrow distributions of environment, limiting their ability to achieve “Darwin Completeness” (the potential to create *any* learning environment).

4/9
OMNI-EPIC uses foundation models to autonomously generate code specifying the next learnable and interesting tasks. The generation of both environments and reward functions enables, in principle, the creation of any learning task (i.e. achieving "Darwin Completeness").

5/9
Every run of OMNI-EPIC triggers an explosion of creativity in designing fascinating, diverse, interesting new challenges tailored to the current capabilities of the agent, akin to the processes observed in biological evolution and human culture (e.g. art, science and technology).

6/9
Here is an example run. All tasks (save 3 seeds) are generated by OMNI-EPIC. Imagine running this for billions of years!

7/9
It is also a great form of human entertainment! OMNI-EPIC ushers in a new era of gaming, where endless novel and interesting content *of any type* is automatically generated & tailored to players' skills. Soon we will share a website where players can engage with generated tasks.

8/9
In conclusion, OMNI-EPIC represents a leap towards truly open-ended learning by generating an endless stream of learnable, interesting, and wildly diverse tasks.

9/9
Personally I was not surprised. For me, this was one of those ideas where once it was proposed, I was sure it was going to work. But I *was* surprised how easy it was to get it to be endlessly creative! I thought that would take more coaxing. It just wants to create open-endedly!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 11, 2024

1/1

Paper Alert

Paper Title: ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Few pointers from the paper

In this paper authors have presented an object-centric approach to empower robots to learn
vision-based manipulation skills from human videos.

They investigated the problem of imitating robot manipulation from a single human video in the open-world setting, where a robot must learn to manipulate novel objects from one video demonstration.

Authors have introduced ORION, an algorithm that tackles the problem by extracting an object-centric manipulation plan from a single RGB-D video and deriving a policy that conditions on the extracted plan.

Their method enables the robot to learn from videos captured by daily mobile devices such as an iPad and generalize the policies to deployment environments with varying visual backgrounds, camera angles, spatial layouts, and novel object instances.

They systematically evaluated their method on both short-horizon and long-horizon tasks, demonstrating the efficacy of ORION in learning from a single human video in the open world.

Organization: @UTAustin , @SonyAI_global

Paper Authors: @yifengzhu_ut , Arisrei Lim, @PeterStone_TX , @yukez

Read the Full Paper here: [2405.20321] Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Project Page: ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

Be sure to watch the attached Demo Video-Sound on

Music by SPmusic from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#robotmanipulation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

**Summary:**

Imagine you want a robot to learn how to do a task, like picking up a ball or moving a block, just by watching a human do it on a video. This paper presents a new way to make that happen. The authors created an algorithm called ORION that allows a robot to learn from a single video of a human doing a task, and then apply that knowledge to do the task itself, even if the environment is different.

**Key Points:**

* The authors want to enable robots to learn from humans by watching videos of them doing tasks.
* They developed an algorithm called ORION that can extract the important steps of a task from a single video and use that to teach a robot how to do it.
* ORION can work with videos taken by everyday devices like an iPad, and the robot can learn to do the task even if the environment is different from the one in the video.
* The authors tested ORION on different tasks and found that it works well, even when the task is complex and requires the robot to do multiple steps.

**Who did this research:**

* The research was done by a team from the University of Texas at Austin and Sony AI Global.
* The authors of the paper are Yifeng Zhu, Arisrei Lim, Peter Stone, and Yuke Zhu.

**Want to learn more:**

* You can read the full paper here: [2405.20321] Vision-based Manipulation from Single Human Video with Open-World Object Graphs
* You can also check out the project page here: ORION: Vision-based Manipulation from Single Human Video with Open-World Object Graphs

bnew · Jul 11, 2024

[Submitted on 27 Mar 2024 (v1), last revised 10 Jul 2024 (this version, v2)]

Vulnerability Detection with Code Language Models - How Far Are We?

Yangruibo Ding

Abstract:In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection.
To address these challenges, we introduce PrimeVul, a new dataset for training and evaluating code LMs for vulnerability detection. PrimeVul incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs' performance in real-world conditions.
Evaluating code LMs on PrimeVul reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PrimeVul. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.

Comments:	Accepted for the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025); Camera-ready Work in Progress
Subjects:	Software Engineering (cs.SE); Computation and Language (cs.CL)
Cite as:	arXiv:2403.18624
	arXiv:2403.18624v2
	[2403.18624] Vulnerability Detection with Code Language Models: How Far Are We?

Submission history

From: [v1] [ view email]
[v1]

https://arxiv.org/pdf/2403.18624

A.I Generated explanation:

Title: Vulnerability Detection with Code Language Models - How Far Are We?

This paper is about using special computer programs called "code language models" to detect vulnerabilities in software code. The authors want to know how well these models work in real-life scenarios.

Author: Yangruibo Ding

The author of this paper is Yangruibo Ding, who can be found on the arXiv website (

Search | arXiv e-print repository

arxiv.org

).

Abstract:

The authors looked at how well code language models can detect vulnerabilities in software code. They found that the current datasets used to train these models have some big problems, such as:

* Poor data quality
* Inaccurate labels
* Duplicate data

This means that the models aren't performing as well as they should in real-life scenarios. To fix this, the authors created a new dataset called PrimeVul, which has better data quality, more accurate labels, and no duplicates.

When they tested the code language models on PrimeVul, they found that the models didn't perform as well as they did on the old datasets. In fact, even the best models performed poorly, which means there's still a lot of work to be done to make these models useful in real-life scenarios.

Comments:

This paper has been accepted for a conference called the 47th IEEE/ACM International Conference on Software Engineering (ICSE 2025).

Subjects:

This paper is about software engineering and computation and language.

Cite as:

You can cite this paper using the following URLs:

*

[2403.18624] Vulnerability Detection with Code Language Models: How Far Are We?

*

[2403.18624v2] Vulnerability Detection with Code Language Models: How Far Are We?

*

[2403.18624] Vulnerability Detection with Code Language Models: How Far Are We?

Submission history:

The paper was submitted on March 27, 2024, and revised on July 10, 2024. You can view the email submission history (

Log in to arXiv | arXiv e-print repository

arxiv.org

) and download the paper (

https://arxiv.org/pdf/2403.18624

).

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Vulnerability Detection with Code Language Models - How Far Are We?

Submission history

Search | arXiv e-print repository

Log in to arXiv | arXiv e-print repository

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Vulnerability Detection with Code Language Models - How Far Are We?​

Submission history​

Vulnerability Detection with Code Language Models - How Far Are We?

Submission history