Large Language Models News & Discussions

bnew · Jul 10, 2024

1/1

CVPR 2024 Highlight Paper Alert

Paper Title: Matching Anything by Segmenting Anything

Few pointers from the paper

The robust association of the same objects across video frames in complex scenes is crucial for many applications, especially Multiple Object Tracking (MOT).

Current methods predominantly rely on labeled domain-specific video datasets, which limits the cross-domain generalization of learned similarity embeddings.

In this paper authors have proposed “MASA”, a novel method for robust instance association learning, capable of matching any objects within videos across diverse domains without tracking labels.

Leveraging the rich object segmentation from the Segment Anything Model (SAM), MASA learns instance-level correspondence through exhaustive data transformations.

They treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. Authors further designed a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects.

Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method, using only unlabeled static images, achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences, in zero-shot association

Organization: @ETH_en , @INSAITinstitute

Paper Authors: Siyuan Li, @leike_lk , @MDanelljan , Luigi Piccinelli, @MattiaSegu , Luc Van Gool, Fisher Yu

Read the Full Paper here: [2406.04221] Matching Anything by Segmenting Anything

Project Page: MASA

Code: GitHub - siyuanliii/masa: Official Implementation of CVPR24 highligt paper: Matching Anything by Segmenting Anything

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Matching Anything by Segmenting Anything

**What's it about:** This paper is about a new way to track objects in videos, even if they're moving around or changing shape. This is important for things like self-driving cars, surveillance systems, and robots that need to follow objects.

**The problem:** Current methods for tracking objects in videos rely on labeled data, which means someone has to manually label each object in the video. This limits how well these methods work in different situations.

**The solution:** The authors of this paper have come up with a new method called "MASA" that can track objects in videos without needing labeled data. MASA uses a technique called "segmentation" to break down the video into smaller parts and then matches those parts across different frames.

**How it works:** MASA uses a powerful tool called the "Segment Anything Model" (SAM) to identify objects in the video. It then uses these objects to learn how to match them across different frames, even if they're moving or changing shape.

**The benefits:** MASA can track objects in videos without needing labeled data, which makes it more flexible and powerful than current methods. It can also work with different types of objects and in different situations.

**The results:** The authors tested MASA on several challenging video datasets and found that it performed better than current state-of-the-art methods, even though it didn't use labeled data.

**Who did it:** The paper was written by a team of researchers from ETH Zurich and the INSAIT Institute.

**Want to learn more:** You can read the full paper here: [2406.04221] Matching Anything by Segmenting Anything, visit the project page here: MASA, or check out the code on GitHub here: GitHub - siyuanliii/masa: Official Implementation of CVPR24 highligt paper: Matching Anything by Segmenting Anything.

bnew · Jul 10, 2024

1/2

Paper Alert

Paper Title: HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Few pointers from the paper

Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis, parsing semantic labels, and tracking moving objects.

Despite considerable progress, existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes.

In this paper, authors have introduced a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding.

Their main idea involves the joint optimization of geometry, appearance, semantics, and motion using a combination of static and dynamic 3D Gaussians, where moving object poses are regularized via physical constraints.

Their approach offers the ability to render new viewpoints in real-time, yielding 2D and 3D semantic information with high accuracy, and reconstruct dynamic scenes, even in scenarios where 3D bounding box detection are highly noisy.

Experimental results on KITTI, KITTI-360, and Virtual KITTI 2 demonstrate the effectiveness of their approach.

Organization: @ZJU_China , @Huawei Noah's Ark Lab, @uni_tue , Tübingen AI Center

Paper Authors: Hongyu Zhou, @jiahaoshao1 , Lu Xu, Dongfeng Bai, Weichao Qiu, Bingbing Liu, Yue Wang, Andreas Geiger (@AutoVisionGroup ), Yiyi Liao

Read the Full Paper here:[2403.12722] HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Project Page: HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by Yevgeniy Sorokin from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

2/2
Not much difference to Kalman filter.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting

**What's it about:** This paper is about creating a better way to understand urban scenes (like cities) using just regular camera images. This means being able to figure out the layout of the scene, what things look like, and even how objects are moving.

**The problem:** Right now, most methods for understanding urban scenes focus on just one or two aspects of the scene, and often need extra information like special sensor data or manual annotations. This makes it hard to get a complete picture of the scene.

**The solution:** The authors of this paper have come up with a new way to understand urban scenes using a technique called "Gaussian Splatting". This method uses special math to combine information about the scene's geometry, appearance, semantics (what things are), and motion (how things are moving).

**How it works:** The method uses a combination of static and dynamic 3D Gaussians (think of them like special math blobs) to understand the scene. It also uses physical constraints to make sure the moving objects are behaving realistically.

**The benefits:** This approach can render new viewpoints of the scene in real-time, and provides highly accurate 2D and 3D semantic information (what things are and where they are). It can even reconstruct dynamic scenes, even when the data is noisy.

**The results:** The authors tested their approach on several datasets (KITTI, KITTI-360, and Virtual KITTI 2) and showed that it works really well.

**Who did it:** The paper was written by a team of researchers from Zhejiang University, Huawei Noah's Ark Lab, University of Tübingen, and the Tübingen AI Center.

**Want to learn more:** You can read the full paper here: [2403.12722] HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting, visit the project page here: HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting, or wait for the code to be released (coming soon!)

bnew · Jul 10, 2024

1/1

Paper Alert

Paper Title: Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Few pointers from the paper

In this paper authors have introduced “Unique3D”, a novel image-to-3D framework for efficiently generating high-quality 3D meshes from single-view images, featuring state-of-the-art generation fidelity and strong generalizability.

Previous methods based on Score Distillation Sampling (SDS) can produce diversified 3D results by distilling 3D knowledge from large 2D diffusion model, but they usually suffer from long per-case optimization time with inconsistent issues.

Recent works address the problem and generate better 3D results either by finetuning a multi-view diffusion model or training a fast feed-forward model. However, they still lack intricate textures and complex geometries due to inconsistency and limited generated resolution

To simultaneously achieve high fidelity, consistency, and efficiency in single image-to-3D, authors have proposed a novel framework Unique3D that includes a multi-view diffusion model with a corresponding normal diffusion model to generate multi-view images with their normal maps.

A multi-level upscale process to progressively improve the resolution of generated orthographic multi-views, as well as an instant and consistent mesh reconstruction algorithm called ISOMER, which fully integrates the color and geometric priors into mesh results.

Extensive experiments demonstrate that our Unique3D significantly outperforms other
image-to-3D baselines in terms of geometric and textural details.

Organization: @Tsinghua_Uni , AVAR Inc.

Paper Authors: Kailu Wu, @fangfu0830 , Zhihan Cai, Runjie Yan, Hanyang Wang, Yating Hu, Yueqi Duan, Kaisheng Ma

Read the Full Paper here: [2405.20343] Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Project Page: Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Code: GitHub - AiuniAI/Unique3D: Official implementation of Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

Demo: Unique3D - a Hugging Face Space by Wuvin

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Unique3D: High-Quality and Efficient 3D Mesh Generation from a Single Image

**What's it about?**

This paper introduces a new way to create high-quality 3D models from just one 2D image. This is a big deal because usually, creating 3D models requires multiple images or a lot of manual work.

**The problem with current methods:**

Current methods can create 3D models from 2D images, but they have some issues:

* They can take a long time to work
* The results can be inconsistent
* They might not have enough detail or texture

**How Unique3D solves these problems:**

The authors of this paper have created a new framework called Unique3D that solves these problems. Here's how:

* It uses a special kind of AI model that can generate multiple views of an object from just one image
* It then uses these views to create a high-quality 3D model
* The model is designed to be fast and efficient, so it doesn't take a long time to work
* It also includes a special algorithm that makes sure the 3D model is consistent and has a lot of detail and texture

**Results:**

The authors tested their Unique3D framework and found that it works much better than other methods. It can create 3D models with more detail and texture, and it's faster and more efficient.

**Where to learn more:**

If you want to learn more about Unique3D, you can:

* Read the full paper here: [https://arxiv.org/abs/2405.20343](https://arxiv.org/abs/2405.20343)
* Check out the project page here: [https://wukailu.github.io/Unique3D/](https://wukailu.github.io/Unique3D/)
* Look at the code on GitHub here: [https://github.com/AiuniAI/Unique3D](https://github.com/AiuniAI/Unique3D)
* See a demo of Unique3D in action here: [https://huggingface.co/spaces/Wuvin/Unique3D](https://huggingface.co/spaces/Wuvin/Unique3D)

bnew · Jul 10, 2024

1/1

Paper Alert

Paper Title: OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

Few pointers from the paper

In this paper authors have presented OmniH2O (Omni Human-to-Humanoid), a learning-based system for whole-body humanoid teleoperation and autonomy.

Using kinematic pose as a universal control interface, OmniH2O enables various ways for a human to control a full-sized humanoid with dexterous hands, including using real time teleoperation through VR headset, verbal instruction, and RGB camera.

OmniH2O also enables full autonomy by learning from teleoperated demonstrations or integrating with frontier models such as GPT-4o.

OmniH2O demonstrates versatility and dexterity in various real-world whole-body tasks through teleoperation or autonomy, such as playing multiple sports, moving and manipulating objects, and interacting with humans.

They developed an RL-based sim-to-real pipeline, which involves large-scale retargeting and augmentation of human motion datasets, learning a real-world deployable policy with sparse sensor input by imitating a privileged teacher policy, and reward designs to enhance robustness and stability.

They have released the first humanoid whole-body control dataset, OmniH2O-6, containing six everyday tasks, and demonstrate humanoid whole-body skill learning from teleoperated datasets.

Organization: @CarnegieMellon , @sjtu1896

Paper Authors: @TairanHe99 , @zhengyiluo, @Xialin_He, @_wenlixiao , @ChongZitaZhang , Weinan Zhang @kkitani , Changliu Liu @GuanyaShi

Read the Full Paper here:https://omni.human2humanoid.com/resources/OmniH2O_paper.pdf

Project Page: OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#humanoid /search?q=#teleoperation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** OmniH2O: Universal and Dexterous Human-to-Humanoid Whole-Body Teleoperation and Learning

**What's it about?**

This paper is about a system called OmniH2O that allows a human to control a humanoid robot (a robot that looks like a human) using different methods, such as virtual reality, voice commands, or even just by watching the human move. The system can also learn to do tasks on its own without human input.

**How does it work?**

The system uses a special way of controlling the robot's movements, called kinematic pose, which allows the human to control the robot in different ways. For example, the human can wear a virtual reality headset and move their body to control the robot's movements in real-time. The system can also learn from the human's movements and do tasks on its own, such as playing sports, moving objects, and interacting with people.

**What's special about it?**

The system is special because it can learn to do tasks in a real-world environment, not just in a simulation. It can also learn from a large dataset of human movements and adapt to new situations. The researchers have also released a dataset of humanoid robot movements, called OmniH2O-6, which can be used by other researchers to improve their own systems.

**Who worked on it?**

The paper was written by researchers from Carnegie Mellon University and Shanghai Jiao Tong University. The authors are listed at the bottom of the post.

**Want to learn more?**

You can read the full paper by clicking on the link provided, or visit the project page to learn more about OmniH2O.

bnew · Jul 10, 2024

1/1

Paper Alert

Paper Title: Dynamic 3D Gaussian Fields for Urban Areas

Few pointers from the paper

In this paper authors have presented an efficient neural 3D scene representation for novel-view synthesis (NVS) in large-scale, dynamic urban areas.

Existing works are not well suited for applications like mixed-reality or closed-loop simulation due to their limited visual quality and non-interactive rendering speeds.

Recently, rasterization-based approaches have achieved high-quality NVS at impressive speeds. However, these methods are limited to small-scale, homogeneous data, i.e. they cannot handle severe appearance and geometry variations due to weather, season, and lighting and do not scale to larger, dynamic areas with thousands of images.

Authors have proposed “4DGF”, a neural scene representation that scales to large-scale dynamic urban areas, handles heterogeneous input data, and substantially improves rendering speeds.

They used 3D Gaussians as an efficient geometry scaffold while relying on neural fields as a compact and flexible appearance model.

They integrated scene dynamics via a scene graph at global scale while modeling articulated motions on a local level via deformations.

This decomposed approach enables flexible scene composition suitable for real-world applications. In experiments, they surpassed the state-of-the-art by over 3 dB in PSNR and more than 200× in rendering speed.

Organization: @ETH_en , @RealityLabs , @CVUTPraha

Paper Authors: @TobiasFischer11 ,@jonaskulhanek , Samuel Rota Bulò, Lorenzo Porzi, @mapo1 , Peter Kontschieder

Read the Full Paper here: [2406.03175] Dynamic 3D Gaussian Fields for Urban Areas

Project Page: Dynamic 3D Gaussian Fields for Urban Areas | Tobias Fischer

Be sure to watch the attached Demo Video-Sound on

Music by Pavel Bekirov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Dynamic 3D Gaussian Fields for Urban Areas

**Summary:** This paper is about creating a new way to represent 3D scenes in urban areas, like cities, in a more efficient and realistic way. This is important for applications like mixed-reality, where you want to be able to see a virtual version of a city that looks and feels real.

**Problem:** Current methods for creating 3D scenes are not good enough for these applications because they are not detailed enough and take too long to render. They also can't handle changes in the scene, like different weather or lighting.

**Solution:** The authors of this paper have come up with a new method called "4DGF" that can create detailed 3D scenes of large urban areas quickly and efficiently. They use a combination of 3D shapes and neural networks to create a flexible and realistic model of the scene.

**Key Features:**

* Can handle large, dynamic urban areas with thousands of images
* Can handle changes in the scene, like weather and lighting
* Fast rendering speeds, over 200 times faster than current methods
* High-quality images, with an improvement of over 3 dB in PSNR (a measure of image quality)

**Organization:** The research was done by a team from ETH Zurich, Reality Labs, and CVUT Prague.

**Authors:** The authors of the paper are Tobias Fischer, Jonas Kulhanek, Samuel Rota Bulò, Lorenzo Porzi, and Peter Kontschieder.

**Read More:**

* You can read the full paper here: [2406.03175] Dynamic 3D Gaussian Fields for Urban Areas
* You can also check out the project page here: Dynamic 3D Gaussian Fields for Urban Areas | Tobias Fischer

bnew · Jul 10, 2024

1/1

CVPR 2024 (Oral) Paper Alert

Paper Title: MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Few pointers from the paper

In this paper authors have presented MultiPly, a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos.

Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects.

Moreover, it requires recovering intricate and complete 3D human shapes from short video sequences, intensifying the level of difficulty.

To tackle these challenges, authors first define a layered neural representation for the entire scene, composited by individual human and background models. They learned the layered neural representation from videos via their layer-wise differentiable volume rendering.

This learning process is further enhanced by their hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module, yielding reliable instance segmentation supervision even under close human interaction.

A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately.

They incorporated effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics, leading to temporally consistent 3D reconstructions with high fidelity.

Organization: @ETH_en , @Microsoft

Paper Authors: Zeren Jiang, @ChenGuo96 , @ManuelKaufmann1 , Tianjian Jiang, Julien Valentin, @OHilliges , Jie Song

Read the Full Paper here: [2406.00159] On the referential capacity of language models: An internalist rejoinder to Mandelkern & Linzen

Project Page: MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by raspberrymusic from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024 /search?q=#3d

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

CVPR 2024 (Oral) Paper Alert

The researchers have written a paper about a new way to use videos to create 3D models of multiple people moving around and interacting with each other. This is a hard problem because the video is only from one angle (monocular) and the people are moving and overlapping each other.

Paper Title: MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild

Here are some key points from the paper:

* The researchers created a new framework called MultiPly that can take a video and create 3D models of multiple people in it.
* To do this, they had to develop a way to separate the people from the background and from each other, even when they're moving and overlapping.
* They also had to figure out how to create complete 3D models of the people from short video sequences.
* To solve these problems, the researchers used a combination of machine learning techniques, including a layered neural representation of the scene, instance segmentation, and optimization formulations.
* They tested their method on videos and got high-quality 3D reconstructions that are consistent over time.

Organization: The researchers are from ETH Zurich and Microsoft.

Paper Authors: The authors are Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin, Olaf Hilliges, and Jie Song.

Read More:

[2406.00159] On the referential capacity of language models: An internalist rejoinder to Mandelkern & Linzen - Read the full paper here.

MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild - Check out the project page.

Code: Coming soon

bnew · Jul 10, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Few pointers from the paper

In this paper authors have addressed the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy.

The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance.

With such methods, authors observed a paradoxical decline in 3D pose accuracy with increasing 2D accuracy.This is caused by biases in the p-GT and the use of an approximate camera projection model.

They quantified the error induced by current camera models and showed that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Their analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental.

They used this to formulate a new loss, “Threshold-Adaptive Loss Scaling” (TALS), that penalizes gross 2D and p-GT errors but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence

To reduce this ambiguity they needed a prior over valid human poses but such priors can introduce unwanted bias.

To address this, they exploited a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively improving robustness to occlusion

Organization: @MPI_IS , @meshcapade , @ETH_en

Paper Authors: @saidwivedi , @yusun14567741 , @PriyankaP1201 , @YaoFeng1995 , @Michael_J_Black

Read the Full Paper here: [2404.16752] TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Project Page: TokenHMR

Code: GitHub - saidwivedi/TokenHMR: [CVPR 2024] TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

Be sure to watch the attached Demo Video-Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

**Summary:** This paper is about creating a more accurate way to estimate the 3D shape and pose of a human body from a single 2D image.

**Problem:** Current methods use large datasets and complex calculations to estimate the 3D pose and shape of a human body from a 2D image. However, these methods have some flaws that can lead to inaccurate results.

**Flaws:** The current methods have two main flaws:

1. **Biased data:** The datasets used to train these models can be biased, which means they may not accurately represent the real world.
2. **Simplified camera model:** The camera model used to project the 3D pose onto a 2D image is simplified, which can lead to errors.

**Solution:** The authors of this paper propose a new method called TokenHMR, which addresses these flaws. They:

1. **Quantified the errors:** They measured the errors caused by the biased data and simplified camera model.
2. **Created a new loss function:** They developed a new loss function called Threshold-Adaptive Loss Scaling (TALS), which penalizes large errors but not small ones.
3. **Used a tokenized representation:** They represented the human pose as a set of tokens, which restricts the estimated poses to valid human poses. This improves the robustness of the model to occlusion (when parts of the body are hidden from view).

**Benefits:** The TokenHMR method is more accurate and robust than current methods, and can be used in various applications such as computer vision, robotics, and healthcare.

**Resources:**

1. **Read the full paper:** [2404.16752] TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation
2. **Project page:** TokenHMR
3. **Code:** GitHub - saidwivedi/TokenHMR: [CVPR 2024] TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation

bnew · Jul 10, 2024

1/11
Ever wanted to rap

? Well, now you can!

Introducing our most capable Lip Sync model yet. It even works with instrumentals

2/11
Morph Studio: you are doing some great work here. I spent some time this afternoon with it, and I love the easy-to-use modular approach you are taking, and how you are building several must-have tools into it! (image to video, lip sync, audio). Very impressive!

3/11
We will keep at it! Thank you so much!

4/11
where exactly in the interface it is located and how to use it)?

5/11
it's in the model selection drop down in the settings panel, and can also be accessed after clicking on a generated shot!

6/11
Amazing, like everything this team does!

7/11
Thank you for your kind words!

8/11
Can you do this on live action? Version a film and match the lip flaps to the new language audio?

9/11
it works even better on live actions, and sometimes on animals too!

10/11
This looks fantastic!

11/11
Thank you Heather! Can't wait to see what you come up with

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

1/1

Paper Alert

Paper Title: Learning Temporally Consistent Video Depth from Video Diffusion Priors

Few pointers from the paper

This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, authors reformulated the prediction task into a conditional generation problem.

This allowed the authors to leverage the prior knowledge embedded in existing video generation models, thereby reducing learning difficulty and enhancing generalizability.

Concretely, they studied how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets

Authors empirically confirmed that a procedural training strategy — first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen —yielded the best results in terms of both spatial accuracy and temporal consistency.

They further examined the sliding window strategy for inference on arbitrarily long videos. Their observations indicated a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results.

Extensive experimental results demonstrate the superiority of their approach, termed “ChronoDepth”, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth.

Organization: @ZJU_China , @Unibo , @AntGroup , @hi_rockuniverse

Paper Authors: @jiahaoshao1 , Yuanbo Yang, Hongyu Zhou, @youmi42984813 , Yujun Shen, @mattpoggi , Yiyi Liao

Read the Full Paper here: [2406.01493] Learning Temporally Consistent Video Depth from Video Diffusion Priors

Project Page: ChronoDepth: Learning Temporally Consistent Video Depth from Video Diffusion Priors

Be sure to watch the attached Demo Video-Sound on

Music by Sergio Prosvirini from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

**Title:** Learning Temporally Consistent Video Depth from Video Diffusion Priors

**Summary:** This paper is about creating a system that can accurately estimate the depth of objects in videos. This is a challenging task because the system needs to not only get the depth right for each individual frame but also make sure the depth is consistent across multiple frames.

**Key Points:**

* Instead of building a new system from scratch, the researchers used existing video generation models to help with the task. This made it easier to learn and improved the system's ability to work with different types of videos.
* They used a combination of image and video datasets to train the system, which helped it learn to predict reliable depth from input videos.
* The researchers found that a specific training strategy, where they optimized the spatial layers of the system first and then the temporal layers, produced the best results.
* They also tested a method for using the system on long videos, which involved processing the video in small chunks with some overlap between them. This trade-off between efficiency and performance, but even with just a one-frame overlap, the results were good.
* The system, called "ChronoDepth", was shown to be better than other existing systems, especially when it came to maintaining consistent depth across multiple frames.

**Problem:** Current methods struggle to estimate depth in videos because they need to not only get the depth right for each individual frame but also make sure the depth is consistent across multiple frames.

**Flaws:** The current methods have some flaws that can lead to inaccurate results:

1. **Lack of temporal consistency:** Current methods focus on getting the depth right for each individual frame, but they don't ensure that the depth is consistent across multiple frames.
2. **Difficulty in learning:** Learning a depth estimator from scratch can be challenging, especially when it comes to handling complex video sequences.

**Solution:** The authors of this paper propose a new method that addresses these flaws. They:

1. **Reformulated the problem:** They turned the depth estimation problem into a conditional generation problem, which allows them to leverage the prior knowledge embedded in existing video generation models.
2. **Used a procedural training strategy:** They developed a training strategy that optimizes the spatial layers of the model first and then the temporal layers, which improves the model's ability to handle complex video sequences.
3. **Exploited video diffusion priors:** They used video diffusion priors to predict reliable depth from input videos, which improves the model's accuracy and robustness.

**Benefits:** The proposed method, called "ChronoDepth", is more accurate and robust than current methods, and can be used in various applications such as computer vision, robotics, and healthcare.

**Authors and Organizations:**

* The paper was written by researchers from Zhejiang University in China, the University of Bologna in Italy, Ant Group, and hi_rockuniverse.
* The authors are Jia-Hao Shao, Yuanbo Yang, Hongyu Zhou, You-Mi, Yujun Shen, Mattia Poggi, and Yiyi Liao.

**Resources:**

* You can read the full paper here: [2406.01493] Learning Temporally Consistent Video Depth from Video Diffusion Priors
* The project page can be found here: ChronoDepth: Learning Temporally Consistent Video Depth from Video Diffusion Priors

bnew · Jul 10, 2024

1/1

Innovation Alert

Meet “Audio Computer” a new kind of computer that speak our language.

Watch Full @TEDTalks from @jasonRugolo here : Welcome to the World of Audio Computers | Jason Rugolo | TED

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

1/1

ICML Paper Alert

Paper Title: Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Few pointers from the paper

Model-Free Reinforcement Learning (MFRL), leveraging the policy gradient theorem, has
demonstrated considerable success in continuous control tasks. However, these approaches are
plagued by high gradient variance due to zeroth order gradient estimation, resulting in suboptimal policies.

Conversely, First-Order Model-Based Reinforcement Learning (FO-MBRL) methods
employing differentiable simulation provide gradients with reduced variance but are susceptible to sampling error in scenarios involving stiff dynamics, such as physical contact.

This paper investigates the source of this error and introduces “Adaptive Horizon Actor-Critic (AHAC)”, an FO-MBRL algorithm that reduces gradient error by adapting the model-based horizon to avoid stiff dynamics.

Empirical findings reveal that AHAC out-performs MFRL baselines, attaining 40% more reward across a set of locomotion tasks and efficiently scaling to high-dimensional control environments with improved wall-clock-time efficiency.

Lastly, this work suggests that future research should not only focus on refining algorithmic approaches for policy learning but also on enhancing simulator technologies to more effectively manage gradient error.

Organization: @GeorgiaTech , @Stanford, @nvidia

Paper Authors: @imgeorgiev , @krishpopdesu , @xujie7979 , @eric_heiden , @animesh_garg

Read the Full Paper here: [2405.17784] Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

Project Page: AHAC

Code: GitHub - imgeorgiev/DiffRL: Learning Optimal Policies Through Contact in Differentiable Simulation

Video: Adaptive Horizon Actor Critic

Be sure to watch the attached Demo Video-Sound on

Music by Dmitry from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

.

A.I Generated explanation:

**Title:** Adaptive Horizon Actor-Critic for Policy Learning in Contact-Rich Differentiable Simulation

**Summary:** This paper is about a new way to teach computers to make decisions in complex situations, like robots interacting with their environment. The goal is to make the computers learn faster and better.

**Problem:** There are two main ways to teach computers to make decisions: Model-Free Reinforcement Learning (MFRL) and First-Order Model-Based Reinforcement Learning (FO-MBRL). MFRL is good at learning from experience, but it can be slow and make mistakes. FO-MBRL is faster and more accurate, but it can get confused when things get complicated, like when a robot is interacting with its environment in a complex way.

**Solution:** The researchers came up with a new way to combine the strengths of both approaches, called Adaptive Horizon Actor-Critic (AHAC). AHAC adapts to the situation and adjusts its approach to avoid making mistakes. This makes it faster and more accurate than the other methods.

**Results:** The researchers tested AHAC on several tasks, like teaching a robot to walk or run. AHAC performed 40% better than the other methods and was able to handle complex situations more efficiently.

**Conclusion:** The researchers think that this new approach is a big step forward, but they also think that we need to improve the simulators we use to train the computers. This will help us make even better decision-making algorithms in the future.

**Resources:**

* Read the full paper here: https://arxiv.org/abs/2405.17784
* Project page: https://adaptive-horizon-actor-critic.github.io/
* Code: https://github.com/imgeorgiev/DiffRL
* Video: https://invidious.poast.org/watch?v=bAW9O3C_1ck

bnew · Jul 10, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Few pointers from the paper

3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process.

In this paper, authors have proposed “HOIDiffusion” for generating realistic and diverse 3D hand-object interaction data.Their model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis.

This offers a more controllable and realistic synthesis as they can specify the structure and style inputs in a disentangled manner.

HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations.

Beyond controllable image synthesis, authors adopted the generated 3D data for learning 6D object pose estimation and showed its effectiveness in improving perception systems.

Organization: @UCSanDiego , @nvidia

Paper Authors: Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, Xiaolong Wang

Read the Full Paper here: [2403.12011] HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Project Page: HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Code: GitHub - Mq-Zhang1/HOIDiffusion: Official Code Release for HOIDiffusion (CVPR 2024)

Be sure to watch the attached Demo Video-Sound on

Music by Rockot from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

A.I Generated explanation:

CVPR 2024 Paper Alert

A new research paper has been published, and it's exciting

Paper Title: HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data

Here are some key points from the paper:

* Problem: It's hard to collect data on how hands interact with objects in 3D because of limitations in hardware.
* Solution: The authors of the paper have created a model called "HOIDiffusion" that can generate realistic and diverse 3D hand-object interaction data. This model uses both 3D geometric structure and text descriptions to create images.
* Advantages: This model allows for more control and realism in image synthesis because it can separate structure and style inputs. It's also trained using a large dataset of natural images and a few 3D human demonstrations.
* Applications: The generated 3D data can be used to improve perception systems, such as estimating the 6D pose of objects.

Organization: The research was conducted at the University of California, San Diego (@UCSanDiego) and NVIDIA (@nvidia).

Authors: Mengqi Zhang, Yang Fu, Zheng Ding, Sifei Liu, Zhuowen Tu, and Xiaolong Wang.

Resources:

Read the full paper here: https://arxiv.org/abs/2403.12011

Project page: https://mq-zhang1.github.io/HOIDiffusion/

Code: https://github.com/Mq-Zhang1/HOIDiffusion

bnew · Jul 10, 2024

1/26
Chinese new DiT Video AI Generation model 【KLING】
Open access！
Generate 120s Video with FPS30 1080P, Understand Physics Better, Model Complex Motion Accurately
prompt:
Traveling by train, viewing all sorts of landscapes through the window.
https://kling.kuaishou.com/

2/26
prompt:Little boy riding his bike in the garden through the changing seasons of fall, winter, spring and summer.

3/26
KLING

4/26
Panda playing the guitar

5/26
The rabbit who reads the newspaper and wears glasses

6/26

7/26
Give me a cappuccino.

8/26
Blooming Flowers

9/26
Car mirrors and sunsets

10/26
A man riding a horse through the Gobi Desert with a beautiful sunset behind him, movie quality.

11/26
An astronaut runs on the surface of the moon, the low angle shot shows the vast background of the moon, the movement is smooth and appears lightweight

12/26
A rally car taking a fast turn on a track

13/26
A Chinese boy wearing glasses enjoys a delicious cheeseburger with his eyes closed in a fast food restaurant

14/26
A Chinese man sits at a table and eats noodles with chopsticks

15/26
Chef chopping onions in the kitchen for the preparation of the dish

16/26
Carefully pouring the milk into the cup, the milk flowed smoothly and the cup was gradually filled with a milky white color

17/26
A little man with blocks visiting an art gallery

18/26
A white cat driving in a car through a busy downtown street with tall buildings and pedestrians in the background

19/26
Macro shot of a volcano erupting in a coffee cup

20/26
A man and woman walking hand in hand under a starry sky with a bucket in the background

21/26
Chimneys in the setting sun

22/26
Dew on blue rose petals, HD, close up, detail

23/26
A corgi wearing sunglasses walks on the beach of a tropical island

24/26
Multi-resolution support, so this is the square version
A corgi wearing sunglasses walks on the beach of a tropical island2

25/26
Versions for vertical screen
A corgi wearing sunglasses walks on the beach of a tropical island 3

26/26
How to get access:
Download their app called: 快影
Then add waitlist for apply, they just release test with few users.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

1/11
Introducing

Abacus Embeddings, a simple tweak to positional embeddings that enables LLMs to do addition, multiplication, sorting, and more. Our Abacus Embeddings trained only on 20-digit addition generalise near perfectly to 100+ digits. 1/n

2/11
Turns out that Transformers are bad at aligning digits, i.e. they can’t tell that the “8” in 3487 is aligned with the “2” in 1923. So we add a special embedding (in addition to the standard embedding) corresponding to the place value of each digit. Super easy! 2/n

3/11
We also tinker with a few other hacks - weight shared recurrence (aka looped transformer, aka deep thinking) can improve performance for out of domain addition. 3/n

4/11
Turns out Abacus Embeddings can do multiplication of moderate size. Additionally, we found that different architectures like Looped Transformer may be better suited to various sorting problems. 4/n

5/11
Finally, we show that Abacus Embeddings, which are applied only on numbers, are easily combined with standard LLM embeddings like RoPE or FIRE, and can potentially improve their maths ability. 5/n

6/11
We show much more in our appendix and open source all of the datasets with the code so you can train looped transformers too! BTW - these models train super fast. Many of our experiments were trained by “cramming” - we use a single A4000 GPU for 1 day. 6/n

7/11
Paper

: [2405.17399] Transformers Can Do Arithmetic with the Right Embeddings
Code

: GitHub - mcleish7/arithmetic: Code to reproduce "Transformers Can Do Arithmetic with the Right Embeddings", McLeish et al (2024)

This was a collaborative effort with: @arpitbansal297 @neeljain1717 @alex_stein0 @jwkirchenbauer @bartoldson @bkailkhu @bhatele @jonasgeiping @A_v_i__S @tomgoldsteincs 7/7

8/11
I think I disagree with the claim of generalization with your method.
Since you have an offset parameter beta sampled uniformly in [1,100], you virtually tell your model how to behave for sequences of size [1,120]. Could you please report performances with seq. len > 120.

9/11
We use the term generalisation/extrapolation as “generalisation of the learnt algorithm.” There are lots of great papers on length extrapolation but our focus is algorithmic extrapolation within a maximum sequence length… for now

10/11
This is super interesting work thanks for sharing!!

Sorry that twitter comments are all the same small-minded joke about calculators.

11/11

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 10, 2024

Aitomatic’s SemiKong uses AI to reshape chipmaking processes

Aitomatic announced SemiKong, the world's first open-source AI Large Language Model (LLM) designed specifically for the chip industry.

venturebeat.com

Aitomatic’s SemiKong uses AI to reshape chipmaking processes

Dean Takahashi @deantak

July 9, 2024 6:00 AM

Aitomatic's SemiKong is bringing AI to chipmaking.

Image Credit: Aitomatic

We want to hear from you! Take our quick AI survey and share your insights on the current state of AI, how you’re implementing it, and what you expect to see in the future. Learn More

Aitomatic announced SemiKong, the world’s first open-source AI Large Language Model (LLM) designed specifically for the semiconductor industry.

Unveiled at Semicon West 2024, SemiKong aims to revolutionize semiconductor processes and fabrication technology, potentially reshaping the $500 billion semiconductor industry in the next five years, the company said.

Developed in collaboration with FPT Software and with industry expertise from semiconductor companies as members of the AI Alliance, Aitomatic said SemiKong outperforms generic LLMs like GPT and Llama3 on industry-specific tasks.

SemiKong shows marked improvements in accuracy, relevance, and understanding of semiconductor processes. Even its smaller version often surpasses larger general-purpose models in domain-specific applications, giving potential for accelerated innovation and reduced costs across the semiconductor value chain, Aitomatic said.

Christopher Nguyen, CEO of Aitomatic, is the leader behind the SemiKong project, and co-lead of the Foundation Models Focus Area in the AI Alliance.

In a statement, Nguyen said, “SemiKong is set to redefine semiconductor manufacturing. This open innovation model, enabled by the AI Alliance, harnesses collective expertise for industry-specific challenges. At Aitomatic, we’re using SemiKong to create Domain-Specific AI Agents that tackle complex fabrication problems with unprecedented effectiveness.”

Atsushi Suzuki, director of product lifecycle management DX at Tokyo Electron and AI Alliance

member, said in a statement, “As an industry expert collaborating through the AI Alliance, I believe

SemiKong represents a significant step forward in applying AI to semiconductor manufacturing.”

And Daisuke Oku, senior specialist at Tokyo Electron and an early proposer of a Semiconductor

Industry Model, said in a statement, “SemiKong is the beginning of an exciting journey in open-source AI for semiconductors. Aitomatic’s innovative approach has the potential to create huge leaps for our industry.”

Phong Nguyen, chief AI Officer of FPT Software, said in a statement, “FPT Software is excited to be part

of this innovative initiative and eager to explore the potential applications of this model, especially to spark the convergence of global AI and Semiconductor. We are confident that our participation will reinforce our position as a frontrunner in shaping the future of the global semiconductor industry.”

As SemiKong drives down semiconductor production costs, consumers could see more powerful smartphones, laptops, and smart home devices at lower prices within the next few years. Anthony Annunziata, head of AI Open Innovation at IBM Research, said in a statement, “The AI Alliance’s open collaboration model was key to enabling this breakthrough. SemiKong DRAFT v0.6 exemplifies how bringing together diverse expertise can drive significant progress in critical industries like semiconductor manufacturing.”

SemiKong will be available for download on HuggingFace and GitHub starting July 9th, 2024. It serves as a foundation for companies to develop proprietary models while leveraging industry-wide knowledge. The next, more powerful version of SemiKong is planned for December 2024, with the first process-specific models expected by September 2024.

The collaborators are committed to ongoing R&D, aiming to build an ecosystem of AI tools that will propel the semiconductor industry into a new era of innovation and efficiency.

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

Aitomatic’s SemiKong uses AI to reshape chipmaking processes

Aitomatic’s SemiKong uses AI to reshape chipmaking processes

Large Language Models News & Discussions

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Aitomatic’s SemiKong uses AI to reshape chipmaking processes​

Aitomatic’s SemiKong uses AI to reshape chipmaking processes