Large Language Models News & Discussions

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Few pointers from the paper

Building effective imitation learning methods that enable robots to learn from limited data and still generalize across diverse real-world environments is a long-standing problem in robot learning.

In this paper authors have proposed “EquiBot”, a robust, data-efficient, and generalizable approach for robot manipulation task learning. Their approach combines SIM(3)-equivariant neural network architectures with diffusion models.

This ensures that their learned policies are invariant to changes in scale, rotation, and translation, enhancing their applicability to unseen environments while retaining the benefits of diffusion-based policy learning, such as multi-modality and robustness.

They showed on a suite of 6 simulation tasks that their proposed method reduces the data requirements and improves generalization to novel scenarios.

In the real world, with 10 variations of 6 mobile manipulation tasks, they showed that their method can easily generalize to novel objects and scenes after learning from just 5 minutes of human demonstrations in each task.

Organization: @Stanford

Paper Authors: @yjy0625 , Zi-ang Cao , @CongyueD , @contactrika , @SongShuran ,@leto__jean

Read the Full Paper here: [2407.01479] EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning

Project Page: EquiBot

Code: GitHub - yjy0625/equibot: Official implementation for paper "EquiBot: SIM(3)-Equivariant Diffusion Policy for Generalizable and Data Efficient Learning".

Be sure to watch the attached Video -Sound on

Music by Zakhar Valaha from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Few pointers from the paper

In this paper authors have presented “Diffusion Forcing”, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels.

They applied Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones.

Their approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories.

Their method offers a range of additional capabilities, such as

rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and

new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks.

In addition to its empirical success, their method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.

Organization: @MIT_CSAIL

Paper Authors: @BoyuanChen0 , Diego Marti Monso, @du_yilun , @max_simchowitz , @RussTedrake , @vincesitzmann

Read the Full Paper here: [2407.01392] Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Project Page: Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Code: GitHub - buoyancy99/diffusion-forcing: code for "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion"

Be sure to watch the attached Demo Video -Sound on

Music by Nick Valerson from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: MimicMotion : High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Few pointers from the paper

In recent years, generative artificial intelligence has achieved significant advancements in the field of image generation, spawning a variety of applications.

However, video generation still faces considerable challenges in various aspects, such as controllability, video length, and richness of details, which hinder the application and popularization of this technology.

In this work, authors have proposed a controllable video generation framework, dubbed “MimicMotion”, which can generate high-quality videos of arbitrary length with any motion guidance.

Compared with previous methods, their approach has several highlights.

Firstly, with confidence-aware pose guidance, temporal smoothness can be achieved so model robustness can be enhanced with large-scale training data.

Secondly, regional loss amplification based on pose confidence significantly eases the distortion of image significantly.

Lastly, for generating long smooth videos, a progressive latent fusion strategy is proposed. By this means, videos of arbitrary length can be generated with acceptable resource consumption. With extensive experiments and user studies, MimicMotion demonstrates significant improvements over previous approaches in multiple aspects.

Organization: @TencentGlobal , @sjtu1896

Paper Authors: Yuang Zhang, Jiaxi Gu, Li-Wen Wang, Han Wang, Junqi Cheng, Yuefeng Zhu, Fangyuan Zou

Read the Full Paper here: [2406.19680] MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Project Page: SOCIAL MEDIA TITLE TAG

Code: GitHub - Tencent/MimicMotion: High-Quality Human Motion Video Generation with Confidence-aware Pose Guidance

Be sure to watch the attached Demo Video -Sound on

Music by Alexander Lisenkov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

IROS 2024 Paper Alert

Paper Title: Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

Few pointers from the paper

Automating dexterous, contact-rich manipulation tasks using rigid robots is a significant challenge in robotics. Rigid robots, defined by their actuation through position commands, face issues of excessive contact forces due to their inability to adapt to contact with the environment, potentially causing damage.

While compliance control schemes have been introduced to mitigate these issues by controlling forces via external sensors, they are hampered by the need for fine-tuning task-specific controller parameters. Learning from Demonstrations (LfD) offers an intuitive alternative, allowing robots to learn manipulations through observed actions.

In this work, authors have introduced a novel system to enhance the teaching of dexterous, contact-rich manipulations to rigid robots. Their system is twofold: firstly, it incorporates a teleoperation interface utilizing Virtual Reality (VR) controllers, designed to provide an intuitive and cost-effective method for task demonstration with haptic feedback.

Secondly, they presented Comp-ACT (Compliance Control via Action Chunking with Transformers), a method that leverages the demonstrations to learn variable compliance control from a few demonstrations.

Their methods have been validated across various complex contact-rich manipulation tasks using single-arm and bimanual robot setups in simulated and real-world environments, demonstrating the effectiveness of their system in teaching robots dexterous manipulations with enhanced adaptability and safety.

Organization: University of Tokyo (@UTokyo_News_en ), OMRON SINIC X Corporation (@sinicx_jp )

Paper Authors: @tatsukamijo , @cambel07 , @mh69543540

Read the Full Paper here: [2406.14990] Learning Variable Compliance Control From a Few Demonstrations for Bimanual Robot with Haptic Feedback Teleoperation System

Be sure to watch the attached Demo Video -Sound on

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Few pointers from the paper

Monocular depth estimation in endoscopy videos can enable assistive and robotic surgery to obtain better coverage of the organ and detection of various health issues.

Despite promising progress on mainstream, natural image depth estimation, techniques perform poorly on endoscopy images due to a lack of strong geometric features and challenging illumination effects.

In this paper, authors utilized the photometric cues, i.e., the light emitted from an endoscope and reflected by the surface, to improve monocular depth estimation.

They first created two novel loss functions with supervised and self-supervised variants that utilize a per-pixel shading representation. They then proposed a novel depth refinement network (PPSNet) that leverages the same per-pixel shading representation.

Finally, The authors have also introduced teacher-student transfer learning to produce better depth maps from both synthetic data with supervision and clinical data with self-supervision. They achieved state-of-the-art results on the C3VD dataset while estimating high-quality depth maps from clinical data.

Organization: Department of Computer Science, University of North Carolina at Chapel Hill (@unccs )

Paper Authors: @Yahskapar , Samuel Ehrenstein, Shuxian Wang, Inbar Fried, Stephen M. Pizer, Marc Niethammer, @SenguptRoni

Read the Full Paper here: [2403.17915] Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Project Page: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos

Code: GitHub - Roni-Lab/PPSNet: PPSNet: Leveraging Near-Field Lighting for Monocular Depth Estimation from Endoscopy Videos (ECCV, 2024)

Be sure to watch the attached Demo Video -Sound on

Music by Denys Kyshchuk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#ECCV2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ECCV 2024 Paper Alert

Paper Title: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Few pointers from the paper

Large, pretrained latent diffusion models (LDMs) have demonstrated an extraordinary ability to generate creative content, specialize to user data through few-shot fine-tuning, and condition their output on other modalities, such as semantic maps.

However, are they usable as large-scale data generators, e.g., to improve tasks in the perception stack, like semantic segmentation? Authors of this paper investigated this question in the context of autonomous driving, and answered it with a resounding "yes".

Authors proposed an efficient data generation pipeline termed “DGInStyle”.

First, they examined the problem of specializing a pretrained LDM to semantically-controlled generation within a narrow domain.

Second, they designed a Multi-resolution Latent Fusion technique to overcome the bias of LDMs towards dominant objects.

Third, they proposed a Style Swap technique to endow the rich generative prior with the learned semantic control.

Using DGInStyle, they generated a diverse dataset of street scenes, train a domain-agnostic semantic segmentation model on it, and evaluated the model on multiple popular autonomous driving datasets.

Their approach consistently increases the performance of several domain generalization methods, in some cases by +2.5 mIoU compared to the previous state-of-the-art method without their generative augmentation scheme.

Organization: @ETH_en , @KU_Leuven , @INSAITinstitute Sofia

Paper Authors: Yuru Jia, @lukashoyer3 , @ShengyHuang , @TianfuWang2 , Luc Van Gool, Konrad Schindler, @AntonObukhov1

Read the Full Paper here: [2312.03048] DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Project Page: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Generation Code: GitHub - yurujaja/DGInStyle: DGInStyle: Domain-Generalizable Semantic Segmentation with Image Diffusion Models and Stylized Semantic Control

Segmentation Code: GitHub - yurujaja/DGInStyle-SegModel: Downstream semantic segmentation evaluation of DGInStyle.

Be sure to watch the attached Video

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

CVPR 2024 Paper Alert

Paper Title: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Few pointers from the paper

Monocular 3D human mesh estimation is an ill-posed problem characterized by inherent ambiguity and occlusion. While recent probabilistic methods propose generating multiple solutions little attention is paid to obtaining high-quality estimates from them.

To address this limitation authors of this paper have introduced “ScoreHypo” a versatile framework by first leverages their novel “HypoNet” to generate multiple hypotheses followed by employing a meticulously designed scorer “ScoreNet” to evaluate and select high-quality estimates.

ScoreHypo formulates the estimation process as a reverse denoising process, where HypoNet produces a diverse set of plausible estimates that effectively align with the image cues.

Subsequently, ScoreNet is employed to rigorously evaluate and rank these estimates based on their quality and finally identify superior ones.

Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator. Moreover, the estimates selected by ScoreNet significantly outperform random generation or simple averaging.

Notably, the trained ScoreNet exhibits generalizability, as it can effectively score existing methods and significantly reduce their errors by more than 15%.

Organization: @PKU1898 , International Digital Economy Academy (IDEA), @sjtu1896

Paper Authors: Yuan Xu, @XiaoxuanMa_ , Jiajun Su, @walterzhu8 , Yu Qiao, Yizhou Wang

Read the Full Paper here: https://shorturl.at/pyIuc

Project Page: ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring

Code: Coming

Be sure to watch the attached Demo Video-Sound on

Music by Pavel Bekirov from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#CVPR2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: DoubleTake: Geometry Guided Depth Estimation

Few pointers from the paper

Estimating depth from a sequence of posed RGB images is a fundamental computer vision task, with applications in augmented reality, path planning etc. Prior work typically makes use of previous frames in a multi view stereo framework, relying on matching textures in a local neighborhood.

In contrast, Authors model leverages historical predictions by giving the latest 3D geometry data as an extra input to our network. This self-generated geometric hint can encode information from areas of the scene not covered by the keyframes and it is more regularized when compared to individual predicted depth maps for previous frames.

Authors have introduced a Hint MLP which combines cost volume features with a hint of the prior geometry, rendered as a depth map from the current camera location, together with a measure of the confidence in the prior geometry.

They demonstrated that their method, which can run at interactive speeds, achieves state-of-the-art estimates of depth and 3D scene reconstruction in both offline and incremental evaluation scenarios.

Organization: @NianticLabs , @ucl

Paper Authors: @MohammedAmr1 , @AleottiFilippo , Jamie Watson, Zawar Qureshi, @gui_ggh , Gabriel Brostow, Sara Vicente, @_mdfirman

Read the Full Paper here: https://nianticlabs.github.io/doubletake/resources/DoubleTake.pdf

Project Page: DoubleTake: Geometry Guided Depth Estimation

Code: Coming

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/2

Paper Alert

Paper Title: L4GM: Large 4D Gaussian Reconstruction Model

Few pointers from the paper

In this paper authors have presented L4GM, the first 4D Large Reconstruction Model that produces animated objects from a single-view video input – in a single feed-forward pass that takes only a second.

Key to their success is a novel dataset of multiview videos containing curated, rendered animated objects from Objaverse. This dataset depicts 44K diverse objects with 110K animations rendered in 48 viewpoints, resulting in 12M videos with a total of 300M frames.

Authors kept their L4GM simple for scalability and build directly on top of LGM, a pretrained 3D Large Reconstruction Model that outputs 3D Gaussian ellipsoids from multiview image input.

L4GM outputs a per-frame 3D Gaussian Splatting representation from video frames sampled at a low fps and then upsamples the representation to a higher fps to achieve temporal smoothness.

They added temporal self-attention layers to the base LGM to help it learn consistency across time, and utilize a per-timestep multiview rendering loss to train the model.

The representation is upsampled to a higher framerate by training an interpolation model which produces intermediate 3D Gaussian representations. They showcased that L4GM that is only trained on synthetic data generalizes extremely well on in-the-wild videos, producing high quality animated 3D assets.

Organization: @nvidia , @UofT , @Cambridge_Uni , @MIT , S-Lab, @NTUsg

Paper Authors: Jiawei Ren, Kevin Xie, @ashmrz10 , Hanxue Liang, Xiaohui Zeng, @karsten_kreis , @liuziwei7 , Antonio Torralba, @FidlerSanja , Seung Wook Kim, @HuanLing6

Read the Full Paper here: [2406.10324] L4GM: Large 4D Gaussian Reconstruction Model

Project Page: L4GM: Large 4D Gaussian Reconstruction Model

Be sure to watch the attached Demo Video-Sound on

Music by Praz Khanal from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

2/2
Impressive

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Few pointers from the paper

In this paper authors have presented “Follow-Your-Emoji”, a diffusion-based framework for portrait animation, which animates a reference portrait with target landmark sequences.

The main challenge of portrait animation is to preserve the identity of the reference portrait and transfer the target expression to this portrait while maintaining temporal consistency and fidelity.

To address these challenges, Follow-Your-Emoji equipped the powerful Stable Diffusion model with two well-designed technologies. Specifically, they first adopt a new explicit motion signal, namely expression-aware landmarks, to guide the animation process.

Authors discovered this landmark can not only ensure the accurate motion alignment between the reference portrait and target motion during inference but also increase the ability to portray exaggerated expressions (i.e., large pupil movements) and avoid identity leakage.

Then, authors have also proposed a facial fine-grained loss to improve the model's ability of subtle expression perception and reference portrait appearance reconstruction by using both expression and facial masks.

Accordingly, their method demonstrates significant performance in controlling the expression of freestyle portraits, including real humans, cartoons, sculptures, and even animals.

By leveraging a simple and effective progressive generation strategy, they extended their model to stable long-term animation, thus increasing its potential application value.

To address the lack of a benchmark for this field, they have introduced EmojiBench, a comprehensive benchmark comprising diverse portrait images, driving videos, and landmarks. They show extensive evaluations on EmojiBench to verify the superiority of Follow-Your-Emoji.

Organization: @hkust , @TencentGlobal , @Tsinghua_Uni

Paper Authors: Yue Ma, Hongyu Liu, Hongfa Wang, Heng Pan, Yingqing He, Junkun Yuan, Ailing Zeng, Chengfei Cai, Heung-Yeung Shum, Wei Liu, Qifeng Chen

Read the Full Paper here: [2406.01900] Follow-Your-Emoji: Fine-Controllable and Expressive Freestyle Portrait Animation

Project Page: Follow-Your-Emoji: Freestyle Portrait Animation

Be sure to watch the attached Demo Video-Sound on

Music by Maksym Dudchyk from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Few pointers from the paper

Large-scale endeavors like RT-1 and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data.

Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances.

The Authors of this paper propose “MANIPULATE-ANYTHING”, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object.

They evaluate their method using two setups:

First, MANIPULATE-ANYTHING successfully generates trajectories for all 5 real-world and 12 simulation tasks, significantly outperforming existing methods like VoxPoser.

Second, MANIPULATE-ANYTHING’s demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser and Code-As-Policies.

The authors believed that MANIPULATE-ANYTHING can be the scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting.

Organization: @uwcse , @nvidia , @allen_ai , Universidad Católica San Pablo

Paper Authors: @DJiafei , @TonyWentaoYuan , @wpumacay7567 ,@YiruHelenWang ,@ehsanik , Dieter Fox, @RanjayKrishna

Read the Full Paper here: https://arxiv.org/pdf/2406.18915

Project Page: Manipulate Anything

Be sure to watch the attached Demo Video-Sound on

Music by StudioKolomna from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

ICML 2024 Paper Alert

Paper Title: Efficient World Models with Context-Aware Tokenization

Few pointers from the paper

Scaling up deep Reinforcement Learning (RL) methods presents a significant challenge. Following developments in generative modeling, model-based RL positions itself as a strong contender.

Recent advances in sequence modeling have led to effective transformer-based world models, albeit at the price of heavy computations due to the long sequences of tokens required to accurately simulate environments.

In this work, authors have proposed ∆-IRIS, a new agent with a world model architecture composed of a discrete autoencoder that encodes stochastic deltas between time steps and an autoregressive transformer that predicts future deltas by summarizing the current state of the world with continuous tokens.

In the Crafter benchmark, ∆-IRIS sets a new state of the art at multiple frame budgets, while being an order of magnitude faster to train than previous attention-based approaches.

Organization: @unige_en

Paper Authors: @micheli_vincent , @EloiAlonso1 , @francoisfleuret

Read the Full Paper here: [2406.19320] Efficient World Models with Context-Aware Tokenization

Code: GitHub - vmicheli/delta-iris: Efficient World Models with Context-Aware Tokenization. ICML 2024

Be sure to watch the attached Demo Video-Sound on

Music by StudioKolomna from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

/search?q=#icml2024

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

LLM Alert

@GoogleDeepMind " Gemma Team" has officially announced the release of Gemma 2, a new addition to the Gemma family of lightweight, state-of-the-art open models, ranging in scale from 2 billion to 27 billion parameters.

The 9 billion and 27 billion parameter models are available today, with a 2 billion parameter model to be released shortly.

Few pointers from the Announcement

In this new version, they have provided several technical modifications to their architecture, such as interleaving local-global attentions and group-query attention.

They also train the 2B and 9B models with knowledge distillation instead of next token prediction. The resulting models deliver the best performance for their size, and even offer competitive alternatives to models that are 2-3× bigger.

They trained Gemma 2 27B on 13 trillion tokens of primarily-English data, the 9B model on 8 trillion tokens, and the 2.6B on 2 trillion tokens. These tokens come from a variety of data sources, including web documents, code, and science articles.

There models are not multimodal and are not trained specifically for state-of-the-art multi-
lingual capabilities. The final data mixture was determined through ablations similar to the approach in Gemini 1.0.

Just like the original Gemma models, Gemma 2 is available under the commercially-friendly Gemma license, giving developers and researchers the ability to share and commercialize their innovations.

Blog: Gemma 2 is now available to researchers and developers

Technical Report: https://storage.googleapis.com/deepmind-media/gemma/gemma-2-report.pdf

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/7
Viggle's new feature "Move" is now live! Visit VIGGLE to animate your pic right away.

Compared to our previous feature which offers greensreen and white background, Move allows you to keep the original background of the image, without further editing.

Get it moving!

2/7
That’s so great!!! Can we see GenAiMovies soon?

3/7
Wow, fantastic. I personally find green screen very useful but I'm pleased with the upgrades. Nice work.

4/7

5/7
Looks good, will try this out later. Make sure you keep the green screen too as thats fine

6/7
Great. For multiple characters video, how to choose which one to replace?

7/7
Wow!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Jul 8, 2024

1/1

Paper Alert

Paper Title: Real-Time Video Generation with Pyramid Attention Broadcast

Few pointers from the paper

Recently, Sora and other DiT-based video generation models have attracted significant attention. However, in contrast to image generation, there are few studies focused on accelerating the inference of DiT-based video generation models.

Additionally, the inference cost for generating a single video can be substantial, often requiring tens of GPU minutes or even hours. Therefore, accelerating the inference of video generation models has become urgent for broader GenAI applications.

In this work authors have introduced “Pyramid Attention Broadcast (PAB)”, the first approach that achieves real-time DiT-based video generation.

By mitigating redundant attention computation, PAB achieves up to 21.6 FPS with 10.6x acceleration, without sacrificing quality across popular DiT-based video generation models including Open-Sora, Open-Sora-Plan, and Latte.

Notably, as a training-free approach, PAB can empower any future DiT-based video generation models with real-time capabilities.

Organization: @NUSingapore , @LifeAtPurdue

Paper Authors: @oahzxl , @JxlDragon , @VictorKaiWang1 , @YangYou1991

Read the Full Paper here: Coming

Blog: Real-Time Video Generation with Pyramid Attention Broadcast

Code: GitHub - NUS-HPC-AI-Lab/OpenDiT: OpenDiT: An Easy, Fast and Memory-Efficient System for DiT Training and Inference

Doc: OpenDiT/docs/pab.md at master · NUS-HPC-AI-Lab/OpenDiT

Be sure to watch the attached Demo Video-Sound on

Music by Hot_Dope from @pixabay

Find this Valuable

?

QT and teach your network something new

Follow me

, @NaveenManwani17 , for the latest updates on Tech and AI-related news, insightful research papers, and exciting announcements.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Large Language Models News & Discussions

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran