The A.I Megathread (LLM , GPT , Development)

bnew · Aug 28, 2024

Grok-2 gets a speed bump after developers rewrite code in three days

The two developers responsible are Lianmin Zheng and Saeed Maleki, according to Babuschkin's post, and they relied on open source SGLang.

venturebeat.com

Grok-2 gets a speed bump after developers rewrite code in three days

Carl Franzen @carlfranzen

August 23, 2024 2:05 PM

Blue and yellow robots race through a pink desert in an AI drawing style illustration

Credit: VentureBeat made with Midjourney

Elon Musk’s xAI has made waves in the last week with the release of its Grok-2 large language model (LLM) chatbot — available through an $8 USD monthly subscription on the social network X.

Now, both versions of Grok-2 — Grok-2 and Grok-2 mini, the latter designed to be less powerful but faster — have both increased the speed at which they can analyze information and output responses after two developers at xAI rewrite the inference code stack completely in the last three days.

As xAI developer Igor Babuschkin posted this afternoon on the social network X under his handle @ibab:

“Grok 2 mini is now 2x faster than it was yesterday. In the last three days @lm_zheng and @MalekiSaeed rewrote our inference stack from scratch using SGLang. This has also allowed us to serve the big Grok 2 model, which requires multi-host inference, at a reasonable speed. Both models didn’t just get faster, but also slightly more accurate. Stay tuned for further speed improvements!”

Grok 2 mini is now 2x faster than it was yesterday. In the last three days @lm_zheng and @MalekiSaeed rewrote our inference stack from scratch using SGLang (GitHub - sgl-project/sglang: SGLang is a fast serving framework for large language models and vision language models.). This has also allowed us to serve the big Grok 2 model, which requires multi-host inference, at a… pic.twitter.com/G9iXTV8o0z

— ibab (@ibab) August 23, 2024

The two developers responsible are Lianmin Zheng and Saeed Maleki, according to Babuschkin’s post.

To rewrite the inference for Grok-2, they relied on SGLang, an open-source (Apache 2.0 licensed) highly efficient system for executing complex language model programs, achieving up to 6.4 times higher throughput than existing systems.

SGLang was developed by researchers from Stanford University, the University of California, Berkeley, Texas A&M University and Shanghai Jiao Tong University and integrates a frontend language with a backend runtime to simplify the programming of language model applications.

The system is versatile, supporting many models, including Llama, Mistral, and LLaVA, and is compatible with open-weight and API-based models like OpenAI’s GPT-4. SGLang’s ability to optimize execution through automatic cache reuse and parallelism within a single program makes it a powerful tool for developers working with large-scale language models.

Grok-2 and Grok-2-Mini Performance Highlights

Additionally, in the latest update to thethird-party Lmsys Chatbot Arena leaderboard that rates AI model performance, the main Grok-2 has secured the #2 spot with an impressive Arena Score of 1293, based on 6686 votes.

This effectively puts Grok-2 in the number two spot (fittingly) for the most powerful AI models in the world, tied with Google’s Gemini-1.5 Pro model, and just behind OpenAI’s latest version of ChatGPT-4o.

Grok-2-mini, which has also benefited from the recent enhancements, has climbed to the #5 position, boasting an Arena Score of 1268 from 7266 votes, just behind GPT-4o mini and Claude 3.5 Sonnet.

Both models are proprietary to xAI, reflecting the company’s commitment to advancing AI technology.

Grok-2 has distinguished itself, particularly in mathematical tasks, where it ranks #1. The model also holds strong positions across various other categories, including Hard Prompts, Coding, and Instruction-following, where it consistently ranks near the top.

This performance places Grok-2 ahead of other prominent models like OpenAI’s GPT-4o (May 2024), which now ranks #4.

Future Developments

According to a response by Babuschkin on X, the main advantage of using Grok-2-mini over the full Grok-2 model is its enhanced speed.

Yes, that’s the main reason for now. We will make it even faster than it is right now.

— ibab (@ibab) August 23, 2024

However, Babuschkin pledged that xAI would further improve the processing speed of Grok-2-mini, which could make it an even more attractive option for users seeking high performance with lower computational overhead.

The addition of Grok-2 and Grok-2-mini to the Chatbot Arena leaderboard and their subsequent performance have garnered significant attention within the AI community.

The models’ success is a testament to xAI’s ongoing innovation and its commitment to pushing the boundaries of what AI can achieve.

As xAI continues to refine its models, the AI landscape can expect further enhancements in both speed and accuracy, keeping Grok-2 and Grok-2-mini at the forefront of AI development.

bnew · Aug 28, 2024

bnew · Aug 28, 2024

1/1
I've been playing with
@SambaNovaAI 's API serving fast Llama 3.1 405B tokens. Really cool to see leading model running at speed. Congrats to Samba Nova for hitting a 114 tokens/sec speed record (and also thanks
@KunleOlukotun
for getting me an API key!)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/2
Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty

Paper: [2408.15242] Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty
Project: UC-GS
Code: GitHub - SainingZhang/UC-GS: [BMVC 2024 ] Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty

Method

1 | 2

2/2

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
In this work, we introduce a novel uncertainty-aware 3D-Gaussian Splatting training paradigm to effectively use aerial imagery to enhance the novel view synthesis of road views.
Training naively with aerial and ground images, which exhibit large view disparity, poses a significant convergence challenge for 3D-GS, and do
es not demonstrate remarkable improvements in performance on road views. To enhance the novel view synthesis of road views and to effectively use the aerial information, this work designs an uncertainty-aware training method that allows aerial images to assist in the synthesis of areas where ground images have poor learning outcomes instead of weighting all pixels equally in 3D-GS training like prior work did.

Paper: Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty
Link: [2408.15242] Drone-assisted Road Gaussian Splatting with Cross-view Uncertainty
Project: UC-GS

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D /search?q=#GenerativeAI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/1
Towards Realistic Example-based Modeling via 3D Gaussian Stitching

discuss: Paper page - Towards Realistic Example-based Modeling via 3D Gaussian Stitching

Using parts of existing models to rebuild new models, commonly termed as example-based modeling, is a classical methodology in the realm of computer graphics. Previous works mostly focus on shape composition, making them very hard to use for realistic composition of 3D objects captured from real-world scenes. This leads to combining multiple NeRFs into a single 3D scene to achieve seamless appearance blending. However, the current SeamlessNeRF method struggles to achieve interactive editing and harmonious stitching for real-world scenes due to its gradient-based strategy and grid-based representation. To this end, we present an example-based modeling method that combines multiple Gaussian fields in a point-based representation using sample-guided synthesis. Specifically, as for composition, we create a GUI to segment and transform multiple fields in real time, easily obtaining a semantically meaningful composition of models represented by 3D Gaussian Splatting (3DGS). For texture blending, due to the discrete and irregular nature of 3DGS, straightforwardly applying gradient propagation as SeamlssNeRF is not supported. Thus, a novel sampling-based cloning method is proposed to harmonize the blending while preserving the original rich texture and content. Our workflow consists of three steps: 1) real-time segmentation and transformation of a Gaussian model using a well-tailored GUI, 2) KNN analysis to identify boundary points in the intersecting area between the source and target models, and 3) two-phase optimization of the target model using sampling-based cloning and gradient constraints. Extensive experimental results validate that our approach significantly outperforms previous works in terms of realistic synthesis, demonstrating its practicality.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
Towards Realistic Example-based Modeling via 3D Gaussian Stitching

Paper: [2408.15708] Towards Realistic Example-based Modeling via 3D Gaussian Stitching
Project: Towards Realistic Example-based Modeling via 3D Gaussian Stitching

Workflow Demo

2/2

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/2

Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

Explore More: MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

2/2
Method

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

tl;dr: similar to MVSplat,less transformer, more CNN, more GS
[2405.12218] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

2/2

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Code dropped: "[ECCV '24] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Code: GitHub - TQTQliu/MVSGaussian: [ECCV 2024] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Project: MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

\

1/7
Oh wow! I just tested Splatt3R with my own data on my computer, which creates 3D Gaussian Splats at 4 FPS on uncalibrated 512x512 2D images!

It's by far the fastest 3D reconstruction method, powered by MASt3R.

Check out the video!

2/7
Here is another example!

3/7
Code: GitHub - btsmart/splatt3r: Official repository for Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

4/7
Fast is underrated

5/7
"We introduce a third head to predict covariances (parameterized by rotation quaternions and scales), spherical harmonics, opacities and mean offsets for each point. This allows us to construct a complete Gaussian primitive for each pixel, which we can then render for novel view synthesis. During training, we only train the Gaussian prediction head, relying on a pre-trained MASt3R model for the other parameters."

6/7
Yeah sure. You can use sam2 to implement this. It just requires a little engineering.

7/7
How is that related here?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
Wait! You think you need at least two images to reconstruct a scene? Think again!

With GenWarp, you can reconstruct a 3D scene from just a single image!

Simply create two views from a single input image and drop them into Splatt3R. What a time to be alive!

2/2
Exciting!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/4
Excited to share our latest project at @SonyAI_global: "GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping"

It generates novel views from just a single image, which can be easily applied to 3DGS-based methods!

Proj page: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

2/4
1)
We introduce a novel approach where a diffusion model learns to implicitly conduct geometric warping conditioned on MDE depth-based correspondence, instead of warping the pixels directly.
It prevents artifacts typically caused by explicit warping.

3/4
2) The augmented self-attention balances attending to areas requiring generation with focusing on regions reliably warped from the input, allowing the model to decide which parts to generate versus warp.

4/4
Yes

For data preprocessing, we use DUSt3R for depth prediction of image pairs, and during inference, it works well with any off-the-shelf MDE like Zoedepth. And, we will make the code and checkpoint public soon!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
GenWarp's code and @Gradio demo have been officially released!
GenWarp generates images from novel viewpoints using just a single image. Please give it a try!

- GitHub: GitHub - sony/genwarp
- Demo: GenWarp - a Hugging Face Space by Sony
- Project: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/1
This work propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control.
It leverages the recent advancements in depth-conditioned T2I models and proposes a novel approach for interactive 3D layout control. It replaces the traditional 2D boxes used in layout control with 3D boxes

Paper: Build-A-Scene:
Interactive 3D Layout Control for Diffusion-Based Image Generation by @peter_wonka and Abdelrahman Eldesokey
Link: [2408.14819] Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation
Project: Build-A-Scene

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Build-A-Scene

Interactive 3D Layout Control for Diffusion-Based Image Generation

discuss: Paper page - Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/1
AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People. [2408.10240] AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2

Introducing AltCanvas- a new tool blending generative AI with a tile-based interface to empower visually impaired creators! AltCanvas combines a novel tile-based authoring approach with Generative AI, allowing users to build complex scenes piece by piece. /search?q=#assets2024

2/2
Accessible PDF:

Arxiv: https://www.arxiv.org/pdf/2408.10240

Led by @SeongHee17633 and Maho Kohga, in collaboration with Steve Landau and Sile O'Modhrain

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/1
Given a pair of key frames as input, this method generates a continuous intermediate video with coherent motion by adapting a pretrained image-to-video diffusion model.

Paper: Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Link: [2408.15239] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Project: Generative Keyframe Interpolation with Forward-Backward Consistency

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

https://nitter.poast.org/DavidFSWD/status/1828837660343054375#m

1/1
Generative Inbetweening

Adapting Image-to-Video Models for Keyframe Interpolation

discuss: Paper page - Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation (Keyframe Interpolation with Stable Video Diffusion)

GitHub - jeanne-wang/svd_keyframe_interpolation

Refreshing to see a paper actually published with the code!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

https://nitter.poast.org/pretendsmarts/status/1824115772446118358#m

1/2
I wish we could show panoramas on X!

(generated using Flux.1-dev and my panorama LoRA)

2/2
You can play with the panorama lora using this spherical viewer!

Text to panorama - a Hugging Face Space by jbilcke-hf

(note: images are not upscaled yet)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
Try out this breathtaking Panorama LoRA:
Prediction lucataco/flux-dev-lora – Replicate

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
I've been training a 360 panorama LoRA for Flux Dev yesterday and am very satisfied with the results! The idea is to upscale the output to make it more realistic.

More info below

2/4
Here's the replicate link
igorriti/flux-360 – Run with an API on Replicate

3/4
Some example results, you can try them on any 360 image viewer online!

4/4
I'm really close to achieving my ideal output. I'll probably retrain the LoRA with more diverse samples, better captioning (I used the autocaptioner and got some inaccurate outputs) and maybe more steps.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/8
Are you frustrated of intermittently bad depth maps ruining your online pipeline

? Wish you could get fast

reuse of your previously predicted depth frames? Do you want shiny fast SOTA feedforward depth and meshes

?

Introducing DoubleTake!
DoubleTake: Geometry Guided Depth Estimation

2/8
MVS depth models suffer when the cost volume doesn’t have any source views or in textureless regions.

While adding more source views helps, there’s a plateau

.

3/8
What’s the fix? Reuse your old depth estimates

? One depth map alone isn’t perfect. Even a fused mesh will have bias and errors

.

We keep track of how confident the mesh is, and use these confidences as input when predicting new depths via a Hint MLP

.

4/8
Our scores are a new state of the art across online and offline depth estimation and feedforward reconstruction.

5/8
Our method is super flexible. It works online, offline where we run the model twice, and can even use stale geometry from a previous visit or reconstructed from other methods!

6/8
Since the trained component is 2D, we generalize better than competing methods to unseen domains, outdoor scenes in this example.

7/8
You can find all the details in the paper, supplemental pdf, and video on the project webpage.

Work done with my amazing coauthors at Niantic:
@AleottiFilippo, Jamie Watson, Zawar Qureshi,
@gui_ggh, Gabriel Brostow, Sara Vicente, @_mdfirman

8/8
Yes! Full timings in the paper. Some on the webpage.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
We released the code for our ECCV 2024 paper, DoubleTake!

GitHub - nianticlabs/doubletake: [ECCV 2024] DoubleTake: Geometry Guided Depth Estimation

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/3
CogVideoX 5B - Open weights Text to Video AI model is out, matching the likes of luma/ runway/ pika!

Powered by diffusers - requires less than 10GB VRAM to run inference!

Checkout the free demo below to play with it!

2/3
Try it out here:

CogVideoX-5B - a Hugging Face Space by THUDM

3/3
Model weights below:

CogVideo - a THUDM Collection

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
Google Colab で CogVideoX-5B をお試し中。
・生成時間:4分16秒
・GPU RAM:17.6GB

2/2
Google Colab で CogVideoX-2B をお試し中。
・生成時間:1分53秒
・GPU RAM:11.4GB

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
New video model drop

THUDM/CogVideoX-5b · Hugging Face

That could be a great new default open-source model for Clapper

2/2
Also check out its little brother CogVideoX-2b which is also open-source with broader usage permissions, being Apache 2.0

THUDM/CogVideoX-2b · Hugging Face

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

/search?q=#TextToVideoGeneration /search?q=#AIAdvancements /search?q=#CogVideoX /search?q=#AIInte…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

Zhipu AI and Tsinghua University researchers have introduced CogVideoX, a novel approach that leverages cutting-edge techniques to enhance text-to-video generation. CogVideoX employs a 3D causal VAE, compressing video data along spatial and temporal dimensions, significantly reducing the computational load while maintaining video quality. The model also integrates an expert transformer with adaptive LayerNorm, which improves the alignment between text and video, facilitating a more seamless integration of these two modalities. This advanced architecture enables the generation of high-quality, semantically accurate videos that can extend over longer durations than previously possible.

CogVideoX incorporates several innovative techniques that set it apart from earlier models. The 3D causal VAE allows for a 4×8×8 compression from pixels to latents, a substantial reduction that preserves the continuity and quality of the video. The expert transformer uses a 3D full attention mechanism, comprehensively modeling video data to ensure that large-scale motions are accurately represented. The model includes a sophisticated video captioning pipeline, which generates new textual descriptions for video data, enhancing the semantic alignment of the videos with the input text. This pipeline includes video filtering to remove low-quality clips and a dense video captioning method that improves the model’s understanding of video content.....

Read the full article on this: CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

Paper: https://arxiv.org/pdf/2408.06072

Model: THUDM/CogVideoX-5b · Hugging Face

GitHub: GitHub - THUDM/CogVideo: Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
【オープンソースの動画生成AI】
CogVideoXが5Bモデルのウェイトをリリース。同時にCogVideoX-2Bも Apache 2.0へ変更

10GB VRAMでも稼働が可能で、CogVideoX-2Bは GTX 1080TI などの古いGPUで、CogVideoX-5Bは RTX 3060 などで動作。

movie：@PurzBeats 続く>>
/search?q=#生成AI

2/2
CogVideo && CogVideoX
Code：GitHub - THUDM/CogVideo: Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Paper：https://arxiv.org/pdf/2408.06072
demo：CogVideoX-5B - a Hugging Face Space by THUDM
/search?q=#生成AI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2

Big update on the SOTA text-to-video model from the Chinese community!
- @ChatGLM from Tsinghua just released CogVideoX 5B
- CogVideoX 2B now supports Apacha 2.0

Paper: Paper page - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Model: THUDM/CogVideoX-5b · Hugging Face
Demo: CogVideoX-5B - a Hugging Face Space by THUDM

CogVideoX reduces memory requirements. The 2B model runs on a 1080TI and the 5B on a 3060.

2/2

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
動画生成AI"CogVideoX-2B"、Colabだとハイメモリが必要なのでローカルPCで試してみました。 RAM:18GB,VRAM:11.7GBでRTX-3060(12G)でぎりぎり動きました。　実写ですと言われたら信じちゃうかも?!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
/search?q=#智谱国产sora

智谱开源CogVideoX-5B，本地跑sora
开源 CogVideoX 系列更大的模型 CogVideoX-5B，
在 GTX 1080TI 等早期显卡运行 CogVideoX-2B （更新为Apache 2.0 协议），
在 RTX 3060等桌面端运行 CogVideoX-5B 模型。

体验地址
CogVideoX-5B - a Hugging Face Space by THUDM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
The best open-access video generation model CogVideoX-5B has been released by @thukeg

They also had a 2B model in Apache 2.0 Amazing! Give it a try on @huggingface

Model: THUDM/CogVideoX-5b · Hugging Face

Demo: CogVideoX-5B - a Hugging Face Space by THUDM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Aug 29, 2024

1/1
Transition videos play a crucial role in media production, enhancing the flow and coherence of visual narratives. Traditional methods like morphing often lack artistic appeal and require specialized skills, limiting their effectiveness. Recent advances in diffusion model-based video generation offer new possibilities for creating transitions but face challenges such as poor inter-frame relationship modeling and abrupt content changes.
TVG proposes a novel training-free Transition Video Generation (TVG) approach using video-level diffusion models that addresses these limitations without additional training.

Paper: TVG: A Training-free Transition Video Generation Method with Diffusion Models
Link: [2408.13413] TVG: A Training-free Transition Video Generation Method with Diffusion Models
Project: SOCIAL MEDIA TITLE TAG

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D /search?q=#GenerativeAI

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

Grok-2 gets a speed bump after developers rewrite code in three days

Grok-2 gets a speed bump after developers rewrite code in three days

Grok-2 and Grok-2-Mini Performance Highlights

Future Developments

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

The A.I Megathread (LLM , GPT , Development)

Veteran

Grok-2 gets a speed bump after developers rewrite code in three days​

Grok-2 and Grok-2-Mini Performance Highlights​

Future Developments​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Grok-2 gets a speed bump after developers rewrite code in three days

Grok-2 and Grok-2-Mini Performance Highlights

Future Developments