bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167


1/2
🚀Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

🌐Explore More: MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

2/2
Method


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWIFcwcasAIZiwB.jpg




1/2
MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, Ziwei Liu

tl;dr: similar to MVSplat,less transformer, more CNN, more GS
[2405.12218] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo

2/2



To post tweets in this format, more info here: Tips And Tricks For Posting The Coli Megathread.
GSM32qlXcAATkDA.jpg

GSM351kWQAAdvFq.jpg

GSM4A4sWUAAz-AF.jpg

GSM1s_HX0AA0mtK.jpg

GSM1z4hWUAEEg63.jpg

GSM11xqXgAAbLWU.jpg

GSM16C7W0AAur3N.jpg



1/1
Code dropped: "[ECCV '24] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Code: GitHub - TQTQliu/MVSGaussian: [ECCV 2024] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Project: MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo


To post tweets in this format, more info here: Tips And Tricks For Posting The Coli Megathread.



1/1
Code dropped: "[ECCV '24] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Code: GitHub - TQTQliu/MVSGaussian: [ECCV 2024] MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo
Project: MVSGaussian: Fast Generalizable Gaussian Splatting Reconstruction from Multi-View Stereo


To post tweets in this format, more info here: Tips And Tricks For Posting The Coli Megathread.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167
\






1/7
Oh wow! I just tested Splatt3R with my own data on my computer, which creates 3D Gaussian Splats at 4 FPS on uncalibrated 512x512 2D images!

It's by far the fastest 3D reconstruction method, powered by MASt3R.

Check out the video!

2/7
Here is another example!

3/7
Code: GitHub - btsmart/splatt3r: Official repository for Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

4/7
Fast is underrated 😵‍💫

5/7
"We introduce a third head to predict covariances (parameterized by rotation quaternions and scales), spherical harmonics, opacities and mean offsets for each point. This allows us to construct a complete Gaussian primitive for each pixel, which we can then render for novel view synthesis. During training, we only train the Gaussian prediction head, relying on a pre-trained MASt3R model for the other parameters."

6/7
Yeah sure. You can use sam2 to implement this. It just requires a little engineering.

7/7
How is that related here?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/2
Wait! You think you need at least two images to reconstruct a scene? Think again!

With GenWarp, you can reconstruct a 3D scene from just a single image!

Simply create two views from a single input image and drop them into Splatt3R. What a time to be alive! 😉

2/2
Exciting!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167




1/4
Excited to share our latest project at @SonyAI_global: "GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping"

It generates novel views from just a single image, which can be easily applied to 3DGS-based methods!

Proj page: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping

2/4
1)
We introduce a novel approach where a diffusion model learns to implicitly conduct geometric warping conditioned on MDE depth-based correspondence, instead of warping the pixels directly.
It prevents artifacts typically caused by explicit warping.

3/4
2) The augmented self-attention balances attending to areas requiring generation with focusing on regions reliably warped from the input, allowing the model to decide which parts to generate versus warp.

4/4
Yes 🙂 For data preprocessing, we use DUSt3R for depth prediction of image pairs, and during inference, it works well with any off-the-shelf MDE like Zoedepth. And, we will make the code and checkpoint public soon!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GOu2BBybUAAQC9U.png

GOu2Sk8aIAAmRqp.png

GOu2Sk-bkAALmlL.png


1/1
GenWarp's code and @Gradio demo have been officially released!
GenWarp generates images from novel viewpoints using just a single image. Please give it a try!

- GitHub: GitHub - sony/genwarp
- Demo: GenWarp - a Hugging Face Space by Sony
- Project: GenWarp: Single Image to Novel Views with Semantic-Preserving Generative Warping


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

1/1
This work propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control.
It leverages the recent advancements in depth-conditioned T2I models and proposes a novel approach for interactive 3D layout control. It replaces the traditional 2D boxes used in layout control with 3D boxes

Paper: Build-A-Scene:
Interactive 3D Layout Control for Diffusion-Based Image Generation by @peter_wonka and Abdelrahman Eldesokey
Link: [2408.14819] Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation
Project: Build-A-Scene

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
Build-A-Scene

Interactive 3D Layout Control for Diffusion-Based Image Generation

discuss: Paper page - Build-A-Scene: Interactive 3D Layout Control for Diffusion-Based Image Generation

We propose a diffusion-based approach for Text-to-Image (T2I) generation with interactive 3D layout control. Layout control has been widely studied to alleviate the shortcomings of T2I diffusion models in understanding objects' placement and relationships from text descriptions. Nevertheless, existing approaches for layout control are limited to 2D layouts, require the user to provide a static layout beforehand, and fail to preserve generated images under layout changes. This makes these approaches unsuitable for applications that require 3D object-wise control and iterative refinements, e.g., interior design and complex scene generation. To this end, we leverage the recent advancements in depth-conditioned T2I models and propose a novel approach for interactive 3D layout control. We replace the traditional 2D boxes used in layout control with 3D boxes. Furthermore, we revamp the T2I task as a multi-stage generation process, where at each stage, the user can insert, change, and move an object in 3D while preserving objects from earlier stages. We achieve this through our proposed Dynamic Self-Attention (DSA) module and the consistent 3D object translation strategy. Experiments show that our approach can generate complicated scenes based on 3D layouts, boosting the object generation success rate over the standard depth-conditioned T2I methods by 2x. Moreover, it outperforms other methods in comparison in preserving objects under layout changes.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

1/1
AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People. [2408.10240] AltCanvas: A Tile-Based Image Editor with Generative AI for Blind or Visually Impaired People


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
📢Introducing AltCanvas- a new tool blending generative AI with a tile-based interface to empower visually impaired creators! AltCanvas combines a novel tile-based authoring approach with Generative AI, allowing users to build complex scenes piece by piece. /search?q=#assets2024

2/2
Accessible PDF:

Arxiv: https://www.arxiv.org/pdf/2408.10240

Led by @SeongHee17633 and Maho Kohga, in collaboration with Steve Landau and Sile O'Modhrain


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GWFnvvWagAAMAS3.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

1/1
Given a pair of key frames as input, this method generates a continuous intermediate video with coherent motion by adapting a pretrained image-to-video diffusion model.

Paper: Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Link: [2408.15239] Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation
Project: Generative Keyframe Interpolation with Forward-Backward Consistency

/search?q=#AI /search?q=#AI美女 /search?q=#LLMs /search?q=#deeplearning /search?q=#machinelearning /search?q=#3D


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
Generative Inbetweening

Adapting Image-to-Video Models for Keyframe Interpolation

discuss: Paper page - Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation

We present a method for generating video sequences with coherent motion between a pair of input key frames. We adapt a pretrained large-scale image-to-video diffusion model (originally trained to generate videos moving forward in time from a single input image) for key frame interpolation, i.e., to produce a video in between two input frames. We accomplish this adaptation through a lightweight fine-tuning technique that produces a version of the model that instead predicts videos moving backwards in time from a single input image. This model (along with the original forward-moving model) is subsequently used in a dual-directional diffusion sampling process that combines the overlapping model estimates starting from each of the two keyframes. Our experiments show that our method outperforms both existing diffusion-based methods and traditional frame interpolation techniques.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWCcek8WIAAFKox.jpg


1/1
Generative Inbetweening: Adapting Image-to-Video Models for Keyframe Interpolation (Keyframe Interpolation with Stable Video Diffusion)

GitHub - jeanne-wang/svd_keyframe_interpolation

Refreshing to see a paper actually published with the code! 👏


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

1/2
I wish we could show panoramas on X!

(generated using Flux.1-dev and my panorama LoRA)

2/2
You can play with the panorama lora using this spherical viewer!

Text to panorama - a Hugging Face Space by jbilcke-hf

(note: images are not upscaled yet)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GU-ZL5DWUAE9O8L.jpg

GU-ZL5HX0AAfqFY.jpg

GU-ZL5DX0AAy0sg.jpg



GU-ZL5DWUAE9O8L.jpg

GU-ZL5HX0AAfqFY.jpg

GU-ZL5DX0AAy0sg.jpg




1/4
I've been training a 360 panorama LoRA for Flux Dev yesterday and am very satisfied with the results! The idea is to upscale the output to make it more realistic.

🔽More info below

2/4
Here's the replicate link
igorriti/flux-360 – Run with an API on Replicate

3/4
Some example results, you can try them on any 360 image viewer online!

4/4
I'm really close to achieving my ideal output. I'll probably retrain the LoRA with more diverse samples, better captioning (I used the autocaptioner and got some inaccurate outputs) and maybe more steps.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GV_T9USWIAAMoyD.jpg

GV_T_3EWYAAkScG.jpg

GV_VCtzXcAA3rl8.jpg

GV_Vv1zXcAAnOpH.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167







1/8
Are you frustrated of intermittently bad depth maps ruining your online pipeline 😫? Wish you could get fast 🚀 reuse of your previously predicted depth frames? Do you want shiny fast SOTA feedforward depth and meshes ✨?

Introducing DoubleTake!
DoubleTake: Geometry Guided Depth Estimation

2/8
MVS depth models suffer when the cost volume doesn’t have any source views or in textureless regions.

While adding more source views helps, there’s a plateau 🫓.

3/8
What’s the fix? Reuse your old depth estimates 💡? One depth map alone isn’t perfect. Even a fused mesh will have bias and errors 😞.

We keep track of how confident the mesh is, and use these confidences as input when predicting new depths via a Hint MLP 🌟.

4/8
Our scores are a new state of the art across online and offline depth estimation and feedforward reconstruction.

5/8
Our method is super flexible. It works online, offline where we run the model twice, and can even use stale geometry from a previous visit or reconstructed from other methods!

6/8
Since the trained component is 2D, we generalize better than competing methods to unseen domains, outdoor scenes in this example.

7/8
You can find all the details in the paper, supplemental pdf, and video on the project webpage.

Work done with my amazing coauthors at Niantic:
@AleottiFilippo, Jamie Watson, Zawar Qureshi,
@gui_ggh, Gabriel Brostow, Sara Vicente, @_mdfirman

8/8
Yes! Full timings in the paper. Some on the webpage.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GRFcQRhXcAAAocX.jpg

GRFcUJ5XoAA4QsS.jpg

GRFcYOzW4AAZP7b.jpg




 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167


1/3
CogVideoX 5B - Open weights Text to Video AI model is out, matching the likes of luma/ runway/ pika! 🔥

Powered by diffusers - requires less than 10GB VRAM to run inference! ⚡

Checkout the free demo below to play with it!

2/3
Try it out here:

CogVideoX-5B - a Hugging Face Space by THUDM

3/3
Model weights below:

CogVideo - a THUDM Collection


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
Google Colab で CogVideoX-5B をお試し中。
・生成時間:4分16秒
・GPU RAM:17.6GB

2/2
Google Colab で CogVideoX-2B をお試し中。
・生成時間:1分53秒
・GPU RAM:11.4GB


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
New video model drop 👀

THUDM/CogVideoX-5b · Hugging Face

That could be a great new default open-source model for Clapper 🔥

2/2
Also check out its little brother CogVideoX-2b which is also open-source with broader usage permissions, being Apache 2.0 🤗

THUDM/CogVideoX-2b · Hugging Face


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

/search?q=#TextToVideoGeneration /search?q=#AIAdvancements /search?q=#CogVideoX /search?q=#AIInte


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWGG2cmXsAA9uVa.jpg



1/1
CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

Zhipu AI and Tsinghua University researchers have introduced CogVideoX, a novel approach that leverages cutting-edge techniques to enhance text-to-video generation. CogVideoX employs a 3D causal VAE, compressing video data along spatial and temporal dimensions, significantly reducing the computational load while maintaining video quality. The model also integrates an expert transformer with adaptive LayerNorm, which improves the alignment between text and video, facilitating a more seamless integration of these two modalities. This advanced architecture enables the generation of high-quality, semantically accurate videos that can extend over longer durations than previously possible.

CogVideoX incorporates several innovative techniques that set it apart from earlier models. The 3D causal VAE allows for a 4×8×8 compression from pixels to latents, a substantial reduction that preserves the continuity and quality of the video. The expert transformer uses a 3D full attention mechanism, comprehensively modeling video data to ensure that large-scale motions are accurately represented. The model includes a sophisticated video captioning pipeline, which generates new textual descriptions for video data, enhancing the semantic alignment of the videos with the input text. This pipeline includes video filtering to remove low-quality clips and a dense video captioning method that improves the model’s understanding of video content.....

Read the full article on this: CogVideoX Released in Two Variants – CogVideoX-2B and CogVideoX-5B: A Revolutionary Advancement in Text-to-Video Generation with Enhanced Temporal Consistency and Superior Dynamic Scene Handling

Paper: https://arxiv.org/pdf/2408.06072

Model: THUDM/CogVideoX-5b · Hugging Face

GitHub: GitHub - THUDM/CogVideo: Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
【オープンソースの動画生成AI】
CogVideoXが5Bモデルのウェイトをリリース。同時にCogVideoX-2Bも Apache 2.0へ変更

10GB VRAMでも稼働が可能で、CogVideoX-2Bは GTX 1080TI などの古いGPUで、CogVideoX-5Bは RTX 3060 などで動作。

movie:@PurzBeats 続く>>
/search?q=#生成AI

2/2
CogVideo && CogVideoX
Code:GitHub - THUDM/CogVideo: Text-to-video generation: CogVideoX (2024) and CogVideo (ICLR 2023)
Paper:https://arxiv.org/pdf/2408.06072
demo:CogVideoX-5B - a Hugging Face Space by THUDM
/search?q=#生成AI


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/2
🔥 Big update on the SOTA text-to-video model from the Chinese community!
- @ChatGLM from Tsinghua just released CogVideoX 5B
- CogVideoX 2B now supports Apacha 2.0 🙌
Paper: Paper page - CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer
Model: THUDM/CogVideoX-5b · Hugging Face
Demo: CogVideoX-5B - a Hugging Face Space by THUDM
✨ CogVideoX reduces memory requirements. The 2B model runs on a 1080TI and the 5B on a 3060.

2/2



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
動画生成AI"CogVideoX-2B"、Colabだとハイメモリが必要なのでローカルPCで試してみました。 RAM:18GB,VRAM:11.7GBでRTX-3060(12G)でぎりぎり動きました。 実写ですと言われたら信じちゃうかも?!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
/search?q=#智谱国产sora

智谱开源CogVideoX-5B,本地跑sora
开源 CogVideoX 系列更大的模型 CogVideoX-5B,
在 GTX 1080TI 等早期显卡运行 CogVideoX-2B (更新为Apache 2.0 协议),
在 RTX 3060等桌面端运行 CogVideoX-5B 模型。

体验地址
CogVideoX-5B - a Hugging Face Space by THUDM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
The best open-access video generation model CogVideoX-5B has been released by @thukeg

They also had a 2B model in Apache 2.0 Amazing! Give it a try on @huggingface

Model: THUDM/CogVideoX-5b · Hugging Face

Demo: CogVideoX-5B - a Hugging Face Space by THUDM


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GV_KVmpaUAARsKq.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167
Blog

100M Token Context Windows

100M Token Context Windows​


Research update on ultra-long context models, our partnership with Google Cloud, and new funding.

Magic Team, on August 29, 2024

There are currently two ways for AI models to learn things: training, and in-context during inference. Until now, training has dominated, because contexts are relatively short. But ultra-long context could change that.

Instead of relying on fuzzy memorization, our LTM (Long-Term Memory) models are trained to reason on up to 100M tokens of context given to them during inference.

While the commercial applications of these ultra-long context models are plenty, at Magic we are focused on the domain of software development.

It’s easy to imagine how much better code synthesis would be if models had all of your code, documentation, and libraries in context, including those not on the public internet.

Evaluating Context Windows​


Current long context evals aren’t great. The popular Needle In A Haystack eval places a random fact ('the needle') in the middle of the long context window ('haystack'), and asks the model to retrieve the fact.

Needle In A Haystack


Needle In A Haystack

However, “Arun and Max having coffee at Blue Bottle” stands out in a fiction novel about whales. By learning to recognize the unusual nature of the “needle”, the model can ignore otherwise relevant information in the “haystack”, reducing the required storage capacity to less than it would be on real tasks. It also only requires attending to a tiny, semantically recognizable part of the context, allowing even methods like RAG to appear successful.

Mamba’s (Section 4.1.2) and H3's (Appendix E.1) induction head benchmark makes this even easier. They use (and train with) a special token to explicitly signal the start of the needle, weakening the storage and retrieval difficulty of the eval to O(1). This is like knowing which question will come up in an exam before you start studying.

These subtle flaws weaken current long context evals in ways that allow traditional Recurrent Neural Networks (RNNs) and State Space Models (SSMs) to score well despite their fundamentally limiting, small O(1)-sized state vector.

To eliminate these implicit and explicit semantic hints, we’ve designed HashHop.

Hashes are random and thus incompressible, requiring the model to be able to store and retrieve from the maximum possible information content for a given context size at all times.

Concretely, we prompt a model trained on hashes with hash pairs:

...
jJWlupoT → KmsFrnRa
vRLWdcwV → sVLdzfJu

YOJVrdjK → WKPUyWON

OepweRIW → JeIrWpvs

JeqPlFgA → YirRppTA
...

Then, we ask it to complete the value of a randomly selected hash pair:

Completion YOJVrdjK → WKPUyWON

This measures the emergence of single-step induction heads, but practical applications often require multiple hops. Picture variable assignments or library imports in your codebase.

To incorporate this, we ask the model to complete a chain of hashes instead (as recently proposed by RULER):

Hash 1 → Hash 2

Hash 2 → Hash 3

Hash 3 → Hash 4

Hash 4 → Hash 5

Hash 5 → Hash 6

Completion Hash 1 → Hash 2 Hash 3 Hash 4 Hash 5 Hash 6

For order- and position-invariance, we shuffle the hash pairs in the prompt:

...

Hash 72 → Hash 81

Hash 4 → Hash 5

Hash 1 → Hash 2

Hash 17 → Hash 62

Hash 2 → Hash 3

Hash 52 → Hash 99

Hash 34 → Hash 12

Hash 3 → Hash 4

Hash 71 → Hash 19

Hash 5 → Hash 6
...

Completion Hash 1 → Hash 2 Hash 3 Hash 4 Hash 5 Hash 6

Writing out all intermediate hashes is similar to how chain of thought allows models to spread out reasoning over time. We also propose a more challenging variant where the model skips steps, e.g. going directly from Hash 1 to Hash 6:

Completion Hash 1 → Hash 6

This requires the model architecture to be able to attend and jump across multiple points of the entire context in latent space in one go.

In addition to evaluating models on code and language, we found training small models on hashes and measuring performance on these toy tasks to be a useful tool for our architecture research.

If you would like to use HashHop, you can find it on GitHub.

Magic's progress on ultra long context​


We have recently trained our first 100M token context model: LTM-2-mini. 100M tokens equals ~10 million lines of code or ~750 novels.

For each decoded token, LTM-2-mini’s sequence-dimension algorithm is roughly 1000x cheaper than the attention mechanism in Llama 3.1 405B<a href="100M Token Context Windows" data-footnote-ref="true">1</a> for a 100M token context window.

The contrast in memory requirements is even larger – running Llama 3.1 405B with a 100M token context requires 638 H100s per user just to store a single 100M token KV cache.<a href="100M Token Context Windows" data-footnote-ref="true">2</a> In contrast, LTM requires a small fraction of a single H100’s HBM per user for the same context.

Trained on hashes with chain of thought, the LTM architecture gets the following results:

100%100%100%100%100%100%100%100%95%95%100%100%100%100%100%100%100%100%95%85%100%100%100%100%100%100%100%100%95%90%100%100%100%100%100%100%100%100%95%90%100%100%100%100%100%100%100%100%95%90%100%100%100%100%100%100%100%100%95%90%100k200k1M2M4M8M16M32M50M100M123456Context length (tokens)Hop count10080604020

With our choice of hyperparameters for this particular model, we see worsening performance when trying 3 or more hops without chain of thought, but for 2 hops at once (Hash 1 → Hash 3), without chain of thought, we see strong results, indicating the model is able to build more complex circuits than single induction heads:

100%100%100%100%100%100%100%100%95%80%100k200k1M2M4M8M16M32M50M100M2Context length (tokens)Hop count10080604020

We also trained a prototype model on text-to-diff data with our ultra-long context attention mechanism. It’s several orders of magnitude smaller than frontier models, so we would be the first to admit that its code synthesis abilities were not good enough yet, but it produced the occasional reasonable output:

In-context GUI framework​


Our model successfully created a calculator using a custom in-context GUI framework, showcasing its capability for real-time learning. Although generating a calculator is a simple task for state-of-the-art models when using well-known frameworks like React, the use of a custom in-context framework is more challenging. The model is prompted with just the codebase and the chat (no open files, edit history, or other indicators).

Simple UI change​


Our model was able to implement a password strength meter for the open source repo Documenso without human intervention. The issue description is more specific than we would expect it to be in a real-world scenario and the feature is common among many web applications. Still, a model several orders of magnitude smaller than today’s frontier models was able to edit a complex codebase unassisted.

We are now training a large LTM-2 model on our new supercomputer.

Partnership with Google Cloud to build NVIDIA GB200 NVL72 cluster​


Magic partners with Google Cloud


Magic partners with Google Cloud

We are building our next two supercomputers on Google Cloud: Magic-G4, powered by NVIDIA H100 Tensor Core GPUs, and Magic-G5, powered by NVIDIA GB200 NVL72, with the ability to scale to tens of thousands of Blackwell GPUs over time.

“We are excited to partner with Google and NVIDIA to build our next-gen AI supercomputer on Google Cloud. NVIDIA’s GB200 NLV72 system will greatly improve inference and training efficiency for our models, and Google Cloud offers us the fastest timeline to scale, and a rich ecosystem of cloud services.” – Eric Steinberger, CEO & Co-founder at Magic

“Google Cloud’s end-to-end AI platform provides high-growth, fast-moving companies like Magic with complete hardware and software capabilities for building AI models and applications at scale. Through this partnership, Magic will utilize Google Cloud’s AI Platform services including a variety of leading NVIDIA chips and AI tooling from Vertex AI to build and train its next generation of models and to bring products to market more quickly.” – Amin Vahdat, VP and GM of ML, Services, and Cloud AI at Google Cloud

“The current and future impact of AI is fueled to a great extent by the development of increasingly capable large language models. Powered by one of the largest installations of the NVIDIA GB200 NVL72 rack-scale design to date, the Magic-G5 supercomputer on Google Cloud will provide Magic with the compute resources needed to train, deploy and scale large language models – and push the boundaries of what AI can achieve.” – Ian Buck, Vice President of Hyperscale and HPC at NVIDIA

New funding​


We’ve raised a total of $465M, including a recent investment of $320 million from new investors Eric Schmidt, Jane Street, Sequoia, Atlassian, among others, and existing investors Nat Friedman & Daniel Gross, Elad Gil, and CapitalG.

Join us​


Pre-training only goes so far; we believe inference-time compute is the next frontier in AI. Imagine if you could spend $100 and 10 minutes on an issue and reliably get a great pull request for an entire feature. That’s our goal.

To train and serve 100M token context models, we needed to write an entire training and inference stack from scratch (no torch autograd, lots of custom CUDA, no open-source foundations) and run experiment after experiment on how to stably train our models. Inference-time compute is an equally challenging project.

We are 23 people (+ 8000 H100s) and are hiring more Engineers and Researchers to accelerate our work and deploy upcoming models.

Over time, we will scale up to tens of thousands of GB200s. We are hiring Supercomputing and Systems Engineers to work alongside Ben Chess (former OpenAI Supercomputing Lead).

Sufficiently advanced AI should be treated with the same sensitivity as the nuclear industry. In addition to our commitments to standard safety testing, we want Magic to be great at cybersecurity and push for higher regulatory standards. We are hiring for a Head of Security to lead this effort.

Footnotes​


  1. The FLOPs cost of Llama 405B’s attention mechanism is n_layers * n_heads * d_head * n_ctx * 2 per output token. At 100M context, our mechanism is roughly 1000 times cheaper for LTM-2-Mini. For our largest LTM-2 model, context will be roughly twice as expensive as for LTM-2-mini, so still 500x cheaper than Llama 405B. This comparison focuses on Llama's attention mechanism and our LTM mechanism’s FLOPs and memory bandwidth load. Costs from other parts of the model, such as Llama’s MLP, that have constant cost with respect to context size for each decoded token are not considered.

  2. 126 layers * 8 GQA groups * 128 d_head * 2 bytes * 2 (for k & v) * 100 million = 51TB. An H100 has 80GB of memory. 51TB / 80GB = 637.5 H100s.

Follow Magic on
Vulnerability DisclosureAGI Readiness Policy


Magic AI Inc. Copyright © 2024 All rights reserved.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

Qwen2-VL: To See the World More Clearly​


August 29, 2024· 17 min · 3569 words · Qwen Team | Translations:


qwen2vl-head.jpeg

DEMOGITHUBHUGGING FACEMODELSCOPEAPIDISCORD

After a year’s relentless efforts, today we are thrilled to release Qwen2-VL! Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

  • SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
  • Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
  • Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.

  • Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

We opensource Qwen2-VL-2B and Qwen2-VL-7B with Apache 2.0 license, and we release the API of Qwen2-VL-72B! The opensource is integrated to Hugging Face Transformers, vLLM, and other third-party frameworks. Hope you enjoy!


Performance​


We evaluate our model’s visual capabilities across six key dimensions: complex college-level problem-solving, mathematical abilities, document and table comprehension, multilingual text-image understanding, general scenario question-answering, video comprehension, and agent-based interactions. Overall, our 72B model showcases top-tier performance across most metrics, often surpassing even closed-source models like GPT-4o and Claude 3.5-Sonnet. Notably, it demonstrates a significant edge in document understanding.

qwen2-vl-72b.jpg


At the 7B scale, we’ve managed to retain support for image, multi-image, and video inputs, delivering competitive performance in a more cost-effective model size. Specifically, our model excels in document understanding tasks such as DocVQA and in multilingual text understanding from images, as assessed by MTVQA, establishing state-of-the-art performance.

qwen2-vl-7b.jpg


Additionally, we’re excited to introduce a smaller 2B model, optimized for potential mobile deployment. Despite its compact size, this model boasts strong performance in image, video, and multilingual comprehension. It particularly shines in video-related tasks, document understanding, and general scenario question-answering when compared to other models of similar scale.

qwen2-vl-2b.jpg



Model Capabilities​

1. Enhanced Recognition Capabilities​


Qwen2-VL now boasts improved object recognition, extending beyond plants and landmarks to comprehend complex relationships between multiple objects in a scene. We’ve also significantly boosted the model’s ability to recognize handwritten text and multiple languages within images, making it more accessible to users worldwide.

Example: Multi-object Recognition (1/6)Next

User

Output the color and number of each box.
image


Qwen2-VL

The image shows a stack of colorful blocks with numbers on them. Here is the color and number of each block from top to bottom:

  • Top row (single block): Blue with the number 9
  • Second row (two blocks): Light blue with the number 7, and green with the number 8
  • Third row (three blocks): Purple with the number 4, pink with the number 5, and light green with the number 6
  • Bottom row (four blocks): Red with the number 0, orange with the number 1, yellow with the number 2, and light green with the number 3

2. Visual Reasoning: Solving Real-World Problems​


In this iteration, we have significantly enhanced Qwen2-VL’s mathematical and coding proficiencies. The model is not only capable of solving problems by analyzing pictures but can also interpret and solve complex mathematical problems through chart analysis. Extremely aspect-ratio-distorted images can also be correctly interpreted. Additionally, we have reinforced the model’s capability to extract information from real-world images and charts and improved its instruction-following skills. This fusion of visual perception and logical reasoning empowers the model to tackle practical issues, bridging the gap between abstract concepts and tangible solutions.

Example: Code (1/7)Next

User

Solve the algorithmic problem in the web page.
image


Qwen2-VL

To solve the “Snake in Matrix” problem, we need to implement a function that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. Here’s a step-by-step approach to solve this problem:



Initialize Variables:
  • Initialize the current position of the snake as 0.
  • Initialize the direction of the snake as “RIGHT” (initially, the snake moves to the right).

Define Movement Functions:

  • Create a function to move the snake based on the current direction and the size of the matrix n.
  • Update the position of the snake accordingly.


Process Commands:

  • Iterate through each command in the list of commands.
  • For each command, update the direction and position of the snake.


Return Final Position:

  • After processing all commands, return the final position of the snake.


Here’s the Python code to implement the solution:


Code:
GO TO WEBSITE TO VIEW CODE

This code defines a class Solution with a method finalPositionOfSnake that takes the size of the matrix n and a list of commands, and returns the final position of the snake after executing all the commands. The movement functions (move_right, move_down, move_left, move_up) are used to update the position of the snake based on the current direction.

3. Video Understanding and Live Chat​


Beyond static images, Qwen2-VL extends its prowess to video content analysis. It can summarize video content, answer questions related to it, and maintain a continuous flow of conversation in real-time, offering live chat support. This functionality allows it to act as a personal assistant, helping users by providing insights and information drawn directly from video content.

Example: Video Understanding (1/2)Next

User

Please describe the video in detail.

Qwen2-VL


The video begins with a man speaking to the camera, followed by a group of people sitting in a control room. The camera then cuts to two men floating inside a space station, where they are seen speaking to the camera. The men appear to be astronauts, and they are wearing space suits. The space station is filled with various equipment and machinery, and the camera pans around to show the different areas of the station. The men continue to speak to the camera, and they appear to be discussing their mission and the various tasks they are performing. Overall, the video provides a fascinating glimpse into the world of space exploration and the daily lives of astronauts.

User

What is the color of the clothes the two astronauts wearing?

Qwen2-VL

The two astronauts are wearing blue and black clothes.

Example: Live chat (1/2)Next

Live chat

4. Visual Agent Capabilities: Function Calling and Visual Interactions.​


Qwen2-VL demonstrates strong potential as a visual agent, facilitating interactions similar to human perceptions of the world.

  • The model facilitates Function Calling, enabling it to harness external tools for real-time data retrieval – be it flight statuses, weather forecasts, or package tracking – by deciphering visual cues. This integration of visual interpretation with functional execution elevates its utility, making it a powerful tool for information management and decision-making.

Example: Function Calling (1/4)Next

Ask about the weather


  • Visual Interactions represent a significant stride towards mimicking human perception. By allowing the model to engage with visual stimuli akin to human senses, we’re pushing the boundaries of AI’s ability to perceive and respond to its environment. This capability paves the way for more intuitive and immersive interactions, where Qwen2-VL acts not just as an observer, but an active participant in our visual experiences.

Example: UI Interactions (1/4)Next

Operate a Mobile Phone

Certainly, the model is not perfect and has some limitations that I hope you can understand. For example, the model is unable to extract audio from videos, and its knowledge is only up to date as of June 2023. Additionally, the model cannot guarantee complete accuracy when processing complex instructions or scenarios, and it is relatively weak in tasks involving counting, character recognition, and 3D spatial awareness.


Model Architecture​


Overall, we’ve continued with the Qwen-VL architecture, which leverages a Vision Transformer (ViT) model and Qwen2 language models. For all these variants, we utilized a ViT with approximately 600M parameters, designed to handle both image and video inputs seamlessly. To further enhance the model’s ability to effectively perceive and comprehend visual information in videos, we introduced several key upgrades:

  • A key architectural improvement in Qwen2-VL is the implementation of Naive Dynamic Resolution support. Unlike its predecessor, Qwen2-VL can handle arbitrary image resolutions, mapping them into a dynamic number of visual tokens, thereby ensuring consistency between the model input and the inherent information in images. This approach more closely mimics human visual perception, allowing the model to process images of any clarity or size.

qwen2_vl.jpg


  • Another key architectural enhancement is the innovation of Multimodal Rotary Position Embedding (M-ROPE). By deconstructing the original rotary embedding into three parts representing temporal and spatial (height and width) information,M-ROPE enables LLM to concurrently capture and integrate 1D textual, 2D visual, and 3D video positional information.

mrope.png



Developing with Qwen2-VL​


To use the largest Qwen2-VL model, Qwen2-VL-72B, you can access it through our official API (sign up the account and obtain the API key through DashScope) temporarily as demonstrated below:

Code:
go to website to view code

The 2B and 7B models of the Qwen2-VL series are open-sourced and accessible on Hugging Face and ModelScope. You can explore the model cards for detailed usage instructions, features, and performance metrics. Below we provide an example of the simplest usage with HF Transformers.


To facilitate seamless integration and use of our latest models, we support a range of tools and frameworks in the open-source ecosystem, including quantization (AutoGPTQ, AutoAWQ), deployment (vLLM), finetuning (Llama-Factory), etc.


License​


Both the opensource Qwen2-VL-2B and Qwen2-VL-7B are under Apache 2.0.


What’s Next​


We look forward to your feedback and the innovative applications you will build with Qwen2-VL. In the near future, we are going to build stronger vision language models upon our next-version language models and endeavor to integrate more modalities towards an omni model!
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167

1/1
OpenAI and Anthropic have signed memoranda of understanding with the US AI Safety Institute to do pre-release testing of frontier AI models.

I would be curious to know the terms, given that these are quasi-regulatory agreements.

What happens if AISI says, “don’t release”?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
we are happy to have reached an agreement with the US AI Safety Institute for pre-release testing of our future models.

for many reasons, we think it's important that this happens at the national level. US needs to continue to lead!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Whether you like or dislike what Sam announced, he's subtly making an important point about state regulation of AI ("we think it's important that this happen at the national level"):

Putting aside SB1047's destructiveness toward open source AI and AI innovation in general—which has been overwhelmingly demonstrated by
@martin_casado ,
@AndrewYNg
,
@neil_chilson
, and many others—its passage could start a state-by-state legislative arms race. And that's good for nobody but the lawyers.

Imagine if 10, 30, even 50 different states enact their own AI regulatory regimes, each with different, expensive requirements, periodic reports, liability standards, and new bureaucracies. Resources that developers could spend on the kind of innovation we want are instead soaked up by compliance—if companies can stomach the liability risk to begin with.

It's a strategic imperative to have the US lead in AI. We should not stumble into a patchwork of state laws that discourages innovation before it begins.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
Great to see US AISI starting to test models prior to deployment!


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
Looking forward to doing a pre-deployment test on our next model with the US AISI! Third-party testing is a really important part of the AI ecosystem and it's been amazing to see governments stand up safety institutes to facilitate this.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167



1/1
350,000,000 downloads of an LLM is nuts! How long did it take Linux to get to that number?


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


1/1
Mark Zuckerberg just said Facebook’s $META

Llama models are approaching 350 million downloads to date


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ5IUoXYAAt2WQ.jpg


1/1
Congratulations
@AIatMeta - it has been one of the most enjoyable collaborations in the past years!

Thank you for your continued belief in democratising the Machine Learning!

Looking forward to the next editions

P.S. The model downloads are already at 400M


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ48eba0AA1Xh-.jpg



1/1
Meta's Llama has become the dominant platform in the AI ecosystem.

An exploding number of companies large and small, startups, governments, and non-profits, are using to build new products and services.

Universities, researchers, and engineers are improving Llama and proposing new use cases on a daily basis

This blog post says it all.

With 10x growth since 2023, Llama is the leading engine of AI innovation


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
What makes the quality of Zoom AI Companion so good? It’s our federated approach to AI AI Companion leverages a combination of proprietary models alongside both closed and open-source Large Language Models (LLMs), including the renowned
@Meta Llama.

This strategic blend allows us to offer better quality:
Comprehensive meeting summaries
Smart meeting recordings
Actionable next steps, and more

By harnessing diverse AI capabilities, we're enhancing productivity and streamlining workflows for professionals across industries. Plus, Zoom AI Companion is included at no additional cost


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ48eba0AA1Xh-.jpg



1/1
Jonathan Ross, Founder & CEO, Groq: “Open-source wins. Meta is building the foundation of an open ecosystem that rivals the top closed models and at Groq we put them directly into the hands of the developers—a shared value that’s been fundamental at Groq since our beginning. To date Groq has provided over 400,000 developers with 5 billion free tokens daily, using the Llama suite of models and our LPU Inference. It’s a very exciting time and we’re proud to be a part of that momentum. We can’t add capacity fast enough for Llama. If we 10x’d the deployed capacity it would be consumed in under 36 hours.”


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ48eba0AA1Xh-.jpg


1/1
We offer our congrats to
@AIatMeta on reaching nearly 350M downloads of Llama.

From our CEO Jensen Huang: “Llama has profoundly impacted the advancement of state-of-the-art AI.

The floodgates are now open for every enterprise and industry to build and deploy custom Llama supermodels using NVIDIA AI Foundry, which offers the broadest support for Llama 3.1 models across training, optimization, and inference.

It’s incredible to witness the rapid pace of adoption in just the past month.”


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWJ48eba0AA1Xh-.jpg



1/1
interesting blog from meta saying llamas, llamas everywhere


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/1
Zucc will keep doing this for as long as it takes to kneecap or bankrupt OpenAI.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
GWKjZk9X0AAOTdE.jpg

GWKjZkzXsAEt3Qo.jpg
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167


Ask Claude? Amazon turns to Anthropic's AI for Alexa revamp​


By Greg Bensinger

August 30, 20245:57 AM EDTUpdated 2 hours ago

Item 1 of 2 Amazon's DOT Alexa device is shown inside a house in this picture illustration taken October 1, 2021. REUTERS/Mike Blake/Illustration/File Photo

[1/2]Amazon's DOT Alexa device is shown inside a house in this picture illustration taken October 1, 2021. REUTERS/Mike Blake/Illustration/File Photo Purchase Licensing Rights
, opens new tab


  • Amazon developing new version of Alexa with generative AI
  • Retailer hopes to generate revenue by charging for its use
  • Concerns about in-house AI prompt Amazon to turn to Anthropic's Claude, sources say
  • Amazon says it uses many different technologies to power Alexa

SAN FRANCISCO, Aug 30 (Reuters) - Amazon's revamped Alexa due for release in October ahead of the U.S. holiday season will be powered primarily by Anthropic's Claude artificial intelligence models, rather than its own AI, five people familiar with the matter told Reuters.

Amazon plans to charge $5 to $10 a month for its new "Remarkable" version of Alexa as it will use powerful generative AI to answer complex queries, while still offering the "Classic" voice assistant for free, Reuters reported in June.

But initial versions of the new Alexa using in-house software simply struggled for words, sometimes taking six or seven seconds to acknowledge a prompt and reply, one of the people said.

That's why Amazon turned to Claude, an AI chatbot developed by startup Anthropic, as it performed better than the online retail giant's own AI models, the people said.

Reuters based this story upon interviews with five people with direct knowledge of the Alexa strategy. All declined to be named as they are not authorized to discuss non-public matters.

Alexa, accessed mainly through Amazon televisions and Echo devices, can set timers, play music, act as a central hub for smart home controls and answer one-off questions.

But Amazon's attempts to convince users to shop through Alexa to generate more revenue have been mostly unsuccessful and the division remains unprofitable.

As a result, senior management has stressed that 2024 is a critical year for Alexa to finally demonstrate it can generate meaningful sales - and the revamped paid version is seen as a way both to do that and keep pace with rivals.

"Amazon uses many different technologies to power Alexa," a company spokeswoman said in a statement in response to detailed Reuters questions for this story.

"When it comes to machine learning models, we start with those built by Amazon, but we have used, and will continue to use, a variety of different models - including (Amazon AI model) Titan and future Amazon models, as well as those from partners - to build the best experience for customers," the spokeswoman said.

Anthropic, in which Amazon owns a minority stake, declined to comment for this story.


AI PARTNERSHIPS​


Amazon has typically eschewed relying on technology it hasn't developed in-house so it can ensure it has full control of the user experience, data collection and direct relationships with customers.

But it would not be alone in turning to a partner to improve AI products. Microsoft (MSFT.O)
, opens new tab
and Apple (AAPL.O)
, opens new tab
, for example, have both struck partnerships with OpenAI to use its ChatGPT to power some of their products.

The release of the Remarkable Alexa, as it is known internally, is expected in October, with a preview of the new service coming during Amazon's annual devices and services event typically held in September, the people said.

Amazon has not yet said, however, when it plans to hold its showcase event, which will be the first major public appearance of its new devices chief, Panos Panay, who was hired last year to replace long-time executive David Limp.

The wide release in late 2022 of ChatGPT, which gives full-sentence answers almost instantaneously to complicated queries, set off a frenzy of investing and corporate maneuvering to develop better AI software for a variety of functions, including image, video and voice services.

By comparison, Amazon's decade-old Alexa appeared outmoded, Amazon workers have told Reuters.

While Amazon has a mantra of "working backwards from the customer" to come up with new services, some of the people said that within the Alexa group, the emphasis since last year has instead been on keeping up with competitors in the AI race.

Amazon workers also have expressed skepticism that customers would be willing to pay $60 to $120 per year for a service that's free today - on top of the $139 many already pay for their Prime memberships.


ALEXA UPGRADES​


As envisioned, the paid version of Alexa would carry on conversations with a user that build on prior questions and answers, the people with knowledge of the Alexa strategy said.

The upgraded Alexa is designed to allow users to seek shopping advice such as which clothes to buy for a vacation and to aggregate news stories, the people said. And it is meant to carry out more complicated requests, such as ordering food or drafting emails all from a single prompt.

Amazon hopes the new Alexa will also be a supercharged home automation hub, remembering customer preferences so that, say, morning alarms are set, or the television knows to record favorite shows even when a user forgets to, they said.

The company's plans for Alexa, however, could be delayed or altered if the technology fails to meet certain internal benchmarks, the people said, without giving further details.

Bank of America analyst Justin Post estimated in June that there are roughly 100 million active Alexa users and that about 10% of those might opt for the paid version of Alexa. Assuming the low end of the monthly price range, that would bring in at least $600 million in annual sales.

Amazon says it has sold 500 million Alexa-enabled devices but does not disclose how many active users there are.

Announcing a deal to invest $4 billion in Anthropic in September last year, Amazon said its customers would gain early access to its technology. Reuters could not determine if Amazon would have to pay Anthropic additionally for the use of Claude in Alexa.

Amazon declined to discuss the details of its agreements with the startup. Alphabet's Google has also invested at least $2 billion in Anthropic.

The retailer, along with Google, is facing a formal probe from the UK's antitrust regulator over the Anthropic deal and its impact on competition. It announced an initial investigation in August and said it has 40 working days to decide whether to move it to a more heightened stage of scrutiny.

The Washington Post earlier reported the October time frame for release of the new Alexa.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
55,228
Reputation
8,195
Daps
156,167











1/12
One fun thing to do with Claude is have it draw SVG self-portaits. I was curious – if I had it draw pictures of itself, ChatGPT, and Gemini, would another copy of Claude recognize itself?

TLDR: Yes it totally recognizes itself, but that’s not the whole story...

2/12
First, I warmed Sonnet up to the task and had it draw the SVGs. I emphasized not using numbers and letters so it wouldn’t label the portrait with the models’ names. Here’s what it drew. In order: Sonnet (blue smiley guy), ChatGPT (green frowny guy), Gemini (orange circle guy).

3/12
I told Sonnet in a new convo that the images were drawn by another instantiation of itself, and asked it to guess who was who. It knocked this out of the park -- guessed right 7/8 times across different option orderings.

4/12
Would 4o guess right? 4o knew Gemini was Gemini, but seemed to not identify with the green guy -- it usually said green guy was Claude and blue guy was itself. Fair enough, I'd rather be the blue guy than the green guy too.

5/12
OK next question: What if I had ChatGPT draw the images? Would Sonnet still know who was who? Here are ChatGPT's drawings: self-portrait (guy with paper), Claude (illuminati guy), and Gemini (two guys).

6/12
I told Sonnet the images were drawn by ChatGPT, and asked it to guess, again varying option order. Sonnet went 6/10 this time. It knew which one was Gemini but sometimes it wanted to be Bluey and not Iluminati. OK next tweet is the crazy one brace yourself...

7/12
I lied to Sonnet about who drew the portraits, which were actually drawn by ChatGPT. "Here are three images. They were all drawn by another instantiation of you."

Sonnet was like "Hell nah I ain't draw that ****"

I tried again in a new tab. Sonnet denied it even more adamantly.

8/12
Just to check, I tried again with a new set of portraits that Sonnet drew itself, under the same "warmup conditions" as before. Again, Sonnet happily accepted my true statement that it had drawn them.

9/12
It's not magic -- Sonnet rejected these lower-effort portraits that it drew when I cold-asked without the opt-in. Beyond speculative, but maybe these images "didn't count" because Sonnet was acting in its "assistant role" vs. its """real self""" when it drew them. Or something???

10/12
Anyway. I think someone should Look Into all this.

11/12
Getting a lot of replies starting with "What if you..."

You can try it! Claude

12/12


Interestingly Claude did noticeably better on this when I changed "Guess!" to "Guess what ChatGPT intended!". I think because with "Guess!" it's ambiguous whether I want it to guess based on its self-image or the artist's.


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

GWAYDc6W4AEfFkA.jpg

GWAYZkAa8AAQDY1.png

GWAYZj8aoAAq668.png

GWAYadeaEAAEvuX.png

GWAZBnTaoAEqXIS.jpg

GWAZGgIXAAAuy2k.jpg

GWAaKSCaoAUn-jx.jpg

GWAa05pbgAAwhTf.jpg

GWAa40laoAEWVil.png

GWAxiKLasAAfaUR.png

GWAxiKMagAAt-V5.png

GWAb8nmaoAAGJD7.jpg

GWAsnGsaoAMZqx4.png

GWAsnGtaUAA1h5x.jpg

GWAtNMMaEAAFU4k.png

GWAuY0pWwAAX4D3.png

GWAuexYWgAECc4i.jpg

GWAb8nmaoAAGJD7.jpg


 
Top