Large Language Models News & Discussions

bnew · Oct 4, 2024

1/2

Today we’re premiering Meta Movie Gen: the most advanced media foundation models to-date.

Developed by AI research teams at Meta, Movie Gen delivers state-of-the-art results across a range of capabilities. We’re excited for the potential of this line of research to usher in entirely new possibilities for casual creators and creative professionals alike.

More details and examples of what Movie Gen can do

Meta Movie Gen - the most advanced media foundation AI models

Movie Gen models and capabilities
Movie Gen Video: 30B parameter transformer model that can generate high-quality and high-definition images and videos from a single text prompt.

Movie Gen Audio: A 13B parameter transformer model that can take a video input along with optional text prompts for controllability to generate high-fidelity audio synced to the video. It can generate ambient sound, instrumental background music and foley sound — delivering state-of-the-art results in audio quality, video-to-audio alignment and text-to-audio alignment.

Precise video editing: Using a generated or existing video and accompanying text instructions as an input it can perform localized edits such as adding, removing or replacing elements — or global changes like background or style changes.

Personalized videos: Using an image of a person and a text prompt, the model can generate a video with state-of-the-art results on character preservation and natural movement in video.

We’re continuing to work closely with creative professionals from across the field to integrate their feedback as we work towards a potential release. We look forward to sharing more on this work and the creative possibilities it will enable in the future.

2/2
As part of our continued belief in open science and progressing the state-of-the-art in media generation, we’ve published more details on Movie Gen in a new research paper for the academic community

https://go.fb.me/toz71j

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/2
More examples of what Meta Movie Gen can do across video generation, precise video editing, personalized video generation and audio generation.

2/2
We’ve shared more details on the models and Movie Gen capabilities in a new blog post

How Meta Movie Gen could usher in a new AI-enabled era for content creators

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Meta Movie Gen - the most advanced media foundation AI models
https://reddit.com/link/1fvyvr9/video/v6ozewbtoqsd1/player

Generate videos from text Edit video with text

Produce personalized videos

Create sound effects and soundtracks

Paper: MovieGen: A Cast of Media Foundation Models
https://ai.meta.com/static-resource/movie-gen-research-paper

Source: AI at Meta on X:

bnew · Oct 6, 2024

Apple releases Depth Pro, an AI model that rewrites the rules of 3D vision

Apple's Depth Pro AI model sets a new standard in 3D depth estimation, offering high-resolution, real-time depth mapping from a single image without camera metadata—transforming industries like AR, autonomous vehicles, and more.

venturebeat.com

Apple releases Depth Pro, an AI model that rewrites the rules of 3D vision

Michael Nuñez@MichaelFNunez

October 4, 2024 11:52 AM

Credit: VentureBeat made with Midjourney

Apple’s AI research team has developed a new model that could significantly advance how machines perceive depth, potentially transforming industries ranging from augmented reality to autonomous vehicles.

The system, called Depth Pro, is able to generate detailed 3D depth maps from single 2D images in a fraction of a second—without relying on the camera data traditionally needed to make such predictions.

The technology, detailed in a research paper titled “Depth Pro: Sharp Monocular Metric Depth in Less Than a Second,” is a major leap forward in the field of monocular depth estimation, a process that uses just one image to infer depth.

This could have far-reaching applications across sectors where real-time spatial awareness is key. The model’s creators, led by Aleksei Bochkovskii and Vladlen Koltun, describe Depth Pro as one of the fastest and most accurate systems of its kind.

Screenshot-2024-10-04-at-11.26.12%E2%80%AFAM.png

A comparison of depth maps from Apple’s Depth Pro, Marigold, Depth Anything v2, and Metric3D v2. Depth Pro excels in capturing fine details like fur and birdcage wires, producing sharp, high-resolution depth maps in just 0.3 seconds, outperforming other models in accuracy and detail. (credit: arxiv.org)

Speed and precision, without the metadata

Monocular depth estimation has long been a challenging task, requiring either multiple images or metadata like focal lengths to accurately gauge depth.

But Depth Pro bypasses these requirements, producing high-resolution depth maps in just 0.3 seconds on a standard GPU. The model can create 2.25-megapixel maps with exceptional sharpness, capturing even minute details like hair and vegetation that are often overlooked by other methods.

“These characteristics are enabled by a number of technical contributions, including an efficient multi-scale vision transformer for dense prediction,” the researchers explain in their paper. This architecture allows the model to process both the overall context of an image and its finer details simultaneously—an enormous leap from slower, less precise models that came before it.

Screenshot-2024-10-04-at-11.34.18%E2%80%AFAM.png

A comparison of depth maps from Apple’s Depth Pro, Depth Anything v2, Marigold, and Metric3D v2. Depth Pro excels in capturing fine details like the deer’s fur, windmill blades, and zebra’s stripes, delivering sharp, high-resolution depth maps in 0.3 seconds. (credit: arxiv.org)

Metric depth, zero-shot learning

What truly sets Depth Pro apart is its ability to estimate both relative and absolute depth, a capability called “metric depth.”

This means that the model can provide real-world measurements, which is essential for applications like augmented reality (AR), where virtual objects need to be placed in precise locations within physical spaces.

And Depth Pro doesn’t require extensive training on domain-specific datasets to make accurate predictions—a feature known as “zero-shot learning.” This makes the model highly versatile. It can be applied to a wide range of images, without the need for the camera-specific data usually required in depth estimation models.

“Depth Pro produces metric depth maps with absolute scale on arbitrary images ‘in the wild’ without requiring metadata such as camera intrinsics,” the authors explain. This flexibility opens up a world of possibilities, from enhancing AR experiences to improving autonomous vehicles’ ability to detect and navigate obstacles.

For those curious to experience Depth Pro firsthand, a live demo is available on the Hugging Face platform.

Screenshot-2024-10-04-at-11.35.50%E2%80%AFAM.png

A comparison of depth estimation models across multiple datasets. Apple’s Depth Pro ranks highest overall with an average rank of 2.5, outperforming models like Depth Anything v2 and Metric3D in accuracy across diverse scenarios. (credit: arxiv.org)

Real-world applications: From e-commerce to autonomous vehicles

This versatility has significant implications for various industries. In e-commerce, for example, Depth Pro could allow consumers to see how furniture fits in their home by simply pointing their phone’s camera at the room. In the automotive industry, the ability to generate real-time, high-resolution depth maps from a single camera could improve how self-driving cars perceive their environment, boosting navigation and safety.

“The method should ideally produce metric depth maps in this zero-shot regime to accurately reproduce object shapes, scene layouts, and absolute scales,” the researchers write, emphasizing the model’s potential to reduce the time and cost associated with training more conventional AI models.

Tackling the challenges of depth estimation

One of the toughest challenges in depth estimation is handling what are known as “flying pixels”—pixels that appear to float in mid-air due to errors in depth mapping. Depth Pro tackles this issue head-on, making it particularly effective for applications like 3D reconstruction and virtual environments, where accuracy is paramount.

Additionally, Depth Pro excels in boundary tracing, outperforming previous models in sharply delineating objects and their edges. The researchers claim it surpasses other systems “by a multiplicative factor in boundary accuracy,” which is key for applications that require precise object segmentation, such as image matting and medical imaging.

Open-source and ready to scale

In a move that could accelerate its adoption, Apple has made Depth Pro open-source. The code, along with pre-trained model weights, is available on GitHub, allowing developers and researchers to experiment with and further refine the technology. The repository includes everything from the model’s architecture to pretrained checkpoints, making it easy for others to build on Apple’s work.

The research team is also encouraging further exploration of Depth Pro’s potential in fields like robotics, manufacturing, and healthcare. “We release code and weights at GitHub - apple/ml-depth-pro: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.,” the authors write, signaling this as just the beginning for the model.

What’s next for AI depth perception

As artificial intelligence continues to push the boundaries of what’s possible, Depth Pro sets a new standard in speed and accuracy for monocular depth estimation. Its ability to generate high-quality, real-time depth maps from a single image could have wide-ranging effects across industries that rely on spatial awareness.

In a world where AI is increasingly central to decision-making and product development, Depth Pro exemplifies how cutting-edge research can translate into practical, real-world solutions. Whether it’s improving how machines perceive their surroundings or enhancing consumer experiences, the potential uses for Depth Pro are broad and varied.

As the researchers conclude, “Depth Pro dramatically outperforms all prior work in sharp delineation of object boundaries, including fine structures such as hair, fur, and vegetation.” With its open-source release, Depth Pro could soon become integral to industries ranging from autonomous driving to augmented reality—transforming how machines and people interact with 3D environments.

bnew · Oct 6, 2024

1/11
@tost_ai

Depth Pro with Depth Flow now on @tost_ai, @runpod_io and @ComfyUI

Thanks to Depth Pro Team ❤ and Depth Flow Team ❤

🗺comfyui: depth-flow-tost/depth_flow_tost.json at main · camenduru/depth-flow-tost

runpod: GitHub - camenduru/depth-flow-tost

tost: please try it

Tost AI

2/11
@ProTipsKe
/search?q=#Apple Releases Depth Pro /search?q=#AI
"...Depth Pro, is able to generate detailed 3D depth maps from single 2D images in a fraction of a second—without relying on the camera data traditionally needed to make such predictions..."
Apple releases Depth Pro, an AI model that rewrites the rules of 3D vision

3/11
@_akhaliq
Web UI for Apple Depth-Pro

Metric depth estimation determines real-world distances to objects in a scene from images. This repo provides a web UI that allows users to estimate metric depth and visualize depth maps by easily uploading images using the Depth Pro model through a simple Gradio UI interface.

4/11
@bycloudai
whatttt
there are like 5 research papers from Apple this week

LLMs Know More Than They Show [2410.02707] LLMs Know More Than They Show: On the Intrinsic Representation of LLM Hallucinations (affiliated only?)
Depth Pro [2410.02073] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
MM1.5 [2409.20566] MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Contrastive Localized Language-Image Pre-Training [2410.02746] Contrastive Localized Language-Image Pre-Training
Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models [2410.02740] Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

[Quoted tweet]

This week’s top AI/ML research papers:

- MovieGen
- Were RNNs All We Needed?
- Contextual Document Embeddings
- RLEF
- ENTP
- VinePPO
- When a language model is optimized for reasoning, does it still show embers of autoregression? An analysis of OpenAI o1
- LLMs Know More Than They Show
- Video Instruction Tuning With Synthetic Data
- PHI-S
- Thermodynamic Bayesian Inference
- Emu3: Next-Token Prediction is All You Need
- Lattice-Valued Bottleneck Duality
- Loong
- Archon
- Direct Judgement Preference Optimization
- Depth Pro
- MIO: A Foundation Model on Multimodal Tokens
- MM1.5
- PhysGen
- Cottention
- UniAff
- Hyper-Connections
- Image Copy Detection for Diffusion Models
- RATIONALYST
- From Code to Correctness
- Not All LLM Reasoners Are Created Equal
- VPTQ: Extreme Low-bit Vector Post-Training Quantization for LLMs
- Leopard: A VLM For Text-Rich Multi-Image Tasks
- Selective Aggregation for LoRA in Federated Learning
- Quantifying Generalization Complexity for Large Language Models
- FactAlign: Long-form Factuality Alignment of LLMs
- Is Preference Alignment Always the Best Option to Enhance LLM-Based Translation?
- Law of the Weakest Link: Cross Capabilities of Large Language Models
- TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
- One Token to Seg Them All: Language Instructed Reasoning Segmentation in Videos
- Looped Transformers for Length Generalization
- Illustrious
- LLaVA-Critic
- Contrastive Localized Language-Image Pre-Training
- Large Language Models as Markov Chains
- CLIP-MoE
- SageAttention
- Training Language Models on Synthetic Edit Sequences Improves Code Synthesis
- Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
- EVER
- The bunkbed conjecture is false

overview for each + authors' explanations
read this in thread mode for the best experience

5/11
@jonstephens85
I am very impressed with Apple's new code release of Depth Pro: Sharp Monocular Metric Depth in Less Than a Second. It's FAST! And it picked up on some incredible depth detail on my super fluffy cat. I show the whole process end to end.

6/11
@aisearchio
Apple has released Depth Pro, an open-source AI model that creates detailed 3D depth maps from single images.
Depth Pro - a Hugging Face Space by akhaliq

7/11
@chrisoffner3d
Testing temporal coherence of the new monocular depth estimator Depth Pro: Sharp Monocular Metric Depth in Less Than a Second with my own videos. The admittedly challenging and low-detail background seems quite unstable.

8/11
@aigclink
苹果刚刚开源了一个深度学习项目：Depth Pro，不到一秒即可生成非常清晰和详细的深度图

可用于各种需要理解图像深度的应用，利好自动驾驶汽车、虚拟现实、3D建模等场景

特点：
1、快速生成：可在0.3秒内生成一张225万像素的深度图
2、高清晰度：高分辨率、高锐度、高频细节，可以捕捉到很多细节
3、无需额外信息：无需相机的特定设置信息，比如焦距设置，就能生成准确的深度图

github：GitHub - apple/ml-depth-pro: Depth Pro: Sharp Monocular Metric Depth in Less Than a Second.

/search?q=#DepthPro

9/11
@TheAITimeline
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

Overview:
Depth Pro introduces a foundation model for zero-shot metric monocular depth estimation, delivering high-resolution depth maps with outstanding sharpness and detail.

It operates efficiently, producing a 2.25-megapixel depth map in just 0.3 seconds without the need for camera metadata.

Key innovations include a multi-scale vision transformer for dense prediction and a combined real-synthetic dataset training protocol for accuracy.

Extensive experiments show that Depth Pro surpasses previous models in several performance metrics, including boundary accuracy and focal length estimation from a single image.

Paper:
[2410.02073] Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

10/11
@toyxyz3
Depth pro test /search?q=#stablediffusion /search?q=#AIイラスト /search?q=#AI /search?q=#ComfyUI

11/11
@jaynz_way
**Summary**:

Apple's Depth Pro AI model transforms 2D images into 3D depth maps in under a second, without extra hardware. It's fast, precise, and useful for AR, robotics, and autonomous systems. The technology uses advanced vision transformers for accuracy. Developers can now explore new 3D applications freely, as Apple has opened up this tech. It's like giving AI the ability to see in 3D instantly, with Apple showing off by riding this tech like a bicycle - no hands!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

1/15
@chrisoffner3d
Testing temporal coherence of the new monocular depth estimator Depth Pro: Sharp Monocular Metric Depth in Less Than a Second with my own videos. The admittedly challenging and low-detail background seems quite unstable.

https://video.twimg.com/ext_tw_video/1842268171002064896/pu/vid/avc1/1080x1214/ArL-6XCEO1Gr0pVT.mp4

2/15
@chrisoffner3d
In this sequence it just seems to fail altogether.

https://video.twimg.com/ext_tw_video/1842269882080989184/pu/vid/avc1/1080x1214/edSr-s2BSGuUmKRZ.mp4

3/15
@chrisoffner3d
Here's a more busy scene with lots of depth variation.

https://video.twimg.com/ext_tw_video/1842278718569435136/pu/vid/avc1/1080x1214/wLCW5vNRA9O4Ttkb.mp4

4/15
@chrisoffner3d
Here I clamp the upper bound of the color map to 800 (metres) but for most frames the max predicted depth value is 10,000 (~infinity).

The DJI Osmo Pocket 3 camera does pretty heavy image processing. Maybe this post-processing is just too different from the training data?

https://video.twimg.com/ext_tw_video/1842287018081964032/pu/vid/avc1/1080x1214/F6R-D9SkpzOBCxeV.mp4

5/15
@chrisoffner3d
Hmm... somehow I think those mountains are more than 7.58 metres away in the left image. And the right one just went

.

6/15
@chrisoffner3d
Motion blur seems to make a huge difference. These are successive frames from a video.

The left one has substantial motion blur and "Depth Pro" produces a very mushy depth map.

The right one is sharper, and the depth map looks more reasonable.

7/15
@chrisoffner3d
Two other successive frames from the same video, this time without noticeable visual differences. Nonetheless, the maximum depth varies by a factor of almost four.

8/15
@chrisoffner3d
From second 10 or so, the models assigns depth 10,000 (infinity) to some pixels it considers to be sky, which is why the depth map turns all red – because all those nearby finite depth values become negligible compared to the maximum value.

https://video.twimg.com/ext_tw_video/1842333210228695040/pu/vid/avc1/1080x1214/B8Q4nKAOYuKr6agy.mp4

9/15
@chrisoffner3d
Looks like primarily areas with large or infinite ground truth depth (e.g. the sky) have very high variance in the maximum depth. I should try some scenes with bounded maximum depth.

https://video.twimg.com/ext_tw_video/1842434205617168385/pu/vid/avc1/1080x1214/TIFyR-8ripcPzEnW.mp4

10/15
@chrisoffner3d
Here I show the log depth to reduce the impact on the maximum depth pixel on the rest of the color map. The shown "Max depth" label is still the raw metric depth value. Odd how the max depth drops/switches from 10,000 to values as low as ~20 in some frames.

https://video.twimg.com/ext_tw_video/1842454596368621568/pu/vid/avc1/1080x1214/zX4cJNPFW5XgCrst.mp4

11/15
@chrisoffner3d
Here's a scene with (mostly) bounded depth, but the sky peeks through the foliage and causes some strong fluctuations in the max depth estimate. Still, overall it looks more stable than the scenes with lots of sky.

https://video.twimg.com/ext_tw_video/1842473286552141824/pu/vid/avc1/1080x1214/Z8a5IicqkEdgiSL1.mp4

12/15
@chrisoffner3d
Here's the (log) depth of an indoor scene with fully bounded depth. The top-down frames in the first half of the video still have high depth variance. The later frames taken at more conventional angles are pleasantly stable.

https://video.twimg.com/ext_tw_video/1842489701111922688/pu/vid/avc1/1080x1214/dRWU_evea_0J5UTO.mp4

13/15
@chrisoffner3d
Again, when there's a lot of sky in the frame, all bets are off. The max depth oscillates between 10,000 (infinity) and values down to <200 in successive frames.

https://video.twimg.com/ext_tw_video/1842516079739834370/pu/vid/avc1/1080x1214/rsIgsOW-FbWs252h.mp4

14/15
@chrisoffner3d
Finally, here's a particularly challenging low-light video of where I met a cute and curious cow while wandering across a tiny island in Indonesia last month. Given the poor lighting and strong noise, I find Depth Pro's performance quite impressive tbh.

https://video.twimg.com/ext_tw_video/1842569306871123968/pu/vid/avc1/1080x1214/9uT8klzhH-4PPHWb.mp4

15/15
@chrisoffner3d
However, looking at the "swishy" artefacts in the depth maps of the cow video, I'm wondering whether the cow is secretly wearing the One Ring and is about to be captured by Sauron's ringwraiths.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

Reflection 70B saga continues as training data provider releases post-mortem report

The more data the Reflection 70B creators publish about the model, the more evidence the open source AI community has to pore over.

venturebeat.com

Reflection 70B saga continues as training data provider releases post-mortem report

Carl Franzen@carlfranzen

October 3, 2024 2:07 PM

Two men stare through cracked glass window

Credit: VentureBeat made with Midjourney

On September 5th, 2024, Matt Shumer, co-founder and CEO of the startup Hyperwrite AI (also known as OthersideAI) took to the social network X to post the bombshell news that he had fine-tuned a version of Meta’s open source Llama 3.1-70B into an even more performant large language model (LLM) known as Reflection 70B — so performant, in fact, based on alleged third-party benchmarking test results he published, that it was “the world’s top open-source model,” according to his post.

I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week – we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on : pic.twitter.com/kZPW1plJuo

— Matt Shumer (@mattshumer_) September 5, 2024

However, shortly after its release, third-party evaluators in the AI research and hosting community struggled to reproduce the claimed results, leading to accusations of fraud.

Researchers cited discrepancies between the announced benchmark results and their independent tests, sparking a wave of criticism on social platforms such as Reddit and X.

In response to these concerns, Shumer pledged he would conduct a review of the issues alongside Sahil Chaudhary, founder of Glaive, the AI startup whose synthetic data Shumer claimed he had trained Reflection 70B on — and which he later revealed to have invested what he called a small amount into.

Now, nearly a month later, Chaudhary last night released a post-mortem report on his Glaive AI blog about the Reflection 70B model and published resources for the open-source AI community to test the model and his training process on their own. He says while he was unable to reproduce all of the same benchmarks, he “found a bug in the initial code,” resulting in several results appearing higher than what he has found on recent tests of Reflection 70B. However, other benchmark results appear higher than before — adding to the mystery.

On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address…

— Sahil Chaudhary (@csahil28) October 2, 2024

As Chaudhary wrote in the post:

“There were a lot of mistakes made by us in the way we launched the model, and handled the problems reported by the community. I understand that things like these have a significant negative effect on the open source ecosystem, and I’d like to apologize for that. I hope that this adds some clarity to what happened, and is a step in the direction of regaining the lost trust. I have released all of the assets required to independently verify the benchmarks and use this model.“

Sharing model artifacts

To restore transparency and rebuild trust, Chaudhary shared several resources to help the community replicate the Reflection 70B benchmarks. These include:

Model weights: Available on Hugging Face, providing the pre-trained version of Reflection 70B.
Training data: Released for public access, enabling independent tests on the dataset used to fine-tune the model.
Training scripts and evaluation code: Available on GitHub, these scripts allow for reproduction of the model’s training and evaluation process.

These resources aim to clarify how the model was developed and offer a path for the community to validate the original performance claims.

Reproducing the benchmarks

In his post-mortem, Chaudhary explained that a major issue with reproducing the initial benchmark results stemmed from a bug in the evaluation code. This bug caused inflated scores in certain tasks, such as MATH and GSM8K, due to an error in how the system handled responses from an external API. The corrected benchmarks show slightly lower, but still strong, performance relative to the initial report.

The updated benchmark results for Reflection 70B are as follows:

MMLU: 90.94%
GPQA: 55.6%
HumanEval: 89.02%
MATH: 70.8%
GSM8K: 95.22%
IFEVAL: 87.63%

Compare that to the originally stated performance of:

MMLU: 89.9%
GPQA: 55.3%
HumanEval: 91%
MATH: 79.7%
GSM8K: 99.2%
IFEVAL: 90.13%

Although the revised scores are not as high as those initially reported, Chaudhary asserts that they are more accurate reflections of the model’s capabilities.

He also addressed concerns about dataset contamination, confirming that tests showed no significant overlap between the training data and benchmark sets.

Reflecting on a hasty release

Chaudhary admitted that the decision to release Reflection 70B was made hastily, driven by enthusiasm for the model’s performance on reasoning-based tasks.

He noted that the launch lacked sufficient testing, particularly regarding the compatibility of the model files, and that he and Shumer had not verified whether the model could be easily downloaded and run by the community.

“We shouldn’t have launched without testing, and with the tall claims of having the best open-source model,” Chaudhary wrote. He also acknowledged that more transparency was needed, especially regarding the model’s strengths and weaknesses. While Reflection 70B excels at reasoning tasks, it struggles in areas like creativity and general user interaction, a fact that was not communicated at launch.

Clarifying API confusion

One of the more serious accusations involved the suspicion that the Reflection 70B API was simply relaying outputs from Anthropic’s Claude model.

Users reported strange behavior in the model’s outputs, including responses that seemed to reference Claude directly.

Chaudhary addressed these concerns, explaining that although some of these behaviors were reproducible, he asserts there was no use of Claude APIs or any form of word filtering in the Reflection 70B model.

He reiterated that the API was run on Glaive AI’s compute infrastructure, and Matt Shumer had no access to the code or servers used during this period.

Looking ahead

In closing, Chaudhary emphasized his commitment to transparency and expressed his hope that this post-mortem and the release of model artifacts will help restore trust in the project. He also confirmed that Matt Shumer is continuing independent efforts to reproduce the benchmark scores.

Despite the setbacks, Chaudhary believes the “reflection tuning” approach — in which a model is given time to check its responses for accuracy before outputting them to a user — has potential and encourages further experimentation by the AI community. “The approach explored has merit, and I look forward to others continuing to explore this technique,” he said.

Shumer, for his part, has posted on X stating: “I am still in the process of validating Reflection myself, as Sahil wrote in his postmortem, but I am encouraged by Sahil’s transparency here on the benchmarks he reported and the API he ran. We still believe in + are working on this approach. Hoping to finish up my repro soon.”

Skepticism among open source AI community remains

Despite Chaudhary’s claims to offer transparency and an innocent explanation for what happened with Reflection 70B, many in the AI community who were initially excited about the model and its stated performance remain skeptical, feeling as though they were burned by erroneous claims and potentially tricked before.

“Still doesn’t feel like anything adds up here,” wrote Alexander Moini, an AI researcher, on X, adding “It took a month to get the model weights on to HF [Hugging Face]?”

Still doesn’t feel like anything adds up here.

It took a month to get the model weights on to HF?

And you’ve had a private api with the “real” weights the whole time? Not to mention it supposedly having tokenizer issues, that look a lot like tokenizers used by anthropic +…

— Alex (@AlexanderMoini) October 3, 2024

Yuchen Jin, co-founder and CTO of Hyperbolic Labs, a startup that offers cloud-based GPUs and other AI services on demand who initially worked hard and late to host Reflection 70B before criticizing Shumer over its discrepancies, also voiced skepticism on X toward Chaudhary’s post-mortem report, pointing out that Chaudhary’s claims on X that he “reproduced all but two of the initially reported scores,” don’t actually match with the data he provided, which show at least 4 benchmarks changing scores from before to now.

"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores? pic.twitter.com/PHSe6CJD7A

— Yuchen Jin (@Yuchenj_UW) October 2, 2024

But perhaps the most damning commentary comes from the Reddit subreddit r/Local LLaMA, wherein one user, “fukkSides” pointed out that Chaudhary could have taken the intervening month to fine-tune a new model to back up his claims that it randomly outputs text indicating it is actually Anthropic’s Claude 3.5 under the hood — which would explain said outputs experienced by users previously and led them to the conclusion that Reflection 70B was a fraudulent wrapper around this other proprietary model served through an API.

Comment
byu/whotookthecandyjar from discussion
inLocalLLaMA

Meanwhile, another Redditor, “DangerousBenefit” looked into the training data Chaudhary released today and found it was filled with many instances of the phrase “as an AI language model,” which indicates it could be generated primarily from OpenAI’s ChatGPT and likely wasn’t properly cleaned.

Regardless, the more data the Reflection 70B creators publish about the model, the more evidence the open source AI community has to pore over and check their work.

bnew · Oct 6, 2024

1/26
@csahil28
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/26
@csahil28
I’m releasing model weights, training data, scripts, and eval code to help reproduce benchmark scores.
Postmortem- Update on Reflection-70B
Weights- glaiveai/Reflection-Llama-3.1-70B · Hugging Face
Eval code- GitHub - glaive-ai/simple-evals
Training code- GitHub - glaive-ai/reflection_70b_training

@RickLamers has also put together a repo to reproduce the benchmark scores easily on gpu instances from Rent GPUs | Vast.ai

GitHub - ricklamers/reflection-repro-vastai

3/26
@csahil28
Using this eval harness, i’ve reproduced all but two of the initially reported scores. Scores for MATH and GSM8K differ due to a bug in the initial benchmarking code. I checked for dataset contamination and found no significant overlap with benchmarks.

However, I acknowledge this doesn't definitively prove the model wasn't trained on benchmarks, which is why I’m releasing the dataset and the training script as well, so others can reproduce this.

4/26
@csahil28
I understand the negative impact this has had on the open-source AI community. I’m committed to learning from these mistakes and hope this post-mortem adds clarity to what happened. Moving forward, I will be more careful and thorough in any of our releases and communications. I’m dedicated to rebuilding trust and contributing positively to the open source AI ecosystem.

5/26
@Yuchenj_UW
"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores?

6/26
@FarouqAldori
What about the switching to Claude sonnet on the ”private api” claims?

7/26
@mysticaltech
Thank you both for coming out clean! That was a good lesson for everyone watching to never rush releases, especially in an ecosystem so sharp as the open-source AI community. That said, at least you did something and now you are bringing us good stuff that will move our collective understanding one inch further. This is great! Thank you both for your contributions

8/26
@bikatr7
You know we're not buying this, right?

You guys were very evidently switching models behind the scenes, you claim to be able to reproduce the Claude behavior, but this does not account for the GPT/Llama ones as well?

The whole tweet and postmortem is a bit vague and misleading, you claim to be able to reproduce all but 2, but the bug you reported would mean that the correct wording should be "I've only reproduced two of the initially reported scores"

Still, none of it makes sense? How do you get a 99% benchmark and not double-check it?

We appreciate the apology, but once again it seems misleading.

[Quoted tweet]
The 'NEW' Reflection 70B is still using one lie to cover up another - we all saw you frantically switching between claude sonnet 3.5 / gpt4o / gpt4o-mini / llama3.1, and there's a timely record within this thread.
glaive.ai/blog/post/reflecti…

9/26
@zjasper666
This is a great first step of having more transparency on open-source research. Besides the dataset and the training script, I think showing the script for how to generate the synthetic training data would be helpful too.

This also shows the importance of having an independent trustworthy evaluation service like @ArtificialAnlys

10/26
@ironcarbs
Thanks for sharing this and open sourcing it for the community.

Really curious how you generated the data for the fine-tuning. Anything else you can share regarding that?

11/26
@BoomBrusher
My feed is so wholesome today

12/26
@failspy
It remains unaddressed in a handwaved "people noticed some tokenizer issues" this issue:

If the API requested 10 tokens worth of output, it would match Claude's tokenization, and not Llama 3's. That can not just "happen" magically from fine-tuning.

[Quoted tweet]
The reflection model on their API is just prefilled Claude 3. (All of the models are limited to 10 tokens, temperature 0, top_k 1)

13/26
@MaximeRivest
Outstanding reaction. Well done.

14/26
@BlkyVlky
What about your API "Reflection" model using Claude's tokenizer Sahil? Cannot think of any excuse for that one, right?

15/26
@DaburritoD3551
Huge if true

16/26
@Barry357807
Guess they were listening to you @MatthewBerman

17/26
@xxammll
The fact that Claude 3.5 Sonnet was behind their API wasn't addressed. Additionally, the MixEval scores are much worse than Llama 3 70B and are also inconsistent with the claimed results. This looks like a grift to me.

18/26
@legolasyiu
Great to hear about your analysis.

19/26
@JohanValero94
Ok~ The weights and scripts going to say the true~
I have faith in this~

20/26
@PedroSobra26847
Thanks for providing clarifications ! Keep up

21/26
@crypto_nobody_
If agile, you are blameless, don’t worry about it

22/26
@_MustafaMS
Good luck, great to hear good news from you guys.

23/26
@sio_mrnobody
You would have been better off silently disappearing into the void and never returning.
No one believes you, no one trusts you.
Go away.

24/26
@YobaALU
> we have never added any word filtering or made use of Claude APIs

imagine training a model to not say Claude for a cover up postmortem, wouldn't you feel like a clown

25/26
@mkieffer1107

26/26
@diegoxfx1
@DotCSV un video de reflexiones sobre el tema por favor

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@mattshumer_
I am still in the process of validating Reflection myself, as Sahil wrote in his postmortem, but I am encouraged by Sahil’s transparency here on the benchmarks he reported and the API he ran.

We still believe in + are working on this approach. Hoping to finish up my repro soon.

[Quoted tweet]
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/4
@AlexanderMoini
Still doesn’t feel like anything adds up here.

It took a month to get the model weights on to HF?

And you’ve had a private api with the “real” weights the whole time? Not to mention it supposedly having tokenizer issues, that look a lot like tokenizers used by anthropic + string match to replace Claude?

3/4
@TimTruth123
Also their API could have been saving off everyone's benchmarks for the second round of training.

4/4
@sio_mrnobody
because it doesnt add up; stop giving this idiot the benefit of the doubt.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/12
@csahil28
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/12
@Yuchenj_UW
"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores?

3/12
@AmgadGamalHasan
In the last 2 benchmarks, the official llama3 was equally good or even better than their model.

4/12
@Yuchenj_UW
right, and in their original post, the GSM8K and IFEval scores were way better than even llama 3.1 405B

5/12
@isidentical
@Yuchenj_UW are you guys going to independently verify the results?

6/12
@Yuchenj_UW
probably not, as their new benchmark results indicate the model is not strong enough

7/12
@_MustafaMS
" Note: I found a bug in the initial code for benchmarking where the scores for MATH and GSM8K.
The difference on e.g. the MATH benchmark is that we score around 69-70% instead of the reported 79% and for GSM8K it means we score about 94-96% instead of reported 99.2%. "

8/12
@Yuchenj_UW
then the correct wording should be "i've only reproduced two of the initially reported scores"?

9/12
@Oli82817545
could you host the model for a short time so we can try it out ? beside benchmarks i wanna see if this model lines up with what i experienced on glaives private api a couple of weeks ago

10/12
@Yuchenj_UW
we don't have a plan to throw another 4xH100s and get into this because they stop responding.

imo, they should host the model again and let people test

11/12
@paul_cal
Feels ok to me, they reproduced substantial gains over llama 3.1 and reached within a few percent of original claims on all but 2

That framing is fine, slightly favourable but not enough to be called deceptive

(no comment on validity of results tho, haven't looked at repo)

12/12
@seanmrnda
Even though it did not achieve the previously claimed results it's a lot better than the 3.1 instruct model. That means, it's working.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

Identifying potential deepfake multimodal content is one of the benefits of OpenAI's design decisions that together define GPT-4o.

venturebeat.com

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

Louis Columbus@LouisColumbus

October 3, 2024 3:14 PM

How GPT-4o Defends Your Identity in the Age of AI-Generated Voices and Deepfakes

Deepfake incidents are surging in 2024, predicted to increase by 60% or more this year, pushing global cases to 150,000 or more. That’s making AI-powered deepfake attacks the fastest-growing type of adversarial AI today. Deloitte predicts deepfake attacks will cause over $40 billion in damages by 2027, with banking and financial services being the primary targets.

AI-generated voice and video fabrications are blurring the lines of believability to hollow out trust in institutions and governments. Deepfake tradecraft is so pervasive in nation-state cyberwarfare organizations that it’s reached the maturity of an attack tactic in cyberwar nations that engage with each other constantly.

“In today’s election, advancements in AI, such as Generative AI or deepfakes, have evolved from mere misinformation into sophisticated tools of deception. AI has made it increasingly challenging to distinguish between genuine and fabricated information,” Srinivas Mukkamala, chief product officer at Ivanti told VentureBeat.

Sixty-two percent of CEOs and senior business executives think deepfakes will create at least some operating costs and complications for their organization in the next three years, while 5% consider it an existential threat. Gartner predicts that by 2026, attacks using AI-generated deepfakes on face biometrics will mean that 30% of enterprises will no longer consider such identity verification and authentication solutions to be reliable in isolation.

“Recent research conducted by Ivanti reveals that over half of office workers (54%) are unaware that advanced AI can impersonate anyone’s voice. This statistic is concerning, considering these individuals will be participating in the upcoming election,” Mukkamala said.

The U.S. Intelligence Community 2024 threat assessment states that “Russia is using AI to create deepfakes and is developing the capability to fool experts. Individuals in war zones and unstable political environments may serve as some of the highest-value targets for such deepfake malign influence.” Deepfakes have become so common that the Department of Homeland Security has issued a guide, Increasing Threats of Deepfake Identities.

How GPT-4o is designed to detect deepfakes

OpenAI’s latest model, GPT-4o, is designed to identify and stop these growing threats. As an “autoregressive omni model, which accepts as input any combination of text, audio, image and video,” as described on its system card published on Aug. 8. OpenAI writes, “We only allow the model to use certain pre-selected voices and use an output classifier to detect if the model deviates from that.”

Identifying potential deepfake multimodal content is one of the benefits of OpenAI’s design decisions that together define GPT-4o. Noteworthy is the amount of red teaming that’s been done on the model, which is among the most extensive of recent-generation AI model releases industry-wide.

All models need to constantly be training on and learning from attack data to keep their edge, and that’s especially the case when it comes to keeping up with attackers’ deepfake tradecraft that is becoming indistinguishable from legitimate content.

The following table explains how GPT-4o features help identify and stop audio and video deepfakes.

Source: VentureBeat analysis

Key GPT-4o capabilities for detecting and stopping deepfakes

Key features of the model that strengthen its ability to identify deepfakes include the following:

Generative Adversarial Networks (GANs) detection. The same technology that attackers use to create deepfakes, GPT-4o, can identify synthetic content. OpenAI’s model can identify previously imperceptible discrepancies in the content generation process that even GANs can’t fully replicate. An example is how GPT-4o analyzes flaws in how light interacts with objects in video footage or inconsistencies in voice pitch over time. 4o’s GANS detection highlights these minute flaws that are undetectable to the human eye or ear.

GANs most often consist of two neural networks. The first is a generator that produces synthetic data (images, videos or audio) and a discriminator that evaluates its realism. The generator’s goal is to improve the content’s quality to deceive the discriminator. This advanced technique creates deepfakes nearly indistinguishable from real content.

Source: CEPS Task Force Report, Artificial Intelligence, and Cybersecurity. Technology, Governance and Policy Challenges, Centre for European Policy Studies (CEPS). Brussels. May 2021

Voice authentication and output classifiers. One of the most valuable features of GPT-4o’s architecture is its voice authentication filter. The filter cross-references each generated voice with a database of pre-approved, legitimate voices. What’s fascinating about this capability is how the model uses neural voice fingerprints to track over 200 unique characteristics, including pitch, cadence and accent. GPT-4o’s output classifier immediately shuts down the process if any unauthorized or unrecognized voice pattern is detected.

Multimodal cross-validation. OpenAI’s system card comprehensively defines this capability within the GPT-4o architecture. 4o operates across text, audio, and video inputs in real time, cross-validating multimodal data as legitimate or not. If the audio doesn’t match the expected text or video context, the GPT4o system flags it. Red teamers found this is especially crucial for detecting AI-generated lip-syncing or video impersonation attempts.

Deepfake attacks on CEOs are growing

Of the thousands of CEO deepfake attempts this year alone, the one targeting the CEO of the world’s biggest ad firm shows how sophisticated attackers are becoming.

Another is an attack that happened over Zoom with multiple deepfake identities on the call including the company’s CFO. A finance worker at a multinational firm was allegedly tricked into authorizing a $25 million transfer by a deepfake of their CFO and senior staff on a Zoom call.

In a recent Tech News Briefing with the Wall Street Journal, CrowdStrike CEO George Kurtz explained how improvements in AI are helping cybersecurity professionals defend systems while also commenting on how attackers are using it. Kurtz spoke with WSJ reporter Dustin Volz about AI, the 2024 U.S. election and threats posed by China and Russia.

“And if now in 2024 with the ability to create deepfakes, and some of our internal guys have made some funny spoof videos with me and it just to show me how scary it is, you could not tell that it was not me in the video,” Kurtz told the WSJ. “So I think that’s one of the areas that I really get concerned about. There’s always concern about infrastructure and those sort of things. Those areas, a lot of it is still paper voting and the like. Some of it isn’t, but how you create the false narrative to get people to do things that a nation-state wants them to do, that’s the area that really concerns me.”

The critical role of trust and security in the AI era

OpenAI’s prioritizing design goals and an architectural framework that puts defake detection of audio, video and multimodal content at the forefront reflect the future of gen AI models.

“The emergence of AI over the past year has brought the importance of trust in the digital world to the forefront,” says Christophe Van de Weyer, CEO of Telesign. “As AI continues to advance and become more accessible, it is crucial that we prioritize trust and security to protect the integrity of personal and institutional data. At Telesign, we are committed to leveraging AI and ML technologies to combat digital fraud, ensuring a more secure and trustworthy digital environment for all.”

VentureBeat expects to see OpenAI expand on GPT-40’s multimodal capabilities, including voice authentication and deepfake detection through GANs to identify and eliminate deepfake content. As businesses and governments increasingly rely on AI to enhance their operations, models like GPT-4o become indispensable in securing their systems and safeguarding digital interactions.

Mukkamala emphasized to VentureBeat that “When all is said and done, though, skepticism is the best defense against deepfakes. It is essential to avoid taking information at face value and critically evaluate its authenticity.”

bnew · Oct 6, 2024

1/2
@chrisoffner3d
Here is the output of DepthCrafter (Hu, Gao, Li et al., 2024). Very smooth!

However, this model conditions on a full video, i.e. all frames need to be known ahead of time – making it unsuitable for online use in robotics. It's also very computationally expensive.

Unfortunately the code base is hardcoded for CUDA, making it impossible for me to run it on MPS on my M3 Max with 128 GB shared memory without major code modifications.

On the RTX 3080 Ti with 12GB memory I have access to, it ran out of memory in its full-resolution mode. Running it at reduced resolution of 512 worked but took quite a while as well.

[Quoted tweet]
Again, when there's a lot of sky in the frame, all bets are off. The max depth oscillates between 10,000 (infinity) and values down to <200 in successive frames.

https://video.twimg.com/ext_tw_video/1843007529611038720/pu/vid/avc1/512x256/giD1Jml0QbMJYVkh.mp4
https://video.twimg.com/ext_tw_video/1842516079739834370/pu/vid/avc1/1080x1214/rsIgsOW-FbWs252h.mp4

2/2
@chrisoffner3d
I once again ask you to make your PyTorch code general enough so it can easily be run on MPS. This way you allow people without access to A100 or H100 GPUs to use your memory-intensive models.

[Quoted tweet]
Every student with only a MacBook to work on will love you for adding an MPS check to your PyTorch device assignment.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@chrisoffner3d
Motion blur seems to make a huge difference. These are successive frames from a video.

The left one has substantial motion blur and "Depth Pro" produces a very mushy depth map.

The right one is sharper, and the depth map looks more reasonable.

2/11
@chrisoffner3d
Two other successive frames from the same video, this time without noticeable visual differences. Nonetheless, the maximum depth varies by a factor of almost four.

3/11
@chrisoffner3d
From second 10 or so, the models assigns depth 10,000 (infinity) to some pixels it considers to be sky, which is why the depth map turns all red – because all those nearby finite depth values become negligible compared to the maximum value.

4/11
@chrisoffner3d
Looks like primarily areas with large or infinite ground truth depth (e.g. the sky) have very high variance in the maximum depth. I should try some scenes with bounded maximum depth.

5/11
@chrisoffner3d
Here I show the log depth to reduce the impact on the maximum depth pixel on the rest of the color map. The shown "Max depth" label is still the raw metric depth value. Odd how the max depth drops/switches from 10,000 to values as low as ~20 in some frames.

6/11
@chrisoffner3d
Here's a scene with (mostly) bounded depth, but the sky peeks through the foliage and causes some strong fluctuations in the max depth estimate. Still, overall it looks more stable than the scenes with lots of sky.

7/11
@chrisoffner3d
Here's the (log) depth of an indoor scene with fully bounded depth. The top-down frames in the first half of the video still have high depth variance. The later frames taken at more conventional angles are pleasantly stable.

8/11
@chrisoffner3d
Again, when there's a lot of sky in the frame, all bets are off. The max depth oscillates between 10,000 (infinity) and values down to <200 in successive frames.

9/11
@chrisoffner3d
Finally, here's a particularly challenging low-light video of where I met a cute and curious cow while wandering across a tiny island in Indonesia last month. Given the poor lighting and strong noise, I find Depth Pro's performance quite impressive tbh.

10/11
@chrisoffner3d
However, looking at the "swishy" artefacts in the depth maps of the cow video, I'm wondering whether the cow is secretly wearing the One Ring and is about to be captured by Sauron's ringwraiths.

11/11
@chrisoffner3d
DepthCrafter gives very impressive results even on my low-light and noisy "cows" video.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 7, 2024

1/11
@jam3scampbell
according to @dylan522p, Microsoft/OpenAI have cracked multi-datacenter distributed training

2/11
@dylan522p
Multi-Datacenter Training: OpenAI's Ambitious Plan To Beat Google's Infrastructure

3/11
@jam3scampbell

4/11
@AndrewCurran_
This was the highlight for me.

5/11
@vincentweisser
We scaled distributed training to 1B parameters with minimal efficiency loss and published our results https://arxiv.org/pdf/2407.07852 along with the code: GitHub - PrimeIntellect-ai/OpenDiloco: OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training.

We are also kicking off multi-datacenter distributed training for open-source models of 7B+ parameters

6/11
@stocknear
Source: Trust me bro

7/11
@akbirthko
doesn't google already do this?

8/11
@milosgajdos
The whole podcast is pure gold. The best thing I've listened to/watched in months. Chapeau @dwarkesh_sp 🫡

9/11
@mark_k
Huge if big.

10/11
@BjornHansenMMA
Totally worth listening to the whole pod.

They speak so fast about complex topics every 15 minutes feels like an hours worth of normal podcast.

11/11
@gcolbourn
This is bad news (but what's the efficiency loss?)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 9, 2024

1/11
@GeminiApp
Image generation with Imagen 3 is now available to all Gemini users around the world.

Imagen 3 is our highest quality image generation model yet and brings an even higher degree of photorealism, better instruction following, and fewer distracting artifacts than ever before.

2/11
@GeminiApp
Results for illustrative purposes and may vary. Internet connection and subscription for certain features required. Language and country availability varies.

3/11
@devlcq
Only in the USA*

4/11
@NicolasGargala2
Amazing but in Australia for $32 AUD a month pretty steep for us normal people

5/11
@WadeWilson_GHF
On attend, on attend, on attend en France

6/11
@InfusingFit
Wow, prompt adherence following is amazing. Usually image generators perform poorly on this

7/11
@EverydayAI_
We covered Imagen3 in-depth a few weeks back. It's actually really frickin good....

https://invidious.poast.org/watch?v=ETMpUqnTwxw&t=61s

8/11
@harishkgarg
not bad

9/11
@KewkD
Image Gen 3 is easily the best on the market for most things. I just want to know when we'll get to create imatges outside of 1:1

10/11
@MansaKirito
Mmmh nice...

11/11
@koltregaskes
Gosh, thank you.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@testingcatalog
In case you didn't have Imagen 3 before on Gemini - now is the time

A broader and worldwide rollout is happening but "Language and country availability varies" still.

Are you able to generate images or ppl as well?

[Quoted tweet]
Image generation with Imagen 3 is now available to all Gemini users around the world.

Imagen 3 is our highest quality image generation model yet and brings an even higher degree of photorealism, better instruction following, and fewer distracting artifacts than ever before.

https://video.twimg.com/ext_tw_video/1844060730745827328/pu/vid/avc1/720x720/Ey5QkhYXj4HO9Fcr.mp4

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 9, 2024

1/11
@LunjunZhang
What if your reward model could “think” more and perform better? Even better, what if your LLM policy could also be used as a reward model?

Introducing GenRM, reward models trained as next token predictors, rather than old-fashioned classifier RMs. This enables things that weren’t possible:

Chain-of-Thought reasoning for RM

Leveraging test-time compute

Single policy + reward model

[1/N]

2/11
@LunjunZhang
LLM-generated solutions often sound convincing even when they are wrong.

For example, the solution below is incorrect because it ignores the word ‘each’ in the problem.

While standard RMs get fooled easily, GenRM can detect the error by explicitly reasoning (with Chain-of-Thought) about solution correctness.

[2/N]

3/11
@LunjunZhang
On algorithmic and math reasoning tasks, GenRM outperforms classical RM and prompted LLM verifiers (LLM-as-a-Judge), in terms of Best-of-N performance, which uses the reward model to select the best solution among N candidates from a fixed LLM.

On GSM8K, when using a Gemma-9B GenRM to verify the outputs of Gemini 1.0 Pro, we observe a 20% improvement with Best-of-N (73% → 92.8%), beating direct generation from GPT-4 and Gemini 1.5 Pro.

4/11
@LunjunZhang
So how does GenRM work? Previously, reward models (RMs) and verifiers were trained as binary classifiers. They do not utilize the text generation capabilities of LLMs.

Now, given a question and an answer, GenRM simply finetunes an LLM to answer "Is the answer correct (Yes/No)?" with a " Yes" or " No" token.
Surprisingly, this itself can do better than standard RMs.

5/11
@LunjunZhang
We can *train* the verifier to use chain-of-thought (CoT) reasoning before giving a final "Yes" / "No" answer.

GenRM-CoT: ‘Let’s verify step by step’, literally.

This can enable the generative verifier to catch subtle mistakes that the discriminative RM misses.

6/11
@LunjunZhang
Using Chain-of-Thought in the reward model unlocks the possibility of leveraging additional inference-time compute to improve verification.

GenRM-CoT can utilize majority voting by sampling multiple verification CoTs and computing the average correctness scores, to turn more test-time compute into more problems solved.

Fine-tuned GenRM models even outperform the LLM (Gemini 1.0 Pro) we used for generating verification CoTs on training problems!

7/11
@LunjunZhang
Now that we are doing next-token prediction, we can unify generation and verification, by simply adding the generation and verification tasks to the same data mixture. As expected, generation and verification are synergistic.

Teaching the verifier to imitate correct solutions improves verification.

Teaching the generator to verify solutions also improves generation performance itself.

8/11
@LunjunZhang
How does GenRM scale with model size?

On GSM8K, we scale model capacity using Gemma 2B, 7B, 9B, and observe positive scaling trends both for GenRM (direct) and GenRM-CoT.

9/11
@LunjunZhang
How do we generate synthetic verification CoT data given only solution correctness labels? This is important, as collecting such data from humans is expensive and eventually infeasible as LLMs surpass human reasoning.

To get synthetic data to work, we provided a reference solution in addition to the problem and solution to verify. This improves the rationale data quality to be high enough such that GenRM-CoT can work well.

10/11
@LunjunZhang
Training on synthetic rationales also leads to interesting emergent behaviors.

Even if GenRM-CoT does not catch the correct step-level mistake, it may still consider the answer problematic, and then attempt the problem from a different angle, before giving the final Yes/No verification answer.

11/11
@LunjunZhang
We hope that Generative RMs can pave the way for better reward models and self-improving reasoners that can verify their own outputs.

[2408.15240] Generative Verifiers: Reward Modeling as Next-Token Prediction

Fun collaboration with @arianTBD , @hbXNov , @kazemi_sm , @aviral_kumar2 , @agarwl_ at Google DeepMind.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 9, 2024

1/2
@_philschmid
Big Distilabel release! Distilabel is an open source framework for creating synthetic datasets and generating AI feedback, designed to provide fast, reliable, and scalable pipelines based on verified research papers for engineers!

And just got its 1.4 release with:

New Steps for better dataset sampling, deduplication (embeddings and minhash), truncation of inputs and better combining outputs

50% Cost Savings by pausing pipelines and using OpenAI Batch API

️ Caching for step outputs for maximum reusability—even if the pipeline changes.

Steps can now generate and save artifacts, automatically uploaded to the Hugging Face Hub.

New Tasks with CLAIR, APIGen, URIAL, TextClassification, TextClustering, and an updated TextGeneration task.

https://video.twimg.com/ext_tw_video/1844053895963639814/pu/vid/avc1/1280x720/YMKBqh95eHOKxELK.mp4

2/2
@_philschmid
Full Release: Release 1.4.0 · argilla-io/distilabel

Docs: Distilabel

Release 1.4.0 · argilla-io/distilabel

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 9, 2024

1/2
@_philschmid
What is a Mixture of Experts (MoE), and why are they successful? @MaartenGr just published a new visual guide on the Mixture of Experts (MoE) to explain the two main components of MoE: Experts and the Router.

TL;DR:

MoE consists of multiple "expert" neural networks and a router that directs inputs to the most suitable experts

Experts aren't domain specialists, but rather learn to handle specific tokens in specific contexts

Load balancing is crucial to ensure all experts are utilized effectively during training

The router uses probability distributions to select which experts process which tokens

MoE allows models to have more parameters overall while using fewer during actual inference

MoE isn't limited to language models - it's also being adapted for vision models

Mixtral 8x7B demonstrates the power of MoE, loading 46.7B parameters but only using 12.8B during inference

Big Kudos to @MaartenGr! Recommend taking a look!

2/2
@_philschmid
A Visual Guide to Mixture of Experts (MoE)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 9, 2024

1/10
@_philschmid
Can we build more capable AI agents by learning from cognitive science? Cognitive Architectures for Language Agents (CoALA) introduces a structured approach to design AI Agents by integrating cognitive architecture principles with modern LLMs.

CoALA describes three key components: 1) modular memory, 2) a structured action space for internal/external interactions, and 3) a generalized decision-making process to select actions.

CoALA Implementation

Define memory components (working, long-term, Procedural)
- Define memory components
- Working Memory: For temporary information during tasks.
- Long-Term Memory: To store knowledge and experiences.
Procedural Memory: For skills and action sequences.

Define a Structured Action Space (internal and external actions).
- Internal Actions: Reasoning steps, memory updates.
- External Actions: Interacting with tools, APIs, or environments.

Implement a decision-making process (propose, evaluate, select).
- Propose: Generate possible actions.
- Evaluate: Assess actions based on goals and context.
- Select: Choose the optimal action to execute.

Add safety mechanisms and monitoring systems

Test and iterate to refine the agent's components and behavior.

2/10
@_philschmid
Paper: Paper page - Cognitive Architectures for Language Agents

This paper provides a theoretical framework for integrating cognitive principles with large language models to create more capable AI agents with no code/examples.

3/10
@adridder
Co-intelligence unlocks amplified reasoning. Cognitive insights enrich AI capabilities. Exciting frontier

4/10
@viksit
thinking of this as a scalable engineering project is going to be a very complex undertaking in this form.

episodic memory is going to exponentially balloon. pruning strategies will be needed. a centralized semantic memory seems over generalized?

5/10
@KingMumboJumbo
Very cool - made a notebookLM podcast on it: Sign in - Google Accounts

6/10
@CohorteAI
CoALA is a transformative approach that bridges cognitive science with modern LLMs to create more capable AI agents. By implementing modular memory systems, structured action spaces, and a robust decision-making process, CoALA enhances an agent's ability to manage information, interact with tools, and make informed decisions. This integration not only improves performance on complex tasks but also incorporates essential safety mechanisms. Leveraging cognitive architecture principles, CoALA sets the foundation for AI agents that are more intelligent, adaptable, and aligned with human-like reasoning. Exciting developments ahead for intelligent multi-agent systems!

7/10
@yaelmendez
Love it. Neural networks are only beginning to take shape. @neuromitosis

8/10
@SaikiK66287209
Interesting approach

9/10
@andysingal
Exciting times ahead for all of us

10/10
@AppyPieInc
Great insights! Integrating cognitive science with AI design like CoALA is a game changer for creating smarter agents. Excited to see how this evolves!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 10, 2024

[2407.01687] Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

[Submitted on 1 Jul 2024 (v1), last revised 4 Oct 2024 (this version, v2)]

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Akshara Prabhakar, Thomas L. Griffiths, R. Thomas McCoy

Chain-of-Thought (CoT) prompting has been shown to enhance the multi-step reasoning capabilities of Large Language Models (LLMs). However, debates persist about whether LLMs exhibit abstract generalization or rely on shallow heuristics when given CoT prompts. To understand the factors influencing CoT reasoning we provide a detailed case study of the symbolic reasoning task of decoding shift ciphers, where letters are shifted forward some number of steps in the alphabet. We analyze the pattern of results produced by three LLMs -- GPT-4, Claude 3, and Llama 3.1 -- performing this task using CoT prompting. By focusing on a single relatively simple task, we are able to identify three factors that systematically affect CoT performance: the probability of the task's expected output (probability), what the model has implicitly learned during pre-training (memorization), and the number of intermediate operations involved in reasoning (noisy reasoning). We show that these factors can drastically influence task accuracy across all three LLMs; e.g., when tested with GPT-4, varying the output's probability of occurrence shifts accuracy from 26% to 70%. Overall, we conclude that CoT prompting performance reflects both memorization and a probabilistic version of genuine reasoning. Code and data at this this https URL

Comments:	EMNLP 2024 Findings; 9 pages plus references and appendices
Subjects:	Computation and Language (cs.CL); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2407.01687 [cs.CL]
	(or arXiv:2407.01687v2 [cs.CL] for this version)
	[2407.01687] Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning Focus to learn more

Submission history

From: Akshara Prabhakar [view email]

[v1] Mon, 1 Jul 2024 18:01:07 UTC (2,503 KB)

[v2] Fri, 4 Oct 2024 01:01:39 UTC (3,109 KB)

https://arxiv.org/pdf/2407.01687

Large Language Models News & Discussions

Veteran

Veteran

Apple releases Depth Pro, an AI model that rewrites the rules of 3D vision​

Speed and precision, without the metadata​

Metric depth, zero-shot learning​

Real-world applications: From e-commerce to autonomous vehicles​

Tackling the challenges of depth estimation​

Open-source and ready to scale​

What’s next for AI depth perception​

Veteran

Veteran

Veteran

Reflection 70B saga continues as training data provider releases post-mortem report​

Sharing model artifacts​

Reproducing the benchmarks​

Reflecting on a hasty release​

Clarifying API confusion​

Looking ahead​

Skepticism among open source AI community remains​

Veteran

Veteran

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises​

How GPT-4o is designed to detect deepfakes​

Key GPT-4o capabilities for detecting and stopping deepfakes​

Deepfake attacks on CEOs are growing​

The critical role of trust and security in the AI era​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning​

Submission history​

Apple releases Depth Pro, an AI model that rewrites the rules of 3D vision

Speed and precision, without the metadata

Metric depth, zero-shot learning

Real-world applications: From e-commerce to autonomous vehicles

Tackling the challenges of depth estimation

Open-source and ready to scale

What’s next for AI depth perception

Reflection 70B saga continues as training data provider releases post-mortem report

Sharing model artifacts

Reproducing the benchmarks

Reflecting on a hasty release

Clarifying API confusion

Looking ahead

Skepticism among open source AI community remains

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

How GPT-4o is designed to detect deepfakes

Key GPT-4o capabilities for detecting and stopping deepfakes

Deepfake attacks on CEOs are growing

The critical role of trust and security in the AI era

Deciphering the Factors Influencing the Efficacy of Chain-of-Thought: Probability, Memorization, and Noisy Reasoning

Submission history