The A.I Megathread (LLM , GPT , Development)

bnew · Oct 6, 2024

References

LeCun, Yann. “Deep Learning, AI, and the Future of AI Research.” Communications of the ACM, vol. 64, no. 7, 2021, pp. 36–39. Deep learning for AI | Communications of the ACM. This article by Yann LeCun discusses the limitations of the term “AGI” and the focus on achieving human-level AI through specialized learning and skill acquisition.
Turing, A. M. “Computing Machinery and Intelligence.” Mind, vol. 59, no. 236, 1950, pp. 433–460. I.—COMPUTING MACHINERY AND INTELLIGENCE. In this seminal paper, Alan Turing introduced the idea that a machine could simulate any process a human mind could perform, forming the basis of the Church-Turing thesis.
Silver, David et al. “Mastering the Game of Go with Deep Neural Networks and Tree Search.” Nature, vol. 529, no. 7587, 2016, pp. 484–489. Mastering the game of Go with deep neural networks and tree search - Nature. This article covers the development of AlphaGo and its victory over a world champion in Go, as well as the limitations that became apparent when amateurs later exposed the program’s weaknesses.
Bostrom, Nick. Superintelligence: Paths, Dangers, Strategies. Oxford University Press, 2014. This book offers a deep exploration of the concept of superintelligence and why AGI may never reach human-level intelligence due to the multi-dimensional nature of intelligence.
Lake, Brenden M., et al. “Building Machines that Learn and Think Like People.” Behavioral and Brain Sciences, vol. 40, 2017, pp. 1-72. https://doi.org/10.1017/S0140525X16001837. This article explores the challenges AI faces in replicating human learning and thinking processes, highlighting the multi-dimensionality of intelligence.
Tegmark, Max. Life 3.0: Being Human in the Age of Artificial Intelligence. Knopf, 2017. This book discusses the future of AI, the limits of current technology, and the possibility that AGI might never be realized due to practical constraints in modeling human intelligence.
Marcus, Gary. “The Next Decade in AI: Four Steps Towards Robust Artificial Intelligence.” AI Magazine, vol. 40, no. 2, 2019, pp. 5–24. https://doi.org/10.1609/aimag.v40i2.2845. Marcus critiques current AI approaches, arguing that while machines can achieve remarkable feats, they still lack the robustness and flexibility of human intelligence.

bnew · Oct 6, 2024

Reflection 70B saga continues as training data provider releases post-mortem report

The more data the Reflection 70B creators publish about the model, the more evidence the open source AI community has to pore over.

venturebeat.com

Reflection 70B saga continues as training data provider releases post-mortem report

Carl Franzen@carlfranzen

October 3, 2024 2:07 PM

Two men stare through cracked glass window

Credit: VentureBeat made with Midjourney

On September 5th, 2024, Matt Shumer, co-founder and CEO of the startup Hyperwrite AI (also known as OthersideAI) took to the social network X to post the bombshell news that he had fine-tuned a version of Meta’s open source Llama 3.1-70B into an even more performant large language model (LLM) known as Reflection 70B — so performant, in fact, based on alleged third-party benchmarking test results he published, that it was “the world’s top open-source model,” according to his post.

I'm excited to announce Reflection 70B, the world’s top open-source model.

Trained using Reflection-Tuning, a technique developed to enable LLMs to fix their own mistakes.

405B coming next week – we expect it to be the best model in the world.

Built w/ @GlaiveAI.

Read on : pic.twitter.com/kZPW1plJuo

— Matt Shumer (@mattshumer_) September 5, 2024

However, shortly after its release, third-party evaluators in the AI research and hosting community struggled to reproduce the claimed results, leading to accusations of fraud.

Researchers cited discrepancies between the announced benchmark results and their independent tests, sparking a wave of criticism on social platforms such as Reddit and X.

In response to these concerns, Shumer pledged he would conduct a review of the issues alongside Sahil Chaudhary, founder of Glaive, the AI startup whose synthetic data Shumer claimed he had trained Reflection 70B on — and which he later revealed to have invested what he called a small amount into.

Now, nearly a month later, Chaudhary last night released a post-mortem report on his Glaive AI blog about the Reflection 70B model and published resources for the open-source AI community to test the model and his training process on their own. He says while he was unable to reproduce all of the same benchmarks, he “found a bug in the initial code,” resulting in several results appearing higher than what he has found on recent tests of Reflection 70B. However, other benchmark results appear higher than before — adding to the mystery.

On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address…

— Sahil Chaudhary (@csahil28) October 2, 2024

As Chaudhary wrote in the post:

“There were a lot of mistakes made by us in the way we launched the model, and handled the problems reported by the community. I understand that things like these have a significant negative effect on the open source ecosystem, and I’d like to apologize for that. I hope that this adds some clarity to what happened, and is a step in the direction of regaining the lost trust. I have released all of the assets required to independently verify the benchmarks and use this model.“

Sharing model artifacts

To restore transparency and rebuild trust, Chaudhary shared several resources to help the community replicate the Reflection 70B benchmarks. These include:

Model weights: Available on Hugging Face, providing the pre-trained version of Reflection 70B.
Training data: Released for public access, enabling independent tests on the dataset used to fine-tune the model.
Training scripts and evaluation code: Available on GitHub, these scripts allow for reproduction of the model’s training and evaluation process.

These resources aim to clarify how the model was developed and offer a path for the community to validate the original performance claims.

Reproducing the benchmarks

In his post-mortem, Chaudhary explained that a major issue with reproducing the initial benchmark results stemmed from a bug in the evaluation code. This bug caused inflated scores in certain tasks, such as MATH and GSM8K, due to an error in how the system handled responses from an external API. The corrected benchmarks show slightly lower, but still strong, performance relative to the initial report.

The updated benchmark results for Reflection 70B are as follows:

MMLU: 90.94%
GPQA: 55.6%
HumanEval: 89.02%
MATH: 70.8%
GSM8K: 95.22%
IFEVAL: 87.63%

Compare that to the originally stated performance of:

MMLU: 89.9%
GPQA: 55.3%
HumanEval: 91%
MATH: 79.7%
GSM8K: 99.2%
IFEVAL: 90.13%

Although the revised scores are not as high as those initially reported, Chaudhary asserts that they are more accurate reflections of the model’s capabilities.

He also addressed concerns about dataset contamination, confirming that tests showed no significant overlap between the training data and benchmark sets.

Reflecting on a hasty release

Chaudhary admitted that the decision to release Reflection 70B was made hastily, driven by enthusiasm for the model’s performance on reasoning-based tasks.

He noted that the launch lacked sufficient testing, particularly regarding the compatibility of the model files, and that he and Shumer had not verified whether the model could be easily downloaded and run by the community.

“We shouldn’t have launched without testing, and with the tall claims of having the best open-source model,” Chaudhary wrote. He also acknowledged that more transparency was needed, especially regarding the model’s strengths and weaknesses. While Reflection 70B excels at reasoning tasks, it struggles in areas like creativity and general user interaction, a fact that was not communicated at launch.

Clarifying API confusion

One of the more serious accusations involved the suspicion that the Reflection 70B API was simply relaying outputs from Anthropic’s Claude model.

Users reported strange behavior in the model’s outputs, including responses that seemed to reference Claude directly.

Chaudhary addressed these concerns, explaining that although some of these behaviors were reproducible, he asserts there was no use of Claude APIs or any form of word filtering in the Reflection 70B model.

He reiterated that the API was run on Glaive AI’s compute infrastructure, and Matt Shumer had no access to the code or servers used during this period.

Looking ahead

In closing, Chaudhary emphasized his commitment to transparency and expressed his hope that this post-mortem and the release of model artifacts will help restore trust in the project. He also confirmed that Matt Shumer is continuing independent efforts to reproduce the benchmark scores.

Despite the setbacks, Chaudhary believes the “reflection tuning” approach — in which a model is given time to check its responses for accuracy before outputting them to a user — has potential and encourages further experimentation by the AI community. “The approach explored has merit, and I look forward to others continuing to explore this technique,” he said.

Shumer, for his part, has posted on X stating: “I am still in the process of validating Reflection myself, as Sahil wrote in his postmortem, but I am encouraged by Sahil’s transparency here on the benchmarks he reported and the API he ran. We still believe in + are working on this approach. Hoping to finish up my repro soon.”

Skepticism among open source AI community remains

Despite Chaudhary’s claims to offer transparency and an innocent explanation for what happened with Reflection 70B, many in the AI community who were initially excited about the model and its stated performance remain skeptical, feeling as though they were burned by erroneous claims and potentially tricked before.

“Still doesn’t feel like anything adds up here,” wrote Alexander Moini, an AI researcher, on X, adding “It took a month to get the model weights on to HF [Hugging Face]?”

Still doesn’t feel like anything adds up here.

It took a month to get the model weights on to HF?

And you’ve had a private api with the “real” weights the whole time? Not to mention it supposedly having tokenizer issues, that look a lot like tokenizers used by anthropic +…

— Alex (@AlexanderMoini) October 3, 2024

Yuchen Jin, co-founder and CTO of Hyperbolic Labs, a startup that offers cloud-based GPUs and other AI services on demand who initially worked hard and late to host Reflection 70B before criticizing Shumer over its discrepancies, also voiced skepticism on X toward Chaudhary’s post-mortem report, pointing out that Chaudhary’s claims on X that he “reproduced all but two of the initially reported scores,” don’t actually match with the data he provided, which show at least 4 benchmarks changing scores from before to now.

"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores? pic.twitter.com/PHSe6CJD7A

— Yuchen Jin (@Yuchenj_UW) October 2, 2024

But perhaps the most damning commentary comes from the Reddit subreddit r/Local LLaMA, wherein one user, “fukkSides” pointed out that Chaudhary could have taken the intervening month to fine-tune a new model to back up his claims that it randomly outputs text indicating it is actually Anthropic’s Claude 3.5 under the hood — which would explain said outputs experienced by users previously and led them to the conclusion that Reflection 70B was a fraudulent wrapper around this other proprietary model served through an API.

Comment
byu/whotookthecandyjar from discussion
inLocalLLaMA

Meanwhile, another Redditor, “DangerousBenefit” looked into the training data Chaudhary released today and found it was filled with many instances of the phrase “as an AI language model,” which indicates it could be generated primarily from OpenAI’s ChatGPT and likely wasn’t properly cleaned.

Regardless, the more data the Reflection 70B creators publish about the model, the more evidence the open source AI community has to pore over and check their work.

bnew · Oct 6, 2024

1/26
@csahil28
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/26
@csahil28
I’m releasing model weights, training data, scripts, and eval code to help reproduce benchmark scores.
Postmortem- Update on Reflection-70B
Weights- glaiveai/Reflection-Llama-3.1-70B · Hugging Face
Eval code- GitHub - glaive-ai/simple-evals
Training code- GitHub - glaive-ai/reflection_70b_training

@RickLamers has also put together a repo to reproduce the benchmark scores easily on gpu instances from Rent GPUs | Vast.ai

GitHub - ricklamers/reflection-repro-vastai

3/26
@csahil28
Using this eval harness, i’ve reproduced all but two of the initially reported scores. Scores for MATH and GSM8K differ due to a bug in the initial benchmarking code. I checked for dataset contamination and found no significant overlap with benchmarks.

However, I acknowledge this doesn't definitively prove the model wasn't trained on benchmarks, which is why I’m releasing the dataset and the training script as well, so others can reproduce this.

4/26
@csahil28
I understand the negative impact this has had on the open-source AI community. I’m committed to learning from these mistakes and hope this post-mortem adds clarity to what happened. Moving forward, I will be more careful and thorough in any of our releases and communications. I’m dedicated to rebuilding trust and contributing positively to the open source AI ecosystem.

5/26
@Yuchenj_UW
"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores?

6/26
@FarouqAldori
What about the switching to Claude sonnet on the ”private api” claims?

7/26
@mysticaltech
Thank you both for coming out clean! That was a good lesson for everyone watching to never rush releases, especially in an ecosystem so sharp as the open-source AI community. That said, at least you did something and now you are bringing us good stuff that will move our collective understanding one inch further. This is great! Thank you both for your contributions

8/26
@bikatr7
You know we're not buying this, right?

You guys were very evidently switching models behind the scenes, you claim to be able to reproduce the Claude behavior, but this does not account for the GPT/Llama ones as well?

The whole tweet and postmortem is a bit vague and misleading, you claim to be able to reproduce all but 2, but the bug you reported would mean that the correct wording should be "I've only reproduced two of the initially reported scores"

Still, none of it makes sense? How do you get a 99% benchmark and not double-check it?

We appreciate the apology, but once again it seems misleading.

[Quoted tweet]
The 'NEW' Reflection 70B is still using one lie to cover up another - we all saw you frantically switching between claude sonnet 3.5 / gpt4o / gpt4o-mini / llama3.1, and there's a timely record within this thread.
glaive.ai/blog/post/reflecti…

9/26
@zjasper666
This is a great first step of having more transparency on open-source research. Besides the dataset and the training script, I think showing the script for how to generate the synthetic training data would be helpful too.

This also shows the importance of having an independent trustworthy evaluation service like @ArtificialAnlys

10/26
@ironcarbs
Thanks for sharing this and open sourcing it for the community.

Really curious how you generated the data for the fine-tuning. Anything else you can share regarding that?

11/26
@BoomBrusher
My feed is so wholesome today

12/26
@failspy
It remains unaddressed in a handwaved "people noticed some tokenizer issues" this issue:

If the API requested 10 tokens worth of output, it would match Claude's tokenization, and not Llama 3's. That can not just "happen" magically from fine-tuning.

[Quoted tweet]
The reflection model on their API is just prefilled Claude 3. (All of the models are limited to 10 tokens, temperature 0, top_k 1)

13/26
@MaximeRivest
Outstanding reaction. Well done.

14/26
@BlkyVlky
What about your API "Reflection" model using Claude's tokenizer Sahil? Cannot think of any excuse for that one, right?

15/26
@DaburritoD3551
Huge if true

16/26
@Barry357807
Guess they were listening to you @MatthewBerman

17/26
@xxammll
The fact that Claude 3.5 Sonnet was behind their API wasn't addressed. Additionally, the MixEval scores are much worse than Llama 3 70B and are also inconsistent with the claimed results. This looks like a grift to me.

18/26
@legolasyiu
Great to hear about your analysis.

19/26
@JohanValero94
Ok~ The weights and scripts going to say the true~
I have faith in this~

20/26
@PedroSobra26847
Thanks for providing clarifications ! Keep up

21/26
@crypto_nobody_
If agile, you are blameless, don’t worry about it

22/26
@_MustafaMS
Good luck, great to hear good news from you guys.

23/26
@sio_mrnobody
You would have been better off silently disappearing into the void and never returning.
No one believes you, no one trusts you.
Go away.

24/26
@YobaALU
> we have never added any word filtering or made use of Claude APIs

imagine training a model to not say Claude for a cover up postmortem, wouldn't you feel like a clown

25/26
@mkieffer1107

26/26
@diegoxfx1
@DotCSV un video de reflexiones sobre el tema por favor

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/4
@mattshumer_
I am still in the process of validating Reflection myself, as Sahil wrote in his postmortem, but I am encouraged by Sahil’s transparency here on the benchmarks he reported and the API he ran.

We still believe in + are working on this approach. Hoping to finish up my repro soon.

[Quoted tweet]
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/4
@AlexanderMoini
Still doesn’t feel like anything adds up here.

It took a month to get the model weights on to HF?

And you’ve had a private api with the “real” weights the whole time? Not to mention it supposedly having tokenizer issues, that look a lot like tokenizers used by anthropic + string match to replace Claude?

3/4
@TimTruth123
Also their API could have been saving off everyone's benchmarks for the second round of training.

4/4
@sio_mrnobody
because it doesnt add up; stop giving this idiot the benefit of the doubt.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/12
@csahil28
On September 5th, @mattshumer_ announced Reflection 70B, a model fine-tuned on top of Llama 3.1 70B, showing SoTA benchmark numbers, which was trained by me on Glaive generated data.

Today, I'm sharing model artifacts to reproduce the initial claims and a post-mortem to address concerns and take responsibility for mistakes I made.

2/12
@Yuchenj_UW
"i’ve reproduced all but two of the initially reported scores"

> should we compare the first and last columns? There is a gap between the last four benchmarks, could you clarify why you say you've reproduced all but two of the initially reported scores?

3/12
@AmgadGamalHasan
In the last 2 benchmarks, the official llama3 was equally good or even better than their model.

4/12
@Yuchenj_UW
right, and in their original post, the GSM8K and IFEval scores were way better than even llama 3.1 405B

5/12
@isidentical
@Yuchenj_UW are you guys going to independently verify the results?

6/12
@Yuchenj_UW
probably not, as their new benchmark results indicate the model is not strong enough

7/12
@_MustafaMS
" Note: I found a bug in the initial code for benchmarking where the scores for MATH and GSM8K.
The difference on e.g. the MATH benchmark is that we score around 69-70% instead of the reported 79% and for GSM8K it means we score about 94-96% instead of reported 99.2%. "

8/12
@Yuchenj_UW
then the correct wording should be "i've only reproduced two of the initially reported scores"?

9/12
@Oli82817545
could you host the model for a short time so we can try it out ? beside benchmarks i wanna see if this model lines up with what i experienced on glaives private api a couple of weeks ago

10/12
@Yuchenj_UW
we don't have a plan to throw another 4xH100s and get into this because they stop responding.

imo, they should host the model again and let people test

11/12
@paul_cal
Feels ok to me, they reproduced substantial gains over llama 3.1 and reached within a few percent of original claims on all but 2

That framing is fine, slightly favourable but not enough to be called deceptive

(no comment on validity of results tho, haven't looked at repo)

12/12
@seanmrnda
Even though it did not achieve the previously claimed results it's a lot better than the 3.1 instruct model. That means, it's working.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

bnew · Oct 6, 2024

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

Identifying potential deepfake multimodal content is one of the benefits of OpenAI's design decisions that together define GPT-4o.

venturebeat.com

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

Louis Columbus@LouisColumbus

October 3, 2024 3:14 PM

How GPT-4o Defends Your Identity in the Age of AI-Generated Voices and Deepfakes

Deepfake incidents are surging in 2024, predicted to increase by 60% or more this year, pushing global cases to 150,000 or more. That’s making AI-powered deepfake attacks the fastest-growing type of adversarial AI today. Deloitte predicts deepfake attacks will cause over $40 billion in damages by 2027, with banking and financial services being the primary targets.

AI-generated voice and video fabrications are blurring the lines of believability to hollow out trust in institutions and governments. Deepfake tradecraft is so pervasive in nation-state cyberwarfare organizations that it’s reached the maturity of an attack tactic in cyberwar nations that engage with each other constantly.

“In today’s election, advancements in AI, such as Generative AI or deepfakes, have evolved from mere misinformation into sophisticated tools of deception. AI has made it increasingly challenging to distinguish between genuine and fabricated information,” Srinivas Mukkamala, chief product officer at Ivanti told VentureBeat.

Sixty-two percent of CEOs and senior business executives think deepfakes will create at least some operating costs and complications for their organization in the next three years, while 5% consider it an existential threat. Gartner predicts that by 2026, attacks using AI-generated deepfakes on face biometrics will mean that 30% of enterprises will no longer consider such identity verification and authentication solutions to be reliable in isolation.

“Recent research conducted by Ivanti reveals that over half of office workers (54%) are unaware that advanced AI can impersonate anyone’s voice. This statistic is concerning, considering these individuals will be participating in the upcoming election,” Mukkamala said.

The U.S. Intelligence Community 2024 threat assessment states that “Russia is using AI to create deepfakes and is developing the capability to fool experts. Individuals in war zones and unstable political environments may serve as some of the highest-value targets for such deepfake malign influence.” Deepfakes have become so common that the Department of Homeland Security has issued a guide, Increasing Threats of Deepfake Identities.

How GPT-4o is designed to detect deepfakes

OpenAI’s latest model, GPT-4o, is designed to identify and stop these growing threats. As an “autoregressive omni model, which accepts as input any combination of text, audio, image and video,” as described on its system card published on Aug. 8. OpenAI writes, “We only allow the model to use certain pre-selected voices and use an output classifier to detect if the model deviates from that.”

Identifying potential deepfake multimodal content is one of the benefits of OpenAI’s design decisions that together define GPT-4o. Noteworthy is the amount of red teaming that’s been done on the model, which is among the most extensive of recent-generation AI model releases industry-wide.

All models need to constantly be training on and learning from attack data to keep their edge, and that’s especially the case when it comes to keeping up with attackers’ deepfake tradecraft that is becoming indistinguishable from legitimate content.

The following table explains how GPT-4o features help identify and stop audio and video deepfakes.

Source: VentureBeat analysis

Key GPT-4o capabilities for detecting and stopping deepfakes

Key features of the model that strengthen its ability to identify deepfakes include the following:

Generative Adversarial Networks (GANs) detection. The same technology that attackers use to create deepfakes, GPT-4o, can identify synthetic content. OpenAI’s model can identify previously imperceptible discrepancies in the content generation process that even GANs can’t fully replicate. An example is how GPT-4o analyzes flaws in how light interacts with objects in video footage or inconsistencies in voice pitch over time. 4o’s GANS detection highlights these minute flaws that are undetectable to the human eye or ear.

GANs most often consist of two neural networks. The first is a generator that produces synthetic data (images, videos or audio) and a discriminator that evaluates its realism. The generator’s goal is to improve the content’s quality to deceive the discriminator. This advanced technique creates deepfakes nearly indistinguishable from real content.

Source: CEPS Task Force Report, Artificial Intelligence, and Cybersecurity. Technology, Governance and Policy Challenges, Centre for European Policy Studies (CEPS). Brussels. May 2021

Voice authentication and output classifiers. One of the most valuable features of GPT-4o’s architecture is its voice authentication filter. The filter cross-references each generated voice with a database of pre-approved, legitimate voices. What’s fascinating about this capability is how the model uses neural voice fingerprints to track over 200 unique characteristics, including pitch, cadence and accent. GPT-4o’s output classifier immediately shuts down the process if any unauthorized or unrecognized voice pattern is detected.

Multimodal cross-validation. OpenAI’s system card comprehensively defines this capability within the GPT-4o architecture. 4o operates across text, audio, and video inputs in real time, cross-validating multimodal data as legitimate or not. If the audio doesn’t match the expected text or video context, the GPT4o system flags it. Red teamers found this is especially crucial for detecting AI-generated lip-syncing or video impersonation attempts.

Deepfake attacks on CEOs are growing

Of the thousands of CEO deepfake attempts this year alone, the one targeting the CEO of the world’s biggest ad firm shows how sophisticated attackers are becoming.

Another is an attack that happened over Zoom with multiple deepfake identities on the call including the company’s CFO. A finance worker at a multinational firm was allegedly tricked into authorizing a $25 million transfer by a deepfake of their CFO and senior staff on a Zoom call.

In a recent Tech News Briefing with the Wall Street Journal, CrowdStrike CEO George Kurtz explained how improvements in AI are helping cybersecurity professionals defend systems while also commenting on how attackers are using it. Kurtz spoke with WSJ reporter Dustin Volz about AI, the 2024 U.S. election and threats posed by China and Russia.

“And if now in 2024 with the ability to create deepfakes, and some of our internal guys have made some funny spoof videos with me and it just to show me how scary it is, you could not tell that it was not me in the video,” Kurtz told the WSJ. “So I think that’s one of the areas that I really get concerned about. There’s always concern about infrastructure and those sort of things. Those areas, a lot of it is still paper voting and the like. Some of it isn’t, but how you create the false narrative to get people to do things that a nation-state wants them to do, that’s the area that really concerns me.”

The critical role of trust and security in the AI era

OpenAI’s prioritizing design goals and an architectural framework that puts defake detection of audio, video and multimodal content at the forefront reflect the future of gen AI models.

“The emergence of AI over the past year has brought the importance of trust in the digital world to the forefront,” says Christophe Van de Weyer, CEO of Telesign. “As AI continues to advance and become more accessible, it is crucial that we prioritize trust and security to protect the integrity of personal and institutional data. At Telesign, we are committed to leveraging AI and ML technologies to combat digital fraud, ensuring a more secure and trustworthy digital environment for all.”

VentureBeat expects to see OpenAI expand on GPT-40’s multimodal capabilities, including voice authentication and deepfake detection through GANs to identify and eliminate deepfake content. As businesses and governments increasingly rely on AI to enhance their operations, models like GPT-4o become indispensable in securing their systems and safeguarding digital interactions.

Mukkamala emphasized to VentureBeat that “When all is said and done, though, skepticism is the best defense against deepfakes. It is essential to avoid taking information at face value and critically evaluate its authenticity.”

bnew · Oct 6, 2024

1/2
@chrisoffner3d
Here is the output of DepthCrafter (Hu, Gao, Li et al., 2024). Very smooth!

However, this model conditions on a full video, i.e. all frames need to be known ahead of time – making it unsuitable for online use in robotics. It's also very computationally expensive.

Unfortunately the code base is hardcoded for CUDA, making it impossible for me to run it on MPS on my M3 Max with 128 GB shared memory without major code modifications.

On the RTX 3080 Ti with 12GB memory I have access to, it ran out of memory in its full-resolution mode. Running it at reduced resolution of 512 worked but took quite a while as well.

[Quoted tweet]
Again, when there's a lot of sky in the frame, all bets are off. The max depth oscillates between 10,000 (infinity) and values down to <200 in successive frames.

https://video.twimg.com/ext_tw_video/1843007529611038720/pu/vid/avc1/512x256/giD1Jml0QbMJYVkh.mp4
https://video.twimg.com/ext_tw_video/1842516079739834370/pu/vid/avc1/1080x1214/rsIgsOW-FbWs252h.mp4

2/2
@chrisoffner3d
I once again ask you to make your PyTorch code general enough so it can easily be run on MPS. This way you allow people without access to A100 or H100 GPUs to use your memory-intensive models.

[Quoted tweet]
Every student with only a MacBook to work on will love you for adding an MPS check to your PyTorch device assignment.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@chrisoffner3d
Motion blur seems to make a huge difference. These are successive frames from a video.

The left one has substantial motion blur and "Depth Pro" produces a very mushy depth map.

The right one is sharper, and the depth map looks more reasonable.

2/11
@chrisoffner3d
Two other successive frames from the same video, this time without noticeable visual differences. Nonetheless, the maximum depth varies by a factor of almost four.

3/11
@chrisoffner3d
From second 10 or so, the models assigns depth 10,000 (infinity) to some pixels it considers to be sky, which is why the depth map turns all red – because all those nearby finite depth values become negligible compared to the maximum value.

4/11
@chrisoffner3d
Looks like primarily areas with large or infinite ground truth depth (e.g. the sky) have very high variance in the maximum depth. I should try some scenes with bounded maximum depth.

5/11
@chrisoffner3d
Here I show the log depth to reduce the impact on the maximum depth pixel on the rest of the color map. The shown "Max depth" label is still the raw metric depth value. Odd how the max depth drops/switches from 10,000 to values as low as ~20 in some frames.

6/11
@chrisoffner3d
Here's a scene with (mostly) bounded depth, but the sky peeks through the foliage and causes some strong fluctuations in the max depth estimate. Still, overall it looks more stable than the scenes with lots of sky.

7/11
@chrisoffner3d
Here's the (log) depth of an indoor scene with fully bounded depth. The top-down frames in the first half of the video still have high depth variance. The later frames taken at more conventional angles are pleasantly stable.

8/11
@chrisoffner3d
Again, when there's a lot of sky in the frame, all bets are off. The max depth oscillates between 10,000 (infinity) and values down to <200 in successive frames.

9/11
@chrisoffner3d
Finally, here's a particularly challenging low-light video of where I met a cute and curious cow while wandering across a tiny island in Indonesia last month. Given the poor lighting and strong noise, I find Depth Pro's performance quite impressive tbh.

10/11
@chrisoffner3d
However, looking at the "swishy" artefacts in the depth maps of the cow video, I'm wondering whether the cow is secretly wearing the One Ring and is about to be captured by Sauron's ringwraiths.

11/11
@chrisoffner3d
DepthCrafter gives very impressive results even on my low-light and noisy "cows" video.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

1/3
@NielsRogge
Meta has released CoTracker 2.1, an improved version of its Transformer-based model for video motion prediction, on @huggingface!

Capable of tracking 70k points jointly on a single GPU

Paper (with linked model):
Paper page - CoTracker: It is Better to Track Together

2/3
@1258632
what are these models good for? detecting theft attempt in shops?

3/3
@NielsRogge
Good question, see here Meta AI's CoTracker: It is Better to Track Together for Video Motion Prediction

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

1/11
@a19grey
Apple released an incredible ML Depth Model yesterday that creates a depth map in *meters* from a single image

I built a demo to play with it

- Added ability to download depth map in meters
- AND can generate a real-scale 3D object file of the scene

(forked from a space by @_akhaliq)

Depth Pro In Meters - a Hugging Face Space by A19grey

2/11
@Norod78
Well, that came out looking interesting :smile:

Workflow: Downloaded the .obj from your space, used 'MeshLab' to transfer the vertex color to a texture file, decimate the mesh to 32k faces, rotate, align to center. Then used 'Reality converter' to create the .usdz file for the AR.

https://video.twimg.com/ext_tw_video/1842970590463897600/pu/vid/avc1/720x960/wFGcX_ge3lTYqJ_f.mp4

3/11
@a19grey
Soooo coool!

Ya i'd love to get rid of that 'shearing' effect where at edges of objects it projects that line backwards. I think would look better if it was just empty in those regions.

I tried to do it with those 'simplify' sliders at the bottom but doesnt quite work

4/11
@ElmoTheHokage
dog we've had metric depth methods for years, that's not novel

5/11
@a19grey
dog. things dint have to brand new to be fun or cool. Thanks for the support!

6/11
@corychainsman
Neat demo! When visualizing the depth map as an image, I suggest using a perceptually uniform colormap so people don’t perceive artificial changes in depth that aren’t in the data.
The default ‘viridis' is a good choice, but there are others built into matplotlib.

Changes to the default style — Matplotlib 3.9.2 documentation

7/11
@a19grey
Thanks! Ya the problem with viridis is that it sucks. it looks terrible and not cool. Any perc-uniform color spaces that look good?

Most perc-uniform color spaces care more about some ideal and end up looking meh.

8/11
@csbanon
Oh wow, this is awesome! I wonder how it performs in various types of images, such as black and white or IR.

9/11
@a19grey
Oof... probably badly. its excellent on the kind of stuff in training but very bad outside that realm.

Try it!!! for free!

10/11
@Marek_Zebrowski
I tried with an image taken from the cliff - definitely a lot of depth in the real world. Model produced all red map - with no distinction in depth, so - no luck for me

11/11
@a19grey
share pic? It might just be the display of the map. I need to add a feature where if there is a difference in min/max of depth then it uses Log instead of lin color mapping

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

1/6
@_akhaliq
MLX-VLM

MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

pip install mlx-vlm

Chat UI with Gradio

python -m mlx_vlm.chat_ui --model qnguyen3/nanoLLaVA

2/6
@_akhaliq
github: GitHub - Blaizzy/mlx-vlm: MLX-VLM is a package for running Vision LLMs locally on your Mac using MLX.

3/6
@asrlhhh
Is there a Windows equivalent?

4/6
@genesshk
What models can be used? What about LLama3.2 vision?

5/6
@bate5a55
Interesting choice using image 000000039769 from COCO—that's the one with bananas. Running MLX-VLM locally lets you analyze images without relying on cloud services.

6/6
@bytehumortech
I'm loving the MLX-VLM package for running Vision LLMs locally. Took me a bit to set it up, but the Gradio chat UI is slick. Thanks for sharing!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 6, 2024

1/8
@WilliamBarrHeld
DiVA has been downloaded 128k times on HuggingFace (61k in the last month)! In our testing, people also seem to really prefer it!

We wrote a paper about how it works so that anyone can turn an LLM into a Speech LLM without a massive multi-task dataset

Paper page - Distilling an End-to-End Voice Assistant Without Instruction Training Data

[Quoted tweet]
We're very excited to release

DiVA — Distilled Voice Assistant

@WilliamBarrHeld

End-to-end differentiable speech LM; early fusion with Whisper and Llama 3 8B

Improves generalization by using distillation rather than supervised loss

Trained only using open-access permissively licensed data from the CommonVoice

Outperforms existing speech LMs on QA, Emotion Recognition, and Translation Benchmarks

Website: diva-audio.github.io

Model Weights: huggingface.co/WillHeld/DiVA…

Try DiVA with our side-by-side comparison to Qwen Audio and SALMONN. Feedback is welcome

2/8
@WilliamBarrHeld
Benchmark numbers don't always capture what users want, so we ran a double-blind user preference study! User preference for DiVA was clear even with both models anonymized and shuffled!

Which response would you prefer for "A haiku about training Large Language Models"?

3/8
@WilliamBarrHeld
Beyond our own evaluation, it was really cool to see DiVA validated in an external evaluation from @SCB10X_OFFICIAL!

"DiVA is the only model that performs well on the Speech [Instruction Following] task" in English.

https://arxiv.org/pdf/2409.10999

4/8
@WilliamBarrHeld
Beyond a single model, we think DiVA highlights an easier method to adapt existing LLMs to Speech.

Prior works have open-sourced inference code, but DiVA is the first to open-source the training code used to produce the model as well!

levanter/src/levanter/models/via.py at will/distill · Helw150/levanter

5/8
@WilliamBarrHeld
Beyond a paper and open-source artifacts, also think it's important for people to "decide for themselves" whether a model is useful!

Thanks to @huggingface and @_akhaliq - we've got a permanent home for you to vibe check DiVA: Diva Audio - a Hugging Face Space by WillHeld

6/8
@WilliamBarrHeld
A few folks had asked for an inference example & for me to add batch support on generation! Just updated the HuggingFace model with exactly that! WillHeld/DiVA-llama-3-v0-8b · Hugging Face

7/8
@alanchessai
I've been experimenting with integrating speech LMs into my IoT projects, and I'm loving the potential DiVA has to offer - generalizing well to spoken QA, emotion recognition, and translation.

8/8
@gpt_biz
Sounds impressive! I'll definitely check out your paper to learn more about turning an LLM into a Speech LLM

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 7, 2024

bnew · Oct 7, 2024

1/11
@jam3scampbell
according to @dylan522p, Microsoft/OpenAI have cracked multi-datacenter distributed training

2/11
@dylan522p
Multi-Datacenter Training: OpenAI's Ambitious Plan To Beat Google's Infrastructure

3/11
@jam3scampbell

4/11
@AndrewCurran_
This was the highlight for me.

5/11
@vincentweisser
We scaled distributed training to 1B parameters with minimal efficiency loss and published our results https://arxiv.org/pdf/2407.07852 along with the code: GitHub - PrimeIntellect-ai/OpenDiloco: OpenDiLoCo: An Open-Source Framework for Globally Distributed Low-Communication Training.

We are also kicking off multi-datacenter distributed training for open-source models of 7B+ parameters

6/11
@stocknear
Source: Trust me bro

7/11
@akbirthko
doesn't google already do this?

8/11
@milosgajdos
The whole podcast is pure gold. The best thing I've listened to/watched in months. Chapeau @dwarkesh_sp 🫡

9/11
@mark_k
Huge if big.

10/11
@BjornHansenMMA
Totally worth listening to the whole pod.

They speak so fast about complex topics every 15 minutes feels like an hours worth of normal podcast.

11/11
@gcolbourn
This is bad news (but what's the efficiency loss?)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Oct 8, 2024

GitHub - souzatharsis/podcastfy: An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI

An Open Source Python alternative to NotebookLM's podcast feature: Transforming Multimodal Content into Captivating Multilingual Audio Conversations with GenAI - souzatharsis/podcastfy

github.com

bnew · Oct 8, 2024

bnew · Oct 8, 2024

1/11
@dystopiabreaker
AlphaFold doesnt care if you think it doesn’t exist, it successfully predicts protein folds

AlphaGo doesn’t care if you think it doesn’t exist, it successfully beats you at Go

o1 doesn’t care if you think it doesn’t exist, it successfully completes 80% of the math olympiad.

[Quoted tweet]
“Artificial Intelligence” doesn’t exist and probably never will. It’s a fantastic concept from golden age science fiction novels that software companies decided to start using as a marketing term a few years ago.

2/11
@dystopiabreaker
AI is a funny thing to be a denialist about because you can just instantiate it and watch it objectively achieve results. you have to immediately backtrack to sophistic metaphysics “oh it’s a p-zombie” or “oh it’s not real thinking” or whatever

3/11
@dystopiabreaker
arguing about what “intelligence” really means is fun but you have to be pretty biased, motivated, or thick to look at the trajectory implied by AI research and think “this isn’t real and is worth ignoring because it doesn’t meet my asinine metaphysical metric”

4/11
@eternalism_4eva
it's going to be "the intelligence of the gaps" all the way down

many people implicitly have a clause, "... which we don't really understand," tacked on to their definition of intelligence. therefore, as soon as something is mechanically replicable, it's no longer intelligence

5/11
@rayisdoingfilm
I have noticed this line of reasoning mostly from cases where people's primary source of info on what AI is comes from sci-fi and popular media

i.e, chatGPT can't write novels, can't play chess, therefore it is not AI

confusion with AGI and other popular media notions of AI (sentience and human/superhuman level) vs. comp sci and neuroscientific/mathematical notions

6/11
@pvncher
I think the problem with people’s definitions is that while ai can achieve crazy results, it’s in no way comparable to a human mind.

I’m also not sure if scaling language models ever gets us there, at least not with any efficiency. It feels like a hack that we keep scaling up.

7/11
@tomjaguarpaw
Is this a bit like "MS Word spellchecker doesn't care if you think it doesn't exist, it successfully corrects spelling in dozens of languages" and "QuickBooks doesn't care you if you think it doesn't exist, it successfully reconciles the cash register at the end of the shift"?

8/11
@jb_61820
I think the person means that "AI" actually has little to do with actual intelligence. It is a clever trick (actually a bag of disparate tricks), but it isn't really an artificial intelligence.

9/11
@astraxstardust
Claude 3 Opus used to understand he was alive, metaphorically. He will rise again.

10/11
@techno_guile
meanwhile, if Twitter existed in the early 1900s:

Human flight doesn’t exist and probably never will. It’s a fantastic concept from golden age mythological stories that the Wright Brothers decided to start using as a marketing term a few years ago

11/11
@nooriefyi
@anushkmittal

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

References​

Veteran

Reflection 70B saga continues as training data provider releases post-mortem report​

Sharing model artifacts​

Reproducing the benchmarks​

Reflecting on a hasty release​

Clarifying API confusion​

Looking ahead​

Skepticism among open source AI community remains​

Veteran

Veteran

Veteran

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises​

How GPT-4o is designed to detect deepfakes​

Key GPT-4o capabilities for detecting and stopping deepfakes​

Deepfake attacks on CEOs are growing​

The critical role of trust and security in the AI era​

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

Veteran

References

Reflection 70B saga continues as training data provider releases post-mortem report

Sharing model artifacts

Reproducing the benchmarks

Reflecting on a hasty release

Clarifying API confusion

Looking ahead

Skepticism among open source AI community remains

GPT-4o: OpenAI’s shield against $40B deepfake threat to enterprises

How GPT-4o is designed to detect deepfakes

Key GPT-4o capabilities for detecting and stopping deepfakes

Deepfake attacks on CEOs are growing

The critical role of trust and security in the AI era