bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585
We Finally Figured Out How AI Actually Works… (not what we thought!)



Channel Info Matthew Berman Subscribers: 443K subscribers

Description
Join My Newsletter for Regular AI Updates 👇🏼

My Links 🔗
👉🏻 Subscribe: Matthew Berman
👉🏻 Twitter: https://twitter.com/matthewberman
👉🏻 Discord: Join the Forward Future AI Discord Server!
👉🏻 Patreon: Get more from Matthew Berman on Patreon
👉🏻 Instagram: https://www.instagram.com/matthewberman_ai
👉🏻 Threads: Matthew Berman (@matthewberman_ai) • Threads, Say more
👉🏻 LinkedIn: Forward Future | LinkedIn

Media/Sponsorship Inquiries ✅

TimeStamps:
0:00 Intro
1:19 Paper Overview
5:46 How AI is Multilingual
8:07 How AI Plans
10:58 Mental Math
14:00 AI Makes Things Up
17:39 Multi-Step Reasoning
19:37 Hallucinations
22:44 Jailbreaks
25:30 Outro

Links:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585
[Technical] SEED-Bench-R1: Evaluating Reinforcement Learning vs Supervised Fine-tuning for Video Understanding in Multimodal LLMs



Posted on Wed Apr 2 11:28:52 2025 UTC

/r/ArtificialInteligence/comments/1jpm9cf/seedbenchr1_evaluating_reinforcement_learning_vs/

Researchers just released a comprehensive evaluation of how reinforcement learning affects video understanding in multimodal language models, introducing a new benchmark called SEED-Bench-R1 with 1,152 multiple-choice questions specifically designed to test video reasoning capabilities.

Key findings:
- Most RLHF-trained models show significant degradation in video understanding compared to their SFT-only counterparts (GPT-4o dropped 9%, Gemini Pro dropped 3.3%)
- Temporal reasoning tasks suffer more than spatial tasks - models struggle more with understanding sequences of events after RL training
- Claude 3 Opus is the exception, showing a 5.9% improvement after RL, suggesting different training approaches matter
- Common failure patterns include focusing on superficial visual elements, displaying overconfidence, and producing lengthy but incorrect explanations
- Error analysis reveals RLHF creates misalignment between user intent (accurate video understanding) and model outputs (confident-sounding but incorrect answers)

I think this reveals a fundamental tension in current AI training pipelines. When we optimize for human preferences through RLHF, we're inadvertently teaching models to provide confident-sounding answers even when they lack proper understanding of video content. This finding challenges the assumption that RLHF universally improves model capabilities and suggests we need specialized approaches for preserving video reasoning during reinforcement learning.

The Claude 3 Opus exception is particularly interesting - understanding what Anthropic is doing differently could provide valuable insights for improving video capabilities across all models. I wonder if their constitutional AI approach or specific reward modeling techniques might be responsible for this difference.

For practitioners, this suggests we should be cautious when deploying RLHF-trained models for video understanding tasks, and potentially consider using SFT-only models when accuracy on video content is critical.

TLDR: Standard reinforcement learning techniques hurt video understanding in most AI models, creating systems that sound confident but miss critical temporal information. Claude 3 Opus is a notable exception, suggesting alternative RL approaches may preserve these capabilities.

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | AI Research Paper Details. Paper Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585

Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors​


By Asif Razzaq

April 1, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large Language Models (LLMs) significantly benefit from attention mechanisms, enabling the effective retrieval of contextual information. Nevertheless, traditional attention methods primarily depend on single token attention, where each attention weight is computed from a single pair of query and key vectors. This design inherently constrains the model’s ability to discern contexts requiring the integration of multiple token signals, thereby limiting its effectiveness on complex linguistic dependencies. For example, identifying sentences simultaneously containing both “Alice” and “rabbit” is challenging because conventional attention mechanisms struggle to integrate multiple separate attention signals efficiently without substantially increasing model complexity.

Meta AI addresses this limitation by introducing Multi-Token Attention (MTA), an advanced attention mechanism that conditions attention weights simultaneously on multiple query and key vectors. MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval. Specifically, the MTA framework consists of two convolutional components: key-query convolution, which aggregates multiple token signals within individual attention heads, and head mixing convolution, which facilitates information sharing among different attention heads. Additionally, the implementation employs group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-04-01-at-11.53.03%E2%80%AFPM-1-1024x512.png


At a technical level, MTA modifies conventional attention calculations by incorporating a two-dimensional convolution operation on the attention logits prior to softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, thus enabling the attention mechanism to identify contextual relationships involving multiple tokens more precisely. Consequently, the model efficiently aggregates local token interactions without substantially increasing the number of parameters or the dimensionality of attention vectors. Moreover, head convolution promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while mitigating less pertinent information. Collectively, these enhancements yield a more robust attention mechanism capable of capturing complex multi-token interactions.

Screenshot-2025-04-01-at-11.53.17%E2%80%AFPM-1024x414.png


Empirical evaluations validate the efficacy of MTA across several benchmarks. In a structured motivating task explicitly designed to illustrate the shortcomings of single-token attention mechanisms, MTA demonstrated near-perfect performance, achieving an error rate of only 0.1%, in contrast to standard Transformer models that exhibited error rates above 50%. Further large-scale experiments involving an 880M-parameter model trained on 105 billion tokens showed MTA consistently outperforming baseline architectures. MTA achieved superior validation perplexity scores across datasets such as arXiv, GitHub, and Wikipedia. Specifically, in tasks requiring extended context comprehension, such as Needle-in-the-Haystack and BabiLong benchmarks, MTA significantly exceeded the performance of standard Transformer models. In the Needle-in-the-Haystack task with 4K token contexts containing multiple needles, MTA attained accuracies ranging from 67% to 97.6%, surpassing standard models by substantial margins.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Screenshot-2025-04-01-at-11.53.49%E2%80%AFPM-1024x676.png


In summary, Multi-Token Attention (MTA) presents a refined advancement in attention mechanisms by addressing fundamental limitations of traditional single-token attention. Leveraging convolutional operations to concurrently integrate multiple query-key interactions, MTA enhances the ability of language models to handle intricate contextual dependencies. These methodological improvements facilitate more precise and efficient performance, particularly in scenarios involving complex token interactions and long-range contextual understanding. Through targeted modifications to standard attention mechanisms, MTA contributes meaningfully to the evolution of more sophisticated, accurate, and computationally efficient language models.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585

LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets​


By Asif Razzaq

February 16, 2025

Reddit Vote Flip Share Tweet 0 Shares

After the advent of LLMs, AI Research has focused solely on the development of powerful models day by day. These cutting-edge new models improve users’ experience across various reasoning, content generation tasks, etc. However, trust in the results and the underlying reasoning used by these models have recently been in the spotlight. In developing these models, the quality of the data, its compliance, and associated legal risks have become key concerns, as the models’ output depends on the underlying dataset.

LG AI Research , a pioneer in the AI field with previous successful launches of the EXAONE Models, has developed an Agent AI to address the above concerns. The Agent AI tracks the life cycle of training datasets to be used in AI models, comprehensively analyzing legal risks and assessing potential threats related to a dataset. LG AI Research has also introduced NEXUS , where users can directly explore results generated by this Agent AI system.

LG AI Research focuses on the training data underlying AI models. This is concerning because AI has been rapidly expanding into various sectors, and the biggest concern is its legal, safe, and ethical advancement. Through this research, LG AI Research found that AI training datasets are redistributed many times, and a dataset is sometimes linked to hundreds of datasets, making it impossible for a human being to track its sources. This lack of transparency can give rise to some serious legal and compliance risks.

Through its offering of an Agent AI embedded in NEXUS , LG AI Research is tracking complex datasets’ lifecycle to ensure data compliance. The team has achieved this through its robust Agent AI, which can automatically find and analyze complex layers and dataset relationships. They developed this Agent AI system using a comprehensive data compliance framework and their EXAONE 3.5 model . The Agent AI system comprises three core modules, and each has been fine-tuned differently:

The Navigation Module: This module is extensively trained to navigate web documents and analyze AI-generated text data. It performs navigation based on the name and type of the entity to find links to web pages or license documents related to the entity. The QA Module: In this module, the model was trained to take collected documents as input and extract dependency and license information from the documents. The Scoring Module: Finally, it was trained using a refined dataset labeled by lawyers, which analyzes license details alongside an entity’s metadata to evaluate and quantify potential legal risks.

Through this robust development, Agent AI has achieved 45 times faster speed than a human expert at a cost cheaper than 700 times.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-02-16-at-7.43.17%E2%80%AFPM-1024x218.png
Table Source : https://lgresearch.ai/data/upload/LG_AI_Research_Data_compliance_arxiv_EST.pdf

Other notable results include: when evaluating 216 randomly chosen datasets from Hugging Face’s top 1,000+ downloads, Agent AI accurately detected dependencies by around 81.04% and identified license documents by about 95.83%.

Screenshot-2025-02-16-at-7.42.30%E2%80%AFPM-1024x364.png
Table Source : https://lgresearch.ai/data/upload/LG_AI_Research_Data_compliance_arxiv_EST.pdf

In this Agent AI, the legal risk assessment for datasets is based on the data compliance framework developed by LG AI Research . This data compliance framework uses 18 key factors: license grants, data modification rights, derivative works permissions, potential copyright infringement in outputs, and privacy considerations. Each factor is weighted according to real-world disputes and case law, ensuring practical, reliable risk assessments. After this, data compliance results are classified into a seven-level risk rating system, where A-1 is the highest, requiring explicit commercial use permission or public domain status, plus consistent rights for all sub-datasets. A-2 to B-2 allows limited use, often free for research but restricted commercially. C-1 to C-2 carry higher risk due to unclear licenses, rights issues, or privacy concerns.

The research on NEXUS has set a new standard for the legal stability of AI training datasets. LG AI Research envisions a long way forward; they have conducted an in-depth analysis of 3,612 major datasets through NEXUS and found that the inconsistency of rights relationships between datasets and dependencies is far higher than expected. Many of these datasets with inconsistencies are used for major AI models in widespread use. For example, of the 2,852 AI training datasets determined to be commercially available, only 605 (21.21%) remained commercially available after accounting for dependency risks.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Recognizing these real-world issues, LG AI Research has several future goals for evolving AI technology and the legal environment. The first immediate goal is to expand the scope and depth of the datasets that Agent AI technology analyzes, aiming to understand the life cycle of all the data worldwide and maintain the quality of assessment and results throughout this expansion. Another vision is to evolve the data compliance framework into a global standard. LG AI Research plans to collaborate with the worldwide AI community and legal experts to develop these criteria into an international standard. Finally, in the long term, LG AI Research plans to evolve NEXUS into a comprehensive legal risk management system for AI developers, contributing to creating a safe, legal, data-compliant, and responsible AI ecosystem.

Sources :

LG Agent AI Research Paper NEXUS LG AI Research LinkedIn Page EXAONE 3.5 Blog

Thanks to the LG AI Research team for the thought leadership/ Resources for this article. LG AI Research team has supported us in this content/article.

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585

This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking​


By Sajjad Ansari

April 1, 2025

Reddit Vote Flip Share Tweet 0 Shares

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy optimization. Recent research focuses on algorithmic improvements, including eliminating critic models to reduce computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a few studies focus on RLHF data construction (i.e., training prompts) and its performance scaling based on these training prompts.

The success of RLHF heavily depends on reward model quality, which faces three challenges: mis-specified reward modeling in representing human preferences, incorrect and ambiguous preferences in training datasets, and poor generalization ability. To address these issues, GenRM was introduced to validate model predictions against ground-truth responses, showing good resistance to reward hacking and gaining adoption in advanced LLMs like DeepSeekV3. Methods like principled data selection that filter overly challenging instances during training and strategic selection methods identify key training prompts to achieve comparable performance with reduced data. Performance scale analysis reveals that RLHF shows superior generalization compared to SFT on novel inputs but significantly reduces output diversity.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from ByteDance Seed address a critical gap in RLHF research where the role of prompt-data construction and its scalability has received less attention. They explore data-driven bottlenecks that limit RLHF performance scaling, focusing on reward hacking and decreasing response diversity challenges. A hybrid reward system is introduced by combining reasoning task verifiers (RTV) and a generative reward model (GenRM) that shows stronger resistance to reward hacking and enables a more accurate assessment of responses against ground-truth solutions. Moreover, a novel prompt-selection method called Pre-PPO is introduced to identify inherently challenging training prompts less susceptible to reward hacking.

AD_4nXdDRQpO72AyCSGTxnjl9SDKyRIlKaFEXQNLZWOfGL8QGQpl4qUNxmWazx28oPx5AT_jwQuwuBUjZVdjU9LGnxKjRN_0x4f1g8b0ihYAIX7VCThuzIpPb9H_yZIOsw0yg4awSYYmbQ


The experimental setup employs two pre-trained language models of different scales: a smaller model with 25B parameters and a larger model with 150B parameters. The training dataset contains one million prompts from diverse domains, including mathematics, coding, instruction-following, creative writing, and logical reasoning. Moreover, the researchers constructed a detailed evaluation framework covering multiple skill areas: logical reasoning, instruction-following, STEM tasks, coding, natural language processing, knowledge, contextual understanding, and out-of-distribution generalization. The evaluation framework includes two versions (V1.0 and V2.0) with overlapping prompts, though V2.0 features more challenging prompts.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

The experimental results show that the proposed approach combining Pre-PPO with prioritized mathematical and coding tasks consistently outperforms the baseline method across model sizes and evaluation datasets. The approach shows an improvement of +1.1 over the baseline when evaluated at 100-step intervals using TestSet V1.0. When tested on the more challenging TestSet V2.0, the performance improvement increases to +1.4. The most substantial gains appear in mathematics-intensive and coding tasks, with an improvement of +3.9 points in STEM and +3.2 points in coding. These improvements are attributed to the strategic prioritization of mathematical reasoning and coding tasks during early RLHF training phases.

In conclusion, this paper addresses critical bottlenecks in RLHF data scaling, specifically identifying reward hacking and reduced response diversity as significant challenges. The researchers proposed a combined approach featuring strategic prompt construction and early-stage training prioritization to solve this issue. The method uses RTV and GenRM to combat reward hacking alongside the novel Pre-PPO prompt selection strategy that identifies and prioritizes challenging training prompts. Analysis reveals that RTV supervision shows the strongest resistance to reward hacking, followed by GenRM with ground-truth labels and then the BT Reward Model. The research establishes a foundation for optimizing RLHF data construction and developing more principle methods to reward hacking and model alignment.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps​


By Asif Razzaq

March 31, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate reasoning chains and multiple retrieval steps. Current methods primarily depend on manually designed prompts or heuristics, posing limitations in scalability and flexibility. Additionally, generating supervised data for multi-step reasoning scenarios is often prohibitively expensive and practically infeasible.

Researchers from Baichuan Inc., Tongji University, The University of Edinburgh, and Zhejiang University introduce ReSearch, a novel AI framework designed to train LLMs to integrate reasoning with search via reinforcement learning, notably without relying on supervised reasoning steps. The core methodology of ReSearch incorporates search operations directly into the reasoning chain. Utilizing Group Relative Policy Optimization (GRPO), a reinforcement learning technique, ReSearch guides LLMs to autonomously identify optimal moments and strategies for performing search operations, which subsequently influence ongoing reasoning. This approach enables models to progressively refine their reasoning and naturally facilitates advanced capabilities such as reflection and self-correction.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-03-31-at-11.38.49%E2%80%AFPM-1-1024x691.png


From a technical perspective, ReSearch employs structured output formats by embedding specific tags—such as <think> , <search> , <result> , and <answer> —within the reasoning chain. These tags facilitate clear communication between the model and the external retrieval environment, systematically organizing generated outputs. During training, ReSearch intentionally excludes retrieval results from loss computations to prevent model bias. Reward signals guiding the reinforcement learning process are based on straightforward criteria: accuracy assessment through F1 scores and adherence to the predefined structured output format. This design encourages the autonomous development of sophisticated reasoning patterns, circumventing the need for manually annotated reasoning datasets.

Experimental evaluation confirms the robustness of ReSearch. When assessed on multi-hop question-answering benchmarks, including HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle, ReSearch consistently outperformed baseline methods. Specifically, ReSearch-Qwen-32B-Instruct achieved improvements ranging between 8.9% and 22.4% in performance compared to established baselines. Notably, these advancements were achieved despite the model being trained exclusively on a single dataset, underscoring its strong generalization capabilities. Further analyses demonstrated that models gradually increased their reliance on iterative search operations throughout training, indicative of enhanced reasoning proficiency. A detailed case study illustrated the model’s capacity to identify suboptimal search queries, reflect on its reasoning steps, and implement corrective actions autonomously.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Screenshot-2025-03-31-at-11.39.04%E2%80%AFPM-1-1024x599.png


In summary, ReSearch presents a significant methodological advancement in training LLMs to seamlessly integrate reasoning with external search mechanisms via reinforcement learning. By eliminating dependency on supervised reasoning data, this framework effectively addresses critical scalability and adaptability issues inherent in multi-hop reasoning scenarios. Its capability for self-reflection and correction enhances its practical applicability in complex, realistic contexts. Future research directions may further extend this reinforcement learning-based framework to broader applications and incorporate additional external knowledge resources.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585


1/3
@ge_yixiao
🚀 Introducing SEED-Bench-R1: RL (GRPO) shines🌟 — but reveals critical gaps between perception and reasoning!
🔗 http://arxiv.org/pdf/2503.24376
💻 GitHub - TencentARC/SEED-Bench-R1
- 📹 Real-world egocentric videos
- 🧠 Tasks balancing perception + logic
- 📊 Rigorous in-distribution/OOD splits



GnbnLE1bYAA5OMJ.jpg


2/3
@ge_yixiao
coauthors: @yshan2u @XihuiLiu



3/3
@EIFY
People are so fast building on GRPO 😅
Would Dr. GRPO make it even better?

[Quoted tweet]
🪂Understanding R1-Zero-Like Training: A Critical Perspective
* DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning??
* The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO??
* Getting GRPO Done Right, we achieve a 7B AIME sota!
🧵

📜Full details: github.com/sail-sg/understan…
🛠️Code: github.com/sail-sg/understan…


GmlZO-paEAE7g5g.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

Fresh

SOHH Vet
Joined
May 2, 2013
Messages
9,550
Reputation
5,653
Daps
22,712
since AI is so advanced and becoming more advanced I bought Articial Intelligence For Dummies, and I bought another AI book too

so I wanna know all about AI, the countless ways it works, and basic understanding, and the future of AI
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585
Gemini 2.5 Pro takes huge lead in new MathArena USAMO benchmark



Posted on Wed Apr 2 14:52:50 2025 UTC

n6g5ud1kqfse1.jpeg




Commented on Wed Apr 2 15:00:35 2025 UTC

That is insane, they go from the 2.0 pro meh model to this masterpiece in such a short time, unreal


│ Commented on Wed Apr 2 17:45:16 2025 UTC

https://i.redd.it/zsqt4be8lgse1.png

│ And they already have a better coding model in LM arena called nightwhisper!
zsqt4be8lgse1.png



Commented on Wed Apr 2 17:58:33 2025 UTC

https://i.redd.it/wdd1pgpongse1.jpeg

This is the state of the art?
wdd1pgpongse1.jpeg


│ Commented on Wed Apr 2 18:23:25 2025 UTC

https://i.redd.it/c7zhspz3sgse1.gif
c7zhspz3sgse1.gif

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585
[New Model] University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy



Posted on Wed Apr 2 17:04:49 2025 UTC

6k40aa57dgse1.png

bf2vz1xudgse1.png

ha6vcyawdgse1.png



Commented on Wed Apr 2 17:22:56 2025 UTC

It's fascinating watching it generate text:

https://i.redd.it/xci0dlo7hgse1.gif
xci0dlo7hgse1.gif


│ Commented on Wed Apr 2 17:52:06 2025 UTC

│ What the actual fukk…

│ │
│ │
│ │ Commented on Wed Apr 2 18:35:15 2025 UTC
│ │
│ │
│ │
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,737
Reputation
9,298
Daps
169,585
AI passed the Turing Test



Posted on Wed Apr 2 13:26:20 2025 UTC

swfaplqnafse1.png




Commented on Wed Apr 2 13:28:06 2025 UTC

This paper finds "the first robust evidence that any system passes the original three-party Turing test"

People had a five minute, three-way conversation with another person & an AI. They picked GPT-4.5, prompted to act human, as the real person 73% of time, well above chance.

Summary thread: https://twiiit.com/camrobjones/status/1907086860322480233 | https://nitter.poast.org/camrobjones/status/1907086860322480233 | https://xcancel.com/camrobjones/status/1907086860322480233 | Cameron Jones @camrobjones, Twitter Profile | TwStalker
Paper: https://arxiv.org/pdf/2503.23674

https://i.redd.it/flojgy87bfse1.png
flojgy87bfse1.png

















1/15
@camrobjones
New preprint: we evaluated LLMs in a 3-party Turing test (participants speak to a human &amp; AI simultaneously and decide which is which).

GPT-4.5 (when prompted to adopt a humanlike persona) was judged to be the human 73% of the time, suggesting it passes the Turing test (🧵)



GndMw2mWAAApMsU.jpg


2/15
@camrobjones
In previous work we found GPT-4 was judged to be human ~50% of the time in a 2-party Turing test, where ppts speak to *either* a human or a model.

This is probably easier for several reasons. Here we ran a new study with Turing's original 3-party setup

Large Language Models Pass the Turing Test



GndNxJeXYAArGF3.png


3/15
@camrobjones
Participants spoke to two "witnesses" at the same time: one human and one AI. Here are some example convos from the study. Can you tell which one is the human? Answers &amp; original interrogator verdicts in the paper...

You can play the game yourself here: The Turing Test — Can you tell a human from an AI?



GndOc5mXUAEZnuY.jpg

GndOfa2WAAAJSxT.jpg

GndOi59WEAAML9G.jpg

GndOlTQXAAAdlDw.jpg


4/15
@camrobjones
Across 2 studies (on undergrads and Prolific) GPT-4.5 was selected as the human significantly more often than chance (50%). LLaMa was not significantly worse than chance, suggesting ppts couldn't distinguish it from people. Baselines (ELIZA &amp; GPT-4o) were worse than chance.



GndO4PAWYAAVyCc.jpg


5/15
@camrobjones
We also tried giving a more basic prompt to the models, without detailed instructions on the persona to adopt. Models performed significantly worse in this condition (highlighting the importance of prompting), but were still indistinguishable from humans in the Prolific study.



GndP9AzWcAAchhD.png


6/15
@camrobjones
As in previous work, people focused more on linguistic and socioemotional factors in their strategies &amp; reasons. This might suggest people no longer see "classical" intelligence (e.g. math, knowledge, reasoning) as a good way of discriminating people from machines.



GndQGv8WsAAbw6T.jpg

GndQJkFWkAAaas9.jpg


7/15
@camrobjones
So do LLMs pass the Turing test? We think this is pretty strong evidence that they do. People were no better than chance at distinguishing humans from GPT-4.5 and LLaMa (with the persona prompt). And 4.5 was even judged to be human significantly *more* often than actual humans!



8/15
@camrobjones
Turing is quite vague about exactly how the test should be implemented. As such there are many possible variations (e.g. 2-party, an hour, or with experts). I think this 3-party 5-min version is the mostly widely accepted "standard" test but planning to explore others in future.



9/15
@camrobjones
Does this mean LLMs are intelligent? I think that's a very complicated question that's hard to address in a paper (or a tweet). But broadly I think this should be evaluated as one among many other pieces of evidence for the kind of intelligence LLMs display.



10/15
@camrobjones
Did LLMs really pass if they needed a prompt? It's a good q. Without any prompt, LLMs would fail for trivial reasons (like admiting to being AI). &amp; they could easily be fine-tuned to behave as they do when prompted. So I do think it's fair to say that LLMs pass.



11/15
@camrobjones
More pressingly, I think the results provide more evidence that LLMs could substitute for people in short interactions without anyone being able to tell. This could potentially lead to automation of jobs, improved social engineering attacks, and more general societal disruption.



12/15
@camrobjones
One of the most important aspects of the Turing test is that it's not static: it depends on people's assumptions about other humans and technology. We agree with @brianchristian that humans could (and should) come back better next year!



GndSZOzWgAAtTEL.jpg


13/15
@camrobjones
Thanks so much to my co-author Ben Bergen, to Sydney Taylor (a former RA who wrote the persona prompt!), to Open Philanthropy and to 12 donors on @manifund who helped to support this work.



14/15
@camrobjones
There's lots more detail in the paper Large Language Models Pass the Turing Test. We also release all of the data (including full anonymized transcripts) for further scrutiny/analysis/to prove this isn't an April Fools joke.

The paper's under review and any feedback would be very welcome!



15/15
@TheRundownAI
AI won't replace you, but a person using AI will.

Join 500,000+ readers and learn how to use AI in just 5 minutes a day (for free).




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top