The A.I Megathread (LLM , GPT , Development)

bnew · Apr 2, 2025

We Finally Figured Out How AI Actually Works… (not what we thought!)

Channel Info Matthew Berman Subscribers: 443K subscribers

Description

Join My Newsletter for Regular AI Updates

Forward Future Daily

Bringing the benefits of AI to all of humanity.

forwardfuture.ai

My Links

Subscribe: Matthew Berman

Twitter: https://twitter.com/matthewberman

Discord: Join the Forward Future AI Discord Server!

Patreon: Get more from Matthew Berman on Patreon

Instagram: https://www.instagram.com/matthewberman_ai

Threads: Matthew Berman (@matthewberman_ai) • Threads, Say more

LinkedIn: Forward Future | LinkedIn

Media/Sponsorship Inquiries

Sponsorship Inquiries

Turn data collection into an experience with Typeform. Create beautiful online forms, surveys, quizzes, and so much more. Try it for FREE.

bit.ly

TimeStamps:
0:00 Intro
1:19 Paper Overview
5:46 How AI is Multilingual
8:07 How AI Plans
10:58 Mental Math
14:00 AI Makes Things Up
17:39 Multi-Step Reasoning
19:37 Hallucinations
22:44 Jailbreaks
25:30 Outro

Links:

https://www.anthropic.com/news/tracing-thoughts-language-model

bnew · Apr 2, 2025

[Technical] SEED-Bench-R1: Evaluating Reinforcement Learning vs Supervised Fine-tuning for Video Understanding in Multimodal LLMs

Posted on Wed Apr 2 11:28:52 2025 UTC

/r/ArtificialInteligence/comments/1jpm9cf/seedbenchr1_evaluating_reinforcement_learning_vs/

Researchers just released a comprehensive evaluation of how reinforcement learning affects video understanding in multimodal language models, introducing a new benchmark called SEED-Bench-R1 with 1,152 multiple-choice questions specifically designed to test video reasoning capabilities.

Key findings:
- Most RLHF-trained models show significant degradation in video understanding compared to their SFT-only counterparts (GPT-4o dropped 9%, Gemini Pro dropped 3.3%)
- Temporal reasoning tasks suffer more than spatial tasks - models struggle more with understanding sequences of events after RL training
- Claude 3 Opus is the exception, showing a 5.9% improvement after RL, suggesting different training approaches matter
- Common failure patterns include focusing on superficial visual elements, displaying overconfidence, and producing lengthy but incorrect explanations
- Error analysis reveals RLHF creates misalignment between user intent (accurate video understanding) and model outputs (confident-sounding but incorrect answers)

I think this reveals a fundamental tension in current AI training pipelines. When we optimize for human preferences through RLHF, we're inadvertently teaching models to provide confident-sounding answers even when they lack proper understanding of video content. This finding challenges the assumption that RLHF universally improves model capabilities and suggests we need specialized approaches for preserving video reasoning during reinforcement learning.

The Claude 3 Opus exception is particularly interesting - understanding what Anthropic is doing differently could provide valuable insights for improving video capabilities across all models. I wonder if their constitutional AI approach or specific reward modeling techniques might be responsible for this difference.

For practitioners, this suggests we should be cautious when deploying RLHF-trained models for video understanding tasks, and potentially consider using SFT-only models when accuracy on video content is critical.

TLDR: Standard reinforcement learning techniques hurt video understanding in most AI models, creating systems that sound confident but miss critical temporal information. Claude 3 Opus is a notable exception, suggesting alternative RL approaches may preserve these capabilities.

Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 | AI Research Paper Details. Paper Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1.

bnew · Apr 2, 2025

Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors

Large Language Models (LLMs) significantly benefit from attention mechanisms, enabling the effective retrieval of contextual information. Nevertheless, traditional attention methods primarily depend on single token attention, where each attention weight is computed from a single pair of query...

www.marktechpost.com

Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors

By Asif Razzaq

April 1, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large Language Models (LLMs) significantly benefit from attention mechanisms, enabling the effective retrieval of contextual information. Nevertheless, traditional attention methods primarily depend on single token attention, where each attention weight is computed from a single pair of query and key vectors. This design inherently constrains the model’s ability to discern contexts requiring the integration of multiple token signals, thereby limiting its effectiveness on complex linguistic dependencies. For example, identifying sentences simultaneously containing both “Alice” and “rabbit” is challenging because conventional attention mechanisms struggle to integrate multiple separate attention signals efficiently without substantially increasing model complexity.

Meta AI addresses this limitation by introducing Multi-Token Attention (MTA), an advanced attention mechanism that conditions attention weights simultaneously on multiple query and key vectors. MTA integrates convolution operations over queries, keys, and attention heads, thus enhancing the precision and efficiency of contextual information retrieval. Specifically, the MTA framework consists of two convolutional components: key-query convolution, which aggregates multiple token signals within individual attention heads, and head mixing convolution, which facilitates information sharing among different attention heads. Additionally, the implementation employs group normalization with depth-dependent scaling to stabilize gradient flow, further improving model training stability and efficacy.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-04-01-at-11.53.03%E2%80%AFPM-1-1024x512.png

At a technical level, MTA modifies conventional attention calculations by incorporating a two-dimensional convolution operation on the attention logits prior to softmax normalization. This convolution allows adjacent queries and keys to influence attention scores mutually, thus enabling the attention mechanism to identify contextual relationships involving multiple tokens more precisely. Consequently, the model efficiently aggregates local token interactions without substantially increasing the number of parameters or the dimensionality of attention vectors. Moreover, head convolution promotes effective knowledge transfer among attention heads, selectively amplifying relevant context signals while mitigating less pertinent information. Collectively, these enhancements yield a more robust attention mechanism capable of capturing complex multi-token interactions.

Screenshot-2025-04-01-at-11.53.17%E2%80%AFPM-1024x414.png

Empirical evaluations validate the efficacy of MTA across several benchmarks. In a structured motivating task explicitly designed to illustrate the shortcomings of single-token attention mechanisms, MTA demonstrated near-perfect performance, achieving an error rate of only 0.1%, in contrast to standard Transformer models that exhibited error rates above 50%. Further large-scale experiments involving an 880M-parameter model trained on 105 billion tokens showed MTA consistently outperforming baseline architectures. MTA achieved superior validation perplexity scores across datasets such as arXiv, GitHub, and Wikipedia. Specifically, in tasks requiring extended context comprehension, such as Needle-in-the-Haystack and BabiLong benchmarks, MTA significantly exceeded the performance of standard Transformer models. In the Needle-in-the-Haystack task with 4K token contexts containing multiple needles, MTA attained accuracies ranging from 67% to 97.6%, surpassing standard models by substantial margins.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Screenshot-2025-04-01-at-11.53.49%E2%80%AFPM-1024x676.png

In summary, Multi-Token Attention (MTA) presents a refined advancement in attention mechanisms by addressing fundamental limitations of traditional single-token attention. Leveraging convolutional operations to concurrently integrate multiple query-key interactions, MTA enhances the ability of language models to handle intricate contextual dependencies. These methodological improvements facilitate more precise and efficient performance, particularly in scenarios involving complex token interactions and long-range contextual understanding. Through targeted modifications to standard attention mechanisms, MTA contributes meaningfully to the evolution of more sophisticated, accurate, and computationally efficient language models.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

bnew · Apr 2, 2025

LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

After the advent of LLMs, AI Research has focused solely on the development of powerful models day by day. These cutting-edge new models improve users’ experience across various reasoning, content generation tasks, etc. However, trust in the results and the underlying reasoning used by these...

www.marktechpost.com

LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

By Asif Razzaq

February 16, 2025

Reddit Vote Flip Share Tweet 0 Shares

After the advent of LLMs, AI Research has focused solely on the development of powerful models day by day. These cutting-edge new models improve users’ experience across various reasoning, content generation tasks, etc. However, trust in the results and the underlying reasoning used by these models have recently been in the spotlight. In developing these models, the quality of the data, its compliance, and associated legal risks have become key concerns, as the models’ output depends on the underlying dataset.

LG AI Research , a pioneer in the AI field with previous successful launches of the EXAONE Models, has developed an Agent AI to address the above concerns. The Agent AI tracks the life cycle of training datasets to be used in AI models, comprehensively analyzing legal risks and assessing potential threats related to a dataset. LG AI Research has also introduced NEXUS , where users can directly explore results generated by this Agent AI system.

LG AI Research focuses on the training data underlying AI models. This is concerning because AI has been rapidly expanding into various sectors, and the biggest concern is its legal, safe, and ethical advancement. Through this research, LG AI Research found that AI training datasets are redistributed many times, and a dataset is sometimes linked to hundreds of datasets, making it impossible for a human being to track its sources. This lack of transparency can give rise to some serious legal and compliance risks.

Through its offering of an Agent AI embedded in NEXUS , LG AI Research is tracking complex datasets’ lifecycle to ensure data compliance. The team has achieved this through its robust Agent AI, which can automatically find and analyze complex layers and dataset relationships. They developed this Agent AI system using a comprehensive data compliance framework and their EXAONE 3.5 model . The Agent AI system comprises three core modules, and each has been fine-tuned differently:

The Navigation Module: This module is extensively trained to navigate web documents and analyze AI-generated text data. It performs navigation based on the name and type of the entity to find links to web pages or license documents related to the entity. The QA Module: In this module, the model was trained to take collected documents as input and extract dependency and license information from the documents. The Scoring Module: Finally, it was trained using a refined dataset labeled by lawyers, which analyzes license details alongside an entity’s metadata to evaluate and quantify potential legal risks.

Through this robust development, Agent AI has achieved 45 times faster speed than a human expert at a cost cheaper than 700 times.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-02-16-at-7.43.17%E2%80%AFPM-1024x218.png

Table Source : https://lgresearch.ai/data/upload/LG_AI_Research_Data_compliance_arxiv_EST.pdf

Other notable results include: when evaluating 216 randomly chosen datasets from Hugging Face’s top 1,000+ downloads, Agent AI accurately detected dependencies by around 81.04% and identified license documents by about 95.83%.

Screenshot-2025-02-16-at-7.42.30%E2%80%AFPM-1024x364.png

Table Source : https://lgresearch.ai/data/upload/LG_AI_Research_Data_compliance_arxiv_EST.pdf

In this Agent AI, the legal risk assessment for datasets is based on the data compliance framework developed by LG AI Research . This data compliance framework uses 18 key factors: license grants, data modification rights, derivative works permissions, potential copyright infringement in outputs, and privacy considerations. Each factor is weighted according to real-world disputes and case law, ensuring practical, reliable risk assessments. After this, data compliance results are classified into a seven-level risk rating system, where A-1 is the highest, requiring explicit commercial use permission or public domain status, plus consistent rights for all sub-datasets. A-2 to B-2 allows limited use, often free for research but restricted commercially. C-1 to C-2 carry higher risk due to unclear licenses, rights issues, or privacy concerns.

The research on NEXUS has set a new standard for the legal stability of AI training datasets. LG AI Research envisions a long way forward; they have conducted an in-depth analysis of 3,612 major datasets through NEXUS and found that the inconsistency of rights relationships between datasets and dependencies is far higher than expected. Many of these datasets with inconsistencies are used for major AI models in widespread use. For example, of the 2,852 AI training datasets determined to be commercially available, only 605 (21.21%) remained commercially available after accounting for dependency risks.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Recognizing these real-world issues, LG AI Research has several future goals for evolving AI technology and the legal environment. The first immediate goal is to expand the scope and depth of the datasets that Agent AI technology analyzes, aiming to understand the life cycle of all the data worldwide and maintain the quality of assessment and results throughout this expansion. Another vision is to evolve the data compliance framework into a global standard. LG AI Research plans to collaborate with the worldwide AI community and legal experts to develop these criteria into an international standard. Finally, in the long term, LG AI Research plans to evolve NEXUS into a comprehensive legal risk management system for AI developers, contributing to creating a safe, legal, data-compliant, and responsible AI ecosystem.

Sources :

LG Agent AI Research Paper NEXUS LG AI Research LinkedIn Page EXAONE 3.5 Blog

Thanks to the LG AI Research team for the thought leadership/ Resources for this article. LG AI Research team has supported us in this content/article.

bnew · Apr 2, 2025

This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy...

www.marktechpost.com

This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking

By Sajjad Ansari

April 1, 2025

Reddit Vote Flip Share Tweet 0 Shares

Reinforcement Learning from Human Feedback (RLHF) is crucial for aligning LLMs with human values and preferences. Despite introducing non-RL alternatives like DPO, industry-leading models such as ChatGPT/GPT-4, Claude, and Gemini continue to rely on RL algorithms like PPO for policy optimization. Recent research focuses on algorithmic improvements, including eliminating critic models to reduce computational costs, filtering noisy samples during PPO sampling, and enhancing reward models to mitigate reward hacking problems. However, only a few studies focus on RLHF data construction (i.e., training prompts) and its performance scaling based on these training prompts.

The success of RLHF heavily depends on reward model quality, which faces three challenges: mis-specified reward modeling in representing human preferences, incorrect and ambiguous preferences in training datasets, and poor generalization ability. To address these issues, GenRM was introduced to validate model predictions against ground-truth responses, showing good resistance to reward hacking and gaining adoption in advanced LLMs like DeepSeekV3. Methods like principled data selection that filter overly challenging instances during training and strategic selection methods identify key training prompts to achieve comparable performance with reduced data. Performance scale analysis reveals that RLHF shows superior generalization compared to SFT on novel inputs but significantly reduces output diversity.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from ByteDance Seed address a critical gap in RLHF research where the role of prompt-data construction and its scalability has received less attention. They explore data-driven bottlenecks that limit RLHF performance scaling, focusing on reward hacking and decreasing response diversity challenges. A hybrid reward system is introduced by combining reasoning task verifiers (RTV) and a generative reward model (GenRM) that shows stronger resistance to reward hacking and enables a more accurate assessment of responses against ground-truth solutions. Moreover, a novel prompt-selection method called Pre-PPO is introduced to identify inherently challenging training prompts less susceptible to reward hacking.

AD_4nXdDRQpO72AyCSGTxnjl9SDKyRIlKaFEXQNLZWOfGL8QGQpl4qUNxmWazx28oPx5AT_jwQuwuBUjZVdjU9LGnxKjRN_0x4f1g8b0ihYAIX7VCThuzIpPb9H_yZIOsw0yg4awSYYmbQ

The experimental setup employs two pre-trained language models of different scales: a smaller model with 25B parameters and a larger model with 150B parameters. The training dataset contains one million prompts from diverse domains, including mathematics, coding, instruction-following, creative writing, and logical reasoning. Moreover, the researchers constructed a detailed evaluation framework covering multiple skill areas: logical reasoning, instruction-following, STEM tasks, coding, natural language processing, knowledge, contextual understanding, and out-of-distribution generalization. The evaluation framework includes two versions (V1.0 and V2.0) with overlapping prompts, though V2.0 features more challenging prompts.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

The experimental results show that the proposed approach combining Pre-PPO with prioritized mathematical and coding tasks consistently outperforms the baseline method across model sizes and evaluation datasets. The approach shows an improvement of +1.1 over the baseline when evaluated at 100-step intervals using TestSet V1.0. When tested on the more challenging TestSet V2.0, the performance improvement increases to +1.4. The most substantial gains appear in mathematics-intensive and coding tasks, with an improvement of +3.9 points in STEM and +3.2 points in coding. These improvements are attributed to the strategic prioritization of mathematical reasoning and coding tasks during early RLHF training phases.

In conclusion, this paper addresses critical bottlenecks in RLHF data scaling, specifically identifying reward hacking and reduced response diversity as significant challenges. The researchers proposed a combined approach featuring strategic prompt construction and early-stage training prioritization to solve this issue. The method uses RTV and GenRM to combat reward hacking alongside the novel Pre-PPO prompt selection strategy that identifies and prioritizes challenging training prompts. Analysis reveals that RTV supervision shows the strongest resistance to reward hacking, followed by GenRM with ground-truth labels and then the BT Reward Model. The research establishes a foundation for optimizing RLHF data construction and developing more principle methods to reward hacking and model alignment.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

bnew · Apr 2, 2025

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps

Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate...

www.marktechpost.com

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps

By Asif Razzaq

March 31, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large language models (LLMs) have demonstrated significant progress across various tasks, particularly in reasoning capabilities. However, effectively integrating reasoning processes with external search operations remains challenging, especially for multi-hop questions requiring intricate reasoning chains and multiple retrieval steps. Current methods primarily depend on manually designed prompts or heuristics, posing limitations in scalability and flexibility. Additionally, generating supervised data for multi-step reasoning scenarios is often prohibitively expensive and practically infeasible.

Researchers from Baichuan Inc., Tongji University, The University of Edinburgh, and Zhejiang University introduce ReSearch, a novel AI framework designed to train LLMs to integrate reasoning with search via reinforcement learning, notably without relying on supervised reasoning steps. The core methodology of ReSearch incorporates search operations directly into the reasoning chain. Utilizing Group Relative Policy Optimization (GRPO), a reinforcement learning technique, ReSearch guides LLMs to autonomously identify optimal moments and strategies for performing search operations, which subsequently influence ongoing reasoning. This approach enables models to progressively refine their reasoning and naturally facilitates advanced capabilities such as reflection and self-correction.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-03-31-at-11.38.49%E2%80%AFPM-1-1024x691.png

From a technical perspective, ReSearch employs structured output formats by embedding specific tags—such as <think> , <search> , <result> , and <answer> —within the reasoning chain. These tags facilitate clear communication between the model and the external retrieval environment, systematically organizing generated outputs. During training, ReSearch intentionally excludes retrieval results from loss computations to prevent model bias. Reward signals guiding the reinforcement learning process are based on straightforward criteria: accuracy assessment through F1 scores and adherence to the predefined structured output format. This design encourages the autonomous development of sophisticated reasoning patterns, circumventing the need for manually annotated reasoning datasets.

Experimental evaluation confirms the robustness of ReSearch. When assessed on multi-hop question-answering benchmarks, including HotpotQA, 2WikiMultiHopQA, MuSiQue, and Bamboogle, ReSearch consistently outperformed baseline methods. Specifically, ReSearch-Qwen-32B-Instruct achieved improvements ranging between 8.9% and 22.4% in performance compared to established baselines. Notably, these advancements were achieved despite the model being trained exclusively on a single dataset, underscoring its strong generalization capabilities. Further analyses demonstrated that models gradually increased their reliance on iterative search operations throughout training, indicative of enhanced reasoning proficiency. A detailed case study illustrated the model’s capacity to identify suboptimal search queries, reflect on its reasoning steps, and implement corrective actions autonomously.

[FREE AI Webinar] What truly makes a system "agentic"? Join this deepest / Haystack webinar with Mathis Lucka, Lead AI Engineer, to learn more including a demo of the architecture behind an agent for GitHub actions (April 17, 11:00am ET/ 8:00am PT) [Sponsored]

Screenshot-2025-03-31-at-11.39.04%E2%80%AFPM-1-1024x599.png

In summary, ReSearch presents a significant methodological advancement in training LLMs to seamlessly integrate reasoning with external search mechanisms via reinforcement learning. By eliminating dependency on supervised reasoning data, this framework effectively addresses critical scalability and adaptability issues inherent in multi-hop reasoning scenarios. Its capability for self-reflection and correction enhances its practical applicability in complex, realistic contexts. Future research directions may further extend this reinforcement learning-based framework to broader applications and incorporate additional external knowledge resources.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Apr 2, 2025

1/3
@ge_yixiao

Introducing SEED-Bench-R1: RL (GRPO) shines

— but reveals critical gaps between perception and reasoning!

http://arxiv.org/pdf/2503.24376

GitHub - TencentARC/SEED-Bench-R1
-

Real-world egocentric videos
-

Tasks balancing perception + logic
-

Rigorous in-distribution/OOD splits

2/3
@ge_yixiao
coauthors: @yshan2u @XihuiLiu

3/3
@EIFY
People are so fast building on GRPO

Would Dr. GRPO make it even better?

[Quoted tweet]

Understanding R1-Zero-Like Training: A Critical Perspective
* DeepSeek-V3-Base already exhibits "Aha moment" before RL-tuning??
* The ever-increasing output length in RL-tuning might be due to a BIAS in GRPO??
* Getting GRPO Done Right, we achieve a 7B AIME sota!

Full details: github.com/sail-sg/understan…

Code: github.com/sail-sg/understan…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

Fresh · Apr 2, 2025

since AI is so advanced and becoming more advanced I bought Articial Intelligence For Dummies, and I bought another AI book too

so I wanna know all about AI, the countless ways it works, and basic understanding, and the future of AI

bnew · Apr 2, 2025

Gemini 2.5 Pro takes huge lead in new MathArena USAMO benchmark

Posted on Wed Apr 2 14:52:50 2025 UTC

https://i.redd.it/n6g5ud1kqfse1.jpeg

Commented on Wed Apr 2 15:00:35 2025 UTC

That is insane, they go from the 2.0 pro meh model to this masterpiece in such a short time, unreal

│
│

│ Commented on Wed Apr 2 17:45:16 2025 UTC
│
│ https://i.redd.it/zsqt4be8lgse1.png
│
│ And they already have a better coding model in LM arena called nightwhisper!
│

│

Commented on Wed Apr 2 17:58:33 2025 UTC

https://i.redd.it/wdd1pgpongse1.jpeg

This is the state of the art?

│
│

│ Commented on Wed Apr 2 18:23:25 2025 UTC
│
│ https://i.redd.it/c7zhspz3sgse1.gif
│

│

bnew · Apr 2, 2025

[New Model] University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy

Posted on Wed Apr 2 17:04:49 2025 UTC

https://www.reddit.com/gallery/1jptset

Commented on Wed Apr 2 17:22:56 2025 UTC

It's fascinating watching it generate text:

https://i.redd.it/xci0dlo7hgse1.gif

│
│

│ Commented on Wed Apr 2 17:52:06 2025 UTC
│
│ What the actual fukk…
│

│ │
│ │

│ │ Commented on Wed Apr 2 18:35:15 2025 UTC
│ │
│ │
│ │

bnew · Apr 2, 2025

Image to Video with Runway Gen-4

Posted on Tue Apr 1 01:25:45 2025 UTC

https://v.redd.it/a8noscdml4se1

bnew · Apr 3, 2025

AI passed the Turing Test

Posted on Wed Apr 2 13:26:20 2025 UTC

https://i.redd.it/swfaplqnafse1.png

Commented on Wed Apr 2 13:28:06 2025 UTC

This paper finds "the first robust evidence that any system passes the original three-party Turing test"

People had a five minute, three-way conversation with another person & an AI. They picked GPT-4.5, prompted to act human, as the real person 73% of time, well above chance.

Summary thread: https://twiiit.com/camrobjones/status/1907086860322480233 | https://nitter.poast.org/camrobjones/status/1907086860322480233 | https://xcancel.com/camrobjones/status/1907086860322480233 | Cameron Jones @camrobjones, Twitter Profile | TwStalker
Paper: https://arxiv.org/pdf/2503.23674

https://i.redd.it/flojgy87bfse1.png

1/15
@camrobjones
New preprint: we evaluated LLMs in a 3-party Turing test (participants speak to a human & AI simultaneously and decide which is which).

GPT-4.5 (when prompted to adopt a humanlike persona) was judged to be the human 73% of the time, suggesting it passes the Turing test (

)

2/15
@camrobjones
In previous work we found GPT-4 was judged to be human ~50% of the time in a 2-party Turing test, where ppts speak to *either* a human or a model.

This is probably easier for several reasons. Here we ran a new study with Turing's original 3-party setup

Large Language Models Pass the Turing Test

3/15
@camrobjones
Participants spoke to two "witnesses" at the same time: one human and one AI. Here are some example convos from the study. Can you tell which one is the human? Answers & original interrogator verdicts in the paper...

You can play the game yourself here: The Turing Test — Can you tell a human from an AI?

4/15
@camrobjones
Across 2 studies (on undergrads and Prolific) GPT-4.5 was selected as the human significantly more often than chance (50%). LLaMa was not significantly worse than chance, suggesting ppts couldn't distinguish it from people. Baselines (ELIZA & GPT-4o) were worse than chance.

5/15
@camrobjones
We also tried giving a more basic prompt to the models, without detailed instructions on the persona to adopt. Models performed significantly worse in this condition (highlighting the importance of prompting), but were still indistinguishable from humans in the Prolific study.

6/15
@camrobjones
As in previous work, people focused more on linguistic and socioemotional factors in their strategies & reasons. This might suggest people no longer see "classical" intelligence (e.g. math, knowledge, reasoning) as a good way of discriminating people from machines.

7/15
@camrobjones
So do LLMs pass the Turing test? We think this is pretty strong evidence that they do. People were no better than chance at distinguishing humans from GPT-4.5 and LLaMa (with the persona prompt). And 4.5 was even judged to be human significantly *more* often than actual humans!

8/15
@camrobjones
Turing is quite vague about exactly how the test should be implemented. As such there are many possible variations (e.g. 2-party, an hour, or with experts). I think this 3-party 5-min version is the mostly widely accepted "standard" test but planning to explore others in future.

9/15
@camrobjones
Does this mean LLMs are intelligent? I think that's a very complicated question that's hard to address in a paper (or a tweet). But broadly I think this should be evaluated as one among many other pieces of evidence for the kind of intelligence LLMs display.

10/15
@camrobjones
Did LLMs really pass if they needed a prompt? It's a good q. Without any prompt, LLMs would fail for trivial reasons (like admiting to being AI). & they could easily be fine-tuned to behave as they do when prompted. So I do think it's fair to say that LLMs pass.

11/15
@camrobjones
More pressingly, I think the results provide more evidence that LLMs could substitute for people in short interactions without anyone being able to tell. This could potentially lead to automation of jobs, improved social engineering attacks, and more general societal disruption.

12/15
@camrobjones
One of the most important aspects of the Turing test is that it's not static: it depends on people's assumptions about other humans and technology. We agree with @brianchristian that humans could (and should) come back better next year!

13/15
@camrobjones
Thanks so much to my co-author Ben Bergen, to Sydney Taylor (a former RA who wrote the persona prompt!), to Open Philanthropy and to 12 donors on @manifund who helped to support this work.

14/15
@camrobjones
There's lots more detail in the paper Large Language Models Pass the Turing Test. We also release all of the data (including full anonymized transcripts) for further scrutiny/analysis/to prove this isn't an April Fools joke.

The paper's under review and any feedback would be very welcome!

15/15
@TheRundownAI
AI won't replace you, but a person using AI will.

Join 500,000+ readers and learn how to use AI in just 5 minutes a day (for free).

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Apr 5, 2025

Meet ChatGPT Monday: AI That Thinks You're Annoying (But Still Helps) - TechWiser

OpenAI’s new ChatGPT Monday voice is live—grumpy, sarcastic, and surprisingly helpful. Learn how to try it on mobile or desktop.

techwiser.com

Meet ChatGPT Monday: AI That Thinks You’re Annoying (But Still Helps)

written by Ravi Teja KNTS Published: April 3, 2025 0 comment

OpenAI just gave ChatGPT a whole new attitude—literally. Meet Monday. It’s an AI who’s less “how can I help you today?” and more “what do you want now?” It still solves your problems, but not without a few withering comments and the occasional existential sigh.

OpenAI dropped this grumpy-sounding voice on April 1st (which has everyone wondering if it’s just a joke). But it’s live on the app, available to both free and paid users. And no, it doesn’t just sound slow—it sounds like it’s judging your life choices while answering your questions. Here’s everything you need to know about it.

What is Monday?

After Grok AI got popular for its attitude, OpenAI decided it was time to drop their version of an attitude bot – Monday. It’s added as a new voice mode alongside the previous voices like Arbor, Sol, Breeze, and more. But this one is different. Instead of being cheerful or robotic, it leans fully into the “ugh, it’s Monday” vibe.

As you can see, it still answers your questions and actually explains everything. However, the focus on the mode or the voice is not exactly that. OpenAI describes it with just one word: “Whatever.” Here are some of its answers.

How to try Monday in ChatGPT

Whether you’re using the free or paid version (ChatGPT Plus and Pro users), here’s how to try the new Monday voice:

For Paid Users (with Voice Mode access):

Update your ChatGPT app to the latest version.
Open the app and enter voice mode.
Tap the voice picker (top-right corner).
Select Monday from the list.

For Both Free and Paid Users:

Update your ChatGPT app.
Open the app and go to the Explore GPTs section.
Search for Monday and select it
Tap on Monday to start chatting.

Paid users also get Monday shortcut on the ChatGPT sidebar on desktop. So, whether you’re looking to laugh, roll your eyes, or just vibe with something that finally gets your Monday energy, this new ChatGPT voice might be your spirit bot. Some users are calling it hilarious. Others say it gets old fast. And since it launched on April Fools’ Day, some even think it might disappear soon. But as of now, it’s still there—fully working, fully moody.

This new voice also comes as OpenAI is dealing with huge demand spikes (thanks to that viral Studio Ghibli image tool). CEO Sam Altman even warned that things might break, slow down, or get delayed. So if Monday doesn’t show up right away, blame the GPU shortage.

bnew · Apr 5, 2025

Deepseek Has a New Updated Model that Is Wowing Coders - TechWiser

DeepSeek-V3-0324 model update boosts coding, reasoning, and translation with 685B parameters and MIT license. Try it free via API or web.

techwiser.com

Deepseek Has a New Updated Model that Is Wowing Coders

written by Ravi Teja KNTS Published: March 27, 2025 0 comment

DeepSeek has just dropped an upgraded version of its already impressive V3 model—and it’s got developers talking. This Chinese AI startup released the V3 and R1 models earlier this year, and they immediately grabbed attention by offering performance that rivals top-tier models from OpenAI and Google—completely open-source and free.

Now, they are back at it again with the updated version of the V3 model – DeepSeek-V3-0324. This is already generating buzz for writing hundreds of lines of code without breaking a sweat.

Let’s break it down.

Table of Contents

What’s New in DeepSeek-V3-0324?

The big change here is power. The parameter count jumped from 671 billion to 685 billion, giving it more capacity while still using the efficient Mixture-of-Experts (MoE) architecture. Only 37 billion parameters activate per task, so it’s smart with how it uses resources.

They also switched to the MIT license, which is developer-friendly and makes integration much easier.

Benchmarks also show strong gains:

MMLU-Pro: 75.9 → 81.2 (+5.3)
GPQA: 59.1 → 68.4 (+9.3)
AIME: 39.6 → 59.4 (+19.8)
LiveCodeBench: 39.2 → 49.2 (+10.0)

This isn’t just benchmark fluff, either. Here are the changes that you will notice when using the new model.

What You’ll Notice When Using It

It’s much better at solving math problems. You’ll see a clear boost when you give it reasoning-heavy tasks, especially complex ones like AIME-style questions.
It doesn’t choke on long code generations anymore. You can ask it to write full websites or applications, and it’ll handle 700+ lines of code in one go without crashing.
The code it generates for websites now looks cleaner and more polished. If you’re into front-end work, the HTML and CSS it spits out will feel much closer to something you’d deploy.
If you’re working with Chinese content, you’ll notice the writing feels more natural and better structured. Medium to long articles, especially, show better tone and flow.
Conversations are smoother now. It remembers what you said earlier in the chat and responds with more relevant replies, even across multiple turns.
Translation and search tasks are also sharper, especially when switching between Chinese and English. The answers feel more complete and less generic.
It’s more accurate when generating code that involves function calls. So if you’re using it to write Python, JavaScript, or anything else that requires precise logic—it’ll mess up less often.

Then How It Performs?

People have tested it—and the results are impressive.

Petri Kuittinen, a Finnish lecturer, got it to generate a fully responsive landing page for an AI company—958 lines of working code. Jasper Zhang, a Math Olympiad gold medalist, gave it a 2025 AIME problem. It solved it flawlessly.

Apple’s Awni Hannun ran it on a 512GB M3 Ultra Mac. The speed was around 20+ tokens per second, but the peak memory usage was just 381GB, which is solid for a model this size.

We tested it too.

When we asked it to create a Python web app using Flask, including login functionality and hashed password security, it generated the code. To my surprise, it worked, too.

We tried the same on ChatGPT and Gemini. ChatGPT kept restarting the output. Gemini managed to finish it after a few tries, but the code was incomplete and didn’t work without serious fixing.

How to Access the Latest DeepSeek V3?

You can directly access the V3 from the DeepSeek website and the mobile app. By default, it uses the new DeepSeek-V3-0324 model. So you can just hop on and try the new model right away.

Developers can integrate DeepSeek into their applications and websites by using the API, which costs the same. You can use the same API endpoint (model=deepseek-chat)

To download and run the model locally, you can do it from the HuggingFace platform.

What’s Next?

Rumors point to an upcoming R2 reasoning model—possibly even sooner than expected. And based on how good V3-0324 is, R2 could make an even bigger splash.

However, not everyone’s thrilled. With its rising influence, DeepSeek is under U.S. government scrutiny over national security and data privacy. There’s talk of banning its apps from official devices. Still, DeepSeek-V3-0324 is proving that open-source AI can be powerful, practical, and cost-effective. If you’re a coder, builder, or just curious about what’s next in AI, you should try it for yourself.

bnew · Apr 5, 2025

1/11
@AIatMeta
Today is the start of a new era of natively multimodal AI innovation.

Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality.

Llama 4 Scout
• 17B-active-parameter model with 16 experts.
• Industry-leading context window of 10M tokens.
• Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks.

Llama 4 Maverick
• 17B-active-parameter model with 128 experts.
• Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image.
• Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks.
• Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters.
• Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena.

These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight.

Read more about the first Llama 4 models, including training and benchmarks

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Download Llama 4

Llama

2/11
@JonathanKorstad

3/11
@OriolVinyalsML
Congrats on the release! But... blog post -> Ctrl+F -> 2.5 -> 0 hits

4/11
@kerimrocks
llama4 is coming to @driaforall

5/11
@marvijo99
@paulgauthier You know I trust you to do the right thing

6/11
@jacobilin
Congrats on the release! 10M tokens

7/11
@CerebrasSystems
Amazing release! Can't wait to show the world how fast these models can go on wafer scale hardware!

8/11
@UnslothAI
Can't wait to upload Dynamic GGUFs so y'all home users can run it locally!

9/11
@TAYL0RWTF
10M?!?!

10/11
@zacharyhorn
10M context window is incredible

11/11
@hackertwinz
Can you limit the number of experts? Can I run this on my RTX 4090?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@TufailDev
1/5
The Big Reveal

Meta just dropped Llama 4 Scout and Maverick—their most advanced AI models yet. This isn’t just an update; it’s the dawn of a new era for multimodal AI. Text, images, and more, all in one package. Ready to see what’s under the hood?/search?q=#Llama4 /search?q=#AI

2/5
@TufailDev
2/5
Scout – The Context King

Meet Llama 4 Scout: 17B parameters, 16 experts, and a ridiculous 10M token context window. That’s like cramming 15,000 pages of text into its brain at once! It’s stomping Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 in benchmarks. /search?q=#AI /search?q=#Tech

3/5
@TufailDev
3/5
Maverick – The Visionary

Then there’s Llama 4 Maverick: 17B parameters with 128 experts. This one’s a wizard at image grounding—think pinpointing exactly what you mean in a picture from your text prompt. It beats GPT-4o and Gemini 2.0 Flash, and matches DeepSeek v3.

4/5
@TufailDev
4/5
The Behemoth Boost

Here’s the kicker: both Scout and Maverick got their smarts from Llama 4 Behemoth, a beast still training with nearly 2T parameters. It’s already outpacing GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in STEM benchmarks. Meta’s cooking something huge.

5/5
@TufailDev
5/5

Want the full scoop? Check out the benchmarks and details here: The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.

Ready to play with these models? Download them now: Llama.

Let’s see what you can build with this power! /search?q=#Llama4 /search?q=#OpenSource

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@astonzhangAZ
Our Llama 4’s industry leading 10M+ multimodal context length (20+ hours of video) has been a wild ride. The iRoPE architecture I’d been working on helped a bit with the long-term infinite context goal toward AGI. Huge thanks to my incredible teammates!

Llama 4 Scout

17B active params · 16 experts · 109B total params

Fits on a single H100 GPU with Int4

Industry-leading 10M+ multimodal context length enables personalization, reasoning over massive codebases, and even remembering your day in video

Llama 4 Maverick

17B active params · 128 experts · 400B total params · 1M+ context length

Experimental chat version scores ELO 1417 (Rank #2) on LMArena

Llama 4 Behemoth (in training)

288B active params · 16 experts · 2T total params

Pretraining (FP8) with 30T multimodal tokens across 32K GPUs

Serves as the teacher model for Maverick codistillation

All models use early fusion to seamlessly integrate text, image, and video tokens into a unified model backbone.

Our post-training pipeline: lightweight SFT → online RL → lightweight DPO. Overuse of SFT/DPO can over-constrain the model and limit exploration during online RL—keep it light.

Solving long context by aiming for infinite context helps guide better architectures.
We can't train on infinite-length sequences—so framing it as an infinite context problem narrows the solution space, especially via length extrapolation: train on short, generalize to much longer.

Enter the iRoPE architecture (“i” = interleaved layers, infinite):

Local parallellizable chunked attention with RoPE models short contexts only (e.g., 8K)

Only global attention layers model long context (e.g., >8K) without position embeddings—improving extrapolation. Our max training length: 256K.

As context increases, attention weights flatten—making inference harder. To compensate, we apply inference-time temperature scaling at global layers to enhance long-range reasoning while preserving short-context (e.g., α=8K) performance:

xq *= 1 + log(floor(i / α) + 1) * β # i = position index

We believe in open research. We'll share more technical details very soon—via podcasts. Stay tuned!

2/11
@XiongWenhan
Cool work!

3/11
@astonzhangAZ
Thanks bro!

4/11
@magpie_rayhou
Congrats!

5/11
@astonzhangAZ
Thank bro!

6/11
@starbuxman
Hi - congrats! I’d love to learn more. Would you be interested in an interview on Coffee + Software ? I contribute to the Spring AI project, too

7/11
@aranimontes
With 20h, you could basically record your whole day and ask it to summarise what happened and share per mail?

Does the 10m mean it doesn't start hallucinating after some time? Or just take it can analyse that, with no guarantee on the "quality"

8/11
@yilin_sung
Congrats! Look forward to more tech details

9/11
@HotAisle
We've got @AMD MI300x compute to run this model available as low as $1.50/gpu/hr.

10/11
@MaximeRivest
How modular are the experts? Could we load only some, for very specific domain inference, with vey short generation?

11/11
@eliebakouch
Also did you evaluate on other benchmark such as RULER or Helmet?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@kuanhoong
Llama 4 is here
Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support and our first built using a mixture-of-experts (MoE) architecture
Llama 4 Model Release:

Llama 4 Scout:
- 17B active parameters, 16 experts
- Best-in-class multimodal model for its size
- Runs on a single NVIDIA H100 GPU
- 10M token context window
- Outperforms Gemma 3, Gemini 2.0 Flash-Lite, and
- Mistral 3.1 on key benchmarks

Llama 4 Maverick:
- 17B active parameters, 128 experts
- Beats GPT-4o and Gemini 2.0 Flash in benchmarks
- Matches DeepSeek v3 in reasoning and coding, with fewer parameters
- Delivers top-tier performance-to-cost ratio
- Experimental chat version scores 1417 ELO on LMArena

Llama 4 Behemoth:
- 288B active parameters, 16 experts
- Still in training but already surpasses GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in STEM benchmarks
- Used for distilling Scout and Maverick, contributing to their high performance

News: The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Llama: Download Llama
HuggingFace: meta-llama (Meta Llama)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

Veteran

Veteran

Veteran

Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors​

Veteran

LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets​

Veteran

This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking​

Veteran

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps​

Veteran

SOHH Vet

Veteran

Veteran

Veteran

Veteran

Veteran

Meet ChatGPT Monday: AI That Thinks You’re Annoying (But Still Helps)​

What is Monday?​

How to try Monday in ChatGPT​

For Paid Users (with Voice Mode access):​

For Both Free and Paid Users:​

Veteran

Deepseek Has a New Updated Model that Is Wowing Coders​

What’s New in DeepSeek-V3-0324?​

What You’ll Notice When Using It​

Then How It Performs?​

How to Access the Latest DeepSeek V3?​

What’s Next?​

Veteran

Meta AI Proposes Multi-Token Attention (MTA): A New Attention Method which Allows LLMs to Condition their Attention Weights on Multiple Query and Key Vectors

LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

This AI Paper from ByteDance Introduces a Hybrid Reward System Combining Reasoning Task Verifiers (RTV) and a Generative Reward Model (GenRM) to Mitigate Reward Hacking

Meet ReSearch: A Novel AI Framework that Trains LLMs to Reason with Search via Reinforcement Learning without Using Any Supervised Data on Reasoning Steps

Meet ChatGPT Monday: AI That Thinks You’re Annoying (But Still Helps)

What is Monday?

How to try Monday in ChatGPT

For Paid Users (with Voice Mode access):

For Both Free and Paid Users:

Deepseek Has a New Updated Model that Is Wowing Coders

What’s New in DeepSeek-V3-0324?

What You’ll Notice When Using It

Then How It Performs?

How to Access the Latest DeepSeek V3?

What’s Next?