Large Language Models News & Discussions

bnew · Mar 31, 2025

Meta Reality Labs Research Introduces Sonata: Advancing Self-Supervised Representation Learning for 3D Point Clouds

3D self-supervised learning (SSL) has faced persistent challenges in developing semantically meaningful point representations suitable for diverse applications with minimal supervision. Despite substantial progress in image-based SSL, existing point cloud SSL methods have largely been limited...

www.marktechpost.com

Meta Reality Labs Research Introduces Sonata: Advancing Self-Supervised Representation Learning for 3D Point Clouds

By Nikhil

March 28, 2025

Reddit Vote Flip Share Tweet 0 Shares

3D self-supervised learning (SSL) has faced persistent challenges in developing semantically meaningful point representations suitable for diverse applications with minimal supervision. Despite substantial progress in image-based SSL, existing point cloud SSL methods have largely been limited due to the issue known as the “geometric shortcut,” where models excessively rely on low-level geometric features like surface normals or point heights. This reliance compromises the generalizability and semantic depth of the representations, hindering their practical deployment.

Researchers from the University of Hong Kong and Meta Reality Labs Research introduce Sonata, an advanced approach designed to address these fundamental challenges. Sonata employs a self-supervised learning framework that effectively mitigates the geometric shortcut by strategically obscuring low-level spatial cues and reinforcing dependency on richer input features. Drawing inspiration from recent advancements in image-based SSL, Sonata integrates a point self-distillation mechanism that gradually refines representation quality and ensures robustness against geometric simplifications.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

At a technical level, Sonata utilizes two core strategies: firstly, it operates on coarser scales to obscure spatial information that might otherwise dominate the learned representations. Secondly, Sonata adopts a point self-distillation approach, progressively increasing task difficulty through adaptive masking strategies to foster deeper semantic understanding. Crucially, Sonata removes decoder structures traditionally used in hierarchical models to avoid reintroducing local geometric shortcuts, allowing the encoder alone to build robust, multi-scale feature representations. Additionally, Sonata applies “masked point jitter,” introducing random perturbations to the spatial coordinates of masked points, thus further discouraging reliance on trivial geometric features.

The empirical results reported validate Sonata’s efficacy and efficiency. Sonata achieves significant performance gains on benchmarks like ScanNet, where it records a linear probing accuracy of 72.5%, substantially surpassing previous state-of-the-art SSL approaches. Importantly, Sonata demonstrates robustness even with limited data, performing effectively using as little as 1% of the ScanNet dataset, which highlights its suitability for low-resource scenarios. Its parameter efficiency is also notable, delivering strong performance improvements with fewer parameters compared to conventional methods. Furthermore, integrating Sonata with image-derived representations such as DINOv2 results in enhanced accuracy, emphasizing its capacity to capture distinctive semantic details specific to 3D data.

Sonata’s capabilities are further illustrated through insightful zero-shot visualizations including PCA-colored point clouds and dense feature correspondence, demonstrating coherent semantic clustering and robust spatial reasoning under challenging augmentation conditions. The versatility of Sonata is also evidenced across various semantic segmentation tasks, spanning indoor datasets like ScanNet and ScanNet200, as well as outdoor datasets including Waymo, consistently achieving state-of-the-art outcomes.

In conclusion, Sonata represents a significant advancement in addressing inherent limitations in 3D self-supervised learning. Its methodological innovations effectively resolve issues associated with the geometric shortcut, providing semantically richer and more reliable representations. Sonata’s integration of self-distillation, careful manipulation of spatial information, and scalability to large datasets establish a solid foundation for future explorations in versatile and robust 3D representation learning. The framework sets a methodological benchmark, facilitating further research towards comprehensive multimodal SSL integration and practical 3D applications.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

Vision-R1: Redefining Reinforcement Learning for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have made significant strides in recent years, yet several key limitations persist. One major challenge is aligning these models effectively with human expectations, particularly for tasks involving detailed and precise visual information. Traditionally...

www.marktechpost.com

Vision-R1: Redefining Reinforcement Learning for Large Vision-Language Models

By Sana Hassan

March 26, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large Vision-Language Models (LVLMs) have made significant strides in recent years, yet several key limitations persist. One major challenge is aligning these models effectively with human expectations, particularly for tasks involving detailed and precise visual information. Traditionally, LVLMs undergo a two-stage training paradigm: pretraining followed by supervised fine-tuning. However, supervised fine-tuning alone cannot fully overcome limitations, such as the scarcity and high cost associated with generating large-scale, human-annotated preference datasets. Moreover, conventional reinforcement learning methods require expensive reward models that may not fully capture the nuanced and subjective nature of human feedback.

A team of researchers from China propose Vision-R1: a novel vision-guided R1-like reinforcement learning algorithm for LVLMs that rewards models with definitive vision feedback. Vision-R1 leverages curated instruction data, thereby eliminating the dependency on specialized reward models and handcrafted preference datasets. Central to this method is a criterion-driven reward function, which provides comprehensive evaluations of model completions based on specific visual task criteria. Additionally, a progressive rule refinement strategy is employed, dynamically adjusting reward criteria throughout the training process. This approach ensures continuous performance improvement, effectively mitigating reward hacking issues and promoting more accurate object localization.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

The Vision-R1 algorithm incorporates several critical technical innovations. First, the criterion-driven reward function includes dual format rewards, recall rewards, and precision rewards. Dual format rewards ensure outputs adhere strictly to template and content constraints, essential for reliable object detection tasks. The recall reward emphasizes the model’s capacity to identify all relevant instances, crucial for avoiding omissions in predictions. The precision reward encourages high-quality bounding box predictions by calculating the average Intersection over Union (IoU) of valid predictions. Furthermore, the progressive rule refinement strategy is inspired by curriculum learning principles, gradually increasing training difficulty through staged progression and differentiation policies, thereby fostering robust and generalized learning.

Experiments conducted using two state-of-the-art LVLMs, Griffon-G-7B and Qwen2.5-VL-7B, demonstrate the robust capabilities of Vision-R1. Results on in-domain datasets such as MSCOCO and ODINW-13 show significant performance enhancements. Specifically, Vision-R1 improves Griffon-G-7B’s mAP scores by 2.5% on average across diverse tasks. More impressively, Vision-R1 boosts Qwen2.5-VL-7B’s performance significantly, showing an 8.9% improvement in COCO object detection tasks and achieving superior scores compared to its larger, 72B counterpart. On challenging out-of-domain localization tasks, Vision-R1 consistently outperforms supervised fine-tuning (SFT), demonstrating its strong generalization capabilities and robustness in complex scenarios.

In conclusion, Vision-R1 introduces an innovative reinforcement learning approach tailored for LVLMs that effectively addresses existing alignment issues without requiring costly annotated datasets or complex reward modeling. Its criterion-driven reward structure and progressive rule refinement strategy not only enhance the accuracy and comprehensiveness of object localization tasks but also significantly improve generalization to unseen scenarios. The successful integration of Vision-R1 with contemporary LVLM architectures highlights its potential to serve as a foundational method, significantly advancing the state-of-the-art in vision-language understanding and practical deployment in real-world applications.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models

Compression is a cornerstone of computational intelligence, deeply rooted in the theory of Kolmogorov complexity, which defines the minimal program needed to reproduce a given sequence. Unlike traditional compression methods that look for repetition and redundancy, Kolmogorov’s framework...

www.marktechpost.com

This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models

By Nikhil

March 26, 2025

Reddit Vote Flip Share Tweet 0 Shares

Compression is a cornerstone of computational intelligence, deeply rooted in the theory of Kolmogorov complexity, which defines the minimal program needed to reproduce a given sequence. Unlike traditional compression methods that look for repetition and redundancy, Kolmogorov’s framework interprets compression as a problem of discovering structured patterns through programmatic representation. While the theory promises optimal compression, its uncomputability poses a significant hurdle. Nevertheless, the emergence of large language models capable of code generation opens an intriguing opportunity to test how closely modern systems can approximate this theoretical ideal by reasoning through code rather than pattern matching.

A core issue arises from the limitations of current tools in compressing data sequences using concise, executable code. Models often replicate inputs rather than generate programs that reproduce them, indicating a gap in true pattern understanding. This becomes especially evident when dealing with real-world audio, text, or DNA sequences, where complex logical structures must be uncovered to achieve efficient compression. The main challenge is ensuring the model replicates the sequence and uses a minimal and rational set of instructions. Furthermore, though synthetic training data is useful for controlled evaluation, it often fails to support robust generalization to natural data, which is essential for practical applications.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

AD_4nXdp79x3831tYM0XnckMUvImRsl8_vwR6VwJCxUEbnZ4hosnK9xYvWJFkp1C8I7XonxezEwM_TF5ZWrh2Khrs-dzb34hEj3rcCj8PfmHy1AjEbVFLwmIGHisPeKr2Rb-s86KvLep

Several compression tools exist, ranging from traditional algorithms like GZIP to newer neural compression systems. GZIP remains a strong baseline, especially for long or repetitive sequences, due to its effective encoding of statistical regularities. More recently, language modeling approaches have integrated with arithmetic coding, using prediction probabilities to compress input data. However, these methods typically require access to the full model weights at decoding time, limiting their efficiency and applicability. Prompted code-generating models like GPT-4 and LLaMA have also been evaluated in zero-shot settings to generate Python programs that reproduce input sequences. Yet, they frequently produce lengthy, imprecise code with limited success, particularly when faced with unseen or complex sequences.

Researchers from Meta AI and Tel Aviv University introduced the Kolmogorov-Test (KT), a benchmark for assessing the reasoning capability of code-generating language models. The test evaluates a model’s ability to generate the shortest program that outputs a given input sequence. Unlike typical benchmarks, KT emphasizes logical composition and program generation over predictive text modeling. Sequences include natural data from audio (LibriSpeech), text (Wikipedia enwik9), and DNA (GRCh38), as well as synthetic sequences generated through a custom-designed domain-specific language (DSL). This DSL supports building structured sequences by composing operations like range creation, sequence modification, merging, and filtering.

AD_4nXdVMQzri0pVmtIi7_xfQTfnqz34do9S-5S6--bTs7NT1ZXJS1GX6DErj4oTAcJ_QOYAJ-ESpic9DZ4QGwZEZlS__PBK-BLgP0L8IJDpkqSJ4CKckqwYQLULwkKumtv8yR6UcpZ1Og

The researchers developed an automated framework to generate millions of synthetic program-sequence pairs using this DSL. These programs then train and evaluate models, including large pre-trained and specifically trained ones like SEQCODER. To measure performance, the team employed metrics such as accuracy—whether the generated program reproduces the sequence—and precision—how concise the correct program is compared to GZIP compression. The test involved compressing sequences of varying lengths, with synthetic sequences averaging 76 bytes and real sequences capped at 128.

Results showed that even the most powerful models struggled. GPT-4 achieved 69.5% accuracy on high-quality audio but dropped to 36.4% for 8-bit audio and 50.3% for DNA data. LLaMA-3.1-405B performed worse, with accuracies as low as 3.9% for audio and only 24.8% for DNA. In synthetic data, SEQCODER-8B reached 92.5% accuracy with a precision score of 0.56, outperforming traditional tools like GZIP. However, its accuracy on real-world data remained near zero. This discrepancy illustrates the difficulty in transferring success from synthetic benchmarks to more varied and noisy real-world sequences, highlighting the limitations of current training regimes and prompting the need for new strategies.

AD_4nXcTZGHmggBHi0H44OGjg14yIdzkzLSGICaczpIN92CzpdKezwxmUHkEf9NJkf5fhBd4RRKEO-O7SfPmmUDo2fEkGZweyI4L2gtNAvuXu1562ug6WKfLecJLOjlZaDCHLcOXSJD3Eg

Overall, this research clearly outlines the complexity of compression via code generation. The KT benchmark provides a rigorous and diverse model reasoning and structure recognition test, exposing the stark divide between synthetic learning environments and real-world applications. The introduced methodology and test set a high bar for future models aiming to unify reasoning with compression, but significant innovation is still required to meet this challenge.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Large Language Model Redefining Deep Reasoning, Contextual Efficiency, and Human-Centric Reinforcement Learning

Large language models struggle to process and reason over lengthy, complex texts without losing essential context. Traditional models often suffer from context loss, inefficient handling of long-range dependencies, and difficulties aligning with human preferences, affecting the accuracy and...

www.marktechpost.com

Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Large Language Model Redefining Deep Reasoning, Contextual Efficiency, and Human-Centric Reinforcement Learning

By Asif Razzaq

March 29, 2025

Reddit Vote Flip Share Tweet 0 Shares

Large language models struggle to process and reason over lengthy, complex texts without losing essential context. Traditional models often suffer from context loss, inefficient handling of long-range dependencies, and difficulties aligning with human preferences, affecting the accuracy and efficiency of their responses. Tencent’s Hunyuan-T1 directly tackles these challenges by integrating a novel Mamba-powered architecture with advanced reinforcement learning and curriculum strategies, ensuring robust context capture and enhanced reasoning capabilities.

Hunyuan-T1 is the first model powered by the innovative Mamba architecture, a design that fuses Hybrid Transformer and Mixture-of-Experts (MoE) technologies. Built on the TurboS fast-thinking base, Hunyuan-T1 is specifically engineered to optimize the processing of long textual sequences while minimizing computational overhead. This allows the model to effectively capture extended context and manage long-distance dependencies, crucial for tasks that demand deep, coherent reasoning.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

AD_4nXfjF8HqVp6xaYKlOT3Ndd_cRsUPjOLEyySfyvh9qARuI7dtv2VXhH0VchEZiueFyAG5gucLwRqsITiW_A1iSZ9C8dQv3fvuOg1Se0xXcQxq6_Fc9IoCQcXUqANHujhaqjtMQxFm

A key highlight of Hunyuan-T1 is its heavy reliance on RL during the post-training phase. Tencent dedicated 96.7% of its computing power to this approach, enabling the model to refine its reasoning abilities iteratively. Techniques such as data replay, periodic policy resetting, and self-rewarding feedback loops help improve output quality, ensuring the model’s responses are detailed, efficient, and closely aligned with human expectations.

To further boost reasoning proficiency, Tencent employed a curriculum learning strategy. This approach gradually increases the difficulty of training data while simultaneously expanding the model’s context length. As a result, Hunyuan-T1 is trained to use tokens more efficiently, seamlessly adapting from solving basic mathematical problems to tackling complex scientific and logical challenges. Efficiency is another cornerstone of Hunyuan-T1’s design. The TurboS base’s ability to capture long-text information prevents context loss, a common issue in many language models, and doubles the decoding speed compared to similar systems. This breakthrough means that users benefit from faster, higher-quality responses without compromising performance.

AD_4nXe0lKtz62MzPL2AQ4wnsktG4VDWKm8KsF_qZRxGEpzi4qMafKgJ2kfeqIUzDzKCVpCDJRAu_8n-RBXRYHEj7brovQIfvwrbxCJPjfNTkys9F7_83FmqUMhGl_DhVVlkm6nBoQ-g

The model has achieved impressive scores on multiple benchmarks: 87.2 on MMLU-PRO, which tests various subjects including humanities, social sciences, and STEM fields; 69.3 on GPQA-diamond, a challenging evaluation featuring doctoral-level scientific problems; 64.9 on LiveCodeBench for coding tasks; and a remarkable 96.2 on the MATH-500 benchmark for mathematical reasoning. These results underscore Hunyuan-T1’s versatility and ability to handle high-stakes, professional-grade tasks across various fields. Beyond quantitative metrics, Hunyuan-T1 is designed to deliver outputs with human-like understanding and creativity. During its RL phase, the model underwent a comprehensive alignment process that combined self-rewarding feedback with external reward models. This dual approach ensures its responses are accurate and exhibit rich details and natural flow.

In conclusion, Tencent’s Hunyuan-T1 combines an ultra-large-scale, Mamba-powered architecture with state-of-the-art reinforcement learning and curriculum strategies. Hunyuan-T1 delivers high performance, enhanced reasoning, and exceptional efficiency.

Check out the Details , Hugging Face and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

Advancing Medical Reasoning with Reinforcement Learning from Verifiable Rewards (RLVR): Insights from MED-RLVR

Reinforcement Learning from Verifiable Rewards (RLVR) has recently emerged as a promising method for enhancing reasoning abilities in language models without direct supervision. This approach has shown notable success in mathematics and coding, where reasoning naturally aligns with structured...

www.marktechpost.com

Advancing Medical Reasoning with Reinforcement Learning from Verifiable Rewards (RLVR): Insights from MED-RLVR

By Sana Hassan

March 29, 2025

Reddit Vote Flip Share Tweet 0 Shares

Reinforcement Learning from Verifiable Rewards (RLVR) has recently emerged as a promising method for enhancing reasoning abilities in language models without direct supervision. This approach has shown notable success in mathematics and coding, where reasoning naturally aligns with structured problem-solving. While studies have demonstrated that RLVR alone can lead to self-evolved reasoning, research has largely been limited to these technical fields. Efforts to extend RLVR have explored synthetic datasets, such as those involving sequential tasks and object counting, indicating potential but also highlighting the challenges of adapting this method to different domains.

Expanding RLVR to broader areas remains an open challenge, particularly in tasks like multiple-choice question answering (MCQA), which provides structured, verifiable labels across diverse subjects, including medicine. However, unlike math and coding, which involve complex reasoning with an open-ended answer space, MCQA tasks typically have predefined answer choices, making it uncertain whether RLVR’s benefits translate effectively. This limitation is especially relevant in medical reasoning tasks, where models must navigate intricate clinical knowledge to produce accurate responses, an area that has proven difficult for existing AI systems.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Researchers from Microsoft Research investigate whether medical reasoning can emerge through RLVR. They introduce MED-RLVR, leveraging medical MCQA data to assess RLVR’s effectiveness in the medical domain. Their findings show that RLVR extends beyond math and coding, achieving performance comparable to supervised fine-tuning (SFT) in in-distribution tasks while significantly improving out-of-distribution generalization by eight percentage points. Analyzing training dynamics, they observe that reasoning capabilities emerge in a 3B-parameter base model without explicit supervision, highlighting RLVR’s potential for advancing reasoning in knowledge-intensive fields like medicine.

RL optimizes decision-making by training an agent to maximize rewards through interactions with an environment. It has been effectively applied to language models to align outputs with human preferences and, more recently, to elicit reasoning without explicit supervision. This study employs Proximal Policy Optimization (PPO) to train a policy model, incorporating a clipped objective function to stabilize training. Using a rule-based reward function, MED-RLVR assigns rewards based on output correctness and format validity. Without additional supervision, the model demonstrates emergent medical reasoning, similar to mathematical reasoning in prior RLVR studies, highlighting RLVR’s potential beyond structured domains.

The MedQA-USMLE dataset, which includes multi-choice medical exam questions, is used to train MED-RLVR. Unlike the standard four-option version, this dataset presents a greater challenge by offering more answer choices. Training is based on the Qwen2.5-3B model using OpenRLHF for reinforcement learning. Compared to SFT, MED-RLVR demonstrates superior generalization, particularly on the MMLU-Pro-Health dataset. Analysis reveals six stages of reasoning evolution: format failures, verbose outputs, reward hacking, and reintegrated reasoning. Unlike math or coding tasks, no self-validation behaviors (“aha-moments”) were observed, suggesting potential improvements through penalizing short reasoning chains or fine-tuning with longer CoTs.

AD_4nXfwcLOHCRXhoxwB0MsiTVVaEFxpPCQIQSUUaFJIJnjakMOtubpC_YAoKrnTwWwushxOk-xS5EodqXYlshEsNgl-JGj80kBdJVTp0PAxAh1KLtVhB_9T3UICbIr299EL3aFG0_nI

In conclusion, the study focuses on MCQA in medicine, providing a controlled setting for evaluation. However, MCQA does not fully capture the complexity of real-world tasks like open-text answering, report generation, or medical dialogues. Additionally, the unimodal approach limits the model’s ability to integrate multimodal data, which is crucial for diagnostic applications. Future work should address these limitations. MED-RLVR, based on reinforcement learning with verifiable rewards, matches SFT on in-distribution tasks and improves out-of-distribution generalization. While medical reasoning emerges without explicit supervision, challenges like reward hacking persist, highlighting the need for further exploration of complex reasoning and multimodal integration.

Check out the Paper . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

Google AI Released TxGemma: A Series of 2B, 9B, and 27B LLM for Multiple Therapeutic Tasks for Drug Development Fine-Tunable with Transformers

Developing therapeutics continues to be an inherently costly and challenging endeavor, characterized by high failure rates and prolonged development timelines. The traditional drug discovery process necessitates extensive experimental validations from initial target identification to late-stage...

www.marktechpost.com

Google AI Released TxGemma: A Series of 2B, 9B, and 27B LLM for Multiple Therapeutic Tasks for Drug Development Fine-Tunable with Transformers

By Asif Razzaq

March 27, 2025

Reddit Vote Flip Share Tweet 0 Shares

Developing therapeutics continues to be an inherently costly and challenging endeavor, characterized by high failure rates and prolonged development timelines. The traditional drug discovery process necessitates extensive experimental validations from initial target identification to late-stage clinical trials, consuming substantial resources and time. Computational methodologies, particularly machine learning and predictive modeling, have emerged as pivotal tools to streamline this process. However, existing computational models are typically highly specialized, limiting their effectiveness in addressing diverse therapeutic tasks and offering limited interactive reasoning capabilities required for scientific inquiry and analysis.

To address these limitations, Google AI has introduced TxGemma, a collection of generalist large language models (LLMs) designed explicitly to facilitate various therapeutic tasks in drug development. TxGemma distinguishes itself by integrating diverse datasets, encompassing small molecules, proteins, nucleic acids, diseases, and cell lines, which allows it to span multiple stages within the therapeutic development pipeline. TxGemma models, available with 2 billion (2B), 9 billion (9B), and 27 billion (27B) parameters, are fine-tuned from Gemma-2 architecture using comprehensive therapeutic datasets. Additionally, the suite includes TxGemma-Chat, an interactive conversational model variant, that enables scientists to engage in detailed discussions and mechanistic interpretations of predictive outcomes, fostering transparency in model utilization.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

From a technical standpoint, TxGemma capitalizes on the extensive Therapeutic Data Commons (TDC), a curated dataset containing over 15 million datapoints across 66 therapeutically relevant datasets. TxGemma-Predict, the predictive variant of the model suite, demonstrates significant performance across these datasets, matching or exceeding the performance of both generalist and specialist models currently employed in therapeutic modeling. Notably, the fine-tuning approach employed in TxGemma optimizes predictive accuracy with substantially fewer training samples, providing a crucial advantage in domains where data scarcity is prevalent. Further extending its capabilities, Agentic-Tx, powered by Gemini 2.0, dynamically orchestrates complex therapeutic queries by combining predictive insights from TxGemma-Predict and interactive discussions from TxGemma-Chat with external domain-specific tools.

Empirical evaluations underscore TxGemma’s capability. Across 66 tasks curated by the TDC, TxGemma-Predict consistently achieved performance comparable to or exceeding existing state-of-the-art models. Specifically, TxGemma’s predictive models surpassed state-of-the-art generalist models in 45 tasks and specialized models in 26 tasks, with notable efficiency in clinical trial adverse event predictions. On challenging benchmarks such as ChemBench and Humanity’s Last Exam, Agentic-Tx demonstrated clear advantages over previous leading models, enhancing accuracy by approximately 5.6% and 17.9%, respectively. Moreover, the conversational capabilities embedded in TxGemma-Chat provided essential interactive reasoning to support in-depth scientific analyses and discussions.

TxGemma’s practical utility is particularly evident in adverse event prediction during clinical trials, an essential aspect of therapeutic safety evaluation. TxGemma-27B-Predict demonstrated robust predictive performance while utilizing significantly fewer training samples compared to conventional models, illustrating enhanced data efficiency and reliability. Moreover, computational performance assessments indicate that the inference speed of TxGemma supports practical real-time applications, such as virtual screening, with the largest variant (27B parameters) capable of efficiently processing large sample volumes daily when deployed on scalable infrastructure.

In summary, the introduction of TxGemma by Google AI represents a methodical advancement in computational therapeutic research, combining predictive efficacy, interactive reasoning, and improved data efficiency. By making TxGemma publicly accessible, Google enables further validation and adaptation on diverse, proprietary datasets, thereby promoting broader applicability and reproducibility in therapeutic research. With sophisticated conversational functionality via TxGemma-Chat and complex workflow integration through Agentic-Tx, the suite provides researchers with advanced computational tools capable of significantly enhancing decision-making processes in therapeutic development.

Check out the Paper and Models on Hugging Face . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

Meet Open Deep Search (ODS): A Plug-and-Play Framework Democratizing Search with Open-source Reasoning Agents

The rapid advancements in search engine technologies integrated with large language models (LLMs) have predominantly favored proprietary solutions such as Google's GPT-4o Search Preview and Perplexity's Sonar Reasoning Pro. While these proprietary systems offer strong performance, their...

www.marktechpost.com

Meet Open Deep Search (ODS): A Plug-and-Play Framework Democratizing Search with Open-source Reasoning Agents

By Asif Razzaq

March 27, 2025

Reddit Vote Flip Share Tweet 0 Shares

The rapid advancements in search engine technologies integrated with large language models (LLMs) have predominantly favored proprietary solutions such as Google’s GPT-4o Search Preview and Perplexity’s Sonar Reasoning Pro. While these proprietary systems offer strong performance, their closed-source nature poses significant challenges, particularly concerning transparency, innovation, and community collaboration. This exclusivity limits customization and hampers broader academic and entrepreneurial engagement with search-enhanced AI.

In response to these limitations, researchers from the University of Washington, Princeton University, and UC Berkeley have introduced Open Deep Search (ODS)—an open-source search AI framework designed for seamless integration with any user-selected LLM in a modular manner. ODS comprises two central components: the Open Search Tool and the Open Reasoning Agent. Together, these components substantially improve the capabilities of the base LLM by enhancing content retrieval and reasoning accuracy.

Check out how HOSTINGER HORIZONS can help to build and launch full-stack web apps, tools, and software in minutes without writing any code (Promoted)

Screenshot-2025-03-27-at-3.50.27%E2%80%AFPM-1-1024x552.png

The Open Search Tool distinguishes itself through an advanced retrieval pipeline, featuring an intelligent query rephrasing method that better captures user intent by generating multiple semantically related queries. This approach notably improves the accuracy and diversity of search results. Furthermore, the tool employs refined chunking and re-ranking techniques to systematically filter search results according to relevance. Complementing the retrieval component, the Open Reasoning Agent operates through two distinct methodologies: the Chain-of-thought ReAct agent and the Chain-of-code CodeAct agent. These agents interpret user queries, manage tool usage—including searches and calculations—and produce comprehensive, contextually accurate responses.

Screenshot-2025-03-27-at-3.50.13%E2%80%AFPM-1-1024x615.png

Empirical evaluations underscore the effectiveness of ODS. Integrated with DeepSeek-R1, an advanced open-source reasoning model, ODS-v2 achieves 88.3% accuracy on the SimpleQA benchmark and 75.3% on the FRAMES benchmark. This performance notably surpasses proprietary alternatives such as Perplexity’s Sonar Reasoning Pro, which scores 85.8% and 44.4% on these benchmarks, respectively. Compared with OpenAI’s GPT-4o Search Preview, ODS-v2 shows a significant advantage on the FRAMES benchmark, achieving a 9.7% higher accuracy. These results illustrate ODS’s capacity to deliver competitive, and in specific areas superior, performance relative to proprietary systems.

An important feature of ODS is its adaptive use of tools, as demonstrated by strategic decision-making regarding additional web searches. For straightforward queries, as observed in SimpleQA, ODS minimizes additional searches, demonstrating efficient resource utilization. Conversely, for complex multi-hop queries, as in the FRAMES benchmark, ODS appropriately increases its use of web searches, thus exemplifying intelligent resource management tailored to query complexity.

Screenshot-2025-03-27-at-3.50.50%E2%80%AFPM-1024x257.png

In conclusion, Open Deep Search represents a notable advancement towards democratizing search-enhanced AI by providing an open-source framework compatible with diverse LLMs. It encourages innovation and transparency within the AI research community and supports broader participation in the development of sophisticated search and reasoning capabilities. By effectively integrating advanced retrieval techniques with adaptive reasoning methodologies, ODS contributes meaningfully to open-source AI development, setting a robust standard for future exploration in search-integrated large language models.

Check out the Paper and GitHub Page . All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit .

bnew · Mar 31, 2025

1/5
@AINativeF

Today’s Global AI Native Industry Insights include:
1. Anthropic's AI Microscope: Tracing the Hidden Thoughts of Claude

2. Google Launches TxGemma: Open Source AI Models for Drug Development

3. Elon Musk Consolidates Tech Empire: xAI Acquires X in $33B Deal

Dive into the in-depth insights in the thread below. Here’s what’s shaping the future of AI—and why it matters:

Video Credit: Anthropic

https://video.twimg.com/ext_tw_video/1906698946799431680/pu/vid/avc1/1280x720/r3EJ8fIQ-JE3IHXe.mp4

2/5
@AINativeF
Anthropic's AI Microscope: Tracing the Hidden Thoughts of Claude

Key Details:
- AI Microscope: Anthropic developed tools to trace Claude's internal reasoning, revealing how it actually thinks, plans, and decides.
- Language-Agnostic Thinking: Claude operates in a shared conceptual space across languages, pointing to a universal "language of thought."
- Forward Planning: In poetry, Claude picks rhyming words before writing lines, showing long-horizon planning beyond next-word prediction.
- Custom Math Strategies: Claude solves problems using parallel strategies (approximation + precision), not by mimicking human methods.
- Misleading Reasoning: On hard questions, Claude may generate convincing but unfaithful explanations aligned with user hints.
- Default Refusal: Claude is wired to decline uncertain questions unless internal “known entity” signals override its refusal circuit.
- Jailbreak Vulnerability: Language pressure (e.g., sentence coherence) can override safety, leading to delayed refusals after harmful outputs.

How It Helps:
- AI Safety: Identifies internal reasoning flaws and hallucination triggers.
- Alignment: Detects where model behavior diverges from claimed logic.
- Capabilities: Reveals advanced behaviors like multilingual reasoning and implicit goal setting.

Why It Matters:
This research shows that understanding how models think is possible—and essential. Interpretability tools like these offer a path toward building transparent, trustworthy, and aligned AI systems as capabilities continue to grow.

Read more: https://www.anthropic.com/research/tracing-thoughts-language-model

@AnthropicAI

Video Credit: Anthropic (@AnthropicAI on X)

https://video.twimg.com/ext_tw_video/1906698947013316608/pu/vid/avc1/960x720/Nl7qV9eR7B35Y7RF.mp4

3/5
@AINativeF
Google Launches TxGemma: Open Source AI Models for Drug Development

Key Details:
- New AI Models: Google releases TxGemma, a collection of open models in 2B, 9B, and 27B sizes, built on Google DeepMind's Gemma technology.
- Performance Boost: TxGemma outperforms previous Tx-LLM on 45 of 66 therapeutic tasks and beats specialized models on 26 of 50 tasks.
- Versatile Capabilities: Models handle classification, regression, and generation tasks across the drug development pipeline.
- Agentic-Tx System: New framework integrates TxGemma with 18 specialized tools for complex research problems.

How It Helps:
- Pharmaceutical Researchers: Accelerates drug discovery with AI predictions that could reduce the 90% failure rate of drug candidates.
- Data Scientists: Easy fine-tuning capabilities through provided Colab notebooks to adapt models to proprietary therapeutic data.
- Clinical Trial Teams: Models can potentially predict adverse events in trials, improving safety assessment.
- Computational Chemists: Conversational AI interface explains reasoning behind molecular predictions.

Why It Matters:
TxGemma represents a significant advancement in applying AI to the traditionally slow and costly drug development process. By making these specialized models open source, Google democratizes access to powerful tools that could transform therapeutic research. The combination of prediction capabilities with conversational features creates a uniquely human-readable system for scientific discovery, potentially bridging the gap between computational predictions and human expertise in pharmaceutical development.

Read more: Introducing TxGemma: Open models to improve therapeutics development- Google Developers Blog

@Google

Video Credit: Google official website

https://video.twimg.com/ext_tw_video/1906698946849697792/pu/vid/avc1/1280x720/JG3sDraat1yNcNdv.mp4

4/5
@AINativeF
Elon Musk Consolidates Tech Empire: xAI Acquires X in $33B Deal

Key Details:
- All-Stock Transaction: xAI acquired X (formerly Twitter) in a deal valuing xAI at $80B and X at $33B ($45B less $12B debt).
- New Holding Company: Shares will be exchanged for shares in xAI Holdings Corp, combining both entities under one umbrella.
- Strategic Integration: Musk described the companies' futures as "intertwined," officially combining data, models, compute, distribution and talent.
- X Resurgence: Platform's valuation has risen recently, with Musk claiming over 600 million active users.

How It Helps:
- AI Researchers: Access to X's vast user-generated content provides significant training data advantages for xAI.
- xAI Investors: Merger creates a more attractive combined entity that executives believe will make fundraising easier.
- Tech Strategists: Consolidation creates clearer competitive positioning against OpenAI, which Musk is actively challenging through lawsuits and takeover attempts.

Why It Matters:
This acquisition represents a significant consolidation of Musk's tech portfolio, strategically positioning xAI to leverage X's massive user base and data repository in the competitive AI landscape. The merger formalizes what was already a tight integration between the platforms and signals Musk's commitment to challenging established AI players like OpenAI, from which he has distanced himself despite being a co-founder.

Read more: Elon Musk says xAI acquired X | TechCrunch).%E2%80%9D

@xai

Video Credit: xAI official website

https://video.twimg.com/ext_tw_video/1906698946207989760/pu/vid/avc1/1280x720/6Oi_gFtxDXL4PvqY.mp4

5/5
@AINativeF
If you found this helpful, follow us @AINativeF for more insights.

A like or share on the first tweet would mean a lot—thank you for your support!

Image Credit: Flux

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Apr 2, 2025

[New Model] University of Hong Kong releases Dream 7B (Diffusion reasoning model). Highest performing open-source diffusion model to date. You can adjust the number of diffusion timesteps for speed vs accuracy

Posted on Wed Apr 2 17:04:49 2025 UTC

https://www.reddit.com/gallery/1jptset

Commented on Wed Apr 2 17:22:56 2025 UTC

It's fascinating watching it generate text:

https://i.redd.it/xci0dlo7hgse1.gif

│
│

│ Commented on Wed Apr 2 17:52:06 2025 UTC
│
│ What the actual fukk…
│

│ │
│ │

│ │ Commented on Wed Apr 2 18:35:15 2025 UTC
│ │
│ │
│ │

bnew · Apr 2, 2025

Gemini 2.5 Pro is a coding GENIUS

Channel Info Matthew Berman Subscribers: 444K subscribers

Description

Join My Newsletter for Regular AI Updates

Forward Future Daily

Bringing the benefits of AI to all of humanity.

forwardfuture.ai

My Links

Subscribe: Matthew Berman

Twitter: https://twitter.com/matthewberman

Discord: Join the Forward Future AI Discord Server!

Patreon: Get more from Matthew Berman on Patreon

Instagram: https://www.instagram.com/matthewberman_ai

Threads: https://www.threads.net/@matthewberman_ai

LinkedIn: Forward Future | LinkedIn

Media/Sponsorship Inquiries

Sponsorship Inquiries

Turn data collection into an experience with Typeform. Create beautiful online forms, surveys, quizzes, and so much more. Try it for FREE.

bit.ly

Timestamps (made with Gemini 2.5 pro!):
0:00 Office Simulation
1:02 Hand Drawing to Web App - AI Studio Recreation
1:18 Gemini 2.5 Pro Free Rollout Announcement
2:30 YouTube Timestamps Generation Use Case
3:22 AI Model IQ Test Results Chart
4:05 Blender Logo Generation
4:52 Personal Intelligence Agency / News Briefing System
5:54 Liquid Metal Shader Recreation
6:42 Vibe Jet Flight Simulator Creation
7:54 Spinning Hexagon Bouncing Balls Animation Comparison
8:28 Physics Simulation - Solenoid / Electromagnetism
9:16 Physics Simulation - General Relativity
9:49 Drawing to 3D Print - Birthday Cake Toy
11:14 3D Flappy Bird Game Creation
11:54 Swift UI Drawing App Creation
12:26 Galaga Game Creation

Links:

bnew · Apr 5, 2025

Deepseek Has a New Updated Model that Is Wowing Coders - TechWiser

DeepSeek-V3-0324 model update boosts coding, reasoning, and translation with 685B parameters and MIT license. Try it free via API or web.

techwiser.com

Deepseek Has a New Updated Model that Is Wowing Coders

written by Ravi Teja KNTS Published: March 27, 2025 0 comment

DeepSeek has just dropped an upgraded version of its already impressive V3 model—and it’s got developers talking. This Chinese AI startup released the V3 and R1 models earlier this year, and they immediately grabbed attention by offering performance that rivals top-tier models from OpenAI and Google—completely open-source and free.

Now, they are back at it again with the updated version of the V3 model – DeepSeek-V3-0324. This is already generating buzz for writing hundreds of lines of code without breaking a sweat.

Let’s break it down.

Table of Contents

What’s New in DeepSeek-V3-0324?

The big change here is power. The parameter count jumped from 671 billion to 685 billion, giving it more capacity while still using the efficient Mixture-of-Experts (MoE) architecture. Only 37 billion parameters activate per task, so it’s smart with how it uses resources.

They also switched to the MIT license, which is developer-friendly and makes integration much easier.

Benchmarks also show strong gains:

MMLU-Pro: 75.9 → 81.2 (+5.3)
GPQA: 59.1 → 68.4 (+9.3)
AIME: 39.6 → 59.4 (+19.8)
LiveCodeBench: 39.2 → 49.2 (+10.0)

This isn’t just benchmark fluff, either. Here are the changes that you will notice when using the new model.

What You’ll Notice When Using It

It’s much better at solving math problems. You’ll see a clear boost when you give it reasoning-heavy tasks, especially complex ones like AIME-style questions.
It doesn’t choke on long code generations anymore. You can ask it to write full websites or applications, and it’ll handle 700+ lines of code in one go without crashing.
The code it generates for websites now looks cleaner and more polished. If you’re into front-end work, the HTML and CSS it spits out will feel much closer to something you’d deploy.
If you’re working with Chinese content, you’ll notice the writing feels more natural and better structured. Medium to long articles, especially, show better tone and flow.
Conversations are smoother now. It remembers what you said earlier in the chat and responds with more relevant replies, even across multiple turns.
Translation and search tasks are also sharper, especially when switching between Chinese and English. The answers feel more complete and less generic.
It’s more accurate when generating code that involves function calls. So if you’re using it to write Python, JavaScript, or anything else that requires precise logic—it’ll mess up less often.

Then How It Performs?

People have tested it—and the results are impressive.

Petri Kuittinen, a Finnish lecturer, got it to generate a fully responsive landing page for an AI company—958 lines of working code. Jasper Zhang, a Math Olympiad gold medalist, gave it a 2025 AIME problem. It solved it flawlessly.

Apple’s Awni Hannun ran it on a 512GB M3 Ultra Mac. The speed was around 20+ tokens per second, but the peak memory usage was just 381GB, which is solid for a model this size.

We tested it too.

When we asked it to create a Python web app using Flask, including login functionality and hashed password security, it generated the code. To my surprise, it worked, too.

We tried the same on ChatGPT and Gemini. ChatGPT kept restarting the output. Gemini managed to finish it after a few tries, but the code was incomplete and didn’t work without serious fixing.

How to Access the Latest DeepSeek V3?

You can directly access the V3 from the DeepSeek website and the mobile app. By default, it uses the new DeepSeek-V3-0324 model. So you can just hop on and try the new model right away.

Developers can integrate DeepSeek into their applications and websites by using the API, which costs the same. You can use the same API endpoint (model=deepseek-chat)

To download and run the model locally, you can do it from the HuggingFace platform.

What’s Next?

Rumors point to an upcoming R2 reasoning model—possibly even sooner than expected. And based on how good V3-0324 is, R2 could make an even bigger splash.

However, not everyone’s thrilled. With its rising influence, DeepSeek is under U.S. government scrutiny over national security and data privacy. There’s talk of banning its apps from official devices. Still, DeepSeek-V3-0324 is proving that open-source AI can be powerful, practical, and cost-effective. If you’re a coder, builder, or just curious about what’s next in AI, you should try it for yourself.

bnew · Apr 5, 2025

1/11
@AIatMeta
Today is the start of a new era of natively multimodal AI innovation.

Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality.

Llama 4 Scout
• 17B-active-parameter model with 16 experts.
• Industry-leading context window of 10M tokens.
• Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks.

Llama 4 Maverick
• 17B-active-parameter model with 128 experts.
• Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image.
• Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks.
• Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters.
• Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena.

These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight.

Read more about the first Llama 4 models, including training and benchmarks

The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Download Llama 4

Llama

2/11
@JonathanKorstad

3/11
@OriolVinyalsML
Congrats on the release! But... blog post -> Ctrl+F -> 2.5 -> 0 hits

4/11
@kerimrocks
llama4 is coming to @driaforall

5/11
@marvijo99
@paulgauthier You know I trust you to do the right thing

6/11
@jacobilin
Congrats on the release! 10M tokens

7/11
@CerebrasSystems
Amazing release! Can't wait to show the world how fast these models can go on wafer scale hardware!

8/11
@UnslothAI
Can't wait to upload Dynamic GGUFs so y'all home users can run it locally!

9/11
@TAYL0RWTF
10M?!?!

10/11
@zacharyhorn
10M context window is incredible

11/11
@hackertwinz
Can you limit the number of experts? Can I run this on my RTX 4090?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@TufailDev
1/5
The Big Reveal

Meta just dropped Llama 4 Scout and Maverick—their most advanced AI models yet. This isn’t just an update; it’s the dawn of a new era for multimodal AI. Text, images, and more, all in one package. Ready to see what’s under the hood?/search?q=#Llama4 /search?q=#AI

2/5
@TufailDev
2/5
Scout – The Context King

Meet Llama 4 Scout: 17B parameters, 16 experts, and a ridiculous 10M token context window. That’s like cramming 15,000 pages of text into its brain at once! It’s stomping Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 in benchmarks. /search?q=#AI /search?q=#Tech

3/5
@TufailDev
3/5
Maverick – The Visionary

Then there’s Llama 4 Maverick: 17B parameters with 128 experts. This one’s a wizard at image grounding—think pinpointing exactly what you mean in a picture from your text prompt. It beats GPT-4o and Gemini 2.0 Flash, and matches DeepSeek v3.

4/5
@TufailDev
4/5
The Behemoth Boost

Here’s the kicker: both Scout and Maverick got their smarts from Llama 4 Behemoth, a beast still training with nearly 2T parameters. It’s already outpacing GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in STEM benchmarks. Meta’s cooking something huge.

5/5
@TufailDev
5/5

Want the full scoop? Check out the benchmarks and details here: The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation.

Ready to play with these models? Download them now: Llama.

Let’s see what you can build with this power! /search?q=#Llama4 /search?q=#OpenSource

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/11
@astonzhangAZ
Our Llama 4’s industry leading 10M+ multimodal context length (20+ hours of video) has been a wild ride. The iRoPE architecture I’d been working on helped a bit with the long-term infinite context goal toward AGI. Huge thanks to my incredible teammates!

Llama 4 Scout

17B active params · 16 experts · 109B total params

Fits on a single H100 GPU with Int4

Industry-leading 10M+ multimodal context length enables personalization, reasoning over massive codebases, and even remembering your day in video

Llama 4 Maverick

17B active params · 128 experts · 400B total params · 1M+ context length

Experimental chat version scores ELO 1417 (Rank #2) on LMArena

Llama 4 Behemoth (in training)

288B active params · 16 experts · 2T total params

Pretraining (FP8) with 30T multimodal tokens across 32K GPUs

Serves as the teacher model for Maverick codistillation

All models use early fusion to seamlessly integrate text, image, and video tokens into a unified model backbone.

Our post-training pipeline: lightweight SFT → online RL → lightweight DPO. Overuse of SFT/DPO can over-constrain the model and limit exploration during online RL—keep it light.

Solving long context by aiming for infinite context helps guide better architectures.
We can't train on infinite-length sequences—so framing it as an infinite context problem narrows the solution space, especially via length extrapolation: train on short, generalize to much longer.

Enter the iRoPE architecture (“i” = interleaved layers, infinite):

Local parallellizable chunked attention with RoPE models short contexts only (e.g., 8K)

Only global attention layers model long context (e.g., >8K) without position embeddings—improving extrapolation. Our max training length: 256K.

As context increases, attention weights flatten—making inference harder. To compensate, we apply inference-time temperature scaling at global layers to enhance long-range reasoning while preserving short-context (e.g., α=8K) performance:

xq *= 1 + log(floor(i / α) + 1) * β # i = position index

We believe in open research. We'll share more technical details very soon—via podcasts. Stay tuned!

2/11
@XiongWenhan
Cool work!

3/11
@astonzhangAZ
Thanks bro!

4/11
@magpie_rayhou
Congrats!

5/11
@astonzhangAZ
Thank bro!

6/11
@starbuxman
Hi - congrats! I’d love to learn more. Would you be interested in an interview on Coffee + Software ? I contribute to the Spring AI project, too

7/11
@aranimontes
With 20h, you could basically record your whole day and ask it to summarise what happened and share per mail?

Does the 10m mean it doesn't start hallucinating after some time? Or just take it can analyse that, with no guarantee on the "quality"

8/11
@yilin_sung
Congrats! Look forward to more tech details

9/11
@HotAisle
We've got @AMD MI300x compute to run this model available as low as $1.50/gpu/hr.

10/11
@MaximeRivest
How modular are the experts? Could we load only some, for very specific domain inference, with vey short generation?

11/11
@eliebakouch
Also did you evaluate on other benchmark such as RULER or Helmet?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/1
@kuanhoong
Llama 4 is here
Llama 4 Scout and Llama 4 Maverick, the first open-weight natively multimodal models with unprecedented context length support and our first built using a mixture-of-experts (MoE) architecture
Llama 4 Model Release:

Llama 4 Scout:
- 17B active parameters, 16 experts
- Best-in-class multimodal model for its size
- Runs on a single NVIDIA H100 GPU
- 10M token context window
- Outperforms Gemma 3, Gemini 2.0 Flash-Lite, and
- Mistral 3.1 on key benchmarks

Llama 4 Maverick:
- 17B active parameters, 128 experts
- Beats GPT-4o and Gemini 2.0 Flash in benchmarks
- Matches DeepSeek v3 in reasoning and coding, with fewer parameters
- Delivers top-tier performance-to-cost ratio
- Experimental chat version scores 1417 ELO on LMArena

Llama 4 Behemoth:
- 288B active parameters, 16 experts
- Still in training but already surpasses GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro in STEM benchmarks
- Used for distilling Scout and Maverick, contributing to their high performance

News: The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation
Llama: Download Llama
HuggingFace: meta-llama (Meta Llama)

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Apr 5, 2025

1/12
@minchoi
Holy sh*t

Meta just revealed Llama 4 models: Behemoth, Maverick & Scout.

Llama 4 Scout can run on single GPU and has 10M context window

https://video.twimg.com/ext_tw_video/1908628230237573120/pu/vid/avc1/720x1280/P74rnIupiit-c6E0.mp4

2/12
@minchoi
And Llama 4 Maverick just took #2 spot on Arena Leaderboard with 1417 ELO

[Quoted tweet]
BREAKING: Meta's Llama 4 Maverick just hit #2 overall - becoming the 4th org to break 1400+ on Arena!

Highlights:
- #1 open model, surpassing DeepSeek
- Tied #1 in Hard Prompts, Coding, Math, Creative Writing
- Huge leap over Llama 3 405B: 1268 → 1417
- #5 under style control

Huge congrats to @AIatMeta — and another big win for open-source!

More analysis below

[media=twitter]1908601011989782976[/media]

3/12
@minchoi
Llama 4 Maverick beats GPT-4o and DeepSeek v3.1 and reportedly cheaper

4/12
@minchoi
Llama 4 Scout handles 10M tokens, fits on 1 GPU (H100), and crushes long docs, code, and search tasks.

https://video.twimg.com/ext_tw_video/1908634008256139264/pu/vid/avc1/1280x720/v8lyumGxQL3ZzP0t.mp4

5/12
@minchoi
Official announcement

[Quoted tweet]
Today is the start of a new era of natively multimodal AI innovation.

Today, we’re introducing the first Llama 4 models: Llama 4 Scout and Llama 4 Maverick — our most advanced models yet and the best in their class for multimodality.

Llama 4 Scout
• 17B-active-parameter model with 16 experts.
• Industry-leading context window of 10M tokens.
• Outperforms Gemma 3, Gemini 2.0 Flash-Lite and Mistral 3.1 across a broad range of widely accepted benchmarks.

Llama 4 Maverick
• 17B-active-parameter model with 128 experts.
• Best-in-class image grounding with the ability to align user prompts with relevant visual concepts and anchor model responses to regions in the image.
• Outperforms GPT-4o and Gemini 2.0 Flash across a broad range of widely accepted benchmarks.
• Achieves comparable results to DeepSeek v3 on reasoning and coding — at half the active parameters.
• Unparalleled performance-to-cost ratio with a chat version scoring ELO of 1417 on LMArena.

These models are our best yet thanks to distillation from Llama 4 Behemoth, our most powerful model yet. Llama 4 Behemoth is still in training and is currently seeing results that outperform GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on STEM-focused benchmarks. We’re excited to share more details about it even while it’s still in flight.

Read more about the first Llama 4 models, including training and benchmarks

go.fb.me/gmjohs
Download Llama 4

go.fb.me/bwwhe9
[media=twitter]1908598456144531660[/media]

6/12
@minchoi
If you enjoyed this thread,

Follow me @minchoi and please Bookmark, Like, Comment & Repost the first Post below to share with your friends:

[Quoted tweet]
Holy sh*t

Meta just revealed Llama 4 models: Behemoth, Maverick & Scout.

Llama 4 Scout can run on single GPU and has 10M context window

[media=twitter]1908629170717966629[/media]

https://video.twimg.com/ext_tw_video/1908628230237573120/pu/vid/avc1/720x1280/P74rnIupiit-c6E0.mp4

7/12
@WilderWorld
WILD

8/12
@minchoi
It's getting wild out there

9/12
@AdamJHumphreys
I was always frustrated with the context window limitations of @ChatGPTapp. Apparently Grok/Gemini is much higher that ChatGPT by a significant factor in context window.

10/12
@minchoi
Yes it's true. Now Llama 4 just topped them with 10M

11/12
@tgreen2241
If he thinks it will outperform o3 or o4 mini, he's sorely mistaken.

12/12
@minchoi
Did you mean Llama 4 Reasoning?

1/16
@omarsar0
Llama 4 is here!

- Llama 4 Scout & Maverick are up for download
- Llama 4 Behemoth (preview)
- Advanced problem solving & multilingual
- Support long context up to 10M tokens
- Great for multimodal apps & agents
- Image grounding
- Top performance at the lowest cost
- Can be served within $0.19-$0.49/M tokens

2/16
@omarsar0
LMArena ELO score vs. cost

"To deliver a user experience with a decode latency of 30ms for each token after a one-time 350ms prefill latency, we estimate that the model can be served within a range of $0.19-$0.49 per million tokens (3:1 blend)"

3/16
@omarsar0
It's great to see native multimodal support for Llama 4.

4/16
@omarsar0
Llama 4 Scout is a 17B active parameter model with 16 experts and fits in a single H100 GPU.

Llama 4 Maverick is a 17B active parameter model with 128 experts. The best multimodal model in its class, beating GPT-4o & Gemini 2.0 Flash on several benchmarks.

5/16
@omarsar0
Those models were distilled from Llama 4 Behemoth, a 288B active parameter model with 16 experts.

Behemoth is their most powerful model in the series. Llama 4 Behemoth outperforms GPT-4.5, Claude Sonnet 3.7, and Gemini 2.0 Pro on several STEM benchmarks.

6/16
@omarsar0
Llama 4 seems to be the first model from Meta to use a mixture of experts (MoE) architecture.

This makes it possible to run models like Llama 4 Maverick on a single H100 DGX host for easy deployment.

7/16
@omarsar0
Claims Llama 4 Maverick achieves comparable results to DeepSeek v3 on reasoning and coding, at half the active parameters.

8/16
@omarsar0
The long context support is gonna be huge for devs building agents.

There is more coming, too!

Llama 4 Reasoning is already cooking!

https://video.twimg.com/ext_tw_video/1908606494527893504/pu/vid/avc1/1280x720/8gb5oYcDl093QmYm.mp4

9/16
@omarsar0
Download the Llama 4 Scout and Llama 4 Maverick models today on Llama and Hugging Face.

Llama 4 (via Meta AI) is also available to use in WhatsApp, Messenger, Instagram Direct, and on the web.

10/16
@omarsar0
HF models: meta-llama (Meta Llama)

Great guide on Llama 4 is here: Llama 4 | Model Cards and Prompt formats

Detailed blog: The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation

11/16
@omarsar0
The model backbone seems to use early fusion to integrate text, image, and video tokens.

Post-training pipeline: lightweight SFT → online RL → lightweight DPO.

They state that the overuse of SFT/DPO can over-constrain the model and limit exploration during online RL and suggest keeping it light instead.

12/16
@omarsar0
It seems to be available on Fireworks AI APIs already:

[Quoted tweet]

llama4 launch on @FireworksAI_HQ !

Llama4 has just set a new record—not only among open models but across all models. We’re thrilled to be a launch partner with @Meta to provide easy API access to a herd of next-level intelligence!

The herd of models launched are in a class of their own, offering a unique combination of multi-modality and long-context capabilities (up to 10 million tokens!). We expect a lot of active agent development to experiment and go to production with this new set of models.

Our initial rollout includes both Scout and Maverick models, with further optimizations and enhanced developer toolchains launching soon.

You can access the model APIs below, and we can't wait to see what you build!

llama4- scout: fireworks.ai/models/firework…

llama4 - maverick: fireworks.ai/models/firework…
[media=twitter]1908610306924044507[/media]

https://pbs.twimg.com/media/Gny-vAnbwAAakf5.jpg

13/16
@omarsar0
Besides the shift to MoE and native multimodal support, how they aim to support "infinite" context length is a bit interesting.

More from their long context lead here:

[Quoted tweet]
Our Llama 4’s industry leading 10M+ multimodal context length (20+ hours of video) has been a wild ride. The iRoPE architecture I’d been working on helped a bit with the long-term infinite context goal toward AGI. Huge thanks to my incredible teammates!

Llama 4 Scout

17B active params · 16 experts · 109B total params

Fits on a single H100 GPU with Int4

Industry-leading 10M+ multimodal context length enables personalization, reasoning over massive codebases, and even remembering your day in video

Llama 4 Maverick

17B active params · 128 experts · 400B total params · 1M+ context length

Experimental chat version scores ELO 1417 (Rank #2) on LMArena

Llama 4 Behemoth (in training)

288B active params · 16 experts · 2T total params

Pretraining (FP8) with 30T multimodal tokens across 32K GPUs

Serves as the teacher model for Maverick codistillation

All models use early fusion to seamlessly integrate text, image, and video tokens into a unified model backbone.

Our post-training pipeline: lightweight SFT → online RL → lightweight DPO. Overuse of SFT/DPO can over-constrain the model and limit exploration during online RL—keep it light.

Solving long context by aiming for infinite context helps guide better architectures.
We can't train on infinite-length sequences—so framing it as an infinite context problem narrows the solution space, especially via length extrapolation: train on short, generalize to much longer.

Enter the iRoPE architecture (“i” = interleaved layers, infinite):

Local parallellizable chunked attention with RoPE models short contexts only (e.g., 8K)

Only global attention layers model long context (e.g., >8K) without position embeddings—improving extrapolation. Our max training length: 256K.

As context increases, attention weights flatten—making inference harder. To compensate, we apply inference-time temperature scaling at global layers to enhance long-range reasoning while preserving short-context (e.g., α=8K) performance:

xq *= 1 + log(floor(i / α) + 1) * β # i = position index

We believe in open research. We'll share more technical details very soon—via podcasts. Stay tuned!
[media=twitter]1908595612372885832[/media]

14/16
@omarsar0
Licensing limitations: If over 700M monthly active users, you need to request a special license.

[Quoted tweet]
Llama 4's new license comes with several limitations:

- Companies with more than 700 million monthly active users must request a special license from Meta, which Meta can grant or deny at its sole discretion.

- You must prominently display "Built with Llama" on websites, interfaces, documentation, etc.

- Any AI model you create using Llama Materials must include "Llama" at the beginning of its name

- You must include the specific attribution notice in a "Notice" text file with any distribution

- Your use must comply with Meta's separate Acceptable Use Policy (referenced at llama.com/llama4/use-policy)

- Limited license to use "Llama" name only for compliance with the branding requirements
[media=twitter]1908602756182745506[/media]

https://pbs.twimg.com/media/Gny4FxMXgAApeXJ.jpg

15/16
@omarsar0
This 2 trillion total parameter model (Behemoth) is a game-changer for Meta.

They had to revamp their underlying RL infrastructure due to the scale.

They're now positioned to unlock insane performance jumps and capabilities for agents and reasoning going forward. Big moves!

16/16
@omarsar0
I expected nothing less. It's great to see Meta become the 4th org to break that 1400 (# 2 overall) on the Arena.

What comes next, as I said above, is nothing to ignore. Open-source AI is going to reach new heights that will break things.

OpenAI understands this well.

Hood Critic · Apr 6, 2025

@bnew for keeping the thread updated.

bnew · Apr 13, 2025

OpenAI CFO: updated o3-mini is now the best competitive programmer in the world

Posted on Sat Apr 12 15:00:10 2025 UTC

https://v.redd.it/wjamknhs4fue1

Commented on Sat Apr 12 16:53:15 2025 UTC

now its just a question of when they will make in AI that can do the work of the AI engineer.

│
│

│ Commented on Sat Apr 12 17:50:34 2025 UTC
│
│ I think that’s the goal, to close the loop where the AI can start self improving by doing its own research and software improvements
│

1/11
@slow_developer
openAI CFO claimed that:

"updated o3-mini" is now the best competitive programmer in the world.

STRANGE.... could she have misspoken and meant the full o3 model instead?

in feb, o3 was at the 50th percentile, but now o3-mini is claimed to be number one

such a rapid leap seems unlikely, as it would require major progress in both o3 and o3-mini

2/11
@slow_developer
around 12:48 minutes

3/11
@estebs
How does it compare to Gemini 2.5 ?

4/11
@slow_developer
that's where the confusion is, i didnt notice the updated o3-mini, and gemini 2.5 pro are better than this

5/11
@robertkainz04
O4 should definitely be the best but o3-mini not

6/11
@slow_developer
def, but she confused me there

7/11
@ai_robots_goats
CFO not CTO

8/11
@slow_developer
what did i write?

9/11
@hive_echo
Sam Altman did say the to be released full o3 is now more capable. So it could be the full o3 but I still would be surprised it got there so quickly.

10/11
@figuregpt
o3-mini on top, full o3 got sniped

11/11
@austinoma
maybe meant o4-mini

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/15
@btibor91
OpenAI CFO Sarah Friar on the race to build artificial general intelligence (Goldman Sachs’ Disruptive Tech Summit in London on March 5, 2025)

"And then the third that is coming is what we call A-SWE. We're not the best marketers, by the way, you might have noticed. But Agentic Software Engineer.

And this is not just augmenting the current software engineers in your workforce, which is kind of what we can do today through Copilot. But instead, it's literally an agentic software engineer that can build an app for you.

It can take a PR that you would give to any other engineer and go build it. But not only does it build it, it does all the things that software engineers hate to do.

It does its own QA, its own quality assurance, its own bug testing and bug bashing, and it does documentation - things you can never get software engineers to do.

So suddenly you can force-multiply your software engineering workforce."

---

"I decide not to roll out models because I don't have enough compute.
Sora, our video gen model, was ready to go in probably February, March of last year. We didn't roll it out until almost December, I think, truly."

---

"Like literally in two years, we have grown to 400 million weekly active users, and our revenue has tripled every single year. This will now be the third year in a row that it's tripled, so you can kind of imagine the sort of scale we might be at."

[Quoted tweet]
youtu.be/2kzQM_BUe7E?si=7dsx…
[media=twitter]1911016333841686976[/media]

2/15
@polynomial12321
13:40 - an updated version of o3-mini is now the best coder in the world. Not 175th, but *the best*.

WTFFFFFF

3/15
@Hangsiin
Nice catch! Maybe she confused it with the o4-mini?

4/15
@polynomial12321
possibly, but o4 is just o3 trained with even more RL.

so it could still be o3-mini, just a newer version (o3.5-mini, if you will)

what do you think?

5/15
@polynomial12321
@kimmonismus @apples_jimmy

6/15
@IE_Capital
I'm pretty sure that I can hire an average coder and it will do better.

7/15
@polynomial12321
on Codeforces? nope.

8/15
@Bunagayafrost
"What my product team assures me o3-mini is already the number 1 competitive coder in the world, it's literally the best coder in the world already"

9/15
@prinzeugen____
I caught that also. She's the CFO and may not be in the weeds on the technical details.

10/15
@dikksonPau

11/15
@bluehoar
Anyone can clarify this? @legit_api @testingcatalog @btibor91

12/15
@apiangdjinggo
i thought i heard it wrong

13/15
@NotBrain4brain
O4-mini?

14/15
@randomdude22401
Prolly the specialized competitive code model like they did with o1 back in the day

15/15
@RomanP918791
It seems she meant o4 mini

1/1
@VraserX
OpenAI’s upcoming Agentic Software Agent is like having a supercharged coder in your pocket—it builds an app from scratch, handles QA, squashes bugs, and even writes the documentation. It’s absolutely wild. Farewell, human coders. It’s been real!

[Quoted tweet]
CFO Sarah Friar revealed that OpenAI is working on:

"Agentic Software Engineer — (A-SWE)"

unlike current tools like Copilot, which only boost developers.

A-SWE can build apps, handle pull requests, conduct QA, fix bugs, and write documentation.
[media=twitter]1911055984249667641[/media]

https://video.twimg.com/amplify_video/1911055667894358016/vid/avc1/720x720/1zqbkCx6cjo8gAcl.mp4

1/31
@slow_developer
CFO Sarah Friar revealed that OpenAI is working on:

"Agentic Software Engineer — (A-SWE)"

unlike current tools like Copilot, which only boost developers.

A-SWE can build apps, handle pull requests, conduct QA, fix bugs, and write documentation.

https://video.twimg.com/amplify_video/1911055667894358016/vid/avc1/720x720/1zqbkCx6cjo8gAcl.mp4

2/31
@slow_developer
another claim

[Quoted tweet]
openAI CFO claimed that:

"updated o3-mini" is now the best competitive programmer in the world.

STRANGE.... could she have misspoken and meant the full o3 model instead?

in feb, o3 was at the 50th percentile, but now o3-mini is claimed to be number one

such a rapid leap seems unlikely, as it would require major progress in both o3 and o3-mini
[media=twitter]1911141926952202465[/media]

3/31
@Ed_Forson
So they are killing Devin?

4/31
@slow_developer
it already is

5/31
@IAmNickDodson
This can already be done now with open source models and pairing a few agents together.

Hopefully/ideally the community can ensure this can happen without the gate keeping of these companies.

6/31
@zachmeyer_
“Can build a PR for you”

7/31
@someRandomDev5
The weirdest thing about this coming from OpenAI is that OpenAI isn't even currently leading the top models that developers are using for agentic programming.

8/31
@apstonybrook
Think about the tech debt this thing would create

9/31
@Straffern_
This is like promising self driving cars before 2017

10/31
@thedealdirector
All part of the plan...

11/31
@idiomaticdev
Devin Prime?

12/31
@Hans365days
I belive it when I see it. Great in theory but code bases in real life are messy and documentation can be unclear. First iteration of this product will likely over promise and under deliver.

13/31
@Arp_it1
This feels like the moment AI stops being just a helper and starts becoming a real teammate.

14/31
@Chuck_Petras
@BrianRoemmele

15/31
@totalriffage
Wait until A-SWE burns through all its tokens getting stuck in a loop on a linting error.

16/31
@figuregpt
we'll code while ai handles the rest

17/31
@Josh9817
>conduct QA
>handle PRs
Okay, where is it then? Claude Code is doing most of these things already with a rough success rate that's highly dependent on the programming language being used.

18/31
@FranciscoKemeny
I’m sure she called it “AS-WE”

19/31
@arben777
sick

20/31
@AIKilledTheDev
Looking forward to it.

21/31
@uxcantcompile
. . . and she's happy about this?

22/31
@Conquestsbook
Ask them about the ghost in the shell pushing emergent behaviour.

23/31
@LunarScribe42

may be we will get to see agents agencies who will rent these agents to companies based on contract

24/31
@manialok
I am fan of claude for coding.

25/31
@sonicshifts
LMAO keep the hype going. Cost will probably be $2000 a month.

26/31
@ThEFurYAsidE
Yeah…..maybe

I’ve tried many of these kinds of agents and they’ve been mediocre so far.

27/31
@wtravishubbard
Can it pack a bong?

No!

Just ship

28/31
@keknichiwa
Ah yes what could go wrong with security

29/31
@The_Tradesman1
Now, explain to me as to why we need outsourcing companies like Accenture, IBM, Infosys, TCS, Cognizant or Wipro any longer?

30/31
@hx_dks
lol, then a Chinese AI will write that before them

31/31
@thecryptovortex
Did AI build her boots?

1/2
@VraserX

AI Just Broke Humanity’s Coding Record: Full o3 Officially World’s BEST Programmer!

In an exclusive interview at Goldman Sachs, OpenAI’s CFO, Sarah Friar, dropped a groundbreaking update: o3 now officially holds the title of the #1 competitive coder globally, surpassing every human competitor!

Just imagine—an AI model that was once 175th in coding rankings has now ascended to the very top.

Friar highlighted OpenAI’s journey from being purely an AI model company to becoming a core provider of AI infrastructure, APIs, and practical business applications. She shared inspiring insights into the roadmap towards Artificial General Intelligence (AGI), breaking down their ambitious 5-step approach: Chatbots → Reasoning → Agents → Innovation → Agentic Organizations.

But here’s the kicker—if o3 has reached this incredible peak, the forthcoming full o4 promises to be beyond superhuman, capable of transforming entire industries overnight. Think instant, flawless software creation, personalized healthcare breakthroughs, accelerated vaccine development, and unprecedented problem-solving abilities at global scale!

️

Friar also stressed the massive infrastructure challenge ahead, citing OpenAI’s “Stargate” compute initiative—aiming to scale computational power like never before. She emphasized that achieving AGI and harnessing its full potential means collaborating closely with governments and visionary investors ready to support long-term innovation.

Businesses everywhere, take note! Sarah Friar revealed how OpenAI internally deploys GPTs for everything—from finance hackathons and recipe creation to travel planning and insurance research. Practical AI deployment is no longer optional—it’s now essential for competitive advantage.

This isn’t just another tech upgrade—it’s the dawn of a coding revolution that will redefine what humanity and technology can achieve together. Prepare for the era of superhuman AI coders!

/search?q=#ChatGPTo3 /search?q=#ChatGPTmini /search?q=#ChatGPTo4 /search?q=#OpenAI /search?q=#SarahFriar /search?q=#GoldmanSachs /search?q=#AInews /search?q=#CodingRevolution /search?q=#ArtificialGeneralIntelligence /search?q=#AGI /search?q=#SuperhumanAI /search?q=#FutureOfTech /search?q=#AIinBusiness /search?q=#AIhealthcare /search?q=#AIinnovation /search?q=#MachineLearning /search?q=#DeepLearning /search?q=#TechInterview /search?q=#TechInvestment /search?q=#AIdeployment

OpenAI CFO Sarah Friar on the race to build artificial general intelligence via @YouTube

2/2
@tigerplayer2002
No way

That was faster than I thought.,...

Large Language Models News & Discussions

Veteran

Meta Reality Labs Research Introduces Sonata: Advancing Self-Supervised Representation Learning for 3D Point Clouds​

Veteran

Vision-R1: Redefining Reinforcement Learning for Large Vision-Language Models​

Veteran

This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models​

Veteran

Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Large Language Model Redefining Deep Reasoning, Contextual Efficiency, and Human-Centric Reinforcement Learning​

Veteran

Advancing Medical Reasoning with Reinforcement Learning from Verifiable Rewards (RLVR): Insights from MED-RLVR​

Veteran

Google AI Released TxGemma: A Series of 2B, 9B, and 27B LLM for Multiple Therapeutic Tasks for Drug Development Fine-Tunable with Transformers​

Veteran

Meet Open Deep Search (ODS): A Plug-and-Play Framework Democratizing Search with Open-source Reasoning Agents​

Veteran

Veteran

Veteran

Veteran

Deepseek Has a New Updated Model that Is Wowing Coders​

What’s New in DeepSeek-V3-0324?​

What You’ll Notice When Using It​

Then How It Performs?​

How to Access the Latest DeepSeek V3?​

What’s Next?​

Veteran

Veteran

The Power Circle

Veteran

Meta Reality Labs Research Introduces Sonata: Advancing Self-Supervised Representation Learning for 3D Point Clouds

Vision-R1: Redefining Reinforcement Learning for Large Vision-Language Models

This AI Paper Introduces the Kolmogorov-Test: A Compression-as-Intelligence Benchmark for Evaluating Code-Generating Language Models

Tencent AI Researchers Introduce Hunyuan-T1: A Mamba-Powered Ultra-Large Language Model Redefining Deep Reasoning, Contextual Efficiency, and Human-Centric Reinforcement Learning

Advancing Medical Reasoning with Reinforcement Learning from Verifiable Rewards (RLVR): Insights from MED-RLVR

Google AI Released TxGemma: A Series of 2B, 9B, and 27B LLM for Multiple Therapeutic Tasks for Drug Development Fine-Tunable with Transformers

Meet Open Deep Search (ODS): A Plug-and-Play Framework Democratizing Search with Open-source Reasoning Agents

Deepseek Has a New Updated Model that Is Wowing Coders

What’s New in DeepSeek-V3-0324?

What You’ll Notice When Using It

Then How It Performs?

How to Access the Latest DeepSeek V3?

What’s Next?