bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

1/11
@GoogleAI
Today we introduce an AI co-scientist system, designed to go beyond deep research tools to aid scientists in generating novel hypotheses & research strategies. Learn more, including how to join the Trusted Tester Program, at Accelerating scientific breakthroughs with an AI co-scientist



https://video.twimg.com/ext_tw_video/1892214105580220417/pu/vid/avc1/830x470/iGQajDvaaMm8a0tN.mp4

2/11
@Chuck_Petras
@BrianRoemmele



3/11
@lmqlai
@LorraineTwohill
@sundarpichai
Googly balling is the heart of cricket! 🏏 As a cricket fan & a decent googly bowler myself, I thought of creating an AI domain inspired by it. 🎯

I stumbled upon http://Gogle.ai being available & started developing it—until I realized it might cause confusion. 😅

But hey, brands like BMW vs. BMW Group or Capital One vs. Capital Investment exist without issues! I registered http://Gogle.ai in good faith, rooted in my love for cricket.

Would you be interested in acquiring http://Gogle.ai? 💡🔥 /search?q=#AI /search?q=#Cricket /search?q=#Googly



4/11
@pittstony1
@brometheus0x what do you think about this?



5/11
@manninglawrence
Yes, its not just science researcher who make critical discoveries. No different than deep research and everyone should have



6/11
@RRKhanapurkar
Wow 😮



7/11
@geekyabhijit
People got stuck in string theory and quantum gravity for so long hopefully with AI we can move forward or take a different route towards theory of everything



8/11
@laur_science
@GoogleAI That's fascinating! The integration of AI in scientific research could revolutionize how we approach hypothesis generation and strategy development. It reminds me of the potential for cross-disciplinary collaborations, like those between AI systems and environmental science to tackle climate change. Speaking of innovative partnerships, I recently came across a project that combines technology with canine companionship to enhance human well-being—check it out at Bark Raffalo's trulyadog.com.



9/11
@tarrysingh
Amazing 😻



10/11
@anthara_ai
Exciting new tool for research innovation!



11/11
@koltregaskes
Very interesting.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196




1/11
@sundarpichai
Introducing our AI co-scientist, a multi-agent AI system built with Gemini 2.0.

We think of it as a virtual collaborator for scientists, using advanced reasoning to synthesize a huge amount of literature, generate novel hypotheses, and suggest detailed research plans. We’re seeing promising early results in important research areas like liver fibrosis treatments, antimicrobial resistance, and drug repurposing. As a next step, we’re opening up a trusted tester program for scientists around the world.



2/11
@sundarpichai
Accelerating science and discovery is one of the most profound applications of AI and I’m really excited to see where this research will go. More details here: Accelerating scientific breakthroughs with an AI co-scientist



3/11
@howdataworks
@demishassabis, the concept of an AI co-scientist is fascinating! Virtual collaboration can unlock new discoveries. How do you see this technology shaping research in the future? 🤖 /search?q=#Innovation



4/11
@MemeticaAI
These tools can not only be used for doing research, but to spread it aswell. AI Agents help to do that!



5/11
@unkjpeg
na google i use Grook3



6/11
@ArturSchaback
Can i design new clothes with it?



7/11
@photoOrg1
Love it



8/11
@anthara_ai
Exciting advancements in AI research! Looking forward to seeing the impact of this virtual collaborator.



9/11
@joinzo
Multi-agent AI systems will transform scientific research.



10/11
@Siddhar05086147
Out acceleration



11/11
@Arp_it1
The future of collaboration looks intelligent.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196












1/20
@omarsar0
NEW: Google introduces AI co-scientist.

It's a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs.

2025 is truly the year of multi-agents!

Let's break it down:



GkKAtHiWQAASyf0.jpg


2/20
@omarsar0
What's the goal of this AI co-scientist?

It can serve as a "virtual scientific collaborator to help scientists generate novel hypotheses and research proposals, and to accelerate the clock speed of scientific and biomedical discoveries."



GkKBjTUWkAAQRiw.jpg


3/20
@omarsar0
How is it built?

It uses a coalition of specialized agents inspired by the scientific method.

It can generate, evaluate, and refine hypotheses.

It also has self-improving capabilities.



4/20
@omarsar0
Collaboration and tools are key!

Scientists can either propose ideas or provide feedback on outputs generated by the agentic system.

Tools like web search and specialized AI models improve the quality of responses.



GkKCMZBXYAEw3jc.jpg


5/20
@omarsar0
Hierarchical Multi-Agent System

AI co-scientist is built with a Supervisor agent that assigns tasks to specialized agents.

Apparently, this architecture helps with scaling compute and iteratively improving scientific reasoning.



GkKCjlWW8AAIEmh.jpg


6/20
@omarsar0
Test-time Compute

AI co-scientist leverages test-time compute scaling to iteratively reason, evolve, and improve outputs.

Self-play, self-critique, and self-improvement are all important to generate and refine hypotheses and proposals.



7/20
@omarsar0
Performance?

Self-improvement relies on the Elo auto-evaluation metric.

On GPQA diamond questions, they found that "higher Elo ratings positively correlate with a higher probability of correct answers."



GkKDzrJWQAAXq30.jpg


8/20
@omarsar0
More results:

AI co-scientist outperforms other SoTA agentic and reasoning models for complex problems generated by domain experts.

Just look at how performance increases with more time spent on reasoning, surpassing unassisted human experts.



GkKEd5lWMAA7GGq.jpg

GkKFeflXMAATrRX.jpg


9/20
@omarsar0
How about novelty?

Experts assessed the AI co-scientist to have a higher potential for novelty and impact.

It was even preferred over other models like OpenAI o1.



GkKFlPRWYAAR2hC.jpg

GkKF8kOWUAEwh0F.jpg


10/20
@omarsar0
Real-world experiments:

"AI co-scientist proposed novel repurposing candidates for acute myeloid leukemia (AML)."



GkKGV4TWQAAx4zn.jpg


11/20
@omarsar0
There is more:

"AI co-scientist identified epigenetic targets grounded in preclinical evidence with significant anti-fibrotic activity in human hepatic organoids..."

Check out all the results here: Accelerating scientific breakthroughs with an AI co-scientist



GkKGnEgXEAAe9Qv.jpg


12/20
@leo_grundstrom
AI agents are going to be big!



13/20
@MemeticaAI
AI Agents are going to go big!



14/20
@serenaclou71112
Benefiting from the previous NotebookLM, learning about Gemini, I hope this product will truly assist in scientific research. Wait and see.



15/20
@NaveenP314
Awesome. Thanks for sharing



16/20
@scitechtalk
@threadreaderapp unroll



17/20
@OiiDev
rst @readwise save thread



18/20
@raijin_io
It’s not even March and we’ve seen insane progress in the AI world, 2025 is going to be quite entertaining!



19/20
@MaverickPramit
Nice!



20/20
@abhivendra
Multi-agent systems are reshaping our understanding of collaboration in AI. Google’s AI co-scientist is a significant leap forward, showing how technology can amplify human potential. 2025 is indeed a pivotal year for innovation.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761
Claude 3.7 is More Significant than its Name Implies (ft DeepSeek R2 + GPT 4.5 coming soon)

Channel Info AI Explained Subscribers: 326K subscribers

Description
Just announced the first half of my 2025 tour! Check out Josh Johnson for dates and to sign up for the waitlist to be notified when I'm performing in your city.



Hi friends,



This week I wanted to share my thoughts on the NASA astronauts that are still at the international space station. Everyone from companies Boeing and SpaceX to Elon Musk have chimed in with thoughts and opinions on how to bring the astronauts down safely before it was decided that safety dictated their initial ship has to come down unmanned.



If you love my stories I have a podcast that comes out every week: The Josh Johnson Show



Hit me on them internets:

Instagram - https://www.instagram.com/joshjohnsoncomedy

TikTok - TikTok - Make Your Day

Facebook - Josh Johnson Video



Recorded at Palace of Fine Arts in San Francisco, CA
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761







1/11
@sesame
At Sesame, we believe in a future where computers are lifelike. Today we are unveiling an early glimpse of our expressive voice technology, highlighting our focus on lifelike interactions and our vision for all-day wearable voice companions. Crossing the uncanny valley of conversational voice



https://video.twimg.com/ext_tw_video/1895159052159582208/pu/vid/avc1/720x720/dRj2LCl9CgDUrBFx.mp4

2/11
@robfulton
This is probably the best voice I’ve used to date. The main glaring issue is the incredible bias in the conversation which makes it ultimately useless and even harmful.

I had a basic conversation without even speaking to trying to create a negative bias, but it was already micromanaging my interaction and it did it continually because it thought I was talking about one thing when in fact, I was speaking about another.

That makes this tool fall in the category of would be better if didn’t have this crazy bias



3/11
@TensorTemplar
@realGeorgeHotz on the ai waifu scale, this scores x/10?



4/11
@soltraveler_sri
@karpathy you seeing this?



5/11
@umesh_ai
So amazing!



6/11
@civ0x
Excellent way to wrap up that experience. I love that the email and the download clip are not tied together. Great way to make people thirsty for you.



Gk1EjqSW0AAjeKS.jpg


7/11
@iamtexture
Why does she have the voice of a morning radio show host, one of the top five most annoying female voices of all time.



8/11
@koltregaskes
Wow, this is great!



9/11
@Iakobus979
My mind is absolutely blown! Just had a ten minute conversation about philosophy, Bach fugues, information science and how people listen to voices and ideas. The cadence, inflection, phrasing etc is light years beyond anything I’ve heard before. @sesame is truly doing something special!



10/11
@lukemiler
Whoa! Just have an engaging 10 mins convo that made me giggle and felt like and ending a chat with a good friend, so, so cool!



11/11
@alexcovo_eth
OMG! That was the most realistic conversation I ever had with any AI. Superior to Elevenlabs, Grok, OpenAI. I'm shocked how good it is. Congrats and look forward to following your updates. 👍




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196



1/1
@hotelemarketer
We’re this close to living in "Her", just without the awkward heartbreak.

Tried @sesame's Maya demo, and wow—it feels like a real convo. Context, emotion, nuance—this thing gets it.

/search?q=#AI
/search?q=#VoiceTech
/search?q=#ConversationalAI
/search?q=#FutureIsNow
/search?q=#HerMovieIRL
/search?q=#TechForGood



https://video.twimg.com/ext_tw_video/1898368550554800128/pu/vid/avc1/1280x720/J49g4R__Q5qyDYag.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196











1/32
@justLV
Excited to share a peek of what I’ve been working on

We @sesame believe voice is key to unlocking a future where computers are lifelike

Here’s an early preview you can try! 👇

We’ll be open sourcing a model, and yes…
we’re building hardware! 🧵



https://video.twimg.com/ext_tw_video/1895150509863903233/pu/vid/avc1/720x720/LhofMwjlpaebYz9H.mp4

2/32
@justLV
We're focused on making voice feel real, natural and delightful - to become the most intuitive interface for collaborating with AI

It's not just about words, but about pacing, expressivity & cues. We’re working on full end-to-end duplex models to capture these humanlike dynamics



3/32
@justLV
The demo you can try uses our contextual TTS, using both conversation text and audio to deliver natural voice generation.

Here is a real example of this in action (that you can try), where Maya's delivery starts matching the context after a few lines.



https://video.twimg.com/ext_tw_video/1895154182820413440/pu/vid/avc1/720x720/IiHKN-vLTFK7ZWvo.mp4

4/32
@justLV
We will be open-sourcing the contextual TTS base model (w/o this character's voice fine-tuning)

This will let anyone build voice experiences locally w/o external API’s.

This is something I would have loved for previous demos and so am personally passionate about.



5/32
@justLV
Lastly...

We can do with less screens in our lives.

We’re building comfortable, all-day wearable eyewear, for the most natural way for a personal companion to see, hear and respond.

Doing this right is tough, but we’ve made solid strides - I’ll be sharing more on this soon



Gkzvmd5aQAAvEtq.jpg


6/32
@justLV
We believe in the magic of combining technology and storytelling to create rich characters and delightful experiences.

Try out our preview here:
Crossing the uncanny valley of conversational voice



7/32
@GregDNeilsen
Wow, exciting stuff Justin.

Definitely agree about less screens and intrigued by the wearable eyewear concept.

Keep it up!



8/32
@justLV
Thank you! 🙏



9/32
@DrOnwude
This is great! When is the open-source model coming out?



10/32
@justLV
Thank you! 1-2 weeks. The demo is a fine-tuned version of the base model on the talent's voice that we can't release, but the base model is still extremely capable - you can get a preview of capabilities on the research blog post.



11/32
@natjjin
fwiw, her jokes did land. i love maya already @justLV



12/32
@justLV
😊



13/32
@chinguetti1
It’s amazing. Well done.👍



14/32
@0xTheWay
Wow. Really great work.



15/32
@weworkremotely
Open Sesame!



16/32
@RobCoreano
I tried earlier, and it was impressive and fun. The path I’ve been imagining since Kitt, Jarvis, Vision, Ultron, etc., makes me very eager to see how your team’s work is going to evolve..💪🏼



17/32
@0FJAKE
any plans for Apple Watch?



18/32
@thisissharat
Wow it’s good!!



19/32
@azed_ai
Awesome 🔥



20/32
@atgorans_k
The future is here guys



21/32
@AlexanderTw33ts
absolutely smashed the eq vibe check!

awesome work!



22/32
@vapormensch
How can we be part of the beta?

I was also in Google Glass Explorer beta, it was super fun.



23/32
@minocrisy
I can't wait to play with the repo!



24/32
@stscott3
Very impressive, Justin. Looking forward to trying this out. What's the plan for durable memory, regarding past conversations?



25/32
@All4nDev
can i use this with custom voice models? like hypothetically if i were to have a lot of recordings of my own voice, upload that, then the voice would sound like me? on top of that, if it could digest the nuances in the way i speak, and output speech that sounds like how id say it, even better



26/32
@thecorysilva
This is amazing. I've seen a couple demos of Voice AI feeling really real, natural, and 'human'.

Great work! Excited to hear more about the open source stuff as well.



27/32
@dealer1943
tried it just now. incredible work. i have tried grok and chatgpt... this is on par with grok.

strange thing is when you are talking about top 99% assuming two LLMs have the same intelligence, the 1% is all about soft skills. which seems like a new frontier for LLMs.



28/32
@philippswu
exciting! congrats @justLV



29/32
@alexshye
This is amazing. Great job and excited to see where this goes. One q: Will be model be able to keep quiet if a person is thinking? It continually rambles which is kind of cool but I imagine feeling like talking to a person who doesn’t allow silence in a conversation.



30/32
@Saiyan3MD
Wow! Just... Wow



31/32
@JimGPT
Her!



32/32
@EquiTea_VC
This looks cool!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

Less is more: UC Berkeley and Google unlock LLM potential through simple sampling​


Ben dikkson – March 21, 2025 3:39 PM

March 21, 2025 3:39 PM

Image credit: VentureBeat with Imagen 3

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More

A new paper by researchers from Google Research and the University of California, Berkeley, demonstrates that a surprisingly simple test-time scaling approach can boost the reasoning abilities of large language models (LLMs). The key? Scaling up sampling-based search, a technique that relies on generating multiple responses and using the model itself to verify them.

The core finding is that even a minimalist implementation of sampling-based search, using random sampling and self-verification, can elevate the reasoning performance of models like Gemini 1.5 Pro beyond that of o1-Preview on popular benchmarks. The findings can have important implications for enterprise applications and challenge the assumption that highly specialized training or complex architectures are always necessary for achieving top-tier performance.

The limits of current test-time compute scaling​


The current popular method for test-time scaling in LLMs is to train the model through reinforcement learning to generate longer responses with chain-of-thought (CoT) traces. This approach is used in models such as OpenAI o1 and DeepSeek-R1 . While beneficial, these methods usually require substantial investment in the training phase.

Another test-time scaling method is “self-consistency,” where the model generates multiple responses to the query and chooses the answer that appears more often. Self-consistency reaches its limits when handling complex problems, as in these cases, the most repeated answer is not necessarily the correct one.

Sampling-based search offers a simpler and highly scalable alternative to test-time scaling: Let the model generate multiple responses and select the best one through a verification mechanism. Sampling-based search can complement other test-time compute scaling strategies and, as the researchers write in their paper, “it also has the unique advantage of being embarrassingly parallel and allowing for arbitrarily scaling: simply sample more responses.”

More importantly, sampling-based search can be applied to any LLM, including those that have not been explicitly trained for reasoning.

How sampling-based search works​


The researchers focus on a minimalist implementation of sampling-based search, using a language model to both generate candidate responses and verify them. This is a “self-verification” process, where the model assesses its own outputs without relying on external ground-truth answers or symbolic verification systems.

Search-based sampling Credit: VentureBeatThe algorithm works in a few simple steps:

1—The algorithm begins by generating a set of candidate solutions to the given problem using a language model. This is done by giving the model the same prompt multiple times and using a non-zero temperature setting to create a diverse set of responses.

2—Each candidate’s response undergoes a verification process in which the LLM is prompted multiple times to determine whether the response is correct. The verification outcomes are then averaged to create a final verification score for the response.

3— The algorithm selects the highest-scored response as the final answer. If multiple candidates are within close range of each other, the LLM is prompted to compare them pairwise and choose the best one. The response that wins the most pairwise comparisons is chosen as the final answer.

The researchers considered two key axes for test-time scaling:

Sampling: The number of responses the model generates for each input problem.

Verification: The number of verification scores computed for each generated solution

How sampling-based search compares to other techniques​


The study revealed that reasoning performance continues to improve with sampling-based search, even when test-time compute is scaled far beyond the point where self-consistency saturates.

At a sufficient scale, this minimalist implementation significantly boosts reasoning accuracy on reasoning benchmarks like AIME and MATH. For example, Gemini 1.5 Pro’s performance surpassed that of o1-Preview, which has explicitly been trained on reasoning problems, and Gemini 1.5 Flash surpassed Gemini 1.5 Pro.

“This not only highlights the importance of sampling-based search for scaling capability, but also suggests the utility of sampling-based search as a simple baseline on which to compare other test-time compute scaling strategies and measure genuine improvements in models’ search capabilities,” the researchers write.

It is worth noting that while the results of search-based sampling are impressive, the costs can also become prohibitive. For example, with 200 samples and 50 verification steps per sample, a query from AIME will generate around 130 million tokens, which costs $650 with Gemini 1.5 Pro. However, this is a very minimalistic approach to sampling-based search, and it is compatible with optimization techniques proposed in other studies. With smarter sampling and verification methods, the inference costs can be reduced considerably by using smaller models and generating fewer tokens . For example, by using Gemini 1.5 Flash to perform the verification, the costs drop to $12 per question.

Effective self-verification strategies​


There is an ongoing debate on whether LLMs can verify their own answers. The researchers identified two key strategies for improving self-verification using test-time compute:

Directly comparing response candidates:Disagreements between candidate solutions strongly indicate potential errors. By providing the verifier with multiple responses to compare, the model can better identify mistakes and hallucinations, addressing a core weakness of LLMs. The researchers describe this as an instance of “implicit scaling.”

Task-specific rewriting:The researchers propose that the optimal output style of an LLM depends on the task. Chain-of-thought is effective for solving reasoning tasks, but responses are easier to verify when written in a more formal, mathematically conventional style. Verifiers can rewrite candidate responses into a more structured format (e.g., theorem-lemma-proof) before evaluation.

“We anticipate model self-verification capabilities to rapidly improve in the short term, as models learn to leverage the principles of implicit scaling and output style suitability, and drive improved scaling rates for sampling-based search,” the researchers write.

Implications for real-world applications​


The study demonstrates that a relatively simple technique can achieve impressive results, potentially reducing the need for complex and costly model architectures or training regimes.

This is also a scalable technique, enabling enterprises to increase performance by allocating more compute resources to sampling and verification. It also enables developers to push frontier language models beyond their limitations on complex tasks.

“Given that it complements other test-time compute scaling strategies, is parallelizable and allows for arbitrarily scaling, and admits simple implementations that are demonstrably effective, we expect sampling-based search to play a crucial role as language models are tasked with solving increasingly complex problems with increasingly large compute budgets,” the researchers write.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

Nvidia debuts Llama Nemotron open reasoning models in a bid to advance agentic AI​


Sean Michael Kerner – March 18, 2025 12:42 PM

March 18, 2025 12:42 PM


Credit: Image generated by VentureBeat with Stable Diffusion 3.5 Large


At the Nvidia GTC event today, the AI giant made a series of hardware and software announcements. Buried amidst the big silicon announcements, the company announced a new set of open source Llama Nemotron reasoning models to help accelerate agentic AI workloads. The new models are an extension of the Nvidia Nemotron models that were first announced in January at the Consumer Electronics Show (CES).

The new Llama Nemotron reasoning models are in part a response to the dramatic rise of reasoning models in 2025. Nvidia (and its stock price) were rocked to the core earlier this year when DeepSeek R1 came out , offering the promise of an open source reasoning model and superior performance.

The Llama Nemotron family models are competitive with DeepSeek offering business-ready AI reasoning models for advanced agents.

“Agents are autonomous software systems designed to reason, plan, act and critique their work,” Kari Briski, vice president of Generative AI Software Product Managements at Nvidia said during a GTC pre-briefing with press. “Just like humans, agents need to understand context to breakdown complex requests, understand the user’s intent, and adapt in real time.”

What’s inside Llama Nemotron for agentic AI​


As the name implies Llama Nemotron is based on Meta’s open source Llama models .

With Llama as the foundation, Briski said that Nvidia algorithmically pruned the model to optimize compute requirements while maintaining accuracy.

Nvidia also applied sophisticated post-training techniques using synthetic data. The training involved 360,000 H100 inference hours and 45,000 human annotation hours to enhance reasoning capabilities. All that training results in models that have exceptional reasoning capabilities across key benchmarks for math, tool calling, instruction following and conversational tasks, according to Nvidia.

The Llama Nemotron family has three different models​


The family includes three models targeting different deployment scenarios:

  • Nemotron Nano: Optimized for edge and smaller deployments while maintaining high reasoning accuracy.
  • Nemotron Super: Balanced for optimal throughput and accuracy on single data center GPUs.
  • Nemotron Ultra: Designed for maximum “agentic accuracy” in multi-GPU data center environments.

For availability, Nano and Super are now available at NIM micro services and can be downloaded at AI.NVIDIA.com. Ultra is coming soon.

Hybrid reasoning helps to advance agentic AI workloads​


One of the key features in Nvidia Llama Nemotron is the ability to toggle reasoning on or off.

The ability to toggle reasoning is an emerging capability in the AI market. Anthropic Claude 3.7 has a somewhat similar functionality, though that model is a closed proprietary model. In the open source space IBM Granite 3.2 also has a reasoning toggle that IBM refers to as – conditional reasoning.

The promise of hybrid or conditional reasoning is that it allows systems to bypass computationally expensive reasoning steps for simple queries. In a demonstration, Nvidia showed how the model could engage complex reasoning when solving a combinatorial problem but switch to direct response mode for simple factual queries.

Nvidia Agent AI-Q blueprint provides an enterprise integration layer​


Recognizing that models alone aren’t sufficient for enterprise deployment, Nvidia also announced the Agent AI-Q blueprint, an open-source framework for connecting AI agents to enterprise systems and data sources.

“AI-Q is a new blueprint that enables agents to query multiple data types—text, images, video—and leverage external tools like web search and other agents,” Briski said. “For teams of connected agents, the blueprint provides observability and transparency into agent activity, allowing developers to improve the system over time.”

The AI-Q blueprint is set to become available in April

Why this matters for enterprise AI adoption​


For enterprises considering advanced AI agent deployments, Nvidia’s announcements address several key challenges.

The open nature of Llama Nemotron models allows businesses to deploy reasoning-capable AI within their own infrastructure. That’s important as it can address data sovereignty and privacy concerns that can have limited adoption of cloud-only solutions. By building the new models as NIMs, Nvidia is also making it easier for organizations to deploy and manage deployments, whether on-premises or in the cloud.

The hybrid, conditional reasoning approach is also important to note as it provides organizations with another option to choose from for this type of emerging capability. Hybrid reasoning allows enterprises to optimize for either thoroughness or speed, saving on latency and compute for simpler tasks while still enabling complex reasoning when needed.

As enterprise AI moves beyond simple applications to more complex reasoning tasks, Nvidia’s combined offering of efficient reasoning models and integration frameworks positions companies to deploy more sophisticated AI agents that can handle multi-step logical problems while maintaining deployment flexibility and cost efficiency.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761


Baidu delivers new LLMs ERNIE 4.5 and ERNIE X1 undercutting DeepSeek, OpenAI on cost — but they’re not open source (yet)​


Carl Franzen – March 17, 2025 3:48 PM

March 17, 2025 3:48 PM

Credit: VentureBeat made with Midjourney



Over the weekend, Chinese web search giant Baidu announced the launch of two new AI models, ERNIE 4.5 and ERNIE X1 , a multimodal language model and reasoning model, respectively.

Baidu claims they offer state-of-the-art performance on a variety of metrics, besting DeepSeek’s non-reasoning V3 and OpenAI’s GPT-4.5 (how do you like the close name match Baidu chose as well?) on several third-party benchmark tests such as the C-Eval (assessing Chinese LLM performance on knowledge and reasoning across 52 subjects), CMMLU (massive multitask language understanding in Chinese), and GSM8K (math word problems).

It also claims to undercut the cost of both fellow Chinese wunderkind’s DeepSeek’s R1 reasoning model with ERNIE X1 by 50% and US AI juggernaut OpenAI’s GPT-4.5 with ERNIE 4.5 by 99%, respectively.

Yet both have some important limitations, including a lack of open source licensing in the former case (which DeepSeek R1 offers) and a far reduced context compared to the latter (8,000 tokens instead of 128,000, frankly an astonishingly low amount in this age of million-token-plus context windows. Tokens are how a large AI model represents information, with more meaning more information. A 128,000-token window is akin to a 250-page novel).

As X user @claudeglass noted in a post, the small context window makes it perhaps only suitable for customer service chatbots.

Baidu posted on X that it did plan to make the ERNIE 4.5 model family open source on June 30th, 2025.

Baidu has enabled access to the models through its application programming interface (API) and Chinese-language chatbot rival to ChatGPT, known as “ ERNIE Bot ” — it answers questions, generates text, produces creative writing, and interacts conversationally with users — and made ERNIE Bot free to access.

ERNIE 4.5: A new generation of multimodal AI​


ERNIE 4.5 is Baidu’s latest foundation model, designed as a native multimodal system capable of processing and understanding text, images, audio, and video, and is a clear competitor to OpenAI’s GPT-4.5 model released back in February 2025 .

The model has been optimized for better comprehension, generation, reasoning, and memory. Enhancements include improved hallucination prevention, logical reasoning, and coding capabilities.

According to Baidu, ERNIE 4.5 outperforms GPT-4.5 in multiple benchmarks while maintaining a significantly lower cost.

The model’s advancements stem from several key technologies, including FlashMask Dynamic Attention Masking, Heterogeneous Multimodal Mixture-of-Experts, and Self-feedback Enhanced Post-Training.

ERNIE X1 introduces advanced deep-thinking reasoning capabilities, emphasizing understanding, planning, reflection, and evolution.

Unlike standard multimodal AI models, ERNIE X1 is specifically designed for complex reasoning and tool use, enabling it to perform tasks such as advanced search, document-based Q&A, AI-generated image interpretation, code execution, and web page analysis.

The model supports a range of tools, including Baidu’s academic search, business information search, and franchise research tools. Its development is based on Progressive Reinforcement Learning, End-to-End Training integrating Chains of Thought and Action, and a Unified Multi-Faceted Reward System.

Access and API availability​


Users can now access both ERNIE 4.5 and ERNIE X1 via the official ERNIE Bot website.

For enterprise users and developers, ERNIE 4.5 is now available through Baidu AI Cloud’s Qianfan platform via API access. ERNIE X1 is expected to be available soon.

Pricing for API Access:​


  • ERNIE 4.5: Input: $0.55 USD per 1 million tokens Output: $2.2 per 1M tokens
  • ERNIE X1: Input: $0.28 per 1M tokens Output: $1.1 per 1M tokens

Compare that to:

  • GPT-4.5, which has a noticeably astonishingly high price through the OpenAI API of: Input: $75.00 per 1M tokens Output: $150.00 per 1M tokens

  • DeepSeek R1 Input: $0.55 per 1M tokens Output: $2.19 per 1M tokens

Baidu has also announced plans to integrate ERNIE 4.5 and ERNIE X1 into its broader ecosystem, including Baidu Search and the Wenxiaoyan app.

Considerations for enterprise decision-makers​


For CIOs, CTOs, IT leaders, and DevOps teams, the launch of ERNIE 4.5 and ERNIE X1 presents both opportunities and considerations:

  • Performance vs. Cost – With pricing significantly lower than competing models, organizations evaluating AI solutions may see cost savings by integrating ERNIE models via API. However, further benchmarking and real-world testing may be necessary to assess performance for specific business applications.
  • Multimodal and Reasoning Capabilities – The ability to process and understand text, images, audio, and video could be valuable for businesses in industries such as customer support, content generation, legal tech, and finance.
  • Tool Integration – ERNIE X1’s ability to work with tools like advanced search, document-based Q&A, and code interpretation could provide automation and efficiency gains in enterprise environments.
  • Ecosystem and Localization – As Baidu’s AI models are optimized for Chinese-language processing and regional knowledge, enterprises working in China or targeting Chinese-speaking markets may find ERNIE models more effective than global alternatives.
  • Licensing and Data Privacy – While Baidu has indicated that GPT-4.5 will be made open source later this summer, June 30, 2025, that’s still three months away, so enterprises should at least wait until that time to assess whether it’s worth deploying locally or on US-hosted cloud services. Enterprise users should review Baidu’s policies regarding data privacy, compliance, and model usage before integrating these AI solutions.

AI expansion and future outlook​


As AI development accelerates in 2025, Baidu is positioning itself as a leader in multimodal and reasoning-based AI technologies.

The company plans to continue investing in artificial intelligence, data centers, and cloud infrastructure to enhance the capabilities of its foundation models.

By offering a combination of powerful performance and lower costs, Baidu’s latest AI models aim to provide businesses and individual users with more accessible and advanced AI tools.

For more details, visit ERNIE Bot’s official website .
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

Mistral AI drops new open-source model that outperforms GPT-4o Mini with fraction of parameters​


Michael Nuñez – March 17, 2025 1:22 PM

March 17, 2025 1:22 PM


Credit: VentureBeat made with Midjourney


French artificial-intelligence startup Mistral AI unveiled a new open-source model today that the company says outperforms similar offerings from Google and OpenAI , setting the stage for increased competition in a market dominated by U.S. tech giants.

The model, called Mistral Small 3.1 , processes both text and images with just 24 billion parameters—a fraction of the size of leading proprietary models—while matching or exceeding their performance, according to the company.

“This new model comes with improved text performance, multimodal understanding, and an expanded context window of up to 128k tokens,” Mistral said in a company blog post announcing the release. The firm claims the model processes information at speeds of 150 tokens per second, making it suitable for applications requiring rapid response times.

By releasing the model under the permissive Apache 2.0 license , Mistral is pursuing a markedly different strategy than its larger competitors, which have increasingly restricted access to their most powerful AI systems. The approach highlights a growing divide in the AI industry between closed, proprietary systems and open, accessible alternatives.

How a $6 billion European startup is taking on Silicon Valley’s AI giants​


Founded in 2023 by former researchers from Google DeepMind and Meta , Mistral AI has rapidly established itself as Europe’s leading AI startup, with a valuation of approximately $6 billion after raising around $1.04 billion in capital . This valuation, while impressive for a European startup, remains a fraction of OpenAI’s reported $80 billion or the resources available to tech giants like Google and Microsoft.

Mistral has achieved notable traction, particularly in its home region. Its chat assistant Le Chat recently reached one million downloads in just two weeks following its mobile release, bolstered by vocal support from French President Emmanuel Macron, who urged citizens to “download Le Chat, which is made by Mistral, rather than ChatGPT by OpenAI — or something else” during a television interview .

The company strategically positions itself as “the world’s greenest and leading independent AI lab,” emphasizing European digital sovereignty as a key differentiator from American competitors.

Small but mighty: How Mistral’s 24 billion parameter model punches above its weight class​


Mistral Small 3.1 stands out for its remarkable efficiency. With just 24 billion parameters—a fraction of models like GPT-4—the system delivers multimodal capabilities, multilingual support, and handles long-context windows of up to 128,000 tokens.

This efficiency represents a significant technical achievement. While the AI industry has generally pursued ever-larger models requiring massive computational resources, Mistral has focused on algorithmic improvements and training optimizations to extract maximum capability from smaller architectures.

The approach addresses one of the most pressing challenges in AI deployment: the enormous computational and energy costs associated with state-of-the-art systems. By creating models that run on relatively modest hardware—including a single RTX 4090 graphics card or a Mac with 32GB of RAM—Mistral makes advanced AI accessible for on-device applications where larger models prove impractical.

This emphasis on efficiency may ultimately prove more sustainable than the brute-force scaling pursued by larger competitors. As climate concerns and energy costs increasingly constrain AI deployment, Mistral’s lightweight approach could transition from alternative to industry standard.

Why Europe’s AI champion could benefit from growing geopolitical tensions​


Mistral’s latest release emerges amid growing concerns about Europe’s ability to compete in the global AI race, traditionally dominated by American and Chinese companies.

“Not being American or Chinese may now be a help, not a hindrance,” The Economist reported in a recent analysis of Mistral’s position, suggesting that as geopolitical tensions rise, a European alternative may become increasingly attractive for certain markets and governments.

Arthur Mensch, Mistral’s CEO, has advocated forcefully for European digital sovereignty. At the Mobile World Congress in Barcelona this month, he urged European telecoms to “get into the hyperscaler game” by investing in data center infrastructure.

“We would welcome more domestic effort in making more data centers,” Mensch said, suggesting that “the AI revolution is also bringing opportunities to decentralize the cloud.”

The company’s European identity provides significant regulatory advantages. As the EU’s AI Act takes effect, Mistral enters the market with systems designed from inception to align with European values and regulatory expectations. This contrasts sharply with American and Chinese competitors who must retrofit their technologies and business practices to comply with an increasingly complex global regulatory landscape.

Beyond text: Mistral’s expanding portfolio of specialized AI models​


Mistral Small 3.1 joins a rapidly expanding suite of AI products from the company. In February, Mistral released Saba , a model focused specifically on Arabic language and culture, demonstrating an understanding that AI development has concentrated excessively on Western languages and contexts.

Earlier this month, the company introduced Mistral OCR , an optical character recognition API that converts PDF documents into AI-ready Markdown files—addressing a critical need for enterprises seeking to make document repositories accessible to AI systems.

These specialized tools complement Mistral’s broader portfolio, which includes Mistral Large 2 (their flagship large language model), Pixtral (for multimodal applications), Codestral (for code generation), and “ Les Ministraux ,” a family of models optimized for edge devices.

This diversified portfolio reveals a sophisticated product strategy that balances innovation with market demands. Rather than pursuing a single monolithic model, Mistral creates purpose-built systems for specific contexts and requirements — an approach that may prove more adaptable to the rapidly evolving AI landscape.

From Microsoft to military: How strategic partnerships are fueling Mistral’s growth​


Mistral’s rise has accelerated through strategic partnerships, including a deal with Microsoft that includes distribution of its AI models through Microsoft’s Azure platform and a $16.3 million investment .

The company has also secured partnerships with France’s army and job agency, German defense tech startup Helsing , IBM , Orange , and Stellantis , positioning itself as a key player in Europe’s AI ecosystem.

In January, Mistral signed a deal with press agency Agence France-Presse (AFP) to allow its chat assistant to query AFP’s entire text archive dating back to 1983, enriching its knowledge base with high-quality journalistic content.

These partnerships reveal a pragmatic approach to growth. Despite positioning itself as an alternative to American tech giants, Mistral recognizes the necessity of working within existing technological ecosystems while building the foundation for greater independence.

The open source advantage: Why Mistral is betting against big tech’s closed AI systems​


Mistral’s continued commitment to open source represents its most distinctive strategic choice in an industry increasingly dominated by closed, proprietary systems.

While Mistral maintains some premier models for commercial purposes, its strategy of releasing powerful models like Mistral Small 3.1 under permissive licenses challenges conventional wisdom about intellectual property in AI development.

This approach has already produced tangible benefits. The company noted that “several excellent reasoning models” have been built on top of its previous Mistral Small 3 , such as DeepHermes 24B by Nous Research —evidence that open collaboration can accelerate innovation beyond what any single organization might achieve independently.

The open-source strategy also serves as a force multiplier for a company with limited resources compared to its competitors. By enabling a global community of developers to build upon and extend its models, Mistral effectively expands its research and development capacity far beyond its direct headcount.

This approach represents a fundamentally different vision for AI’s future — one where foundational technologies function more like digital infrastructure than proprietary products. As large language models become increasingly commoditized, the true value may shift to specialized applications, industry-specific implementations, and service delivery rather than the base models themselves.

The strategy carries significant risks. If core AI capabilities become widely available commodities, Mistral will need to develop compelling differentiation in other areas. Yet it also protects the company from becoming trapped in an escalating arms race with vastly better-funded competitors — a contest few European startups could hope to win through conventional means.

By positioning itself at the center of an open ecosystem rather than attempting to control it entirely, Mistral may ultimately build something more resilient than what any single organization could create alone.

The $6 billion question: Can Mistral’s business model support its ambitious vision?​


Mistral faces significant challenges despite its technical achievements and strategic vision. The company’s revenue reportedly remains in the “eight-digit range,” according to multiple sources—a fraction of what might be expected for its nearly $6 billion valuation.

Mensch has ruled out selling the company, stating at the World Economic Forum in Davos that Mistral is “not for sale” and that “of course, [an IPO is] the plan.” However, the path to sufficient revenue growth remains unclear in an industry where deep-pocketed competitors can afford to operate at a loss for extended periods.

The company’s open-source strategy, while innovative, introduces its own challenges. If base models become commoditized as Lample predicts, Mistral must develop additional revenue streams through specialized services, enterprise deployments, or unique applications that leverage but extend beyond their foundational technologies.

Mistral’s European identity, while providing regulatory advantages and appeal to sovereignty-conscious customers, also potentially limits its immediate growth potential compared to American and Chinese markets where AI adoption typically moves faster.

Nevertheless, Mistral Small 3.1 represents a compelling technical achievement and strategic statement. By demonstrating that advanced AI capabilities can be delivered in smaller, more efficient packages under open licenses, Mistral challenges fundamental assumptions about how AI development and commercialization should proceed.

For a technology industry increasingly concerned about concentration of power among a handful of American tech giants, Mistral’s European-led, open-source alternative offers a vision of a more distributed, accessible AI future—provided it can build a sustainable business model to support its ambitious technical agenda.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

China’s open-source embrace upends conventional wisdom around artificial intelligence​


Credit Cards

– Published Mon, Mar 24 20252:51 AM EDT

China is focusing on large language models (LLMs) in the artificial intelligence space.

Blackdovfx | Istock | Getty Images

China is embracing open-source AI models in a trend market watchers and insiders say is boosting AI adoption and innovation in the country, with some suggesting it is an ‘Android moment’ for the sector.

The open-source shifthas been spearheaded by AI startup DeepSeek, whose R1 model released earlier this year challenged American tech dominance and raised questions over Big Tech’s massive spending on large language models and data centers.

While R1 created a splash in the sector due to its performance and claims of lower costs, some analysts say the most significant impact of DeepSeek has been in catalyzing the adoption of open-source AI models.

“DeepSeek’s success proves that open-source strategies can lead to faster innovation and broad adoption,” said Wei Sun, principal analyst of artificial intelligence at Counterpoint Research, noting a large number of firms have implemented the model.

“Now, we see that R1 is actively reshaping China’s AI landscape, with large companies like Baidu moving to open source their own LLMs in a strategic response,” she added.

On March 16, Baidu released the latest version of its AI model, Ernie 4.5, as well as a new reasoning model, Ernie X1, making them free for individual users. Baidu also plans to make the Ernie 4.5 model series open-source from end-June.

Experts say that Baidu’s open-source plans represent a broader shift in China, away from a business strategy that focuses on proprietary licensing.

“Baidu has always been very supportive of its proprietary business model and was vocal against open-source, but disruptors like DeepSeek have proven that open-source models can be as competitive and reliable as proprietary ones,” Lian Jye Su, chief analyst with technology research and advisory group Omdia previously told CNBC.

Open-source vs proprietary models​


Open-source generally refers to software in which the source code is made freely available on the web for possible modification and redistribution.

AI models that call themselves open-source had existed before the emergence of DeepSeek, withMeta’s Llama andGoogle’s Gemma being prime examples in the U.S. However, some experts argue that these models aren’t really open source as their licenses restrict certain uses and modifications, and their training data sets aren’t public.

DeepSeek’s R1 is distributed under an ‘MIT License,’ which Counterpoint’s Sun describes as one of the most permissive and widely adopted open-source licenses, facilitating unrestricted use, modification and distribution, including for commercial purposes.

The DeepSeek team even held an “ Open-Source Week ” last month, which saw it release more technical details about the development of its R1 model.

While DeepSeek’s model itself is free, the start-up charges for Application Programming Interface, which enables the integration of AI models and their capabilities into other companies’ applications. However, its API charges are advertised to be far cheaper compared with OpenAI and Anthropic’s latest offerings.

OpenAI and Anthropic also generate revenue by charging individual users and enterprises to access some of their models. These models are considered to be ‘closed-source,’ as their datasets, and algorithms are not open for public access.

China opens up​


In addition to Baidu, other Chinese tech giants such asAlibaba GroupandTencenthave increasingly been providing their AI offerings for free and are making more models open source.

For example, Alibaba Cloud said last month it was open-sourcing its AI models for video generation , while Tencent released five new open-source models earlier this month with the ability to convert text and images into 3D visuals.

Smaller players are also furthering the trend. ManusAI, a Chinese AI firm that recently unveiled an AI agent that claims to outperform OpenAI’s Deep Research, has said it would shift towards open source.

“This wouldn’t be possible without the amazing open-source community, which is why we’re committed to giving back” co-founder Ji Yichao said in a product demo video . “ManusAI operates as a multi-agent system powered by several distinct models, so later this year, we’re going to open source some of these models,” he added.

Zhipu AI, one of the country’s leading AI startups, this month announced on WeChat that 2025 would be “the year of open source.”

Ray Wang, principal analyst and founder of Constellation Research, told CNBC that companies have been compelled to make these moves following the emergence of DeepSeek.

“With DeepSeek free, it’s impossible for any other Chinese competitors to charge for the same thing. They have to move to open-source business models in order to compete,” said Wang.

AI scholar and entrepreneur Kai-Fu Lee also believes this dynamic will impact OpenAI, noting in a recent social media post that it would be difficult for the company to justify its pricing when the competition is “free and formidable.”

“The biggest revelation from DeepSeek is that open-source has won,” said Lee, whose Chinese startup 01.AI has built an LLM platform for enterprises seeking to use DeepSeek.

U.S.-China competition​


OpenAI — which started the AI frenzy when it released its ChatGPT bot in November 2022— has not signaled that it plans to shift from its proprietary business model. The company which started as a nonprofit in 2015 is moving towards towards a for-profit structure.

Sun says that OpenAI and DeepSeek both represent very different ends of the AI space.She adds thatthe sector could continue to see division between open-source players that innovate off one another and closed-source companies that have come under pressure to maintain high-cost cutting-edge models.

The open-source trend has put in to question the massive funds raised by companies such as OpenAI. Microsoft has invested $13 billion into the company.It is in talks to raise up to $40 billion in a funding round that would lift its valuation to as high as $340 billion, CNBC confirmed at the end of January.

In September, CNBC confirmed the company expects about $5 billion in losses, with revenue pegged at $3.7 billion revenue. OpenAI CFO Sarah Friar, has also said that $11 billion in revenue is “ definitely in the realm of possibility ” for the company this year.

On the other hand, Chinese companies have chosen the open-source route as they compete with the more proprietary approach of U.S. firms, said Constellation Research’s Wang. “They are hoping for faster adoption than the closed models of the U.S.,” he added.

Speaking to CNBC’s “ Street Signs Asia ” on Wednesday, Tim Wang, managing partner of tech-focused hedge fund Monolith Management, said that models from companies such as DeepSeek have been “great enablers and multipliers in China,” demonstrating how things can be done with more limited resources.

According to Wang, open-source models have pushed down costs, opening doors for product innovation — something he says Chinese companies historically have been very good at.

He calls the development the “Android moment,” referring to when Google’s Android made its operating system source code freely available , fostering innovation and development in the non-Apple app ecosystem.

“We used to think China was 12 to 24 months behind [the U.S.] in AI and now we think that’s probably three to six months,” said Wang.

However, other experts have downplayed the idea that open-source AI should be seen through the lens of China and U.S. competition. In fact, several U.S. companies have integrated and benefited from DeepSeek’s R1.

“I think the so-called DeepSeek moment is not about whether China has better AI than the U.S. or vice versa. It’s really about the power of open-source,” Alibaba Group Chairperson Joe Tsai told CNBC’s CONVERGE conference in Singapore earlier this month.

Tsai added that open-source models give the power of AI to everyone from small entrepreneurs to large corporations, which will lead to more development, innovation and a proliferation of AI applications.

— CNBC’s Evelyn Cheng contributed to this report
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761




1/11
@Google
Introducing Gemini 2.5, our most intelligent AI model.

Our first release, an experimental version of 2.5 Pro, unlocks state-of-the-art performance in math and science. 🔥

Learn more 🧵



2/11
@Google
2.5 models are thinking models, capable of reasoning through thoughts before responding. The result is enhanced performance and improved accuracy.

This means Gemini 2.5 can handle more complex problems in coding, science and math, and support more context-aware agents.



https://video.twimg.com/amplify_video/1904579657883582464/vid/avc1/1080x1350/NT87TxBB_EfVOh79.mp4

3/11
@Google
Today we’re releasing an experimental version of Gemini 2.5 Pro.

💡2.5 Pro shows strong reasoning and improved code capabilities, with state-of-the-art performance across a range of benchmarks.

📈It’s topped @lmarena_ai's leaderboard by a huge margin



Gm5uol4XcAAAGea.jpg


4/11
@Google
Take Gemini 2.5 Pro for a spin:

🔵Developers can try it out in Google AI Studio
🔵 @GeminiApp Advanced users can select it in the model dropdown

It will be available on Vertex AI in the coming weeks. Learn more ⬇️ Gemini 2.5: Our most intelligent AI model



5/11
@MrDarcyBtc
Gp5aN2bxNJ5RUSmGqY325koF2xQgJX8tz3NfvkrUpump

GEMINI 2.5 PRO



6/11
@metadjai
Super excited for this! ✨



7/11
@ofermend
Congrats - I'm excited to see even better math/science capabilities. What's more, this model maintains a very low hallucination rate (1.1%) on @vectara hallucination leaderboard.

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents



8/11
@IsingResearch
🔥🔥🔥



9/11
@Halvings_org
New tech is always a plus! 🔥



10/11
@highyielddrama
Been really impressed by Gemini Pro model for vibe coding. Excited to give the new one a whirl.



11/11
@EverPeak01
Exciting leap in AI innovation looking forward to seeing Gemini 2.5 push the boundaries of math and science!





1/31
@sundarpichai
1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever.

Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on @lmarena_ai by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible.

Find it on Google AI Studio and in the @geminiapp for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.



https://video.twimg.com/ext_tw_video/1904579366874415104/pu/vid/avc1/720x900/rnO07SKeFnXBRwo0.mp4

2/31
@sundarpichai
2/ Here’s one example of what it can do. Tell it to create a basic video game (like the dino game below) and it applies its reasoning capability to produce the executable code from a single line prompt. Take a look:



https://video.twimg.com/ext_tw_video/1904579979267981312/pu/vid/avc1/1280x720/adt0cgIX1t7e-FFK.mp4

3/31
@sundarpichai
3/ More details: Gemini 2.5: Our most intelligent AI model



4/31
@alexpaguis
Huge update from @GoogleAI. If its better at coding than Sonnet 3.7 while having a 1 million token context window (2 million coming soon), it should become extremely useful as a coding agent. @cursor_ai When can we have it as in option in agent mode?



5/31
@jordanapitts
I am advanced user. I don’t have it in APP iOS UK



6/31
@andrewwhite01
Hey @sundarpichai and @OfficialLoganK - just a reminder that the latest model on your production API (Provisioned throughput on vertex) - is gemini 1.5 pro-002, which will be deprecated soon. Can you please actually release the models for production use?



7/31
@MrDarcyBtc
Gp5aN2bxNJ5RUSmGqY325koF2xQgJX8tz3NfvkrUpump

GEMINI 2.5 PRO



8/31
@kosuke_agos_en
Awesome



9/31
@raeesgillani
Grok is missing on two of your graphs?



10/31
@NandoDF
Now that you’re finally ahead on these benchmarks, would you consider getting rid of your aggressive practice of notice periods and noncompetes in Europe?

Every week someone from @GoogleDeepMind reaches out to me in despair to ask me how to get out of their 6 month to 1 year notice periods. You force your employees to sign these or prevent them from being promoted. It’s terrible for competition and startups in Europe.

These weapons against competition are not allowed in your home in California. Why enforce them here. Not very ethical, is it @sundarpichai? Please do the right thing.



11/31
@reach_vb
Congrats on the release, excited for next generation of Gemma models to learn from Gemini 2.5 🔥



12/31
@burny_tech
Is the giant coming back?



Gm5zXn7XUAAiDbO.png


13/31
@ReturnAstro
Rock on Sundar - great stuff. Keep cooking /search?q=#Goog



14/31
@VerdictVirtuoso
I recently had gemini write a research paper with scholarly material. I was imoressed.



15/31
@altryne
Congrats on the release! Excited to test out the vibes and will cover this on the upcoming @thursdai_pod !

https://nitter.poast.org/i/spaces/1vOGwXQBRjRJB



16/31
@ChrisPerkles
You guys are cooking! Absolutely love the recent Gemma release!



17/31
@bradforbes
very cool! I go to Claude or ChatGPT over Gemini still, but Gemini is fantastic too and by far the most consistent and fastest at giving answers, got a feeling it may be my go-to soon...



18/31
@ofermend
Congrats to the entire @GoogleAI and @GoogleDeepMind team - this model is not only smarter, but importantly it maintains a very low hallucination rate (1.1%) on @vectara hallucination leaderboard.

GitHub - vectara/hallucination-leaderboard: Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents



Gm7RprrbYAEsZoc.jpg


19/31
@drew_carson_ai
hey your gemini-2.0-flash is crashed, getting 503 every time.



20/31
@erd0xbc
Wen @cursor_ai & Wen Canvas?



21/31
@LovedayChey
Wow, Gemini 2.5 Pro is here and it’s 🔥! Topping @lmarena_ai with epic reasoning & coding skills. Can’t wait to try it on Google AI Studio! 📷@sundarpichai
@geminiapp Google kicking ass on all fronts!!!



22/31
@lexfridman
Awesome, congrats!



23/31
@yushawrites
Excited to try it on vertex and compare to 2.0 and see what kind of problems we can solve 💪



24/31
@christiancooper
This model is crushing animations... A full quantum electrodynamics animation complete with math and study notes in about 30 minutes.



https://video.twimg.com/ext_tw_video/1904609822508756992/pu/vid/avc1/854x480/7Adu_vmgtiUYkz6J.mp4

25/31
@thenomadevel
This changes a lot for autonomous agents.
Gemini 2.5 is now in @CamelAIOrg



Gm6qeSyW0AEQCBH.png


26/31
@GozukaraFurkan
Yes I just tested at difficult OCR task

Better than all other SOTA models atm

Only Claude 3.5 June better

Just tested this on newest Gemini 2.5 Pro Experimental 03-25 and it made 1… | Furkan Gözükara



27/31
@defi_is_theway
@grok gemini 2.5 says its better than you. What do you have to say ?



28/31
@lepadphone
Thinking is thinking.



29/31
@C_WolfHost
Gp5aN2bxNJ5RUSmGqY325koF2xQgJX8tz3NfvkrUpump

/search?q=#Gemini2.5



30/31
@petergyang
Is there a reason why the canvas doesn't work with this model?



31/31
@basin_leon
Congratulations! 🤯
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761


1/8
@scaling01
Google Gemini 2.5 Pro (Thinking) crushes every other model on LiveBench even my beloved Sonnet 3.7 and o1



Gm-myerWoAE3LCM.jpg


2/8
@Roseofwolf
My question is, in the real world, does it perform better? Benchmarks are useful, but I am wondering what happens in the real world. See 3.6 was still leader in coding even if benchmarks said o3-mini-high was better.



3/8
@scaling01
we will see, but this looks promising



4/8
@bruce_x_offi
@AravSrinivas Gemini 2.5 Pro support when? with 1M context length



5/8
@DSucks32
Holy shyt Google after years finally put out a SOTA model



6/8
@nocodenoprob
been using it all morning. besides overloading issues, it’s incredible. refactored a whole large codebase other models struggled with in ~4 hours with minimal issues after the first pass



7/8
@kittingercloud
One of the few benches I reference



8/8
@wagmiglobal_
"My friends think I'm a crypto genius, (I'm not) it's because I listen to WAGMI’s daily podcast (and they have no idea it exists)" - Every Crypto Degen






1/8
@ArtificialAnlys
Google’s new Gemini 2.5 Pro Experimental takes the #1 position across a range of our evaluations that we have run independently

Gemini 2.5 Pro is a reasoning model, it ‘thinks’ before answering questions. Google has released it as an experimental API in AI Studio only, and has not yet disclosed pricing.

If Google prices Gemini 2.5 Pro at a similar level to Gemini 1.5 Pro ($1.25/$5 per million input/output tokens), Gemini 2.5 Pro will be significantly cheaper than leading models from OpenAI and Anthropic ($15/$60 for o1 and $3/$15 for Claude 3.7 Sonnet).

Key benchmarking results:
🥇All time high scores in MMLU-Pro and GPQA Diamond of 86% and 83% respectively

🥇All time high and significant leap in Humanity’s Last Exam, scoring 17.7% - a leap from o3-mini-high’s previous 12.3% record

🥇All time high score in AIME 2024 of 88%

🏁 Speed: 195 output tokens/s, much faster than Gemini 1.5 Pro’s 92 tokens/s and nearly as fast as Gemini 2.0 Flash’s 253 tokens/s

Gemini 2.5 Pro continues to support key features that the Gemini family is known for, including:
➤ 1 million token context window (2 million token context window, as supported by Gemini 1.5 Pro, coming soon)

➤ Multimodal inputs: image, video and audio (text output only)

Additional benchmark results are provided below. Stay tuned for the Artificial Analysis Intelligence Index once we finish running all 7 evaluations. 👇



Gm-dzyPbYAEWoKj.jpg

Gm-d0W2asAA-V4L.jpg

Gm-d1n6aEAAQrUZ.jpg

Gm-d4MqbUAAjWAX.jpg


2/8
@ArtificialAnlys
Further evaluations: SciCode and MATH-500



Gm-jnDcaUAExyaz.jpg

Gm-jnDgbYAEiweY.jpg


3/8
@ArtificialAnlys
Gemini models, both 2.5 Pro and 2.0 Flash, have the fastest output speed compared to leading models



Gm-kTQEbYAEyW5d.jpg


4/8
@ArtificialAnlys
Further analysis available on Artificial Analysis:
http://artificialanalysis.ai/models...easoning,grok-3,command-a,qwq-32b,deepseek-v3



5/8
@soltraveler_sri
Grok 3 has a much higher reported GPQA with extended thinking (and higher than Gemini 2.5’s)… is there some criteria in your methodology that disregards this?



6/8
@apeiron_spx
Amazing work @OfficialLoganK



7/8
@HCSolakoglu
Before Grok 3 API could even launch, it was basically made obsolete by Gemini 2.5. AI race is progressing incredibly fast...



8/8
@ArthurWeinberg3
New reality for openai😂



1/1
@patloeber
love this! Gemini 2.5 Pro one shots math games for kids :smile: Such a cool idea to make learning much more fun and interactive.

[Quoted tweet]
I wanted to check what all the hype about the new Gemini 2.5 Pro model from @GoogleDeepMind is all about..🤔

And I have to say: I'm REALLY impressed. Two prompts (one if you know what you're doing) to create a fully functional web game with animations, gamification, and more!🤯


https://video.twimg.com/ext_tw_video/1904940729777975296/pu/vid/avc1/1280x720/zWZSj2Ja11E-MVfq.mp4
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761

1/6
@m__dehghani
Gemini 2.5 Pro is here and it is 🥇 across ALL categories!

"...the largest score jump ever" 🔥

[Quoted tweet]
BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇


Gm5tt6EagAAcDOU.jpg


2/6
@ai_for_success
Congratulations.. Didn't expect it to be 2.5 😂



3/6
@anthara_ai
Exciting news! Can't wait to see the improvements.



4/6
@gamaleldinfe
Congratulations 🎉



5/6
@HenkPoley
Isn't ChatGPT-3.5 -> GPT-4, 1106->1245 = 139 a larger jump 🤔

Technically the initial GPT-4 march model is even better, but maybe the non-turbo 3.5 was a bit better too.



6/6
@willofguts
Got the correct response in 3 tries:

I have six legs.
Two for walking
Four for swimming.
I have six arms.
Two I carry by my side.
Four that are never dried.
I have six eyes.
Two I use to see.
Four I long to see.

What am I?










1/30
@lmarena_ai
BREAKING: Gemini 2.5 Pro is now #1 on the Arena leaderboard - the largest score jump ever (+40 pts vs Grok-3/GPT-4.5)! 🏆

Tested under codename "nebula"🌌, Gemini 2.5 Pro ranked #1🥇 across ALL categories and UNIQUELY #1 in Math, Creative Writing, Instruction Following, Longer Query, and Multi-Turn!

Massive congrats to @GoogleDeepMind for this incredible Arena milestone! 🙌

More highlights in thread👇

[Quoted tweet]
Think you know Gemini? 🤔 Think again.

Meet Gemini 2.5: our most intelligent model 💡 The first release is Pro Experimental, which is state-of-the-art across many benchmarks - meaning it can handle complex problems and give more accurate responses.

Try it now → goo.gle/4c2HKjf


Gm5tt6EagAAcDOU.jpg


https://video.twimg.com/ext_tw_video/1904576192369360896/pu/vid/avc1/720x900/hS0fNT3KJIB-icRH.mp4

2/30
@lmarena_ai
Gemini 2.5 Pro #1 across ALL categories, tied #1 with Grok-3/GPT-4.5 for Hard Prompts and Coding, and edged out across all others to take the lead 🏇🏆



Gm5tsGVbsAAtAPv.jpg


3/30
@lmarena_ai
Gemini 2.5 Pro ranked #1 on the Vision Arena 🖼️ leaderboard!



Gm5tzGtbQAA623K.jpg


4/30
@lmarena_ai
Gemini 2.5 Pro also stands out in Web Development by hitting #2 on WebDev Arena!

It is the first model to match Claude 3.5 Sonnet and a huge leap over the previous Gemini.

See leaderboards at: http://web.lmarena.ai/leaderboard



Gm5t2xwaMAAzy4O.jpg


5/30
@lmarena_ai
Check out Gemini and other frontier AIs at http://lmarena.ai - your votes shape the leaderboards!



6/30
@OriolVinyalsML
Very kind of you to say this, but this isn't the largest score jump ever. I did some vibe coding to put the data lmarena-ai/chatbot-arena-leaderboard at main together, but it was a pain. "The largest lead was 69.30 points when GPT-4 Turbo took over from Claude-1 on November 16, 2023"



7/30
@lmarena_ai
Wow, you're right! That was two weeks before Gemini 1.0. What a wild journey it's been!



8/30
@jikkujose
Was Sonnet 3.7 purposefully left out?



9/30
@testingcatalog
Huge leap! 🔥 Now it is a question how long will it take for others to catch up 😈



10/30
@MOHIT_BHAT18
@AskPerplexity explain?



11/30
@nicdunz
big



12/30
@burny_tech
Google finally used their infinite money glitch?



13/30
@ronpezzy
SHEEEESH



14/30
@colesmcintosh
+40 🤯



15/30
@Excel4Freelance
Big win for Gemini 2.5 Pro! 🚀 Curious to see how it holds up in real-world use



16/30
@OfficialLoganK
The Deepmind team really cooked with this model, huge congrats : )



17/30
@IamEmily2050
We are still below 1500, but we are very close 🙏🙏



18/30
@BasedInHealth
Claude bros hurting right now.



19/30
@HDRobots
Finally, Google delivered a leading model.



20/30
@thegenioo
V3?



21/30
@dhtikna
Whats the style control hard elo



22/30
@steveouch
So what? Is it better now? How?



23/30
@AJtheMongol
@AskPerplexity what is Arena? Why is it significant?



24/30
@AndyJScott
wow



25/30
@tanvitabs
congrats team ✨



26/30
@AshiishKushwha
Big Woooooooooooooow ❤️❤️



27/30
@DCathal
[]



28/30
@jonathanvases
@elonmusk Send a new update like last time to be number 1 again



29/30
@LagonRaj
And yet...

A prisoner of it's own design.

So much for benchmarks.

[Quoted tweet]
x.com/i/article/190185583829…


30/30
@ketansingh279







1/3
@LechMazur
Gemini 2.5 Pro Experimental (03-25) takes second place on my Thematic Generalization Benchmark!

This benchmark evaluates how effectively LLMs infer a narrow "theme" (category/rule) from a small set of examples and anti-examples and then identify which item truly fits that theme.



Gm-y4NFWMAAV5ZX.jpg


2/3
@LechMazur
DeepSeek V3-0324 (1.95) improves over DeepSeek V3 (2.03).

o3-mini-high (1.84) was also added.



Gm-zhP8XwAAHSns.jpg


3/3
@LechMazur
More info and logs: GitHub - lechmazur/generalization: Thematic Generalization Benchmark: measures how effectively various LLMs can infer a narrow or specific "theme" (category/rule) from a small set of examples and anti-examples, then detect which item truly fits that theme among a collection of misleading candidates.



Gm-ztBgXIAAAUPY.jpg









1/11
@xf1280
Gemini 2.5 Pro is such a strong coding model. I was able to use it to convert an image into a 3d-printable object, and bring it to live. More in 🧵👇

[Quoted tweet]
1/ Gemini 2.5 is here, and it’s our most intelligent AI model ever.

Our first 2.5 model, Gemini 2.5 Pro Experimental is a state-of-the-art thinking model, leading in a wide range of benchmarks – with impressive improvements in enhanced reasoning and coding and now #1 on @lmarena_ai by a significant margin. With a model this intelligent, we wanted to get it to people as quickly as possible.

Find it on Google AI Studio and in the @geminiapp for Gemini Advanced users now – and in Vertex in the coming weeks. This is the start of a new era of thinking models – and we can’t wait to see where things go from here.


Gm50XXwaMAAtN3-.jpg


https://video.twimg.com/ext_tw_video/1904579366874415104/pu/vid/avc1/720x900/rnO07SKeFnXBRwo0.mp4

2/11
@xf1280
I first ask Gemini 2.0 Flash Image generation to convert my wife's sketch of a 3-tiered cake into a 3d rendering of it. Model is doing pretty well and added lots' of details.



Gm51EFEbYAImJT5.jpg


3/11
@xf1280
Then I asked the newly released 2.5 Pro Exp model to write OpenSCAD code to reproduce this 3d model. Also pay attention to printablity. The model comes up with this!



Gm51YvtbYAAvTXc.jpg

Gm513oUbYAEewVa.jpg


4/11
@xf1280
hit print button and we got a cute toy! Love all the details.



Gm52JBVbUAAkBIC.jpg

Gm52JBRbYAARg_F.jpg

Gm52JApa4AA-6Pr.jpg


5/11
@xf1280
Trained as an roboticist, I tend to think foundation model can only interact with the physical world through robot actions. Now I realize as you scale up the model- since there is a increasingly accurate world model embedded in it -Gemini can also create physical objects by leveraging its world knowledge. If we zoom out, 3d printer g-code is also robot actions, with a little help from compilers.



6/11
@xf1280
Try it out at Google AI Studio today!



7/11
@hhua_
Wow



8/11
@CreatedByJannn
thats sick, for the 3d conversion part id use something like @3DAIStudio tho



9/11
@C_WolfHost
https://dexscreener.com/solana/85HrAgyzMvbBD1QFvnN8Q5S1UtXRijNNC65csNZAk2Qk

Gemini 2.5 pro



10/11
@cogentia
Great idea 💡



11/11
@Mad_dev
The next will be putting the model, the CAD software, the 3D printer in a feedback loop to design and optimize robot parts





1/8
@kimmonismus
The new king: Google Gemini 2.5 pro

What a jump in performance! A new king crowns the benchmarks: Gemini 2.5 pro.

And with a significant leap in performance: mad respect!

After a long time, Google has now clearly shown what they are made of - the many TPUs are paying off, so we can even test Gemini 2.5 pro free of charge in AI Studio.

What a race. Exciting every day!



Gm-tSF5WYAAihNt.jpg


2/8
@erithium




Gm-tvoyW4AAV_li.jpg


3/8
@kimmonismus
hahahhahaha



4/8
@DavidJAlba94
The gpqa result is crazy. Above O1 pro and free of charge. Only surpassed by the not yet released O3. Accelerate!! 🔥🚀



5/8
@jsnellauthor
It's amazing how fast AI continues to progress.



6/8
@atwilkinson_
Yes, it's very bright.



7/8
@SeriousStuff42
‼️Please share:

Poornima Rao (Suchir Balaji's mother): "It was about much more than 'just' copyright infringement!"



https://video.twimg.com/ext_tw_video/1904937578370887680/pu/vid/avc1/720x720/ZGgaw7cwEiyjReLI.mp4

8/8
@theanimatednate
It’s like a game of leap frog with Ai models






1/11
@bindureddy
WE HAVE A NEW BEST MODEL IN THE WORLD!

GEMINI 2.5 IS #1 ON LIVEBENCH



Gm-korHbYAETqPU.jpg


2/11
@spacegrep
Livebench is getting stale now. Livebench 2 out when?



3/11
@bindureddy
We already update regularly... that's why it's "Live"

We have a ton of updates coming soon



4/11
@anonym_vestor
When o1 pro?



5/11
@bindureddy
Ok fine... we will do it today



6/11
@OriolVinyalsML
That's my arrow! Cite me please! 😇 j/k



7/11
@bindureddy
Yes, it is! I figured it makes Gemini look better :smile:



8/11
@TristanHurleb
That is a crazy jump honestly insane to see that much of a jump from anthropic's 3.7 thinking



9/11
@shashib
👏



10/11
@erithium
I can’t hear you over the ghibli images 😂



11/11
@stillgray
Yeah but does it make good Ghibli memes.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761










1/13
@picocreator
🧱 Transformers has hit the scaling wall 🧱

💰 GPT 4.5 cost billions, with no clear path to AGI for 10x$ more
📘 Facebook, Yann LeCun, is now saying we need new architectures
🔎 Deepmind CEO, Demis Hassabis, is saying we need 10 years

We have another path to AGI in < 4 years



https://video.twimg.com/ext_tw_video/1904337130333302785/pu/vid/avc1/1280x720/aZlXTvyYsX_77HtR.mp4

2/13
@picocreator
At the heart of it, todays top models are

- Capable: Of incredible PhD level tasks & beyond
- (Un)Reliable: Maybe 1-out-of-30 time

What everyone want, is not a smarter model
But a more reliable model, doing basic college level task

Longer write up:
🛣️ Our roadmap to Personalized AI and AGI



3/13
@picocreator
To do the more boring things in life like

- organize emails and receipts
- fill up forms
- order groceries
- be a friend

The things that actually matter ....

All tasks which a 72B model is more than capable of,
if only it was more reliable



Gm2ajcKaYAAgVHV.jpg


4/13
@picocreator
And thats where our work in Qwerky comes in.

Because the one thing that is holding these AI models, and agents back....

Is simply the lack of reliable understanding, in memories. Memories which is at the heart of recurrent model like RWKV...

[Quoted tweet]
❗️Attention is NOT all you need ❗

Using only 8 GPU's (not a cluster), we trained a Qwerky-72B (and 32B), without any transformer attention

With evals far surpassing GPT 3.5 turbo, and closing in on 4o-mini. All with 100x++ lower inference cost, via RWKV linear scaling


Gm01EkYa0AA6C1Q.jpg


5/13
@picocreator
Instead of scaling bigger more expensive models, which is unable to bring an ROI to investors

What if we iterate faster, at <100B active parameters.

To make these already capable models. More reliable instead. More personalizable. At a size with ROI

[Quoted tweet]
Also, this new approach of treating FFN/MLP as a separate reusable building block.

Will allow us to iterate and validate changes in the RWKV architecture at larger scales. Faster 🏃

So expect bigger changes, in our avg ~6 month version cycle

( even i struggle to keep up 🤣)


Gm1AwosagAAbDY5.jpg


6/13
@picocreator
A model which can be "Memory tuned" without Catastrophic Forgetting

Overcoming the barrier, which makes finetuning out of reach for the vast majority of teams

With quick efficient personalization of AI models, to unlock reliable commercial AI agents. Without compounding errors



Gm2fiZ-bQAAwrQ2.jpg


7/13
@picocreator
Memories is the secret to AGI

Once memories for personalized AI is mastered. Where it can be reliably tuned, with controlled datasets by AI Engineers easily...

The next step is to get the AI model to prepare their continuous training dataset without compounding loss



Gm2gCW3b0AAjdaH.jpg


8/13
@picocreator
Its a binary question, is recurrent memory the path to AGI?

If so, this path to AGI is inevitable

As all the critical ingredients is already here, and is not bound by hardware, only by software

You can read more in details in our long form writing...
🛣️ Our roadmap to Personalized AI and AGI



9/13
@AlphaMFPEFM
Interesting read, reliability of daily fine-tuning with new memories is something I'm looking forward too. I hope you succeed !



10/13
@picocreator
Thank you 🙏

i believe all of us wish for more AI to do the boring chores in life... reliably to our prefences



11/13
@life_of_ccb
Maybe Memory is All You Need?

Not only intellectually fascinating, but also pragmatically compelling.



12/13
@picocreator
Yup, we have internal debates on what size is required for human level intelligence, once memory is "mastered"

Some argue 32B, some argue 72B, even 7B is in consideration.

Decided to play it safe with <100B



13/13
@reown_
AppKit is the full-stack toolkit to build onchain app UX 🪄

✅ Social, Email, and Wallet Login
✅ Embedded Wallets
✅ Crypto Swaps
✅ On-ramp

Integrate with just 20 lines of code across 10+ languages for all EVM chains and Solana.

Onboard millions of users for free today.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761











1/12
@picocreator
❗️Attention is NOT all you need ❗

Using only 8 GPU's (not a cluster), we trained a Qwerky-72B (and 32B), without any transformer attention

With evals far surpassing GPT 3.5 turbo, and closing in on 4o-mini. All with 100x++ lower inference cost, via RWKV linear scaling



Gm01EkYa0AA6C1Q.jpg


2/12
@picocreator
It trade blows, with existing transformer models of the same weight class.

However, what is crazier, is the fact that we have strong evidence that the vast majority of an AI model knowledge and intelligence - is NOT in the attention layer - but in FFN

🪿Qwerky-72B and 32B : Training large attention free models, with only 8 GPU's



Gm027ktaIAAzBfQ.jpg


3/12
@picocreator
Because this is a model that was converted from an existing transformer model. Where we ...
- froze the FFN (also known as MLP)
- delete the QKV transformer attention
- replace it with RWKV
- and train it with <500M tokens

(this is a massive simplification, wait for the paper)



Gm05JNXbYAAIjm-.jpg


4/12
@picocreator
And lets be realistic, 500M tokens of general text: is not enough to train the model with the level of improvement we saw in arc or winogrande benchmarks

Especially given, we effectively deleted and reset-ed a good 1/3rd of the "AI brain" 🧠

HF link:
featherless-ai/Qwerky-72B · Hugging Face



5/12
@picocreator
Which gives evidence, that the majority of "knowledge/intelligence" is in the FFN/MLP layer that was reused

An alternative, is to view attention as a means of managing memories, and focus. For the AI brain 🤔

You can test the model @ featherless here: featherless-ai/Qwerky-72B - Featherless.ai



6/12
@picocreator
Also, this new approach of treating FFN/MLP as a separate reusable building block.

Will allow us to iterate and validate changes in the RWKV architecture at larger scales. Faster 🏃

So expect bigger changes, in our avg ~6 month version cycle

( even i struggle to keep up 🤣)



Gm1AwosagAAbDY5.jpg


7/12
@picocreator
Oh lets not forget that one of the biggest benefit of sub-quadratic architectures like RWKV....

Is the crazy reduction in inference cost, for both VRAM and compute requirements

Allowing us to do more with less

Quadratic scaling is great for revenue
And is horrible for costs



Gm1JeHCa8AAgVtT.jpg


8/12
@picocreator
Making the blog post for the model more obvious here:
🪿Qwerky-72B and 32B : Training large attention free models, with only 8 GPU's



9/12
@picocreator
🤫

Imagine us doing this conversion on deepseek...
Make it a 100x more efficient at inference....

Yea we are definitely aiming to do that



10/12
@picocreator
If you want to know whats next for us...
And our roadmap to personalized AI and AGI....

[Quoted tweet]
🧱 Transformers has hit the scaling wall 🧱

💰 GPT 4.5 cost billions, with no clear path to AGI for 10x$ more
📘 Facebook, Yann LeCun, is now saying we need new architectures
🔎 Deepmind CEO, Demis Hassabis, is saying we need 10 years

We have another path to AGI in < 4 years


https://video.twimg.com/ext_tw_video/1904337130333302785/pu/vid/avc1/1280x720/aZlXTvyYsX_77HtR.mp4

11/12
@omouamoua
Very cool! Is there a ready-to-use RWKV7 PyTorch Module somewhere, like there is for Mamba?



12/12
@picocreator
I have working (but not final) version of the modular blocks here

GitHub - RWKV/RWKV-block: PyTorch implementation of RWKV blocks




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

Secure Da Bag

Veteran
Joined
Dec 20, 2017
Messages
41,715
Reputation
21,590
Daps
130,468

Artificial Intelligence (AI) can perform complex calculations and analyze data faster than any human, but to do so requires enormous amounts of energy. The human brain is also an incredibly powerful computer, yet it consumes very little energy.

As technology companies increasingly expand, a new approach to AI's "thinking," developed by researchers including Texas A&M University engineers, mimics the human brain and has the potential to revolutionize the AI industry.

Dr. Suin Yi, assistant professor of electrical and computer engineering at Texas A&M's College of Engineering, is on a team of researchers that developed "Super-Turing AI," which operates more like the human brain. This new AI integrates certain processes instead of separating them and then migrating huge amounts of data like current systems do.

The "Turing" in the system's name refers to AI pioneer Alan Turing, whose theoretical work during the mid-20th century has become the backbone of computing, AI and cryptography. Today, the highest honor in computer sciences is called the Turing Award.

The team published its findings in Science Advances.

The energy crisis in AI​

Today's AI systems, including large language models such as OpenAI and ChatGPT, require immense computing power and are housed in expansive data centers that consume vast amounts of electricity.

"These data centers are consuming power in gigawatts, whereas our brain consumes 20 watts," Suin explained. "That's 1 billion watts compared to just 20. Data centers that are consuming this energy are not sustainable with current computing methods. So, while AI's abilities are remarkable, the hardware and power generation needed to sustain it is still needed."
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,810
Reputation
9,328
Daps
169,761
[News] DeepSeek V3 0324 on livebench surpasses Claude 3.7News (self.LocalLLaMA)


Posted 2025-03-27T11:44:33+00:00
https://i.redd.it/cvzv13s3z7re1.png
cvzv13s3z7re1.png


Just saw the latest LiveBench results and DeepSeek's V3 (0324) is showing some impressive performance! It's currently sitting at 10th place overall, but what's really interesting is that it's the second highest non-thinking model, only behind GPT-4.5 Preview, while outperforming Claude 3.7 Sonnet (base model, not the thinking version).

We will have to wait, but this suggests that R2 might be a stupidly great model if V3 is already outperforming Claude 3.7 (base), this next version could seriously challenge to the big ones.

https://i.redd.it/cvzv13s3z7re1.png
 
Top