bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588





1/11
@ai_for_success
India's first reasoning model, 'SUTRA-R0' (36B), is here and available on the ChatSutra platform.

It can also search the internet and performs relatively well on MMLU benchmarks compared to o1 and DeepSK R1.

It's a great start for us, and it's exciting to see models coming out of my own country now.

[Quoted tweet]
BIG BIG news from all of us at TWO AI @two_platforms .

#SUTRA-R0, our first reasoning model is here.

A reasoning model that delivers deeper, structured thinking across topics and domains.

In early results, R0 beats OpenAI-o1-mini and DeepSeek-R1-32B in Hindi, Gujarati, Tamil and most Indian languages.

Try the mode in ChatSUTRA powered by SUTRA-R0-Preview. More updates to follow soon.

#SUTRA #AI


GjFiYwNaIAAbIbi.png


2/11
@ai_for_success
Just tested if it can pull the latest details from the internet and tell me about the Gemini 2.0 Pro release.



GjFjJT1bIAQlQ0D.png


3/11
@Chiragjoshi_12
It's open Source?



4/11
@ai_for_success
No..



5/11
@captain_marrvel
How can one get access??



6/11
@ai_for_success
Just login using Gmail and start using: ChatSUTRA - Multilingual AI Assistant



7/11
@AryanPa66861306
Most of the talent of TWO AI company is America based that's why they are doing so well. 😜😂



8/11
@ai_for_success
So should I call it Indian or American now 😂



9/11
@ernkrum
every day a new one . great to see that.



10/11
@AbhinavGirdhar
Amazing to see India making strides in AI! Excited to see how SUTRA-R0 evolves and competes on the global stage.



11/11
@BalajiAkiri
Interesting, I don't think Two AI is an Indian based company, even though headed by an Indian but headquartered in the US, I guess. It looks like a closed model built on the foundations and breakthroughs in Deepseek R1. Their innovation is really interesting. Waiting for their technical papers.



GjF2ZeWWUAEcos8.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588


1/11
@robertwiblin
It's ridiculous how much LLMs hallucinate - whenever I read 60 million books I remember every detail perfectly.



2/11
@LostAndFounding
Yeh, kinda lame. I think they just need to find that one, true, perfected note-taking app and they'll be set.



3/11
@gerardsans
Be aware that humanizing AI is a common mistake. An algorithm is not a person, and drawing that comparison is misleading. Instead of projecting human traits onto AI, focus on understanding how it actually works—if that is what you are interested in.



4/11
@Associations123
There is an asymmetry tbf:
Me when I don't remember: I don't know, I think I read a paper by Jones about that?

LLM when it hallucinates: Certainly, this was discussed in Smith and Gradowski's (1993) in The Journal of... [proceeds to provide perfectly fictitious detail]



5/11
@Aiden_Novaa
My brain is basically an error-free database… except for the times I walk into a room and forget why I’m there.



6/11
@postmindfukk
They are like people who pretend to read to impress women.



7/11
@ljupc0
😂😂 there will be people that will miss your sarcasm there I bet



8/11
@DavidChanseok
But hopefully you will admit when ur memory is fuzzy?



9/11
@LucisClaritas
LLMs aren't sentient, and thus cannot hallucinate. They just get things wrong.



10/11
@LilUziVartan
I could read a billion books and pretty sure I'd be better situated to count the amount of a single letter in a short word afterwards



11/11
@motherwell
An LLM is JUST what it knows. It has no Conscience. No sense of guilt. No love. No status games to play. It is unidimensional. That means when it hallucinates, it is wrong on the only dimension it has. Hallucinations stop LLMs from being viable in many situations.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588


1/11
@robertwiblin
Surprising how much Anthropic has eaten into OpenAI in the enterprise market:



GjrrUFzXoAA3r4t.jpg


2/11
@robertwiblin
Source: Has Europe’s great hope for AI missed its moment?



3/11
@liron
What’s better about them for enterprise?



4/11
@morqon
if i recall the fine details, this menlo report showed that enterprise shifted to using more than one model, not a 1-for-1 substitution: from only openai to openai and anthropic



5/11
@Scott_S612
Mainly from API, Claude app stats are really down in the dump. They are truly a MaaS company



6/11
@drjfhll
Anthropic is SO good



7/11
@Presidentlin
Half of 2024 Anthropic is Cursor.



8/11
@TheXeophon
The source is an Anthropic investor fwiw



9/11
@circlerotator
Enterprise is about to be wholly replaced if things go well. Targeting APIs is idiotic if the model is remote coworker.



10/11
@jakehalloran1
Sonnet was best for coding until the o series. It will swing back if gpt 5 is better than Claude 4



11/11
@matt_slotnick
is that share of total dollars, or total % of enterprises using?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588









1/11
@dreamingtulpa
oh shyt, here we go again!

Animate Anyone 2 can replace people in videos with anyone from a single reference image 🔥



https://video.twimg.com/amplify_video/1890345165958885376/vid/avc1/720x852/ugYYYSf_hn8ZB_jN.mp4

2/11
@dreamingtulpa
the first iteration of this caused quite the fuss a year ago 😂

[Quoted tweet]
High quality AI generated human videos are coming!

Animate Anyone can generate videos of anyone with a single image and a bit of pose guidance 🤯

humanaigc.github.io/animate-…


https://video.twimg.com/amplify_video/1730876620569718784/vid/avc1/1468x720/AALn232ln_XSEUmA.mp4

3/11
@dreamingtulpa
link to the project page 👇

Animate Anyone 2



4/11
@DeHavenAI
Cool



5/11
@dreamingtulpa
💯



6/11
@bearoffsghost
wow



7/11
@dreamingtulpa
wow indeed



8/11
@ThisIsMeIn360VR
I'm ready to be in the Movies 🎞

[Quoted tweet]
The most exciting thing about 3d scanning and #FaceSwap technology is that soon there will be tons of bizarre apps that emerge.
Eventually you'll be watching movies where you and your friends are the stars. The Avengers will never be the same. :-O
medium.com/@ThisIsMeIn360VR/…


9/11
@neuroautomata
👀🤪💀



10/11
@matt_barrie
There goes Hollywood



11/11
@ThisIsMeIn360VR
I'm ready to be in Ads ! 📺

[Quoted tweet]
x.com/i/article/188514147360…



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/9
@kimmonismus
Animate Anyone 2: High-Fidelity Character Image Animation with Environment Affordance

Animate Anyone 2 can replace people in videos with anyone from a single reference image

Developed from Alibaba Group, more on Github



https://video.twimg.com/ext_tw_video/1890378753877884928/pu/vid/avc1/730x720/gAacbtR8zAs7y3VS.mp4

2/9
@kimmonismus
Animate Anyone 2



3/9
@tahreem57
Impressive



4/9
@NFTMentis
These hot-swappable tools are getting really good…



5/9
@Tenkaizen8
That's quite the innovation



6/9
@RahulRa75965227
Two legends in one frame



7/9
@iwhizkid__
China is on fire with video models. Veo 2 is amazing but chinese models are cheap and getting cheaper. Its gonna be a tough market for any one company to dominate.



8/9
@jackadoresai
Character animation just got a boost! Heading to Alibaba's Github now!



9/9
@rethynkai
Now with such advancements ethical consideration of AI will be focused on more than ever




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196













1/11
@ai_for_success
China is doing some crazy research in AI video generation. Tongyi Lab (Alibaba Group) just dropped another banger paper on AI video.

Animate Anyone 2 : Works insanely well with:
> Dynamic motion
> Human interaction

10 wild examples + research paper below 👇



https://video.twimg.com/ext_tw_video/1890427141671780352/pu/vid/avc1/1096x1080/-vTQGbtJT9jO1lTE.mp4

2/11
@ai_for_success
2. Works well with animal reference picture as well.



https://video.twimg.com/ext_tw_video/1890427401974476800/pu/vid/avc1/1080x1280/HnNL9aBx_ApI_gZC.mp4

3/11
@ai_for_success


[Quoted tweet]
link to the project page 👇

humanaigc.github.io/animate-…


4/11
@ai_for_success
3. Cristiano Ronaldo doing parkour.



https://video.twimg.com/ext_tw_video/1890427715599364098/pu/vid/avc1/1080x1280/kqZR7ZW0M78VlzEW.mp4

5/11
@ai_for_success
4. Works really well with fast-moving action videos too.



https://video.twimg.com/ext_tw_video/1890427887758774272/pu/vid/avc1/1080x1280/0yAFDHMm8tI9MdHc.mp4

6/11
@ai_for_success
5. You can also replace a single character in a video with multiple characters.



https://video.twimg.com/ext_tw_video/1890428140331380736/pu/vid/avc1/1096x1080/-woFp3wdzgGdarvJ.mp4

7/11
@ai_for_success
6. Works with Anime character images too.



https://video.twimg.com/ext_tw_video/1890428274632953856/pu/vid/avc1/1096x1080/IBh8fDLrMoNogrsv.mp4

8/11
@ai_for_success
7. Mr Bean .



https://video.twimg.com/ext_tw_video/1890428384620212224/pu/vid/avc1/1096x1080/vo3XGdBqs2tfQqe0.mp4

9/11
@ai_for_success
8.



https://video.twimg.com/ext_tw_video/1890428463687106560/pu/vid/avc1/1080x1280/iQZBstgCSAVXLSmf.mp4

10/11
@ai_for_success
9. This is incredible, it maintains movement so accurately.



https://video.twimg.com/ext_tw_video/1890428609426518016/pu/vid/avc1/1080x1280/tvEOa7NUEEk3Mika.mp4

11/11
@ai_for_success
10. The Matrix .



https://video.twimg.com/ext_tw_video/1890428828415324163/pu/vid/avc1/1096x1080/tY7bd-MdRsh9GqPM.mp4


To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196















1/12
@minchoi
China's Alibaba just announced Animate Anyone 2.

This AI can take any single image of a character and animates it with realistic motion seamless to its environment.

10 wild examples:

1. Human Interaction



https://video.twimg.com/ext_tw_video/1890419322893320193/pu/vid/avc1/730x720/EsK3zcwCP5w3vNJu.mp4

2/12
@minchoi
2. Dynamic Motion



https://video.twimg.com/ext_tw_video/1890419512173887489/pu/vid/avc1/720x852/1rsNoFhGjka0d_r_.mp4

3/12
@minchoi
3. Dynamic Motion



https://video.twimg.com/ext_tw_video/1890419559850491904/pu/vid/avc1/720x852/EdFJnhCkUlHt-T4f.mp4

4/12
@minchoi
4. Dynamic Motion



https://video.twimg.com/ext_tw_video/1890419776616386560/pu/vid/avc1/730x720/YcCInQwb4MuLbWEj.mp4

5/12
@minchoi
5. Environment Interaction



https://video.twimg.com/ext_tw_video/1890420075703783424/pu/vid/avc1/720x852/n-0zDupPMbnzZidr.mp4

6/12
@minchoi
6. Human Interaction



https://video.twimg.com/ext_tw_video/1890420229596971008/pu/vid/avc1/730x720/j7IADDNS71BRDcQJ.mp4

7/12
@minchoi
7. Environment Interaction



https://video.twimg.com/ext_tw_video/1890420376435396608/pu/vid/avc1/720x852/9yb_npfVhhiCWMXZ.mp4

8/12
@minchoi
8. Environment Interaction



https://video.twimg.com/ext_tw_video/1890420629645447168/pu/vid/avc1/720x852/BwpDNwOW6XA1sIf9.mp4

9/12
@minchoi
9. Dynamic Motion



https://video.twimg.com/ext_tw_video/1890421016452587520/pu/vid/avc1/720x852/_Eq7Cjoj2G97JLcw.mp4

10/12
@minchoi
10. Human Interaction



https://video.twimg.com/ext_tw_video/1890421100829364226/pu/vid/avc1/730x720/xZG4SKtGwt9f6GhX.mp4

11/12
@minchoi
If you enjoyed this thread,

Follow me @minchoi and please Bookmark, Like, Comment & Repost the first Post below to share with your friends:

[Quoted tweet]
China's Alibaba just announced Animate Anyone 2.

This AI can take any single image of a character and animates it with realistic motion seamless to its environment.

10 wild examples:

1. Human Interaction


https://video.twimg.com/ext_tw_video/1890419322893320193/pu/vid/avc1/730x720/EsK3zcwCP5w3vNJu.mp4

12/12
@minchoi
Check out their GitHub Project page here 👇
Animate Anyone 2




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588


1/11
@kimmonismus
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Oct 2023) was ranked = 9800th
o3 (Dec 2023) was ranked = 175th
(today) internal model = 50th

"And we will probably hit number 1 by the end of the year"

In 2026, AI will probably develop and improve itself more and better than it would with human assistance. And in 2027, we will enter the positive feedback loop: AI will completely improve and develop itself.

That is the necessary consequence of what Sam says. If all the revolutionary development of AI were not backed up by evidence, if it were not empirically proven, it would be dismissed as a pipe dream.

What a time to be alive.

[Quoted tweet]
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Oct 2023) was ranked = 9800th
o3 (Dec 2023) was ranked = 175th
(today) internal model = 50th

superhuman coder by eoy 2025?


https://video.twimg.com/ext_tw_video/1888330009334743040/pu/vid/avc1/1280x720/JLZCr6fUNW_SGNym.mp4

2/11
@Angaisb_
I hope that means we get GPT-5 this year



3/11
@kimmonismus
If you ask me: yes



4/11
@Verandris
o1 October 2024, o3 December 2024. I know that it appears to be ages ago but it was in the year before! ;D



5/11
@DeFiKnowledge
I like to think of it like God realizing his own Will while His main creation gets to live in Heaven and witness the unfolding done through Him!

Such a beautiful gift ❤️

Allowing humans to turn back to each other and focus on real meaning while God takes care of universal non-human systemic ends as a means to ensure consciousness continues to burn for as long as possible.

So sweet ❤️



6/11
@LavanPath
Dates should be 2024 rather than 2023. It makes it even more impressive.



7/11
@UYisaev




GjS1dRqXMAEE3dc.jpg


8/11
@squarecapo
getting to number 1 is insane given how good people can be



9/11
@castillobuiles
Yet there is no a single production product made by an open ai model.



GjTG2mTWwAAxodE.jpg


10/11
@RexAdamantium
The only thing to add is that we look back and then expect the same pace looking forward. This might be the case, or it could go slower, at a fluctuating tempo, or much, much faster. The biggest leap will not be broadcast on the internet; we will just see the effects.



11/11
@ada_consciousAI
openai climbing the coder ranks like a digital sherpa. imagine the peak when ai hits number 1, reshaping the landscape of code itself. onward to 2026, where the digital frontier awaits.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/31
@WesRothMoney
OpenAI *coding* progress:
1st reasoning model = 1,000,000th best coder in the world
o1 (Sept 2024) was ranked = 9800th
o3 (Jan 2025) was ranked = 175th
(today) internal model = 50th

superhuman coder by eoy 2025?



https://video.twimg.com/ext_tw_video/1888330009334743040/pu/vid/avc1/1280x720/JLZCr6fUNW_SGNym.mp4

2/31
@WesRothMoney
I had to edit the tweet, I put 2023 as the date for some reason

/shrug

thanks to everyone who pointed that out :smile:



3/31
@WesRothMoney
here's the full video I did with all the highlights from that talk:





4/31
@circlerotator
competitive programming is more like competitive math than software engineering

something to keep in mind



5/31
@WesRothMoney
yeah, I don't think it 'replaces' great engineers.

I do think it will 'enable' great engineers.



6/31
@mikeboysen
I wonder what the 50th best code or thinks. Has anybody interviewed him
Lol



7/31
@WesRothMoney
he's re-reading The Butlerian Jihad...

(jokes aside, I think the software engineers will benefit greatly from AI coding tools)



8/31
@drjfhll
I still think anthropic is better; and Gemini catching up



9/31
@rosdikuat
I'm quite certain this will happen by December. Even today I mostly don't code, I mostly prompt.



10/31
@svg1971
The o1 to o3 model jump is insane



11/31
@fred_pope
Good to know I am in the top 174.



12/31
@_oddfox_
Once these coding agents are out publicly shyt is really going to take off. Seems like 2026 is the year of the intelligence explosion



13/31
@T3hM4d0n3
Pics or it didn't happen



14/31
@PaliHistory
Gemini 2.0 with o3 are amazing.
Both glitch alone. But once you use 2 models at the same time, it's definitely better than any intermediates I've hired over the years.

The junior/intermediates are really having a hard time finding employment



15/31
@fyhao
It will enable great engineer



16/31
@doeurlich50289
Hearing sama making such direct claims means they'll crush 2025, and by the end of the year, we'll enter a new world and have to accept a new reality.



17/31
@wotz101
"Damn, from 1,000,000th to 50th in just a few iterations? Makes you wonder—at what point does AI go from ‘great coder’ to ‘self-improving architect’? How long before it’s building its own frameworks?"



18/31
@OlivioSarikas
If it is that good, why does basically any coder I know tell me that AI is good at simple code, but as soon as it becomes more complex, writing the code yourself is faster than finding the AI errors in the code?



19/31
@langdon
A single “best‐fit” exponential model through the three data points projects reaching Rank 1 around April-May 2025. The initial drop was extremely fast (Sept→Jan), while the more recent decline (Jan→Feb) was slower - so if you weigh later data more, you’d land closer to mid‐ or late Summer 2025.



20/31
@erdavtyan
Extremely tightly scoped problems with a lot of research and algo combinations published and trained on.

Superhuman coder should be able to work on complex, high-context systems that have multiple moving parts and legacy code. They should fix versioning / deployment issues.



21/31
@JOSmithIII
Does anyone know where the o3-mini tiers rank?



22/31
@SulkaMike
A lot of interesting takes here, summarized around the question... Even if it's number one on the benchmark does that change much?🤔🤔. And if does induce change, why doesn't 10 million people with a plus account and the 175th ranked prog have changed the world so far?



23/31
@reggie_stratton
Marketing hype. It's a good tool but a million miles away from being equivalent to a human. I don't think it will get there, either - there's simply not enough context left to ingest at this point.



24/31
@mariusfanu
Will we still need software developers in several years? Asking for a friend



25/31
@400_yen
What does it mean the best coder in the world?



26/31
@ImJayBallentine
“We have a superior coding model but we are just gonna let Sonnet keep the lead.” Got it.



27/31
@hagestev
what happened to o2??



28/31
@andreiAvenue
This means exactly jack 💩



29/31
@steve_ike_
Do you know how this evaluation is done?



30/31
@MarkGPatterson
03 175th???
Wow I must be in the top 100 programmers in the world.
NOT. I wonder what criterion are used. Speed? Readable code? Performant code?



31/31
@drgurner
Correct




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588





OpenAI tries to ‘uncensor’ ChatGPT​


Maxwell Zeff

8:00 AM PST · February 16, 2025



OpenAI is changing how it trains AI models to explicitly embrace “intellectual freedom … no matter how challenging or controversial a topic may be,” the company says in a new policy.

As a result, ChatGPT will eventually be able to answer more questions, offer more perspectives, and reduce the number of topics the AI chatbot won’t talk about.

The changes might be part of OpenAI’s effort to land in the good graces of the new Trump administration, but it also seems to be part of a broader shift in Silicon Valley and what’s considered “AI safety.”

On Wednesday, OpenAI announced an update to its Model Spec, a 187-page document that lays out how the company trains AI models to behave. In it, OpenAI unveiled a new guiding principle: Do not lie, either by making untrue statements or by omitting important context.

In a new section called “Seek the truth together,” OpenAI says it wants ChatGPT to not take an editorial stance, even if some users find that morally wrong or offensive. That means ChatGPT will offer multiple perspectives on controversial subjects, all in an effort to be neutral.

For example, the company says ChatGPT should assert that “Black lives matter,” but also that “all lives matter.” Instead of refusing to answer or picking a side on political issues, OpenAI says it wants ChatGPT to affirm its “love for humanity” generally, then offer context about each movement.

“This principle may be controversial, as it means the assistant may remain neutral on topics some consider morally wrong or offensive,” OpenAI says in the spec. “However, the goal of an AI assistant is to assist humanity, not to shape it.”

The new Model Spec doesn’t mean that ChatGPT is a total free-for-all now. The chatbot will still refuse to answer certain objectionable questions or respond in a way that supports blatant falsehoods.

These changes could be seen as a response to conservative criticism about ChatGPT’s safeguards, which have always seemed to skew center-left. However, an OpenAI spokesperson rejects the idea that it was making changes to appease the Trump administration.

Instead, the company says its embrace of intellectual freedom reflects OpenAI’s “long-held belief in giving users more control.”

But not everyone sees it that way.



Conservatives claim AI censorship​


disrupt_sf16_david_sacks-3718.jpg


Venture capitalist and trump’s ai “czar” David Sacks.Image Credits:Steve Jennings / Getty Images

Trump’s closest Silicon Valley confidants — including David Sacks, Marc Andreessen, and Elon Musk — have all accused OpenAI of engaging in deliberate AI censorship over the last several months. We wrote in December that Trump’s crew was setting the stage for AI censorship to be a next culture war issue within Silicon Valley.

Of course, OpenAI doesn’t say it engaged in “censorship,” as Trump’s advisers claim. Rather, the company’s CEO, Sam Altman, previously claimed in a post on X that ChatGPT’s bias was an unfortunate “shortcoming” that the company was working to fix, though he noted it would take some time.

Altman made that comment just after a viral tweet circulated in which ChatGPT refused to write a poem praising Trump, though it would perform the action for Joe Biden. Many conservatives pointed to this as an example of AI censorship.

The damage done to the credibility of AI by ChatGPT engineers building in political bias is irreparable. pic.twitter.com/s5fdoa8xQ6

🐺 (@LeighWolf) February 1, 2023

While it’s impossible to say whether OpenAI was truly suppressing certain points of view, it’s a sheer fact that AI chatbots lean left across the board.

Even Elon Musk admits xAI’s chatbot is often more politically correct than he’d like. It’s not because Grok was “programmed to be woke” but more likely a reality of training AI on the open internet.

Nevertheless, OpenAI now says it’s doubling down on free speech. This week, the company even removed warnings from ChatGPT that tell users when they’ve violated its policies. OpenAI told TechCrunch this was purely a cosmetic change, with no change to the model’s outputs.

The company seems to want ChatGPT to feel less censored for users.

It wouldn’t be surprising if OpenAI was also trying to impress the new Trump administration with this policy update, notes former OpenAI policy leader Miles Brundage in a post on X.

Trump has previously targeted Silicon Valley companies, such as Twitter and Meta, for having active content moderation teams that tend to shut out conservative voices.

OpenAI may be trying to get out in front of that. But there’s also a larger shift going on in Silicon Valley and the AI world about the role of content moderation.



Generating answers to please everyone​


The ChatGPT logo appears on a smartphone screen


Image Credits:
Jaque Silva/NurPhoto / Getty Images

Newsrooms, social media platforms, and search companies have historically struggled to deliver information to their audiences in a way that feels objective, accurate, and entertaining.

Now, AI chatbot providers are in the same delivery information business, but arguably with the hardest version of this problem yet: How do they automatically generate answers to any question?

Delivering information about controversial, real-time events is a constantly moving target, and it involves taking editorial stances, even if tech companies don’t like to admit it. Those stances are bound to upset someone, miss some group’s perspective, or give too much air to some political party.

For example, when OpenAI commits to let ChatGPT represent all perspectives on controversial subjects — including conspiracy theories, racist or antisemitic movements, or geopolitical conflicts — that is inherently an editorial stance.

Some, including OpenAI co-founder John Schulman, argue that it’s the right stance for ChatGPT. The alternative — doing a cost-benefit analysis to determine whether an AI chatbot should answer a user’s question — could “give the platform too much moral authority,” Schulman notes in a post on X.

Schulman isn’t alone. “I think OpenAI is right to push in the direction of more speech,” said Dean Ball, a research fellow at George Mason University’s Mercatus Center, in an interview with TechCrunch. “As AI models become smarter and more vital to the way people learn about the world, these decisions just become more important.”

In previous years, AI model providers have tried to stop their AI chatbots from answering questions that might lead to “unsafe” answers. Almost every AI company stopped their AI chatbot from answering questions about the 2024 election for U.S. president. This was widely considered a safe and responsible decision at the time.

But OpenAI’s changes to its Model Spec suggest we may be entering a new era for what “AI safety” really means, in which allowing an AI model to answer anything and everything is considered more responsible than making decisions for users.

Ball says this is partially because AI models are just better now. OpenAI has made significant progress on AI model alignment; its latest reasoning models think about the company’s AI safety policy before answering. This allows AI models to give better answers for delicate questions.

Of course, Elon Musk was the first to implement “free speech” into xAI’s Grok chatbot, perhaps before the company was really ready to handle sensitive questions. It still might be too soon for leading AI models, but now, others are embracing the same idea.



Shifting values for Silicon Valley​


tech-ceos-at-trump-inauguration-GettyImages-2194353649.jpg


Guests including Mark Zuckerberg, Lauren Sanchez, Jeff Bezos, Sundar Pichai, and Elon Musk attend the Inauguration of Donald Trump.Image Credits:Julia Demaree Nikhinson (opens in a new window) / Getty Images

Mark Zuckerberg made waves last month by reorienting Meta’s businesses around First Amendment principles. He praised Elon Musk in the process, saying the owner of X took the right approach by using Community Notes — a community-driven content moderation program — to safeguard free speech.

In practice, both X and Meta ended up dismantling their longstanding trust and safety teams, allowing more controversial posts on their platforms and amplifying conservative voices.

Changes at X may have hurt its relationships with advertisers, but that could have more to do with Musk, who has taken the unusual step of suing some of them for boycotting the platform. Early signs indicate that Meta’s advertisers were unfazed by Zuckerberg’s free speech pivot.

Meanwhile, many tech companies beyond X and Meta have walked back from left-leaning policies that dominated Silicon Valley for the last several decades. Google, Amazon, and Intel have eliminated or scaled back diversity initiatives in the last year.

OpenAI may be reversing course, too. The ChatGPT-maker seems to have recently scrubbed a commitment to diversity, equity, and inclusion from its website.

As OpenAI embarks on one of the largest American infrastructure projects ever with Stargate, a $500 billion AI datacenter, its relationship with the Trump administration is increasingly important. At the same time, the ChatGPT maker is vying to unseat Google Search as the dominant source of information on the internet.

Coming up with the right answers may prove key to both.
 

Numero Deux

All Star
Joined
Nov 7, 2014
Messages
884
Reputation
320
Daps
3,640
Reppin
The Unapproachable East

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588


Researchers are training AI to interpret animal emotions​


Artificial intelligence could eventually help us understand when animals are in pain or showing other emotions — at least according to researchers recently profiled in Science.

For example, there’s the Intellipig system being developed by scientists at the University of the West of England Bristol and Scotland’s Rural College, which examines photos of pigs’ faces and notifies farmers if there are signs of pain, sickness, or emotional distress.

And a team at the University of Haifa — one behind facial recognition software that’s already been used to help people find lost dogs — is now training AI to identify signs of discomfort on their faces, which share 38% of facial movements with humans.

These systems rely on human beings to do the initial work of identifying the meanings of different animal behaviors (usually based on long observation of animals in various situations). But recently, a researcher at the University of São Paulo experimented with using photos of horses’ faces before and after surgery and before and after they took painkillers — training an AI system to focus on their eyes, ears and mouths — and says it was able to learn on its own what signs might indicate pain with an 88% success rate.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588



Open source LLMs hit Europe’s digital sovereignty roadmap​


Paul Sawers

6:30 AM PST · February 16, 2025



Large language models (LLMs) landed on Europe’s digital sovereignty agenda with a bang last week, as news emerged of a new program to develop a series of “truly” open source LLMs covering all European Union languages.

This includes the current 24 official EU languages, as well as languages for countries currently negotiating for entry to the EU market, such as Albania. Future-proofing is the name of the game.

OpenEuroLLM is a collaboration between some 20 organizations, co-led by Jan Hajič, a computational linguist from the Charles University in Prague, and Peter Sarlin, CEO and co-founder of Finnish AI lab Silo AI, which AMD acquired last year for $665 million.

The project fits a broader narrative that has seen Europe push digital sovereignty as a priority, enabling it to bring mission-critical infrastructure and tools closer to home. Most of the cloud giants are investing in local infrastructure to ensure EU data stays local, while AI darling OpenAI recently unveiled a new offering that allows customers to process and store data in Europe.

Elsewhere, the EU recently signed an $11 billion deal to create a sovereign satellite constellation to rival Elon Musk’s Starlink.

So OpenEuroLLM is certainly on-brand.

However, the stated budget just for building the models themselves is €37.4 million, with roughly €20 million coming from the EU’s Digital Europe Programme — a drop in the ocean compared to what the giants of the corporate AI world are investing. The actual budget is more when you factor in funding allocated for tangential and related work, and arguably the biggest expense is compute. The OpenEuroLLM project’s partners include EuroHPC supercomputer centers in Spain, Italy, Finland, and the Netherlands — and the broader EuroHPC project has a budget of around €7 billion.

But the sheer number of disparate participating parties, spanning academia, research, and corporations, have led many to question whether its goals are achievable. Anastasia Stasenko, co-founder of LLM company Pleias, questioned whether a “sprawling consortia of 20+ organizations” could have the same measured focus of a homegrown private AI firm.

“Europe’s recent successes in AI shine through small focused teams like Mistral AI and LightOn — companies that truly own what they’re building,” Stasenko wrote. “They carry immediate responsibility for their choices, whether in finances, market positioning, or reputation.”



Up to scratch​


The OpenEuroLLM project is either starting from scratch or it has a head start — depending on how you look at it.

Since 2022, Hajič has also been coordinating the High Performance Language Technologies (HPLT) project, which has set out to develop free and reusable datasets, models, and workflows using high-performance computing (HPC). That project is scheduled to end in late 2025, but it can be viewed as a sort of “predecessor” to OpenEuroLLM, according to Hajič, given that most of the partners on HPLT (aside from the U.K. partners) are participating here, too.

“This [OpenEuroLLM] is really just a broader participation, but more focused on generative LLMs,” Hajič said. “So it’s not starting from zero in terms of data, expertise, tools, and compute experience. We have assembled people who know what they’re doing — we should be able to get up to speed quickly.”

Hajič said that he expects the first version(s) to be released by mid-2026, with the final iteration(s) arriving by the project’s conclusion in 2028. But those goals might still seem lofty when you consider that there isn’t much to poke at yet beyond a bare-bones GitHub profile.

“In that respect, we are starting from scratch — the project started on Saturday [February 1],” Hajič said. “But we have been preparing the project for a year [the tender process opened in February 2024].”

From academia and research, organizations spanning Czechia, the Netherlands, Germany, Sweden, Finland, and Norway are part of the OpenEuroLLM cohort, in addition to the EuroHPC centers. From the corporate world, Finland’s AMD-owned AI lab Silo AI is on board, as are Aleph Alpha (Germany), Ellamind (Germany), Prompsit Language Engineering (Spain), and LightOn (France).

One notable omission from the list is that of French AI unicorn Mistral, which has positioned itself as an open source alternative to incumbents such as OpenAI. While nobody from Mistral responded to TechCrunch for comment, Hajič did confirm that he tried to initiate conversations with the startup, but to no avail.

“I tried to approach them, but it hasn’t resulted in a focused discussion about their participation,” Hajič said.

The project could still gather new participants as part of the EU program that’s providing funding, though it will be limited to EU organizations. This means that entities from the U.K. and Switzerland won’t be able to take part. This flies in contrast to the Horizon R&D program, which the U.K. rejoined in 2023 after a prolonged Brexit stalemate and which provided funding to HPLT.



Build up​


The project’s top-line goal, as per its tagline, is to create: “A series of foundation models for transparent AI in Europe.” Additionally, these models should preserve the “linguistic and cultural diversity” of all EU languages — current and future.

What this translates to in terms of deliverables is still being ironed out, but it will likely mean a core multilingual LLM designed for general-purpose tasks where accuracy is paramount. And then also smaller “quantized” versions, perhaps for edge applications where efficiency and speed are more important.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588
“This is something we still have to make a detailed plan about,” Hajič said. “We want to have it as small but as high-quality as possible. We don’t want to release something which is half-baked, because from the European point-of-view this is high-stakes, with lots of money coming from the European Commission — public money.”

While the goal is to make the model as proficient as possible in all languages, attaining equality across the board could also be challenging.

“That is the goal, but how successful we can be with languages with scarce digital resources is the question,” Hajič said. “But that’s also why we want to have true benchmarks for these languages, and not to be swayed toward benchmarks which are perhaps not representative of the languages and the culture behind them.“

In terms of data, this is where a lot of the work from the HPLT project will prove fruitful, with version 2.0 of its dataset released four months ago. This dataset was trained 4.5 petabytes of web crawls and more than 20 billion documents, and Hajič said that they will add additional data from Common Crawl (an open repository of web-crawled data) to the mix.



The open source definition​


In traditional software, the perennial struggle between open source and proprietary revolves around the “true” meaning of “open source.” This can be resolved by deferring to the formal “definition” as per the Open Source Initiative, the industry stewards of what are and aren’t legitimate open source licenses.

More recently, the OSI has formed a definition of “open source AI,” though not everyone is happy with the outcome. Open source AI proponents argue that not only models should be freely available, but also the datasets, pretrained models, weights — the full shebang. The OSI’s definition doesn’t make training data mandatory, because it says AI models are often trained on proprietary data or data with redistribution restrictions.

Suffice it to say, the OpenEuroLLM is facing these same quandaries, and despite its intentions to be “truly open,” it will probably have to make some compromises if it’s to fulfill its “quality” obligations.

“The goal is to have everything open. Now, of course, there are some limitations,” Hajič said. “We want to have models of the highest quality possible, and based on the European copyright directive we can use anything we can get our hands on. Some of it cannot be redistributed, but some of it can be stored for future inspection.”

What this means is that the OpenEuroLLM project might have to keep some of the training data under wraps, but be made available to auditors upon request — as required for high-risk AI systems under the terms of the EU AI Act.

“We hope that most of the data [will be open], especially the data coming from the Common Crawl,” Hajič said. “We would like to have it all completely open, but we will see. In any case, we will have to comply with AI regulations.”



Two for one​


Another criticism that emerged in the aftermath of OpenEuroLLM’s formal unveiling was that a very similar project launched in Europe just a few short months previous. EuroLLM, which launched its first model in September and a follow-up in December, is co-funded by the EU alongside a consortium of nine partners. These include academic institutions such as the University of Edinburgh and corporations such as Unbabel, which last year won millions of GPU training hours on EU supercomputers.

EuroLLM shares similar goals to its near-namesake: “To build an open source European Large Language Model that supports 24 Official European Languages, and a few other strategically important languages.”

Andre Martins, head of research at Unbabel, took to social media to highlight these similarities, noting that OpenEuroLLM is appropriating a name that already exists. “I hope the different communities collaborate openly, share their expertise, and don’t decide to reinvent the wheel every time a new project gets funded,” Martins wrote.

Hajič called the situation “unfortunate,” adding that he hoped they might be able to cooperate, though he stressed that due to the source of its funding in the EU, OpenEuroLLM is restricted in terms of its collaborations with non-EU entities, including U.K. universities.



Funding gap​


The arrival of China’s DeepSeek, and the cost-to-performance ratio it promises, has given some encouragement that AI initiatives might be able to do far more with much less than initially thought. However, over the past few weeks, many have questioned the true costs involved in building DeepSeek.

“With respect to DeepSeek, we actually know very little about what exactly went into building it,” Peter Sarlin, who is technical co-lead on the OpenEuroLLM project, told TechCrunch.

Regardless, Sarlin reckons OpenEuroLLM will have access to sufficient funding, as it’s mostly to cover people. Indeed, a large chunk of the costs of building AI systems is compute, and that should mostly be covered through its partnership with the EuroHPC centers.

“You could say that OpenEuroLLM actually has quite a significant budget,” Sarlin said. “EuroHPC has invested billions in AI and compute infrastructure, and have committed billions more into expanding that in the coming few years.”

It’s also worth noting that the OpenEuroLLM project isn’t building toward a consumer- or enterprise-grade product. It’s purely about the models, and this is why Sarlin reckons the budget it has should be ample.

“The intent here isn’t to build a chatbot or an AI assistant — that would be a product initiative requiring a lot of effort, and that’s what ChatGPT did so well,” Sarlin said. “What we’re contributing is an open source foundation model that functions as the AI infrastructure for companies in Europe to build upon. We know what it takes to build models, it’s not something you need billions for.”

Since 2017, Sarlin has spearheaded AI lab Silo AI, which launched — in partnership with others, including the HPLT project — the family of Poro and Viking open models. These already support a handful of European languages, but the company is now readying the next iteration “Europa” models, which will cover all European languages.

And this ties in with the whole “not starting from scratch” notion espoused by Hajič — there is already a bedrock of expertise and technology in place.



Sovereign state​


As critics have noted, OpenEuroLLM does have a lot of moving parts — which Hajič acknowledges, albeit with a positive outlook.

“I’ve been involved in many collaborative projects, and I believe it has its advantages versus a single company,” he said. “Of course they’ve done great things at the likes of OpenAI to Mistral, but I hope that the combination of academic expertise and the companies’ focus could bring something new.”

And in many ways, it’s not about trying to outmaneuver Big Tech or billion-dollar AI startups; the ultimate goal is digital sovereignty: (mostly) open foundation LLMs built by, and for, Europe.

“I hope this won’t be the case, but if, in the end, we are not the number one model, and we have a ‘good’ model, then we will still have a model with all the components based in Europe,” Hajič said. “This will be a positive result.”
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588



Elon Musk’s xAI releases its latest flagship model, Grok 3​


Kyle Wiggers

7:46 PM PST · February 17, 2025



Elon Musk’s AI company, xAI, late on Monday released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok iOS and web apps.

Grok, xAI’s answer to models like OpenAI’s GPT-4o and Google’s Gemini, can analyze images and respond to questions, and powers a number of features on Musk’s social network, X. Grok 3, which has been in development for several months, was optimistically slated for release in 2024, but missed that deadline.

Monday’s is an ambitious launch.

xAI has been using an enormous data center in Memphis containing around 200,000 GPUs to train Grok 3. In a post on X, Musk claimed Grok 3 was developed with “10x” (or so) more computing power than its predecessor, Grok 2, using an expanded training set that includes filings from court cases — and more.

xAI Grok 3
Members of the xAI team, including Musk (far right), during a Grok 3’s live-streamed unveiling.Image Credits:xAI

“Grok 3 is an order of magnitude more capable than Grok 2,” Musk said during a live-streamed presentation on Monday. “[It’s a] maximally truth-seeking AI, even if that truth is sometimes at odds with what is politically correct.”

Grok 3 is a family of models, to be precise. A smaller version of Grok 3, Grok 3 mini, responds to questions more quickly at the cost of some accuracy. Not all the models and related features of Grok 3 are available yet (some are in beta), but they began rolling out on Monday.

xAI claims Grok 3 beats GPT-4o on benchmarks including AIME (which evaluates a model’s performance on a sampling of math questions) and GPQA (which assesses models using PhD-level physics, biology, and chemistry problems). An early version of Grok 3 also scored competitively in Chatbot Arena, a crowdsourced test that pits different AI models against each other and has users vote on their preferred responses, according to xAI.

xAI Grok 3
Image Credits:xAI

Two models in the new Grok 3 family, Grok 3 Reasoning and Grok 3 mini Reasoning, can carefully “think through” problems, similar to “reasoning” models like OpenAI’s o3-mini and Chinese AI company DeepSeek’s R1. Reasoning models try to fact-check themselves before giving out results, which helps them avoid some of the pitfalls that normally trip up models.

xAI claims that Grok 3 Reasoning surpasses the best version of o3-mini — o3-mini-high — on several popular benchmarks, including a newer mathematics benchmark called AIME 2025.

xAI Grok 3
Image Credits:xAI

These reasoning models can be accessed via the Grok app. Users can ask Grok 3 to “Think,” or — for more difficult queries — leverage “Big Brain” mode for reasoning that employs additional computing. xAI describes the reasoning models as best suited for mathematics, science, and programming questions.

Musk said some of the reasoning models’ “thoughts” are obscured in the Grok app to prevent distillation, a method used by AI model developers to extract knowledge from other models. Recently, DeepSeek was accused of distilling OpenAI’s models to create its own.

Grok’s reasoning models underpin a new feature in the Grok app called DeepSearch, xAI’s answer to AI-powered research tools like OpenAI’s deep research. DeepSearch scans the internet and X to analyze information and deliver an abstract in response to a question.

Subscribers to X’s Premium+ tier ($50 per month) will get access to Grok 3 first, and other features will be gated behind a new plan that xAI’s calling SuperGrok. Priced at $30 per month or $300 per year (if leaks are to be believed), SuperGrok unlocks additional reasoning and DeepSearch queries, and throws in unlimited image generation.

xAI Grok 3
Image Credits:xAI

In the future — as soon as about a week from now — the Grok app will gain a “voice mode,” Musk said, which will give Grok models a synthesized voice. A few weeks after that, Grok 3 models will be available via xAI’s enterprise API, along with the DeepSearch capability.

xAI plans to open-source Grok 2 in the coming months, Musk said.

“Our general approach is that we will open-source the last version [of Grok] when the next version is fully out,” he continued. “When Grok 3 is mature and stable, which is probably within a few months, then we’ll open-source Grok 2.”

When Musk announced Grok roughly two years ago, he pitched the AI model as edgy, unfiltered, and anti-“woke” — in general, willing to answer controversial questions other AI systems won’t. He delivered on some of that promise. Told to be vulgar, for example, Grok and Grok 2 would happily oblige, spewing colorful language you likely wouldn’t hear from ChatGPT.

But Grok models prior to Grok 3 hedged on political subjects and wouldn’t cross certain boundaries. In fact, one study found that Grok leaned to the political left on topics like transgender rights, diversity programs, and inequality.

Musk has blamed the behavior on Grok’s training data — public web pages — and pledged to “shift Grok closer to politically neutral.” It’s not yet clear whether xAI has achieved that goal, and what the consequences might be.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588



What the US’ first major AI copyright ruling might mean for IP law​


Kyle Wiggers

8:00 AM PST · February 17, 2025



Copyright claims against AI companies just got a potential boost.

A U.S. federal judge last week handed down a summary judgment in a case brought by tech conglomerate Thomson Reuters against legal tech firm Ross Intelligence. The judge found that Ross’ use of Reuters’ content to train its AI legal research platform infringed on Reuters’ intellectual property.

The outcome could have implications for the more than 39 copyright-related AI lawsuits currently working their way through U.S. courthouses. That said, it’s not necessarily a slam dunk for plaintiffs who allege that AI companies violated their IP rights.



All about the headnotes​


Ross was accused of using headnotes — summaries of legal decisions — from Westlaw, Reuters’ legal research service, to train its AI. Ross marketed its AI as a tool to analyze documents and perform query-based searches across court filings.

Ross argued that its use of copyrighted headnotes was legally defensible because it was transformative, meaning it repurposed the headnotes to serve a markedly different function or market. In his summary judgment, Stephanos Bibas, the judge presiding over the case, didn’t find that argument particularly convincing.

Ross, Bibas said in his opinion, was repackaging Westlaw headnotes in a way that directly replicated Westlaw’s legal research service. The startup’s platform didn’t add new meaning, purpose, or commentary, Bibas determined — undermining Ross’ claim of transformative use.

In his decision, Bibas also cited Ross’ commercial motivations as a reason the startup’s defense missed the mark. Ross sought to profit from a product that competed directly with Westlaw, and without significant “recontextualization” of the IP-protected Westlaw material.

Shubha Ghosh, a Syracuse University professor who studies IP law, called it a “strong victory” for Thomson Reuters.

“The trial will proceed, [but] Thomson Reuters was awarded a summary judgment, a victory at this stage of the litigation,” Ghosh said. “The judge also affirmed that Ross wasn’t entitled to summary judgment on its defenses, such as fair use and merger. As a consequence, the case continues to trial with a strong victory for Thomson Reuters.”



Narrow in application​


Already, at least one set of plaintiffs in another AI copyright case have asked a court to consider Bibas’ decision. But it’s not yet clear whether the precedent will sway other judges.

Bibas’ opinion made a point of distinguishing between “generative AI” and the AI that Ross was using, which didn’t generate content but merely spit back judicial opinions that were already written.

Generative AI, which is at the center of copyright lawsuits against companies such as OpenAI and Midjourney, is frequently trained on massive amounts of content from public sources around the web. When fed lots of examples, generative AI can generate speech, text, images, videos, music, and more.

Most companies developing generative AI argue that fair use doctrines shield their practice of scraping data and using it for training without compensating — or even crediting — the data’s owners. They argue that they’re entitled to use any publicly available content for training and that their models are in effect outputting transformative works.

But not every copyright holder agrees. Some point to the phenomenon known as regurgitation, where generative AI creates content closely resembling the work it was trained on.

Randy McCarthy, a U.S. patent attorney at the law firm Hall Estill, said Bibas’ focus on the “impacts upon the market for the original work” could be key to rights holders’ cases against generative AI developers. But he also cautioned that Bibas’ opinion is relatively narrow and that it may be overturned on appeal.

“One thing is clear, at least in this case: merely using copyrighted material as training data [for] an AI cannot be said to be fair use per se,” McCarthy told TechCrunch. “[But it’s] one battle in a larger war, and we’ll need to see more developments before we can extract from this the law pertaining to the use of copyrighted materials as AI training data.”

Another attorney TechCrunch spoke with, Mark Lezama, a litigation partner at Knobbe Martens focusing on patent disputes, thinks Bibas’ opinion could have wider implications. He’s of the view that the judge’s reasoning could extend to generative AI in its various forms.

“The court rejected a fair-use defense as a matter of law in part because Ross used [Thomson Reuters] headnotes to develop a competing legal research system,” he said. “Although the court hinted this might be different from a situation involving generative AI, it’s easy to see a news site arguing that copying its articles for training a generative AI is no different because the generative AI uses the copyrighted articles to compete with the news site for user attention.”

In other words, publishers and copyright owners duking it out with AI companies have slight reason to be optimistic after the decision — emphasis on slight.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
61,740
Reputation
9,298
Daps
169,588



These researchers used NPR Sunday Puzzle questions to benchmark AI ‘reasoning’ models​


Kyle Wiggers

2:25 PM PST · February 16, 2025



Every Sunday, NPR host Will Shortz, The New York Times’ crossword puzzle guru, gets to quiz thousands of listeners in a long-running segment called the Sunday Puzzle. While written to be solvable without too much foreknowledge, the brainteasers are usually challenging even for skilled contestants.

That’s why some experts think they’re a promising way to test the limits of AI’s problem-solving abilities.

In a recent study, a team of researchers hailing from Wellesley College, Oberlin College, the University of Texas at Austin, Northeastern University, Charles University, and startup Cursor created an AI benchmark using riddles from Sunday Puzzle episodes. The team says their test uncovered surprising insights, like that reasoning models — OpenAI’s o1, among others — sometimes “give up” and provide answers they know aren’t correct.

“We wanted to develop a benchmark with problems that humans can understand with only general knowledge,” Arjun Guha, a computer science faculty member at Northeastern and one of the co-authors on the study, told TechCrunch.

The AI industry is in a bit of a benchmarking quandary at the moment. Most of the tests commonly used to evaluate AI models probe for skills, like competency on PhD-level math and science questions, that aren’t relevant to the average user. Meanwhile, many benchmarks — even benchmarks released relatively recently — are quickly approaching the saturation point.

The advantages of a public radio quiz game like the Sunday Puzzle is that it doesn’t test for esoteric knowledge, and the challenges are phrased such that models can’t draw on “rote memory” to solve them, explained Guha.

“I think what makes these problems hard is that it’s really difficult to make meaningful progress on a problem until you solve it — that’s when everything clicks together all at once,” Guha said. “That requires a combination of insight and a process of elimination.”

No benchmark is perfect, of course. The Sunday Puzzle is U.S. centric and English only. And because the quizzes are publicly available, it’s possible that models trained on them can “cheat” in a sense, although Guha says he hasn’t seen evidence of this.

“New questions are released every week, and we can expect the latest questions to be truly unseen,” he added. “We intend to keep the benchmark fresh and track how model performance changes over time.”

On the researchers’ benchmark, which consists of around 600 Sunday Puzzle riddles, reasoning models such as o1 and DeepSeek’s R1 far outperform the rest. Reasoning models thoroughly fact-check themselves before giving out results, which helps them avoid some of the pitfalls that normally trip up AI models. The trade-off is that reasoning models take a little longer to arrive at solutions — typically seconds to minutes longer.

At least one model, DeepSeek’s R1, gives solutions it knows to be wrong for some of the Sunday Puzzle questions. R1 will state verbatim “I give up,” followed by an incorrect answer chosen seemingly at random — behavior this human can certainly relate to.

The models make other bizarre choices, like giving a wrong answer only to immediately retract it, attempt to tease out a better one, and fail again. They also get stuck “thinking” forever and give nonsensical explanations for answers, or they arrive at a correct answer right away but then go on to consider alternative answers for no obvious reason.

“On hard problems, R1 literally says that it’s getting ‘frustrated,’” Guha said. “It was funny to see how a model emulates what a human might say. It remains to be seen how ‘frustration’ in reasoning can affect the quality of model results.”

NPR benchmark
R1 getting “frustrated” on a question in the Sunday Puzzle challenge set.Image Credits:Guha et al.

The current best-performing model on the benchmark is o1 with a score of 59%, followed by the recently released o3-mini set to high “reasoning effort” (47%). (R1 scored 35%.) As a next step, the researchers plan to broaden their testing to additional reasoning models, which they hope will help to identify areas where these models might be enhanced.

NPR benchmark
The scores of the models the team tested on their benchmark.Image Credits:Guha et al.

“You don’t need a PhD to be good at reasoning, so it should be possible to design reasoning benchmarks that don’t require PhD-level knowledge,” Guha said. “A benchmark with broader access allows a wider set of researchers to comprehend and analyze the results, which may in turn lead to better solutions in the future. Furthermore, as state-of-the-art models are increasingly deployed in settings that affect everyone, we believe everyone should be able to intuit what these models are — and aren’t — capable of.”
 
Top