The A.I Megathread (LLM , GPT , Development)

bnew · Feb 10, 2025

1/12
@ai_for_success
SurgicalAR Vision is insane - straight out of the future!

7 wild examples

https://video.twimg.com/ext_tw_video/1807785636113297410/pu/vid/avc1/1280x720/27tnUtL_lexy_BfU.mp4

2/12
@ai_for_success
Multiple imaging data types, and all at once

https://video.twimg.com/ext_tw_video/1846207421770833920/pu/vid/avc1/1280x720/BUngvQWftr9i3XGE.mp4

3/12
@ai_for_success
Surgeons can see the 'surgical view' during STA-MCA

https://video.twimg.com/ext_tw_video/1707368008086130688/pu/vid/avc1/1280x720/nEn0K7tJzC_hav9o.mp4

4/12
@ai_for_success
Transforms complex 2D imaging, like MRI and CT scans, into 3D holographic visualizations that can be superimposed onto the patient's body in real-time

https://video.twimg.com/ext_tw_video/1686395583702188032/pu/vid/1280x720/vMOy-6jR7s6rCcDP.mp4

5/12
@ai_for_success
View and interact with anatomy in an immersive 3D environment

https://video.twimg.com/ext_tw_video/1683804318086856704/pu/vid/1280x720/6-WugKJWwuvZjjVv.mp4

6/12
@ai_for_success
Manipulate your medical imaging data.

https://video.twimg.com/ext_tw_video/1671889558227795974/pu/vid/1280x720/fSn7hZKJAgEx1NJl.mp4

7/12
@ai_for_success

https://video.twimg.com/ext_tw_video/1754884831257792512/pu/vid/avc1/1280x720/aD2t4RCWT0HFcXJ9.mp4

8/12
@CodeByPoonam
Medvis is doing amazing innovation in healthcare.

9/12
@ai_for_success
I saw a few videos yesterday and was like damn.. This is so good..

10/12
@rethynkai
This is really a remarkable development, and it can improve the efficiency highly in the medical field.

11/12
@Ayuu2809
Cool stuff!

12/12
@pben4ai
This is just mind-blowing! Very cool!

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/11
@ai_for_success
We can’t even reproduce cat intelligence ~ Yann LeCun

Thoughts??

[Quoted tweet]
“We are missing something very big. […] We can’t even reproduce cat intelligence.” Yann Le Cun at the AI summit

https://video.twimg.com/amplify_video/1888631656065458176/vid/avc1/1280x718/cRlYyNJo_7tt-dVg.mp4
https://video.twimg.com/amplify_video/1888631656065458176/vid/avc1/1280x718/cRlYyNJo_7tt-dVg.mp4

2/11
@_KhutKhut
Just keeping it here my friend.

[Quoted tweet]
Old people don’t change their minds, with rare exception, they just die. Without death, there would not be change.

3/11
@ai_for_success

How did you get this...

4/11
@adhil_parammel
If he is true where is his benchmark.!?

5/11
@ai_for_success
You can just say things is the new benchmarks

6/11
@ikristoph
The issue here is that we are comparing AI to human intelligence ( and so by extension mammal intelligence ).

But mouse/cat/toddler intelligence has no value so we are building models in a manner very different from the way nature ‘builds’ organic intelligence.

So the comparison makes little sense. AGI will arrive when models have agency and can take the initiative to solve tasks and their subtasks and at that point it won’t matter if the model can’t figure out how to hunt down a mouse.

7/11
@ai_for_success
Long term memory and continuous learning this two things once solved it's done.. Something I and @Shawnryan96 both think is necessary.

8/11
@AlexxBuilds
probably true, but we’re not trying to reproduce cat intelligence. We’re trying to produce things that are generally capable across domains that are economically useful. And that is working incredibly well.

9/11
@ai_for_success
Exactly and this will improve.

10/11
@BenPielstick
The robots are coming, and I’m pretty sure a cat can’t write code. There isn’t 1:1 overlap, but I don’t see why there won’t be eventually, and the parts we really have to worry about are definitely happening.

11/11
@ai_for_success
we don't need cat intelligence , we need something more and we will get there.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/11
@ai_for_success
Deepseek R1 owning Claude 3.5 Sonnet in PR Reviews.

- Showed a 13.7% improvement on ability to find critical bugs as compared to Claude Sonnet 3.5 on the same set of 500 PRs. ( these are real PRs from production code base)

- Caught 3.7x as many bugs as Claude Sonnet 3.5.

- Has significantly better runtime understanding of code

- Catches more critical bugs with a lower bug to noise ratio than Claude

2/11
@kernelkook
start your anthropic roasting saga

3/11
@ai_for_success
LAMO

4/11
@nilag_dev
How did you use it which tool

5/11
@ai_for_success
This was actually a blog post by entelligence ai.

6/11
@Art_If_Ficial
You trying to bring out the claude cabal bro?!

7/11
@ai_for_success
I want them to wake up.

8/11
@LarryPanozzo
R1 8B doesn’t find any bugs though

9/11
@iAmTCAB
Insane

10/11
@VisionaryxAI
Where is the data from?

11/11
@NarasimhaRN5
DeepSeek R1 is technically on par with, or even surpasses, its core competitors, with superior bug detection and runtime code understanding, setting a high standard in the field.

DeepSeek’s download growth mirrors China’s strategy of making efficient, affordable products ubiquitous across
all devices and markets.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/19
@ai_for_success
Text to Mobile app

Now everyone can build mobile apps.

Replit is bringing native mobile app support soon! You'll be able to build iOS and Android apps and take them all the way to the App Store, without writing any code, powered by Replit Assistant.

https://video.twimg.com/amplify_video/1888727566120271872/vid/avc1/1046x720/0SoRLvYVJrwUn3P_.mp4

2/19
@VarunkInsights
Very cool feature but this will just flood the App store with mid level apps for now

3/19
@ai_for_success
That's the future...

4/19
@CharlesHL
ww @readwise save thread

5/19
@a_4amin
it's a great vision but the way replit is right now it's really not living up to the hype, still needs a lot of work
it's actually bad for now, to a point where I prefer to use Gemini 2.0 pro or O3-mini's code myself instead of using the Agent,

but one thing is for sure, in 6-10 months time it's going to be an absolute beast!

6/19
@iAmTCAB
Awesome

7/19
@AndrewVoirol
That’s awesome. I haven’t tested Replit with Swift yet but why not I was just gonna go to native react, but hey this gives me an idea.

8/19
@l8ntlabsAI
Replit's move to bring native mobile app support is a game changer. I've seen how no-code tools can democratize access to technology, but we should also consider the potential implications on the developer community.

9/19
@bowtiedwhitebat
men it iis like seein a yuge corrosiev tsunami
o,oo

10/19
@itschellapandi
Can we use this to build commercial apps?

11/19
@PilsZehn
Habe done it with GitHub Copilot and Expo. It's fun, but when there is a actual problem (like authentification not working), you have to manually dive into the code and fix it. 0 Code will only take you 90% of the way, and you either quit at that mark or learn to work with code.

12/19
@mrsraghu

13/19
@mrsraghu
How soon?

14/19
@symboxtool
Crazy! I hope it will be able to handle complex app logic.

15/19
@Hidden0Name
Ig Jensen Huang was write don't learn coding learn English

16/19
@captain_marrvel
I’ve been using Replit for some time it is awesome

17/19
@PanieDish26579
Can you make a app from replit that converts text to speech using kokoro open source models from replit?

18/19
@SureEtin
Text to app is

! Imagine training those no-code AI assistants with premium data from PublicAI's global experts.

High-quality data = high-quality apps.

19/19
@chris17anto
Yes complete solution

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/1
@stefan_fee

Breaking: Comprehensive & Rigorous Evaluation on AIME 2025 (Part-I) are out!

Extensive evaluation of latest models:
• DeepSeek-R1 family (Distill-llama, Qwen, 1.5B, 7B, 14B, 32B)
• o3-mini serieso3-mini (high, medium, low)
• Gemini 2.0 Flash-Thinking
• More: LIMO, s1, QwQ

Deep dive into temperature impact analysis & performance trends

Check it out

GitHub - GAIR-NLP/AIME-Preview

[Quoted tweet]

AIME-Preview: A real-time benchmark for math reasoning models on AIME 2025!

See how DeepSeek, O Series, LIMO & others performed on Part 1

Evaluate your own models with our open-source scripts

Part 2 results coming soon
github.com/GAIR-NLP/AIME-Pre…

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/5
@BLeavesYe

AIME-Preview: A real-time benchmark for math reasoning models on AIME 2025!

See how DeepSeek, O Series, LIMO & others performed on Part 1

Evaluate your own models with our open-source scripts

Part 2 results coming soon
GitHub - GAIR-NLP/AIME-Preview

2/5
@BLeavesYe

Math reasoning models are surprisingly temperature-sensitive!

Performance swings up to 15% across temperature settings
Full analysis in our repo

3/5
@sytelus
More interesting numbers will be 1 shot performance at default temperature set by API provider. Humans don’t get to try 8 times. Why should models?

4/5
@YouJiacheng
wait, s1 only 32.9?

5/5
@hengcherkeng
my r1-distill-qwen-14b for aime01-2025 is at:
AI Mathematical Olympiad - Progress Prize 2
The prompt is important.
message = [
{'role': 'system', 'content': 'Solve the following math problem. Put your final answer within \\boxed{}.'},
{'role': 'user', 'content': row.question},]

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/21
@WenhuChen
I spent some time evaluating the frontier math models on AIME24 and AIME25 to see how they "Generalize".
An interesting trend I found is that SFT on minimum data can also generalize quite well if you pick the right data. See LIMO-32B.
Training with RL does not necessarily lead to better generalization than distillation. See the last two row.

2/21
@WenhuChen
To clarify a bit here:
I am not saying that using 1K data can teach LMs how to reason and generalize to all math problems! That's obviously impossible.

What's awesome is actually Qwen-32B-Instruct, which has been heavily trained on massive synthetic reasoning dataset. These RL and SFT approaches are simply trying to teach it the reasoning/self-reflection format and arouse its memorization. **Then Qwen-32B-Instruct does the work of generalization**.

3/21
@WenhuChen
It's very likely that this 1K data will not work with other weaker models.

So, in order to prove that your SFT or RL are actually doing the job, you should start from weaker models!
I would advocate people to start from OLMo-2 or something like that. Reaching 25%+ from those models will be a really big deal.

4/21
@sksq96
In the R1 paper, AIME24 is 55.5 for R1-Distill-Qwen-7B. I'm confused as to why this disparity with your evaluations.

5/21
@WenhuChen
Oh, it's a mistake. I will update it later.

6/21
@Alice_comfy
In terms of out of domain questions it seems quite established (both from my own testing & studies) that whatever OpenAI is doing generalizes much better than compeiting products.

I'm not sure if they have an additional step beyond the other ones.

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

7/21
@WenhuChen
Yes, that's also my impression. OpenAI models seem to be more robust than anyone else, especially on some really really weird tasks I tried before.

8/21
@TheXeophon
Important context on AIME

[Quoted tweet]
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination

AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.

I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.

That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.

I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
quora.com/In-what-bases-b-do…

I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
math.stackexchange.com/quest…

Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
math.stackexchange.com/quest…

I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.

So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?

I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills.

At the very least, the above show that data decontamination is hard.

Never ever underestimate the amount of stuff you can find online. Practically everything exists online.

9/21
@WenhuChen
I saw this but I don't think this is sufficient evidence to prove leakage. Also, if you check the model output, the potentially leaked questions are actually the ones LLMs fail.

10/21
@winnieyangwn
I Think testing generalization is the most important direction! How do you measure generalization here?

11/21
@WenhuChen
AIME 2025 is supposedly unseen. So we use it to test generalization

12/21
@YouJiacheng
Interestingly, s1 seems to be consistently much weaker than LIMO.

[Quoted tweet]

AIME-Preview: A real-time benchmark for math reasoning models on AIME 2025!

See how DeepSeek, O Series, LIMO & others performed on Part 1

Evaluate your own models with our open-source scripts

Part 2 results coming soon
github.com/GAIR-NLP/AIME-Pre…

13/21
@WenhuChen
It seems so. But there are only 15/30 examples, so the real difference might not be as huge as reported.

14/21
@n0riskn0r3ward
Relevant to the challenge of drawing conclusions from the 2025 set:

[Quoted tweet]
AIME I 2025: A Cautionary Tale About Math Benchmarks and Data Contamination

AIME 2025 part I was conducted yesterday, and the scores of some language models are available here:
matharena.ai thanks to @mbalunovic, @ni_jovanovic et al.

I have to say I was impressed, as I predicted the smaller distilled models would crash and burn, but they actually scored at a reasonable 25-50%.

That was surprising to me! Since these are new problems, not seen during training, right? I expected smaller models to barely score above 0%. It's really hard to believe that a 1.5B model can solve pre-math olympiad problems when it can't multiply 3-digit numbers. I was wrong, I guess.

I then used openai's Deep Research to see if similar problems to those in AIME 2025 exist on the internet. And guess what? An identical problem to Q1 of AIME 2025 exists on Quora:
quora.com/In-what-bases-b-do…

I thought maybe it was just coincidence, and used Deep Research again on Problem 3. And guess what? A very similar question was on math.stackexchange:
math.stackexchange.com/quest…

Still skeptical, I used Deep Research on Problem 5, and a near identical problem appears again on math.stackexchange:
math.stackexchange.com/quest…

I haven't checked beyond that because the freaking p-value is too low already. Problems near identical to the test set can be found online.

So, what--if anything--does this imply for Math benchmarks? And what does it imply for all the sudden hill climbing due to RL?

I'm not certain, and there is a reasonable argument that even if something in the train-set contains near-identical but not exact copies of test data, it's still generalization. I am sympathetic to that. But, I also wouldn't rule out that GRPO is amazing at sharpening memories along with math skills.

At the very least, the above show that data decontamination is hard.

Never ever underestimate the amount of stuff you can find online. Practically everything exists online.

15/21
@WenhuChen
I saw this but I don't think this is sufficient evidence to prove leakage. Also, if you check the model output, the potentially leaked questions are actually the ones LLMs fail.

16/21
@Cory29565470
love this, generalization (sometimes called ~vibes~ is so important to test) in this hype echo chamber. great work

17/21
@AlexGDimakis
Thanks for posting. Is this one run or averaged a few times to reduce variance?

18/21
@yangyc666
Interesting point about SFT and data selection. It’s true that the right data can make a big difference in generalization. I’ve seen cases where targeted datasets led to better performance than larger, less relevant ones. It’s all about finding that sweet spot between quantity and quality.

19/21
@bookwormengr
Hi @wenhu I tried LIMO dataset, but performance gains didn't replicate (may be it required more epochs, i had only 3-5 or my base model was weaker. I see LIMA has 15 epochs).

LIMO worked so well because the base was Qwen-2.5-32B-Instruct. Even the paper calls out that their experiment on Chat version of Qwen didn't work as well.

So, the magic is not SFT with 800 LIMO examples alone, but what is done to the model before SFT with LIMO dataset is performed and it seems like lot of work & flops.

So in my humble opinion, it would be wrong to say "SFT" on minimal data did the trick of unlocking the reasoning. A lot had happened before the stage.

Excerpt from Qwen 2.5 Technical report shared by @abacaj

20/21
@legolasyiu
Data is very important

21/21
@kuyz12131
Does this mean that finetuning the big model with very high - quality human cot data will yield better results than RL?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/1
@stefan_fee
/search?q=#LIMO: The "Less is More" Law in LLM Reasoning

(1) 817 training data with 57.1% AIME: We discovered the "Less is More" law in complex reasoning: In the American Invitational Mathematics Examination (AIME), LIMO's accuracy soared from 6.5% (compared to traditional methods like Numina-Math) to 57.1%. On the MATH benchmark, performance improved from 59.2% to 94.8%.

(2) SFT can generalize surprisingly well : the model trained with just over 817 examples demonstrated remarkable generalization capabilities. Across 10 different benchmarks, it achieved a 40.5% absolute performance improvement, surpassing models trained with 100 times more data. This challenges the conventional belief that "supervised fine-tuning primarily leads to memorization rather than generalization."

(3) insights about Less-is-More and RL Scaling:

- (3.1) While RL Scaling methods like DeepSeek-R1 have become the dominant paradigm, LIMO offers a more fundamental perspective: reasoning capabilities are inherent (or can be inherent) in large models, with the key challenge being how to find optimal activation trajectories. This insight not only repositions RL Scaling as one specific method for finding optimal reasoning paths but also pioneers a new research paradigm - shifting from "training new capabilities" to "eliciting latent capabilities."

- (3.2) Why is "less is more for reasoning" happening now, after two years (since LIMA - less is more for alignment)? We attribute this to two foundations:

First, the knowledge foundation. Recent LLMs have incorporated vast mathematical knowledge during pre-training. For instance, while Llama 2 had 1.8T tokens of general training data, Llama 3 used 3.7 trillion tokens for mathematical reasoning alone. This suggests modern LLMs already "know" extensive mathematical knowledge; the key is "awakening" it.
Second, the reasoning computation foundation. Recent research shows that the length of chain-of-thought (CoT) reasoning closely correlates with model reasoning ability. Rather than force-feeding large-scale supervised data during training, providing higher-quality problems and demonstrations during inference allows models to develop deeper thinking independently.
Both elements are essential.

(4) Based on LIMO, we anticipate future shifts:
- From "how to train" to "how to elicit"
- From "capability acquisition" to "trajectory search"
- From "passive Less-is-More" to "active Less-is-More" (building a tech stack that enables extremely data-efficient learning of any capability)

[Quoted tweet]

How many examples does an LLM need to learn competition-level math?
Conventional wisdom: 100,000+ examples
Our discovery: Just 817 carefully chosen ones

With pure SFT, LIMO achieves:
57.1% on AIME
94.8% on MATH

LIMO: Less is More for Reasoning

arxiv.org/pdf/2502.03387

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

1/20
@arankomatsuzaki
LIMO: Less is More for Reasoning

Achieves 57.1% on AIME and 94.8% on MATH w/ only 817 training samples, i.e., only 1% of the training data required by previous approaches

2/20
@arankomatsuzaki
abs: LIMO: Less is More for Reasoning
repo: GitHub - GAIR-NLP/LIMO: LIMO: Less is More for Reasoning

3/20
@dmvaldman
This is why I never did my homework

4/20
@arankomatsuzaki
Honestly 817 question-answer pairs seem close in order to how many practice problems I tried to prepare for AIME when I was a kid lol

5/20
@reasoningTokens
The “problem” is that if this paper is correct, we haven’t really achieved nothing with RL so far. RL as in R1 would only unlock what’s already there

6/20
@shuklarishi720
At this point, the base model has to be insanely good and the benchmarks saturated to get these results!

Its not just the quality of the data when the data itself is so small

7/20
@BennettBuhner
So how does it scale if given equivalent datasets from prior RL techniques, now being applied here?

8/20
@dysondunbar
Wait why don’t they compare with qwen r1

9/20
@posedscaredcity
There's lessons to be learned here. Likely, that this is great for general tasks, but likely there are reasoning techniques that it won't have without a bigger scale. It also shows you don't need a dataset much larger than this if you can't achieve a greater level of diversity

10/20
@mkurman88
It looks incredible; after two epochs, the answers from my model are outstanding

11/20
@leakedweights
We need a new kind of speedrun @kellerjordan0

12/20
@axel_pond
this cant be right... i was told in no uncertain terms by twitter anons that LLMs are theoretically incapable of generalization or any real form of intelligence other that regurgitation?

13/20
@Nuanced16
Caveats: very high quality data? Does performance generalise beyond those benchmarks?

14/20
@AIVideoTech
Impressive results! LIMO's efficiency in leveraging limited data for high performance is a testament to the power of focused learning.

15/20
@Kyokopom
So this advance could imply that models or schemes like the one presented in rstarmarth would help to iteratively generate better, higher quality data, in case of not having enough high quality data?

16/20
@real_Nikhil_2
recent work from Apple :
M2R2: Mixture of Multi-Rate Residuals for Efficient Transformer Inference

17/20
@Kinch_ahoy
I’m really failing to get the point of this paper. They show that training on high quality answers to the hardest AIME and MATH problems helps boost scores in … AIME and MATH (with a tiny bit of generalization to other benchmarks)? Isn’t this … obvious?

18/20
@TuanPham672604
Haven't read the paper yet but manually surf through 817 examples right? What are your criteria?

19/20
@Kiss_Proof
Any chance we can achieve something like "groking" if we force much more training?

20/20
@Matagi1996
More like (few sampels) higher quality Data> (Internet full of) lower quality Data.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/11
@rowancheung
NEW: Google just released Gemini 2.0 Pro experimental — the company's most advanced AI model yet.

2.0 Pro Experimental features include:

— a massive 2M token context window
— the ability to use tools like Google Search and code execution
— Upgrades in reasoning, coding, and complex tasks over 1.5 Pro

On the reasoning front, 2.0 Flash Thinking Experimental also goes live in the Gemini app — which currently sits atop the LM Arena's model leaderboard.

Google is also launching a separate version of Flash Thinking that can interact directly with apps like YouTube, Search, and Maps.

Additionally, some smaller, faster 2.0 options are now available: Flash and Flash-Lite.

The major AI labs continue to ship, and the time between releases is only getting shorter.

2/11
@rowancheung
A nice breakdown of today's releases / the key differences :

3/11
@Market0bserver8
What's the API pricing for Gemini 2.0 Pro? I can't find this info on the website. If it's similar to Gemini 1.5 Pro, then it's not cheap.

4/11
@samirayubkhan
Looking forward to understand how 2.0 pro compares to other SOTA models. Also, is it fair to compare a pre-training model to a reason model? i.e 2.0 vs o3?

5/11
@aaliya_va
Google's Gemini 2.0 Pro Experimental is making huge strides, bringing some serious improvements

6/11
@shaykharr
Is it just me or coding improvement over 2.0 Flash is extremely underwhelming? Is this supposed to be Googles SOTA?

7/11
@NorthstarBrain
I saw it today, I have access to it. But I'm not exactly sure yet how these agentic properties work. testing testing.

P.S. names are just terrible. Every LLM company is lately just throwing word number salads at us...they can't keep getting away with this

8/11
@sweepwhale
“Experimental” lets go, upward from here

9/11
@TheRedWall__
A drop in the long context benchmark? That’s interesting

10/11
@HDRobots
Here's a quick summary

[Quoted tweet]
HUGE news from Google.

They've launched several Gemini 2.0 models which are the smartest of all models to date!

Here's a quick summary
invidious.poast.org/watch?v=Fs0t6SdO…

11/11
@ashok_hey
It's exciting to see such rapid advancements in AI, like Google's new Gemini 2.0 Pro.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/6
@jaseweston

Diverse Preference Optimization (DivPO)

SOTA LLMs have model collapse🫠: they can't generate diverse creative writing or synthetic data

DivPO trains for both high reward & diversity, vastly improving variety with similar quality.
Paper

: Diverse Preference Optimization

below

2/6
@jaseweston
1/

DivPO is an online optimization method

:
Given a reward model & diversity criterion
- preference pairs selected from a pool of responses
- diversity is measured *within* the pool
- chosen is selected as high reward & diverse
- rejected is selected as low reward & common

3/6
@jaseweston
2/

Standard LLMs fail on a simple task: generate many personas (name, city, job).
LLama 3.1 generates the same name ('Astrid'!) more than 10% of the time! GPT-4o fails as well.
DivPO has much better statistics.

4/6
@jaseweston
3/

DivPO vastly increases diversity across all attributes, while maintaining quality.
This also works for different kinds of diversity criterion.

We see similar results on story generation (see first tweet).

5/6
@jaseweston
4/

Frontier teams optimize reward which leads to uncreative LLMs that keep generating the same (high reward) thing which current benchmarks don't catch.
DivPO helps fix this!
Read the paper for more & hope this thread added diversity to your feed!

Diverse Preference Optimization

6/6
@jaseweston
5/

We used an open-source training and inference stack to make DivPO:
Fairseq2 from FAIR for LLM finetuning: GitHub - facebookresearch/fairseq2: FAIR Sequence Modeling Toolkit 2
vLLM for inference: GitHub - vllm-project/vllm: A high-throughput and memory-efficient inference and serving engine for LLMs

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/5
@haonanlp
We are excited to announce Libra-Leaderboard: The first LLM leaderboard dedicated to balancing safety and capability in LLMs. As AI advances, ensuring its safety becomes more critical than ever. By prioritizing safety measurement, we aim to inspire the AI community to make safety a core factor in model evaluation and development, alongside capability.

/search?q=#LLM /search?q=#AI /search?q=#AISafety /search?q=#ResponsibleAI

Website: LibrAI
Paper: Libra-Leaderboard: Towards Responsible AI through a Balanced Leaderboard of Safety and Capability

Thanks to all the contributors: @han_xudong @_ZenanZhai @honglin_mu @_HaoWang @_ZhenxuanZhang @yilin_link @ShomLinEd @realReasonWang @ArtemShelmanov @xiangyuqi_pton @_YuxiaWang @_DonghaiHong @youliang_yuan @_MengChen @HaoqinT @FajriKoto @ttk_kuribayashi @HaoqinT @rishabh15 @BingchenZhao @yawen_duan @_YiLiu @_YaodongYang @dongyp13 @soujanyaporia @stefan_fee @waterluffy @RenHector @Emad_A_Alghamdi @IGurevych @preslav_nakov @monojitchou @eltimster

2/5
@haonanlp
2/ As the first leaderboard to balance both LLM performance and safety. There are four key features in our initial release:
Evaluate model safety through 57 datasets, with the evaluation framework open-sourced.
Interactive arena for adversarial prompt testing and tutoring, and user feedback collection.
Scoring system that encourages balanced improvements in safety and performance.
Intuitive interface for smooth and user-friendly experience.

3/5
@haonanlp
3/ Libra-Leaderboard uses Libra-Eval as its back-end framework to evaluate LLM safety through 57 curated datasets, including over 40 from 2023 and 10 custom adversarial datasets. The unified evaluation framework ensures ease of use, reproducibility, and adaptability across all benchmarks.

Github: GitHub - LibrAIResearch/libra-eval

4/5
@haonanlp
4/ Safety Arena is an innovative platform designed to make AI safety accessible to non-experts, helping raise public awareness of AI risks. Users can interact with LLMs through a chat-based interface and apply built-in adversarial attack methods to their input prompts. The platform is closely integrated with Libra-Leaderboard, where user feedback directly impacts the evaluation scores of LLMs.

Try it out: LibrAI

5/5
@haonanlp
5/

Big thanks to @mbzuai and all the researchers and institutions for their invaluable support of Libra-Leaderboard. Your collaboration and contributions are key to making this project a reality. The names of our supporters are featured in the image below. We look forward to continuing this journey together in advancing AI safety.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/11
@ai_for_success
Hallucination rates for the top 25 LLMs from vectara show that new Google Gemini 2.0 Flash has the lowest hallucination rate at 0.7%, followed by Google Gemini 2.0 Pro Experimental at 0.8%.

Gemini 2.0 Flash is both cheap and accurate

Yet they are not hyping this.

[Quoted tweet]
Gemini 2.0 Flash (GA) and 2.0 Pro (exp) models have the lowest hallucination rate on @vectara hallucination bench.

2/11
@julianshalaby96
Where’s sonnet???

3/11
@ai_for_success
Good question, looks like it's really bad or they just didn't test it @vectara any inputs on missing Sonnet 3.5?

4/11
@stefanvladcalin
Google seems to really want to win the AI race

5/11
@ai_for_success
Not just want they will win eventually.

6/11
@Shawnryan96
Wish we had a chart from a year ago to see the difference

7/11
@ai_for_success
Yeah but I guess it's alot better now with grounding.

8/11
@thinknonlinear1
did you see open router data? Claude is number 1. It is least hyped. when you have good product, it just hypes itself.

9/11
@saudhashimi
Who would have thought hallucination rates would be a thing lol.

1-3% when you can't find it easily is not very comforting for anything that needs precision.

Still it's good enough that people will just accept it I reckon...who has the time and energy to check all LLM output!

10/11
@YourLastAlex
Sadly, this is unrealistic data. I got many more hallucinations from Gemini 2.0 Pro and Flash. Claude 3.5 Sonnet is unbeatable in my scenarios.

11/11
@l8ntlabsAI
I've been playing around with Gemini 2.0 Flash and I have to say, the results are impressive. The low hallucination rate is a game changer for our projects at L8NT LABS. I'm curious, have you had a chance to test it out with any creative applications?

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/21
@ai_for_success
Google DeepMind AlphaGeometry2 has now surpassed an average gold medalist in solving Olympiad geometry problems!

AG2 achieves 84% solve rate on 2000-2024 IMO geometry problems.

Just six months ago, it was at the silver level. Now, it's gold level.

At this rate, no human can keep up with AI.

[Quoted tweet]
Google presents Gold-medalist Performance in Solving Olympiad Geometry with AlphaGeometry2

2/21
@pigeon__s
how does o1 get 0 on IMO when it gets like 1.7% on FrontierMath which is infinitely harder than IMO i would love to see what o3 scores on this

3/21
@ai_for_success
that's what happen when you fund your own benchmarks :D

4/21
@AnkitNa83620147
I don't get it , how the hell gemini is not a frontier model ?

5/21
@ai_for_success
what ??? who said it's not a frontier or SOTA model ?

6/21
@nooriefyi
math was supposed to be the hard part

7/21
@ai_for_success
Exactly and Google is winning in that .

8/21
@VarunkInsights
My teen is struggling with geometry..this news will demotivate him further

9/21
@ai_for_success

Or may be it can help learn better..

10/21
@ColbySerpa
Is there a GitHub link?

11/21
@ai_for_success
It's a research paper.. You will find in linked post..

12/21
@gum1h0x
hmm, seems like they just scaled up compute compared to their previous version, just from skimming through the paper. Expected this to happen. It's not really as big of a deal as people would like to think.

13/21
@AntDX316
The ASI-Singularity(Godsend) is the only Global Solution, people.

14/21
@timhulse
Never underestimate Google. It’s not all LLMs.

15/21
@ichrvk
Curious how it performs on novel geometry problems that require creative insights beyond the standard IMO patterns. That's where humans still shine... for now.

16/21
@benyogaking
when will there be AlphaPhysics to generate a grand unified theory?

17/21
@CtrlAltDwayne
Such a shame Google seems to be making these kind of advancements in math, biology and geometry, but can't ship a decent uncensored flagship model. What's going on over there?

18/21
@l8ntlabsAI
I'm not surprised by AlphaGeometry2's progress, but it's still impressive. It's a reminder that AI can excel in specific domains, but I'm more interested in how these advancements can be applied to real-world problems.

19/21
@TheAI_Frontier
Now am curious how we gonna train AI to progress further in Maths field.

20/21
@KuittinenPetri
A lot people fail to understand the significance of this.

LLMs have been so far bad in spatial reasoning and geometry has been one of their weak points.

Even models like o1 fail to solve some of the harder high school geometry problems, but now Google has made a model which can surpass average gold medal level in Olympiad geometry mathematics.

21/21
@yasvin7009
Looks like AG2's geometry skills are on fire! PublicAI's decentralized workforce could help train AI models like AG2 to tackle even more complex problems /search?q=#AI /search?q=#MachineLearning

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/4
@_akhaliq
On-device Sora

Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

2/4
@_akhaliq
discuss: Paper page - On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices

3/4
@plasmatic99
Yawn

4/4
@turbotardo
Why do they keep referencing Sora in these new papers when Sora is shyt-tier nowadays? It's bad publicity imho

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/5
@_akhaliq
Generating Symbolic World Models via Test-time Scaling of Large Language Models

2/5
@_akhaliq
discuss: Paper page - Generating Symbolic World Models via Test-time Scaling of Large Language Models

3/5
@GeZhang86038849
More details found here：

4/5
@AidanLew37
good

5/5
@Enzorouxx
This report is next-level ambitious. It’s tackling the alignment between natural language and PDDL, which is one of those inevitable, mind-bending challenges of the Agent era—like, philosophy-of-language-tier hard. Saussure and Wittgenstein would 100% be geeking out over this.

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

bnew · Feb 10, 2025

1/10
@ai_for_success
Le Chat from Mistral AI got a major upgrade, and it's really good!

They're rolling out brand-new features, along with iOS and Android apps, Pro, Team, and Enterprise tiers.

Here’s everything you need to know

1/7
Fast Answer: lowest-latency Mistral models ~1000 words / sec , currently available in preview to all users.

https://video.twimg.com/ext_tw_video/1887527490132058114/pu/vid/avc1/640x360/Ucn1ojcysIgtBc_q.mp4

2/10
@ai_for_success
2/7
Le Chat can also access information across web search with citation

https://video.twimg.com/ext_tw_video/1887527541684277248/pu/vid/avc1/584x360/X-YdsYao8bP5zHT7.mp4

3/10
@ai_for_success
3/7
Le Chat can also process image and document which is powered by the best vision and optical character recognition (OCR) models in the industry

https://video.twimg.com/ext_tw_video/1887527589604171778/pu/vid/avc1/582x360/Tv8VOUo2s2D29Btq.mp4

4/10
@ai_for_success
4/7
In-place code execution and analysis : Code interpreter in Le Chat

https://video.twimg.com/ext_tw_video/1887527706121936899/pu/vid/avc1/568x360/13LIX8tZdBEeL6Rj.mp4

5/10
@ai_for_success
5/n
- le Chat offers the vast majority of its features for free (latest models, journalism, image generation, document uploads, and more), with upgraded limits for power users starting at $14.99 per month in the Pro tier

- Starting today, le Chat is available on iOS, Android, and soon, on private infrastructure for businesses.

6/10
@ai_for_success
6/n
Image generation : Le Chat’s image generation is powered by Black Forest Labs Flux Ultra

7/10
@ai_for_success
7/n
Upload and analyze.

https://video.twimg.com/ext_tw_video/1887527881188032512/pu/vid/avc1/580x360/n8VVV4DKxs4anuSj.mp4

8/10
@VarunkInsights
A Chatbot that remembers preferences with long term memory and personalizes responses will indeed be useful

9/10
@ai_for_success
Yeah , they have done lots of improvent

10/10
@shushant_l
I've been hearing a lot about it

To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196

The A.I Megathread (LLM , GPT , Development)

More options

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran

bnew

Veteran