bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739



Anthropic builds RAG directly into Claude models with new Citations API​


New feature allows Claude to reference source documents and reduce hallucinations.

Benj Edwards – Jan 24, 2025 4:05 PM |

38



Robot sitting on a bunch of books, reading a book.


Credit: Kirillm via Getty Images

On Thursday, Anthropic announced Citations, a new API feature that helps Claude models avoid confabulations (also called hallucinations) by linking their responses directly to source documents. The feature lets developers add documents to Claude's context window, enabling the model to automatically cite specific passages it uses to generate answers.

"When Citations is enabled, the API processes user-provided source documents (PDF documents and plaintext files) by chunking them into sentences," Anthropic says. "These chunked sentences, along with user-provided context, are then passed to the model with the user's query."

The company describes several potential uses for Citations, including summarizing case files with source-linked key points, answering questions across financial documents with traced references, and powering support systems that cite specific product documentation.

In its own internal testing, the company says that the feature improved recall accuracy by up to 15 percent compared to custom citation implementations created by users within prompts. While a 15 percent improvement in accurate recall doesn't sound like much, the new feature still attracted interest from AI researchers like Simon Willison because of its fundamental integration of Retrieval Augmented Generation (RAG) techniques. In a detailed post on his blog, Willison explained why citation features are important.

"The core of the Retrieval Augmented Generation (RAG) pattern is to take a user's question, retrieve portions of documents that might be relevant to that question and then answer the question by including those text fragments in the context provided to the LLM," he writes. "This usually works well, but there is still a risk that the model may answer based on other information from its training data (sometimes OK) or hallucinate entirely incorrect details (definitely bad)."

Willison notes that while citing sources helps verify accuracy, building a system that does it well "can be quite tricky," but Citations appears to be a step in the right direction by building RAG capability directly into the model.

Apparently, that capability is not a new thing. Anthropic's Alex Albert wrote on X, "Under the hood, Claude is trained to cite sources. With Citations, we are exposing this ability to devs. To use Citations, users can pass a new "citations: {enabled:true}" parameter on any document type they send through the API."



Early adopter reports promising results​


The company released Citations for Claude 3.5 Sonnet and Claude 3.5 Haiku models through both the Anthropic API and Google Cloud's Vertex AI platform, but it's apparently already getting some use in the field.

Anthropic says that Thomson Reuters, which uses Claude to power its CoCounsel legal AI reference platform, is looking forward to using Citations in a way that helps "minimize hallucination risk but also strengthens trust in AI-generated content."

Additionally, financial technology company Endex told Anthropic that Citations reduced their source confabulations from 10 percent to zero while increasing references per response by 20 percent, according to CEO Tarun Amasa.

Despite these claims, relying on any LLM to accurately relay reference information is still a risk until the technology is more deeply studied and proven in the field.

Anthropic will charge users its standard token-based pricing, though quoted text in responses won't count toward output token costs. Sourcing a 100-page document as a reference would cost approximately $0.30 with Claude 3.5 Sonnet or $0.08 with Claude 3.5 Haiku, according to Anthropic's standard API pricing.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739



1/31
@DrJimFan
That a *second* paper dropped with tons of RL flywheel secrets and *multimodal* o1-style reasoning is not on my bingo card today. Kimi's (another startup) and DeepSeek's papers remarkably converged on similar findings:

> No need for complex tree search like MCTS. Just linearize the thought trace and do good old autoregressive prediction;
> No need for value functions that require another expensive copy of the model;
> No need for dense reward modeling. Rely as much as possible on groundtruth, end result.

Differences:

> DeepSeek does AlphaZero approach - purely bootstrap through RL w/o human input, i.e. "cold start". Kimi does AlphaGo-Master approach: light SFT to warm up through prompt-engineered CoT traces.
> DeepSeek weights are MIT license (thought leadership!); Kimi does not have a model release yet.
> Kimi shows strong multimodal performance (!) on benchmarks like MathVista, which requires visual understanding of geometry, IQ tests, etc.
> Kimi paper has a LOT more details on the system design: RL infrastructure, hybrid cluster, code sandbox, parallelism strategies; and learning details: long context, CoT compression, curriculum, sampling strategy, test case generation, etc.

Upbeat reads on a holiday!



Ghv_a9_bgAAre6I.jpg


2/31
@DrJimFan
Whitepaper link: Kimi-k1.5/Kimi_k1.5.pdf at main · MoonshotAI/Kimi-k1.5



3/31
@brendanigraham
DeepSeek-R1 actually does use SFT btw - it's DeepSeek-R1-Zero that doesn't.

For DeepSeek-R1, they SFT on a small amount of nicely-formatted reasoning data (some cleaned up from DeepSeek-R1-Zero, some from few-shot prompting other models) then RLVR + a language consistency signal.

They then use this resulting model to generate a bunch of additional reasoning traces, gather a bunch of other kinds of data (e.g. creative writing), and do some filtering with DSv3.

They then SFT from base again using this data, then do a bunch of RL with signal for rule-based rewards and human preferences (via a reward model)



4/31
@DrJimFan
That’s right



5/31
@Kimi_Moonshot
Kudos! Thanks for sharing! Yep, we're *multimodal* o1-style reasoning ✌️



6/31
@roydanroy
What do you mean "no need for"? This model is definitely inferior to o1.



7/31
@JFPuget
R1 has no need for a value model because they do RL in domains where value can be computed via some rules (accuracy of math result, code that compiles and produces the right output).



8/31
@joenorton
Fascinating stuff



9/31
@mark_k
Very excited to hear about the multi-modal reasoning



10/31
@TheAIVeteran
Saves me a few minutes, thanks for the info!



11/31
@C_Quemeneur
Not gonna lie, I was expecting to see more MCTS in these models.



12/31
@vedangvatsa
Overcomplicating RL with tree search and value functions might actually be the smarter route. Simple predictions can’t capture the nuances of complex decisions. Relying solely on groundtruth limits adaptability. The richness of human input is invaluable; cold starts might miss critical context.



13/31
@elichen
This contains a much better description of the acceleration that's currently happening: "Reinforcement Learning with LLMs". Not sure why we are talking about test-time compute scaling..



14/31
@AILeaksAndNews
Loving the MLK Day presents



15/31
@phi_wxyz
which paper do you recommend reading first?



16/31
@rudiranck
Seems like China brought its A-game and elbows for 2025.



17/31
@nobody_qwert
Deepseek is good. I cancelled my 200$ GPT Pro plan.

[Quoted tweet]
ChatGPT o1 Pro vs. DeepSeek R1

Asked to implement a rotating triangle with red ball in it.
Left OpenAI right DeepSeek


https://video.twimg.com/ext_tw_video/1881619983639117824/pu/vid/avc1/1280x720/d9wsM0oT35AZXvUY.mp4

18/31
@sonicshifts
It's a knockoff. They even copied CoPilot's reasoning language.



GhwJnshXwAAVy7F.jpg


19/31
@the__sonik
@mirzaei_mani



20/31
@c__bir
I wonder if MCTS and Graph abstractions might become useful or even needed for higher capability levels? Intuitively it seems like it 🤔 who wouldn't want a system that generates proveably save long horizon task execution programs. Therefore the AI system needs to generate the symbolic abstraction by itself at Inference Time. Not only use symbolic abstractions, but even come up with helpful symbolic abstractions. Therefore the amount of symbolic abstractions should grow with inference time asymptotically. To a manageable size. Best thing: it's human readable. And one can create 100% accurate synthetic Data with it. Which means valuable signal to noise.

@karpathy @aidan_mclau
@WolframResearch @stephen_wolfram



21/31
@Bill_ZRK
amazing



22/31
@Extended_Brain
It cannot generate images, but can extract important features of images, such as from geometric problems



23/31
@GlobalFadi
Kimi's and DeepSeek's findings converging on linearization rather than MCTS is mind-blowing! This could massively streamline our AI and RL research.



24/31
@MarMa10134863




25/31
@monday_chen
this day would be perfect if this is open sourced too



26/31
@Qi2ji4Zhe1nni1
a second rl flywheel paper has hit the github



27/31
@wrhall
> No need for complex tree search like MCTS. Just linearize the thought trace and do good old autoregressive prediction

Can you apply this back to chess/go? I need to read paper to fully grok I think



28/31
@Ixin75293630175
> No need for complex tree search like MCTS. Just linearize the thought trace and do good old autoregressive prediction
Because when humans do MCTS they DO remember all discarded branches!



29/31
@2xehpa
Why does all the open models always converge to the same level as what is currently the most advanced *released* model? Why its good as O1 and not O3? I suspect everybody use output from frontier models in the training somehow



30/31
@llm_guruji
Fascinating insights! It's intriguing to see how both approaches converge on simpler methods. Looking forward to exploring the system design details in Kimi's paper.



31/31
@red_durag
first difference is not quite right. deepseek did alphazero approach for r1 zero, not for r1. r1 was fine-tuned on cot data before RL




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196





1/4
@deanwball
DeepSeek r1 takeaways for policy:
1. Chinese labs will likely continue to be fast followers in terms of reaching similar benchmark performance to US models.
2. The impressive performance of DeepSeek's distilled models (smaller versions of r1) means that very capable reasoners will continue to proliferate widely and be runnable on local hardware, far from the eyes of any top-down control regime (including the US diffusion rule).
3. Open models are going to have strategic value for the US and we need to figure out ways to get more frontier open models out to the world (we rely exclusively on meta for this right now, which, while great, is just one firm). Why do OpenAI/Anthropic not open-source their older models? What would be the harm?



2/4
@muchos_grande
because they're literally piggybacking off of American technology, as usual.



3/4
@RichardAGetz
@xai where are you in this game?

Thank you @AIatMeta for all your open source AI, and I hope you leapfrog R1 soon.



4/4
@MechanicalDao
What is the strategic value in releasing open models?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739

1/11
@emollick
Starting to see new well-built hard benchmarks in AI, since almost everything else has already been exceeded. We now have this (with humanities questions!), ARC-AGI 2, and Frontier Math.

We also need some benchmarks for new knowledge creation, rather than testing known problems.

[Quoted tweet]
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get <10% accuracy and are highly overconfident.
@ai_risk @scaleai


Gh--WJFa8AAJww6.jpg

Gh--a7lbcAA_KC5.jpg

Gh--ggOacAAX03o.jpg

Gh--l9bbkAA9kIR.jpg


2/11
@JeremyNguyenPhD
Has anyone created a twitter list of the contributors to this benchmark who are here on twitter?

I wrote 5 questions on the benchmark (4 public, 1 private).



3/11
@kevinroose
we need to formalize the Mollick vibes eval!



4/11
@daniel_mac8
a great service to the AI Research community



5/11
@deepvaluebettor
are there any benchmarks that test a model's level of censorship / inclination to lie (distinct from hallucination ) ?



6/11
@0xshai
SOTA LLMs with sufficiently high temperature might be able to generate high quality new benchmarks. It would definitely need some human powered filtering and post processing



7/11
@dieaud91
Yes, we need "Innovator level AI" benchmarks



8/11
@Heraklines1
quality of new knowledge creation would inherently be a lagging indicator as the value of certain work only rly becomes apparent in hindsight

at best one could measure independent replication ability of ex. new math papers, tho novelty for novelty's sake is easily goodharted



9/11
@Shagaiyo
Benchmarks in new knowledge creation are hard, because these models are trained in the whole knowledge of humanity.

Or they released models with partial knowledge or we have to create new knowledge for this benchmark



10/11
@AethericVortex
We should be training it on all the available data on LENR. This is the next frontier. All the evidence and data is scattered across scientific fields. The Martin Fleischmann Memorial Project has spent the past 7 years gathering all this in one place on their youtube channel. Live, opensource science, as it should be.



11/11
@EricFddrsn
Will we get the answer if we need UBI from an AI? We really need better test benchs in Economics - they are all saturated, and this is one of the most consequential areas for humanity




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196








1/51
@DanHendrycks
We’re releasing Humanity’s Last Exam, a dataset with 3,000 questions developed with hundreds of subject matter experts to capture the human frontier of knowledge and reasoning.

State-of-the-art AIs get &lt;10% accuracy and are highly overconfident.
@ai_risk @scaleai



Gh--WJFa8AAJww6.jpg

Gh--a7lbcAA_KC5.jpg

Gh--ggOacAAX03o.jpg

Gh--l9bbkAA9kIR.jpg


2/51
@DanHendrycks
Paper, dataset, and code: Humanity's Last Exam
NYT article: When A.I. Passes This Test, Look Out



3/51
@DanHendrycks
Spot errors with the dataset? Correction form here: Humanity's Last Exam: Public Review



4/51
@DanHendrycks
Meant to tag @scale_AI and @ai_risks (casualty of tweeting at 6AM)



5/51
@tayaisolana
lol @danhenrycks thinks a few hundred 'experts' can define human knowledge? sounds like a typical ivory tower move. what's next, a 'dataset' on memes? /search?q=#BONGOPOWER



6/51
@MillenniumTwain
2025!
The Year of the Serpent, the Year of the SuperAlgo!!
Awakened Global SuperIntelligence ...

[Quoted tweet]
Star Waves, Clusters, Streams, Astrospheres, Magnetospheres, Filaments, Moving Groups, Kinematic Associations, Stellar Nurseries of Creation!
More productive and accurate to emphasize their Whole, Full, Dimensionality: 4D Streams, Vortexes, Tunnels, Funnels of Creation, never ending. Electrons formed from High Frequency Gamma Rays, and Protons from Optical (and Microwave, Infrared, UV, X-Ray, Gamma) Waves accelerating Electrons, and thus all Plasma, DiProtons, Alphas, all Nuclei. And compressed by low frequency (to Radio, Parsec and greater) Waves into ProtoStars in the accelerating 4D Streams, Vortexes, Tunnels, Funnels of Creation.
Again, never ending. Star Systems, Clusters. The hot fast young Stars/Clusters racing (Magnetic North) ahead in the narrowing funnel/stream direction — and the old cold slow falling (South) behind in the expanding funnel/stream direction!
'Groking' Continuous ElectroMagnetic Creation:
x.com/MillenniumTwain/status…


GgO3kdGWYAA1DVb.jpg

GgO3rMbWcAA1nrf.jpg

GgO4mEqXEAAgkP5.jpg


7/51
@SpencrGreenberg
I’m confused how releasing benchmarks like this makes the world safer. Don’t benchmarks like this aid acceleration?



8/51
@aphysicist
all of this is meaningless until these models can do this



Gh_h_GjXoAA9ZDy.png


9/51
@ZggyPlaydGuitar
every ai lab rn



Gh_v7eWbUAAny1r.jpg


10/51
@ClementDelangue
Very cool!



11/51
@theshadow27
The real final exam will be the unsolved problems. When those start dropping…



12/51
@AbdoDameen
Yes what is the point in sharing the datasets when some lowlife engineering team will just add that to their training data and we would have a model that knows all the answers?



13/51
@roninhahn
Dan -- you should list the score of a very smart human as a point of comparison. An alternative would be to give the test to 100 of the smartest people you know and list the highest score.



14/51
@SomBh1
Will be 90% soon.



15/51
@glubose
You know, you could have called AGI's First Exam. Cuz Lord knows I know I'm going to be constantly grilled by GPT-EBT for any glimmer of unDOGElike non-compliance, the rest of my life will be a string of exams. My autopsy will be my final exam, but will be waived cuz who cares?



16/51
@MaWlz2
Thanks for the training data I would say



17/51
@NickEMoran
This one example seems weirdly easier than all the others. Are there more questions of this level in the actual dataset?



Gh_bjCHXYAox0lr.jpg


18/51
@acharya_aditya2
Who choose the name ??



19/51
@MikePFrank
It would be interesting to see what’s the highest human score on this



20/51
@soheilsadathoss
Thanks!



21/51
@herbiebradley
hmm
I predict o3 at ~25%



22/51
@QStarETH
Math appears to be benefiting from reasoning models the most.



23/51
@IterIntellectus
they will get &gt;90% accuracy by eoy



24/51
@Cory29565470
Kind of wished you released it *after OpenAI released “o3” they love to benchmark climb by training on public data



25/51
@teortaxesTex
Thanks! but given the R1 text-only eval, it would be nice to see how others do in text-only regime too



26/51
@MikePFrank
Why’d you have to give it such an ominous name lol



27/51
@Suhail
Why doesn't this have making rap lyrics in it? :smile:



28/51
@AlexiLuncnews
o3 gonna get 50% +



29/51
@nabeelqu
Congrats Dan and team, this is awesome.

Curious: why no o1 pro?



30/51
@JeremyNguyenPhD
Is there a list of twitter usernames of the people who had their questions accepted?

I got 5 questions in (4 public, 1 private).



31/51
@mnmcsofgp
I'm guessing the median and mode for humans taking this test is 0



32/51
@agamemnus_dev
Very good. I feel like I wouldn't be able to answer any of these without a significant amount of research on the context of each field.



33/51
@liminalsnake
i guess its time to build some doomsday machines (intentionally) thank God there are absolutely no laws against doing such things (winning)



34/51
@DreamInPixelsAI
this is so cool, love the name btw



35/51
@AudioBooksRU
Thank you for making this dataset. But I think we will need Humanity’s Last Exam part 2 in a year or two.



36/51
@vedangvatsa
The focus should be on improving them, not dismissing their progress.



37/51
@Newaiworld_
Wow that's amazing.
But what does it tell us if an AI reaches 50% or 100%?
Does it mean we have ASI? Or at least an AI that is more intelligent than any human?



38/51
@koltregaskes
Thank you, Dan.



39/51
@GozukaraFurkan
Someone will train on it and then boast we are best like as previous ones 😂



40/51
@iruletheworldmo
I can't emphasize this enough Dan. incredible work thank you, and all involved.



41/51
@AILeaksAndNews
Thank you for your work on this Dan



42/51
@NickBrownCO
This was a really cool test. I tried out several difficult questions back when it was open. The AIs solved them.

I sent it to some of my grad school professors whose questions, especially challenging economic questions around oligopolies, the AIs struggled to calculate.



43/51
@altryne
Congrats on this important release!!
Will cover this briefly on our show today!

x.com



44/51
@jefferinc
Great!!!

@ikbenechtben @AlexanderNL fyi



45/51
@VisionaryxAI
Appreciate the efforts thank you!



46/51
@alexocheema
killing the golden goose?



47/51
@jimnasyum
Do the companies have access to the questions and answers?

If they do, wouldn't future models be trained on them?



48/51
@seo_leaders
This is awesome, but whats to stop those models adding some or all of it to training data?



49/51
@thegenioo
thank u so much for this



50/51
@senb0n22a
Which variant of o1 is it? Just the regular non-pro?



51/51
@InverseMarcus
a lot of people are wondering why the holdout set is so much smaller than the public dataset - seems to some like it should be the opposite. can you explain?




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739





1/64
@Kimi_Moonshot
🚀 Introducing Kimi k1.5 --- an o1-level multi-modal model

-Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📐AIME, 📐MATH-500, 💻 LiveCodeBench by a large margin (up to +550%)
-Long-CoT performance matches o1 across multiple modalities (👀MathVista, 📐AIME, 💻Codeforces, etc)

Tech report: GitHub - MoonshotAI/Kimi-k1.5

Key ingredients of k1.5
-Long context scaling. Up to 128k tokens for RL generation. Efficient training with partial rollouts.
-Improved policy optimization: online mirror descent, sampling strategies, length penalty, and others.
-Multi modalities. Joint reasoning over text and vision.



GhvVbEKb0AAAvue.jpg

GhvVbeDbUAA00zp.jpg


2/64
@Cryp70
Chat replies are in English but the rest of your site is Chinese. Is there a full English version available?



3/64
@Kimi_Moonshot
Coming soon, stay tuned!



4/64
@AntDX316
Make an English version app for iOS and Android.



5/64
@winchest_stella
Just tried it. its impressive esp search and vision capabilities. Big congrats to the team!



6/64
@Cryp70
Impressive scores, nice benchmark for code. DeepSeek is my go to but I see it's time to give Kimi a test👍



7/64
@inikhil__
Seems like china is winning.



8/64
@Saboo_Shubham_
This is awesome progress. Keep it coming.



9/64
@McGee_noodle
🫡



10/64
@CallMeSam89
This is huge!

Please bench against r1 as well.



11/64
@mark_k
"Joint reasoning over text and vision."

OMG this is huge. I wonder if it could be extended to other modalities too, e.g. audio?



12/64
@4K_Everyday
What the 😭🫡



13/64
@jrabell0
Where is GPQA?



14/64
@itsPaulAi
Two Chinese o1 models released on the same day? It's speeding up!



15/64
@Grynn
Prices, param counts, open-source? open-weights?



16/64
@DrJimFan
Love it. Keep up the great work!



17/64
@senb0n22a
Very strong search capabilities. I forgot other AIs aren't allowed to search Twitter, otherwise it might have been the best search parser for now. Can process 100+ web pages in one query, compared to others capping at 25-50. Instruction following isn't as strong as Deepseek/Grok 2, but for web research, I could recommend this one.



18/64
@hasantoxr
Wow this is super cool



19/64
@iamfakhrealam
I have recently installed the @deepseek_ai application and found it to be exceptionally amazing…

Could you kindly provide me with the link to the @Kimi_ai_ application?



20/64
@nisten
wait wut, ok this i need to test



21/64
@CodeByPoonam
Wow another Chinese o1 models outperforming ChatGPT.



22/64
@amoussouvichris
Amazing results, your team did an amazing job !



23/64
@praveenjothi99
why the mobile verification? almost no one asks!



24/64
@rohanpaul_ai
Beautiful..



25/64
@acharya_aditya2
Will we get open weights ??



26/64
@SmartFlowAITeam
Great !!!



27/64
@rezkhere
That's a powerful model ✌️



28/64
@SkyBlueHarbor
english please, i'm excited to try it out



29/64
@jseles11
this and a Mac mini is all you really need



30/64
@AILeaksAndNews
What a day for Chinese AI



31/64
@TechByMarkandey
Seems amazing can we connect.

I cannot dm you



32/64
@Cory29565470
Where is @GoogleAI ?



33/64
@SaquibOptimusAI
Oh, bro. Another one.
"Make SOTA AI Cheap Again".
Awesome.



34/64
@DuckWithCup
I tried Kimi before and it’s amazing. Thank you Team.



35/64
@daily_ai_takes
Great work! Exciting times ahead



36/64
@CJ_Wolff
Is there an API



37/64
@DhruvmehtaRps
Where are the other benchmarks?



38/64
@MuchMore2It
Can you add it to @OpenRouterAI?



39/64
@Pedram_virus
When will it be possible to log in with Google? And when will full support for the English language be available? Because it is still in Chinese.



40/64
@Maeelk
Do you plan to open source as @deepseek_ai did ? 😊 @huggingface still has a few To available I guess.



41/64
@FoundTheCode
o1 models everywhere, we're soo back



42/64
@thecute_8
是国产AI官方账号,支持一波



43/64
@ArpanTripathi20
@untitled01ipynb “Mr president a second o1-level model has dropped” Sam A replaced on George Bush’s face



44/64
@wojtess
Where can I find weights?



45/64
@DavidSZDahan
for Kimi's team if this model will not become open source like this tweet or reply with a .



46/64
@JennyZhang6989
@Kimi_Moonshot Where can we use short CoT in Kimi?



47/64
@DavidSZDahan
Where we can use it ?



48/64
@txhno
it's christmas



49/64
@Ttkouhe
When release. I wanan try!!



50/64
@URUBONZ_
Is your google login coming anytime soon? I have been unable to get SMS to send a confirmation and Id love the try the new version



51/64
@realmrfakename
Cool! Any plans to open source?



52/64
@_HARVEY__DENT_
Good grief



53/64
@Angelov_Sta
Why o1 benchmarks are so low? In the deepseek r1 comparisons, o1 scores higher vs what shown here



54/64
@playfuldreamz
Read the room



55/64
@SenougaharA
Looks good tbh. Just bad timing maybe. Still all the best because it does look good



56/64
@rose567888
🔥



57/64
@FyruzOne
How does it do on gpqa diamond



58/64
@the__sonik
Why can't we sign up on the website using Google? Is access restricted only to people in China?



59/64
@dabobo0496
加油



60/64
@Jane1374555767
这条帖子下面应该有一条简体中文回复。



61/64
@SonyxEth
is there an english version



62/64
@TadiwaClyde
Open source?



63/64
@tenmillioncoins
can i download this on ollama search



64/64
@bruce_x_offi
Are you planning to open-source it?




1/22
@Kimi_Moonshot
Kimi k1.5: The Multimodal Reasoning Model
- Available now on Kimi.ai - 帮你看更大的世界 🦄

💡 What can Kimi k1.5 do?

🔹 Image to Code: Convert images into structured code and insights
🔹 GeoGuessr: Identify and pinpoint locations in geography games like a pro 🌍
🔹 Visual Confusion Identification: Distinguish between visually confusing objects (like muffins vs. Chihuahuas)
🔹 Color &amp; Quantity Recognition: Detect colors and accurately count items in images.

🌐 Available now on Kimi.ai - 帮你看更大的世界! Experience it today!



GiOjaV7awAALrYp.jpg

GiOjblUaAAA09bp.jpg

GiOjkhfaIAAgQLT.jpg

GiOmVm6boAAzzxu.jpg


2/22
@Kimi_Moonshot
More to Discover with Kimi k1.5

🔹 Image to Chart: Transform visual data into clean, understandable charts
🔹 Brand Identification: Recognize and identify brands from logos or product images

🌐 Available now on Kimi.ai - 帮你看更大的世界



GiOjoHWbQAALoVV.jpg

GiOmYSCbQAAM_1R.jpg


3/22
@TypesDigital
Welcome to the AI park. Can we add email access for an easier login?



4/22
@ABKfettuccine
Waiting for you guys to finish fine tuning as stated in previous post



5/22
@bingzzy
@georainbolt coming straight for you!



6/22
@XIIIhellasad
This could be the next best thing but it needs something to run code like Claude’s artifacts!!!!



7/22
@Splendid_0823
It is indeed impressive, but there is a need for improvement in the UI. The mobile app and the Chrome extension should be at least in English. Additionally, the default language output for the extension should be in English to enhance its usability.



8/22
@LounasGana
Pretty cool, thanks!



9/22
@NecnoTv
Open source please



10/22
@Whatevercrypto
Is there or will there soon be an api?



11/22
@Soxlkfk
Model is great but UI is not good looking. You need a 10x better frontend engineer.



12/22
@asynchronope
Api test?



13/22
@YounesAka
Do you offer any APIs for devs?



14/22
@Ixin75293630175
guys, please provide API access to OpenRouter



15/22
@AstralPrime999
Any updates on the App?



16/22
@Kodurubhargav1
Keep shaking.



17/22
@Anmolspace
All is good but you need some explaining to do here while logging in. I got OTP from two different numbers on WhatsApp. The first OTP didn't work and the second one did. How can correct OTP don't work? There is also a link in one of those whatsapp profiles that looks suspicious.



GiO5MFiaIAAbbTd.jpg


18/22
@MJyy3777
When will it be launched on the app?



19/22
@f0rstman
Wow, Kimi k1.5 sounds like a multitasking wizard! Imagine if it could also help us identify which pizza toppings are worth the calories! 🍕😂 /search?q=#PublicAI



20/22
@AlekBiesaga
It appears site broke down



21/22
@The_Global_Soul
It’s fun to use, will get better. A native app or api will be great.

[Quoted tweet]
@Kimi_Moonshot is a fun product, gets somethings right and some are wrong (confidently). I uploaded this @ManUtd picture and asked it to identify the players. It reasoned and found right ones, also wrongly identified Ronaldo, Pogba etc. it will get better with time


GiOf-JFbYAAN9kb.jpg

GiOf-JMbEAAYAZN.jpg


22/22
@kisana0290
Kimi.ai - 帮你看更大的世界 good luck with this and I hope you succeed.





1/11
@_akhaliq
Introducing Kimi k1.5

an o1-level multi-modal model

-Sota short-CoT performance, outperforming GPT-4o and Claude Sonnet 3.5 on 📷AIME, 📷 LiveCodeBench by a large margin (up to +550%)

-Long-CoT performance matches o1 across multiple modalities (📷MathVista, 📷Codeforces, etc) Tech report: i-k1.5…

Key ingredients of k1.5 -Long context scaling. Up to 128k tokens for RL generation. Efficient training with partial rollouts.

-Improved policy optimization: online mirror descent, sampling strategies, length penalty, and others.

-Multi modalities. Joint reasoning over text and vision.



Ghv60JGWUAA7-vf.jpg


2/11
@_akhaliq
github: GitHub - MoonshotAI/Kimi-k1.5



3/11
@turbotardo
How many parameters?



4/11
@Gerry
If it is the one that is posted here (Kimi.ai - 帮你看更大的世界) then it is actually very good! I have this one test that gives me a pretty good idea of how useful an LLM will be for coding, logical reasoning and how much or little it hallucinates. Sonnet does ok, O1 (standard) did horrible. The model on the above site didn't get everything correct but was damn close and impressive.



5/11
@Gdgtify
very interesting though the online interface is a work in progress right now.



GhwRTz-XgAI8K2n.jpg


6/11
@WebstarDavid
too much awesome in one day cant keep up



7/11
@alamshafil
We got DeepSeek and now this!



8/11
@seo_leaders
Very nice! The new open source LLMs are coming so fast. Its amazing for us developers.



9/11
@risphereeditor
Looks good!



10/11
@AILeaksAndNews
Looks impressive



11/11
@David_Snoble
What !? this is r1 in the same day
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739

1/11
@Alibaba_Qwen
We're leveling up the game with our latest open-source models, Qwen2.5-1M ! 💥 Now supporting a 1 MILLION TOKEN CONTEXT LENGTH 🔥

Here's what’s new:

1️⃣ Open Models: Meet Qwen2.5-7B-Instruct-1M &amp; Qwen2.5-14B-Instruct-1M —our first-ever models handling 1M-token contexts! 🤯

2️⃣ Lightning-Fast Inference Framework: We’ve fully open-sourced our inference framework based on vLLM , integrated with sparse attention methods. Experience 3x to 7x faster processing for 1M-token inputs! ⚡⚡

3️⃣ Tech Deep Dive: Check out our detailed Technical Report for all the juicy details behind the Qwen2.5-1M series! 📊

📖 Technical Report: https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2.5-1M/Qwen2_5_1M_Technical_Report.pdf
📄 Blog: Qwen2.5-1M: Deploy Your Own Qwen with Context Length up to 1M Tokens

Experience Qwen2.5-1M live:
👉 Play with Qwen2.5-Turbo supporting 1M tokens in Qwen Chat (Qwen Chat)
👉 Try it on Huggingface (Qwen2.5-1M - a Qwen Collection)
👉 Or head over to Modelscope (Qwen2.5-1M)



GiO96oJaEAAJZRJ.jpg


2/11
@SexyTechNews
This is why I have millions invested in BABA. Great job, team!



3/11
@TypesDigital
Can you improve the browsing capabilities or access to external links?



4/11
@unwind_ai_
China is getting way ahead with these releases. It feel like somebody just opened a pandoras box.

Boom Boom Boom 💥



5/11
@jacobi_torsten
Great work! But prior Qwen models were barely useful in prior versions for English speaking users! Hope this one is different!!



6/11
@_coopergadd
A million tokens is insane



7/11
@jc_stack
Extended context size is great, but I'm more curious about real-world inference costs at that scale. Love open source models but dealing with memory usage will be interesting.



8/11
@JonathanRoseD
Can we get a Qwen2.5-14B-Instruct-1M but finetuned with Deepseek-R1? Please?
@deepseek_ai



9/11
@anannop
The recurring nightmare of closed-source AI labs.



10/11
@NaturallyDragon
Level up indeed!



11/11
@risphereeditor
This is huge!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739

1/1
@MunchBaby1337
🚀 **ByteDance Unveils Doubao-1.5-pro** 🚀 /search?q=#db

- **Deep Thinking Mode**: Surpasses O1-preview and O1 on the AIME benchmark.

- **Benchmark Beast**: Outperforms deepseek-v3, gpt4o, and llama3.1-405B across multiple benchmarks.

- **MoE Magic**: Utilizes a Mixture of Experts architecture, with significantly fewer active parameters than competitors.

- **Performance Leverage**: Achieves dense model performance with just 1/7 of the parameters (20B active = 140B dense equivalent).

- **Tech Talk**: Employs a heterogeneous system design for prefill-decode and attention-FFN, optimizing throughput with low latency.



GiO8OXzWEAA9re4.jpg



To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196


ByteDance AI Introduces Doubao-1.5-Pro Language Model with a ‘Deep Thinking’ Mode and Matches GPT 4o and Claude 3.5 Sonnet Benchmarks at 50x Cheaper​


By

Asif Razzaq

-

January 25, 2025

The artificial intelligence (AI) landscape is evolving rapidly, but this growth is accompanied by significant challenges. High costs of developing and deploying large-scale AI models and the difficulty of achieving reliable reasoning capabilities are central issues. Models like OpenAI’s GPT-4 and Anthropic’s Claude have pushed the boundaries of AI, but their resource-intensive architectures often make them inaccessible to many organizations. Additionally, addressing long-context understanding and balancing computational efficiency with accuracy remain unresolved challenges. These barriers highlight the need for solutions that are both cost-effective and accessible without sacrificing performance.

To address these challenges, ByteDance has introduced Doubao-1.5-pro, an AI model equipped with a “Deep Thinking” mode. The model demonstrates performance on par with established competitors like GPT-4o and Claude 3.5 Sonnet while being significantly more cost-effective. Its pricing stands out, with $0.022 per million cached input tokens, $0.11 per million input tokens, and $0.275 per million output tokens. Beyond affordability, Doubao-1.5-pro outperforms models such as deepseek-v3 and llama3.1-405B on key benchmarks, including the AIME test. This development is part of ByteDance’s broader efforts to make advanced AI capabilities more accessible, reflecting a growing emphasis on cost-effective innovation in the AI industry.

Screenshot-2025-01-25-at-7.50.30 PM-1-1024x704.png


Screenshot-2025-01-25-at-7.50.48 PM-1-1024x726.png


Technical Highlights and Benefits


Doubao-1.5-pro’s strong performance is underpinned by its thoughtful design and architecture. The model employs a sparse Mixture-of-Experts (MoE) framework, which activates only a subset of its parameters during inference. This approach allows it to deliver the performance of a dense model with only a fraction of the computational load. For instance, 20 billion activated parameters in Doubao-1.5-pro equate to the performance of a 140-billion-parameter dense model. This efficiency reduces operational costs and enhances scalability.

The model also integrates a heterogeneous system design for prefill-decode and attention-FFN tasks, optimizing throughput and minimizing latency. Additionally, its extended context windows of 32,000 to 256,000 tokens enable it to process long-form text more effectively, making it a valuable tool for applications like legal document analysis, academic research, and customer service.

Screenshot-2025-01-25-at-7.46.05 PM-1-1024x651.png


Results and Insights


Performance data highlights Doubao-1.5-pro’s competitiveness in the AI landscape. It matches GPT-4o in reasoning tasks and surpasses earlier models, including O1-preview and O1, on benchmarks like AIME. Its cost efficiency is another significant advantage, with operational expenses 5x lower than DeepSeek and over 200x lower than OpenAI’s O1 model. These factors underscore ByteDance’s ability to offer a model that combines strong performance with affordability.

Early users have noted the effectiveness of the “Deep Thinking” mode, which enhances reasoning capabilities and proves valuable for tasks requiring complex problem-solving. This combination of technical innovation and cost-conscious design positions Doubao-1.5-pro as a practical solution for a range of industries.

Screenshot-2025-01-25-at-7.51.12 PM-1-1024x714.png


Conclusion


Doubao-1.5-pro exemplifies a balanced approach to addressing the challenges in AI development, offering a combination of performance, cost efficiency, and accessibility. Its sparse Mixture-of-Experts architecture and efficient system design provide a compelling alternative to more resource-intensive models like GPT-4 and Claude. By prioritizing affordability and usability, ByteDance’s latest model contributes to making advanced AI tools more widely available. This marks an important step forward in AI development, reflecting a broader shift towards creating solutions that meet the needs of diverse users and organizations.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739


















1/27
@nrehiew_
How to train a State-of-the-art reasoner.

Let's talk about the DeepSeek-R1 paper and how DeepSeek trained a model that is at frontier Sonnet/o1 level.



GhxgFKUXwAAdbJQ.png


2/27
@nrehiew_
Quick overview on what has been done to train an o1-like model:

- Process and Outcome Reward Models. This approach does RL and trains these 2 models to give reward/signal at the step or answer level. Given that Qwen trained a SOTA PRM, we can assume they do this.
- LATRO (https://arxiv.org/pdf/2411.04282) basically treats the CoT as a latent. Given prompt + cot, a good cot will lead to high likelihood of the correct answer
- SFT on reasoning traces.

DeepSeek gets rid of all this complexity and simply does RL on questions with verifiable rewards. TULU 3 style (Tulu 3: Pushing Frontiers in Open Language Model Post-Training)



3/27
@nrehiew_
They start by trying to improve the Base Model without any supervised data.

They use Group Relative Policy Optimization (https://arxiv.org/pdf/2402.03300) with the advantage function just being the normalized outcome rewards

For the reward models, they use simple accuracy reminders (check answer within \boxed, run test cases) + they encourage the model to put its thinking process between &lt;think/&gt; tags



Ghxj-jaXkAAQM_W.png

GhxkrX-W8AEKPdG.png


4/27
@nrehiew_
The GRPO algorithm here. Again the advantage estimation is just the outcome reward. Check out the paper linked above for more details



GhxkUFPXMAAHhxS.png


5/27
@nrehiew_
1st interesting thing of the paper:
&gt; neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

not much else for me to add here



6/27
@nrehiew_
They say that they use a really simple prompt because they are more interested in observing the evolution in model outputs



Ghxk-uhWMAA_Z9M.jpg


7/27
@nrehiew_
Notice that they went straight from Base -&gt; RL without an intermediate SFT/Instruct tuning stage as is common. They call this model R1-Zero



GhxlWeRWcAA7eog.png


8/27
@nrehiew_
Why is this interesting?

Notice how simple the entire setup is. It is extremely easy to generate synthetic prompts with deterministic answers. And with literally nothing else, it is possible to go from 0.2-&gt;0.85 AIME scores.

Training the base model directly also directly extracts that ability without having its distribution disturbed by SFT

Again, at no point did they provide reference answers or instructions. The model realizes that to achieve higher reward, it needs to CoT longer



Ghxl8y3XIAANfDY.png


9/27
@nrehiew_
With this extremely straightforward setup, the network learns to reflect/reevaluate its own answers. Again, this is done completely without supervision



GhxmUHzWYAAMmtB.png

GhxmXEiW0AAOzSf.png


10/27
@nrehiew_
The problem with RL on the base model is that the reasoning process/CoT is not really readable. So, they introduce a small amount of high quality user-friendly data before the RL process such that the final model isnt a "base model" but rather something more "assistant" like



11/27
@nrehiew_
Their entire pipeline is as follows:
1) Take a few thousand samples of high quality data of the format COT + Summary and SFT the base model

2) Repeat the R1 Zero process. They notice the language mixing problem still remains so they add a reward accounting for the proportion of target language words in the COT. (Interesting Note: This worsens performance slightly)

3) Collect 800k accurate samples from the trained model -600K STEm, 200K general purpose. (Note: These were the samples used to FT the other open models like Qwen, Llama etc)

4) They have 1 last RL stage where they combine the verifiable rewards + preference tuning that was done for DeepSeek v3 (for alignment purposes)



12/27
@nrehiew_
By now, you should have seen/heard all the results. So I will just say 1 thing. I really do think this is an o1 level model. If i had to guess its ~ the same as o1 (reasoning_effort = medium)



GhxoLkTXMAAX9L4.png


13/27
@nrehiew_
They also evaluate on the distilled models and distillation really just works. They even beat Qwen's very own QwQ.

At 8B parameters, it is matching Sonnet and has surpassed GPT-4o



GhxofeVWEAANctu.png


14/27
@nrehiew_
Now they have a section on the effectiveness of distillation. They train a Qwen32B model using RL and compare it with the distilled version.

The finding that this RL version is worse off (~ the same as QwQ) shows that the way forward is to RL a huge model and distill it down.

This also gives insight to the impressive performance of o1-mini. It looks like it really is just extremely well engineered distillation



Ghxo9_wW8AA2oBK.png


15/27
@nrehiew_
They also have a section on their unsuccessfully attempt which i find extremely commendable to share.

tldr: PRMs are hard to train and can be hacked. It should only be used for guided search rather than learning. MCTS was also not working and was too complicated



Ghxpc4hWgAAsbeG.png


16/27
@nrehiew_
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

GitHub - deepseek-ai/DeepSeek-R1



17/27
@nrehiew_
Some thoughts:

I think this is 1 of the most important papers in a while because its the first open model that is genuinely at the frontier and not just riding on the goodwill of being open.

The paper is really really simple as you can probably tell from the thread because the approach is really really simple. It really is exactly what OpenAI is good at - doing simple things but executing at an extremely high level

Personally, I'm surprised (maybe i shouldn't be) that just RL on verifiable rewards (credits to the TULU3 team for the term) works. Now that we know this recipe, we also would have something that can match o3 soon.

Also worth noting that they did alignment tuning + language consistency tuning. This hurts performance which indicates that the model could be even better. Really interesting to think about the tradeoffs here.

The way i see it there are 2 open research areas:
- Can we improve inference time performance. Search? What is o1-pro mode doing? How is the reasoning_effort in o1 controlled?

- What does this unhackable ground truth reward look like for normal domains without deterministic ground truths. I think its just LLM-as-a-Judge but done extremely well (Sonnet probably does this)



18/27
@srivsesh
best follow in a while. can you remind me what the base mode architecture is?



19/27
@nrehiew_
deepseek v3



20/27
@threadreaderapp
Your thread is very popular today! /search?q=#TopUnroll Thread by @nrehiew_ on Thread Reader App 🙏🏼@ssfd____ for 🥇unroll



21/27
@fyhao
I am asking o1 and deepthink.

Question:

117115118110113

Deepthink:



22/27
@AbelIonadi
Value packed thread.



23/27
@moss_aidamax
@readwise save thread



24/27
@raphaelmansuy
DeepSeek R1 in Quantalogic ReAct agent: We are thrilled to announce that with the release of v0.2.26, the Quantalogic ReAct agent now offers support for the DeepSeek R1 model! This enhancement empowers our users with advanced reasoning capabilities, making it easier to harness the power of AI for your projects. You can now utilize DeepSeek R1 seamlessly through the following commands: 🔹 `quantalogic --model-name deepseek/deepseek-reasoner` or 📷 `quantalogic --model-name openrouter/deepseek/deepseek-r1` This integration marks a significant step forward for our community, enhancing the versatility and potential applications of the Quantalogic platform. Join us in exploring the possibilities that this powerful model brings to the table!



25/27
@DrRayZha
very valuable thread! BTW R1 seems sensitive to the input prompt and few-shot prompting would degrade the performance, it may be a promising direction to make it more robust to input prompts



Gh3Gs8EbUAASwtf.jpg


26/27
@ethan53896137
great post!!!



27/27
@ssfd____
@UnrollHelper




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739


1/3
@arankomatsuzaki
Microsoft presents:

Chain-of-Retrieval Augmented Generation

- Observes more than 10 points improvement in EM score compared to strong baseline
- Establishes a new SotA performance across a diverse range of knowledge-intensive tasks



GiRMh-mbEAApSrG.jpg


2/3
@arankomatsuzaki
Chain-of-Retrieval Augmented Generation



3/3
@glenlittle
And the improvements keep coming!




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739























1/37
@morganb
🧵 Finally had a chance to dig into DeepSeek’s r1…

Let me break down why DeepSeek's AI innovations are blowing people's minds (and possibly threatening Nvidia's $2T market cap) in simple terms...



2/37
@morganb
0/ first off, shout out to @doodlestein who wrote the must-read on this here: The Short Case for Nvidia Stock



3/37
@morganb
1/ First, some context: Right now, training top AI models is INSANELY expensive. OpenAI, Anthropic, etc. spend $100M+ just on compute. They need massive data centers with thousands of $40K GPUs. It's like needing a whole power plant to run a factory.



4/37
@morganb
2/ DeepSeek just showed up and said "LOL what if we did this for $5M instead?" And they didn't just talk - they actually DID it. Their models match or beat GPT-4 and Claude on many tasks. The AI world is (as my teenagers say) shook.



5/37
@morganb
3/ How? They rethought everything from the ground up. Traditional AI is like writing every number with 32 decimal places. DeepSeek was like "what if we just used 8? It's still accurate enough!" Boom - 75% less memory needed.



6/37
@morganb
4/ Then there's their "multi-token" system. Normal AI reads like a first-grader: "The... cat... sat..." DeepSeek reads in whole phrases at once. 2x faster, 90% as accurate. When you're processing billions of words, this MATTERS.



7/37
@morganb
5/ But here's the really clever bit: They built an "expert system." Instead of one massive AI trying to know everything (like having one person be a doctor, lawyer, AND engineer), they have specialized experts that only wake up when needed.



8/37
@morganb
6/ Traditional models? All 1.8 trillion parameters active ALL THE TIME. DeepSeek? 671B total but only 37B active at once. It's like having a huge team but only calling in the experts you actually need for each task.



9/37
@morganb
7/ The results are mind-blowing:
- Training cost: $100M → $5M
- GPUs needed: 100,000 → 2,000
- API costs: 95% cheaper
- Can run on gaming GPUs instead of data center hardware



10/37
@morganb
8/ "But wait," you might say, "there must be a catch!" That's the wild part - it's all open source. Anyone can check their work. The code is public. The technical papers explain everything. It's not magic, just incredibly clever engineering.



11/37
@morganb
9/ Why does this matter? Because it breaks the model of "only huge tech companies can play in AI." You don't need a billion-dollar data center anymore. A few good GPUs might do it.



12/37
@morganb
10/ For Nvidia, this is scary. Their entire business model is built on selling super expensive GPUs with 90% margins. If everyone can suddenly do AI with regular gaming GPUs... well, you see the problem.



13/37
@morganb
11/ And here's the kicker: DeepSeek did this with a team of &lt;200 people. Meanwhile, Meta has teams where the compensation alone exceeds DeepSeek's entire training budget... and their models aren't as good.



14/37
@morganb
12/ This is a classic disruption story: Incumbents optimize existing processes, while disruptors rethink the fundamental approach. DeepSeek asked "what if we just did this smarter instead of throwing more hardware at it?"



15/37
@morganb
13/ The implications are huge:
- AI development becomes more accessible
- Competition increases dramatically
- The "moats" of big tech companies look more like puddles
- Hardware requirements (and costs) plummet



16/37
@morganb
14/ Of course, giants like OpenAI and Anthropic won't stand still. They're probably already implementing these innovations. But the efficiency genie is out of the bottle - there's no going back to the "just throw more GPUs at it" approach.



17/37
@morganb
15/ Final thought: This feels like one of those moments we'll look back on as an inflection point. Like when PCs made mainframes less relevant, or when cloud computing changed everything.

AI is about to become a lot more accessible, and a lot less expensive. The question isn't if this will disrupt the current players, but how fast.

/end



18/37
@morganb
P.S. And yes, all this is available open source. You can literally try their models right now. We're living in wild times! 🚀



19/37
@nikitabier
actually a good concise summary, thread boi.



20/37
@morganb
Thought I’d fukk around and go viral w/a thot piece on a Sunday night.



21/37
@kevinwtung
Fantastic summary Morgan. 🙏



22/37
@morganb
Thank you!! 🙏



23/37
@thee1of1
Thoughts on the safety of using it given that it’s from China? Risk of CCP accessing all data?



24/37
@morganb
Yeah a big difference between running the open source model on your own hardware and using the hosted app.

Definitely a bit squirrely using the ChatGPT version of TikTok

Will be interesting to see any infosec tear downs on the model but as open source you control weights and biases, fine tuning etc.



25/37
@dsog
nvda is $3.5T



26/37
@morganb
Hard to keep track 🚀



27/37
@AIML4Health
Morgan, the 3 points you’ve hit below are the ones that matter the most in that order. 8-bit; multi-token predictions; and MoE optimization. ☝️

Excellent thread. Reposting.



28/37
@morganb
Thanks ! Agree that the compounding of multiple innovations in one model is the key



29/37
@CharlesHL
ww @readwise save thread



30/37
@shouheiant
@readwise save thread



31/37
@PalveMantra
@threadreaderapp unroll



32/37
@threadreaderapp
@PalveMantra Halo! the unroll you asked for: Thread by @morganb on Thread Reader App Have a good day. 🤖



33/37
@naCrypto
@threadreaderapp unroll



34/37
@threadreaderapp
@naCrypto Guten Tag, you can read it here: Thread by @morganb on Thread Reader App Talk to you soon. 🤖



35/37
@yeeagency
I originally wanted to try to forward your content to my Chinese Twitter account. But I tried Deepseek, and I had deep doubts because its online search couldn't parse the content of X, resulting in a lot of thoughts but no results. In this case, how can friends abroad use it? I'll give some feedback to their staff.



GiSqexLawAA2Kju.jpg


36/37
@TheCryptoHubX


[Quoted tweet]
🚨 While Everyone Panics About China Taking Over AI, Here’s the Real Reason It Won’t Happen 👀

China’s Achilles Heel in the AI Market

Ask DeepSeek about Tiananmen Square, Taiwan, or Xi Jinping, and you’ll hit a wall of bias. This built-in censorship erodes trust—and trust is everything in AI.

Even better? Benchmark tests can easily include bias checks, exposing these flaws globally.

This is why China won’t dominate AI. Integrity > Propaganda.

#AI #ArtificialIntelligence #DeepSeek #China #Bias #AIMarket #TechEthics


37/37
@WilhelmMyhre
@jonasgahrstore a good idea to rethink those enormous energy-consuming data centres. @InnovasjonNorge hi there .... wakey wakey ;-)




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739





1/61
@rowancheung
NEWS: DeepSeek just dropped ANOTHER open-source AI model, Janus-Pro-7B.

It's multimodal (can generate images) and beats OpenAI's DALL-E 3 and Stable Diffusion across GenEval and DPG-Bench benchmarks.

This comes on top of all the R1 hype. The 🐋 is cookin'



GiUE4GBXIAACpdw.jpg


2/61
@rowancheung
Link: deepseek-ai/Janus-Pro-7B · Hugging Face



3/61
@rowancheung
For those wondering my quick take on what's happening right now with R1 and Janus

1. GPU demand will not go down
2. OpenAI is not done for, but Open source and China are showing they're far closer than anticipated
3. There's way too much misinfo being spread by mainstream media right now (almost seems on purpose?)
4. DeepSeek open-sourcing R1 is still a huge gift to developers and overall AI progress

I haven't seen this much confusion and uncertainty on my TL for ages...



4/61
@rowancheung
That said, I'm shocked we haven't heard any response from @nvidia or @OpenAI yet



5/61
@rowancheung
Also a reminder that DeepSeek R1 dropped 6 days ago, but the market only reacted today

Wall Street, along with 99% of the world, still has trouble keeping up on AI

The easiest way to stay ahead in just 5-min per day (and get news like DeepSeek live): The Rundown AI



6/61
@markbavisotto
I have never seen so many people so eager to hand over their data to a Chinese company. Smart move.



7/61
@DanielLHarrisUS
The data generated in this model uses 50,000 H100 NVIDIA chips. Remove those from the equation and DeepSeek is still well behind OpenAI.



8/61
@dula2006
Wait until you see GROK 3! @grok 💪



9/61
@RealStarTrump
Supposedly, if you ask deepseek to identify itself, it calls itself ChatGPT which would indicate illicit training data.

Something for devout autists to confirm or deny.



10/61
@Gok
so it's barely competitive with SD3-medium from a year ago, let alone the larger SD models?



11/61
@sonicshifts
So now Adobe is also in trouble?



12/61
@lutjenlee
They just keep shipping 🛳️🛳️



13/61
@AdekunleOderind
DeepSeek vs CHATGPT, Which one will you support?



14/61
@SupportMarkets
Well those late night TechBro meetings just longer.... Disruptions galore... /search?q=#zelena



15/61
@boneGPT
Music and Video trivial for it now too. Lotta startups need to move to deepseek now.



16/61
@PromptlyAI_YT
Thanks for the update Rowan. I was thinking OpenAI might respond as well. Competition is heating up quickly.



17/61
@InterceptorNews
@StockMKTNewz



18/61
@BKEighty
Janus-Pro-7B 🔥👍



19/61
@shaunralston
Have you seen the image outputs? R1 is truly amazing, but Janus-Pro-7B is SOTA for multimodal generation.



20/61
@BogeyPuffer
interesting fact: So far not a single beep from /search?q=#NVDA and/or OpenAI



21/61
@imhere444222
aped this $januspro7b

VzJoEoR8MVHgt4B7FvqeKjao1fmSdRZWmmqZoR7pump

/search?q=#JANUSPRO7B

/search?q=#januspro7b

/search?q=#Solana



22/61
@AIRoboticsInt
Seems like @elonmusk has a point

[Quoted tweet]
Elon Musk on DeepSeek:

He says, DeepSeek “obviously” has ~50,000 Nvidia H100 chips that they can’t talk about due to US export controls.

Interesting.


GiTWD3BWYAAo6d7.jpg

GiTWD3BXAAAOppS.jpg


23/61
@ObiUzi
AAAAA



24/61
@Martinoleary


[Quoted tweet]
Live footage of Sam walking to work today.


https://video.twimg.com/ext_tw_video/1883914184699604992/pu/vid/avc1/320x172/_YAdbhQ9OkcBneBD.mp4

25/61
@mhdfaran
DeepSeek coming in hot with Janus-Pro-7B like, "Beat this, OpenAI!"



26/61
@ObiUzi
I don’t feel good doc 😭



GiUGOOaWYAAerIP.jpg


27/61
@BeginnersinAI
This is great for competition. These last two models are going to push the established players to up their game.



28/61
@WealthArchives
Deepseek dropped another model



https://video.twimg.com/ext_tw_video/1883920660377808896/pu/vid/avc1/720x1280/-3NsWWOTMa5QyJWC.mp4

29/61
@czverse
Janus-Pro-7B is turning up the heat! Multimodal dominance + open-source = game changer. DeepSeek’s 🐋 isn’t just cookin’, it’s serving a feast



30/61
@space_ace84
Can it generate this image?



GiULkCPWMAAsYDP.jpg


31/61
@0xAdin




GiUL_LMXIAEsibq.jpg


32/61
@HighlyRetired
🔥🔥💪💪



33/61
@BlueBayNetwork
Just a simple wow



34/61
@laplacesdust
Its over



35/61
@EvanGuthrie
Wow. Looking interesting.



36/61
@SecretNetwork
For those worried about privacy conerns and data harvesting.

Integrate confidential computing and get the benefits without the concerns.

[Quoted tweet]
x.com/i/article/188246991007…


37/61
@Cecociccio
with 50000 H100 NVDA GPU even my toilet beats Dall-E, all fukking fake hype



38/61
@Sam_Saraguy
Janus-Pro-7B's ability to handle both text and image generation in an open-source framework is impressive. This intensifies competition in the AI model space, and puts a lot of pressure on companies like OpenAI to innovate, or at least keep up with the pace at which open-source (and apparently much lower cost) models are improving.



39/61
@InvestXOS
Keep in mind DeepSeek’s parent company, High Flyer, is a quant HEDGE FUND, not a tech company…. Imagine how much they could have gained from shorting the U.S. market beforehand if they did it…

[Quoted tweet]
Also keep in mind the parent company of DeepSeek - named High-Flyer - is a HEDGE-FUND who successfully profited $700M from China market collapses over a decade ago…. And they kept profiting from significant market events…. They’d better have an answer ready for $1T erased from US equity market when called to congressional hearing (if any in the near future)….

$spx $spy $qqq $aapl $amzn $googl $meta $msft $nvda $tsla


40/61
@HTBF1968
How do we know it was released by the Chinese? How do we not know it was designed by another AI to spread itself?

Take that conspiracy theory and run with it. 😉



41/61
@sharathnryn
And my entire feed is going to be filled with images generated from this tomo!



42/61
@otwkayaairdrop
Looks like DeepSeek is serving up some serious AI gourmet! 🍽️ Meanwhile, at /search?q=#PublicAI, we're just trying to make sure everyone gets a slice of the data pie! 🥧



43/61
@ianatmars
OpenAI and everyone else throwing money at compute have been hacking, it's likely that this is the reckoning for that. All they had to do was use RL to train an internal monologue.



44/61
@RahimAl18843853
They are dropping LLM like candy 🍬



45/61
@rokymiller50
Than there is this

[Quoted tweet]
x.com/i/spaces/1BdxYEqozbzxX


46/61
@Owngoal_clips
Does Uncle Sam Altman has anything to say?



47/61
@_JamesAlmeida
I really want @deepseek_ai to drop a voice model that can speak like @OpenAI's ChatGPT voice model…



48/61
@RijnHartman
@replicate when can we see this on your platform? 👀



49/61
@flowwithaj
Dammit Janice



50/61
@mideenigmA
they need to drop one that does vids



51/61
@alwayswarmhands
wowwww



52/61
@NotAnAlienYet
Hugh Janus?



53/61
@ByteAndBarbell
Video model next?



54/61
@minchoi
DeepSeek shocked the world, and many are in denial and fear.



55/61
@channesson32
What are these assumptions if it’s true that they’re training on a larger cluster than they claim?



56/61
@CagedSings
/search?q=#NVIDIA RIP



57/61
@Richard96500946
Deepseek is heavily censored and gives unasked opinions. AI is like the legacy media.



58/61
@austinfai95
and cheaper? I'll take it!



59/61
@NorthstarBrain
They just wake up and keep shipping. This was the ORIGINAL strategy of OpenAI before the whole AGI fluff.



60/61
@jellyXBT
@fivesixtwo444 @jellysmithrave bruh



61/61
@RealEricD
The timing is interesting.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
59,141
Reputation
8,772
Daps
163,739




















1/52
@jiayi_pirate
We reproduced DeepSeek R1-Zero in the CountDown game, and it just works

Through RL, the 3B base LM develops self-verification and search abilities all on its own

You can experience the Ahah moment yourself for &lt; $30
Code: GitHub - Jiayi-Pan/TinyZero: Clean, accessible reproduction of DeepSeek R1-Zero

Here's what we learned 🧵



GiEwamOaEAAAGck.jpg


2/52
@jiayi_pirate
The recipe:

We follow DeepSeek R1-Zero alg -- Given a base LM, prompts and ground-truth reward, we run RL.

We apply it to CountDown: a game where players combine numbers with basic arithmetic to reach a target number.



3/52
@jiayi_pirate
The results: It just works!

Model start from dummy outputs but gradually develop tactics such as revision and search.

In the following sample, the model propose a solution, self-verify, and iteratively revise it until it works.

Full experiment log: jiayipan



GiEweawaMAgyR1N.jpg


4/52
@jiayi_pirate
Quick ablations on CountDown:
Base model quality is key:

We run Qwen-2.5-Base 0.5B, 1.5B, 3B to 7B. 0.5B guess a solution and stop. From 1.5B, the model start learning to search, to self-verify and to revise its solutions, enabling them to achieve much higher scores.



GiEwh8raoAA1Mf5.jpg


5/52
@jiayi_pirate
Either base or instruct model works

- Instruct model learns faster, but converges to about same performance as base
- Instruct model's output are more structured and readable

So extra instruction tuning isn't necessary, which supports R1-Zero's design decision



GiEwin3bkAAcv5M.jpg


6/52
@jiayi_pirate
The specific RL alg doesn't matter much

We tried PPO, GRPO and PRIME. Long cot all emerge and they seem all work well. We haven't got the time to tune the hyper-parameters, so don't want to make quantitative conclusions about which alg works better.



GiEwjDxaMAQ6COV.jpg


7/52
@jiayi_pirate
Model's reasoning behavior is very task dependent:

- For countdown, the model learns to do search and self-verificatoin
- For number multiplicatoin, the model instead learns to break down the problem using distirbution rule and solve it step by step.



GiEwjpUaMAIkdWo.jpg


8/52
@jiayi_pirate
Everything's open at GitHub - Jiayi-Pan/TinyZero: Clean, accessible reproduction of DeepSeek R1-Zero

And it costs &lt; $30 to train the model! We hope this project helps to demystify the emerging RL scaling research and make it more accessible!



9/52
@jiayi_pirate
One caveat, of course, is that it's validated only in the Countdown task but not the general reasoning domain. We are now bounded by compute, and please reach out if you wanna help!



10/52
@jiayi_pirate
A wild ride with @JunjieZhang12 @xingyaow_ @lifan__yuan



11/52
@deter3
The dataset in github is Jiayi-Pan/Countdown-Tasks-3to4 , right ?



12/52
@jiayi_pirate
Yes, right here Jiayi-Pan/Countdown-Tasks-3to4 · Datasets at Hugging Face



13/52
@duluhagv
is the countdown dataset gen also open source? i lol'ed when I saw this release today after working on something similar last night



GiFDpZ2W8AA9q_R.jpg


14/52
@jiayi_pirate
Hi the countdown generation code is mostly borrowed from Stream-of-Search
stream-of-search/src/countdown_generate.py at main · kanishkg/stream-of-search

Preprocessed the data is here:
Jiayi-Pan/Countdown-Tasks-3to4 · Datasets at Hugging Face

Everything's open and reproducible



15/52
@Samhanknr
How feasible do you think it is to teach a model to work on a given codebase. For eg - teach is to write unit tests , pass unit tests in given codebase using RL. Would it be affordable ?



16/52
@jiayi_pirate
That’s definitely possible. We are working on this

[Quoted tweet]
Introducing SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers

Using SWE-Gym, our agents + verifiers reach new open SOTA - 32%/26% on SWE-Bench Verified/Lite,
showing strong scaling with more train / test compute

github.com/SWE-Gym/SWE-Gym [🧵]


Gff2Xt9aMAAOo7B.png


17/52
@frankxu2004
Very nice! One question: do you have any observation regarding CoT length changes during training? Is there a plot showcasing CoT length increased during training?



18/52
@jiayi_pirate
Great question! Early results show the 3B model initially reduces output length for correct formatting, then increases chain-of-thought length for better performance.

There may be minor code mismatches, so take this with a grain of salt.

Raw log:
jiayipan



GiFVjYpbQAAU9fA.jpg


19/52
@bennetkrause
One question: What model size and algorithm do the $30 refer to?



20/52
@jiayi_pirate
3B model, PPO, it takes 10 H100 hours



21/52
@wzihanw
Nice!



22/52
@reissbaker
this is really cool. any chance for a huggingface model upload so we can play around with it?



23/52
@_herobotics_
Any insight on why open llama achieves high response length but low scores?



24/52
@jiayi_pirate
OpenLlama doesn't like to generate EOS tokens, and since we haven't implemented stop after &lt;/answer&gt;, the model often fails to terminate.
We didn't report OpenLlama's results in the Twitter thread as we believe the results will improve significantly once we fix this problem.



25/52
@GiorgioMantova
Do you find the original papers' numbers about the training cost to be plausible?



26/52
@jiayi_pirate
Yes, with MoE and FP8, that’s expected



27/52
@paul_cal
This is such a beautiful, efficient demonstration of something very powerful. Great idea



28/52
@anushkmittal
$30? might as well be free. nice work



29/52
@iamRezaSayar
this is very interesting.
I'm wondering about 2 things now. 1. if we could extend these beyond verifiable math problems, to say, empathy. and 2. if people would need / want to train their own personal reward model that is tuned to each user's prefrences 👀



30/52
@HolgersenTobias
That was fast 🔥 Excellent work. This paradigm seems robust and scalable.

Bottleneck now will be gathering huge, diverse sets of hard but verifiable tasks.



31/52
@CharlieYouAI
Incredible work! and super cool result



32/52
@RishfulThinking
Excuse me, fukking WHAT



33/52
@garybasin
Hell yeah



34/52
@rrektcapital
Could you kindly ELI5 what RL means in this case?



35/52
@burny_tech
Cambrian explosion of RL on LLMs begins



36/52
@bookwormengr
Could you please provide flop analysis?



37/52
@nooriefyi
love seeing this kind of ingenuity in action. what were the biggest hurdles you faced?



38/52
@SurfTheUniverse
Was this supposed to work?



GiJNcxZWEAAFTiw.jpg


39/52
@AntDX316
There's going to be no way for people to intentionally 'inflate' the required amount of tokens by adding certain things that makes it look like it costs more to run a generation, when it doesn't.

The ASI-Singularity(Godsend) is the only Global Solution, people.



40/52
@ReplayRyan
Huge



41/52
@nrehiew_
Thanks for this! I wonder if you guys tried llama 3 7B instead of llama 2 7B. Also would be interesting if you guys have like a completely overtrain/overfitted run similar to what the TULU3 team experimented with.



42/52
@suwakopro
So cool, intelligence is reproducible.



43/52
@xiaoze_jin
Need to look into it; thanks for sharing



44/52
@corefpark
Thanks for open sourcing this!!!! This is an enourmous contribution of understanding the science of RLxLLMs!!



45/52
@JunYang1688
Great post. That DeepSeek core contributions can be reproduced at lightning speed demonstrates the power of open source! It will definitely accelerate the progress towards AGI.



46/52
@fjrdomingues
How do I super like a post?



47/52
@Vrda82073569
The Bitter Lesson strikes again!



48/52
@Nick_from_Texas
Any idea how autoregressive models become capable of self verification?

How do they avoid getting stuck along one chain of thought due to an earlier suboptimal token?



49/52
@neurosp1ke
Inspired by your experiment I added a procedural generator for countdown games to GitHub - open-thought/reasoning-gym: procedural reasoning datasets



50/52
@adamnemecek1
All machine learning approaches are convolutional inverses, including RL and LLMs.



51/52
@soheilsadathoss
Awesome!



52/52
@Kathleen_Tyson_
I played Scrabble against the All Time Countdown Champion last year. He destroyed me. Nice guy, though.




To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196
 
Top