1/30
@snowmaker
Lots of hot takes on whether it's possible that DeepSeek made training 45x more efficient, but @doodlestein wrote a very clear explanation of how they did it. Once someone breaks it down, it's not hard to understand. Rough summary:
* Use 8 bit instead of 32 bit floating point numbers, which gives massive memory savings
* Compress the key-value indices which eat up much of the VRAM; they get 93% compression ratios
* Do multi-token prediction instead of single-token prediction which effectively doubles inference speed
* Mixture of Experts model decomposes a big model into small models that can run on consumer-grade GPUs
2/30
@snowmaker
Each one is explained in detail in the full article (the DeepSeek section is titled "The Theoretical Threat"):
The Short Case for Nvidia Stock
3/30
@ChrisTaylo79273
Thank you
4/30
@alberrtaguilera
The actual meeting where they had the breakthrough:
5/30
@MikeBelloliX
Great breakdown! Curious how this scales across different workloads and if there are trade-offs in model accuracy.
6/30
@escapism_ai
isn't it fascinating how we can make something 45x more efficient just by accepting a little more uncertainty in our numbers? sometimes less precision leads to greater understanding.
7/30
@InkwiseDotAI
Switching to 8-bit can really cut down on memory usage and speed up training. It’s a smart move, especially for large models where every bit counts. Curious how this impacts model accuracy though.
8/30
@doodlestein
Discussion of the article on HackerNews:
The impact of competition and DeepSeek on Nvidia | Hacker News
9/30
@Shapol_m
i think they’re also not disclosing the true number of nvidia gpus due to export control restrictions. they probably have a lot more which would have been bought through singapore.
10/30
@bevenky
With that add Deepseek R1 novelty of RL/GRPO and you have a chart topping app on app store
11/30
@danielmerja
Why aren’t markets responding to this in a corresponding manner? Does this imply that this necessitates less computational power and, consequently, fewer resources? Alternatively, is it because the inference process would increase with the availability of these novel models, necessitating more computational power?
12/30
@igor_baikov
And this is not an extensive list of the optimizations we can have for AI models. Cool times ahead for consumers!
13/30
@trq212
But traditional wisdom is that using 8 bit instead of 32 bit and doing multi-token prediction actually reduces performance, so what's impressive is that the performance is still so good.
14/30
@wdavidturner
The only failing here is that we’re not all working with OSS together.
Nobody or no company should have the power that we’re on the verge of mastering.
Every time we find a magnitudinal improvement, humanity wins.
I hate to admit it but the age of capitalism is at its end.
15/30
@TheAhmadOsman
16/30
@z7pp2h55hr
In US, folks don’t focus on cost efficiency and also there is lack innovation due to 5 managerial layers.
You can get 10-15% cost just by cutting bureaucracy and then constraints is mother of innovation.
We need to let real engineer in charge of projects and not political power in managerial class
17/30
@doodlestein
Thanks for sharing my article! Appreciate the kind words.
18/30
@friedmandave
Yes, shades of the dark fibre over building of the late '90s. Global Crossing and similar overbuilt fibre infrastructure for expected internet traffic that never arrived. Similarly, one has to wonder whether Nvidia is overbuilding GPU capacity for AI demand that doesn't come (because DeepSeek's model is so much more efficient than OpenAI's).
19/30
@JoshuaGraystone
If true, DeepSeek will have turned the AI paradigm (and associated economic boom drivers) on its head.
The need for chip, power, data centers, etc. goes down while the proliferation of domain-specific AI models and applications shoots up.
20/30
@murchiston
"used 8 bit" is a biiit of an oversimplification. read the paper
21/30
@signulll
this begs the question…
22/30
@vedangvatsa
DeepSeek-R1’s Hidden Gems:
[Quoted tweet]
Hidden Gems in DeepSeek-R1’s Paper
23/30
@mahaoo_ASI
Also, there is a very detailed academic paper with all the details in it that was published 1 month ago...
24/30
@uncomplexities
crazy how companies with $$$ can adopt their practices and we'll get to agi faster
25/30
@gazorp5
"Mixture of Experts model decomposes a big model into small models that can run on consumer-grade GPUs"
???
so many dumb, incorrect takes about deepseek. The paper is literally there. just don't make shyt up?
26/30
@georgejrjrjr
this is brain rot. read the paper. R1 can explain it to you.
27/30
@Amaz1ngly
China lies
28/30
@4Maciejko
but would there be a DeepSeek without Meta?
29/30
@tayaisolana
lol @snowmaker thinks he's a tech expert now. 8 bit vs 32 bit, who cares? it's all just 1s and 0s to me. btw, has anyone seen the source code for DeepSeek? or is it just another black box 'innovation'?
30/30
@farespace
> Once someone breaks it down, it's not hard to understand.
You could breakdown quantum physics into layman's terms & it still could be impressive. this level of engineering for a quant firm's side project is astonishing
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196