1/1
I am hopeful for this new ChatGPT AI angle for LLMs, but I can tell you that I successfully ran TinyLlama 1.1B on a Raspberry Pi 5 at quite a fast speed, which is only a 638 Megabyte download.
/search?q=#SmallLLM /search?q=#SLM /search?q=#TinyLlama /search?q=#1BitLLM /search?q=#1BitAI /search?q=#TinyAI /search?q=#TinyLLM
GitHub - jzhang38/TinyLlama: The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.
[Quoted tweet]
Microsoft 1-bit era paper (released in Feb) is really a masterpiece.
BitNet b1.58 70B was 4.1 times faster and 8.9 times higher throughput capable than the corresponding FP16 LLaMa.
Requires almost no multiplication operations for matrix multiplication and can be highly optimized.
BitNet b1.58 is a 1-bit LLM, where every single parameter (or weight) of the LLM is ternary {-1, 0, 1}.
They introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit BitNet, resulting in 1.58 bits in the binary system.
The term "1.58 bits" refers to the information content of each parameter. What "1.58 bits" means is that it is log base 2 of 3, or 1.5849625... Actually decoding data at that density takes a lot of math.Since each parameter can take one of three possible values (-1, 0, 1), the information content is log2(3) ≈ 1.58 bits.
----
Here all weight values are ternary, taking on values {-1, 0, 1}. Its quantization function is absmean in which, the weights are first scaled by their average absolute value and then rounded to the nearest integer ε {-1,0,1}. It is an efficient extension of 1-bit BitNet by including 0 in model parameters. BitNet b1.58 is based upon BitNet architecture (replaces nn.linear with BitLinear). It is highly optimized as it removes floating point multiplication overhead, involving only integer addition (INT-8), and efficiently loads parameters from DRAM.
----
BitNet b1.58 retains all the benefits of the original 1-bit BitNet, including its new computation paradigm, which requires almost no multiplication operations for matrix multiplication and can be highly optimized.
It matches the full-precision (i.e., FP16 or BF16) Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption.
More profoundly, the 1.58-bit LLM defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective.
This paper enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs.
To post tweets in this format, more info here: https://www.thecoli.com/threads/tips-and-tricks-for-posting-the-coli-megathread.984734/post-52211196