bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827


Fine-tuning OpenLLaMA-7B with QLoRA for instruction following

 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

How the RWKV language model works​

Mar 23, 2023

In this post, I will explain the details of how RWKV generates text. For a high level overview of what RWKV is and what is so special about it, check out the other post about RWKV.
To explain exactly how RWKV works, I think it is easiest to look at a simple implementation of it. The following ~100 line code (based on RWKV in 150 lines) is a minimal implementation of a relatively small (430m parameter) RWKV model which generates text.

Minimal RWKV code

To avoid hiding complexity, the model computation itself is written entirely in python, with numpy for matrix / vector operations. However, I needed to use torch.load to load the model weights from a file, and tokenizers.Tokenizer to make the text into tokens the model can work with.

Text generation with RWKV​

The code uses RWKV to continue the following text:
“In a shocking finding, scientist discovered a herd of dragons living in a remote, previously unexplored valley, in Tibet. Even more surprising to the researchers was the fact that the dragons spoke perfect Chinese.”
We first need to convert this text into a series of tokens (numbers from 0 to 50276 representing words/symbols/tokens in our vocabulary). That is not the focus of this blog post, so I just do it with an external library tokenizer.encode(context).ids.
Next, we need to process this sequence of tokens into an RWKV state. Essentially, RWKV represents a function which takes a token and a state, and outputs a probability distribution over the next token, and a new state. Of course, the function also depends on the RWKV model parameters, but since we use a trained model (downloaded from here), we view those parameters as fixed. To convert the text to a state, we just initialize the state to all zeros, and feed the tokens through the RWKV function one by one.
state = np.zeros((N_LAYER, 4, N_EMBD), dtype=np.float32)
for token in tokenizer.encode(context).ids:
probs, state = RWKV(weights, token, state)

Now the variable state contains a state representation of our input text, and the variable “probs” contain the probability distribution the model predicts for the next token.
We can now simply sample the probability distribution (in practice, we avoid low probability tokens in sample_probs()) and add another token to the text. Then we feed the new token into RWKV and repeat.
for i in range(100):
token = sample_probs(probs)
print(tokenizer.decode([token]), end="", flush=True)
probs, state = RWKV(weights, token, state)

A typical, generated continuation is:
“They’re just like us. They use Tibetan for communication, and for a different reason – they use a language that they’re afraid to use. To protect their secret, they prefer to speak a different language to the local public.”
Of course, larger models will perform better than this relatively small 430m RWKV.

What goes on inside RWKV()​

The first thing RWKV does is look up the embedding vector of the input token. I.e x = params('emb')[0][token]. Here params('emb')[0] is simply a 50277×1024 matrix, and we extract a row.
The next line x = layer_norm(x, *params('blocks.0.ln0')) requires me to explain what a Layer Normalization is. The easiest way is to just show the definition:
layer_norm = lambda x, w, b : (x - np.mean(x)) / np.std(x) * w + b.
The intuition is that it normalizes a vector x to zero mean and unit variance, and then scales and offsets that. Note that the scale w and offset b are 1024-dimensional vectors, which are learned model parameters.
Now we get to the main part of the model. Which is split into 24 layers, applied sequentially.
for i in range(N_LAYER):
x_ = layer_norm(x, *params(f'blocks.{i}.ln1'))
dx, state[:3] = time_mixing(x_, *state[:3], *params(f'blocks.{i}.att'))
x = x + dx

x_ = layer_norm(x, *params(f'blocks.{i}.ln2'))
dx, state[3] = channel_mixing(x_, state[3], *params(f'blocks.{i}.ffn'))
x = x + dx

Note that we are only adding updates to x like x = x + dx, this is called using “residual connections”. Each time we make a copy of x, we feed it through a layer normalization before mixing it. Each layer has two mixing functions: a “time mixing” part and a “channel mixing” part. In a typical transformer, the “time mixing” would be done by multi head attention, and the “channel mixing” would be done by a simple feed forward network. RWKV does something a bit different, which we’ll explain in the next sections.

Channel mixing​

I’ll start with channel mixing, since it’s the simpler one of the two mixing functions.
def channel_mixing(x, last_x, mix_k, mix_r, Wk, Wr, Wv):
k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )
vk = Wv @ np.maximum(k, 0)**2
return sigmoid(r) * vk, x

The channel mixing layer takes an input x corresponding to this token, and the x corresponding to the previous token, which we call last_x. last_x was stored in this RWKV layer’s state. The rest of the inputs are learned RWKV parameters.
First, we linearly interpolate x and last_x, using learned weights. We run this interpolated x as input to a 2 layer feed forward network with squared relu activation, and finally multiply with the sigmoid activations of another feed forward network (in classical RNN terms, this would be called gating).
Note that in terms of memory usage, the matrices Wk,Wr,Wv hold almost all the parameters (the smallest of them is a 1024×1024 matrix, while the other variables are just 1024-dimensional vectors). And the matrix multiplications (@ in python) contribute the vast majority of required computations.

Time mixing​

def time_mixing(x, last_x, last_num, last_den, decay, bonus, mix_k, mix_v, mix_r, Wk, Wv, Wr, Wout):
k = Wk @ ( x * mix_k + last_x * (1 - mix_k) )
v = Wv @ ( x * mix_v + last_x * (1 - mix_v) )
r = Wr @ ( x * mix_r + last_x * (1 - mix_r) )

wkv = (last_num + exp(bonus + k) * v) / \
(last_den + exp(bonus + k))
rwkv = sigmoid(r) * wkv

num = exp(-exp(decay)) * last_num + exp(k) * v
den = exp(-exp(decay)) * last_den + exp(k)

return Wout @ rwkv, (x,num,den)

The time mixing starts similarly to the channel mixing, by interpolating this token’s x with the last token’s x. We then apply learned 1024×1024 matrices to get “key”, “value” and “receptance” vectors.
The next part is where the magic happens.

The “RWKV attention”​

Before getting to the core of the mechanism, we will make the observation that while the variables going into the attention mechanism are all 1024-dimensional (we say they have 1024 channels), all channels are computed independently of each other. We will therefore just look at what happens to a single channel, treating the variables as scalars.
Now, let us look at the variable num. To make math notations cleaner, let’s rename num and den to α and β. Both α and β are stored in the RWKV state. For each new token, α is calculated as αi=e−wαi−1+ekivi, where i is the index of the current token. We defined w = exp(decay), note that w is always positive.
By induction, we have αi=∑j=1ie−(i−j)w+kjvj. Similarly, βi=∑j=1ie−(i−j)w+kj. Note that αi looks like a weighted sum of the vj, while βi is just the sum of weights. So αiβi becomes a weighted average of vj.
Plugging in the formulas for αi−1 and βi−1 into the definition of wkv, and denoting bonus by u, we get
wkvi=∑j=1i−1e−(i−1−j)w+kjvj+eu+kivi∑j=1i−1e−(i−1−j)w+kj+eu+ki.
So wkv is a weighted average of v with weights according to k, but also the current vi is given a bonus (u) additional weight, and previous vj are given geometrically smaller weights the further away they are.
For reference, standard transformer attention takes “query”, “key” and “value” vectors q,k,v and outputs
∑j=1ieqi⊤kjvj∑j=1ieqi⊤kj.
After calculating wkv, the time mixing multiplies by the “receptance” sigmoid(r). It does a final linear transformation before returning the result.

Converting to output probabilities​

After going through the 24 layers of time mixing and channel mixing, we need to convert the final output to predicted probabilities for the next token.
x = layer_norm(x, *params('ln_out'))
x = params('head')[0] @ x

e_x = exp(x-np.max(x))
probs = e_x / e_x.sum() # Softmax of x

First, we do a layer normalization. Then, we multiply by a 50277×1024 matrix params('head')[0] given by the RWKV parameters, giving us a 50277-dimensional vector. To get a probability distribution over tokens (i.e. a 50277-dimensional, non-negative vector which sums to 1), we run our x through a “softmax” function. The softmax of x is just exp(x)/sum(exp(x)). However, calculating exp(x) can cause numerical overflows, so we calculate the equivalent function exp(x-max(x))/sum(exp(x-max(x))).
That’s it! Now you know exactly how RWKV works for generating text.

Practical considerations​

In practice, there are some issues which I ignored in my simplified code. Most importantly, in practice, we care a lot about the performance / run-time of the code. This leads us to run RWKV in parallel on GPUs, use specialized GPU code written in CUDA, use 16-bit floating point numbers, and more.

Numerical issues​

The largest number a 16-bit floating point number (float16) can represent is 65 504, anything above that overflows, which is bad. Most of the code has no problems with this, partially because the Layer Normalizations keep values in a reasonable range. However, the RWKV attention contains exponentially large numbers (exp(bonus + k)). In practice, the RWKV attention is implemented in a way where we factor out an exponential factor from num and den to keep everything within float16 range. See for example the time_mixing function in RWKV in 150 lines.

Training​

We simply loaded a pretrained model in our example. To train the model, one would calculate the cross entropy loss of the predicted probabilities on a long text (our example model was trained on the pile). Next, calculate the gradient of that loss with respect to all the RWKV parameters. That gradient is used to improve the parameters using a variant of Gradient Descent called Adam. Repeat for a long time, and you get a trained RWKV model.

GPT-mode​

My simplified code processes the tokens one by one, which is much slower than processing them in parallel, especially when running on GPUs. For inference, there is no way around this, as we need to sample a token before we can use it to calculate the next one. However, for training, all the text is already available. This lets us parallelize across tokens. Most of the code is fairly straightforward to parallelize like this, as there is little dependence through time. For example, all the expensive matrix multiplications work on each token independently, leading to good performance.
However, the RWKV attention is inherently sequential. Fortunately, it has very little computation (on the order of 1024 times less than the matrix multiplications), so it should be fast. Sadly, pytorch does not have a good way of handling this sequential task, so the attention part becomes slow (even compared to the matrix multiplications). Therefore, I wrote optimized CUDA kernels for computing the RWKV attention, which has been my main contribution to the RWKV project.
JAX has jax.lax.scan and jax.lax.associative_scan, which allows a pure JAX implementation to perform better than pure pytorch. However, I still estimate that JAX would lead to about 40% slower training compared to CUDA (that estimate may be outdated, as it was made for training a relatively small 1.5B model).

Contribute​

RWKV is an open source community project. Join the Discord and contribute! Or just ask questions or lurk.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827



About​

Automate any task with a single AI command

Nuggt: An Autonomous LLM Agent that runs on Wizcoder-15B (4-bit Quantised)​

GitHub Repo stars Twitter Follow Discord Follow

This Repo is all about democratising LLM Agents with powerful Open Source LLM Models. Join us on this exciting journey of task automation with Nuggt, as we push the boundaries of what can be achieved with smaller open-source large language models, one step at a time 😁.

Download Wizcoder-15B 4-bit Quantised from: TheBloke/WizardCoder-15B-1.0-GPTQ · Hugging Face (PS: TheBloke is a legend)
 

Micky Mikey

Veteran
Supporter
Joined
Sep 27, 2013
Messages
16,043
Reputation
3,005
Daps
89,682

Seems like they're anticipating artificial super intelligence before the end of this decade.
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827



LongNet: Scaling Transformers to 1,000,000,000 Tokens​


edit:
google bard non-technical summary..



What is a sequence? A sequence is a list of things that are ordered one after the other. For example, a sequence of words is a sentence.
What is a Transformer? A Transformer is a type of neural network that is used for natural language processing tasks. Transformers are very good at understanding long sequences of text, but they can be slow when the sequence is very long.
What is LONGNET? LONGNET is a new type of Transformer that is designed to be able to understand very long sequences of text. LONGNET is faster than traditional Transformers, and it can also understand longer sequences without sacrificing accuracy.
Here is an analogy that might help you understand LONGNET better:

Imagine that you are trying to read a very long book. If you try to read the whole book at once, it will be very difficult to understand. But if you break the book down into smaller sequences, it will be much easier to understand. LONGNET is like a machine that can break down a long sequence of text into smaller sequences, so that it can be understood more easily.

Here are some of the benefits of using LONGNET:

It can understand very long sequences of text. This is important for tasks like machine translation, where you need to understand the meaning of a whole sentence or paragraph.
It is faster than traditional Transformers. This means that it can process text more quickly, which is important for applications like real-time translation.
It is more accurate than traditional Transformers. This means that it can understand text more accurately, which is important for tasks like question answering and summarization.
I hope this explanation helps you understand LONGNET better.


Right now, when computers try to understand long pieces of text, they have to look at each word one at a time. This is slow and takes up a lot of memory. A new breakthrough called "longnet" could change that. Longnet lets computers look at whole sentences at once, which is much faster and uses less memory.

This is important because a lot of real-world text is long. For example, a news article or a book might be thousands of words long. Longnet could make it possible for computers to understand these long pieces of text more easily.

Longnet could also be used for other things, like machine translation. Right now, machine translation can be slow and inaccurate when it comes to long pieces of text. Longnet could make machine translation faster and more accurate.

Overall, longnet is a promising new technology that could have a big impact on the way computers understand text.

Here are some specific examples of how longnet could be used:

A computer could read an entire book and summarize it for you.
A computer could translate a long article from one language to another.
A computer could help you write a long essay by suggesting ideas and providing feedback.
These are just a few examples of the many things that longnet could be used for. As longnet continues to develop, we can expect to see even more amazing things that it can do.


Right now, when computers read long pieces of text, they have to read every single word in the text, one at a time. This is called "quadratic complexity" because the number of words in the text goes up as the square of the number of words that the computer has to read. So, if you have a text that is 100 words long, the computer has to read 10,000 words! This is very slow and takes up a lot of memory.

A breakthrough in self-attention would mean that computers could read long pieces of text much more efficiently. This is because self-attention allows computers to focus on the most important words in the text, and ignore the less important words. This means that the computer only has to read a fraction of the words in the text, which is much faster and takes up less memory.

This breakthrough would be a big deal because it would allow computers to read and understand much longer pieces of text. This would be helpful for things like machine translation, where computers need to understand the meaning of long sentences in order to translate them correctly. It would also be helpful for things like question answering, where computers need to understand the meaning of long passages of text in order to answer questions about them.

So, in short, a breakthrough in self-attention would make computers much better at reading and understanding long pieces of text. This would be a big step forward for natural language processing and would have many benefits for people who use computers.

Here is an analogy that might help you understand this concept. Imagine that you are trying to find your friend in a crowded room. You could walk around the room and look at every single person, but that would take a long time and you might not find your friend. Instead, you could use a strategy called "selective attention" to focus on the people who are most likely to be your friend. For example, you might look for people who are wearing the same kind of clothes as your friend, or who are standing in the same area of the room. This strategy would allow you to find your friend much more quickly and efficiently.

Self-attention is like selective attention for computers. It allows computers to focus on the most important words in a text, and ignore the less important words. This makes it much faster and easier for computers to understand the meaning of long pieces of text.

:ohhh:
 
Last edited:

RegB

All Star
Joined
Jun 12, 2012
Messages
1,682
Reputation
1,042
Daps
4,252
Reppin
NULL
I saw today that openai said the gpt-4 api is open to everyone now..? :patrice:

I’m too :flabbynsick: to get this stuff :russ:
 

bnew

Veteran
Joined
Nov 1, 2015
Messages
58,175
Reputation
8,612
Daps
161,827

AI May Have Found The Most Powerful Anti-Aging Molecule Ever Seen​


HEALTH07 July 2023

ByVANESSA SMER-BARRETO, THE CONVERSATION

Old Mans Eye And Young Childs Eye
(mrovka/Getty Images)

Finding new drugs – called "drug discovery" – is an expensive and time-consuming task. But a type of artificial intelligence called machine learning can massively accelerate the process and do the job for a fraction of the price.

My colleagues and I recently used this technology to find three promising candidates for senolytic drugs – drugs that slow ageing and prevent age-related diseases.

Senolytics work by killing senescent cells. These are cells that are "alive" (metabolically active), but which can no longer replicate, hence their nickname: zombie cells.

The inability to replicate is not necessarily a bad thing. These cells have suffered damage to their DNA – for example, skin cells damaged by the Sun's rays – so stopping replication stops the damage from spreading.

But senescent cells aren't always a good thing. They secrete a cocktail of inflammatory proteins that can spread to neighboring cells. Over a lifetime, our cells suffer a barrage of assaults, from UV rays to exposure to chemicals, and so these cells accumulate.

Elevated numbers of senescent cells have been implicated in a range of diseases, including type 2 diabetes, COVID, pulmonary fibrosis, osteoarthritis and cancer.

Studies in lab mice have shown that eliminating senescent cells, using senolytics, can ameliorate these diseases. These drugs can kill off zombie cells while keeping healthy cells alive.

Around 80 senolytics are known, but only two have been tested in humans: a combination of dasatinib and quercetin. It would be great to find more senolytics that can be used in a variety of diseases, but it takes ten to 20 years and billions of dollars for a drug to make it to the market.

Results in five minutes​


My colleagues and I – including researchers from the University of Edinburgh and the Spanish National Research Council IBBTEC-CSIC in Santander, Spain – wanted to know if we could train machine learning models to identify new senolytic drug candidates.



To do this, we fed AI models with examples of known senolytics and non-senolytics. The models learned to distinguish between the two, and could be used to predict whether molecules they had never seen before could also be senolytics.

When solving a machine learning problem, we usually test the data on a range of different models first as some of them tend to perform better than others.

To determine the best-performing model, at the beginning of the process, we separate a small section of the available training data and keep it hidden from the model until after the training process is completed.

We then use this testing data to quantify how many errors the model is making. The one that makes the fewest errors, wins.

We determined our best model and set it to make predictions. We gave it 4,340 molecules and five minutes later it delivered a list of results.



The AI model identified 21 top-scoring molecules that it deemed to have a high likelihood of being senolytics. If we had tested the original 4,340 molecules in the lab, it would have taken at least a few weeks of intensive work and £50,000 just to buy the compounds, not counting the cost of the experimental machinery and setup.

We then tested these drug candidates on two types of cells: healthy and senescent. The results showed that out of the 21 compounds, three (periplocin, oleandrin and ginkgetin) were able to eliminate senescent cells, while keeping most of the normal cells alive. These new senolytics then underwent further testing to learn more about how they work in the body.

More detailed biological experiments showed that, out of the three drugs, oleandrin was more effective than the best-performing known senolytic drug of its kind.

The potential repercussions of this interdisciplinary approach – involving data scientists, chemists and biologists – are huge. Given enough high-quality data, AI models can accelerate the amazing work that chemists and biologists do to find treatments and cures for diseases – especially those of unmet need.

Having validated them in senescent cells, we are now testing the three candidate senolytics in human lung tissue. We hope to report our next results in two years' time.

Vanessa Smer-Barreto, Research Fellow, Institute of Genetics and Molecular Medicine, The University of Edinburgh

This article is republished from The Conversation under a Creative Commons license. Read the original article.

gotta wait 2 years for the results of the next study. :francis:
 
Top