In this blog post, I’m going to talk about details and principles of the LoRA (Low-Rank Adaptation Of Large Language Models) method, while going through the paper. The main focus of the paper is to bring up a new structure which decreases the cost of a model’s fine-tuning while increasing its performance. Fine-tuning operations have had an important place in natural language processing studies for a while. Let’s say you have a trained model, and you want to expand its knowledge. You can do it by fine-tuning, without training the model from scratch.
Let’s say a model has n parameters after the training and you need to fine-tune it. If you use a full fine-tuning method, the model’s all parameters are going to be updated. When we consider models that have billions of parameters, the update on all parameters will create time and storage problems. In the paper, the better or on par performance outcomes are emphasized for RoBERTa, DeBERTa, GPT-2 and GPT-3 models by using the LoRA method.
It is mentioned in the paper that The LoRA method is inspired by the studies of Li et. al (2018a) and Aghajanyan et al. (2020). These studies emphasize that the success of over-parametrized models is actually based on a low intrinsic dimension. This became the point that shaped the hypothesis of the LoRA method.
The LoRA method is a study based entirely on the rank factorization. In this method, two sequential matrices are added parallel to some dense layers in the neural networks as shown in Figure 1. These sequential matrices are formed by a given rank configuration. The below figure shows what a dense layer looks like after adding the sequential rank matrices. The input variables are sent through both the original pretraining weight matrix and the first matrix of the sequential matrices. Then, output values are calculated by summing up the outputs of these two parallel matrices.
https://arxiv.org/abs/2106.09685
The LoRA Method
Neural networks have many dense layers that perform matrix multiplications. The weight matrices in these layers are actually full-rank matrices (all rows and columns are independent). But Aghajanyan et al. mentioned in their paper that pre-trained language models have a low “instrisic dimension”. It shaped the LoRA hypothesis within this aspect. So the LoRA hypothesis became that the models could learn with low-dimensional changes in the weight matrix during the adaptation to a specific task.
Let
be a weight matrix of a pre-trained model. The updates to this matrix were made through matrices
separated by low-rank. Here, B is a dxr matrix, A is a rxk matrix, while r (rank) is much smaller than d and k. During the fine-tuning, the gradient updates to Wo are prevented. Both Wo and ∆W matrices are multiplied by the same input values and results are summed up and output h= w0 + ∆Wx = w0 +BAx is formed.
At the beginning of the training, they used random Gaussian initialization for A matrix values and full zero initialization for B matrix values. In other words, BA is equal to zero at the beginning.
LoRA Application in Transformer Architecture
There are 4 weight matrices Wq, Wk,Wv,Wo in self-attention module and 2 weight matrices in MLP module in transformer architecture. During LoRA adaptation, they froze the weights in the MLP module, and applied the method to attention weights in order to keep the training simple.
Advantages
Greatest conveniences provided by LoRA are memory and storage usages. For example, in the article, it is mentioned that for a large transformer model trained with Adam, VRAM usage is reduced by up to 2/3 if << r d. This is because optimizer states are not kept for frozen parameters.
For the GPT-3 175B model, VRAM consumption decreased from 1.2TB to 350GB.
In the scenario where r=4 and only the weights of the value and query matrices were changed, the checkpoint size was reduced from 350GB to 35MB. This enabled training to be performed with fewer GPUs.
In addition, since there is no gradient update in most of the parameters, a 25% acceleration was achieved when training the GPT-3 175B with LoRA compared to the full fine-tuning process.
** The article also mentions that if the fine-tuning process is needed for a different language, the entire model should be used instead of LoRA.
Additional Knowledge
What is a Rank:
Rank is the total number of linearly independent columns or rows of a matrix. Let’s assume we have a matrix with n columns and all columns are independent from each other, then we can say that the column rank of this matrix is n. Again, if all rows of a matrix with n rows are independent from each other, then we can say again that the row rank of the matrix is n.
The row rank and column rank of a matrix must be equal. As an example, we can look at the matrix below.
Matrix A=
[[1,2,5],
[ 2, 4, 10] ]
If we look at the matrix carefully, we can see that the 2nd row is twice the 1st row. Likewise, the 2nd and 3rd columns of the matrix are 2 and 5 times the 1st column, respectively. In other words, while there are two vectors on the same line on a row basis in the matrix, and three different vectors on another line on a column basis, there is only one unique vector on each row and column basis. This means that the rank of the matrix is 1.
In order to find the rank, we convert the matrix into its echelon form, and then we calculate the number of rows which consist of at least one non-zero value.
What is a Low Rank Matrix:
In matrices, the rank can be equal to or less than the number of columns or rows. Matrices which have smaller linearly independent rows or columns than the current number of rows or columns are called low-rank matrices. Matrices where all rows or columns are independent are called full-rank matrices.
What is a Rank Factorization/Decomposition:
We can rewrite a matrix with the rank r as B=L.R^T by decomposing it into L and R matrices. Here, L is a matrix of mxr dimensions, while R^T is a matrix of rxn dimensions. This process is known as a rank decomposition.
The importance of Rank Factorization:
The importance of the rank factorization process is that it allows us to divide a matrix into smaller matrices, L and R, and keep them in less space. It means that you can store B by storing its factors L and R. This reduces the storage requirements of B to (m+n)r numbers down from mn numbers. An additional point worth mentioning here is that after decomposing a matrix into these two small matrices, it allows many calculations to be made about the matrix without even producing the same matrix again. For this reason, when dealing with a low rank matrix, rank factorization is always considered as an important first step. After the rank factorization, calculations can be made much faster and using less memory.
Furthermore
For more detailed information about low rank matrices, you can read Ethan R. Epperly’s blog and LoRA paper. Additionally, you can examine LoRA’s code-based applications in detail on the microsoft/LoRA github repository.