In this blog post, I’m going to talk about details and principles of the LoRA (Low-Rank Adaptation Of Large Language Models) method, while going through the paper. The main focus of the paper is to bring up a new structure which decreases the cost of a model’s fine-tuning while increasing its performance. Fine-tuning operations have had an important place in natural language processing studies for a while. Let’s say you have a trained model, and you want to expand its knowledge. You can do it by fine-tuning, without training the model from scratch.
Let’s say a model has n parameters after the training and you need to fine-tune it. If you use a full fine-tuning method, the model’s all parameters are going to be updated. When we consider models that have billions of parameters, the update on all parameters will create time and storage problems. In the paper, the better or on par performance outcomes are emphasized for RoBERTa, DeBERTa, GPT-2 and GPT-3 models by using the LoRA method.
It is mentioned in the paper that The LoRA method is inspired by the studies of Li et. al (2018a) and Aghajanyan et al. (2020). These studies emphasize that the success of over-parametrized models is actually based on a low intrinsic dimension. This became the point that shaped the hypothesis of the LoRA method.
The LoRA method is a study based entirely on the rank factorization. In this method, two sequential matrices are added parallel to some dense layers in the neural networks as shown in Figure 1. These sequential matrices are formed by a given rank configuration. The below figure shows what a dense layer looks like after adding the sequential rank matrices. The input variables are sent through both the original pretraining weight matrix and the first matrix of the sequential matrices. Then, output values are calculated by summing up the outputs of these two parallel matrices.

https://arxiv.org/abs/2106.09685
The LoRA Method
Neural networks have many dense layers that perform matrix multiplications. The weight matrices in these layers are actually full-rank matrices (all rows and columns are independent). But Aghajanyan et al. mentioned in their paper that pre-trained language models have a low “instrisic dimension”. It shaped the LoRA hypothesis within this aspect. So the LoRA hypothesis became that the models could learn with low-dimensional changes in the weight matrix during the adaptation to a specific task.
Let
be a weight matrix of a pre-trained model. The updates to this matrix were made through matrices
