Join our new webinar “Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost” on March 5th at 9AM PT  ||  Register today ->

ubiai deep learning

The Most Effective Techniques for Applying parameter-efficient Fine-tuning (PEFT)

APRIL 9th, 2025

LoRA (Low-Rank Adaptation)

 
 

What is LoRA?

 
LoRA, or Low-Rank Adaptation, is a technique designed to make fine-tuning large AI models more efficient. It’s one of the key techniques under the umbrella of PEFT, which aim to reduce the computational overhead involved in adjusting large pre-trained models. LoRA achieves this efficiency by breaking down the model’s weight matrices into smaller, low-rank components. This reduces the number of parameters that need to be updated during fine-tuning, making the process faster and more cost-effective.
 

The LoRA Fine-Tuning Process

 
LoRA simplifies the process of model adaptation by focusing on specific parts of the model, typically the weight matrices. Here’s how the LoRA technique works:
 
Selecting Target Layers for Fine-Tuning
 
he first step in the LoRA process is identifying which layers or components of the model should be fine-tuned. These layers typically involve the weight matrices 𝑊 that define how the model processes input data. By narrowing the focus to specific layers, LoRA ensures that only the necessary parts of the model are adapted, rather than adjusting the entire network of parameters.
 
Decomposing Weight Matrices
 
Once the target layers are identified, the next step involves decomposing the large weight matrix into smaller, low-rank matrices. Instead of working with a single, large matrix, LoRA breaks it down into two smaller matrices that can be updated more easily. This decomposition allows for more efficient fine-tuning since the smaller matrices require less computational power to adjust.
 
The original matrix 𝑊 can be approximated by multiplying two smaller matrices 𝐴 and 𝐵 :
 
Updating Only the Smaller Matrices
 
In this step, instead of updating the entire large matrix 𝑊 , LoRA focuses on updating only the smaller, low-rank matrices 𝐴 and 𝐵. These matrices are the only ones that change during fine-tuning, while the rest of the model’s parameters remain fixed.
 
Reconstructing the Weight Matrices
 
After the low-rank matrices are updated, they are multiplied back together to recreate the updated weight matrix. The recombined matrix is given by:
This recomposed matrix now reflects the adjustments made during fine-tuning. The recombination process allows the model to retain its original structure while incorporating the new, fine-tuned parameters.
 
Integrating the Updated Weights
 
Once the low-rank matrices are recombined, the updated weight matrix is integrated back into the model. This replacement does not alter the rest of the model’s architecture, which preserves its original capabilities and knowledge.
 
Implementing PEFT fine-tuning with LORA is very simple. Here’s a step-by-step guide to using LoRA for fine-tuning a pre-trained model for sentiment classification:
 
We start by importing necessary libraries to load datasets, manage the LoRA configuration, and handle the model and training setup.
				
					from datasets import load_dataset
from peft import PeftModel, PeftConfig, LoraConfig, TaskType, get_peft_model
from transformers import TrainingArguments, Trainer, AutoModelForSequenceClassification, AutoTokenizer
				
			
we then create a lable mapping that is used to tell the model how to interpret the target labels.This mapping ensures that the model can understand and output sentiment labels correctly during the training and evaluation process.
 
 
				
					id_to_label = {0: "NEGATIVE", 1: "POSITIVE"}
label_to_id = {"NEGATIVE": 0, "POSITIVE": 1}
				
			
We load the pre-trained model for sequence classification. The model is pre-trained on a large corpus of text and has learned general language patterns. The model is then initialized with our mappings so that it can correctly handle the classification task.
				
					model= AutoModelForSequenceClassification.from_pretrained("distilbert-base-uncased", id2label=id2label, label2id=label2id)
				
			
We load the Rotten Tomatoes dataset using the load_dataset function. This dataset contains movie reviews labeled as either “positive” or “negative” sentiment.
				
					dataset = load_dataset("rotten_tomatoes")
dataset
				
			
We load a tokenizer corresponding to the pre-trained model. The tokenizer is essential because it converts raw text into token IDs that the model can process. It ensures that the input text is in a form that the model understands.
				
					checkpoint = "distilbert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
				
			

Next, we’ll create a tokenization function, which we’ll use to apply the tokenizer to the dataset to convert the text into tokenized input.

				
					def tokenizer_func(input):
  return tokenizer(input["text"], truncation=True)
  
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
data_tokenized = data.map(tokenizer_func, batched=True)
#Let's also prepare the dataset
train_data_tokenized = data_tokenized["train"].remove_columns(["text"]).rename_column("label", "labels")
val_data_tokenized = data_tokenized["validation"].remove_columns(["text"]).rename_column("label", "labels")
				
			
This is where LoRA (Low-Rank Adaptation) comes into play. By configuring LoRA, we efficiently adapt the model with fewer trainable parameters.
				
					lora_config = LoraConfig(
    r=8,  # rank - see hyperparameter section for more details
    lora_alpha=32,  # scaling factor - see hyperparameter section for more details
    target_modules=["q_lin", "v_lin"],
    lora_dropout=0.05,
    bias="none",
    task_type=TaskType.SEQ_CLS
)
peft_model = get_peft_model(model, lora_config)
peft_model.print_trainable_parameters()
				
			

We then set up the training configuration. This configuration controls how the model will be trained.

 
 
				
					output_dir = f'./rotten-tomatoes-classification-training-{str(int(time.time()))}'
training_args = TrainingArguments(
    output_dir=output_dir,
    learning_rate=1e-5,
    logging_steps=1,
    max_steps=10
)
				
			

Finally, we set up the Trainer and start the training process. This step kicks off the fine-tuning process, and the model will start learning to classify the movie reviews into “positive” or “negative” categories.

				
					trainer = Trainer(
    model=peft_model,
    args=training_args,
    train_dataset=train_data_tokenized,
    eval_dataset=val_data_tokenized,
    data_collator=data_collator,
    tokenizer=tokenizer,
)
trainer.train()
				
			
Make sure to adjust the parameter space to optimize performance based on your specific use case.
 

QLoRA (Quantized LoRA)

 

What is QLoRA?

 
QLoRA is an extended version of LoRA designed to reduce the memory requirements of large language models by quantizing the model’s weight parameters to 4-bit precision. In typical LLMs, parameters are stored in 32-bit format, but QLoRA compresses them to 4 bits, making the model’s memory footprint much smaller. This compression enables the fine-tuning of LLMs on hardware with limited memory, such as consumer-grade GPUs, which were previously unable to handle such models.

The QLoRA Process Explained

 
Quantizing the Model Weights to 4-bit Precision
 
The first step in QLoRA is compressing the weight parameters of the pre-trained large language model from the usual 32-bit precision to a more memory-efficient 4-bit format. This reduces the overall memory usage of the model.
 
Using 4-bit Normal Float for Improved Quantization
 
Instead of using the standard 4-bit integers or floating-point representations, QLoRA introduces a novel “4-bit Normal Float” data type, optimized for normally distributed data. This approach helps represent the weights of the model, which typically follow a Gaussian distribution. QLoRA applies quantile quantization to divide the data into bins of equal size, ensuring more efficient memory usage and reducing the problem of outliers distorting the distribution.
 
Applying Double Quantization
 
To enhance model compression, QLoRA uses a two-step quantization approach. The first level of quantization involves reducing the precision of the weight parameters. The second level quantizes the constants that are used in the first quantization process.
 
 
Paged Optimizers for Efficient Memory Management
 
When training large models, especially on GPUs with limited memory, it’s important to manage memory efficiently. QLoRA uses a technique called “paged optimizers” to handle this. It relies on NVIDIA’s unified memory feature, which allows automatic page-to-page transfers between CPU and GPU memory. When the GPU runs low on memory, parts of the optimizer state are moved to CPU memory. These pages are then automatically transferred back to the GPU when they are needed for an optimizer update.
 
Implementing QLoRA is very similar to LoRA, but with a few minor changes. Here’s how it works:
 
We begin by loading a pre-trained causal language model and applying 4-bit quantization to reduce the model’s memory footprint.
				
					model = AutoModelForCausalLM.from_pretrained(
    model_id, 
    quantization_config=bnb_config, 
    device_map={"":0}
)
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
				
			
Just like LoRA, we configure the QLoRA parameters through LoraConfig. The main difference here is the inclusion of target_modules, which specifies that QLoRA will be applied to specific attention projection layers (k_proj, v_proj, q_proj) in the model. These layers are crucial for the attention mechanism, and QLoRA adapts them in a low-rank way.
 
				
					qlora_config = LoraConfig(
    r=config.qlora_rank,
    lora_alpha=config.qlora_alpha,
    target_modules=["k_proj", "v_proj", "q_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)
				
			
We then use the get_peft_model function to apply the QLoRA configuration to the model.
				
					model = get_peft_model(model, qlora_config)
				
			
The model is then set up for training using the Trainer class that we used previously. Here, the optimizer is set to “paged_adamw_8bit”, a specialized optimizer for efficient training with quantized models.
 
				
					trainer = transformers.Trainer(
    #set this like we did in the privious chapter
    args=transformers.TrainingArguments(
        #rest of the setup
        optim="paged_adamw_8bit" #Add this
    ),
)
				
			

DoRA (Decomposed Low-Rank Adaptation)

 

What is DoRA?

 
DoRA (Decomposed Low-Rank Adaptation) is another extended version of LoRA. This technique takes the concept of Low-Rank Adaptation (LoRA) a step further by decomposing the pre-trained weight matrices into two distinct components: magnitude and direction. This decomposition allows for more efficient fine-tuning by applying directional updates through LoRA, minimizing the number of trainable parameters and improving the overall model’s learning efficiency and training stability.

The DoRA Process Explained

 
Decompose Pretrained Weights
 
The pretrained weight W0​ is decomposed into two components:
 
  • Magnitude (𝑚): Represents the scale of the weights, initialized as the norm of the pretrained weights.
 
  • Direction (𝑉): The normalized pretrained weight matrix.
 
Adapt Direction
 
During fine-tuning, updates ( Δ 𝑉) are applied only to the directional component (𝑉), which is represented using low-rank matrices (𝐴 and 𝐵). This enables efficient parameter adaptation while keeping the number of trainable parameters minimal. The updated directional component is recalculated as: V + Delta V
 
 
Recombine Magnitude and Direction
 
After training, the updated weights are merged back by recomposing the magnitude ( 𝑚 ) and the adapted direction.
 
 
Generate Merged Weights
 
The final merged weight (𝑊′) incorporates the pretrained knowledge from 𝑊 along with the fine-tuned updates, ready for downstream tasks.
 
The implementation of DoRA is quite similar to LoRA when using Hugging Face’s PEFT library:
 
Both methods utilize The same steps we covered on the previous sections of this guide. The main distinction lies in enabling DoRA through the configuration. Instead of the standard LoRA setup, you initialize the configuration with the use_dora parameter set to True:
				
					
from peft import LoraConfig, get_peft_model
​
# Initialize DoRA configuration
config = LoraConfig(
    #Some other parameters here
    use_dora=True
    
)
				
			

This change activates the decomposition mechanism specific to DoRA.

NEFTune (Noise-Enhanced Fine-Tuning)

 

What Is NEFTune?

 
Noise-Enhanced Fine-Tuning (NEFTune) is another effective technique for improving the fine-tuning process of language models. NEFTune uses the well-known regularization technique of introducing random noise to improve model generalization during the fine-tuning process. This approach aims to reduce overfitting and enhance the robustness of pre-trained models.
 

The NEFTune Process Explained

 
Targeting the Embedding Layers
 
The embedding layers are responsible for converting input tokens into vector representations. These embeddings are a crucial foundation for the model’s understanding of the input.
 
Adding Gaussian Noise
 
During fine-tuning, small amounts of Gaussian noise are applied specifically to the embedding layers. This controlled disturbance helps the model generalize better and avoid overfitting.
 
Why NEFT is Effective
 
 
  • Encourages Generalization: By perturbing the embeddings, the model learns to focus on higher-level features within the training data rather than memorizing fine-grained details.
 
  • Prevents Overfitting: NEFTune reduces the risk of overfitting to the training set, particularly for smaller datasets, by forcing the model to adapt to slight variations in the data.
 
  • Improves Performance: Adding noise has been shown to improve the fine-tuned model’s ability to perform on unseen data, yielding better results on downstream tasks.
 
Implementing NEFT is also very simple:
 
To implement NEFTune, you can incorporate it directly into your training process by modifying the trainer’s configuration. The key is to add the neftune_noise_alpha parameter, which specifies the noise intensity added to the embeddings.
				
					
trainer = SFTTrainer(
    #Some Parameters Here
     neftune_noise_alpha = 5
)
				
			
When fine-tuning the model, this noise is integrated into the token embedding layer, This is exactly what allows the model to generalize better and learn higher-level features from the dataset.
 

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !