Join our new webinar “Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost” on March 5th at 9AM PT || Register today ->

Supervised Fine-Tuning 101: Strategies Every ML Engineer Should Know

APRIL 9th, 2025

supervised fine-tuning is a method used to adapt a pre-trained Language Model for specific tasks using labeled data. Unlike unsupervised techniques, supervised fine-tuning uses a dataset of pre-validated responses, which is its key distinguishing feature. During supervised fine-tuning, the model’s weights are adjusted using supervised learning techniques. The adjustment process uses task-specific loss calculations, which measure how much the model’s output differs from the known correct answers (ground truth labels). This helps the model learn the specific patterns and nuances needed for its intended task.

Here’s a practical example:

Consider a pre-trained LLM responding to “Can you summarize this research paper for me?” with a basic response like “The paper discusses the effects of caffeine on sleep patterns and finds negative correlations between evening coffee consumption and sleep quality.” While technically accurate, this response might not meet the standards for an academic writing assistant, which typically requires proper structure, methodology highlights, and key findings in context.

This is where supervised fine-tuning becomes valuable. By training the model on a set of validated examples that demonstrate proper academic summaries, the model learns to provide more comprehensive and appropriate responses. These examples might include structured sections (introduction, methodology, results, conclusions), critical analysis of the methodology, limitations of the study, and connections to related research – all following academic writing conventions.

Through supervised fine-tuning with validated training examples, the model learns to enhance its responses while maintaining accuracy, effectively adapting its output style to match the requirements of academic writing and research analysis.

Supervised Fine-Tuning Strategies

In this blog, we will be discussing useful supervised fine-tuning techniques that can help you effectively adapt pre-trained language models for your specific use cases. These techniques will enable you to achieve better performance and more consistent outputs aligned with your requirements.

Full parameter fine-tuning is one approach to adapting pre-trained language models for specific tasks. This technique involves updating all or most of the model’s parameters during the fine-tuning process. Among the various methods available, standard gradient descent fine-tuning and Layer-wise Fine-Tuning (LIFT) stand as primary approaches, each with its own characteristics and use cases.

Standard Gradient Descent Fine-Tuning

Standard Gradient Descent Fine-Tuning is the most straightforward approach to fine-tuning, where all model parameters are updated simultaneously using gradient descent optimization (check Fine-Tuning explanation chapter). This method requires significant computational resources as it processes and updates the entire model at once. While this approach can be highly effective, it carries the risk of catastrophic forgetting, where the model might lose some of its previously learned general knowledge while adapting to the new task. To implement standard fine-tuning effectively, practitioners typically use optimization techniques like Adam or SGD.

Here’s how you would implement standard gradient descent fine-tuning in practice:

				
					def standard_finetuning(model, train_dataloader, num_epochs, learning_rate=2e-5):
    optimizer = AdamW(model.parameters(), lr=learning_rate)
    
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch in train_dataloader:
            optimizer.zero_grad()
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                labels=batch['labels']
            )
            loss = outputs.loss
            total_loss += loss.item()
            loss.backward()
            optimizer.step()
            
        avg_loss = total_loss / len(train_dataloader)
        print(f"Epoch {epoch+1}, Average Loss: {avg_loss:.4f}")

When implementing standard fine-tuning, it’s crucial to use a small learning rate, typically between 2e-5 and 5e-5, to prevent drastic changes to the model’s parameters. Gradient clipping prevent exploding gradients, while warmup steps allow for gradual learning rate increase. Regular monitoring of validation loss helps prevent overfitting, ensuring the model maintains its generalization capabilities.

Layer-wise Fine-Tuning (LIFT)

This strategy takes a more nuanced approach to model adaptation. Instead of updating all parameters simultaneously, LIFT fine-tunes the model layer by layer, starting from the top layers and gradually moving down to the lower layers. This offers better preservation of general language understanding and provides a more controlled adaptation process. The risk of catastrophic forgetting is reduced as the model’s fundamental language understanding, typically encoded in lower layers, remains more stable during the initial stages of fine-tuning.

The implementation of LIFT requires more sophisticated code structure:

				
					
class LayerwiseFineTuner:
    def __init__(self, model, num_layers_per_stage=2):
        self.model = model
        self.num_layers_per_stage = num_layers_per_stage
        self.total_layers = len(list(model.bert.encoder.layer))
        
    def lift_finetuning(self, train_dataloader, num_epochs_per_stage, learning_rate=2e-5):
        for stage in range(0, self.total_layers, self.num_layers_per_stage):
            start_layer = self.total_layers - stage - self.num_layers_per_stage
            end_layer = self.total_layers - stage
            
            optimizer = AdamW(
                [p for p in self.model.parameters() if p.requires_grad],
                lr=learning_rate
            )
            
            for epoch in range(num_epochs_per_stage):
                total_loss = 0
                self.model.train()
                
                for batch in train_dataloader:
                    optimizer.zero_grad()
                    outputs = self.model(
                        input_ids=batch['input_ids'],
                        attention_mask=batch['attention_mask'],
                        labels=batch['labels']
                    )
                    loss = outputs.loss
                    total_loss += loss.item()
                    loss.backward()
                    optimizer.step()

Choosing Between Standard Fine-Tuning and LIFT

When choosing between these approaches, several factors come into play. Dataset size significantly influences the choice – smaller datasets often benefit from LIFT’s more controlled approach, while larger datasets might achieve better results with standard fine-tuning. Computational resources also play a crucial role, as LIFT allows for more controlled resource usage despite taking longer overall. The specificity of the task and the size of the model should also inform this decision, with more complex adaptations potentially benefiting from LIFT’s progressive approach.

The choice between them ultimately depends on the specific requirements of your project, including available computational resources, dataset characteristics, and the desired balance between training time and adaptation quality.

Half Fine-Tuning represents an efficient compromise between full parameter fine-tuning and feature extraction approaches in the adaptation of pre-trained language models. This methodology strategically updates only a portion of the model’s parameters while keeping others frozen, solving the issue of catastrophic forgetting. an important techniques within HFT is Selective Layer Freezing.

Selective Layer Freezing

strategically freezes certain layers of the neural network while allowing others to be updated during the training process. This technique is founded on the observation that different layers in a neural network capture different levels of abstraction, with lower layers typically capturing more general features and higher layers handling task-specific features. By selectively freezing layers, we can maintain the model’s foundational knowledge while adapting specific components for new tasks.

Freezing Strategy

The implementation of Selective Layer Freezing requires careful consideration of which layers should be frozen. There are several strategies available, with the two most commonly used being alternate freezing and freezing the lower half of the layers. Let’s dive into how to implement these strategies.

Alternate Freezing: In the alternate freezing strategy, the model freezes every even-numbered layer, as these layers tend to capture more general, lower-level abstractions. By freezing these layers and allowing the higher layers to adapt, the model retains its foundational knowledge while focusing on task-specific learning in the upper layers.

Freezing the Lower Half: In the lower-half freezing strategy, the first half of the model’s layers are frozen, preserving their general knowledge. Meanwhile, the higher layers, which are responsible for more task-specific features, remain trainable. This allows the model to leverage its pre-existing knowledge while tailoring the higher-level features to suit the new task.

Here’s a comprehensive implementation example that demonstrates the core concepts:

To begin, define the Finetuner class, which handles Selective Layer Freezing. The _init_ method receives two parameters: the model to fine-tune and a freeze_pattern dictating the freezing strategy. Upon creation, it invokes setup_layer_freezing to determine which model layers to freeze.

				
					
class SelectiveLayerFineTuner:
    def __init__(self, model, freeze_pattern='alternate'):
        self.model = model
        self.freeze_pattern = freeze_pattern
        self.setup_layer_freezing()

then we Configure Layer Freezing, the model’s transformer layers are accessed and frozen according to the chosen strategy. The method first retrieves all layers of the transformer model. For the alternate pattern, it freezes every even-numbered layer, as they tend to capture lower-level abstractions. For the lower_half pattern, the method freezes the lower half of the layers, preserving their foundational knowledge while keeping the higher layers trainable for task-specific adaptations. Each frozen layer’s parameters are set to requires_grad = False, ensuring they remain unchanged during training.

				
					
def setup_layer_freezing(self):
        # Get all transformer layers
        layers = #Model layers
        
        if self.freeze_pattern == 'alternate':
            # Freeze alternate layers
            for i, layer in enumerate(layers):
                if i % 2 == 0:  # Freeze even-numbered layers
                    for param in layer.parameters():
                        param.requires_grad = False
        
        elif self.freeze_pattern == 'lower_half':
            # Freeze lower half of the layers
            num_layers = len(layers)
            for i, layer in enumerate(layers):
                if i < num_layers // 2:
                    for param in layer.parameters():
                        param.requires_grad = False

Fine-Tuning the Model

Once the selective layer freezing is set up, the next step is to define the training loop for fine-tuning the model. You’ll need to choose an appropriate loss function and optimization algorithm, configure the learning rate, and set the number of epochs. During the training process, only the layers that are not frozen will have their parameters updated, while the frozen layers will retain their original values.

Parameter-efficient fine-tuning (PEFT) is a method designed to adapt large pre-trained models for specific tasks while minimizing the number of parameters that need to be updated. Unlike traditional fine-tuning approaches, such as full fine-tuning or “half fine-tuning,” where you freeze some layers and update the rest of the model, PEFT focuses on freezing most of the model’s parameters while only modifying a small subset of them. This could include the addition of task-specific adapters or updates to certain layers, significantly reducing the number of parameters that need to be trained.

The concept of Parameter-Efficient Fine-Tuning (PEFT) has significantly lowered the barriers to applying large language models (LLMs) in development. This has sparked a wide range of research into various methods for achieving PEFT. These methods can be classified into three main categories:

Selective Fine-Tuning: This approach focuses on updating a carefully chosen subset of a pre-trained model’s parameters, rather than fine-tuning the entire model. This method enables more efficient adaptation to specific tasks.

Additive Fine-Tuning: New modules are added to the pre-trained model for fine-tuning. These modules are then trained to incorporate domain-specific knowledge, allowing the model to adapt to new tasks while preserving the original model’s capabilities.

Reparameterization: Where a low-dimensional representation is created for specific model components. This reduces the complexity of the fine-tuning process by working with a smaller set of parameters.

In the next blog, we will be exploring a few of the most effective techniques for applying parameter-efficient fine-tuning (PEFT).

What are you waiting for?

Fine-tune Your Model for Free

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Supervised Fine-Tuning 101: Strategies Every ML Engineer Should Know

Here’s a practical example:

Supervised Fine-Tuning Strategies

Full Parameter Fine-Tuning

Standard Gradient Descent Fine-Tuning

Here’s how you would implement standard gradient descent fine-tuning in practice:

Layer-wise Fine-Tuning (LIFT)

The implementation of LIFT requires more sophisticated code structure:

Choosing Between Standard Fine-Tuning and LIFT

Half Fine-Tuning (HFT)

Selective Layer Freezing

Freezing Strategy

Fine-Tuning the Model

Parameter-Efficient Fine-Tuning (PEFT)

What are you waiting for?

Fine-tune Your Model for Free

Features

Case Studies

Company

Legal

Supervised Fine-Tuning 101: Strategies Every ML Engineer Should Know

Here’s a practical example:

Supervised Fine-Tuning Strategies

Full Parameter Fine-Tuning

Standard Gradient Descent Fine-Tuning

Here’s how you would implement standard gradient descent fine-tuning in practice:

Layer-wise Fine-Tuning (LIFT)

The implementation of LIFT requires more sophisticated code structure:

Choosing Between Standard Fine-Tuning and LIFT

Half Fine-Tuning (HFT)

Selective Layer Freezing

Freezing Strategy

Fine-Tuning the Model

Parameter-Efficient Fine-Tuning (PEFT)

What are you waiting for?

Fine-tune Your Model for Free

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset