Join our new webinar “Harnessing AI Agents for Advanced Fraud Detection” on Feb 13th at 9AM PT  ||  Register today ->

ubiai deep learning

Fine-Tuning LLaMA-3 for Psychology Question Answering Using LoRA and Unsloth

Dec 5th, 2024

Fine-Tuning Llama3 for Psychology Question Answering

This step-by-step guide walks you through the fine-tuning process of the LLaMA-3 model for psychology question answering using LoRA (Low-Rank Adaptation) and Unsloth. By following this notebook, you’ll learn how to efficiently adapt a large language model for a specialized task in psychology. Whether you’re a machine learning enthusiast or a psychology researcher, this guide provides a practical approach to creating a robust question-answering system.

 

You can find the Google Colab notebook following this link: https://colab.research.google.com/drive/1gqZ2PsjbjkNfDV3ixMiv7Ka7zglF6fQr?usp=sharing 

You can watch the step-by-step video tutorial below:

Fine-tune and evaluate your model with UBIAI

  • Prepare your high quality Training Data
  • Train best-in-class LLMs: Build domain-specific models that truly understand your context, fine-tune effortlessly, no coding required
  • Deploy with just few clicks: Go from a fine-tuned model to a live API endpoint with a single click
  • Optimize with confidence: unlock instant, scalable ROI by monitoring and analyzing model performance to ensure peak accuracy and tailored outcomes.

Step 1: Install Necessary Libraries

The following cell installs essential libraries for fine-tuning.

  • Unsloth: A framework for efficient model management and fine-tuning.

  • Xformers: Optimizes attention mechanisms for handling large sequences.

  • TRL (Transformers Reinforcement Learning): Provides tools for reinforcement learning-based model tuning.

  • PEFT (Parameter Efficient Fine-Tuning): Reduces memory requirements during fine-tuning by updating only a subset of model parameters.

  • BitsAndBytes: Enables efficient quantization techniques, such as 4-bit precision.

Using these libraries ensures our environment is equipped for both memory-efficient and effective LLM fine-tuning.

				
					%%capture
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes
				
			

Step 2: Load Pretrained Model and Tokenizer

What is Unsloth and What Is It Used For?

 

Unsloth is a streamlined framework designed to simplify and optimize the process of working with large language models (LLMs). Think of it as your ultimate toolkit for fine-tuning and deploying LLMs efficiently and easily.

				
					from unsloth import FastLanguageModel
import torch
max_seq_length = 2048
dtype = None
load_in_4bit = True
				
			

Why Did We Load the Model from Unsloth?

 

We loaded the model from Unsloth because it provides a pre-configured, optimized environment tailored for efficient LLM fine-tuning and deployment. Here’s why this choice makes sense:

  • Memory Efficiency: Unsloth’s models are pre-quantized, often using techniques like 4-bit precision, which significantly reduces memory requirements without compromising performance.
  • Long-Context Support: The framework incorporates advanced features like RoPE (Rotary Position Embedding) scaling, making it ideal for tasks requiring long input sequences.
  • Fine-Tuning Ready: Models from Unsloth are designed with parameter-efficient techniques in mind, ensuring smooth integration with LoRA and QLoRA.
  • Ease of Use: By handling complex setups internally, Unsloth eliminates the need for extensive manual configurations, saving time and reducing errors.

The model loaded here is “unsloth/llama-3-8b-bnb-4bit,” a lightweight yet powerful variant for tasks requiring large language models.

				
					model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/llama-3-8b-bnb-4bit",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
				
			

Step 3: Apply Parameter-Efficient Fine-Tuning (LoRA)

What is LoRA?

LoRA (Low-Rank Adaptation) is a method that fine-tunes two smaller matrices instead of the entire weight matrix of a pre-trained LLM. These smaller matrices form a LoRA adapter, which is then applied to the original LLM. The fine-tuned adapter is much smaller in size compared to the original model, often only a small percentage of its size. During inference, this LoRA adapter is combined with the original LLM.

 

In our notebook, LoRA adapters are applied:

  • Target Modules: Defines which parts of the model are fine-tuned, like query, key, and value projections.
  • LoRA Alpha & Dropout: Control the adaptation strength and regularization.
  • Gradient Checkpointing: Reduces memory usage during training by recomputing intermediate states.
  • Random State: Ensures reproducibility.

 

This step ensures that we are only modifying specific parameters (around 10% of all parameters).

 

Parameter-efficient fine-tuning (PEFT) is a more efficient form of instruction-based fine-tuning. Full LLM fine-tuning is resource-intensive, demanding considerable computational power, memory, and storage. PEFT addresses this by updating only a select set of parameters while keeping the rest frozen. This reduces the memory load during training and prevents the model from forgetting previously learned information. PEFT is particularly useful when fine-tuning for multiple tasks. Among the common techniques to achieve PEFT, LoRA, and QLoRA are widely recognized for their effectiveness.

				
					model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",    *
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)

model
				
			

Step 4: Load and Preprocess the Dataset for Fine-Tuning

What is an LLM Dataset?

 
An LLM dataset is a collection of text data used for training and fine-tuning language models. These datasets contain various types of text, such as questions, answers, documents, or dialogues, and are tailored for specific tasks or domains. The quality of the dataset significantly influences the model’s performance and accuracy.

Types of Datasets for Fine-Tuning LLMs

 
  • Text Classification Datasets: These datasets help train models to categorize text into predefined categories like sentiment analysis, topic classification, or spam detection.

 

  • Text Generation Datasets: These consist of prompts and corresponding responses, useful for training models to generate contextually appropriate and coherent text.

 

  • Summarization Datasets: These datasets contain long documents paired with summaries, designed to train models to generate or refine summaries.

 

  • Question-Answering Datasets: These datasets include questions and their correct answers, often derived from FAQs, support dialogues, or knowledge bases.

 

  • Mask Modeling Datasets: These are used to train models with masked language modeling (MLM), where parts of the text are hidden, and the model predicts the missing words or tokens. This method is crucial in the pre-training phase for models like BERT.

 

  • Instruction Fine-Tuning Datasets: These datasets consist of instructions paired with expected responses, guiding the model to execute tasks based on user commands.

 

  • Conversational Datasets: These datasets are designed for training dialogue models, with conversations between users and systems or among multiple users.

 

  • Named Entity Recognition (NER) Datasets: These datasets teach models to identify and categorize entities like names, locations, dates, etc.

 

When we want to fine-tune a model for a specific use case, we can create a custom dataset that falls into any of the above categories. By curating a dataset that is tailored to the task at hand, we can optimize the model’s performance to meet the specific needs of the application. This custom dataset allows us to focus on the relevant information and ensure that the model is fine-tuned with the most appropriate data for the target use case, leading to more accurate and effective results.

 
				
					chat_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""
				
			

 

This function ensures the input data aligns with the model’s requirements, reducing inconsistencies during training.

				
					EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
    instruction = ""
    inputs       = examples["question"]
    outputs      = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = chat_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass
				
			

Hugging Face Datasets

 

Hugging Face offers a vast array of pre-existing datasets that are perfect for fine-tuning models across various domains. These datasets are well-curated, diverse, and readily available, making them an excellent resource for quickly getting started with model fine-tuning.

 

In this notebook, we’ve chosen a psychology-focused question-answering dataset from Hugging Face. This dataset includes pairs of psychological questions and their corresponding answers, making it ideal for fine-tuning a model on psychology-related queries.

				
					from datasets import load_dataset

dataset = load_dataset("BoltMonkey/psychology-question-answer", split = "train")
dataset = dataset.map(formatting_prompts_func, batched = True,)

dataset

import pprint
#Here are a few examples of what the data looks like
pprint.pprint(dataset[250])
pprint.pprint(dataset[260])
pprint.pprint(dataset[270])
				
			

Step 5: Configure the Training Parameters

 

Now let’s define the training parameters, including the model, tokenizer, and dataset, along with key training settings like batch size, gradient accumulation steps, learning rate, and maximum training steps. The TrainingArguments specify additional configurations, such as optimization with the AdamW optimizer, weight decay, and logging frequency.

 

  • Learning Rate: Controls the speed at which the model updates during training.

  • Batch Size: The number of samples processed in one iteration.

  • Epochs: The number of times the model passes through the entire training dataset.

  • Logging Directory: Specifies where to store training logs, useful for monitoring progress.

 

The Trainer simplifies the configuration and management of the fine-tuning process, ensuring a balanced and efficient setup. We did 60 steps to make finetuning faster, but you can set num_train_epochs=1 for a full run, and turn off max_steps=None.

				
					from trl import SFTTrainer
from transformers import TrainingArguments

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = True,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 60,
        learning_rate = 2e-4,
        fp16 = not torch.cuda.is_bf16_supported(),
        bf16 = torch.cuda.is_bf16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
				
			

Step 6: Start Training

 

The line trainer_stats = trainer.train()initiates the fine-tuning process using the SFTTrainer. It triggers the training loop, where the model learns from the provided dataset based on the configurations defined earlier. we can see the loss is decreasing during training, this means the model is learning and improving its performance. In machine learning, loss represents how well the model’s predictions match the actual target values. When the loss decreases over time, it indicates that the model is gradually adjusting its parameters to make more accurate predictions.

				
					trainer_stats = trainer.train()
				
			

Step 7: Perform Inference with the Fine-Tuned Model to Evaluate output

 

When we generate text with the fine-tuned model, the output typically includes the full structure of the input prompt along with the model’s response. For instance, the output for this example looks like this:

				
					FastLanguageModel.for_inference(model) # For faster Inference

inputs = tokenizer(
[
    chat_prompt.format(
        "", # instruction - leave this blank!
        "Who is the founder of the psychoanalytic theory?", # input
        "", # output - leave this blank!
    )
], return_tensors = "pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 64, use_cache = True)
tokenizer.batch_decode(outputs)
				
			

To extract and print only the response part of the output, we can modify your code to process the decoded output string. Here’s the new code

 
				
					FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        chat_prompt.format(
            "",  # instruction
            "Who is the founder of the psychoanalytic theory? ",  # input
            "",  # output
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)[0]  # Decode the output

# Extracting the response part
response = decoded_output.split("### Response:")[-1].strip()  # Get text after "### Response:"
response = response.split("<|end_of_text|>")[0].strip()  # Remove the end token if present

print(response)
				
			

Comparing response to dataset answer:

 

As you can see, when we asked the same question twice, the model gave the same answer but written in different ways. This shows that the model is relying on the dataset while also being able to generalize.

				
					FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        chat_prompt.format(
            "",  # instruction
            "Who proposed the concept of self-efficacy?",  # input
            "",  # output
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)[0]

response = decoded_output.split("### Response:")[-1].strip()
response = response.split("<|end_of_text|>")[0].strip()

print(response)
				
			

Comparing response to dataset answer:

 

in this example, the model used the knowledge from the dataset but with variation and extension. This behavior demonstrates that the model has genuinely learned patterns and concepts from the data rather than merely memorizing it.

  

				
					FastLanguageModel.for_inference(model)

inputs = tokenizer(
    [
        chat_prompt.format(
            "",  # instruction
            "Who is known for their work on classical conditioning?",  # input
            "",  # output
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)[0]

response = decoded_output.split("### Response:")[-1].strip()
response = response.split("<|end_of_text|>")[0].strip()

print(response)
				
			

Comparing response to dataset answer:

The model is also answering in the same style by keeping the responses short and straight to the point. This consistency in style indicates that the model not only learned the content from the dataset but also the tone and structure of the responses, allowing it to generate answers that align with the desired format.

Step 8: Saving, loading finetuned models

				
					model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
# model.push_to_hub("your_name/lora_model", token = "...") # If you want to save online
# tokenizer.push_to_hub("your_name/lora_model", token = "...") # If you want to save online


#Let's zip our model folder
import shutil
import os
folder_path = "/content/lora_model"
zip_file_path = "/content/lora_model.zip"

shutil.make_archive(zip_file_path.replace(".zip", ""), 'zip', folder_path)

#Now if we want to use our model again we can just load it:

if False:
    from unsloth import FastLanguageModel
    model, tokenizer = FastLanguageModel.from_pretrained(
        model_name = "lora_model", #model folder
        max_seq_length = max_seq_length,
        dtype = dtype,
        load_in_4bit = load_in_4bit,
    )
    FastLanguageModel.for_inference(model)
    
chat_prompt = """
### Instruction:
{}

### Input:
{}

### Response:
{}"""


inputs = tokenizer(
    [
        chat_prompt.format(
            "",  # instruction
            "What is heuristics?",  # input
            "",  # output
        )
    ],
    return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens=64, use_cache=True)
decoded_output = tokenizer.batch_decode(outputs)[0]

response = decoded_output.split("### Response:")[-1].strip()
response = response.split("<|end_of_text|>")[0].strip()

print(response)
				
			

Comparing response to dataset answer:

And that’s a wrap! Keep exploring, keep learning, and enjoy the process.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !