ubiai deep learning

Fine-Tuning LLAMA 3 Model for Relation Extraction Using UBIAI Data

July 8th, 2024

In the rapidly evolving field of Natural Language Processing (NLP), the ability to extract meaningful relationships from unstructured text has become increasingly crucial. This article delves into the process of fine-tuning Large Language Models (LLMs), specifically LLAMA 3, for the task of Relation Extraction. We’ll explore how to leverage data annotated using UBIAI, a cutting-edge annotation platform, to enhance the model’s performance in identifying and classifying semantic relationships within text.

What is Relation Extraction?

Relation Extraction stands at the forefront of NLP tasks, serving as a bridge between unstructured text and structured knowledge. This process involves identifying entities within a text and determining the semantic relationships that connect them. For instance, in the sentence “Tesla, founded by Elon Musk, is revolutionizing the electric vehicle industry,” a relation extraction system would identify the entities “Tesla,” “Elon Musk,” and “electric vehicle industry,” and extract relationships such as “founded by” and “revolutionizing.”

Large Language Models (LLMs)

The advent of Large Language Models has ushered in a new era of NLP capabilities. These sophisticated models, trained on vast corpora of text, possess an innate understanding of language structures and semantics. LLAMA 3, a state-of-the-art LLM, exemplifies this power with its ability to comprehend and generate human-like text across diverse domains. By fine-tuning LLAMA 3 for relation extraction, we can harness its deep language understanding to extract nuanced relationships from text with unprecedented accuracy.

Data Preparation for Fine-Tuning

Preparing Data: The UBIAI Advantage

At the heart of any successful machine learning project lies high-quality data. This is where UBIAI shines as a game-changing annotation platform. UBIAI goes beyond traditional annotation tools by offering:

  1. Advanced Document Processing: UBIAI excels in extracting text from various document formats, maintaining structural integrity and contextual information crucial for relation extraction tasks.
  2. Intuitive Annotation Interface: The platform provides a user-friendly environment for annotators to effortlessly identify entities and define relationships, ensuring consistent and accurate labeling.
  3. Quality Control Mechanisms: UBIAI incorporates built-in validation tools and inter-annotator agreement features, significantly enhancing the reliability of the annotated dataset.
  4. Customizable Annotation Schemas: Users can define custom entity types and relationship categories, tailoring the annotation process to specific domains or use cases.
  5. Collaborative Workflows: UBIAI supports team-based annotation projects, allowing for efficient distribution of tasks and seamless collaboration among annotators.

By leveraging UBIAI’s powerful features, researchers and data scientists can create high-fidelity datasets specifically designed for training relation extraction models. This meticulously annotated data serves as the foundation for fine-tuning LLAMA 3, enabling it to excel in extracting complex relationships from text across various domains.

 

Data Preprocessing

After exporting the annotated data from UBIAI in JSON format, we need to preprocess it to match the format required for fine-tuning the LLAMA 3 model. Here’s a Python script that demonstrates this process:

				
					import json
import pandas as pd

def preprocess_json(data, possible_relationships):
    # Extract the relevant information
    document = data['document']
    tokens = data['tokens']
    relations = data['relations']

    # Create a mapping of token index to its text and entity label
    token_info = {i: {'text': t['text'], 'label':     t['entityLabel']} for i, t in enumerate(tokens)}
    
# Format the entities and relationships
    entities = [(t['text'], t['entityLabel']) for t in tokens]
    formatted_entities = ", ".join([f"{text} ({label})" for text, label in entities])

    formatted_relations = []
    for r in relations:
        child_index = r['child']
        head_index = r['head']

        if child_index < len(tokens) and head_index < len(tokens):
            child = token_info[child_index]['text']
            head = token_info[head_index]['text']
            
 relation_label = r['relationLabel']
            formatted_relations.append(f"{child} -> {head} ({relation_label})")

    formatted_relations = "; ".join(formatted_relations)

    # Create the formatted prompt and response
    prompt = f"systemExtract relationships between entities from the following text.user Text: \"{document}\" Entities: {formatted_entities}. Possible relationships: {', '.join(possible_relationships)}."
    response = f"assistantThe relations between the entities: {formatted_relations}"
    full_prompt = prompt + response
    return full_prompt

# List of possible relationships (customize as needed)
possible_relationships = ["MUST_HAVE", "REQUIRES", "NICE_TO_HAVE"]

input_path = "/content/UBIAI_REL_data.json"
# Read the input JSON file
with open(input_path, 'r') as file:
     data = json.load(file)

# Preprocess all JSON strings
data = [preprocess_json(j, possible_relationships) for j in data]

# Convert to a DataFrame
df = pd.DataFrame(data, columns=["text"])

# Save to CSV
df.to_csv('fine_tuning_data.csv', index=False)


				
			

This script loads the JSON data, extracts the relevant text, entities, and relationships, and then formats this information into a structure suitable for our model. It creates a prompt-response pair for each data point, which will be used for fine-tuning.

Fine-Tuning LLAMA 3

 

Setting Up the Model

To fine-tune the LLAMA 3 model, we first need to import the necessary libraries and set up the model. Here’s the code to do that:

				
					import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig
from peft import prepare_model_for_kbit_training, LoraConfig
from trl import setup_chat_format

model_id = "meta-llama/Meta-Llama-3-8B"

# Tokenizer setup
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True, trust_remote_code=True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.padding_side = 'left'
tokenizer.model_max_length = 2048

# Quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.float16
)

# Model setup
device_map = {"": torch.cuda.current_device()} if torch.cuda.is_available() else None
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map=device_map,
    quantization_config=bnb_config
)

model, tokenizer = setup_chat_format(model, tokenizer)
model = prepare_model_for_kbit_training(model)

# LoRA configuration

peft_config = LoraConfig(
    lora_alpha=128,
    lora_dropout=0.05,
    r=256,
    bias="none",
    target_modules=["q_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "k_proj", "v_proj"],
    task_type="CAUSAL_LM",)

				
			

This code sets up the LLAMA 3 model with 4-bit quantization and LoRA (Low-Rank Adaptation) for efficient fine-tuning.

 

Training Configuration

Next, we set up the training arguments:

 

				
					from transformers import TrainingArguments

args = TrainingArguments(
    output_dir="sft_model_path",
    num_train_epochs=1,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=2,
    gradient_checkpointing=True,
    optim="adamw_8bit",
    logging_steps=10,
    save_strategy="epoch",
    Learning_rate=2e-4,
    bf16=True,
    tf32=True,
    max_grad_norm=0.3,
    warmup_ratio=0.03,
    lr_scheduler_type="constant",
    report_to="tensorboard",)

				
			

These arguments define various aspects of the training process, such as the number of epochs, batch size, learning rate, and optimization strategy.

 

Training the Model

Now we can set up the trainer and start the fine-tuning process:

				
					from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=args,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length=512,
    tokenizer=tokenizer,
    dataset_text_field="text",
    packing=False,
    dataset_kwargs={
        "add_special_tokens": False,
        "append_concat_token": False,
    }
)
trainer.train()
trainer.save_model()

				
			

This code sets up the SFTTrainer (Supervised Fine-Tuning Trainer) and starts the training process.

 

Merging the New Model Weights

After fine-tuning, we merge the new weights with the base model:

 

				
					from peft import PeftModel

base_model = "meta-llama/Meta-Llama-3-8B"
new_model = "/content/REL_finetuned_llm"

base_model_reload = AutoModelForCausalLM.from_pretrained(
    base_model,
    return_dict=True,
    torch_dtype=torch.float16,
    trust_remote_code=True,
)

base_model_reload, tokenizer = setup_chat_format(base_model_reload, tokenizer)

model = PeftModel.from_pretrained(base_model_reload, new_model)
model = model.merge_and_unload()

model.save_pretrained("llama-3-8b-REL")
tokenizer.save_pretrained("llama-3-8b-REL")

				
			

This code loads the base model, applies the fine-tuned weights, and saves the merged model.

Inference

Finally, we can use the fine-tuned model for inference:




				
					messages = [{"role": "user", "content": """Extract relationships between entities from the following text. Text: "1+ years development experience on Java stack AppConnect / API's experience is added advantage. Compute, Network and Storage Monitoring Tools (Ex: Netcool) Application Performance Tools (IBM APM) Cloud operations and Automation Tools (VmWare, ICAM, ...) Proven Record of developing enterprise class products and applications. Preferred Tech and Prof Experience None EO Statement IBM is committed to creating a diverse environment and is proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to race, color, religion, gender, gender identity or expression, sexual orientation, national origin, genetics, disability, age, or veteran status. IBM is also committed to compliance with all fair employment practices regarding citizenship and immigration status. ." Entities: 1+ years (EXPERIENCE), development (SKILLS). Possible relationships: EXPERIENCE_IN, LOCATED_IN, WORKS_FOR, PART_OF, CREATED_BY."""}]

prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.float16,
    device_map="auto",
)

outputs = pipe(prompt, max_new_tokens=120, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])

				
			

OUTPUT : 1+ years (EXPERIENCE) -> development (SKILLS) (EXPERIENCE_IN)

 

This code demonstrates how to use the fine-tuned model to extract relationships from a given text.

Conclusion

Fine-tuning LLAMA 3 for relation extraction involves several steps, from data annotation and preprocessing to model setup, fine-tuning, and inference. By following this guide, you can leverage the power of LLAMA 3 to extract meaningful relationships from your text data. The flexibility and robustness of LLAMA 3 make it an excellent choice for various NLP tasks, including relation extraction.

 

This tutorial has demonstrated the end-to-end process using data annotated in UBIAI, showcasing how you can transform raw text into structured information with the help of advanced language models. The combination of UBIAI’s annotation capabilities and LLAMA 3’s powerful language understanding creates a potent tool for extracting valuable insights from unstructured text data.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !