ubiai deep learning

Fine-Tuning Mistral 7B for Named Entity Recognition

Mar 8th, 2024

In the realm of natural language processing (NLP), Named Entity Recognition (NER) stands as a fundamental task, pivotal for various applications ranging from information extraction to question answering systems. With the advent of Mistral 7B, a revolutionary open-source large language model developed by Mistral AI, the landscape of NLP has witnessed a transformative shift. In this article, we embark on a journey to explore the potential of Mistral 7B for NER tasks, shedding light on the intricacies of fine-tuning this state-of-the-art model to excel in entity recognition.

Introduction to Named Entity Recognition

Named Entity Recognition (NER) serves as a cornerstone task in NLP, aiming to identify and classify named entities within text into predefined categories such as person names, locations, organizations, dates, and more. The accurate extraction of named entities plays a pivotal role in various downstream applications, including information retrieval, sentiment analysis, and knowledge graph construction.

Traditionally, NER systems relied on handcrafted rules and feature engineering techniques, which often lacked robustness and scalability across diverse domains. However, the advent of deep learning and pre-trained language models like Mistral 7B has revolutionized the field, offering a data-driven approach that leverages vast amounts of text data for enhanced performance.

Forging Ahead: Mistral 7B in Comparison to Other Open Source LLMs

Mistral 7B emerges as a standout player, offering a distinctive combination of performance, efficiency, and accessibility. Unlike its counterparts, Mistral 7B boasts innovative architectural features such as grouped-query attention and sliding window attention, enhancing both inference speed and sequence handling capabilities. These advancements not only set Mistral 7B apart but also underscore its superior adaptability and resilience across diverse natural language processing (NLP) tasks. Furthermore, Mistral 7B’s Apache 2.0 licensing and community-driven development approach foster an environment of collaboration and innovation, solidifying its position as a top choice among researchers and practitioners seeking cutting-edge language understanding models.

Unveiling Mistral 7B: A Game-Changer in NLP

Mistral 7B emerges as a beacon of innovation in the realm of language models, boasting a staggering 7 billion parameters meticulously designed to tackle the complexities of natural language. Engineered with cutting-edge features such as grouped-query attention (GQA) and sliding window attention (SWA), Mistral 7B exhibits unparalleled performance and efficiency, making it a formidable contender for NLP tasks, including Named Entity Recognition.

The utilization of GQA enables Mistral 7B to expedite inference, facilitating real-time processing of text data—an essential requirement for NER tasks operating in dynamic environments. Furthermore, SWA empowers Mistral 7B to handle sequences of varying lengths with remarkable efficiency, ensuring robust performance across diverse text inputs.

Fine-Tuning Mistral 7B for NER: Unlocking Its Full Potential

While Mistral 7B showcases remarkable capabilities in its pre-trained state, its true potential shines through the process of fine-tuning, where the model’s parameters are adapted to the nuances of specific tasks and datasets. Fine-tuning Mistral 7B for NER tasks involves further training the model on annotated data, allowing it to learn domain-specific patterns and improve entity recognition accuracy.

Install the necessary Python packages and libraries required for fine-tuning Mistral

				
					!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece


				
			
				
					from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from trl import SFTTrainer
from huggingface_hub import notebook_login
base_model, new_model = "mistralai/Mistral-7B-v0.1" , "ilyesbk/NER_mistral_7b"
				
			

Try UBIAI AI Annotation Tool now !

  • Annotate smartly and quickly any type of documents in the most record time
  • Fine-tune your DL models with our approved tool tested by +100 Experts now!
  • Get better and fantastic collaboration space with your team.

Data annotation and structure using UBIAI tools

In the process of fine-tuning large language models like Mistral 7B, the quality and relevance of the data play a pivotal role in determining the model’s performance and efficacy. High-quality, well-annotated data ensures that the model learns from accurate and representative examples, ultimately enhancing its ability to generalize and produce reliable outputs. For this tutorial, the data utilized is sourced from UBIAI, a platform renowned for its robust annotation tools and comprehensive NLP capabilities. Leveraging UBIAI’s annotation capabilities ensures that the data used for fine-tuning is meticulously labeled and tailored to the specific task at hand, thereby maximizing the effectiveness of the training process.

Screenshot 2024-03-28 at 1.54.18 PM
data labeling

This code reads JSON data from a file containing invoice information. It organizes the data into a structured format, ensuring there are no duplicate annotations. The processed results are stored for further analysis and fine-tuning Mistral for Named Entity Recognition (NER) tasks.

				
					import json

# Read JSON data from file
with open('/content/invoice_data_Ubiai.json', 'r') as file:
    data = json.load(file)

results = []
for document in data:
    document_name = document["documentName"]
    text = document["document"]
    annotations = document["annotation"]

    result = {}

    for annotation in annotations:
        label = annotation["label"]
        text_extracted = annotation["text"]

        if label not in result:
            result[label] = set()  # Use a set to avoid duplicates

        result[label].add(text_extracted)

    # Convert sets back to lists before creating the result entry
    result = {label: list(entities) for label, entities in result.items()}

    result_entry = {"document_name": document_name, "text": text, "result": result}
    results.append(result_entry)

print(len(results))
				
			

1371

				
					import json

# Extracting unique labels from all objects
unique_labels = set()
for obj in results:
    unique_labels.update(obj["result"].keys())

print("Unique Labels:", unique_labels)


				
			

Unique Labels: {‘PERSON’, ‘PRODUCT’, ‘FACILITY’, ‘MONEY’, ‘EVENT’, ‘DATE’, ‘ORGANIZATION’, ‘LOCATION’}

The data structure consists of documents with associated names, texts, and annotations. Annotations include entities such as persons, organizations, locations, facilities, and events, providing structured information for NER tasks.

				
					print(json.dumps(results[0], indent=2))
{
  "document_name": "gupta_leaks_export_1379.json",
  "text": "98.9.32. 98.9.33. 98.9.34. 43 Mr Zwane is linked to the Guptas. He sent invoices to Sahara Computers that get paid by Linkway. The payments that went to Estina, ultimately, were channelled to the Gupta company; and Gateway and the beneficiaries intended to benefit from this company have not seen what this project was meant to do for them. So it seems, given the urgency at which it was appointed, the R 30 million advanced payment, all this was meant to benefit Gupta related entities and/or the Guptas themselves. Mr Zwane stated that in his view had the management of the project been done well, it would have been regarded as one of the best projects in agriculture to have happened for the people of the Free State. Influence of MEC Elizabeth Rockman .10. 98.11. Ms Rockman admits that her interaction with the Gupta family predated her discussions with them about the Vrede Dairy Project. She was introduced to them when the New Age made a presentation to the Provincial EXCO to get support for advertisements. Thereafter these meetings were fairly frequent and the Premier was aware of such meeting. The interactions continued after Ms Rockman was appointed as MEC Finance and even extended to interaction with the Gupta family about the Vrede Dairy Project outstanding payments related to Estina, which she often facilitated.",
  "result": {
    "PERSON": [
      "Rockman",
      "Gupta",
      "Ms Rockman",
      "Zwane"
    ],
    "ORGANIZATION": [
      "Provincial EXCO",
      "MEC Finance",
      "Estina",
      "people of the Free State",
      "Vrede Dairy Project",
      "Guptas",
      "Sahara Computers"
         ],
    "LOCATION": [
      "Linkway"
    ],
    "FACILITY": [
      "Gateway"
    ],
    "EVENT": [
      "New Age"
    ]
  }
}


				
			

The data is now accompanied by a detailed prompt outlining the instructions for entity extraction tasks. Each instruction specifies the exact requirements for extracting entities under different labels, such as DATE, PERSON, ORGANIZATION, LOCATION, FACILITY, EVENT, MONEY, and PRODUCT. The prompt emphasizes accuracy and relevance in responses and provides guidelines to ensure consistency and completeness in the extracted information. This structured prompt enables efficient processing of the data for entity recognition tasks, aligning with the objectives of the NER (Named Entity Recognition) process.

				
					prompt = """
    Extract the entities for the following labels from the given text and provide the results in JSON format
    - Entities must be extracted exactly as mentioned in the text.
    - Return each entity under its label without creating new labels.
    - Provide a list of entities for each label, ensuring that if no entities are found for a label, an empty list is returned.
    - Accuracy and relevance in your responses are key.

    Lables and their Descriptions:
    - DATE : Extract only the date
    - FACILITY : Extract information related to the facility
    - PERSON : Extract names or personal entities
    - MONEY : Extract information related to money
    - ORGANIZATION : Extract names or entities related to organizations
    - LOCATION : Extract location information
    - PRODUCT : Extract information related to products
    - EVENT : Extract information related to events

    """

instruction_value = prompt.strip()

# Adding the "INSTRUCTION" field to each object in the JSON array
for obj in results:
    obj["Original_INSTRUCTION"] = instruction_value

# Printing the modified JSON
print(json.dumps(results[0], indent=2))


				
			

This process involves converting a subset of data into string representations, appending them to a list, and saving them to a CSV file. Subsequently, the CSV file is read into a Pandas DataFrame and transformed into a Hugging Face Datasets object for further processing.




				
					import csv


# Creating a list to store the string representations
csv_data = []

# Creating a string representation for each object and appending to the list
for obj in results[0:200]:
    obj_str = f"""{obj['Original_INSTRUCTION']}\n\n### Instruction:\n{obj['text']}\n\n### Response:\n{json.dumps(obj['result'], indent=2)}\n"""
    csv_data.append(obj_str)

# Writing the list to a CSV file
csv_file_path = 'Ubia_mistral_data.csv'
with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file:
    csv_writer = csv.writer(csv_file)

    # Writing header
    csv_writer.writerow(['chat_sample', 'source'])

    # Writing data
    for data in csv_data:
        csv_writer.writerow([data, 'UBIAI_data'])


print(f'CSV file saved at: {csv_file_path}')

CSV file saved at: Ubia_mistral_data.csv
import pandas as pd
from datasets import Dataset

dataset = pd.read_csv("/content/Ubia_mistral_data.csv")
dataset = Dataset.from_pandas(dataset)
dataset["chat_sample"][0]
{"type":"string"}

				
			

the Mistral 7B base model is loaded using the BitsAndBytesConfig configuration, which enables quantization for efficient memory usage and faster inference. Additionally, the tokenizer associated with Mistral 7B is loaded, ensuring proper padding and end-of-sequence token settings for text processing tasks.

				
					# Load base model(Mistral 7B)
bnb_config = BitsAndBytesConfig(
    load_in_4bit= True,
    bnb_4bit_quant_type= "nf4",
    bnb_4bit_compute_dtype= torch.bfloat16,
    bnb_4bit_use_double_quant= False,
)
model = AutoModelForCausalLM.from_pretrained(
   base_model,
    quantization_config=bnb_config,
    device_map={"": 0}
)
model.config.use_cache = False # silence the warnings. Please re-enable for inference!
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
tokenizer.add_bos_token, tokenizer.add_eos_token
/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
{"model_id":"16ae0332511546ae8ab7bbc22553f9aa","version_major":2,"version_minor":0}
(True, True)
#Adding the adapters in the layers
model = prepare_model_for_kbit_training(model)
peft_config = LoraConfig(
        r=16,
        lora_alpha=16,
        lora_dropout=0.05,
        bias="none",
        task_type="CAUSAL_LM",
        target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"]
    )
model = get_peft_model(model, peft_config)

				
			

In this section, the code initializes monitoring for the fine-tuning process of Mistral 7B using the Weights & Biases (W&B) platform. By logging into W&B with the provided API key, the training progress and performance metrics are tracked in real-time. The wandb.init() function initiates a new run within the specified project (“Fine-tuning Mistral 7B UBIAI”), with the job type set as “training”. Enabling anonymous access ensures that the training data remains secure while allowing for collaborative monitoring and analysis.

 

				
					# Monitering the LLM
wandb.login(key = "3be55aa2732d4d435809cd33149e996e24b470d8")
run = wandb.init(project='Fine tuning mistral 7B UBIAI', job_type="training", anonymous="allow")


				
			

initializing hyperparameters for the training process, defining settings such as output directory, number of epochs, batch size, learning rate, and logging frequency. Additionally, it configures the SFTTrainer, specifying the model, dataset, tokenizer, and other relevant parameters essential for fine-tuning Mistral 7B for the given task.

				
					#Hyperparamter
training_arguments = TrainingArguments(
    output_dir= "./results",
    num_train_epochs= 5,
    per_device_train_batch_size= 8,
    gradient_accumulation_steps= 2,
    optim = "paged_adamw_8bit",
    save_steps= 1000,
    logging_steps= 30,
    learning_rate= 2e-4,
    weight_decay= 0.001,
    fp16= False,
    bf16= False,
    max_grad_norm= 0.3,
    max_steps= -1,
    warmup_ratio= 0.3,
    group_by_length= True,
    lr_scheduler_type= "constant",
    report_to="wandb"
)
dataset_text_field = "chat_sample"


# Setting sft parameters
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    peft_config=peft_config,
    max_seq_length= None,
    dataset_text_field=dataset_text_field,
    tokenizer=tokenizer,
    args=training_arguments,
    packing= False,
)


				
			
				
					import torch
torch.cuda.empty_cache()
trainer.train()
# Save the fine-tuned model
trainer.model.save_pretrained(new_model)
wandb.finish()
model.config.use_cache = True
model.eval()


				
			
data labeling

Let's Try Our Model: Testing Mistral 7B with User Prompts

The provided function stream takes a user prompt as input and generates text based on that prompt using the Mistral 7B model. It utilizes the model’s generate method to produce text, using the user’s prompt as context. The generated text is then returned as output.

				
					prompt_inst = """
    Extract the entities for the following labels from the given text and provide the results in JSON format
    - Entities must be extracted exactly as mentioned in the text.
    - Return each entity under its label without creating new labels.
    - Provide a list of entities for each label, ensuring that if no entities are found for a label, an empty list is returned.
    - Accuracy and relevance in your responses are key.

    Lables and their Descriptions:
    - DATE : Extract only the date
    - FACILITY : Extract information related to the facility
    - PERSON : Extract names or personal entities
    - MONEY : Extract information related to money
    - ORGANIZATION : Extract names or entities related to organizations
    - LOCATION : Extract location information
    - PRODUCT : Extract information related to products
    - EVENT : Extract information related to events
    \n"""


				
			
				
					def stream(user_prompt):
    runtimeFlag = "cuda:0"
    system_prompt = prompt_inst
    B_INST, E_INST = "[INST]", "[/INST]"

    prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n{E_INST}"

    inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag)

    streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)

    _ = model.generate(**inputs, streamer=streamer, max_new_tokens=200)
stream("""So what about you, you are employed fulltime, aren’t 1 you able to consult with him?
 2 Adv B Makhene : Yeah, but I’m booked for an operation. 3 [Discussions amongst each other] 4 Adv N Kanyane : Should I pause? 5 Adv T Madonsela : Yes, please. 6 [Go off record // Back on record] 7 Adv T Madonsela : ... I
 think it we have been fair, even though this whole procedural 8 issue has been an ambush question. Mr Hulley is saying, “We were 9 ambushed with the procedural issue, because the document was a 10 lengthy one”, but you
 have had then seen ... and everyone announced 11 to the Media that “We are here to answer questions today” and we 12 show up here, having cancelled everything for today, there are no 13 questions being answered. 14 So
 I’m just saying in all fairness we have tried to meet you 15 halfway, Mister President. After a lengthy ... after a whole day 16 squandered discussing procedure, we are then saying let’s meet each 17 other halfway.
 So what I’m putting then on the table ... I heard you 18 whisper that you have another 30 minutes, you need to go ... 20 19 minutes? 20 President Zuma : You mean now? Yeah. 21 Adv T Madonsela : Which obviously we
 have eaten 4 hours of your time as the President 22 of our country, taking care of all of our lives, we have eaten 4 hours 23 of your time discussing procedure. That 4 hours could well have 24 been used to discuss
 these issues and that 4 hours could well have 25
""")

				
			

Response:

				
					{ 
'ORGANIZATION': ['n s c', 'Media'],
 'PERSON': ['Hulley', 'Zuma', 'Makhene'],
 'DATE': ['today'] 
}


				
			

Conclusion

Fine-tuning Mistral 7B for Named Entity Recognition represents a transformative approach to entity recognition tasks, offering enhanced performance, adaptability, and efficiency. As organizations continue to leverage NER for information extraction, knowledge discovery, and beyond, the capabilities unlocked by Mistral 7B pave the way for groundbreaking advancements in NLP. By harnessing the power of Mistral 7B and fine-tuning it for NER, practitioners can unlock new possibilities in understanding and processing textual data, driving innovation across diverse domains.




Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !