Fine-Tuning Mistral 7B for Named Entity Recognition

Mar 8th, 2024

In the realm of natural language processing (NLP), Named Entity Recognition (NER) stands as a fundamental task, pivotal for various applications ranging from information extraction to question answering systems. With the advent of Mistral 7B, a revolutionary open-source large language model developed by Mistral AI, the landscape of NLP has witnessed a transformative shift. In this article, we embark on a journey to explore the potential of Mistral 7B for NER tasks, shedding light on the intricacies of fine-tuning this state-of-the-art model to excel in entity recognition.

Introduction to Named Entity Recognition

Named Entity Recognition (NER) serves as a cornerstone task in NLP, aiming to identify and classify named entities within text into predefined categories such as person names, locations, organizations, dates, and more. The accurate extraction of named entities plays a pivotal role in various downstream applications, including information retrieval, sentiment analysis, and knowledge graph construction.

Traditionally, NER systems relied on handcrafted rules and feature engineering techniques, which often lacked robustness and scalability across diverse domains. However, the advent of deep learning and pre-trained language models like Mistral 7B has revolutionized the field, offering a data-driven approach that leverages vast amounts of text data for enhanced performance.

Forging Ahead: Mistral 7B in Comparison to Other Open Source LLMs

Mistral 7B emerges as a standout player, offering a distinctive combination of performance, efficiency, and accessibility. Unlike its counterparts, Mistral 7B boasts innovative architectural features such as grouped-query attention and sliding window attention, enhancing both inference speed and sequence handling capabilities. These advancements not only set Mistral 7B apart but also underscore its superior adaptability and resilience across diverse natural language processing (NLP) tasks. Furthermore, Mistral 7B’s Apache 2.0 licensing and community-driven development approach foster an environment of collaboration and innovation, solidifying its position as a top choice among researchers and practitioners seeking cutting-edge language understanding models.

Unveiling Mistral 7B: A Game-Changer in NLP

Mistral 7B emerges as a beacon of innovation in the realm of language models, boasting a staggering 7 billion parameters meticulously designed to tackle the complexities of natural language. Engineered with cutting-edge features such as grouped-query attention (GQA) and sliding window attention (SWA), Mistral 7B exhibits unparalleled performance and efficiency, making it a formidable contender for NLP tasks, including Named Entity Recognition.

The utilization of GQA enables Mistral 7B to expedite inference, facilitating real-time processing of text data—an essential requirement for NER tasks operating in dynamic environments. Furthermore, SWA empowers Mistral 7B to handle sequences of varying lengths with remarkable efficiency, ensuring robust performance across diverse text inputs.

Fine-Tuning Mistral 7B for NER: Unlocking Its Full Potential

While Mistral 7B showcases remarkable capabilities in its pre-trained state, its true potential shines through the process of fine-tuning, where the model’s parameters are adapted to the nuances of specific tasks and datasets. Fine-tuning Mistral 7B for NER tasks involves further training the model on annotated data, allowing it to learn domain-specific patterns and improve entity recognition accuracy.

Install the necessary Python packages and libraries required for fine-tuning Mistral

				
					!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece

				
					from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from trl import SFTTrainer
from huggingface_hub import notebook_login
base_model, new_model = "mistralai/Mistral-7B-v0.1" , "ilyesbk/NER_mistral_7b"

Try UBIAI AI Annotation Tool now !

Annotate smartly and quickly any type of documents in the most record time
Fine-tune your DL models with our approved tool tested by +100 Experts now!
Get better and fantastic collaboration space with your team.

Data annotation and structure using UBIAI tools

In the process of fine-tuning large language models like Mistral 7B, the quality and relevance of the data play a pivotal role in determining the model’s performance and efficacy. High-quality, well-annotated data ensures that the model learns from accurate and representative examples, ultimately enhancing its ability to generalize and produce reliable outputs. For this tutorial, the data utilized is sourced from UBIAI, a platform renowned for its robust annotation tools and comprehensive NLP capabilities. Leveraging UBIAI’s annotation capabilities ensures that the data used for fine-tuning is meticulously labeled and tailored to the specific task at hand, thereby maximizing the effectiveness of the training process.

This code reads JSON data from a file containing invoice information. It organizes the data into a structured format, ensuring there are no duplicate annotations. The processed results are stored for further analysis and fine-tuning Mistral for Named Entity Recognition (NER) tasks.

				
					import json

# Read JSON data from file
with open('/content/invoice_data_Ubiai.json', 'r') as file:
    data = json.load(file)

results = []
for document in data:
    document_name = document["documentName"]
    text = document["document"]
    annotations = document["annotation"]

    result = {}

    for annotation in annotations:
        label = annotation["label"]
        text_extracted = annotation["text"]

        if label not in result:
            result[label] = set()  # Use a set to avoid duplicates

        result[label].add(text_extracted)

    # Convert sets back to lists before creating the result entry
    result = {label: list(entities) for label, entities in result.items()}

    result_entry = {"document_name": document_name, "text": text, "result": result}
    results.append(result_entry)

print(len(results))

1371

				
					import json

# Extracting unique labels from all objects
unique_labels = set()
for obj in results:
    unique_labels.update(obj["result"].keys())

print("Unique Labels:", unique_labels)

Unique Labels: {‘PERSON’, ‘PRODUCT’, ‘FACILITY’, ‘MONEY’, ‘EVENT’, ‘DATE’, ‘ORGANIZATION’, ‘LOCATION’}

The data structure consists of documents with associated names, texts, and annotations. Annotations include entities such as persons, organizations, locations, facilities, and events, providing structured information for NER tasks.

				
					print(json.dumps(results[0], indent=2))
{
  "document_name": "gupta_leaks_export_1379.json",
  "text": "98.9.32. 98.9.33. 98.9.34. 43 Mr Zwane is linked to the Guptas. He sent invoices to Sahara Computers that get paid by Linkway. The payments that went to Estina, ultimately, were channelled to the Gupta company; and Gateway and the beneficiaries intended to benefit from this company have not seen what this project was meant to do for them. So it seems, given the urgency at which it was appointed, the R 30 million advanced payment, all this was meant to benefit Gupta related entities and/or the Guptas themselves. Mr Zwane stated that in his view had the management of the project been done well, it would have been regarded as one of the best projects in agriculture to have happened for the people of the Free State. Influence of MEC Elizabeth Rockman .10. 98.11. Ms Rockman admits that her interaction with the Gupta family predated her discussions with them about the Vrede Dairy Project. She was introduced to them when the New Age made a presentation to the Provincial EXCO to get support for advertisements. Thereafter these meetings were fairly frequent and the Premier was aware of such meeting. The interactions continued after Ms Rockman was appointed as MEC Finance and even extended to interaction with the Gupta family about the Vrede Dairy Project outstanding payments related to Estina, which she often facilitated.",
  "result": {
    "PERSON": [
      "Rockman",
      "Gupta",
      "Ms Rockman",
      "Zwane"
    ],
    "ORGANIZATION": [
      "Provincial EXCO",
      "MEC Finance",
      "Estina",
      "people of the Free State",
      "Vrede Dairy Project",
      "Guptas",
      "Sahara Computers"
         ],
    "LOCATION": [
      "Linkway"
    ],
    "FACILITY": [
      "Gateway"
    ],
    "EVENT": [
      "New Age"
    ]
  }
}

The data is now accompanied by a detailed prompt outlining the instructions for entity extraction tasks. Each instruction specifies the exact requirements for extracting entities under different labels, such as DATE, PERSON, ORGANIZATION, LOCATION, FACILITY, EVENT, MONEY, and PRODUCT. The prompt emphasizes accuracy and relevance in responses and provides guidelines to ensure consistency and completeness in the extracted information. This structured prompt enables efficient processing of the data for entity recognition tasks, aligning with the objectives of the NER (Named Entity Recognition) process.

prompt = """ Extract the entities for the following labels from the given text and provide the results in JSON format - Entities must be extracted exactly as mentioned in the text. - Return each entity under its label without creating new labels. - Provide a list of entities for each label, ensuring that if no entities are found for a label, an empty list is returned. - Accuracy and relevance in your responses are key. Lables and their Descriptions: - DATE : Extract only the date - FACILITY : Extract information related to the facility - PERSON : Extract names or personal entities - MONEY : Extract information related to money - ORGANIZATION : Extract names or entities related to organizations - LOCATION : Extract location information - PRODUCT : Extract information related to products - EVENT : Extract information related to events """ instruction_value = prompt.strip() # Adding the "INSTRUCTION" field to each object in the JSON array for obj in results: obj["Original_INSTRUCTION"] = instruction_value # Printing the modified JSON print(json.dumps(results[0], indent=2))

import csv # Creating a list to store the string representations csv_data = [] # Creating a string representation for each object and appending to the list for obj in results[0:200]: obj_str = f"""{obj['Original_INSTRUCTION']}\n\n### Instruction:\n{obj['text']}\n\n### Response:\n{json.dumps(obj['result'], indent=2)}\n""" csv_data.append(obj_str) # Writing the list to a CSV file csv_file_path = 'Ubia_mistral_data.csv' with open(csv_file_path, 'w', newline='', encoding='utf-8') as csv_file: csv_writer = csv.writer(csv_file) # Writing header csv_writer.writerow(['chat_sample', 'source']) # Writing data for data in csv_data: csv_writer.writerow([data, 'UBIAI_data']) print(f'CSV file saved at: {csv_file_path}') CSV file saved at: Ubia_mistral_data.csv import pandas as pd from datasets import Dataset dataset = pd.read_csv("/content/Ubia_mistral_data.csv") dataset = Dataset.from_pandas(dataset) dataset["chat_sample"][0] {"type":"string"}

# Load base model(Mistral 7B) bnb_config = BitsAndBytesConfig( load_in_4bit= True, bnb_4bit_quant_type= "nf4", bnb_4bit_compute_dtype= torch.bfloat16, bnb_4bit_use_double_quant= False, ) model = AutoModelForCausalLM.from_pretrained( base_model, quantization_config=bnb_config, device_map={"": 0} ) model.config.use_cache = False # silence the warnings. Please re-enable for inference! model.config.pretraining_tp = 1 model.gradient_checkpointing_enable() # Load tokenizer tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True) tokenizer.padding_side = 'right' tokenizer.pad_token = tokenizer.eos_token tokenizer.add_eos_token = True tokenizer.add_bos_token, tokenizer.add_eos_token /usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_token.py:88: UserWarning: The secret `HF_TOKEN` does not exist in your Colab secrets. To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session. You will be able to reuse this secret in all of your notebooks. Please note that authentication is recommended but still optional to access public models or datasets. warnings.warn( {"model_id":"16ae0332511546ae8ab7bbc22553f9aa","version_major":2,"version_minor":0} (True, True) #Adding the adapters in the layers model = prepare_model_for_kbit_training(model) peft_config = LoraConfig( r=16, lora_alpha=16, lora_dropout=0.05, bias="none", task_type="CAUSAL_LM", target_modules=["q_proj", "k_proj", "v_proj", "o_proj","gate_proj"] ) model = get_peft_model(model, peft_config)

# Monitering the LLM wandb.login(key = "3be55aa2732d4d435809cd33149e996e24b470d8") run = wandb.init(project='Fine tuning mistral 7B UBIAI', job_type="training", anonymous="allow")

#Hyperparamter training_arguments = TrainingArguments( output_dir= "./results", num_train_epochs= 5, per_device_train_batch_size= 8, gradient_accumulation_steps= 2, optim = "paged_adamw_8bit", save_steps= 1000, logging_steps= 30, learning_rate= 2e-4, weight_decay= 0.001, fp16= False, bf16= False, max_grad_norm= 0.3, max_steps= -1, warmup_ratio= 0.3, group_by_length= True, lr_scheduler_type= "constant", report_to="wandb" ) dataset_text_field = "chat_sample" # Setting sft parameters trainer = SFTTrainer( model=model, train_dataset=dataset, peft_config=peft_config, max_seq_length= None, dataset_text_field=dataset_text_field, tokenizer=tokenizer, args=training_arguments, packing= False, )

import torch torch.cuda.empty_cache() trainer.train() # Save the fine-tuned model trainer.model.save_pretrained(new_model) wandb.finish() model.config.use_cache = True model.eval()

prompt_inst = """ Extract the entities for the following labels from the given text and provide the results in JSON format - Entities must be extracted exactly as mentioned in the text. - Return each entity under its label without creating new labels. - Provide a list of entities for each label, ensuring that if no entities are found for a label, an empty list is returned. - Accuracy and relevance in your responses are key. Lables and their Descriptions: - DATE : Extract only the date - FACILITY : Extract information related to the facility - PERSON : Extract names or personal entities - MONEY : Extract information related to money - ORGANIZATION : Extract names or entities related to organizations - LOCATION : Extract location information - PRODUCT : Extract information related to products - EVENT : Extract information related to events \n"""

def stream(user_prompt): runtimeFlag = "cuda:0" system_prompt = prompt_inst B_INST, E_INST = "[INST]", "[/INST]" prompt = f"{system_prompt}{B_INST}{user_prompt.strip()}\n{E_INST}" inputs = tokenizer([prompt], return_tensors="pt").to(runtimeFlag) streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True) _ = model.generate(**inputs, streamer=streamer, max_new_tokens=200) stream("""So what about you, you are employed fulltime, aren’t 1 you able to consult with him? 2 Adv B Makhene : Yeah, but I’m booked for an operation. 3 [Discussions amongst each other] 4 Adv N Kanyane : Should I pause? 5 Adv T Madonsela : Yes, please. 6 [Go off record // Back on record] 7 Adv T Madonsela : ... I think it we have been fair, even though this whole procedural 8 issue has been an ambush question. Mr Hulley is saying, “We were 9 ambushed with the procedural issue, because the document was a 10 lengthy one”, but you have had then seen ... and everyone announced 11 to the Media that “We are here to answer questions today” and we 12 show up here, having cancelled everything for today, there are no 13 questions being answered. 14 So I’m just saying in all fairness we have tried to meet you 15 halfway, Mister President. After a lengthy ... after a whole day 16 squandered discussing procedure, we are then saying let’s meet each 17 other halfway. So what I’m putting then on the table ... I heard you 18 whisper that you have another 30 minutes, you need to go ... 20 19 minutes? 20 President Zuma : You mean now? Yeah. 21 Adv T Madonsela : Which obviously we have eaten 4 hours of your time as the President 22 of our country, taking care of all of our lives, we have eaten 4 hours 23 of your time discussing procedure. That 4 hours could well have 24 been used to discuss these issues and that 4 hours could well have 25 """)

Fine-Tuning Mistral 7B for Named Entity Recognition

Mar 8th, 2024

Introduction to Named Entity Recognition

Forging Ahead: Mistral 7B in Comparison to Other Open Source LLMs

Unveiling Mistral 7B: A Game-Changer in NLP

Fine-Tuning Mistral 7B for NER: Unlocking Its Full Potential

Install the necessary Python packages and libraries required for fine-tuning Mistral

Data annotation and structure using UBIAI tools

Let's Try Our Model: Testing Mistral 7B with User Prompts

Conclusion

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Fine-Tuning Mistral 7B for Named Entity Recognition

Mar 8th, 2024

Introduction to Named Entity Recognition

Forging Ahead: Mistral 7B in Comparison to Other Open Source LLMs

Unveiling Mistral 7B: A Game-Changer in NLP

Fine-Tuning Mistral 7B for NER: Unlocking Its Full Potential

Install the necessary Python packages and libraries required for fine-tuning Mistral

Data annotation and structure using UBIAI tools

Let's Try Our Model: Testing Mistral 7B with User Prompts

Conclusion

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset