Fine-Tuning Mistral 7B for Named Entity Recognition
Mar 8th, 2024
In the realm of natural language processing (NLP), Named Entity Recognition (NER) stands as a fundamental task, pivotal for various applications ranging from information extraction to question answering systems. With the advent of Mistral 7B, a revolutionary open-source large language model developed by Mistral AI, the landscape of NLP has witnessed a transformative shift. In this article, we embark on a journey to explore the potential of Mistral 7B for NER tasks, shedding light on the intricacies of fine-tuning this state-of-the-art model to excel in entity recognition.
Introduction to Named Entity Recognition
Named Entity Recognition (NER) serves as a cornerstone task in NLP, aiming to identify and classify named entities within text into predefined categories such as person names, locations, organizations, dates, and more. The accurate extraction of named entities plays a pivotal role in various downstream applications, including information retrieval, sentiment analysis, and knowledge graph construction.
Traditionally, NER systems relied on handcrafted rules and feature engineering techniques, which often lacked robustness and scalability across diverse domains. However, the advent of deep learning and pre-trained language models like Mistral 7B has revolutionized the field, offering a data-driven approach that leverages vast amounts of text data for enhanced performance.
Forging Ahead: Mistral 7B in Comparison to Other Open Source LLMs
Mistral 7B emerges as a standout player, offering a distinctive combination of performance, efficiency, and accessibility. Unlike its counterparts, Mistral 7B boasts innovative architectural features such as grouped-query attention and sliding window attention, enhancing both inference speed and sequence handling capabilities. These advancements not only set Mistral 7B apart but also underscore its superior adaptability and resilience across diverse natural language processing (NLP) tasks. Furthermore, Mistral 7B’s Apache 2.0 licensing and community-driven development approach foster an environment of collaboration and innovation, solidifying its position as a top choice among researchers and practitioners seeking cutting-edge language understanding models.
Unveiling Mistral 7B: A Game-Changer in NLP
Mistral 7B emerges as a beacon of innovation in the realm of language models, boasting a staggering 7 billion parameters meticulously designed to tackle the complexities of natural language. Engineered with cutting-edge features such as grouped-query attention (GQA) and sliding window attention (SWA), Mistral 7B exhibits unparalleled performance and efficiency, making it a formidable contender for NLP tasks, including Named Entity Recognition.
The utilization of GQA enables Mistral 7B to expedite inference, facilitating real-time processing of text data—an essential requirement for NER tasks operating in dynamic environments. Furthermore, SWA empowers Mistral 7B to handle sequences of varying lengths with remarkable efficiency, ensuring robust performance across diverse text inputs.
Fine-Tuning Mistral 7B for NER: Unlocking Its Full Potential
While Mistral 7B showcases remarkable capabilities in its pre-trained state, its true potential shines through the process of fine-tuning, where the model’s parameters are adapted to the nuances of specific tasks and datasets. Fine-tuning Mistral 7B for NER tasks involves further training the model on annotated data, allowing it to learn domain-specific patterns and improve entity recognition accuracy.
Install the necessary Python packages and libraries required for fine-tuning Mistral
!pip install -q -U bitsandbytes
!pip install -q -U git+https://github.com/huggingface/transformers.git
!pip install -q -U git+https://github.com/huggingface/peft.git
!pip install -q -U git+https://github.com/huggingface/accelerate.git
!pip install -q trl xformers wandb datasets einops gradio sentencepiece
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging, TextStreamer
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
import os,torch, wandb, platform, gradio, warnings
from trl import SFTTrainer
from huggingface_hub import notebook_login
base_model, new_model = "mistralai/Mistral-7B-v0.1" , "ilyesbk/NER_mistral_7b"
Try UBIAI AI Annotation Tool now !
- Annotate smartly and quickly any type of documents in the most record time
- Fine-tune your DL models with our approved tool tested by +100 Experts now!
- Get better and fantastic collaboration space with your team.
Data annotation and structure using UBIAI tools
In the process of fine-tuning large language models like Mistral 7B, the quality and relevance of the data play a pivotal role in determining the model’s performance and efficacy. High-quality, well-annotated data ensures that the model learns from accurate and representative examples, ultimately enhancing its ability to generalize and produce reliable outputs. For this tutorial, the data utilized is sourced from UBIAI, a platform renowned for its robust annotation tools and comprehensive NLP capabilities. Leveraging UBIAI’s annotation capabilities ensures that the data used for fine-tuning is meticulously labeled and tailored to the specific task at hand, thereby maximizing the effectiveness of the training process.
This code reads JSON data from a file containing invoice information. It organizes the data into a structured format, ensuring there are no duplicate annotations. The processed results are stored for further analysis and fine-tuning Mistral for Named Entity Recognition (NER) tasks.
import json
# Read JSON data from file
with open('/content/invoice_data_Ubiai.json', 'r') as file:
data = json.load(file)
results = []
for document in data:
document_name = document["documentName"]
text = document["document"]
annotations = document["annotation"]
result = {}
for annotation in annotations:
label = annotation["label"]
text_extracted = annotation["text"]
if label not in result:
result[label] = set() # Use a set to avoid duplicates
result[label].add(text_extracted)
# Convert sets back to lists before creating the result entry
result = {label: list(entities) for label, entities in result.items()}
result_entry = {"document_name": document_name, "text": text, "result": result}
results.append(result_entry)
print(len(results))
1371 |
import json
# Extracting unique labels from all objects
unique_labels = set()
for obj in results:
unique_labels.update(obj["result"].keys())
print("Unique Labels:", unique_labels)
Unique Labels: {‘PERSON’, ‘PRODUCT’, ‘FACILITY’, ‘MONEY’, ‘EVENT’, ‘DATE’, ‘ORGANIZATION’, ‘LOCATION’} The data structure consists of documents with associated names, texts, and annotations. Annotations include entities such as persons, organizations, locations, facilities, and events, providing structured information for NER tasks. |
print(json.dumps(results[0], indent=2))
{
"document_name": "gupta_leaks_export_1379.json",
"text": "98.9.32. 98.9.33. 98.9.34. 43 Mr Zwane is linked to the Guptas. He sent invoices to Sahara Computers that get paid by Linkway. The payments that went to Estina, ultimately, were channelled to the Gupta company; and Gateway and the beneficiaries intended to benefit from this company have not seen what this project was meant to do for them. So it seems, given the urgency at which it was appointed, the R 30 million advanced payment, all this was meant to benefit Gupta related entities and/or the Guptas themselves. Mr Zwane stated that in his view had the management of the project been done well, it would have been regarded as one of the best projects in agriculture to have happened for the people of the Free State. Influence of MEC Elizabeth Rockman .10. 98.11. Ms Rockman admits that her interaction with the Gupta family predated her discussions with them about the Vrede Dairy Project. She was introduced to them when the New Age made a presentation to the Provincial EXCO to get support for advertisements. Thereafter these meetings were fairly frequent and the Premier was aware of such meeting. The interactions continued after Ms Rockman was appointed as MEC Finance and even extended to interaction with the Gupta family about the Vrede Dairy Project outstanding payments related to Estina, which she often facilitated.",
"result": {
"PERSON": [
"Rockman",
"Gupta",
"Ms Rockman",
"Zwane"
],
"ORGANIZATION": [
"Provincial EXCO",
"MEC Finance",
"Estina",
"people of the Free State",
"Vrede Dairy Project",
"Guptas",
"Sahara Computers"
],
"LOCATION": [
"Linkway"
],
"FACILITY": [
"Gateway"
],
"EVENT": [
"New Age"
]
}
}
The data is now accompanied by a detailed prompt outlining the instructions for entity extraction tasks. Each instruction specifies the exact requirements for extracting entities under different labels, such as DATE, PERSON, ORGANIZATION, LOCATION, FACILITY, EVENT, MONEY, and PRODUCT. The prompt emphasizes accuracy and relevance in responses and provides guidelines to ensure consistency and completeness in the extracted information. This structured prompt enables efficient processing of the data for entity recognition tasks, aligning with the objectives of the NER (Named Entity Recognition) process.