ubiai deep learning
spacy models

Fine-Tuning SpaCy Models: Customizing Named Entity Recognition for Domain-Specific Data

Dec 14th, 2023

In the realm of Natural Language Processing (NLP), a foundational endeavor involves extracting meaningful insights from textual data. At the core of numerous NLP applications lies Named Entity Recognition (NER), a pivotal technique that plays a crucial role in recognizing and classifying entities such as names, dates, and locations embedded within textual content.

 

In the following blog post, I will guide you through fine-tuning a Named Entity Recognition (NER) model using spaCy, a powerful library for NLP tasks. Specifically, We will cover

  1. Named Entity Recognition

  2. spaCy
  3. spaCy for Named Entity Recognition
  4. Importance of Customizing NER Models
  5. Fine-Tuning spaCy’s NER model: Tok2vec 
  6. Fine-Tuning spaCy’s Transformer NER model

Named Entity Recognition:

Named Entity Recognition (NER) constitutes a specialized facet of natural language processing dedicated to pinpointing and categorizing named entities within textual content. These named entities encompass distinct categories such as individual names, organizational titles, geographic locations, dates, numerical values, and beyond. 

 

The significance of NER extends across diverse applications, encompassing information extraction, question answering, chatbots, sentiment analysis, and recommendation systems, underscoring its pivotal role in advancing multiple areas of natural language understanding and utilization.

image_2023-12-14_160026839

spaCy:

SpaCy stands as a leading natural language processing (NLP) library, renowned for its efficiency and versatility in handling various linguistic tasks. Developed by Explosion AI, SpaCy is designed with a focus on production-ready applications, making it a go-to choice for researchers, developers, and businesses alike. 

 

This open-source library offers pre-trained models for tasks such as part-of-speech tagging, named entity recognition, and dependency parsing. What sets SpaCy apart is its speed and memory efficiency, making it particularly adept at processing large volumes of text in real time.

 

Moreover, Its user-friendly interface, extensive language support, and integration with deep learning frameworks contribute to its popularity in the NLP community, empowering users to extract valuable insights and information from text data seamlessly. One more thing about SpaCy is that it supports Custom training and fine-tuning as we’re doing in this article which is very important for several scenarios. 

spacy

spaCy for Named Entity Recognition:

Setup:

We begin by importing the essential library, spacy, which is the core component for natural language processing tasks. 

				
					 import spacy

				
			

Following that, we download the large English language model, en_core_web_lg, which encompasses a comprehensive set of linguistic features. 

				
					 !python -m spacy download en_core_web_lg

				
			

Once the download is complete, we initialize and load the model using the spacy.load(“en_core_web_lg”) command.

				
					model = spacy.load("en_core_web_lg")

				
			

Creating the NER object:

We then create a SpaCy Doc object by applying the loaded SpaCy model to a given text string. Specifically, the text “UBIAI is cool” undergoes processing through the pipeline, and the resulting document is stored in the variable doc.

				
					 doc = nlp("UBIAI is cool ")
 print(doc)
				
			

Displaying :

In SpaCy, the displacymodule is used to visualize linguistic annotations, including named entities, dependencies, and more. It provides a convenient way to visualize the structure and relationships within a text.

 

In the next example, we apply it to the doc object created.

				
					from spacy import displacy
 displacy.render(doc, style="ent", jupyter=True)

				
			

As output we get :

image_2023-12-14_160436030

Importance of Customizing NER models:

While pre-trained models offer a solid foundation for various NLP tasks, the need for customization arises from the unique requirements of specific applications and domains. This section arguments why Fine-tuning pre-trained NER models is important :

  • Domain Specificity: Excels in domain-specific data by being trained on specialized content, leading to enhanced accuracy in recognizing entities within that particular domain.
  • Improved precisions: Enables higher precision by tailoring the model to recognize specific entity types crucial to your application, reducing the risk of false positives.
  • Data Control: Provides control over the quality of training data and annotation, ensuring that the model is exposed to relevant and accurate examples.
  • Adaptability :Allows continuous updates and fine-tuning as the data evolves, ensuring that the model stays relevant and effective over time.

 

While pre-trained NER models are proficient at recognizing common entity types like persons, organizations, and locations, they might fall short when confronted with the need to identify unique entities specific to your domain. That is why fine-tuning NER models on domain specific data is important

Try UBIAI AI Annotation Tool now !

  • Annotate smartly and quickly any type of documents in the most record time
  • Fine-tune your DL models with our approved tool tested by +100 Experts now!
  • Get better and fantastic collaboration space with your team.

Fine-Tuning spaCy’s NER Models: Tok2vec

In this section we will guide you on how to fine-tune a spaCy NER model en_core_web_lg on your own data.

Create a JSON file for your training data :

Our first task involves instructing spaCy on how to recognize words associated with specific tags. To achieve this, we must curate a JSON file comprising examples, each annotated with tags and their corresponding indices.

				
					import json


with open('UBIAI_TEST.json', 'r') as f:
    data = json.load(f)

With the file UBIAI_TEST.json being in this format and containing a sentence in “content”, its corresponding label in the “tag_name”. “Start” 
				
			

With the file UBIAI_TEST.json being in this format and containing a sentence in “content”, its corresponding label in the “tag_name”. “Start” represents the position of the starting character and “end” shows where it ends.

				
					{
  "examples": [
    {
      "id": "ca-1",
      "content": "Schedule a calendar event in Teak oaks HOA about competitions happening tomorrow",
      "annotations": [
        {
          "start": 0,
          "end": 7,
          "tag_name": "action"
        },
        {
          "start": 11,
          "end": 24,
          "tag_name": "domain"
        },
        {
          "start": 29,
          "end": 41,
          "tag_name": "hoa"
        },
        {
          "start": 49,
          "end": 70,
          "tag_name": "event"
        },

				
			

Data preparation :

This code snippet is instrumental in preparing the training data in the correct format for training a SpaCy Named Entity Recognition (NER) model.



				
					training_data = []
for example in data['examples']:
    temp_dict = {}
    temp_dict['text'] = example['content']
    temp_dict['entities'] = []
    for annotation in example['annotations']:
        start = annotation['start']
        end = annotation['end'] + 1
        label = annotation['tag_name'].upper()
        temp_dict['entities'].append((start, end, label))
    training_data.append(temp_dict)
print(training_data[0])

				
			

For that first example the output would be : 

{‘text’: ‘Schedule a calendar event in Teak oaks HOA about competitions happening tomorrow’, ‘entities’: [(0, 8, ‘ACTION’), (11, 25, ‘DOMAIN’), (29, 42, ‘HOA’), (49, 71, ‘EVENT’), (72, 80, ‘DATE’)]

Converting training data to SpaCy Docbin format :

In this segment, the training data is transformed into SpaCy’s efficient DocBin format, a binary structure designed for storing Doc objects. The process unfolds as follows:

				
					from spacy.tokens import DocBin
from tqdm import tqdm
from spacy.util import filter_spans


nlp = spacy.blank('en')
doc_bin = DocBin()
for training_example in tqdm(training_data):
    text = training_example['text']
    labels = training_example['entities']
    doc = nlp.make_doc(text)
    ents = []
    for start, end, label in labels:
        span = doc.char_span(start, end, label=label, alignment_mode="contract")
        if span is None:
            print("Skipping entity")
        else:
            ents.append(span)
    filtered_ents = filter_spans(ents)
    doc.ents = filtered_ents
    doc_bin.add(doc)


doc_bin.to_disk("train.spacy")

				
			

Example on a json file containing 7 training examples: 

image_2023-12-14_160918872

Configuration :

Now it is time to create the Training Configuration: 

 

Start by creating these two configuration files (The configuration is from the official spaCy documentation).

image_2023-12-14_160956762
image_2023-12-14_161032681

Then, execute the subsequent command within the notebook code block to initialize spaCy, utilizing the specified configuration file. This configuration file is crucial for training the spaCy model with the custom features we have generated.

				
					!python -m spacy init fill-config base_config.cfg config.cfg

				
			

Training :

Now, train the spaCy model.

				
					!python -m spacy train config.cfg --output ./ --paths.train ./train.spacy --paths.dev ./train.spacy

				
			

Once the training is done, 2 folders named model-best and model-last are going to be generated.

Loading & Testing:

Now all you have to do is to load and test the fine-tuned model.

				
					# first we load the model
nlp_ner = spacy.load("model-best")

# we create a document object and we test the fine-tuned model
doc = nlp_ner("Could you please reserve a team brainstorming session on coming Wednesday at 11 AM?")


spacy.displacy.render(doc, style="ent")

				
			

Fine-Tuning spaCy’s transformer NER Models:

In this section, we’ll provide step-by-step guidance on fine-tuning a spaCy NER model (en_core_web_trf) on your custom data. The initial three steps mirror those of using tok2vec, with the only distinction being the switch from the en_core_web_lg model to the en_core_web_trf.

Configuration :

Now it is time to create the Training Configuration: 

 

Choose your desired language and set the component to ‘ner.’ Depending on your system specifications, opt for either CPU or GPU. Save this configuration as ‘base_config.cfg.’

image_2023-12-14_161403540

To complete the configuration with the default settings for the rest of the system parameters, execute the following command in the command line to generate the ‘config.cfg’ file.

				
					!python -m spacy init fill-config base_config.cfg config.cfg
				
			

Training :

Train the model using the command line by specifying the training and development data paths in the configuration file. 



				
					!python -m spacy train config.cfg --output ./output --paths.train ./train.spacy --paths.dev ./dev.spacy
				
			

Additionally, you can set parameters such as batch size, max steps, epochs, patience, and more directly in the configuration file.

Loading & Testing:

Now all you have to do is to load and test the fine-tuned model.

				
					nlp = spacy.load(“output/model-last/”) #load the model



sentence = “””We are looking for a Backend Developer who has 4-6 years of experience in designing, developing and implementing backend services using Python and Django.”””


doc = nlp(sentence)


from spacy import displacy
displacy.render(doc, style=”ent”, jupyter=True)

				
			

Conclusion

In conclusion By utilizing Natural language processing libraries like spaCy and combining domain expertise with the flexibility offered by custom NER models, you could pave the way for more accurate and context-aware natural language understanding in various real-world scenarios.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !