ubiai deep learning
image_2024-01-19_115616513

Using ChatGPT to Pre-annotate Named Entities Recognition Labeling Tasks

Jan 19th 2024

In the dynamic realm of natural language processing, the role of Named Entity Recognition (NER) is indispensable, serving as the linchpin for extracting meaningful insights from unstructured text. This article embarks on a journey to explore an innovative solution at the convergence of advanced language models and NER tasks, particularly focusing on the utilization of ChatGPT.

The primary purpose of this article is to unravel the potential of using ChatGPT for pre-annotating Named Entity Recognition labeling tasks. We delve into the intricacies of how this powerful language model can enhance the efficiency of data labeling in the context of NER, offering a compelling alternative to traditional methods.

Imagine a world where extracting valuable information from vast amounts of text is not only accurate but also remarkably efficient. This vision is not too distant, thanks to the intersection of cutting-edge language models and the ever-evolving field of NER. In the following pages, we uncover a groundbreaking approach that could revolutionize how we annotate and label data for Named Entity Recognition.

In our exploration of ChatGPT’s role in NER pre-annotation, we naturally introduce primary keywords such as “named entity extraction python,” “gpt named entity recognition,” and “ml data labeling.” These keywords set the stage for a comprehensive discussion on the integration of advanced language models with practical applications in data labeling for NER tasks.

Capture d'écran 2024-01-19 115717

Background on Named Entity Extraction

image_2024-01-19_115844053

Definition: Named Entity Extraction (NER) is a pivotal natural language processing (NLP) task involving the identification and classification of entities within text, including people, locations, organizations, dates, and specific terms.

Applications:

 

  • Information Retrieval: Facilitates document categorization and indexing for efficient search.
  • Sentiment Analysis: Enhances contextual understanding for nuanced sentiment insights.
  • Question-Answering Systems: Aids in pinpointing relevant information for accurate responses.
  • Specialized Fields (Biomedical, Finance, Legal): Crucial for extracting key information in various domains.

Importance of Labeled Data:

  • Training Foundation: High-quality labeled datasets are essential for training effective NER models.
  • Generalization: Influences the model’s ability to generalize well to new, unseen data.

 

As we explore NER’s role in diverse applications, it becomes clear that meticulous data labeling is paramount for model accuracy. This sets the stage for innovative approaches, including the integration of advanced language models like ChatGPT, to optimize the data labeling process and elevate the efficacy of Named Entity Extraction.

Overview of ChatGPT

image_2024-01-19_120044115

Meet ChatGPT, a groundbreaking language model from OpenAI that’s changing how computers understand and generate human-like text.

 

Understanding Text:

ChatGPT is really good at figuring out what text means. It can understand the meaning, context, and details in all sorts of written information.

 

Generating Text Like Humans:

Not only does ChatGPT understand text, but it can also create new text that sounds just like something a person might say. This makes it super versatile for many different language tasks.

 

Versatility in Language Tasks:

You can use ChatGPT for a lot of things! Whether it’s making chatbots, translating languages, summarizing content, or creating text, ChatGPT is like a Swiss Army knife for language-related tasks.

 

Innovation in Language:

ChatGPT is not just an ordinary language model; it’s a cutting-edge technology that’s changing how we communicate with computers. It’s making machines more human-like in understanding and responding to what we say.

In a nutshell, ChatGPT is not your average language model. It’s a powerful tool that’s making computers speak our language in a way that’s both smart and creative.

Using ChatGPT for Pre-annotation

Efficient Workflow:

  • Automation with ChatGPT: Leverage ChatGPT’s language understanding to automatically mark potential named entities in unlabelled text.
  • Streamlining Process: Pre-annotation accelerates NER data labeling, saving time and improving overall efficiency.
 

Enhanced Accuracy:

  • Human-Machine Collaboration: Combine ChatGPT’s suggestions with human expertise for accurate and high-quality labeled datasets.
  • Improved Validation: Human-labelers can focus on refining and validating pre-annotated entities, ensuring accuracy in the labeling workflow.

Comparisons with Traditional Methods

Traditional Methods:

Traditional NER data labeling methods rely on manual processes, introducing challenges related to both efficiency and accuracy. These methods are time-consuming, requiring substantial effort from human annotators to identify and annotate named entities. Moreover, the manual nature makes them prone to errors, especially with large datasets, leading to inconsistencies and inaccuracies in the labeled data.

 

ChatGPT-Based Pre-annotation:

ChatGPT’s pre-annotation approach brings a notable shift in addressing the limitations of traditional methods, offering clear benefits in efficiency and accuracy. By automating the initial stage, ChatGPT significantly speeds up the labeling process, saving time and effort compared to manual methods. The automated process enhances consistency and minimizes human errors in identifying entities throughout the labeling process.

 

Advantages of ChatGPT:

ChatGPT’s advantages extend to improving the accuracy of NER data labeling. Its advanced language understanding enhances accuracy by interpreting the contextual nuances in unlabelled text. In benchmark tests, especially with GPT-4, ChatGPT outperforms traditional methods in recognizing people, organizations, and locations. The model excels in diverse NER tasks, offering precise identification of named entities.

 

Benchmark Comparison:

  • People Names:
  • SpaCy and GPT-3.5 show some misclassifications. GPT-4, Bard, and Llama2 provide accurate results with additional information. Top performer: Bard, Llama2, and GPT-4.
  • Organization Names:
  • GPT-4 and GPT-3.5 offer accurate results with additional context. Llama2 lacks deeper context. Bard and SpaCy have more pronounced misses. Top performer: GPT-4, GPT-3.5.
  • Locations:
  • SpaCy’s results are basic, with limited context understanding. Llama2 and Bard focus on locations but lack extended context. GPT-4 excels in almost flawless recognition and contextual understanding. Top Performer: GPT-4.
image_2024-01-19_120323753

Integration with Python for NER Pre-annotation:

Integrating ChatGPT with Python for Named Entity Recognition (NER) pre-annotation involves leveraging the OpenAI API. Below are practical examples demonstrating the integration process:

  1. Install OpenAI Python Package:

Ensure you have the OpenAI Python package installed. You can do this using pip:

!pip install openai

  1. Obtain OpenAI API Key:

You need to sign up for OpenAI and obtain your API key, which you’ll use for making API requests.

				
					import openai

# Set your OpenAI API key
openai.api_key = "api_key"
Python Code Example:
base_prompt = """
In the sentence below, give me the list of:
- organization named entity
- location named entity
- person named entity
- miscellaneous named entity.
Format the output in json with the following keys:
- ORGANIZATION for organization named entity
- LOCATION for location named entity
- PERSON for person named entity
- MISCELLANEOUS for miscellaneous named entity.
Sentence below:
"""

text = """
Pearland Economic Development Corp. (PEDC) is planning an event at the beautiful Ivy Lofts in Houston’s East End. Tim Daugherty, President and CEO of Millar Inc., and Matt Buchanan, President of Pearland Economic
Development Corp., will be attending. The event will showcase the growth of Millar Inc., a leading organization in the biomedical field. Additionally, representatives from Mitsubishi Heavy Industries Compressor
International Corp. and Dover Energy will join, adding to the diversity of attendees. Make sure to visit the spectacular Lower Kirby District during your visit to Houston, located at 11950 N. Spectrum Blvd.
The discussion will cover innovative developments in various industries, including the latest technologies from Cardiovascular Systems Inc. and the visionary projects of Modern Land (China) Co. Ltd.
"""
response = openai.chat.completions.create(
      model="gpt-3.5-turbo-1106",
      messages=[
        {"role": "system", "content": base_prompt},
        {"role": "user", "content": text}
      ]
    )
generated_content = response.choices[0].message.content
print(generated_content)
```json
{
  "ORGANIZATION": ["Pearland Economic Development Corp.", "Millar Inc.", "Mitsubishi Heavy Industries Compressor International Corp.", "Dover Energy", "Cardiovascular Systems Inc.", "Modern Land (China) Co. Ltd."],
  "LOCATION": ["Houston’s East End", "Ivy Lofts", "Lower Kirby District", "11950 N. Spectrum Blvd."],
  "PERSON": ["Tim Daugherty", "Matt Buchanan"],
  "MISCELLANEOUS": ["East End"]
}
```

				
			

Streamlined Interaction with JSON Responses

Incorporating the response_format={“type”: “json_object”} parameter in your OpenAI GPT API requests enhances response handling by instructing the API to return results in a structured JSON object. This simplifies the extraction of generated content and associated metadata, facilitating seamless integration into various applications. The JSON format offers a standardized and versatile foundation, empowering developers to efficiently extract specific details or dynamically adapt content. Consider leveraging this feature to unlock the full potential of GPT-generated content in your projects, ensuring a smooth and adaptable integration experience.

				
					response = openai.chat.completions.create(
      model="gpt-3.5-turbo-1106",
      response_format={ "type": "json_object" },
      messages=[
        {"role": "system", "content": base_prompt},
        {"role": "user", "content": text}
      ]
    )
generated_content = response.choices[0].message.content
print(generated_content)
{
  "ORGANIZATION": [
    "Pearland Economic Development Corp. (PEDC)",
    "Millar Inc.",
    "Mitsubishi Heavy Industries Compressor International Corp.",
    "Dover Energy",
    "Cardiovascular Systems Inc.",
    "Modern Land (China) Co. Ltd."
  ],
  "LOCATION": [
    "Ivy Lofts",
    "Houston",
    "East End",
    "Lower Kirby District",
    "11950 N. Spectrum Blvd."
  ],
  "PERSON": [
    "Tim Daugherty",
    "Matt Buchanan"
  ],
  "MISCELLANEOUS": [
    "Millar Inc., a leading organization in the biomedical field"
  ]
}

				
			

The Python code defines convert_to_entity_format to structure data for spaCy visualization. Taking json_data with entity categories and text as inputs, the function iterates through categories, finding entities in the text. It records start and end indices, assigns labels, and returns a structured entities dictionary. The example showcases this process, emphasizing the code’s role in preparing data for spaCy’s displacy visualization, which enhances the interpretability of recognized entities in the given text.

				
					def convert_to_entity_format(json_data, text):
    entities = {}

    for category, items in json_data.items():
        entities[category] = []
        for item in items:
            start_idx = text.find(item)
            end_idx = start_idx + len(item)
            entities[category].append({"start": start_idx, "end": end_idx, "label": category})

    return entities

# Example JSON data
json_data = {
    "ORGANIZATION": [
        "Pearland Economic Development Corp.",
        "Millar Inc.",
        "Mitsubishi Heavy Industries Compressor International Corp.",
        "Dover Energy",
        "Cardiovascular Systems Inc.",
        "Modern Land (China) Co. Ltd."
    ],
    "LOCATION": [
        "Ivy Lofts",
        "Houston’s East End",
        "Lower Kirby District",
        "11950 N. Spectrum Blvd.",
        "Houston"
    ],
    "PERSON": [
        "Tim Daugherty",
        "Matt Buchanan"
    ],
    "MISCELLANEOUS": [
        "PEDC",
        "CEO"
    ]
}

# Example text
text = """
Pearland Economic Development Corp. (PEDC) is planning an event at the beautiful Ivy Lofts in Houston’s East End.
Tim Daugherty, President and CEO of Millar Inc., and Matt Buchanan, President of Pearland Economic Development Corp., will be attending.
The event will showcase the growth of Millar Inc., a leading organization in the biomedical field. Additionally, representatives
from Mitsubishi Heavy Industries Compressor International Corp. and Dover Energy will join, adding to the diversity of attendees.
Make sure to visit the spectacular Lower Kirby District during your visit to Houston, located at 11950 N. Spectrum Blvd.
The discussion will cover innovative developments in various industries, including the latest technologies from Cardiovascular Systems Inc.
and the visionary projects of Modern Land (China) Co. Ltd.
"""

# Convert JSON data to the desired format
entities = convert_to_entity_format(json_data, text)

# Print the result
print(entities)
{'ORGANIZATION': [{'start': 1, 'end': 36, 'label': 'ORGANIZATION'}, {'start': 151, 'end': 162, 'label': 'ORGANIZATION'}, {'start': 386, 'end': 444, 'label': 'ORGANIZATION'}, {'start': 449, 'end': 461, 'label': 'ORGANIZATION'}, {'start': 744, 'end': 771, 'label': 'ORGANIZATION'}, {'start': 802, 'end': 830, 'label': 'ORGANIZATION'}], 'LOCATION': [{'start': 82, 'end': 91, 'label': 'LOCATION'}, {'start': 95, 'end': 113, 'label': 'LOCATION'}, {'start': 546, 'end': 566, 'label': 'LOCATION'}, {'start': 608, 'end': 631, 'label': 'LOCATION'}, {'start': 95, 'end': 102, 'label': 'LOCATION'}], 'PERSON': [{'start': 115, 'end': 128, 'label': 'PERSON'}, {'start': 168, 'end': 181, 'label': 'PERSON'}], 'MISCELLANEOUS': [{'start': 38, 'end': 42, 'label': 'MISCELLANEOUS'}, {'start': 144, 'end': 147, 'label': 'MISCELLANEOUS'}]}

				
			

This code combines recognized entities from the provided JSON data with spaCy’s own entity annotations for visualization. It processes a sample text using spaCy, converts the entity information into a dictionary format, and then visualizes these entities, using spaCy’s displacy module. The resulting visualization enhances the comprehension of identified entities in the text.

				
					from IPython.core.display import display, HTML
import spacy
from spacy import displacy


# Combine entities for visualization (excluding spaCy entities)
all_entities = [entity_type for entity_type in entities if entity_type not in [ent.label_ for ent in doc.ents]]


# Create a dummy spaCy Doc object with the sample text
text = """
Pearland Economic Development Corp. (PEDC) is planning an event at the beautiful Ivy Lofts in Houston’s East End.
Tim Daugherty, President and CEO of Millar Inc., and Matt Buchanan, President of Pearland Economic Development Corp., will be attending.
The event will showcase the growth of Millar Inc., a leading organization in the biomedical field. Additionally, representatives
from Mitsubishi Heavy Industries Compressor International Corp. and Dover Energy will join, adding to the diversity of attendees.
Make sure to visit the spectacular Lower Kirby District during your visit to Houston, located at 11950 N. Spectrum Blvd.
The discussion will cover innovative developments in various industries, including the latest technologies from Cardiovascular Systems Inc.
and the visionary projects of Modern Land (China) Co. Ltd.
"""

# Process the text with spaCy
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)

# Convert spaCy Doc to dictionary format (excluding spaCy entities)
dic_ents = {
    "text": text,
    "ents": [{"start": ent.start_char, "end": ent.end_char, "label": ent.label_} for ent in doc.ents],
    "title": None
}

# Sort entities by start and end indices
dic_ents["ents"] = sorted(dic_ents["ents"], key=lambda x: (x["start"], x["end"]))

# Visualize the entities with random colors directly in the Jupyter Notebook
html = displacy.render(dic_ents, manual=True, style="ent")
display(HTML(html))

				
			
image_2024-01-19_120759691

Addressing Challenges:

 

  1. Ambiguity and Context Understanding:
  • Challenge: Ambiguous or context-dependent entities may lead to inaccurate pre-annotations.
  • Mitigation: Clearly define context, provide cues, and refine during human validation.
  1. Entity Recognition Variability:
  • Challenge: Recognition may vary based on training data, affecting generalization.
  • Mitigation: Fine-tune on domain-specific data and regularly update for better performance.
  1. Handling New Entities:
  • Challenge: Difficulty recognizing entities outside the model’s training data.
  • Mitigation: Maintain an updated entity dictionary, periodically retrain, and expand data sources.

Conclusion:

In summary, our exploration into the realm of ChatGPT for named entity extraction brought forth several key insights. We started by understanding its potential impact on machine learning data labeling tasks, highlighting the convenience it introduces to the process.

Moving forward, the importance of labeled data in training effective Named Entity Recognition (NER) models was underscored. ChatGPT, as a language model, showcased its prowess in comprehending and generating human-like text, setting the stage for its application in pre-annotating NER tasks.

Comparing ChatGPT with traditional methods revealed its efficiency and accuracy advantages, especially when recognizing entities like people, organizations, and locations. Practical integration examples with Python were shared, empowering readers to implement and experiment with ChatGPT in their NER workflows.

Visualizing the extracted entities using spaCy’s displacy provided a tangible way to grasp the labeled information within the text, enhancing the overall understanding of the process.

However, challenges were acknowledged, and strategies for mitigating them were discussed. The dynamic landscape of natural language processing, coupled with ChatGPT’s evolving capabilities, hints at a promising future for improving and refining NER labeling tasks.

 

As we conclude, we invite you to delve further into our platform. Explore additional articles, experiment with the features discussed, and actively contribute by sharing your insights. Your engagement is pivotal in fostering a vibrant learning community. Thank you for being an integral part of our journey!

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !