Advanced NER With GPT-4, LLaMA3, and Mixtral

July 25th, 2024

Generative deep learning models based on Transformers have significantly advanced NLP use cases in recent years. Among these, GPT-4, LLaMA3, and Mixtral stand out as powerful text generation models that have revolutionized tasks such as entity extraction (NER).

In this article, we will explore how to leverage these models for advanced NER, comparing their performance in a case study. We will use the GPT-4 model via its API and run LLaMA3 and Mixtral using Ollama, a framework designed for executing LLMs. Let’s dive in!

The Traditional Approach to NER

Entity extraction, one of the oldest and most common NLP tasks, traditionally relied on frameworks like spaCy and NLTK. SpaCy is known for its production-readiness and speed, offering various pre-trained models with native entities like addresses and dates. NLTK, on the other hand, is excellent for research but less suited for production.

However, these pre-trained models only support the entities they were trained for, which often do not match real-world requirements. Custom entities such as job titles, product names, and company names necessitate creating extensive datasets through tedious annotation processes followed by model training. This repetitive and labor-intensive process has been a significant bottleneck.

Enter Generative Models

The advent of generative large language models has transformed the landscape. OpenAI’s GPT-3, and later GPT-4o, introduced capabilities that extend beyond traditional NLP models, performing tasks like NER, summarization, translation, and classification without explicit training.

Despite GPT-4o’s strengths, it comes with drawbacks such as high cost, slower performance compared to frameworks like spaCy, and complexity in usage. Fortunately, open-source alternatives like LLaMA3 by Meta and Mixtral by Mistral AI have emerged. These models can be deployed on any server, albeit with some MLOps knowledge and potential costs.

Integrating Knowledge Graphs into the RAG Stack

Integrating Knowledge Graphs into the RAG Stack significantly enhances its performance by leveraging structured data relationships. Here’s why this integration is important and the benefits it brings:

Why Integration is Important? :

Integrating Knowledge Graphs into the RAG Stack enhances the system’s ability to provide accurate and contextually relevant information. Knowledge Graphs offer a structured representation of data, capturing complex relationships and entities. This structured data helps improve both the retrieval and generation processes in the RAG Stack.

Benefits of Integration:

Enhanced Retrieval Accuracy: Knowledge Graphs enable the retrieval component to find more relevant and precise information by leveraging structured relationships between entities. This means the system can fetch more accurate data tailored to the user query.

Improved Generation Quality: By providing the generation model with well–organized and context–rich data, the quality and relevance of generated responses can be significantly improved. This ensures that the output is not only accurate but also contextually appropriate.

Contextual Understanding: Knowledge Graphs help the RAG Stack to better understand the context of queries. This deeper understanding leads to more accurate and context–aware responses, enhancing the overall user experience.

Better Decision–Making: The structured data in Knowledge Graphs aids in better decision–making by providing a comprehensive view of the information. This holistic perspective supports more informed responses.

Setting Up Ollama for LLaMA3 and Mixtral

To run LLaMA3 and Mixtral using Ollama, follow these steps:

Install Ollama: Ensure you have Ollama installed and configured on your system.
Set Up the Environment: Configure your environment to use GPUs for optimal performance.
Load Models: Load LLaMA3 and Mixtral models into Ollama.

With the setup ready, we can now proceed to our case study.

Fine-tune and evaluate your model with UBIAI

Prepare your high quality Training Data
Train best-in-class LLMs: Build domain-specific models that truly understand your context, fine-tune effortlessly, no coding required
Deploy with just few clicks: Go from a fine-tuned model to a live API endpoint with a single click
Optimize with confidence: unlock instant, scalable ROI by monitoring and analyzing model performance to ensure peak accuracy and tailored outcomes.

Case Study: Comparing GPT-4, LLaMA3, and Mixtral

We conducted a comprehensive comparison of GPT-4, LLaMA3, and Mixtral for NER tasks using a case of entity extraction from resumes. Below are the steps and findings from our study:

Dataset: Choosing a test and labels : [‘SKILLS’, ‘NAME’, ‘CERTIFICATE’, ‘HOBBIES’, ‘COMPANY’, ‘UNIVERSITY’]
Models and Framework:
- GPT-4 via API
- LLaMA3 and Mixtral via Ollama
Evaluation Metrics: We changed the results to json format then we created a script to find unique entities between the llms response. Then calculating the number of entities and the difference between the llms we got a percentage score for each llm
Code Implementation:
- GPT-4: Utilized the OpenAI API for inference.
- LLaMA3 and Mixtral: Implemented using Ollama for model execution.

Here is a snippet of the code used for comparison:

				
					import openai
import ollama
import json

def generate_prompt(labels, text):
    prompt = """
    Extract the entities for the following labels from the given text and provide the results in JSON format.
    - Entities must be extracted exactly as mentioned in the text.
    - Return each entity under its label without creating new labels.
    - Provide a list of entities for each label, ensuring that if no entities are found for a label, an empty list is returned.
    - Accuracy and relevance in your responses are key.

    labels:"""

    for label in labels:
        prompt += f"\n- {label}"


    prompt += """
    JSON Structure:
    {
    """

    for label in labels:
        prompt += f'"{label}": [],\n'

    prompt += "}\n\n"
    prompt +="\n\nTEXT:"
    prompt += text
    return prompt

labels = ['SKILLS', 'NAME', 'CERTIFICATE', 'HOBBIES', 'COMPANY', 'UNIVERSITY']

def gpt_ner(prompt):
    MODEL="gpt-4o"

    completion = client.chat.completions.create(
    model=MODEL,
    messages=[
    {"role": "system", "content": "Supreme Entity Recognition Expert"},
    {"role": "user", "content": prompt}
  ]
)
return completion.choices[0].message.content


# Ollama for LLaMA3 and Mixtral
def ollama_ner(model, text):
    response = ollama.chat(model=OLLAMA_MODEL, messages=[
    {
    'role': 'user',
    'content': prompt ,
    },
    ])
  return response['message']['content']

# Sample Text
text = "......"

# Results
gpt_result = gpt_ner(text)
llama_result = ollama_ner('llama3', text)
mixtral_result = ollama_ner('mixtral', text)

print("GPT-4 Result:", gpt_result)
print("LLaMA3 Result:", llama_result)
print("Mixtral Result:", mixtral_result)

Comparing Results

To evaluate the performance of each model, we compared the unique entities they extracted. Here’s how we did it:

Extract Unique Entities: For each model, we extracted the entities and stored them.
Count Unique Entities: We counted the unique entities extracted by each model.
Comparison: We compared the counts and the types of unique entities extracted by each model.

Here’s the code snippet used for the comparison:

				
					import json

def compare_jsons(json1, json2):
    # Convert JSON strings to dictionaries if they're not already
    if isinstance(json1, str):
        json1 = json.loads(json1)
    if isinstance(json2, str):
        json2 = json.loads(json2)

    differences = {"json1_unique": {}, "json2_unique": {}}

    all_keys = set(json1.keys()) | set(json2.keys())

    for key in all_keys:
        if key in json1 and key in json2:
            if isinstance(json1[key], list) and isinstance(json2[key], list):
                set1 = set(json1[key])
                set2 = set(json2[key])

                diff1 = set1 - set2
                diff2 = set2 - set1

                if diff1:
                    differences["json1_unique"][key] = list(diff1)
                if diff2:
                    differences["json2_unique"][key] = list(diff2)
            elif json1[key] != json2[key]:
                differences["json1_unique"][key] = json1[key]
                differences["json2_unique"][key] = json2[key]
        elif key in json1:
            differences["json1_unique"][key] = json1[key]
        else:
            differences["json2_unique"][key] = json2[key]

    return differences

def compare_json_results(json_diff):
    total_unique_entities = 0
    unique_counts = {}

    for json_key, categories in json_diff.items():
        unique_counts[json_key] = 0
        for category, entities in categories.items():
            unique_counts[json_key] += len(entities)
            total_unique_entities += len(entities)

    # Calculate scores
    if total_unique_entities == 0:
        return "Both results are identical. Score: 0"

    scores = {}
    for json_key, count in unique_counts.items():
        scores[json_key] = (count / total_unique_entities) * 100

    # Prepare the result string
    result = f"Total unique entities: {total_unique_entities}\n"
    for json_key, score in scores.items():
        result += f"{json_key} unique entity count: {unique_counts[json_key]}\n"
        result += f"{json_key} score: {score:.2f}%\n"

    return result

Results and Analysis of Advanced NER Performance

The results of our study indicated notable differences in the models’ performance. We compared the unique entities extracted by each model (GPT-4o, LLaMA3, and Mixtral) and calculated their unique entity counts and respective scores.

Comparison Between GPT-4o and Mixtral

				
					{
  "GPT-4o_unique": {
    "HOBBIES": [
      "scuba diver"
    ],
    "UNIVERSITY": [
      "Johns Hopkins University"
    ],
    "SKILLS": [
      "several programming languages"
    ]
  },
  "Mixtral_unique": {
    "HOBBIES": [
      "scuba diving"
    ],
    "COMPANY": [
      "MIT",
      "Stanford University"
    ],
    "SKILLS": [
      "machine learning",
      "programming languages",
      "predictive diagnostics in oncology"
    ]
  }
}

Total Unique Entities: 9
GPT-4o Unique Entity Count: 3
GPT-4o Score: 33.33%
Mixtral Unique Entity Count: 6
Mixtral Score: 66.67%

Comparison Between LLaMA3 and Mixtral

				
					{
  "LLAMA3_unique": {
    "HOBBIES": [
      "scuba diver"
    ],
    "CERTIFICATE": [
      "Advanced AI in Healthcare certification",
      "Ethical AI Implementation certificate"
    ],
    "UNIVERSITY": [
      "Johns Hopkins University"
    ]
  },
  "Mixtral_unique": {
    "HOBBIES": [
      "scuba diving"
    ],
    "CERTIFICATE": [
      "Ethical AI Implementation certificate from the World Health Organization",
      "Advanced AI in Healthcare certification from Johns Hopkins University"
    ],
    "COMPANY": [
      "MIT",
      "Stanford University"
    ],
    "SKILLS": [
      "machine learning",
      "predictive diagnostics in oncology"
    ]
  }
}

Comparison Between GPT-4o and LLaMA3

Total Unique Entities: 11
LLaMA3 Unique Entity Count: 4
LLaMA3 Score: 36.36%
Mixtral Unique Entity Count: 7
Mixtral Score: 63.64%

				
					{
  "GPT-4o_unique": {
    "CERTIFICATE": [
      "Ethical AI Implementation certificate from the World Health Organization",
      "Advanced AI in Healthcare certification from Johns Hopkins University"
    ],
    "SKILLS": [
      "several programming languages"
    ]
  },
  "LLAMA3_unique": {
    "CERTIFICATE": [
      "Advanced AI in Healthcare certification",
      "Ethical AI Implementation certificate"
    ],
    "SKILLS": [
      "programming languages"
    ]
  }
}

Total Unique Entities: 6
GPT-4o Unique Entity Count: 3
GPT-4o Score: 50.00%
LLaMA3 Unique Entity Count: 3
LLaMA3 Score: 50.00%

Summary Table of Results

Explanation of Results

The comparison results highlighted Mixtral as the best-performing model in terms of unique entity extraction. Mixtral consistently identified more unique entities across various categories compared to GPT-4o and LLaMA3. Here are the key points:

Mixtral vs GPT-4o: Mixtral extracted more unique entities (6 vs 3) and achieved a higher score (66.67% vs 33.33%).
Mixtral vs LLaMA3: Mixtral again outperformed LLaMA3 by identifying more unique entities (7 vs 4) with a higher score (63.64% vs 36.36%).
GPT-4o vs LLaMA3: Both models had an equal number of unique entities (3 each) and identical scores (50.00%).

Overall, Mixtral demonstrated superior performance in extracting unique entities, making it the most effective model among the three for this NER task.

Conclusion

Our comparison of GPT-4o, LLaMA3, and Mixtral for advanced Named Entity Recognition (NER) highlighted Mixtral as the top performer. Mixtral consistently extracted a greater number of unique entities, achieving a score of 66.67% against GPT-4o’s 33.33% and 63.64% against LLaMA3’s 36.36%.

These findings emphasize Mixtral’s superiority in identifying diverse entities, making it a powerful tool for applications requiring comprehensive entity extraction. By leveraging advanced models like Mixtral, we can significantly streamline and enhance the accuracy of NER tasks, reducing the need for extensive data annotation and training.

What are you waiting for?

Automate your process!

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Advanced NER With GPT-4, LLaMA3, and Mixtral

The Traditional Approach to NER

Enter Generative Models

Integrating Knowledge Graphs into the RAG Stack

Why Integration is Important? :

Benefits of Integration:

Setting Up Ollama for LLaMA3 and Mixtral

Fine-tune and evaluate your model with UBIAI

Case Study: Comparing GPT-4, LLaMA3, and Mixtral

Comparing Results

Results and Analysis of Advanced NER Performance

Comparison Between GPT-4o and Mixtral

Comparison Between LLaMA3 and Mixtral

Comparison Between GPT-4o and LLaMA3

Summary Table of Results

Explanation of Results

Conclusion

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Advanced NER With GPT-4, LLaMA3, and Mixtral

The Traditional Approach to NER

Enter Generative Models

Integrating Knowledge Graphs into the RAG Stack

Why Integration is Important? :

Benefits of Integration:

Setting Up Ollama for LLaMA3 and Mixtral

Fine-tune and evaluate your model with UBIAI

Case Study: Comparing GPT-4, LLaMA3, and Mixtral

Comparing Results

Results and Analysis of Advanced NER Performance

Comparison Between GPT-4o and Mixtral

Comparison Between LLaMA3 and Mixtral

Comparison Between GPT-4o and LLaMA3

Summary Table of Results

Explanation of Results

Conclusion

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset