ubiai deep learning
LLM fine tuning

LLM fine-tuning for NER using UBIAI annotation tool

Jan 22th 2024

Have you ever thought about improving a large language model’s capabilities in a specific task and faced an issue with annotating your training data? If so, this tutorial is for you. This article will be showing you how you can utilize UBIAI’s annotation tool for LLMs fine tuning.

Specifically we will be covering : 

  • UBIAI
  • What Is Fine Tuning
  • Why Fine tuning
  • UBIAI’s Annotation Tool
  • Fine Tuning

 

1- UBIAI:

UBIAI, is a startup headquartered in California, offers cloud-based solutions and services specializing in Natural Language Processing (NLP) specifically in text annotation. Our focus is on assisting users in extracting actionable insights from unstructured documents.

image_2024-01-22_142546857

2- What is Fine Tuning:

LLM Fine tuning refers to the process of making small adjustments or refinements to a system or model to optimize its performance for a specific task or dataset.

image_2024-01-22_142701014

In the context of machine learning, fine-tuning typically involves taking a pre-trained model and adjusting its parameters to better suit a particular use case. This approach is particularly valuable when working with limited amounts of labeled data for a specific task.

By leveraging the knowledge gained during pre-training on a larger and more diverse dataset, fine-tuning allows the model to adapt and specialize for the target task. The process requires striking a balance to avoid overfitting to the new data while still capturing the relevant patterns and nuances. LLM Fine tuning is widely employed across various domains, including natural language processing, computer vision, and audio analysis, enabling models to achieve higher accuracy and performance in specific applications.

3- Why LLM Fine Tuning:

LLM Fine tuning has emerged as a pivotal strategy for maximizing the potential of expansive language models such as GPT-3.5 Turbo. We’ve dedicated a separate guide to fine-tuning GPT-3 to illustrate its significance.

 

 

While pre-trained models showcase impressive human-like text generation, fine-tuning serves as the key to unlocking their true capabilities. This process empowers developers to tailor the model by training it on domain-specific data, allowing adaptation to specialized use cases beyond the scope of general-purpose training. Fine-tuning proves instrumental in enhancing the model’s relevance, accuracy, and performance, particularly for niche applications.

 

 

Fine-tuning facilitates the customization of models, enabling the creation of unique and distinctive experiences tailored to specific requirements and domains. Through training on domain-specific data, the model produces more pertinent and precise outputs within that niche. This level of customization provides businesses with the capability to develop AI applications that are precisely attuned to their needs.

 

 

The fine-tuning process enhances the model’s capacity to adhere to instructions and consistently produce reliable output formatting. Training on formatted data allows the model to grasp the desired structure and style, ultimately refining steerability. The outcome is more predictable and controllable outputs, contributing to increased reliability.

4- UBIAI’s Annotation Tool:

image_2024-01-22_144812001

To perform Fine Tuning on any model, it is essential to have annotated data for training purposes. At UBIAI, we provide user-friendly annotation features that support over 20 languages and can handle various document types, including PDFs and images. Additionally, we offer diverse export formats for your annotated data, ensuring flexibility and compatibility with a wide range of Language Model Models (LLMs).

To do so, first login to your UBIAI account and go to your profile and start by setting your OpenAi Key. You can get this Key from the official website of OpenAi.

image_2024-01-22_145021406

Initiate the creation of a new project and start by importing the data you want to annotate for the fine-tuning. 

image_2024-01-22_145334081

Then, navigate to the Models section by accessing the right sidebar on the projects page. This section will provide you with the tools and options necessary for configuring and managing models within your project. 

image_2024-01-22_151503172

In this article, we are refining an LLM. Kindly proceed to the LLM section and incorporate a model by selecting the blue button.



image_2024-01-22_151613879

In UBIAI we provide you with an intuitive graphical user interface that allows easy parameterization of your model. The creativity of the model is encapsulated in the temperature parameter, offering you control over its innovative output. The few-shot examples indicate the number of instances we supply to the model in the prompt, serving as a proven method to enhance the model’s responsiveness. Consider this approach as a reliable means to improve the overall quality and effectiveness of the model’s responses. Finally the context length represents the maximum number of tokens that you don’t want to exceed.

image_2024-01-22_151704519

The Python code defines convert_to_entity_format to structure data for spaCy visualization. Taking json_data with entity categories and text as inputs, the function iterates through categories, finding entities in the text. It records start and end indices, assigns labels, and returns a structured entities dictionary. The example showcases this process, emphasizing the code’s role in preparing data for spaCy’s displacy visualization, which enhances the interpretability of recognized entities in the given text.

image_2024-01-22_151820115

After importing the data and configuring the annotation model, the next steps involve validating the process and proceeding with the inference.

image_2024-01-22_151856776

After ensuring that everything is validated, proceed to the prediction phase by clicking on the “Predict” button.

image_2024-01-22_151939097

Remember that before doing the prediction you can always specify the labels you want to extract. 

Capture d'écran 2024-01-22 152021

At this stage your data is annotated and ready to be used for fine tuning LLMs. So just feel free to export in a suitable format. For OpenAI’s models you have to export the processed data into a JSON format because it suitable for fine tuning the GPT Model.

image_2024-01-22_152123618

Select the appropriate format and click on the option that best aligns with your specific case.

image_2024-01-22_152203632

At this stage, all the data processed by the UBIAI annotation tool is ready for use and has been downloaded into a local folder.

5- Annotated Data formatting :

OpenAI requires data to be formatted in JSONL (JSON Lines) format. JSONL is a format where each line represents a valid JSON object, and these lines are separated by newlines.  

Every instance within the dataset must represent a conversation formatted similarly to our Chat Completions API. This entails organizing the data as a sequence of messages, wherein each message includes a prompt, and a completion. It is imperative that a subset of the training examples explicitly addresses scenarios where the model’s behavior deviates from the desired outcome. The assistant messages provided in the dataset should serve as exemplary responses, showcasing the ideal behavior expected from the model in such situations.

				
					{
"prompt": "my prompt ->",
"completion": "the answer of the prompt. \n"
}

				
			

 

  • The term “prompt” refers to the input text that the model reads and processes. The primary separator utilized is the arrow sign (->), clearly marking the boundary between the prompt and the anticipated response.
  • On the other hand, “completion” represents the expected response corresponding to the given prompt. To signify the conclusion of the answer, a backslash (“\n”) is employed as a stop sequence.
				
					training_data = [
  {
      "prompt": "XXXXXXXXXXXXXXXXX  ->",
      "completion": """YYYYYYYYYYYYYYYYYYY .\n"""
  },
  {
      "prompt": "XXXXXXXXXXXXXXXXX   ->",
      "completion": """ YYYYYYYYYYYYYYYYYYY  \n"""
  }
]
validation_data = [
  {
      "prompt": "XXXXXXXXXXXXXXXXX   ->",
      "completion": """ YYYYYYYYYYYYYYYYYYY \n"""
  },
  {
      "prompt": "XXXXXXXXXXXXXXXXX   ->",
      "completion": """ YYYYYYYYYYYYYYYYYYY .\n"""
  }
]

				
			

Handling data in the list format, as demonstrated earlier, could be practical for small datasets. Nevertheless, there are numerous advantages to storing the data in JSONL (JSON Lines) format. These benefits encompass scalability, interoperability, simplicity, and compatibility with the OpenAI API. It’s noteworthy that the OpenAI API necessitates data in JSONL format for the creation of fine-tuning jobs.

The code below utilizes the prepare_finetuning_data function to transform data annotated by UBIAI annotation tool to JSONL format ready to be used for finetuning:

				
					import json


def prepare_finetuning_data(data):
    finetuning_data = []
    for record in data:
        document = record.get("document", "")  # Assuming "document" is your input data
        tokens = record.get("tokens", [])  # Assuming "tokens" is the key for the list of tokens


        # Extract entities and their values from tokens
        entities = {}
        for token in tokens:
            entity_label = token.get("entityLabel", "")
            if entity_label:
                text = token.get("text", "")
                entities.setdefault(entity_label, []).append(text)


        # Create completion string with entities and their values
        completion = "\n".join([f"{entity}:[{', '.join(map(json.dumps, values))}]" for entity, values in entities.items()])


        # Create a dictionary with "prompt" and "completion" keys
        example = {"prompt": document, "completion": completion}
        finetuning_data.append(example)


    return finetuning_data
				
			

Let’s execute this function on an example that was annotated by UBIAI’s annotation tool. Specifically on the Resume example: 

image_2024-01-22_152553988

The UBIAI’s tool generates a file of  json objects following this format for each annotated file. The code to translate these JSON objects to JSONL should look something like this : 

				
					{
        "documentName": "functionalsample.pdf",
        "document": "Functional Resume Sample ...",
        "tokens": [
            {"text": "Smith\n2002", "start": 33, "end": 43, "token_start": 5, "token_end": 6, "entityLabel": "YEAR"},
            {"text": "2002", "start": 39, "end": 43, "token_start": 6, "token_end": 6, "entityLabel": "YEAR"},
            {"text": "1999-2002", "start": 877, "end": 886, "token_start": 119, "token_end": 119, "entityLabel": "YEAR"},
            {"text": "John Doe", "start": 100, "end": 108, "token_start": 15, "token_end": 16, "entityLabel": "NAME"},
            {"text": "Software Engineer", "start": 150, "end": 167, "token_start": 25, "token_end": 27, "entityLabel": "JOB_TITLE"},
        ],
        "relations": [],
        "classifications": []
    },

				
			

Let’s try and execute the prepare_finetuning_data on this example:

				
					import json


def prepare_finetuning_data(data):
    finetuning_data = []
    for record in data:
        document = record.get("document", "")  # Assuming "document" is your input data
        tokens = record.get("tokens", [])  # Assuming "tokens" is the key for the list of tokens


        # Extract entities and their values from tokens
        entities = {}
        for token in tokens:
            entity_label = token.get("entityLabel", "")
            if entity_label:
                text = token.get("text", "")
                entities.setdefault(entity_label, []).append(text)


        # Create completion string with entities and their values
        completion = "\n".join([f"{entity}:[{', '.join(map(json.dumps, values))}]" for entity, values in entities.items()])


        # Create a dictionary with "prompt" and "completion" keys
        example = {"prompt": document, "completion": completion}
        finetuning_data.append(example)


    return finetuning_data


# Example usage
json_data = [  # Your list of JSON objects here
    {
        "documentName": "functionalsample.pdf",
        "document": "Functional Resume Sample ...",
        "tokens": [
            {"text": "Smith\n2002", "start": 33, "end": 43, "token_start": 5, "token_end": 6, "entityLabel": "YEAR"},
            {"text": "2002", "start": 39, "end": 43, "token_start": 6, "token_end": 6, "entityLabel": "YEAR"},
            {"text": "1999-2002", "start": 877, "end": 886, "token_start": 119, "token_end": 119, "entityLabel": "YEAR"},
            {"text": "John Doe", "start": 100, "end": 108, "token_start": 15, "token_end": 16, "entityLabel": "NAME"},
            {"text": "Software Engineer", "start": 150, "end": 167, "token_start": 25, "token_end": 27, "entityLabel": "JOB_TITLE"},
        ],
        "relations": [],
        "classifications": []
    },
    # Add more records as needed
]


finetuning_data = prepare_finetuning_data(json_data)


# Now `finetuning_data` is a list of dictionaries suitable for OpenAI API fine-tuning

				
			

If we print the formatted JSONL we’ll get as result :

				
					[{'prompt': 'Functional Resume Sample ...', 
'completion': 'YEAR:["Smith\\n2002", "2002", "1999-2002"]\nNAME:["John Doe"]\nJOB_TITLE:["Software Engineer"]'}]


				
			
image_2024-01-22_152811658

6- Fine tuning:

This tutorial guides you through the fine-tuning of GPT models using the user interface.

So open the OpenAi website and head to the LLM Fine tuning section. It should look something like that.

image_2024-01-22_152929776

Conclusion:

In summary, the UBIAI tool emerges as a game-changer in the fine-tuning landscape, offering a robust solution that empowers researchers to unlock the full potential of their language models. As we continue to explore the vast capabilities of natural language processing, tools like UBIAI pave the way for more efficient, effective, and collaborative advancements in the development and optimization of language models.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !