Have you ever thought about improving a large language model’s capabilities in a specific task and faced an issue with annotating your training data? If so, this tutorial is for you. This article will be showing you how you can utilize UBIAI’s annotation tool for LLMs fine tuning.
Specifically we will be covering :
UBIAI, is a startup headquartered in California, offers cloud-based solutions and services specializing in Natural Language Processing (NLP) specifically in text annotation. Our focus is on assisting users in extracting actionable insights from unstructured documents.
LLM Fine tuning refers to the process of making small adjustments or refinements to a system or model to optimize its performance for a specific task or dataset.
In the context of machine learning, fine-tuning typically involves taking a pre-trained model and adjusting its parameters to better suit a particular use case. This approach is particularly valuable when working with limited amounts of labeled data for a specific task.
By leveraging the knowledge gained during pre-training on a larger and more diverse dataset, fine-tuning allows the model to adapt and specialize for the target task. The process requires striking a balance to avoid overfitting to the new data while still capturing the relevant patterns and nuances. LLM Fine tuning is widely employed across various domains, including natural language processing, computer vision, and audio analysis, enabling models to achieve higher accuracy and performance in specific applications.
LLM Fine tuning has emerged as a pivotal strategy for maximizing the potential of expansive language models such as GPT-3.5 Turbo. We’ve dedicated a separate guide to fine-tuning GPT-3 to illustrate its significance.
While pre-trained models showcase impressive human-like text generation, fine-tuning serves as the key to unlocking their true capabilities. This process empowers developers to tailor the model by training it on domain-specific data, allowing adaptation to specialized use cases beyond the scope of general-purpose training. Fine-tuning proves instrumental in enhancing the model’s relevance, accuracy, and performance, particularly for niche applications.
Fine-tuning facilitates the customization of models, enabling the creation of unique and distinctive experiences tailored to specific requirements and domains. Through training on domain-specific data, the model produces more pertinent and precise outputs within that niche. This level of customization provides businesses with the capability to develop AI applications that are precisely attuned to their needs.
The fine-tuning process enhances the model’s capacity to adhere to instructions and consistently produce reliable output formatting. Training on formatted data allows the model to grasp the desired structure and style, ultimately refining steerability. The outcome is more predictable and controllable outputs, contributing to increased reliability.
To perform Fine Tuning on any model, it is essential to have annotated data for training purposes. At UBIAI, we provide user-friendly annotation features that support over 20 languages and can handle various document types, including PDFs and images. Additionally, we offer diverse export formats for your annotated data, ensuring flexibility and compatibility with a wide range of Language Model Models (LLMs).
To do so, first login to your UBIAI account and go to your profile and start by setting your OpenAi Key. You can get this Key from the official website of OpenAi.
Initiate the creation of a new project and start by importing the data you want to annotate for the fine-tuning.
Then, navigate to the Models section by accessing the right sidebar on the projects page. This section will provide you with the tools and options necessary for configuring and managing models within your project.
In this article, we are refining an LLM. Kindly proceed to the LLM section and incorporate a model by selecting the blue button.
In UBIAI we provide you with an intuitive graphical user interface that allows easy parameterization of your model. The creativity of the model is encapsulated in the temperature parameter, offering you control over its innovative output. The few-shot examples indicate the number of instances we supply to the model in the prompt, serving as a proven method to enhance the model’s responsiveness. Consider this approach as a reliable means to improve the overall quality and effectiveness of the model’s responses. Finally the context length represents the maximum number of tokens that you don’t want to exceed.
The Python code defines convert_to_entity_format to structure data for spaCy visualization. Taking json_data with entity categories and text as inputs, the function iterates through categories, finding entities in the text. It records start and end indices, assigns labels, and returns a structured entities dictionary. The example showcases this process, emphasizing the code’s role in preparing data for spaCy’s displacy visualization, which enhances the interpretability of recognized entities in the given text.
After importing the data and configuring the annotation model, the next steps involve validating the process and proceeding with the inference.
After ensuring that everything is validated, proceed to the prediction phase by clicking on the “Predict” button.
Remember that before doing the prediction you can always specify the labels you want to extract.
At this stage your data is annotated and ready to be used for fine tuning LLMs. So just feel free to export in a suitable format. For OpenAI’s models you have to export the processed data into a JSON format because it suitable for fine tuning the GPT Model.
Select the appropriate format and click on the option that best aligns with your specific case.
At this stage, all the data processed by the UBIAI annotation tool is ready for use and has been downloaded into a local folder.
OpenAI requires data to be formatted in JSONL (JSON Lines) format. JSONL is a format where each line represents a valid JSON object, and these lines are separated by newlines.
Every instance within the dataset must represent a conversation formatted similarly to our Chat Completions API. This entails organizing the data as a sequence of messages, wherein each message includes a prompt, and a completion. It is imperative that a subset of the training examples explicitly addresses scenarios where the model’s behavior deviates from the desired outcome. The assistant messages provided in the dataset should serve as exemplary responses, showcasing the ideal behavior expected from the model in such situations.
{
"prompt": "my prompt ->",
"completion": "the answer of the prompt. \n"
}
training_data = [
{
"prompt": "XXXXXXXXXXXXXXXXX ->",
"completion": """YYYYYYYYYYYYYYYYYYY .\n"""
},
{
"prompt": "XXXXXXXXXXXXXXXXX ->",
"completion": """ YYYYYYYYYYYYYYYYYYY \n"""
}
]
validation_data = [
{
"prompt": "XXXXXXXXXXXXXXXXX ->",
"completion": """ YYYYYYYYYYYYYYYYYYY \n"""
},
{
"prompt": "XXXXXXXXXXXXXXXXX ->",
"completion": """ YYYYYYYYYYYYYYYYYYY .\n"""
}
]
Handling data in the list format, as demonstrated earlier, could be practical for small datasets. Nevertheless, there are numerous advantages to storing the data in JSONL (JSON Lines) format. These benefits encompass scalability, interoperability, simplicity, and compatibility with the OpenAI API. It’s noteworthy that the OpenAI API necessitates data in JSONL format for the creation of fine-tuning jobs.
The code below utilizes the prepare_finetuning_data function to transform data annotated by UBIAI annotation tool to JSONL format ready to be used for finetuning:
import json
def prepare_finetuning_data(data):
finetuning_data = []
for record in data:
document = record.get("document", "") # Assuming "document" is your input data
tokens = record.get("tokens", []) # Assuming "tokens" is the key for the list of tokens
# Extract entities and their values from tokens
entities = {}
for token in tokens:
entity_label = token.get("entityLabel", "")
if entity_label:
text = token.get("text", "")
entities.setdefault(entity_label, []).append(text)
# Create completion string with entities and their values
completion = "\n".join([f"{entity}:[{', '.join(map(json.dumps, values))}]" for entity, values in entities.items()])
# Create a dictionary with "prompt" and "completion" keys
example = {"prompt": document, "completion": completion}
finetuning_data.append(example)
return finetuning_data
Let’s execute this function on an example that was annotated by UBIAI’s annotation tool. Specifically on the Resume example:
The UBIAI’s tool generates a file of json objects following this format for each annotated file. The code to translate these JSON objects to JSONL should look something like this :
{
"documentName": "functionalsample.pdf",
"document": "Functional Resume Sample ...",
"tokens": [
{"text": "Smith\n2002", "start": 33, "end": 43, "token_start": 5, "token_end": 6, "entityLabel": "YEAR"},
{"text": "2002", "start": 39, "end": 43, "token_start": 6, "token_end": 6, "entityLabel": "YEAR"},
{"text": "1999-2002", "start": 877, "end": 886, "token_start": 119, "token_end": 119, "entityLabel": "YEAR"},
{"text": "John Doe", "start": 100, "end": 108, "token_start": 15, "token_end": 16, "entityLabel": "NAME"},
{"text": "Software Engineer", "start": 150, "end": 167, "token_start": 25, "token_end": 27, "entityLabel": "JOB_TITLE"},
],
"relations": [],
"classifications": []
},
Let’s try and execute the prepare_finetuning_data on this example:
import json
def prepare_finetuning_data(data):
finetuning_data = []
for record in data:
document = record.get("document", "") # Assuming "document" is your input data
tokens = record.get("tokens", []) # Assuming "tokens" is the key for the list of tokens
# Extract entities and their values from tokens
entities = {}
for token in tokens:
entity_label = token.get("entityLabel", "")
if entity_label:
text = token.get("text", "")
entities.setdefault(entity_label, []).append(text)
# Create completion string with entities and their values
completion = "\n".join([f"{entity}:[{', '.join(map(json.dumps, values))}]" for entity, values in entities.items()])
# Create a dictionary with "prompt" and "completion" keys
example = {"prompt": document, "completion": completion}
finetuning_data.append(example)
return finetuning_data
# Example usage
json_data = [ # Your list of JSON objects here
{
"documentName": "functionalsample.pdf",
"document": "Functional Resume Sample ...",
"tokens": [
{"text": "Smith\n2002", "start": 33, "end": 43, "token_start": 5, "token_end": 6, "entityLabel": "YEAR"},
{"text": "2002", "start": 39, "end": 43, "token_start": 6, "token_end": 6, "entityLabel": "YEAR"},
{"text": "1999-2002", "start": 877, "end": 886, "token_start": 119, "token_end": 119, "entityLabel": "YEAR"},
{"text": "John Doe", "start": 100, "end": 108, "token_start": 15, "token_end": 16, "entityLabel": "NAME"},
{"text": "Software Engineer", "start": 150, "end": 167, "token_start": 25, "token_end": 27, "entityLabel": "JOB_TITLE"},
],
"relations": [],
"classifications": []
},
# Add more records as needed
]
finetuning_data = prepare_finetuning_data(json_data)
# Now `finetuning_data` is a list of dictionaries suitable for OpenAI API fine-tuning
If we print the formatted JSONL we’ll get as result :
[{'prompt': 'Functional Resume Sample ...',
'completion': 'YEAR:["Smith\\n2002", "2002", "1999-2002"]\nNAME:["John Doe"]\nJOB_TITLE:["Software Engineer"]'}]
This tutorial guides you through the fine-tuning of GPT models using the user interface.
So open the OpenAi website and head to the LLM Fine tuning section. It should look something like that.
In summary, the UBIAI tool emerges as a game-changer in the fine-tuning landscape, offering a robust solution that empowers researchers to unlock the full potential of their language models. As we continue to explore the vast capabilities of natural language processing, tools like UBIAI pave the way for more efficient, effective, and collaborative advancements in the development and optimization of language models.