Automating Entity Extraction from PDFs with LLM Fine-Tuning
Jun 15th, 2023
The need for high-quality labeled data cannot be overstated in modern machine learning applications. From improving our models’ performance to ensuring fairness, the power of labeled data is immense. Unfortunately, the time and effort required to create such datasets are equally significant. But what if we could reduce the time spent on this task from days to mere hours while maintaining or even enhancing the labeling quality? A utopian dream? Not anymore.
Emerging paradigms in machine learning — Zero-Shot Learning, Few-Shot Learning, and Model-Assisted Labeling — present a transformative approach to this crucial process. These techniques harness the power of advanced algorithms, reducing the need for extensive labeled datasets, and enabling faster, more efficient, and highly effective data annotation.
In this tutorial, we are going to present a method to auto-label unstructured and semi-structured documents using Large Language Model’s (LLM) in-context learning capabilities.
Information extraction from SDS
Unlike traditional supervised models that require extensive labeled data to get trained on solving a specific task, LLMs can generalize and extrapolate information from a few examples by tapping into its large knowledge base. This emergent capability, knows as in-context learning, makes LLM a versatile choice for many tasks that includes not only text generation but also data extraction such as named entity recognition.
For this tutorial, we are going to label Safety Data Sheets (SDS) from various companies using zero-shot and few-shot labeling capabilities of GPT 3.5, also known as ChatGPT. SDS offer comprehensive information regarding specific substances or mixtures, designed to assist workplaces in effectively managing chemicals. These documents play a vital role in providing detailed insights into hazards, encompassing environmental risks, and offering invaluable guidance on safety precautions. SDSs act as an indispensable source of knowledge, enabling employees to make informed decisions regarding the safe handling and utilization of chemicals in the workplace. SDS usually come in PDFs in various layouts but usually contain the same information. In this tutorial, we are interested to train an AI model that automatically identifies the following entities:
- Product number
- CAS number
- Use cases
- Classification
- GHS label
- Formula
- Molecular weight
- Synonym
- Emergency phone number
- First aid measures
- Component
- Brand
Extracting this relevant information and storing it in a searchable database is very valuable for many companies since it allows the search and retrieval of hazardous components very quickly.
Here is an example of an SDS:
Publicly available SDS. Image by Author
Zero-shot Labeling
Automating entity extraction from documents using Large Language Models (LLMs), particularly with a focus on llm fine tuning, presents a unique set of challenges. Unlike text generation tasks, information extraction requires careful handling to overcome issues such as hallucination and the generation of extraneous comments by LLMs. In this comprehensive guide, we’ll delve into the complexities of information extraction with LLMs and highlight the significance of llm fine tuning. Additionally, we’ll introduce the UBIAI annotation tool, which streamlines the process by abstracting away the intricacies of prompt engineering and result parsing.
Challenges in Information Extraction with LLMs:
LLMs, initially designed for text completion tasks, face challenges in accurately extracting information. The tendency to hallucinate or generate additional text poses a hurdle when the goal is precise information retrieval. To address this, a consistent output format, such as JSON, is essential for parsing LLM results. This requires meticulous llm fine tuning in the form of specific keywords during prompt engineering to ensure accurate extraction and interpretation.
The Role of Prompt Engineering and Result Parsing:
To navigate the complexities of LLM output, effective prompt engineering becomes crucial. The use of llm fine tuning keywords and structuring prompts optimally aids in obtaining coherent and relevant information. Result parsing is equally significant, involving the mapping of extracted entities back to the original tokens in the input text. This ensures a seamless integration of LLM outputs into the document analysis pipeline, highlighting the iterative nature of llm fine tuning for refining performance.
UBIAI Annotation Tool: Simplifying the Process
UBIAI annotation tool offers a streamlined solution to the challenges posed by LLMs in information extraction. It handles prompt engineering, data chunking to comply with context length limits, and leverages OpenAI’s GPT3.5 Turbo API for inference. The tool automates the parsing, processing, and application of LLM outputs, facilitating efficient auto-labeling of documents through the integration of llm fine tuning methodologies.
Getting Started with UBIAI:
1. Upload your documents, whether in native PDF, image, or simple Docx formats.
2. Navigate to the annotation page and select the Few-shot tab in the annotation interface.
3. Benefit from UBIAI’s automated processes for prompt engineering, data chunking, and result parsing, each fine-tuned to enhance the llm fine tuning experience.
By combining the power of LLM fine-tuning and utilizing tools like UBIAI, organizations can overcome the challenges associated with information extraction. The seamless integration of llm fine tuning keywords, prompt engineering, and result parsing enhances the accuracy and efficiency of entity extraction from diverse document formats. Embrace the capabilities of UBIAI to simplify your workflow and unlock the full potential of LLMs in automating document analysis, showcasing the adaptability of llm fine tuning for varying tasks.
For more details, checkout the documentation here: https://ubiai.gitbook.io/ubiai-documentation/zero-shot-and-few-shot-labeling
UBIAI enables you to configure the number of examples that you would like the model to learn from to auto-label the next documents. The app will automatically choose the most informative documents from your already labeled dataset and concatenate them in the prompt. This approach is called Few-shot labeling where “Few” ranges from 0 to n. To configure, the number of examples, simply click on the configuration button and input the number of examples, as shown below.
For this tutorial, we are going to provide zero examples to the LLM to learn from and ask it to label the data based purely on the description of the entity itself. Surprisingly, the LLM is able to understand our document quite well and does most of the labeling correctly!
Below is the result of zero-shot labeling on the SDS PDF without any examples, quite impressive!
Conclusion
Automating entity extraction from PDFs using Large Language Models (LLMs) has become a reality with the advent of LLMs in-context learning capabilities such as Zero-Shot Learning and Few-Shot Learning. These techniques harness the power of LLMs latent knowledge to reduce the reliance on extensive labeled datasets and enable faster, more efficient, and highly effective data annotation.
The tutorial presented a method to auto-label semi-structured documents, specifically focusing on Safety Data Sheets (SDS) but would also work for unstructured text. By leveraging the in-context learning capabilities of LLMs, particularly GPT 3.5 (chatGPT), the tutorial demonstrated the ability to automatically identify important entities within SDSs, such as product number, CAS number, use cases, classification, GHS label, and more.
The extracted information, if stored in a searchable database, provides significant value to companies as it allows for quick search and retrieval of hazardous components. The tutorial highlighted the potential of zero-shot labeling, where the LLM can understand and extract information from SDSs without any explicit examples. This showcases the versatility and generalization abilities of LLMs, going beyond text generation tasks.
If you are interested to create your own training dataset using LLMs zero-shot capabilities, schedule a demo with us here.
Follow us on Twitter @UBIAI5 !