Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs
Leveraging zero-shot labeling
July 18th, 2023
The process of automating entity extraction from PDF documents has proven to be highly beneficial in various applications. One specific area where this automation can be extremely valuable is in the context of Safety Data Sheets (SDS) for chemical management. SDS companies can greatly benefit from automated entity extraction to efficiently process and manage the vast amount of information present in SDS documents.
In this article, we will explore how to leverage the power of Large Language Models (LLMs) to automate entity extraction from SDS and improve the overall efficiency of SDS companies.
Importance of Entity Extraction in SDS Management
Safety Data Sheets (SDS) contain critical information about hazardous substances and mixtures. SDS companies play a vital role in assisting workplaces in effectively managing chemicals and ensuring employee safety.
However, the manual extraction of relevant entities can be time-consuming and prone to errors. By automating this process, SDS companies can streamline their operations and improve accuracy.
Utilizing Large Language Models (LLMs) for Entity Extraction
Large Language Models (LLMs) offer a powerful solution for automating entity extraction from SDS documents. Unlike traditional supervised models that require extensive labeled data, LLMs can tap into their vast knowledge base and generalize from a few examples. In the context of SDS, LLMs can leverage their in-context learning capabilities to recognize and extract key entities from unstructured and semi-structured SDS documents.
Information extraction from SDS
In this tutorial, we will explore the zero-shot and few-shot labeling capabilities of GPT 3.5 (ChatGPT) to label SDS documents from different companies.
SDSs are typically available in PDF format with varying layouts, but they generally contain the same essential information. Our goal in this tutorial is to train an AI model capable of automatically identifying key entities within SDSs, including product number, CAS number, use cases, classification, GHS label, formula, molecular weight, synonym, emergency phone number, first aid measures, component, and brand.
Extracting and storing this relevant information in a searchable database holds significant value for companies as it allows for quick search and retrieval of hazardous components. The example below illustrates the structure of an SDS:
Zero-Shot Labeling for SDS Entity Extraction
Zero-shot labeling, a technique employed by LLMs, allows the extraction of entities without explicit examples. This capability proves to be highly effective in the context of SDS documents, where the LLM can understand the entities based purely on their descriptions. By utilizing zero-shot labeling, SDS companies can significantly reduce the time and effort required for manual labeling and extraction.
Implementing Automated Entity Extraction with UBIAI
To implement automated entity extraction from SDS using LLMs, companies can leverage annotation tools such as UBIAI. These tools simplify the process by handling prompting, data chunking, and inference using OpenAI’s GPT3.5 Turbo API.
With UBIAI, companies can upload their SDS documents, whether in PDF or image format, and configure the number of examples the LLM should learn from for auto-labeling. The tool automatically selects the most informative documents from the labeled dataset and incorporates them into the prompt for entity extraction.
To begin, you can easily upload your documents in various formats such as native PDF, image files, or simple Docx.
After uploading, navigate to the annotation page and select the Few-shot tab within the annotation interface.
For more comprehensive information, you can refer to the documentation available at this link: [ https://ubiai.gitbook.io/ubiai-documentation/zero-shot-and-few-shot-labeling
UBIAI provides the flexibility to configure the number of examples that the model learns from to auto-label subsequent documents. The application automatically selects the most informative documents from your already labeled dataset and combines them in the prompt.
This approach is known as Few-shot labeling, where the “Few” can range from 0 to n. To configure the number of examples, simply click on the configuration button and enter the desired number of examples, as demonstrated below.
In this tutorial, we will utilize zero examples for the LLM to learn from and instruct it to label the data solely based on the entity’s description.
Remarkably, the LLM demonstrates a strong understanding of our document and accurately labels most of the entities!
The following shows the impressive result of zero-shot labeling on the SDS PDF without any examples.