Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

Leveraging zero-shot labeling

July 18th, 2023

The process of automating entity extraction from PDF documents has proven to be highly beneficial in various applications. One specific area where this automation can be extremely valuable is in the context of Safety Data Sheets (SDS) for chemical management. SDS companies can greatly benefit from automated entity extraction to efficiently process and manage the vast amount of information present in SDS documents.

In this article, we will explore how to leverage the power of Large Language Models (LLMs) to automate entity extraction from SDS and improve the overall efficiency of SDS companies.

Importance of Entity Extraction in SDS Management

Safety Data Sheets (SDS) contain critical information about hazardous substances and mixtures. SDS companies play a vital role in assisting workplaces in effectively managing chemicals and ensuring employee safety.

However, the manual extraction of relevant entities can be time-consuming and prone to errors. By automating this process, SDS companies can streamline their operations and improve accuracy.

Utilizing Large Language Models (LLMs) for Entity Extraction

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

Large Language Models (LLMs) offer a powerful solution for automating entity extraction from SDS documents. Unlike traditional supervised models that require extensive labeled data, LLMs can tap into their vast knowledge base and generalize from a few examples. In the context of SDS, LLMs can leverage their in-context learning capabilities to recognize and extract key entities from unstructured and semi-structured SDS documents.


Information extraction from SDS

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

In this tutorial, we will explore the zero-shot and few-shot labeling capabilities of GPT 3.5 (ChatGPT) to label SDS documents from different companies.

SDSs are typically available in PDF format with varying layouts, but they generally contain the same essential information. Our goal in this tutorial is to train an AI model capable of automatically identifying key entities within SDSs, including product number, CAS number, use cases, classification, GHS label, formula, molecular weight, synonym, emergency phone number, first aid measures, component, and brand.

Extracting and storing this relevant information in a searchable database holds significant value for companies as it allows for quick search and retrieval of hazardous components. The example below illustrates the structure of an SDS:


Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

Zero-Shot Labeling for SDS Entity Extraction

Zero-shot labeling, a technique employed by LLMs, allows the extraction of entities without explicit examples. This capability proves to be highly effective in the context of SDS documents, where the LLM can understand the entities based purely on their descriptions. By utilizing zero-shot labeling, SDS companies can significantly reduce the time and effort required for manual labeling and extraction.


Implementing Automated Entity Extraction with UBIAI

To implement automated entity extraction from SDS using LLMs, companies can leverage annotation tools such as UBIAI. These tools simplify the process by handling prompting, data chunking, and inference using OpenAI’s GPT3.5 Turbo API.

With UBIAI, companies can upload their SDS documents, whether in PDF or image format, and configure the number of examples the LLM should learn from for auto-labeling. The tool automatically selects the most informative documents from the labeled dataset and incorporates them into the prompt for entity extraction.

Steps :

To begin, you can easily upload your documents in various formats such as native PDF, image files, or simple Docx.

After uploading, navigate to the annotation page and select the Few-shot tab within the annotation interface.

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

For more comprehensive information, you can refer to the documentation available at this link: [


UBIAI provides the flexibility to configure the number of examples that the model learns from to auto-label subsequent documents. The application automatically selects the most informative documents from your already labeled dataset and combines them in the prompt.


This approach is known as Few-shot labeling, where the “Few” can range from 0 to n. To configure the number of examples, simply click on the configuration button and enter the desired number of examples, as demonstrated below.

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

In this tutorial, we will utilize zero examples for the LLM to learn from and instruct it to label the data solely based on the entity’s description.

Remarkably, the LLM demonstrates a strong understanding of our document and accurately labels most of the entities!


The following shows the impressive result of zero-shot labeling on the SDS PDF without any examples.

Benefits of Automated Entity Extraction for SDS Companies

Automating Entity Extraction from Safety Data Sheets (SDS) using LLMs

Automated entity extraction offers numerous benefits for SDS companies. By implementing LLM-based automation, companies can:


1 . Save time and effort:
Automating entity extraction significantly reduces the manual labor required for labeling SDS documents, allowing employees to focus on higher-value tasks.


2. Improve accuracy:

LLMs have the ability to extract entities with high precision, reducing the risk of human errors associated with manual extraction.


3. Enhance compliance:

Automated entity extraction ensures consistency and compliance with regulatory requirements by accurately identifying and categorizing critical entities.


4. Enable quick retrieval of information:

Storing the extracted entities in a searchable database enables SDS companies to quickly search and retrieve relevant information, improving decision-making processes and overall efficiency.


Automating entity extraction from Safety Data Sheets (SDS) using Large Language Models (LLMs) has the potential to revolutionize the operations of SDS companies. By leveraging the in-context learning capabilities of LLMs, specifically zero-shot labeling, SDS companies can extract key entities from SDS documents with improved efficiency and accuracy.

Implementing automated entity extraction streamlines SDS management processes, enhances compliance, and enables faster retrieval of vital information. Embracing LLM-based automation can empower SDS companies to optimize their operations and ensure effective chemical management.

If you are keen on building your own training dataset using the zero-shot capabilities of LLMs, we invite you to schedule a demo with us. You can explore and experience firsthand how LLMs can be leveraged for your specific needs.

Stay updated by following us on Twitter @UBIAI5 for the latest news and updates.