How Few-Shot Learning is Automating Document Labeling
Apr 6, 2023
Manual document labeling is a time-consuming and tedious process that often requires significant resources and can be prone to errors. However, recent advancements in machine learning, particularly the technique known as few-shot learning, are making it easier to automate the labeling process. Large Language Models (LLMs) in particular are excellent few shot learners thanks for their emergent capability in context learning.
In this article, we will take a closer look at how few-shot learning is transforming document labeling, specifically for Named Entity Recognition which is the most important task in document processing. We will show how the UBIAI‘s platform is making it easier than ever to automate this critical task using few shot labeling techniques.
What is Few-Shot Learning?
Few-shot learning is a machine learning technique that enables models to learn a given task with only a few labeled examples. Without modifying its weights, the model can be tuned to perform a specific task by including concatenated training examples of these tasks in its input and asking the model to predict the output of a target text. Here is an example of few shot learning for the task of Named Entity Recognition (NER) using 3 examples:
###Prompt Extract entities from the following sentences without changing original words. ### Sentence: " and storage components. 5+ years of experience deliver ing scalable and resilient services at large enterprise scale, including experience in data platforms including large-scale analytics on relational, structured and unstructured data. 3+ years of experien ce as a SWE/Dev/Technical lead in an agile environment including 1+ years of experience operating in a DevOps model. 2+ years of experience designing secure, scalable and cost-efficient PaaS services on the Microsoft Azure (or similar) platform. Expert understanding of" DIPLOMA: none DIPLOMA_MAJOR: none EXPERIENCE: 3+ years, 5+ years, 5+ years, 5+ years, 3+ years, 1+ years, 2+ years SKILLS: designing, delivering scalable and resilient services, data platforms, large-scale analytics on relational, structured and unstructured data, SWE/Dev/Technical, DevOps, designing, PaaS services, Microsoft Azure ### Sentence: "8+ years demonstrated experience in designing and developing enterprise-level scale services/solutions. 3+ years of leadership and people management experience. 5+ years of Agile Experie nce Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience Other 5+ years of full-stack software development exp erience to include C# (or similar) experience with the ability to contribute to technical architecture across web, mobile, middle tier, data pipeline" DIPLOMA: Bachelors DIPLOMA_MAJOR: Computer Science EXPERIENCE: 8+ years, 3+ years, 5+ years, 5+ years, 5+ years, 3+ years SKILLS: designing, developing enterprise-level scale services/solutions, leadership and people management experience, Agile Experience, full-stack software development, C#, designing ### Sentence: "5+ years of experience in software development. 3+ years of experience in designing and developing enterprise-level scale services/solutions. 3+ years of experience in leading and managing teams. 5+ years of experience in Agile Experience. Bachelors degree in Computer Science or Engineering, or a related field, or equivalent alternative education, skills, and/or practical experience."
The prompt typically begins by instructing the model to perform a specific task, such as “Extract entities from the following sentences without altering the original words”. Notice, we’ve added the instruction ‘without changing the original words’ to prevent the LLM from hallucinating random texts, which it is notoriously known for. This has proven critical in obtaining consistent responses from the model.
The few-shot learning phenomenon has been extensively studied in this article, which I highly recommend. Essentially, the paper demonstrates that, under mild assumptions, the pretraining distribution of the model is a mixture of latent tasks that can be efficiently learned through in-context learning. In this case, in-context learning is more about identifying the task than about learning it by adjusting the model weights.
Few-shot learning has an excellent practical application in the data labeling space, often referred as few-shot labeling. In this case, we provide the model few labeled examples and ask it to predict the labels of the subsequent documents. However, integrating this capability in a functional data labeling platform is easier said than done, here are few challenges:
- LLMs are inherently text generators and tend to generate variable output. Prompt engineering is critical to make them create predictable output that can be later used to auto-label the data.
- Token limitation: LLMs such as OpenAI’s GPT-3 is limited to 4000 tokens per request which limits the length of documents that can be sent at once. Chunking and splitting the data before sending the request becomes essential.
- Span offset calculation: After receiving the output from the model, we need to search its occurrence in the document and label it correctly.
Few Shot Labeling with UBIAI
We’ve recently added few shot labeling capability by integrating OpenAI’s GPT-3 Davinci with UBIAI annotation tool. The tool currently support few-shot NER task for unstructured and semi-structured documents such as PDFs and scanned images.
To get started:
- Simply label 1’5 examples
- Enable few-shot GPT model
- Run prediction on a new unlabeled document
Here is an example of few shot NER on job description with 5 examples provided:
The GPT model accurately predicts most entities with just five in-context examples. Because LLMs are trained on vast amounts of data, this few-shot learning approach can be applied to various domains, such as legal, healthcare, HR, insurance documents, etc., making it an extremely powerful tool.
However, the most surprising aspect of few-shot learning is its adaptability to semi-structured documents with limited context. In the example below, I provided GPT with only one labeled OCR’d invoice example and asked it to label the next. The model surprisingly predicted many entities accurately. With even more examples, the model does an exceptional job of generalizing to semi-structured documents as well.