How to Auto-Label Your Data Using Transformer Models

Jun 13, 2022

While many applications have been using off the shelf pre-trained models for various tasks such as content generation, question-answering, or generic named entity recognition, less focus has been put into creating business specific training datasets that enable fine-tuning large models to solve specific business problems. In order for AI to have a real and long lasting impact, it has to be adopted by small and medium businesses with very distinct business problems. Using a one-size fits all model has proven unworkable and unrealistic.

Creating custom training dataset is easier said than done, it requires high quality data labeling which is usually expensive and time-consuming to create. Therefore, finding ways to automate the labeling process is of the utmost importance in the field of AI and it is currently a very hot topic of research. Although recent advancements in programmatic labeling such as Weak labeling have been proposed, their output quality remains questionable and require strong human supervision. For more information check out my previous article “Can Weak Labeling Replace Human Labeled Data

In this article, we will leverage a transformer model to auto-label our data using a small seed annotated dataset. We will then review the model’s annotation to correct incorrect labels. 

Annotation Pipeline

We will use UBIAI annotation tool to label the data, train the model and auto-label the rest of our unlabeled data. The tool offers a comprehensive pipeline (shown below) to automate data labeling, including:

  1. Pre-annotation using dictionaries or regular expressions
  2. Manual annotation: supports NER, relation, and document classification
  3. Model training: supports spaCy and transformer training
  4. Model auto-labeling using the custom trained models
Auto-Label Data Using Transformer Models

UBIAI Workflow

Data Labeling

For this tutorial, we will label scientific abstracts to extract all the materials, processes and tasks mentioned in the abstracts. You can of course create any labeled dataset that is relevant to the business problem you are trying to solve.

For our initial seed annotation, we start by annotating 50 abstracts with labels MATERIAL, PROCESS and TASK as shown below.

annotation interface UBIAI

UBIAI Annotation of Scientific Abstracts

Because of the technical requirement of this annotation, we must be very mindful of our subject matter expert (SME) time and that is the reason we have decided to instill the SME knowledge into transformer models to help speed up the annotation process.

Model Training

Training transformer models with annotated dataset is straightforward thanks to the Huggingface library. However, integrating continuous model training as you annotate is more technically challenging. Thanks to UBIAI’s accelerated GPU training, we can train a BERT model on the fly with our annotation without any code required. This approach has many advantages:

  1. Measuring the impact of our annotation on the model performance continuously as we annotate is unbelievably valuable. UBIAI provides precision, recall and F-1 score per entity for each annotation cycle so you can focus on the lowest confidence entities and improve the annotation quality overall.
  2. By leveraging the power of transformers to auto-label your data, you can cut manual labeling time significantly.

To train the model, simply go to the Models menu and press add new model and give it a name:

Auto-Label Data Using Transformer Models

Model Creation in UBIAI

 

 

Once the model is created, select the Named Entity Recognition tab and press on the train model button. The model configuration window will open up:

Auto-Label Data Using Transformer Models

Model Configuration

 

 

UBIAI offers spaCy and Transformer model that you can train for NER and relation extraction tasks. Simply select the Bert option and choose distillbert-base-cased. As of the writing of this article, we are adding more models from the Huggingface library, such as the layoutlm, that you can easily train.

Once you specify the number of iterations, dropout and Batch size, press run and the application will launch the accelerated training on a GPU server.

Once the model training is done, we can look at the model performance for each training iteration:

Auto-Label Data Using Transformer Models

Model Performance

 

With 50 documents annotated, the model has an F-1 score of 11.54%, precision of 13.16% and Recall of 10.27%. These numbers are relatively low but with more annotation we expect better performance. Let’s test the model and see if it learned anything.

Model Auto-Labeling

With model training done, we can now test the model by auto-labeling few unseen abstracts with one click. Simply press on auto-label icon to launch the automatic annotation, it will take few seconds to complete the auto-labeling.

Auto-Label Data Using Transformer Models

Auto-labeling Project Selection

 

And here is the result:

Auto-Label Data Using Transformer Models

Labeled Text Using Trained Model

 

 

Our model was able to predict few correct entities such “spin flip” and “non-local voltage signal” as process but missed materials such as Py and Cu.

Although the model performance was low, it was able to correctly label part of the abstract. The labeler will need to correct the mislabeled/missed entities and re-run the training.

After running a second training with another 250 documents labeled, we get an F-score of 29.7%, precision of 32.5% and recall of 27.44%. Let’s run the test again on the same testing dataset:

Auto-Label Data Using Transformer Models

Labeled Text Using Second Training Iteration


The model already learned to identify materials and significantly improved the process extraction!

Another way to improve the model performance is to use a pre-trained model trained on scientific article such as SciBert.

Once your model reaches a satisfactory performance, you can even download it directly from UBIAI or query it using the model API key.

Conclusion

In this tutorial, we have shown how to leverage transformers to auto-label your data. This can easily be done with UBIAI annotation tool where you can create your initial seed annotation and train the transformer model with a lick of a button. You can also run inference with the trained model without any code required.

UBIAI