Build An NLP Project From Zero To Hero (4): Data Labeling
Jan 11, 2022
Data labeling is without a doubt a critical phase in the workflow of any Machine Learning project. Here, we are preparing the study material for our students or rather our Machine Learning models.
Data Labeling or Data Annotation is defined as the process of tagging data, be it images, text files, or videos, and adding meaningful labels to provide context so that a machine learning model can learn from it.
First, we will talk about Data Labeling in general and then we will apply to our project: Analyzing the Stock Market tweets using a NER model.
Introduction to Data Labeling
The majority of ML models nowadays are Supervised Learning Models. They rely heavily on their training data to learn the generalization of a given task. With every training iteration, the model adjusts the weights to predict the correct labels provided by the human annotator.
Annotators are tasked to use their own judgment to annotate every training example: Is this email spam or not? Is this image a cat or a dog?
A more complicated example would be identifying entities like emails, persons, companies’ names in every email. The labeler here is required to provide the span (index of beginning, index of ending) for every entity in the text.
As you can see, the labeling or tagging process ranges from being a very simple binary choice to a complex and granular specific choice. To make sure the annotation is successful, there are few requirements:
- User-Friendly Labeling Interface: In cases of complex tagging, labelers can be overwhelmed easily. Time is very crucial and you want to finish the process rapidly and correctly. Developing or using an easy-to-use and efficient Data Annotation Tool can certainly help you.
- Domain Knowledge Consensus: Every labeler needs to produce consistent and correct tagging. If the data contains high-level domain such as health care, finance, scientific research, the labelers need to have subject matter expertise to perform the annotation. For this project, I had to read articles to understand the jargon of the Stock Market world. This knowledge will help also in the case of conflict when multiple labelers are annotating the same corpus: labelers can make mistakes based on lack of knowledge or bias after all.
- Assessing Data Quality: Always verify the accuracy of your labels during the entire phase, because as I realized in this project, the dataset I collected performed well with certain labels but did not with others. An example I give is: I included the PERSON entity in the labels, but the dataset did not provide enough examples for that specific entity, on the other hand, there was plenty of examples for COMPANY names and their TICKERS.
This does not encompass all of the requirements for data labeling but understanding these points will give you a very good start and save you a lot of headaches.
Here is a little story of mine: In a previous NLP project, I was working on developing a model to predict credible news sources and biased news sources based on a dataset of articles. I labeled the data as follows: I had a list of news sources based on my cultural knowledge of them: certainly, there are well-established sources that are considered credible by most people and the rest can be considered less credible. So, I labeled the articles based on a social consensus for lack of better words.
The model got stuck at around 87% accuracy: tried various architectures, and many featurization techniques but I had no improvement. Then I realized that among the articles of the less credible sources there exist good articles that are on par with those of the socially accepted sources. So I realized the model was given wrong labels which resulted in poor performance.
I hope that this section made you realize how crucial data labeling is, now, let us do some practice!
The tool specializes in NER, Relation Extraction, and Text Classification. It has many fascinating features like auto-labeling with rules, dictionaries, and model-assisted labeling.
Defining the Labels
But before starting the work, we need to know what labels we need to extract exactly: we have talked about labels in the pre-annotation section of the previous episode. The generic spaCy is not fine tuned on stock market text so we need to redefine the labels:
- COMPANY: the companies names
- TICKER: A special symbol for every company in the stock market: Apple’s ticker is AAPL in the NASDAQ stock exchange.
- TIME: We regrouped the TIME and DATE labels into one label.
- MONEY: self-explanatory
- MONEY_LABEL: this one is a little bit tricky and it came out after a lot of research, MONEY labels on their own are not really helpful without mentioning what they refer to. This label indicates simply what the MONEY is about. For example, is it a value of a target price? Or a new rise in a company’s share value? This label refers to the jargon of the Stock Market.
- PERCENT: A number indicating a percentage, a statistic.
- CARDINAL: A number on its own, not a TIME, a DATE, MONEY, or anything else.
- PRODUCT: any mention of a product.
- PERSON: A real-life person name like a CEO or a journalist.
- GPE: Geopolitical Entity.
- EVENT: like a financial summit.
First, we need to create a project that will host our data and our task. The steps to do so are well-covered in the Documentation. Basically, you will define your project as a Span-based annotation project. You will configure its settings by addinglabels to use and then you will need to import a dataset with a supported format: ours is a pre-annotated list of dictionaries from the Spacy default NER model. checkout the previous article for more information on how to preprocess your data.
UBIAI Annotation Interface
Most of your work will be in the Annotation Tab. You can add labels directly in Entities, Relations, or Classification interfaces. Theworkflow is simple: first, select the label and then highlight with your mouse the words in the document text interface. If you are done, validate your document and move to the next one as shown below: