Build An NLP Project From Zero To Hero (5): Model Training
Jan 18, 2022
Training an ML model is without a doubt the most interesting part for every data scientist and for every machine learning enthusiast. Model training refers simply to the model learning from its input data to generalize over a given phenomenon. With every training iteration, the model adjusts its weights to be able to make correct predictions as much as possible using a training algorithm like gradient descent.
There are a lot of details that concern this phase: selecting the model, verifying the integrity of the input data, evaluating the model, training, and saving it. We will get to every detail and then we will show how we apply each one.
In general, since we have observed and have prepared our data through preprocessing and labeling, we should have a good idea of what model we will be using.
Usually, there will be a list of models to choose from, and this will make the project far more complicated than it should be. A good intuition is to try the simplest model for your task and then proceed to improve the architecture of the existing model or to choose a more complex model that is compatible with the same task. However, the simplest model can fail from the start.
So, you can identify certain characteristics and properties that will help you reduce the candidate model list.
A very good example of this is from the Google Developers Guide of Text Classification. Through a lot of experimentation and testing, they identified a metric S/W or the number of samples/number of words per sample ratio. This metric will indicate wether you should choose n-gram models like logistic regression and support vector machines or sequence models like CNN or RNN for the text classification task. In practice, this is difficult to achieve on your own as you need to do a lot of experimentation and testing. This is why you should research industry-standard models.
There are also the characteristics of the data itself that would help map to the correct model: If your data has a large number of features but a significantly lower number of observations, a support vector machine will perform better than logistic regression.
For Named Entity Recognition, do you want to train a new model from scratch? Or use a pre-trained model and probably build upon it? For example, you can use a Spacy pre-trained NER model but it might not suit your need as in the case of this project. Or, you can train a Spacy Model from scratch using your own dataset, giving new vocabulary and labels for the model.
This article%20API%20(e.g.%20GATE).) presents a great overview of pre-trained NER models, ranging from rule-based models like in NLTK to probabilistic models like Stanford Core NLP and Deep Learning Models like Flair.
In the Data Preprocessing Phase, we have used the Spacy pre-trained NER model during the pre-annotation of the dataset. The model uses a sophisticated word embedding strategy using subword features and “Bloom” embeddings, a deep convolutional neural network with residual connections.
Besides, we have trained a spacy-based model in the last article, using the Model Assisted Labeling feature within the UBIAI tool. The performance was not too bad as a start. We can download it and use it like any other spacy model by clicking the Download Button in the Action Column in the Models Tab of our current project:
UBIAI Model Training Dashboard
For this article, we will feature the workflow for training two models: a Probabilistic Model, CRF or Continuous Random Fields, and a Deep Learning Model, Spacy NER model with Transformers.
But before that, we need to talk about the Input Data Format and Model Evaluation.
Input Data Format
To assure that your model will actually work, you must identify clearly the format of its Input Data. This is necessary for both prediction and training. There exist many suitable formats for the NER task and among them:
- IOB format: short for inside, outside, the beginning is a common tagging format for tagging tokens in a chunking task in NLP. Every document is separated into tokens. Each token will take a row and in front of every token, you will find its label. The Null label or ‘O’ is necessary in this case to mark unlabeled tokens. Since the labeling is practically word by word, there is an additional technique to label multiple token terms, I-notation (Inside of a labeled term) and B-notation (Beginning of a labeled term). Documents are separated between each other by a special separator (in our case ‘-DOCSTART- -X- O O’).
- JSON format: In this format, your dataset is a list of JSON objects. Each object represents a document and a list of its annotations. An annotation is a labeled term represented by a dictionary containing its text, its label, its starting position, and its ending index in the text in the document string.