Build An NLP Project From Zero To Hero (5): Model Training

Jan 18, 2022

Training an ML model is without a doubt the most interesting part for every data scientist and for every machine learning enthusiast. Model training refers simply to the model learning from its input data to generalize over a given phenomenon. With every training iteration, the model adjusts its weights to be able to make correct predictions as much as possible using a training algorithm like gradient descent.

There are a lot of details that concern this phase: selecting the model, verifying the integrity of the input data, evaluating the model, training, and saving it. We will get to every detail and then we will show how we apply each one.


Model Selection


In general, since we have observed and have prepared our data through preprocessing and labeling, we should have a good idea of what model we will be using.

Usually, there will be a list of models to choose from, and this will make the project far more complicated than it should be. A good intuition is to try the simplest model for your task and then proceed to improve the architecture of the existing model or to choose a more complex model that is compatible with the same task. However, the simplest model can fail from the start.


So, you can identify certain characteristics and properties that will help you reduce the candidate model list.

A very good example of this is from the Google Developers Guide of Text Classification. Through a lot of experimentation and testing, they identified a metric S/W or the number of samples/number of words per sample ratio. This metric will indicate wether you should choose n-gram models like logistic regression and support vector machines or sequence models like CNN or RNN for the text classification task. In practice, this is difficult to achieve on your own as you need to do a lot of experimentation and testing. This is why you should research industry-standard models.


There are also the characteristics of the data itself that would help map to the correct model: If your data has a large number of features but a significantly lower number of observations, a support vector machine will perform better than logistic regression.

For Named Entity Recognition, do you want to train a new model from scratch? Or use a pre-trained model and probably build upon it? For example, you can use a Spacy pre-trained NER model but it might not suit your need as in the case of this project. Or, you can train a Spacy Model from scratch using your own dataset, giving new vocabulary and labels for the model.

This article%20API%20(e.g.%20GATE).) presents a great overview of pre-trained NER models, ranging from rule-based models like in NLTK to probabilistic models like Stanford Core NLP and Deep Learning Models like Flair.


In the Data Preprocessing Phase, we have used the Spacy pre-trained NER model during the pre-annotation of the dataset. The model uses a sophisticated word embedding strategy using subword features and “Bloom” embeddings, a deep convolutional neural network with residual connections.

Besides, we have trained a spacy-based model in the last article, using the Model Assisted Labeling feature within the UBIAI tool. The performance was not too bad as a start. We can download it and use it like any other spacy model by clicking the Download Button in the Action Column in the Models Tab of our current project:


Build An NLP Project From Zero To Hero (Model Training)

UBIAI Model Training Dashboard



For this article, we will feature the workflow for training two models: a Probabilistic Model, CRF or Continuous Random Fields, and a Deep Learning Model, Spacy NER model with Transformers.

But before that, we need to talk about the Input Data Format and Model Evaluation.

Input Data Format

To assure that your model will actually work, you must identify clearly the format of its Input Data. This is necessary for both prediction and training. There exist many suitable formats for the NER task and among them:


  1. IOB format: short for inside, outside, the beginning is a common tagging format for tagging tokens in a chunking task in NLP. Every document is separated into tokens. Each token will take a row and in front of every token, you will find its label. The Null label or ‘O’ is necessary in this case to mark unlabeled tokens. Since the labeling is practically word by word, there is an additional technique to label multiple token terms, I-notation (Inside of a labeled term) and B-notation (Beginning of a labeled term). Documents are separated between each other by a special separator (in our case ‘-DOCSTART- -X- O O’).
  2. JSON format: In this format, your dataset is a list of JSON objects. Each object represents a document and a list of its annotations. An annotation is a labeled term represented by a dictionary containing its text, its label, its starting position, and its ending index in the text in the document string.
Build An NLP Project From Zero To Hero (Model Training)
Build An NLP Project From Zero To Hero (Model Training)

IOB left, Spacy JSON right


If you recall, we have already used the JSON format with Spacy previously in the pre-annotation and the Data Labeling Process.

In the UBIAI tool, you just need to open the Project list menu and click on the Download Button in the Actions’ Column.

There is an important point to talk about, splitting your dataset into a training set, development (or validation set), and a test set. We can omit the dev set to simplify things as we are in a learning project. A ratio of 80/20 is good for our small dataset.

Model Evaluation

Since it is a classification task, we might begin with Accuracy. Accuracy is good if your dataset is balanced (every label has the same number of instances as everyone else). This is not our case.


Traditionally, these three metrics are considered for the NER task:

  • Precision: Determines if your model predicts a real incorrect label as correct. In other words, your model predicts that ‘Google’ is a PERSON name while it is not correct in reality. The higher this metric is, the lesser your model makes this mistake.

  • Recall: Determines if your model predicts a real correct label as incorrect. For example, your model does not predict ‘Google’ as COMPANY even though it is in reality. The higher this metric is, the lesser your model will miss correct instances.

  • F1-Score: an overall indicator of the performance of the classifier that takes into account both Precision and Recall.

You noticed that I explained these metrics in terms of intuition. To delve more theoretically, begin by checking out this article by Harikrishnan.

In the next two parts, I am supposing that you have a basic understanding of how a Machine Learning model train and what constitute a Model architecture. If you want to delve deeper into this topic, I recommend the Coursera Deep Learning Specialization by Andrew Ng.

Training a CRF Model

We have a sequence of tokens in every training example. These tokens are usually words if you decided to tokenize at the word level. We have talked about tokenization extensively in this episode.

To predict the nature of a word (is it a PERSON, a COMPANY, etc), we cannot ignore the sequential nature of our data (tweets, sentences) as it is a significant loss of information. We have to select a model that can infer from previous positions for the prediction of the current position. Named Entity Recognition is of sequential nature after all.

For example, using the IOB format and given the tweet “Facebook has a target price of $10”, the labeling might be “Facebook (B-COMPANY) has (O) a (O) target (O) target (B-MONEY_LABEL) price (I-MONEY_LABEL) of (O) $10 (MONEY)”. So to predict the word ‘price’ as an Inside MONEY_LABEL, we need to know features of the previous word ‘target’. Knowing that it has the label Beginning MONEY_LABEL, its features would serve very well in making a correct prediction for ‘price’.

  1. Model Architecture :
  2. CRF or Continuous Random Fields builds upon this intuition by building feature functions that take into account the sentence and arbitrary labels throughout it. As a simple example, let us consider this feature function input
  3. Sentence s

  4. The position i of a word in the sentence

  5. The label l(i) of the current word

  6. The label l(i−1) of the previous word 

  7. This feature function outputs a real-valued number which is usually binary. It is called linear-chain CRF. We then **assign each feature function a weight that is to be learned by the model. Lastly, we transform these functions into probabilities** by summing over every sentence for every feature function and by subsequent exponentiation and normalization.
Build An NLP Project From Zero To Hero (Model Training)

Generic CRF Model




I hope that you have an intuition of how CRF works. For more details, check out these two great articles by Analyticsvidhya and Edwin Chen.


I am using Google Colaboratory as the working environment and Google Drive as where to store our data and our models. 

					from google.colab import drive
I copied my train and test datasets to the ./content current directory.

!cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_test_IOB.tsv ./stock_test_IOB.tsv

      !cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_train_IOB.tsv ./stock_train_IOB.tsv
Install necessary libraries. We will be using the sklearn_crfsuite library to make our model. Make sure to have sklearn version below version 0.24 otherwise, it would introduce bugs later.

!pip install -U ‘scikit-learn<0.24’

      !pip install sklearn_crfsuite

      #import libraries

      import matplotlib.pyplot as plt'ggplot')

      from itertools import chain

      import nltk

      import sklearn

      import scipy.stats

      from sklearn.metrics import make_scorer

      #from sklearn.cross_validation import cross_val_score

      from sklearn.model_selection import RandomizedSearchCV

      import sklearn_crfsuite

      from sklearn_crfsuite import scorers

      from sklearn_crfsuite import metrics

      import random

      from nltk.tokenize import word_tokenize'punkt')
Convert your datasets into lists of tuples, each tuple is the word and its label.

def import_documents_set_iob(train_file_path):
      with open(train_file_path,  encoding="utf8") as f:
          tokens_in_file = f.readlines()

      # construct list of list train set format
      new_train_set = []

      for index_token,token in enumerate(tokens_in_file):
          # detect new document
          is_new_document = False
          if token == '-DOCSTART- -X- O O

              # So, there's a new document
              is_new_document = True
              document = []
              # A document is a set (triplets) of token name, POS token, tag token
              split_token = token.split("	")
              try :
                  #print ("except :",split_token)

                  # if end of document, we store the document in th train set
                  if (tokens_in_file[index_token+1] == '-DOCSTART- -X- O O
' ):

                  # detect the end of file or the end of all tokens in all documents in train set
                  if (index_token== (len(tokens_in_file) - 1)) :


      return new_train_set
train_file_path = r"/content/stock_train_IOB.tsv"

      train_sents = import_documents_set_iob(train_file_path)

      test_file_path = r"/content/stock_test_IOB.tsv"

      test_sents = import_documents_set_iob(test_file_path

Now, let us transform our data into useful features by collecting details about every token and its adjacent neighbors. You notice that I am commenting out part of speech tagging. You can include it if you have a Part Of Speech Model which can be a part of the feature engineering pipeline.

					# Utils functions to extract features
def word2features(sent, i):
    word = sent[i][0]
    #postag = sent[i][1]

    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        # 'postag': postag,
        # 'postag[:2]': postag[:2],

    if i > 0:
        word1 = sent[i-1][0]
        #postag1 = sent[i-1][1]
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
            # '-1:postag': postag1,
            # '-1:postag[:2]': postag1[:2],
        features['BOS'] = True

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        #postag1 = sent[i+1][1]
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
            # '+1:postag': postag1,
            # '+1:postag[:2]': postag1[:2],
        features['EOS'] = True

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    #return [label for token, postag, label in sent]
    return [label for token,  label in sent]

def sent2tokens(sent):
    #return [token for token, postag, label in sent]
    return [token for token, label in sent]

print ("example extracted features from single word :",sent2features(train_sents[0])[0])
We transform our data:

X_train = [sent2features(s) for s in train_sents]

      y_train = [sent2labels(s) for s in train_sents]

      X_test = [sent2features(s) for s in test_sents]

      y_test = [sent2labels(s) for s in test_sents]
And then we proceed to train and create our model:

crf = sklearn_crfsuite.CRF(
    ), y_train)
Let us check the F1-score of our model, do not forget to remove the ‘O’ label because it will inflate the score than it should be.

# Evaluation of trained model
# Start remove 'O' labels
labels = list(crf.classes_)
print("trained labels :",labels)

# start prediction and calculate f-score
y_pred = crf.predict(X_test)
print (metrics.flat_f1_score(y_test, y_pred,
                      average='weighted', labels=labels,zero_division=True))

# Inspect evaluation per class
# group B and I results
sorted_labels = sorted(
    key=lambda name: (name[1:], name[0])

F1-score for CRF model

F1-score for CRF model


That is pretty good! It performed better than the Spacy NER model (81%). Remember that our labels are not balanced, if we adjust this problem, we will be having an excellent model.


The CRF model was the defacto solution for various NLP tasks like Part of Speech Tagging and Named Entity Recognition before the Deep Learning Era. It is still efficient as you have seen right now!

Model Usage

Let us use it for unseen examples:

					#convert raw sentences into list of tuples (token and empty)
def sents2tuples(sents):
      res = []
      for sent in sents:
        tokens = word_tokenize(sent)
        res.append([(token,'') for token in tokens])
      return res

#with sent2tuples, preprocessing will work just fine with new text
def preprocess( texts):
      texts = [res for res in sents2tuples(texts)]
      X = [sent2features(s) for s in texts]
      return X

samples = ["Facebook has a price target of $ 20 for this quarter",
         "$ AAPL is gaining a new momentum"]

processed = preprocess(samples)

pred = crf.predict(processed)
for i in range(len(samples)):
  sentence = samples[i].split()
  for j in range(len(sentence)):
Build An NLP Project From Zero To Hero (Model Training)

The output of CRF




Not bad for our two examples. However, the model will struggle with variations of ‘$20’ for ‘$ 20’ and ‘$AAPL’ for ‘$ AAPL’. It will not label them correctly. This can be mitigated by more effective tokenization and feature engineering. We can generate variations of the same instances (new training examples by varying the spacing) and let the model learn them. This is called Data Augmentation.


Lastly, don’t forget to save your model and test it again!

import pickle

      filename = 'crf_model.sav'

      pickle.dump(crf, open(filename, 'wb'))

      loaded_model = pickle.load(open(filename, 'rb'))

Build An NLP Project From Zero To Hero (Model Training)


Training a Spacy NER Transformer-based Model

Transformers are considered state of the art for NLP tasks. As usual, we will try to understanding the intuition behind it without going too much in details.

You heard buzzwords like Google BERT (Bidirectional Encoder Representations from Transformer) and Open AI GPT3 (Generative Pre-trained Transformer 3), highly sophisticated models that can understand natural languages and generate well-structured sentences. It all goes back to the Transformers as you can see from their names.

The main objective is to understand, how the tokens in a given document are interconnected with each other.

Facebook has a price target of $ 20 for this quarter. Analysts put it to ‘Hold’.

When we read this hypothetical tweet, our minds memorizes the word ‘Facebook’, and remembers that the terms ‘has’ and ‘price target’ are related to it. It can also deduce that ‘it’ is also related to the initial term. By looking for every word that is connected to it, a Transformer model uses the same concept to synthesize effectively the semantic relationships between words which other models struggle to do.

Model Architecture

Let us see Transformers through word embeddings:

  • We have a document with 16 tokens (the tweet example)

  • Select the first Token ‘Facebook’ X

  • Convert the 16 tokens to Word Embeddings (each token encoding depends on the rest of the tokens and their weights.)

  • Reflect similarities between Encoded Tokens and the Selected Token by using Dot Product (then normalization to not inflate the weights)

  • The Dot product and the Normalization will produce new weights which will be used again with the 16 Encoded Tokens to compute the representation Y of our token X.

  • And you repeat the process for the rest of Tokens. It is noted that the weights mentioned here are different in concept than those of Neural Networks Weights.

Keywords that relate to Transformers are Values (the Encoded Tokens at the phase of computing Y), the Query (The Selected Token), and the Keys (Encoded Tokens as an output of the Word Embeddings).

Build An NLP Project From Zero To Hero (Model Training)

Values, Queries and Keys



This is a naïve simple Transformer architecture. You can introduce a Feed-Forward Network for the Values.

It is a loose explanation but it is enough to get you started. I recommend reading this article by Oleg Borisov.


It is noted that training any Spacy-based Model follows the same workflow.

As usual, I have used Google Colab and Google Drive.

Use GPU for your runtime and check it:

!nvcc --version
      #nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2020 #NVIDIA Corporation Built on Mon_Oct_12_20:09:46_PDT_2020 Cuda #compilation tools, release 11.1, V11.1.105 Build #cuda_11.1.TC455_06.29190527_0


Install these dependencies, we will need cuda and spacy transformers.

!pip install -U pip setuptools wheel

      !pip install -U spacy

      !pip install -U spacy[cuda111,transformers] #cuda version 111

Make sure that cuda (parallel computing platform by Nvidia) and cupy (python library like numpy but used for GPU-accelerated computing) has the same versions or closer. As of this moment, Colab current cupy version is 9.4 and Cuda version is 11.1. Configuring GPU for training your Models locally can cause a lot of headaches, so be careful about the matching of versions between the library and the platform.

Check PyTorch and Cuda availability:

import torch


Make a folder for your project and change the current working directory to that folder:

!mkdir trf_ner

      cd trf_ner

These commands will get the train and test datasets and convert them from IOB to JSON format and then spacy binary format:

!cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_test_IOB.tsv ./stock_test_IOB.tsv

      !cp /content/drive/MyDrive/Public/stock-market-analysis-split/stock_train_IOB.tsv ./stock_train_IOB.tsv

      !python -m spacy convert ./stock_train_IOB.tsv ./ -t json -n 1 -c iob

      !python -m spacy convert ./stock_test_IOB.tsv ./ -t json -n 1 -c iob

      !python -m spacy convert ./stock_train_IOB.json ./ -t spacy

      !python -m spacy convert ./stock_test_IOB.json ./ -t spacy


Go to the Spacy Config File Widget and generate your proper config file. Make sure to select NER, GPU Transformer, and efficiency. Upload the config file to Google Drive, and alter parameters according to your need. In fact, I needed to modify the ‘total_steps’ in [training.optimizer.learn_rate] from 20000 to 10000 because of a sudden drop in the performance of the model while training, it went from 74% F-score to 0 and remained that way for the rest of epochs.

Initialize your project using the config file:

!python -m spacy init fill-config /content/drive/MyDrive/Public/stock-market-analysis-split/base_config.cfg ./config.cfg


We can debug our data:

!python -m spacy debug data ./config.cfg

If there are major issues that will prevent you from training your model, Spacy will inform you. Here, we are told that there is a low number of training examples (350) and that there are labels with very low cardinality like PERSON. They are warnings and not errors and so we can proceed.

Build An NLP Project From Zero To Hero (Model Training)

Debugging your data with spacy



Train the model:

!python -m spacy train -g 0 ./config.cfg --output ./

If you see that the pipeline has initialized then everything is working correctly:

Build An NLP Project From Zero To Hero (Model Training)

Training spacy NER transformer model




You certainly noticed that I have aborted the training after just three epochs. There is a story behind this. I actually trained the model for 3 hours and it reached an F-Score of around 87%, however, Google Colab cut out the GPU support because I exceeded the allowed quota. So be careful with Cloud resources, or your day-work will be lost. I had to redo the training the next day.

Our Best Ner Trf Model has an F-Score of 85.71%. Not bad.

The last thing, save your model after training!


cp -r /content/trf_ner /content/drive/MyDrive/Public/stock-market-analysis-split/trf_ner



We can import the model like any other Spacy model:

import spacy

      ner = spacy.load('/content/drive/MyDrive/Public/stock-market-analysis-split/trf_ner/model-best')


Let us try it on some examples:

samples = ["Facebook has a price target of $ 20 for this quarter",
         "$ AAPL is gaining a new momentum"]

for doc in ner.pipe(samples):
  for ent in doc.ents:
      print(ent.label_, ent.text)

The output of NER Trf Model


Output of the transformer model

It works!


We were able to get decent starting models from a small dataset (350 training examples, 50 test examples) and we know exactly why they are not performing better (unbalanced dataset, Twitter tweets have a specific style of writing different). Our journey is far from over. There are a lot of things to consider, like Model Fine-Tuning, Data Augmentation, Model Monitoring, and Model Deployment. We are not done yet!

This article was longer than usual but I believe it can serve as a future guide for those who want to get quickly into training their own NLP models.

If you are curious, you can request a demo contacting admin@ or Twitter.

Happy learning and see you in the next article!