ubiai deep learning

Categorizing Invoices: Multimodal Transformers for Structured and Unstructured Data

Aug 9, 2022

In this article, we will fine-tune a pre-trained BERT model on our “multimodal” data to perform a multiclass classification of invoices by category.

  1. Business understanding
  2. Work environment preparation
  3. Data understanding
  4. What are Multimodal Transformers?
  5. Data Preparation
  6. Modeling
  7. Evaluation results

Business understanding

First, let’s take a look at the business side of this article.

As a matter of fact, in most organizations, each invoice is classified into a specific category. In practical terms, if an employee incurs expenses to repair or maintain facilities or equipment in their office, those invoices will be classified under the category “Repair_and_Maintenance” for example.

Therefore, the category is an important information to create expense reports in order to reimburse employees for eligible business expenses, and to track either expenses for the overall organization or expenses associated with a specific product, client or project.

Although categorizing expenses is an important task to accomplish, doing it manually can be a real burden and a waste of time and resources. The same problem arises when we need to extract data from invoices such as date, TTC amount, taxes, and seller… However, with the recent advancement in deep learning models such as Transformers, it is becoming easier than ever to fine-tune large language models to serve a specific business need. All you need is high-quality labeled data to train the model.

For invoice extraction, we need to find an annotation tool that offers OCR annotation to parse the text and bounding box from the invoices and allows native labeling. Fortunately, I have found a tool named UBIAI that will enable you to directly label your invoices and also train deep learning models such as LayoutLM to automatically extract information from the invoice image (as shown in the illustration below).

Multimodal Transformers for structured & unstructured data.
UBIAI’s OCR Annotation (Image from UBIAI.tools)

Whereas the information mentioned above (Date, Amount_TTC, taxes…) are explicitly indicated on the invoice and we use the UBIAI annotation tool to extract them, the information on the category is not directly mentioned on the invoice and therefore has to be inferred (deducted) from the data that we extracted using the UBIAI tool.

Now that we have presented the business context, let’s prepare the working environment to implement our invoice category classification model.


2. Work environment preparation 


For this article, we used the publicly available dataset -which we have slightly adjusted and supplemented to suit our business needs- to train our multimodel transformer.

We used Google Colab as a web IDE for python, it’s free so go ahead and create a new notebook to follow us in the simplest way.

Next, we need to import the libraries that we will use:

					import numpy as np
import pandas as pd
from sklearn import preprocessing
from sklearn.metrics import f1_score, matthews_corrcoef!pip install multimodal-transformers
from transformers import AutoTokenizer, AutoConfig, Trainer, EvalPrediction, set_seed
from transformers.training_args import TrainingArgumentsfrom multimodal_transformers.data import load_data_from_folder
from multimodal_transformers.model import TabularConfig
from multimodal_transformers.model import AutoModelWithTabular
Don’t worry about the number of imports, we’ll see the use of each one when it is relevant. But now let’s move on to take a look at our data.

3. Data understanding


As mentioned above, the category is a piece of information that we will be inferring (deducing) from our present invoice information:


df = pd.read_csv('/content/categorized ocr_annotated_invoices.csv')
Multimodal Transformers for structured & unstructured data.
The head (first 5 rows) of the df data frame
Multimodal Transformers for structured & unstructured data.
Descriptive statistics of df
Summary info of df

As we can see, our data is a set of invoices that are similar to the invoices being processed by the UBIAI annotation tool (We have selected only a few columns to work with), and they are presented in tabular format; each row represents an invoice (observation). Each column represents a feature (predictors) such as Amount_TTCVAT_TVASeller_ID, and Invoice_Description. The data is labeled and the target variable is Category_ID.

So we will predict the Category_ID based on the other features. Let’s calculate the number of different categories presented in our dataset:



      number_categories = df.Category_ID.nunique()print("We have ", number_categories, " classes, so it's a Multiclass Classification")

> We have 36 classes, so it’s a Multiclass Classification.


4. Multimodal Transformers


Transformer-based models using unstructured text data are so powerful, well discussed, and commonly used. However, in our case, we do have textual data presented in the Invoice_Description column, but we also have valuable structured data. For the features with structured data, we have Seller_IDVAT_TVA and Amount_TTC and each of these features brings information that a single feature would not provide.

We call these different ways of perceiving data (unstructured text, structured numerical data…) modalities. Combining information from several modalities to make a prediction is called “Multimodal Fusion”.

To do so we decided to use the multimodal-transformers package that, in the simplest terms, combines the output of the transformer model on the textual data WITH the structured data in the categorical and numerical features.


5. Data Preparation

Now that we are familiar with our business case, the properties of our data, and the appropriate tool, we can begin to work on preparing the dataset.

We define the dictionary column_info in which we will specify which columns have text data, numeric data, categorical data, and the target variable:

column_info = {
'text_cols': ['Invoice_Description'],
'num_cols': ['VAT_TVA', 'Amount_TTC'],
'cat_cols': ['Seller_ID'],
'label_col': 'Category_ID',
# 'label_list': list(df.Category_ID.unique( ))
Then we should encode the label column Category_ID as transformers.Trainer does not work with the qualitative label column, and it expects the label column to contain integers from 0 to the number of classes -1:
encoder = preprocessing.LabelEncoder( )df["Category_ID"] = encoder.fit_transform(df['Category_ID']).astype(int)

Note: There is a multiple ways to encode features, we used here sklearn.preprocessing.LabelEncoder (! Be aware that if you are going to encode a non-label feature with this method, you should be careful about the possibility of creating a wrong ordinal significance in the dimension, so you may need to use sklearn.preprocessing.OneHotEncoder).

In order to build a reliable model, we should not use the same dataset for both model training and evaluation, so we need to split our data into training, validation, and test sets:




					train_df, validation_df, test_df = np.split(df.sample(frac = 1), [int(.8 * len(df)), int(.9 * len(df))])print('TRAIN DATA is ~80% :', len(train_df))
print('VALIDATION DATA is ~10% :', len(validation_df))
print('TEST DATA is ~10% :', len(test_df))train_df.to_csv('train.csv')
Multimodal Transformers for structured & unstructured data.
Splitting the data

Note: We saved the 3 sets in csv files, the file names must be such as train.csv, test.csv and val.csv, otherwise the multimodal_transformers.data.load_data_from_folder function will not be able to load them.

We then load our datasets into a TorchTabularTextDataset, which will include the text inputs for the HuggingFace transformers, and our specified categorical and numerical feature columns. To do this, we must first load our HuggingFace tokenizer:

					pretrained_model_name = 'bert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name)train_dataset, validation_dataset, test_dataset = load_data_from_folder(folder_path = '.',
                       text_cols = column_info['text_cols'],
                       tokenizer = tokenizer,
                       label_col = column_info['label_col'],
                       # label_list = column_info['label_list'],
                       categorical_cols = column_info['cat_cols'],
                       numerical_cols = column_info['num_cols'],
                       sep_text_token_str = tokenizer.sep_token,
We have chosen to use the pre-trained BERT base model (uncased), depending on your requirements you can choose another pre-trained model from here.

Note: Text preprocessing is not needed when using pre-trained models such as BERT, it uses all of the information in a sentence, including punctuation and stop-words.


6. Modeling


Now the next thing to do is to load our transformer with a tabular model.

First, we specify our tabular configurations in a TabularConfig object, in which we also specify how we want to combine the tabular features with the textual features and we will use a weighted sum method.

Second, we define this configuration as the tabular_config member variable of a BertConfig object of the HuggingFace transformer.

Once we have defined model_config, we can load the model using the HuggingFace multimodal_transformers.model.AutoModelWithTabular.from_pretrained method.

					tabular_config = 
        num_labels = number_categories,
        cat_feat_dim = train_dataset.cat_feats.shape[1],
        numerical_feat_dim = train_dataset.numerical_feats.shape[1],
        combine_feat_method =     'weighted_feature_sum_on_transformer_cat_and_numerical_feats'
model_config = AutoConfig.from_pretrained('bert-base-uncased')
model_config.tabular_config = tabular_config
model = AutoModelWithTabular.from_pretrained('bert-base-uncased', config = model_config)
After loading the model and before training it on our data, we need to define some relevant evaluation metrics to provide insight into the performance of the model, therefore we define the function to return the precision, the F1 score, and Matthew’s correlation coefficient:

def calculate_classification_metrics(p: EvalPrediction):  predicted_labels = np.argmax(p.predictions, axis = 1)
  expected_labels = p.label_ids  accuracy = (predicted_labels == expected_labels).mean( )
  f1 = f1_score(y_true = expected_labels, y_pred = predicted_labels, average = 'micro')  eval_result = {
         "acc": accuracy,
         "f1": f1,
         "mcc": matthews_corrcoef(expected_labels, predicted_labels)
  }  return eval_result



At this point we need just to train our model, so only three steps remain 🎉:

  1. Define the training hyperparameters in TrainingArguments.
  2. Pass our training arguments to Trainer along with our datasets, model, and evaluation function.
  3. Call train() to fine-tune our model.
					training_args = TrainingArguments(output_dir = "./UBIAI/model_out",
                                  logging_dir = "./UBIAI/run_logs",
                                  overwrite_output_dir = True,
                                  do_train = True,
                                  do_eval = True,
                                  per_device_train_batch_size = 32,
                                  num_train_epochs = 1,
                                  evaluate_during_training = True,
                                  logging_steps = 5,
                                  eval_steps = 54)
set_seed(training_args.seed)trainer = Trainer(model = model,
                 args = training_args,
                 train_dataset = train_dataset,
                 eval_dataset = validation_dataset,
                 compute_metrics = calculate_classification_metrics)
trainer.train( )
Multimodal Transformers for structured & unstructured data.
trainer.train( )


7. Evaluation


After training the model, let’s look at the validation metrics:

# Load the TensorBoard notebook extension
%load_ext tensorboard
%tensorboard --logdir ./UBIAI/run_logs --port=6006
Multimodal Transformers for structured & unstructured data.
Training validation metrics

As a demonstration of the model prediction in action, we will now run the prediction on some new invoice that is not part of the training or validation sets used for fine-tuning the model :

					# Save our model
trainer.save_model ("./UBIAI/model_out")
# Load it
model = AutoModelWithTabular.from_pretrained("./UBIAI/model_out", local_files_only=True)trainer = Trainer(model=model)
# Predict with it
y_predicted = [list(i).index(max(i)) for i in trainer.predict(test_dataset).predictions]print("➡️Predicted categories for the test data set:", y_predicted)# Evaluate the prediction
precision = sum(1 for x,y in zip(y_predicted,test_dataset.labels) if x == y) / float(len(y_predicted))print("🥳 Our model is able to predict the invoice category with an accuracy score of", round(precision*100), "%")
Multimodal Transformers for structured & unstructured data.
Post-training prediction test results

Note: let’s be reminded of the purposes of our 3 datasets:

 train_dataset is used to train the model and have it learn the hidden patterns in the data.

 validation_dataset is used to validate the performance of our model during training.

 test_dataset is used to test the model after completing training.



Working with annotated invoices that contain both structured data (Amount_TTC…) and unstructured data (Invoice_Description), we were able to load a transformer with a tabular model that takes advantage of both text and structured data, we fine-tuned the pre-trained model on our data, and we were able to predict the category of the invoices with a precision score of ~95%.

What are you waiting for?

Automate your process!

© 2023 UBIAI Web Services — All rights reserved.