Fine-Tuning Transformer Model for Invoice Recognition
Jun 21, 2021
Building on my recent tutorial on how to annotate PDFs and scanned images for NLP applications, we will attempt to fine-tune the recently released Microsoft’s Layout LM model on an annotated custom dataset that includes French and English invoices. While the previous tutorials focused on using the publicly available FUNSD dataset to fine-tune the model, here we will show the entire process starting from annotation and pre-processing to training and inference.
LayoutLM Model
The LayoutLM model is based on BERT architecture but with two additional types of input embeddings. The first is a 2-D position embedding that denotes the relative position of a token within a document, and the second is an image embedding for scanned token images within a document. This model achieved new state-of-the-art results in several downstream tasks, including form understanding (from 70.72 to 79.27), receipt understanding (from 94.02 to 95.24), and document image classification (from 93.07 to 94.42). For more information, refer to the original article.
Thankfully, the model was open sourced and made available in huggingface library. Thanks, Microsoft!
For this tutorial, we will clone the model directly from the huggingface library and fine-tune it on our own dataset, link to google colab is below. But first, we need to create the training data.
https://colab.research.google.com/drive/1KnkVuYW6Ne25hOZ_IApiv_MIYb4lxCAq?usp=sharing
Invoice Annotation
Using the UBIAI text annotation tool, I have annotated around 50 personal invoices. I am interested to extract both the keys and values of the entities; for example in the following text “Date: 06/12/2021” we would annotate “Date” as DATE_ID and “06/12/2021” as DATE. Extracting both the keys and values will help us correlate the numerical values to their attributes. Here are all the entities that have been annotated:
DATE_ID, DATE, INVOICE_ID, INVOICE_NUMBER,SELLER_ID, SELLER, MONTANT_HT_ID, MONTANT_HT, TVA_ID, TVA, TTC_ID, TTC
Here are a few entity definitions:
MONTANT_HT: Total price pre-taxTTC: Total price with taxTVA: Tax amount
Below is an example of an annotated invoice using UBIAI:

Image by author: Annotated invoice
After annotation, we export the train and test files from UBIAI directly in the correct format without any pre-processing step. The export will include three files for each training and test datasets and one text file containing all the labels named labels.txt:
Train/Test.txt
2018 O
Sous-total O
en O
EUR O
3,20 O
€ O
TVA S-TVA_ID
(0%) O
0,00 € S-TVA
Total B-TTC_ID
en I-TTC_ID
EUR E-TTC_ID
3,20 S-TTC
€ O
Services O
soumis O
au O
mécanisme O
d'autoliquidation O
- O
Train/Test_box.txt (contain bounding box for each token):
€ 912 457 920 466
Services 80 486 133 495
soumis 136 487 182 495
au 185 488 200 495
mécanisme 204 486 276 495
d'autoliquidation 279 486 381 497
- 383 490 388 492
Train/Test_image.txt (contain bounding box, document size, and name):
€ 912 425 920 434 1653 2339 image1.jpg
TVA 500 441 526 449 1653 2339 image1.jpg
(0%) 529 441 557 451 1653 2339 image1.jpg
0,00 € 882 441 920 451 1653 2339 image1.jpg
Total 500 457 531 466 1653 2339 image1.jpg
en 534 459 549 466 1653 2339 image1.jpg
EUR 553 457 578 466 1653 2339 image1.jpg
3,20 882 457 911 467 1653 2339 image1.jpg
€ 912 457 920 466 1653 2339 image1.jpg
Services 80 486 133 495 1653 2339 image1.jpg
soumis 136 487 182 495 1653 2339 image1.jpg
au 185 488 200 495 1653 2339 image1.jpg
mécanisme 204 486 276 495 1653 2339 image1.jpg
d'autoliquidation 279 486 381 497 1653 2339 image1.jpg
- 383 490 388 492 1653 2339 image1.jpg
labels.txt:
B-DATE_ID
B-INVOICE_ID
B-INVOICE_NUMBER
B-MONTANT_HT
B-MONTANT_HT_ID
B-SELLER
B-TTC
B-DATE
B-TTC_ID
B-TVA
B-TVA_ID
E-DATE_ID
E-DATE
E-INVOICE_ID
E-INVOICE_NUMBER
E-MONTANT_HT
E-MONTANT_HT_ID
E-SELLER
E-TTC
E-TTC_ID
E-TVA
E-TVA_ID
I-DATE_ID
I-DATE
I-SELLER
I-INVOICE_ID
I-MONTANT_HT_ID
I-TTC
I-TTC_ID
I-TVA_ID
O
S-DATE_ID
S-DATE
S-INVOICE_ID
S-INVOICE_NUMBER
S-MONTANT_HT_ID
S-MONTANT_HT
S-SELLER
S-TTC
S-TTC_ID
S-TVA
S-TVA_ID
Fine-Tuning LayoutLM Model:
Here, we use google colab with GPU to fine-tune the model. The code below is based on the original layoutLM paper and this tutorial .
First, install the layoutLM package…
! rm -r unilm! git clone -b remove_torch_save https://github.com/NielsRogge/unilm.git! cd unilm/layoutlm! pip install unilm/layoutlm
…as well as the transformer package from where the model will be downloaded:
! rm -r transformers! git clone https://github.com/huggingface/transformers.git! cd transformers! pip install ./transformers
Next, create a list containing the unique labels from labels.txt:
from torch.nn import CrossEntropyLossdef get_labels(path):
with open(path, "r") as f:
labels = f.read().splitlines()
if "O" not in labels:
labels = ["O"] + labels
return labelslabels = get_labels("./labels.txt")
num_labels = len(labels)
label_map = {i: label for i, label in enumerate(labels)}
pad_token_label_id = CrossEntropyLoss().ignore_index
Then, create a pytorch dataset and dataloader:
from transformers import LayoutLMTokenizer
from layoutlm.data.funsd import FunsdDataset, InputFeatures
from torch.utils.data import DataLoader, RandomSampler, SequentialSamplerargs = {'local_rank': -1,
'overwrite_cache': True,
'data_dir': '/content/data',
'model_name_or_path':'microsoft/layoutlm-base-uncased',
'max_seq_length': 512,
'model_type': 'layoutlm',}# class to turn the keys of a dict into attributes
class AttrDict(dict):
def __init__(self, *args, **kwargs):
super(AttrDict, self).__init__(*args, **kwargs)
self.__dict__ = selfargs = AttrDict(args)tokenizer = LayoutLMTokenizer.from_pretrained("microsoft/layoutlm-base-uncased")# the LayoutLM authors already defined a specific FunsdDataset, so we are going to use this here
train_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="train")
train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
sampler=train_sampler,
batch_size=2)eval_dataset = FunsdDataset(args, tokenizer, labels, pad_token_label_id, mode="test")
eval_sampler = SequentialSampler(eval_dataset)
eval_dataloader = DataLoader(eval_dataset,
sampler=eval_sampler,
batch_size=2)batch = next(iter(train_dataloader))input_ids = batch[0][0]tokenizer.decode(input_ids)
Load the model from huggingface. This will be fine-tuned on the dataset.
from transformers import LayoutLMForTokenClassification
import torchdevice = torch.device("cuda" if torch.cuda.is_available() else "cpu")model = LayoutLMForTokenClassification.from_pretrained("microsoft/layoutlm-base-uncased", num_labels=num_labels)
model.to(device)
Finally, start the training:
from transformers import AdamW
from tqdm import tqdmoptimizer = AdamW(model.parameters(), lr=5e-5)global_step = 0
num_train_epochs = 50
t_total = len(train_dataloader) * num_train_epochs # total number of training steps#put the model in training mode
model.train()
for epoch in range(num_train_epochs):
for batch in tqdm(train_dataloader, desc="Training"):
input_ids = batch[0].to(device)
bbox = batch[4].to(device)
attention_mask = batch[1].to(device)
token_type_ids = batch[2].to(device)
labels = batch[3].to(device)# forward pass
outputs = model(input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, token_type_ids=token_type_ids,
labels=labels)
loss = outputs.loss# print loss every 100 steps
if global_step % 100 == 0:
print(f"Loss after {global_step} steps: {loss.item()}")# backward pass to get the gradients
loss.backward()#print("Gradients on classification head:")
#print(model.classifier.weight.grad[6,:].sum())# update
optimizer.step()
optimizer.zero_grad()
global_step += 1
You should be able to see the training progress and the loss getting updated.