Fine-tuning LayoutLM v3 for Invoice Processing and comparing its performance to LayoutLM v2

Jul 18, 2022

Document understanding is the first and most important step in document processing and extraction. It is the process of extracting information from an unstructured or semi-structured document to transform it into a structured form. This structured representation can then be used to support various downstream tasks such as information retrieval, summarization, classification, and so on. There are many different approaches to document understanding, but all of them share the same goal: to create a structured representation of the document content that can be used for further processing.


For semi-structured document such as invoices, receipts or contracts, Microsoft’s layoutLM model has shown a great promise with the development of LayoutLM v1 and v2. For an in-depth tutorial, refer to my previous two articles “Fine-Tuning Transformer Model for Invoice Recognition” and “Fine-Tuning LayoutLM v2 For Invoice Recognition”.


In this tutorial, we will fine-tune Microsoft’s latest LayoutLM v3 on invoices similar to my previous tutorials and we will compare its performance to the layoutLM v2 model.

LayoutLM v3

The main advantage of LayoutLM v3 over its predecessors is the multi-modal transformer architecture that combines text and image embedding in a unified way. Instead of relying on a CNN to do the image embedding, the document image is represented as a linear projections of image patches that are then linearly embedded and aligned with text tokens as shown below. The main advantage of this approach is the reduction in parameters needed and overall lower computation.

LayoutLM v3 vs LayoutLM v2 : Fine-tuning LayoutLM v3 for Invoice Processing
Layout LM v3 Architecture. Source

The authors show that “LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image centric tasks such as document image classification and document layout analysis”.

LayoutLM v3

Similar to my previous article, we will use the same dataset of 220 annotated invoices to fine-tune the layoutLM v3 model. To perform the annotations, we have used UBIAI Text Annotation tool since it supports OCR parsing, native PDF/image annotation and export in the right format that is compatible with LayoutLM model without the need of any post processing.

After exporting the annotation file from UBIAI, we upload it to a google drive folder. We will use google colab for model training and inference.

The training and inference script can be accessed in the google colab below:

First step is to open a google colab, connect your google drive and install the transformers package from huggingface. Note that we are not using the detectron 2 package to fine-tune the model on entity extraction unlike layoutLMv2. However, for layout detection (outside the scope of this article), the detectorn 2 package will be needed:

					from google.colab import drivedrive.mount('/content/drive')!pip install -q git+! pip install -q git+ "dill<0.3.5" seqeval
Next, pull the script to process the ZIP file exported from UBIAI:
! rm -r layoutlmv3FineTuning! git clone -b main!/bin/bashIOB_DATA_PATH = "/content/drive/MyDrive/LayoutLM_data/"! cd /content/! rm -r data! mkdir data! cp "$IOB_DATA_PATH" data/! cd data && unzip -q dataset && rm! cd ..
Run the preprocess script:
#!/bin/bash#preprocessing argsTEST_SIZE = 0.33DATA_OUTPUT_PATH = "/content/"! python3 layoutlmv3FineTuning/ --valid_size $TEST_SIZE --output_path $DATA_OUTPUT_PATH
Load the dataset post-process:
from datasets import load_metricfrom transformers import TrainingArguments, Trainerfrom transformers import LayoutLMv3ForTokenClassification,AutoProcessorfrom import default_data_collatorimport torch# load datasetsfrom datasets import load_from_disktrain_dataset = load_from_disk(f'/content/train_split')eval_dataset = load_from_disk(f'/content/eval_split')label_list = train_dataset.features["labels"].feature.names
num_labels = len(label_list)
label2id, id2label = dict(), dict()
for i, label in enumerate(label_list):
    label2id[label] = i
    id2label[i] = label
Define few metrics for evaluation:
					metric = load_metric("seqeval")
import numpy as npreturn_entity_level_metrics = Falsedef compute_metrics(p):
    predictions, labels = p
    predictions = np.argmax(predictions, axis=2)# Remove ignored index (special tokens)
    true_predictions = [
        [label_list[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    true_labels = [
        [label_list[l] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]results = metric.compute(predictions=true_predictions, references=true_labels,zero_division='0')
    if return_entity_level_metrics:
        # Unpack nested dictionaries
        final_results = {}
        for key, value in results.items():
            if isinstance(value, dict):
                for n, v in value.items():
                    final_results[f"{key}_{n}"] = v
                final_results[key] = value
        return final_results
        return {
            "precision": results["overall_precision"],
            "recall": results["overall_recall"],
            "f1": results["overall_f1"],
            "accuracy": results["overall_accuracy"],
  • Load, train and evaluate the model:
					model = LayoutLMv3ForTokenClassification.from_pretrained("microsoft/layoutlmv3-base",
                                                         label2id=label2id)processor = AutoProcessor.from_pretrained("microsoft/layoutlmv3-base", apply_ocr=False)NUM_TRAIN_EPOCHS = 50PER_DEVICE_TRAIN_BATCH_SIZE = 1PER_DEVICE_EVAL_BATCH_SIZE = 1LEARNING_RATE = 4e-5training_args = TrainingArguments(output_dir="test",
                                  # max_steps=1500,
                                  # eval_steps=100,
                                  metric_for_best_model="f1")# Initialize our Trainer
trainer = Trainer(
After training is done, the evaluation on the test dataset is performed. Below is the model score after evaluation:

{'epoch': 50.0,
 'eval_accuracy': 0.9521988527724665,
 'eval_f1': 0.6913439635535308,
 'eval_loss': 0.41490793228149414,
 'eval_precision': 0.6362683438155137,
 'eval_recall': 0.756857855361596,
 'eval_runtime': 9.7501,
 'eval_samples_per_second': 9.846,
 'eval_steps_per_second': 9.846}
The model was able to achieve an F1-score of 0.69, 0.75 recall and 0.63 precision.

Let’s run the model on a new invoice that is not part of the training dataset.:

Inference using LayoutLM v3

To run the inference, we will OCR the invoice using Tesseract and feed the information to our trained model to run predictions. To simplify the process, we have created a custom made script with few lines of codes that lets you ingest the OCR output and run predictions using the model.

  • First step, lets import few important libraries and load the model:
					#drive mountfrom google.colab import drivedrive.mount('/content/drive')## install Hugging face Transformers library to load Layoutlmv3 Preprocessor!pip install -q git+ install tesseract OCR Engine! sudo apt install tesseract-ocr! sudo apt install libtesseract-dev## install pytesseract , please click restart runtime button in the cell output and move forward in the notebook! pip install pytesseract# ! rm -r layoutlmv3FineTuning! git clone osimport torchimport warningsfrom PIL import Imagewarnings.filterwarnings('ignore')# move all inference images from /content to 'images' folder
for image in os.listdir():
    img ='{os.curdir}/{image}')
    os.system(f'mv "{image}" "images/{image}"')
    pass# defining inference parametersmodel_path = "/content/drive/MyDrive/LayoutLM_data/layoutlmv3.pth" # path to Layoutlmv3 modelimag_path = "/content/images" # images folder
# if inference model is pth then convert it to pre-trained format
if model_path.endswith('.pth'):
  layoutlmv3_model = torch.load(model_path)
  model_path = '/content/pre_trained_layoutlmv3'
We are now ready to run predictions using the model
# Call inference module! python3 /content/layoutlmv3FineTuning/ --model_path "$model_path" --images_path $imag_path
LayoutLM v3 vs LayoutLM v2 : Fine-tuning LayoutLM v3 for Invoice Processing
Image by Author: Output LayoutLM v3

With 220 annotated invoices, the model was able to correctly predict the seller name, dates, invoice number and Total price (TTC)!

If we look closely, we notice it made a mistake by considering the Laptop total price as the Total invoice price. Given the model the score, this is not surprising and hints that more training data is required.

Comparing LayoutLM v2 vs LayoutLM v3

Apart from being less computationally intensive, does layoutLM V3 provide a performance boost compared to its v2 counter part? To answer this question we compare both model outputs of the same invoice. Here is the layoutLM v2 output as shown in my previous article:

LayoutLM v3 vs LayoutLM v2 : Fine-tuning LayoutLM v3 for Invoice Processing
Image by Author: Output LayoutLM v2

We observe few distinctions:

  • The v3 model was able to detect most of the keys correctly whereas v2 failed to predict invoice_ID, Invoice number_ID and Total_ID
  • The v2 model incorrectly labeled Total price $1,445.00 as MONTANT_HT (means total price before tax in French) whereas v3 predicted the total price correctly.
  • Both models made a mistake in labeling the laptop price as Total.

Based on this single example, layoutLM V3 is showing a better performance overall but we need to test on a larger dataset to confirm this observation.



By open sourcing layoutLM models, Microsoft is leading the way of digital transformation of many businesses ranging from supply chain, healthcare, finance, banking, etc.

In this step-by-step tutorial, we have shown how to fine-tune layoutLM V3 on a specific use case which is invoice data extraction. We have then compared its performance to the layoutLM V2 and an found a slight performance boost that is still need to be verified on a larger dataset.

Based on both the performance and computational gains, I would highly recommend to leverage the new layoutLM v3.

If you are interested to create your own training dataset in the most efficient and streamlined way, don’t hesitate to try out UBIAI OCR annotation feature here for free.

Follow us on Twitter @UBIAI5 or subscribe here!