Fine-tuning LayoutLM v3 for Invoice Processing and comparing its performance to LayoutLM v2
Jul 18, 2022
Document understanding is the first and most important step in document processing and extraction. It is the process of extracting information from an unstructured or semi-structured document to transform it into a structured form. This structured representation can then be used to support various downstream tasks such as information retrieval, summarization, classification, and so on. There are many different approaches to document understanding, but all of them share the same goal: to create a structured representation of the document content that can be used for further processing.
For semi-structured document such as invoices, receipts or contracts, Microsoft’s layoutLM model has shown a great promise with the development of LayoutLM v1 and v2. For an in-depth tutorial, refer to my previous two articles “Fine-Tuning Transformer Model for Invoice Recognition” and “Fine-Tuning LayoutLM v2 For Invoice Recognition”.
In this tutorial, we will fine-tune Microsoft’s latest LayoutLM v3 on invoices similar to my previous tutorials and we will compare its performance to the layoutLM v2 model.
LayoutLM v3
The main advantage of LayoutLM v3 over its predecessors is the multi-modal transformer architecture that combines text and image embedding in a unified way. Instead of relying on a CNN to do the image embedding, the document image is represented as a linear projections of image patches that are then linearly embedded and aligned with text tokens as shown below. The main advantage of this approach is the reduction in parameters needed and overall lower computation.

The authors show that “LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image centric tasks such as document image classification and document layout analysis”.
LayoutLM v3
Similar to my previous article, we will use the same dataset of 220 annotated invoices to fine-tune the layoutLM v3 model. To perform the annotations, we have used UBIAI Text Annotation tool since it supports OCR parsing, native PDF/image annotation and export in the right format that is compatible with LayoutLM model without the need of any post processing.
After exporting the annotation file from UBIAI, we upload it to a google drive folder. We will use google colab for model training and inference.
The training and inference script can be accessed in the google colab below:
First step is to open a google colab, connect your google drive and install the transformers package from huggingface. Note that we are not using the detectron 2 package to fine-tune the model on entity extraction unlike layoutLMv2. However, for layout detection (outside the scope of this article), the detectorn 2 package will be needed:
from google.colab import drivedrive.mount('/content/drive')!pip install -q git+https://github.com/huggingface/transformers.git! pip install -q git+https://github.com/huggingface/datasets.git "dill<0.3.5" seqeval
Next, pull the preprocess.py script to process the ZIP file exported from UBIAI:
! rm -r layoutlmv3FineTuning! git clone -b main https://github.com/UBIAI/layoutlmv3FineTuning.git#!/bin/bashIOB_DATA_PATH = "/content/drive/MyDrive/LayoutLM_data/Invoice_Project_mkWSi4Z.zip"! cd /content/! rm -r data! mkdir data! cp "$IOB_DATA_PATH" data/dataset.zip! cd data && unzip -q dataset && rm dataset.zip! cd ..
Run the preprocess script:
#!/bin/bash#preprocessing argsTEST_SIZE = 0.33DATA_OUTPUT_PATH = "/content/"! python3 layoutlmv3FineTuning/preprocess.py --valid_size $TEST_SIZE --output_path $DATA_OUTPUT_PATH
Load the dataset post-process:
from datasets import load_metricfrom transformers import TrainingArguments, Trainerfrom transformers import LayoutLMv3ForTokenClassification,AutoProcessorfrom transformers.data.data_collator import default_data_collatorimport torch# load datasetsfrom datasets import load_from_disktrain_dataset = load_from_disk(f'/content/train_split')eval_dataset = load_from_disk(f'/content/eval_split')label_list = train_dataset.features["labels"].feature.names
num_labels = len(label_list)
label2id, id2label = dict(), dict()
for i, label in enumerate(label_list):
label2id[label] = i
id2label[i] = label
Define few metrics for evaluation: