Fine-Tuning LayoutLM v2 For Invoice Recognition
Jun 7, 2022
Since writing my last article on “Fine-Tuning Transformer Model for Invoice Recognition” which leveraged layoutLM transformer models for invoice recognition, Microsoft has released a new layoutLM v2 transformer model with a significant improvement in performance compared to the first LayoutLM model. In this tutorial, I will demonstrate step by step how to fine-tune layoutLM V2 on invoices starting from data annotation to model training and inference.
Training and inference scripts are available on Google Colab.
Training Script:
Inference Script:
LayoutLM V2 Model
Unlike the first layoutLM version, layoutLM v2 integrates the visual features, text and positional embedding, in the first input layer of the Transformer architecture as shown below. This enables the model to learn cross modality interaction between visual and textual information, the interaction among text, layout, and image in a single multi-modal framework. Here is a snippet from the abstract: “Experiment results show that LayoutLMv2 outperforms LayoutLM by a large margin and achieves new state-of-the-art results on a wide variety of downstream visually-rich document understanding tasks, including FUNSD (0.7895 → 0.8420), CORD (0.9493 → 0.9601), SROIE (0.9524 → 0.9781), Kleister-NDA (0.8340 → 0.8520), RVL-CDIP (0.9443 → 0.9564), and DocVQA (0.7295 → 0.8672)”.
For more information, please refer to the original paper.

Annotation
For this tutorial, we have annotated a total of 220 invoices using UBIAI Text Annotation Tool. UBIAI OCR Annotation allows annotation directly on native PDFs, scanned documents, or images PNG and JPG in a regular or handwritten form. We have recently added support for over 20 languages including Arabic and Hebrew, etc.