Fine-Tuning OCR-Free Donut Model for Invoice Recognition and comparing its performance to layoutLM
Oct 26, 2022
Introduction
Intelligent document processing (IDP) is the ability to automatically understand the content and structure of documents. This is a critical capability for any organization that needs to process a large number of documents, such as for customer service, claims processing, or compliance. However, IDP is not a trivial task. Even for the most common document types, such as invoices or resumes, the variety of formats and layouts that exist can make it very difficult for IDP software to accurately interpret the content.
Current document understanding models, such as layoutLM, will often require OCR processing to extract the text from documents before they can be processed. While OCR can be an effective way to extract text from documents, it is not without its challenges. OCR accuracy can be impacted by factors such as the quality of the original document, the font used, and the clarity of the text. Furthermore, OCR is slow and computational intensive which adds another layer of complexity. This can make it difficult to achieve the high level of accuracy needed for IDP. To overcome these challenges, new approaches to IDP are needed that can accurately interpret documents without the need for OCR.
Enter Donut, which stands for Document Understanding Transformer, an OCR-free transformer model that achieved state of the art performance beating even the layoutLM model in terms of accuracy according to the original paper.
In this tutorial, we are going to fine-tune the new Donut model for invoice extraction and compare its performance to the latest layoutLM v3. Let’s get started!
For reference, below is google colab script to fine-tune the Donut model:
Donut Architecture
So how is the model able to extract text and understand images without requiring any OCR processing? Donut architecture is based on a visual encoder and a text decoder. The visual encoder takes as input visual features x∈R H×W×C into a set of embeddings {zi |zi∈R d , 1≤i≤n}, where n is feature map size or the number of image patches and d is the dimension of the latent vectors of the encoder. The authors used Swin transformer as encoder because it shows the best performance based on their initial study. The text decoder is a BART transformer model that maps the input features into a sequence of subwords tokens.
Donut uses teacher-forcing strategy model which uses ground truth in the input instead of the output. The model generate a sequence of tokens based on a prompt that depends on the type of task we would like to achieve such as classification, question-answering and parsing. For example, if we are looking to extract the class of the document, we will feed the image embedding to the decoder along with the type of task and the model will output a text sequence corresponding to the type of document. If we are interested by question-answering, we will input the question “what is the price of choco mochi” and the model will output the answer. The output sequence is then converted to a JSON file. For more information, refer to the original article.

Donut architecture. Source
Invoice Labeling
In this tutorial, we are going to fine-tune the model on 220 invoices that were labeled using the UBIAI Text Annotation tool, similar to my previous articles on fine-tuning the layoutLM models. Here is an example that shows the format of the labeled dataset exported from UBIAI.