Fine-Tuning OCR-Free Donut Model for Invoice Recognition and comparing its performance to layoutLM

Oct 26, 2022


Intelligent document processing (IDP) is the ability to automatically understand the content and structure of documents. This is a critical capability for any organization that needs to process a large number of documents, such as for customer service, claims processing, or compliance. However, IDP is not a trivial task. Even for the most common document types, such as invoices or resumes, the variety of formats and layouts that exist can make it very difficult for IDP software to accurately interpret the content.


Current document understanding models, such as layoutLM, will often require OCR processing to extract the text from documents before they can be processed. While OCR can be an effective way to extract text from documents, it is not without its challenges. OCR accuracy can be impacted by factors such as the quality of the original document, the font used, and the clarity of the text. Furthermore, OCR is slow and computational intensive which adds another layer of complexity. This can make it difficult to achieve the high level of accuracy needed for IDP. To overcome these challenges, new approaches to IDP are needed that can accurately interpret documents without the need for OCR.

Enter Donut, which stands for Document Understanding Transformer, an OCR-free transformer model that achieved state of the art performance beating even the layoutLM model in terms of accuracy according to the original paper.


In this tutorial, we are going to fine-tune the new Donut model for invoice extraction and compare its performance to the latest layoutLM v3. Let’s get started!


For reference, below is google colab script to fine-tune the Donut model:

Donut Architecture

So how is the model able to extract text and understand images without requiring any OCR processing? Donut architecture is based on a visual encoder and a text decoder. The visual encoder takes as input visual features x∈R H×W×C into a set of embeddings {zi |zi∈R d , 1≤i≤n}, where n is feature map size or the number of image patches and d is the dimension of the latent vectors of the encoder. The authors used Swin transformer as encoder because it shows the best performance based on their initial study. The text decoder is a BART transformer model that maps the input features into a sequence of subwords tokens.


Donut uses teacher-forcing strategy model which uses ground truth in the input instead of the output. The model generate a sequence of tokens based on a prompt that depends on the type of task we would like to achieve such as classification, question-answering and parsing. For example, if we are looking to extract the class of the document, we will feed the image embedding to the decoder along with the type of task and the model will output a text sequence corresponding to the type of document. If we are interested by question-answering, we will input the question “what is the price of choco mochi” and the model will output the answer. The output sequence is then converted to a JSON file. For more information, refer to the original article.

LayoutLM vs OCR-free Donut Model

Donut architecture. Source

Invoice Labeling

In this tutorial, we are going to fine-tune the model on 220 invoices that were labeled using the UBIAI Text Annotation tool, similar to my previous articles on fine-tuning the layoutLM models. Here is an example that shows the format of the labeled dataset exported from UBIAI.

LayoutLM vs OCR-free Donut Model
Image by Author: UBIAI OCR annotation feature

UBIAI supports OCR parsing, native PDF/image annotation and export in the right format. You can fine-tune the layouLM model right in the UBIAI platform and auto-label your data with it, which can save a lot of manual annotation time.

Fine-tuning Donut

The first step is to import the needed packages and clone the Donut repo from Github.

					from PIL import Imageimport torch!git clone!cd donut && pip install .from donut import DonutModelimport jsonimport shutil
Next, we need to extract the labels and parse the image names from the JSON file exported from UBIAI. We put the path of the labeled dataset and the processed folder (replace with your own path).

ubiai_data_folder = "/content/drive/MyDrive/Colab Notebooks/UBIAI_dataset"ubiai_ocr_results = "/content/drive/MyDrive/Colab Notebooks/UBIAI_dataset/ocr.json"processed_dataset_folder = "/content/drive/MyDrive/Colab Notebooks/UBIAI_dataset/processed_dataset"with open(ubiai_ocr_results) as f:  data = json.load(f)#Extract labels from the JSON file
all_labels = list()
for j in data:
  all_labels += list(j['annotation'][cc]['label'] for cc in range(len(j['annotation'])))all_labels = set(all_labels)
all_labels#Setup image path
images_metadata = list()
images_path = list()for obs in data:ground_truth = dict()
  for ann in obs['annotation']:
    if ann['label'].strip() in ['SELLER', 'DATE', 'TTC', 'INVOICE_NUMBERS', 'TVA']:
      ground_truth[ann['label'].strip()] = ann['text'].strip()try:
    ground_truth = {key : ground_truth[key] for key in ['SELLER', 'DATE', 'TTC', 'INVOICE_NUMBERS', 'TVA']}
images_metadata.append({"gt_parse": ground_truth})
images_path.append(obs['images'][0]['name'].replace(':',''))dataset_len = len(images_metadata)

We split the data into training, test and validation set. To do so, simply create three folders train, test and validation. Within each folder create an empty metadata.jsonl file and run the script below:

for i, gt_parse in enumerate(images_metadata):
  # train
  if i < round(dataset_len*0.8) :
    with open(processed_dataset_folder+"/train/metadata.jsonl", 'a') as f:
      line = {"file_name": images_path[i], "ground_truth": json.dumps(gt_parse, ensure_ascii=False)}
      f.write(json.dumps(line, ensure_ascii=False) + "
      shutil.copyfile(ubiai_data_folder + '/' + images_path[i], processed_dataset_folder + "/train/" + images_path[i])
      if images_path[i] == "050320sasdoodahfev20_2021-09-24_0722.txt_image_0.jpg":
# test
if round(dataset_len*0.8) <= i < round(dataset_len*0.8) + round(dataset_len*0.1):
with open(processed_dataset_folder+"/test/metadata.jsonl", 'a') as f:
line = {"file_name": images_path[i], "ground_truth": json.dumps(gt_parse, ensure_ascii=False)}
f.write(json.dumps(line, ensure_ascii=False) + " ")
shutil.copyfile(ubiai_data_folder + '/' + images_path[i], processed_dataset_folder + "/test/" + images_path[i])
if images_path[i] == "050320sasdoodahfev20_2021-09-24_0722.txt_image_0.jpg":
print('test')# validation
if round(dataset_len*0.8) + round(dataset_len*0.1) <= i < dataset_len:
with open(processed_dataset_folder+"/validation/metadata.jsonl", 'a') as f:
line = {"file_name": images_path[i], "ground_truth": json.dumps(gt_parse, ensure_ascii=False)}
f.write(json.dumps(line, ensure_ascii=False) + " ")
shutil.copyfile(ubiai_data_folder + '/' + images_path[i], processed_dataset_folder + "/validation/" + images_path[i])

The script will convert our original annotations into JSON format containing the image path and the ground truth:

{"file_name": "156260522812_2021-10-26_195802.2.txt_image_0.jpg", "ground_truth": "{"gt_parse": {"SELLER": "TJF", "DATE": "création-09/05/2019", "TTC": "73,50 €", "INVOICE_NUMBERS": "N° 2019/068", "TVA": "12,25 €"}}"}{"file_name": "156275474651_2021-10-26_195807.3.txt_image_0.jpg", "ground_truth": "{"gt_parse": {"SELLER": "SAS CALIFRAIS", "DATE": "20/05/2019", "TTC": "108.62", "INVOICE_NUMBERS": "7133", "TVA": "5.66"}}"}
Next, go to “/content/donut/config” folder, create a new file called “train.yaml” and copy the following config content (make sure to replace the dataset path by your own path):

result_path: "/content/drive/MyDrive/Colab Notebooks/UBIAI_dataset/processed_dataset/result"
pretrained_model_name_or_path: "naver-clova-ix/donut-base" # loading a pre-trained model (from moldehub or path)
dataset_name_or_paths: ["/content/drive/MyDrive/Colab Notebooks/UBIAI_dataset/processed_dataset"] # loading datasets (from moldehub or path)
sort_json_key: False # cord dataset is preprocessed, and publicly available at
train_batch_sizes: [1]
val_batch_sizes: [1]
input_size: [1280, 960] # when the input resolution differs from the pre-training setting, some weights will be newly initialized (but the model training would be okay)
max_length: 768
align_long_axis: False
num_nodes: 1
seed: 2022
lr: 3e-5
warmup_steps: 300 # 800/8*30/10, 10%
num_training_samples_per_epoch: 800
max_epochs: 50
max_steps: -1
num_workers: 8
val_check_interval: 1.0
check_val_every_n_epoch: 3
gradient_clip_val: 1.0
verbose: True

Note that you can update the hyperparameters based on your own use case.

We are finally ready to train the model, simply run the command below:

!cd donut && python --config config/train.yaml
LayoutLM vs OCR-free Donut Model
Image by Author: Donut Model Training

The model training will take about 1.5 hour using google colab with GPU enabled.

To get the model performance, we test the model on a test dataset and compare its prediction to the ground truth:

					import glob
with open('/content/drive/MyDrive/Invoice dataset/UBIAI_dataset/processed_dataset/test/metadata.jsonl') as f:
  result = [json.loads(jline) for jline in]
test_images = glob.glob(processed_dataset_folder+'/test/*.jpg')acc_dict = {'SELLER' : 0, 'DATE' : 0, 'TTC' : 0, 'INVOICE_NUMBERS' : 0, 'TVA' : 0}for path in test_images:
  image ="RGB")
donut_result = my_model.inference(image=image, prompt="<s_ubiai-donut>")
returned_labels = donut_result['predictions'][0].keys()for i in result:
if i['file_name'] == path[path.index('/test/')+6:]:
truth = json.loads(i['ground_truth'])['gt_parse']
breakfor l in [x for x in returned_labels if x in ['SELLER', 'DATE', 'TTC', 'INVOICE_NUMBERS', 'TVA']]:
if donut_result['predictions'][0][l] == truth[l]:
acc_dict[l] +=1


Here is the score per entity:

SELLER: 0% , DATE: 47%, TTC: 74%, INVOICE_NUMBERS: 53%, TVA: 63%

Although there were enough examples (274), the SELLER entity had a score of 0. The rest of the entities had higher scores but were still in the lower range. Now let’s try running the model on a new invoice that wasn’t part of the training dataset.

LayoutLM vs OCR-free Donut Model
Image by Author: Test invoice

The model predictions are:

DATE’: ‘31/01/2017’,

‘TTC’: ‘$1,455.00’,


‘TVA’: ‘$35.00’

The model had trouble extracting the seller name and invoice number, but it correctly recognized the Total Price (TTC), the date, and mislabeled the tax (TVA). Although the model’s performance is relatively low, we can try some hyperparameter tuning to enhance it and/or label more data.

Donut vs LayoutLM

The Donut model has several advantages over its counter part layoutLM, such as lower computational cost, lower processing time, and less error due to OCR. But how does the performance compare? According to the original paper, the Donut model performs better than layoutLM on the CORD dataset.

LayoutLM vs OCR-free Donut Model
Model performance score comparison


However, we haven’t noticed a performance increase when using our own labeled dataset. If anything, LayoutLM has been able to capture more entities such as the seller name and invoice number. This discrepancy could be due to the fact that we haven’t done any hyperparameter tuning. Alternatively, it is possible that Donut requires more labelled data to achieve good performance.


In this tutorial, we have focused on data extraction but the Donut model is capable of document classification, document question answering and synthetic data generation, so we have only scratched the surface. The OCR-free model presents many advantages such as higher speed processing, lower complexity, less error propagation due to low quality OCR.

As a next step, we can improve the model performance by performing hyperparameter tuning and labeling more data.

If you are interested to label your own training dataset, don’t hesitate to try out UBIAI OCR annotation feature here for free.