Join our new webinar “Harnessing AI Agents for Advanced Fraud Detection” on Feb 13th at 9AM PT  ||  Register today ->

ubiai deep learning
LayoutLMv3

Fine Tuning LayoutLMv3 for documents classification

Dec 22nd 2023

This article is your go-to guide for learning how to fine-tune the LayoutLMv3 model on new, unseen data. It’s a hands-on project with step-by-step instructions. Specifically, we’ll cover:

 

  • LayoutLMv3
  • Fine Tuning LayoutLMv3
  • Set-Up
  • Financial Documents Clustering Dataset
  • Optical character Recognition
  • Pre-processing for fine tuning LLMv3
  • Model
  • Training
  • Evaluation & Inference

LayoutLMv3

LayoutLMv3.4 stands out as a cutting-edge pre-trained language model crafted by Microsoft Research Asia. Tailored to excel in document analysis tasks demanding an understanding of both textual and layout information such as document classification, information extraction, and question answering, this model is grounded in the transformer architecture. Its training involves vast amounts of annotated document images and text, enabling LayoutLMv3.4 to adeptly recognize and encode both textual content and visual document layout. This proficiency results in superior performance across a spectrum of document analysis tasks.

The versatility of LayoutLMv3.4 extends to applications like document classification, named entity recognition, and question answering. 

Fine Tuning LayoutLMv3 for documents classification

How to annotate your dataset ?

Fine Tuning LayoutLMv3

In this section we Will be fine-tuning the LayoutLMv3 model on the CORD dataset.

Step1: Set-Up

Initially, we install wkhtmltopdf in Colab. 

				
					# Download the package
!wget -q https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-1/wkhtmltox_0.12.6-1.bionic_amd64.deb


# Install the package
!dpkg -i wkhtmltox_0.12.6-1.bionic_amd64.deb


# Fix dependencies
!apt-get -f install -y

				
			

Next, we will proceed with the installation of the required libraries for this tutorial:

				
					!pip install -qqq transformers==4.27.2 --progress-bar off
!pip install -qqq pytorch-lightning==1.9.4 --progress-bar off
!pip install -qqq torchmetrics==0.11.4 --progress-bar off
!pip install -qqq imgkit==1.2.3 --progress-bar off
!pip install -qqq easyocr==1.6.2 --progress-bar off
!pip install -qqq Pillow==9.4.0 --progress-bar off
!pip install -qqq tensorboardX==2.5.1 --progress-bar off
!pip install -qqq huggingface_hub==0.11.1 --progress-bar off
!pip install -qqq --upgrade --no-cache-dir gdown

				
			

And once they’re installed let’s import them and set the seed to 42:

				
					from transformers import LayoutLMv3FeatureExtractor, LayoutLMv3TokenizerFast, LayoutLMv3Processor, LayoutLMv3ForSequenceClassification
from tqdm import tqdm
import torch
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from sklearn.model_selection import train_test_split
import imgkit
import easyocr
import torchvision.transforms as T
from pathlib import Path
import matplotlib.pyplot as plt
import os
import cv2
from typing import List
import json
from torchmetrics import Accuracy
from huggingface_hub import notebook_login
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
 
%matplotlib inline
pl.seed_everything(42)

				
			
image_2023-12-22_093019094

Step 2: Loading the Dataset:

  • The Financial Documents Clustering dataset : 

This dataset, sourced from Kaggle and named “Financial Documents Clustering,” consists of HTML documents, specifically tables extracted from publicly accessible financial annual reports of Hexaware Technologies.

image_2023-12-22_093117762
  • Loading the Dataset :

Now let’s load the dataset into our working notebook. 

				
					!gdown 1tMZXonmajLPK9zhZ2dt-CdzRTs5YfHy0
!unzip -q financial-documents.zip
!mv "TableClassifierQuaterlyWithNotes" "documents"
				
			
image_2023-12-22_093240246

Since the original documents in the dataset are formatted in HTML tags, a conversion process is required. In this part we will be converting the HTML documents into Images. But first, let’s rename the folder in which we have the data

				
					for dir in Path("documents").glob("*"):
  dir.rename(str(dir).lower().replace(" ", "_"))
 
list(Path("documents").glob("*"))

				
			
image_2023-12-22_093351476

Then, let’s create a directory for the output images.

				
					for dir in Path("documents").glob("*"):
    image_dir = Path(f"images/{dir.name}")
    image_dir.mkdir(exist_ok=True, parents=True)

				
			

Once the directory is created let’s convert the HTML files to images and save them in the images directory.

				
					def convert_html_to_image(file_path: Path, images_dir: Path, scale: float = 1.0) -> Path:
    file_name = file_path.with_suffix(".jpg").name
    save_path = images_dir / file_path.parent.name / f"{file_name}"
    imgkit.from_file(str(file_path), save_path, options={'quiet': '', 'format': 'jpeg'})
 
    image = Image.open(save_path)
    width, height = image.size
    image = image.resize((int(width * scale), int(height * scale)))
    image.save(str(save_path))
 
    return save_path

				
			

If you’re running your code on google colab, You should first install wkhtmltopdf

				
					!apt-get update
!apt-get install -y wkhtmltopdf
document_paths = list(Path("documents").glob("*/*"))
 
for doc_path in tqdm(document_paths):
    convert_html_to_image(doc_path, Path("images"), scale=0.8)

				
			
image_2023-12-22_093553951

An example of an image after the conversion from HTML to jpg is shown here : 

DEBATE preliminary (2)-5

Fine-tune and evaluate your model with UBIAI

  • Prepare your high quality Training Data
  • Train best-in-class LLMs: Build domain-specific models that truly understand your context, fine-tune effortlessly, no coding required
  • Deploy with just few clicks: Go from a fine-tuned model to a live API endpoint with a single click
  • Optimize with confidence: unlock instant, scalable ROI by monitoring and analyzing model performance to ensure peak accuracy and tailored outcomes.

Step 3: Optical character recognition:

Utilizing the easyocr library, which has been installed previously, we can perform OCR on each image within our dataset. The following example demonstrates how to use easyocr to extract text from an image:

				
					reader = easyocr.Reader(['en'])
image_path = image_paths[4]
ocr_result = reader.readtext(str(image_path))
ocr_result

				
			

Now let’s see how the easyocr works an example. 

image_2023-12-22_094206637
				
					for image_path in tqdm(image_paths):
    ocr_result = reader.readtext(str(image_path), batch_size=16)
 
    ocr_page = []
    for bbox, word, confidence in ocr_result:
        ocr_page.append({
            "word": word, "bounding_box": create_bounding_box(bbox)
        })
 
    with image_path.with_suffix(".json").open("w") as f:
        json.dump(ocr_page, f)

				
			

It is worth noting that we can extract all of these boxes detected using easyocr and save them in JSON files. This could be done using this script:

				
					feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr=False)
tokenizer = LayoutLMv3TokenizerFast.from_pretrained(
    "microsoft/layoutlmv3-base"
)
processor = LayoutLMv3Processor(feature_extractor, tokenizer)

				
			
image_2023-12-22_094405112

Now, let’s apply the processor to a sample document. LayoutLMv3 requires bounding boxes to be normalized on a 0-1000 scale. To achieve this, we’ll determine the image width and height scales

				
					image_path = image_paths[0]
image = Image.open(image_path).convert("RGB")
width, height = image.size
 
width_scale = 1000 / width
height_scale = 1000 / height

				
			

Next, we’ll perform OCR and extract words along with their bounding boxes:

				
					def scale_bounding_box(box: List[int], width_scale : float = 1.0, height_scale : float = 1.0) -> List[int]:
    return [
        int(box[0] * width_scale),
        int(box[1] * height_scale),
        int(box[2] * width_scale),
        int(box[3] * height_scale)
    ]
 
json_path = image_path.with_suffix(".json")
with json_path.open("r") as f:
    ocr_result = json.load(f)
 
words = []
boxes = []
for row in ocr_result:
 boxes.append(scale_bounding_box(row["bounding_box"], width_scale, height_scale))
    words.append(row["word"])
 
len(words), len(boxes)
				
			
image_2023-12-22_094554527

We’ve defined the scale_bounding_box() function to apply the image scale to each bounding box. Subsequently, we iterate over each row of the OCR results stored in ocr_result, extracting the bounding box coordinates and word text for each recognized text region, and scaling the bounding box coordinates using the scale_bounding_box() function.

				
					encoding = processor(
    image,
    words,
    boxes=boxes,
    max_length=512,
    padding="max_length",
    truncation=True,
    return_tensors="pt"
)
 
print(f"""
input_ids:  {list(encoding["input_ids"].squeeze().shape)}
word boxes: {list(encoding["bbox"].squeeze().shape)}
image data: {list(encoding["pixel_values"].squeeze().shape)}
image size: {image.size}
""")

				
			
image_2023-12-22_094655704

For instance, the input_ids are of shape [512], word boxes are of shape [512, 4], and image data is of shape [3, 224, 224]. Lastly, we visualize the encoded image using a transformation from torchvision:

				
					image_data = encoding["pixel_values"][0]
transform = T.ToPILImage()
transform(image_data)

				
			
image_2023-12-22_094816410

This encoded image is represented as a 3-dimensional array of shape (channels, height, width). The tensor is then converted to a PIL image object using the provided transformation. 

Step 4 : Model

Let’s start by creating an instance of the model

				
					model = LayoutLMv3ForSequenceClassification.from_pretrained(
    "microsoft/layoutlmv3-base", num_labels=2
)# num_labels= 2 means we’re using it for binary classification

				
			
image_2023-12-22_094946428

Step 5: Training

We start by preparing our training data

				
					train_images, test_images = train_test_split(image_paths, test_size=.2)
DOCUMENT_CLASSES = sorted(list(map(
    lambda p: p.name,
    Path("images").glob("*")
)))
DOCUMENT_CLASSES


				
			
image_2023-12-22_095051334

We partition the document images into training and testing subsets. Following this, we proceed to extract the document classes by parsing the directory names containing the document images. This extraction process enables us to establish a mapping between each document image and its corresponding class.

 

With these preparatory steps completed, we now possess all the essential elements to construct a PyTorch Dataset. The dataset, once created, will serve as a foundational component for training and evaluating our model, incorporating the segmented document images and their associated classes.

				
					class DocumentClassificationDataset(Dataset):
 
    def __init__(self, image_paths, processor):
        self.image_paths = image_paths
        self.processor = processor
 
    def __len__(self):
        return len(self.image_paths)
 
    def __getitem__(self, item):
 
        image_path = self.image_paths[item]
        json_path = image_path.with_suffix(".json")
        with json_path.open("r") as f:
            ocr_result = json.load(f)
 
            with Image.open(image_path).convert("RGB") as image:
 
                width, height = image.size
                width_scale = 1000 / width
                height_scale = 1000 / height
 
                words = []
                boxes = []
                for row in ocr_result:
                    boxes.append(scale_bounding_box(
                        row["bounding_box"],
                        width_scale,
                        height_scale
                    ))
                    words.append(row["word"])
 
                encoding = self.processor(
                    image,
                    words,
                    boxes=boxes,
                    max_length=512,
                    padding="max_length",
                    truncation=True,
                    return_tensors="pt"
                )
 
        label = DOCUMENT_CLASSES.index(image_path.parent.name)
 
        return dict(
            input_ids=encoding["input_ids"].flatten(),
            attention_mask=encoding["attention_mask"].flatten(),
            bbox=encoding["bbox"].flatten(end_dim=1),
            pixel_values=encoding["pixel_values"].flatten(end_dim=1),
            labels=torch.tensor(label, dtype=torch.long)

				
			

We can now create datasets and data loaders for the train and test documents:

				
					train_dataset = DocumentClassificationDataset(train_images, processor)
test_dataset = DocumentClassificationDataset(test_images, processor)
 
train_data_loader = DataLoader(
    train_dataset,
    batch_size=8,
    shuffle=True,
    num_workers=2
)
 
test_data_loader = DataLoader(
    test_dataset,
    batch_size=8,
    shuffle=False,
    num_workers=2
)

				
			

Now let’s wrap up all the components in a lightning module.

				
					class ModelModule(pl.LightningModule):
    def __init__(self, n_classes:int):
        super().__init__()
        self.model = LayoutLMv3ForSequenceClassification.from_pretrained(
            "microsoft/layoutlmv3-base",
            num_labels=n_classes
        )
        self.model.config.id2label = {k: v for k, v in enumerate(DOCUMENT_CLASSES)}
        self.model.config.label2id = {v: k for k, v in enumerate(DOCUMENT_CLASSES)}
        self.train_accuracy = Accuracy(task="multiclass", num_classes=n_classes)
        self.val_accuracy = Accuracy(task="multiclass", num_classes=n_classes)
 
    def forward(self, input_ids, attention_mask, bbox, pixel_values, labels=None):
        return self.model(
            input_ids,
            attention_mask=attention_mask,
            bbox=bbox,
            pixel_values=pixel_values,
            labels=labels
        )
 
    def training_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        bbox = batch["bbox"]
        pixel_values = batch["pixel_values"]
        labels = batch["labels"]
        output = self(input_ids, attention_mask, bbox, pixel_values, labels)
        self.log("train_loss", output.loss)
        self.log(
            "train_acc",
            self.train_accuracy(output.logits, labels),
            on_step=True,
            on_epoch=True
        )
        return output.loss
 
    def validation_step(self, batch, batch_idx):
        input_ids = batch["input_ids"]
        attention_mask = batch["attention_mask"]
        bbox = batch["bbox"]
        pixel_values = batch["pixel_values"]
        labels = batch["labels"]
        output = self(input_ids, attention_mask, bbox, pixel_values, labels)
        self.log("val_loss", output.loss)
        self.log(
            "val_acc",
            self.val_accuracy(output.logits, labels),
            on_step=False,
            on_epoch=True
        )
        return output.loss
 
    def configure_optimizers(self):
        optimizer = torch.optim.Adam(self.model.parameters(), lr=0.00001) #1e-5
        return optimizer

				
			

Now we create a training instance of ModelModule:

image_2023-12-22_095350110

To monitor and visualize the training progress, we incorporate Tensorboard into our workflow. By loading the Tensorboard extension and specifying the log directory, we enable real-time tracking of metrics and visualizations:

				
					%load_ext tensorboard
%tensorboard --logdir lightning_logs

				
			

Additionally, we configure the PyTorch Lightning Trainer to manage the training process efficiently. A ModelCheckpoint callback is set up to save the model’s weights after each epoch, employing a specific naming convention that includes the epoch number, training step, and validation loss. The Trainer is configured to utilize a single GPU, implement mixed precision (16-bit) training, and run for a total of 5 epochs:



				
					model_checkpoint = ModelCheckpoint(
    filename="{epoch}-{step}-{val_loss:.4f}", save_last=True, save_top_k=3, monitor="val_loss", mode="min"
)


trainer = pl.Trainer(
    accelerator="gpu",
    precision=16,
    devices=1,
    max_epochs=5,
    callbacks=[
        model_checkpoint
    ],
)



				
			
image_2023-12-22_095534420

The subsequent training phase is initiated using the trainer.fit() method, specifying the model module, along with the training and testing data loaders:

				
					trainer.fit(model_module, train_data_loader, test_data_loader)

				
			
image_2023-12-22_095635237

Step 6: Evaluation & Inference:

To assess the model’s performance, we first load the best-trained model and upload it to the HuggingFace Hub:

				
					#now the evaluation
trained_model = ModelModule.load_from_checkpoint(
    model_checkpoint.best_model_path,
    n_classes=len(DOCUMENT_CLASSES),
    local_files_only=True
)
 
notebook_login()
 
trained_model.model.push_to_hub(
    "layoutlmv3-financial-document-classification"
)

				
			
image_2023-12-22_095804992

Once uploaded, we can easily download the model using its name or ID and load it for inference:

				
					DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
 
model = LayoutLMv3ForSequenceClassification.from_pretrained(
    "curiousily/layoutlmv3-financial-document-classification"
)
model = model.eval().to(DEVICE)

				
			

Next, we define a function for inference on a single document image:

				
					def predict_document_image(
    image_path: Path,
    model: LayoutLMv3ForSequenceClassification,
    processor: LayoutLMv3Processor):
 
    json_path = image_path.with_suffix(".json")
    with json_path.open("r") as f:
        ocr_result = json.load(f)
 
        with Image.open(image_path).convert("RGB") as image:
 
            width, height = image.size
            width_scale = 1000 / width
            height_scale = 1000 / height
 
            words = []
            boxes = []
            for row in ocr_result:
                boxes.append(
                    scale_bounding_box(
                        row["bounding_box"],
                        width_scale,
                        height_scale
                    )
                )
                words.append(row["word"])
 
            encoding = processor(
                image,
                words,
                boxes=boxes,
                max_length=512,
                padding="max_length",
                truncation=True,
                return_tensors="pt"
            )
 
    with torch.inference_mode():
        output = model(
            input_ids=encoding["input_ids"].to(DEVICE),
            attention_mask=encoding["attention_mask"].to(DEVICE),
            bbox=encoding["bbox"].to(DEVICE),
            pixel_values=encoding["pixel_values"].to(DEVICE)
        )
 
    predicted_class = output.logits.argmax()
    return model.config.id2label[predicted_class.item()]

				
			

Finally, we execute the function on all test documents and use a confusion matrix for a more comprehensive evaluation:

				
					labels = []
predictions = []
for image_path in tqdm(test_images):
    labels.append(image_path.parent.name)
    predictions.append(
        predict_document_image(image_path, model, processor)
    )

				
			
image_2023-12-22_100032999

Now all we have to do is check the documents’ correct labels and how the fine tuned model did on classifying them. 

				
					print("correct lables are ",labels)
print("Predicted lables are ",predictions)

				
			
image_2023-12-22_100131688

Conclusion

In conclusion, the process of fine-tuning LayoutLMv3 emerges as a pivotal strategy in unlocking the full potential of this state-of-the-art language model. With its unique ability to seamlessly integrate text and layout information, LayoutLMv3 stands at the forefront of document analysis tasks, ranging from document classification to other downstream tasks.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !