Join our new webinar “Harnessing AI Agents for Advanced Fraud Detection” on Feb 13th at 9AM PT || Register today ->
This article is your go-to guide for learning how to fine-tune the LayoutLMv3 model on new, unseen data. It’s a hands-on project with step-by-step instructions. Specifically, we’ll cover:
LayoutLMv3.4 stands out as a cutting-edge pre-trained language model crafted by Microsoft Research Asia. Tailored to excel in document analysis tasks demanding an understanding of both textual and layout information such as document classification, information extraction, and question answering, this model is grounded in the transformer architecture. Its training involves vast amounts of annotated document images and text, enabling LayoutLMv3.4 to adeptly recognize and encode both textual content and visual document layout. This proficiency results in superior performance across a spectrum of document analysis tasks.
The versatility of LayoutLMv3.4 extends to applications like document classification, named entity recognition, and question answering.
In this section we Will be fine-tuning the LayoutLMv3 model on the CORD dataset.
Initially, we install wkhtmltopdf in Colab.
# Download the package
!wget -q https://github.com/wkhtmltopdf/packaging/releases/download/0.12.6-1/wkhtmltox_0.12.6-1.bionic_amd64.deb
# Install the package
!dpkg -i wkhtmltox_0.12.6-1.bionic_amd64.deb
# Fix dependencies
!apt-get -f install -y
Next, we will proceed with the installation of the required libraries for this tutorial:
!pip install -qqq transformers==4.27.2 --progress-bar off
!pip install -qqq pytorch-lightning==1.9.4 --progress-bar off
!pip install -qqq torchmetrics==0.11.4 --progress-bar off
!pip install -qqq imgkit==1.2.3 --progress-bar off
!pip install -qqq easyocr==1.6.2 --progress-bar off
!pip install -qqq Pillow==9.4.0 --progress-bar off
!pip install -qqq tensorboardX==2.5.1 --progress-bar off
!pip install -qqq huggingface_hub==0.11.1 --progress-bar off
!pip install -qqq --upgrade --no-cache-dir gdown
And once they’re installed let’s import them and set the seed to 42:
from transformers import LayoutLMv3FeatureExtractor, LayoutLMv3TokenizerFast, LayoutLMv3Processor, LayoutLMv3ForSequenceClassification
from tqdm import tqdm
import torch
from torch.utils.data import Dataset, DataLoader
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from PIL import Image, ImageDraw, ImageFont
import numpy as np
from sklearn.model_selection import train_test_split
import imgkit
import easyocr
import torchvision.transforms as T
from pathlib import Path
import matplotlib.pyplot as plt
import os
import cv2
from typing import List
import json
from torchmetrics import Accuracy
from huggingface_hub import notebook_login
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
%matplotlib inline
pl.seed_everything(42)
This dataset, sourced from Kaggle and named “Financial Documents Clustering,” consists of HTML documents, specifically tables extracted from publicly accessible financial annual reports of Hexaware Technologies.
Loading the Dataset :
Now let’s load the dataset into our working notebook.
!gdown 1tMZXonmajLPK9zhZ2dt-CdzRTs5YfHy0
!unzip -q financial-documents.zip
!mv "TableClassifierQuaterlyWithNotes" "documents"
Since the original documents in the dataset are formatted in HTML tags, a conversion process is required. In this part we will be converting the HTML documents into Images. But first, let’s rename the folder in which we have the data
for dir in Path("documents").glob("*"):
dir.rename(str(dir).lower().replace(" ", "_"))
list(Path("documents").glob("*"))
Then, let’s create a directory for the output images.
for dir in Path("documents").glob("*"):
image_dir = Path(f"images/{dir.name}")
image_dir.mkdir(exist_ok=True, parents=True)
Once the directory is created let’s convert the HTML files to images and save them in the images directory.
def convert_html_to_image(file_path: Path, images_dir: Path, scale: float = 1.0) -> Path:
file_name = file_path.with_suffix(".jpg").name
save_path = images_dir / file_path.parent.name / f"{file_name}"
imgkit.from_file(str(file_path), save_path, options={'quiet': '', 'format': 'jpeg'})
image = Image.open(save_path)
width, height = image.size
image = image.resize((int(width * scale), int(height * scale)))
image.save(str(save_path))
return save_path
If you’re running your code on google colab, You should first install wkhtmltopdf
!apt-get update
!apt-get install -y wkhtmltopdf
document_paths = list(Path("documents").glob("*/*"))
for doc_path in tqdm(document_paths):
convert_html_to_image(doc_path, Path("images"), scale=0.8)
An example of an image after the conversion from HTML to jpg is shown here :
Utilizing the easyocr library, which has been installed previously, we can perform OCR on each image within our dataset. The following example demonstrates how to use easyocr to extract text from an image:
reader = easyocr.Reader(['en'])
image_path = image_paths[4]
ocr_result = reader.readtext(str(image_path))
ocr_result
Now let’s see how the easyocr works an example.
for image_path in tqdm(image_paths):
ocr_result = reader.readtext(str(image_path), batch_size=16)
ocr_page = []
for bbox, word, confidence in ocr_result:
ocr_page.append({
"word": word, "bounding_box": create_bounding_box(bbox)
})
with image_path.with_suffix(".json").open("w") as f:
json.dump(ocr_page, f)
It is worth noting that we can extract all of these boxes detected using easyocr and save them in JSON files. This could be done using this script:
feature_extractor = LayoutLMv3FeatureExtractor(apply_ocr=False)
tokenizer = LayoutLMv3TokenizerFast.from_pretrained(
"microsoft/layoutlmv3-base"
)
processor = LayoutLMv3Processor(feature_extractor, tokenizer)
Now, let’s apply the processor to a sample document. LayoutLMv3 requires bounding boxes to be normalized on a 0-1000 scale. To achieve this, we’ll determine the image width and height scales
image_path = image_paths[0]
image = Image.open(image_path).convert("RGB")
width, height = image.size
width_scale = 1000 / width
height_scale = 1000 / height
Next, we’ll perform OCR and extract words along with their bounding boxes:
def scale_bounding_box(box: List[int], width_scale : float = 1.0, height_scale : float = 1.0) -> List[int]:
return [
int(box[0] * width_scale),
int(box[1] * height_scale),
int(box[2] * width_scale),
int(box[3] * height_scale)
]
json_path = image_path.with_suffix(".json")
with json_path.open("r") as f:
ocr_result = json.load(f)
words = []
boxes = []
for row in ocr_result:
boxes.append(scale_bounding_box(row["bounding_box"], width_scale, height_scale))
words.append(row["word"])
len(words), len(boxes)
We’ve defined the scale_bounding_box() function to apply the image scale to each bounding box. Subsequently, we iterate over each row of the OCR results stored in ocr_result, extracting the bounding box coordinates and word text for each recognized text region, and scaling the bounding box coordinates using the scale_bounding_box() function.
encoding = processor(
image,
words,
boxes=boxes,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt"
)
print(f"""
input_ids: {list(encoding["input_ids"].squeeze().shape)}
word boxes: {list(encoding["bbox"].squeeze().shape)}
image data: {list(encoding["pixel_values"].squeeze().shape)}
image size: {image.size}
""")
For instance, the input_ids are of shape [512], word boxes are of shape [512, 4], and image data is of shape [3, 224, 224]. Lastly, we visualize the encoded image using a transformation from torchvision:
image_data = encoding["pixel_values"][0]
transform = T.ToPILImage()
transform(image_data)
This encoded image is represented as a 3-dimensional array of shape (channels, height, width). The tensor is then converted to a PIL image object using the provided transformation.
Let’s start by creating an instance of the model
model = LayoutLMv3ForSequenceClassification.from_pretrained(
"microsoft/layoutlmv3-base", num_labels=2
)# num_labels= 2 means we’re using it for binary classification
We start by preparing our training data
train_images, test_images = train_test_split(image_paths, test_size=.2)
DOCUMENT_CLASSES = sorted(list(map(
lambda p: p.name,
Path("images").glob("*")
)))
DOCUMENT_CLASSES
We partition the document images into training and testing subsets. Following this, we proceed to extract the document classes by parsing the directory names containing the document images. This extraction process enables us to establish a mapping between each document image and its corresponding class.
With these preparatory steps completed, we now possess all the essential elements to construct a PyTorch Dataset. The dataset, once created, will serve as a foundational component for training and evaluating our model, incorporating the segmented document images and their associated classes.
class DocumentClassificationDataset(Dataset):
def __init__(self, image_paths, processor):
self.image_paths = image_paths
self.processor = processor
def __len__(self):
return len(self.image_paths)
def __getitem__(self, item):
image_path = self.image_paths[item]
json_path = image_path.with_suffix(".json")
with json_path.open("r") as f:
ocr_result = json.load(f)
with Image.open(image_path).convert("RGB") as image:
width, height = image.size
width_scale = 1000 / width
height_scale = 1000 / height
words = []
boxes = []
for row in ocr_result:
boxes.append(scale_bounding_box(
row["bounding_box"],
width_scale,
height_scale
))
words.append(row["word"])
encoding = self.processor(
image,
words,
boxes=boxes,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt"
)
label = DOCUMENT_CLASSES.index(image_path.parent.name)
return dict(
input_ids=encoding["input_ids"].flatten(),
attention_mask=encoding["attention_mask"].flatten(),
bbox=encoding["bbox"].flatten(end_dim=1),
pixel_values=encoding["pixel_values"].flatten(end_dim=1),
labels=torch.tensor(label, dtype=torch.long)
We can now create datasets and data loaders for the train and test documents:
train_dataset = DocumentClassificationDataset(train_images, processor)
test_dataset = DocumentClassificationDataset(test_images, processor)
train_data_loader = DataLoader(
train_dataset,
batch_size=8,
shuffle=True,
num_workers=2
)
test_data_loader = DataLoader(
test_dataset,
batch_size=8,
shuffle=False,
num_workers=2
)
Now let’s wrap up all the components in a lightning module.
class ModelModule(pl.LightningModule):
def __init__(self, n_classes:int):
super().__init__()
self.model = LayoutLMv3ForSequenceClassification.from_pretrained(
"microsoft/layoutlmv3-base",
num_labels=n_classes
)
self.model.config.id2label = {k: v for k, v in enumerate(DOCUMENT_CLASSES)}
self.model.config.label2id = {v: k for k, v in enumerate(DOCUMENT_CLASSES)}
self.train_accuracy = Accuracy(task="multiclass", num_classes=n_classes)
self.val_accuracy = Accuracy(task="multiclass", num_classes=n_classes)
def forward(self, input_ids, attention_mask, bbox, pixel_values, labels=None):
return self.model(
input_ids,
attention_mask=attention_mask,
bbox=bbox,
pixel_values=pixel_values,
labels=labels
)
def training_step(self, batch, batch_idx):
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
bbox = batch["bbox"]
pixel_values = batch["pixel_values"]
labels = batch["labels"]
output = self(input_ids, attention_mask, bbox, pixel_values, labels)
self.log("train_loss", output.loss)
self.log(
"train_acc",
self.train_accuracy(output.logits, labels),
on_step=True,
on_epoch=True
)
return output.loss
def validation_step(self, batch, batch_idx):
input_ids = batch["input_ids"]
attention_mask = batch["attention_mask"]
bbox = batch["bbox"]
pixel_values = batch["pixel_values"]
labels = batch["labels"]
output = self(input_ids, attention_mask, bbox, pixel_values, labels)
self.log("val_loss", output.loss)
self.log(
"val_acc",
self.val_accuracy(output.logits, labels),
on_step=False,
on_epoch=True
)
return output.loss
def configure_optimizers(self):
optimizer = torch.optim.Adam(self.model.parameters(), lr=0.00001) #1e-5
return optimizer
Now we create a training instance of ModelModule:
To monitor and visualize the training progress, we incorporate Tensorboard into our workflow. By loading the Tensorboard extension and specifying the log directory, we enable real-time tracking of metrics and visualizations:
%load_ext tensorboard
%tensorboard --logdir lightning_logs
Additionally, we configure the PyTorch Lightning Trainer to manage the training process efficiently. A ModelCheckpoint callback is set up to save the model’s weights after each epoch, employing a specific naming convention that includes the epoch number, training step, and validation loss. The Trainer is configured to utilize a single GPU, implement mixed precision (16-bit) training, and run for a total of 5 epochs:
model_checkpoint = ModelCheckpoint(
filename="{epoch}-{step}-{val_loss:.4f}", save_last=True, save_top_k=3, monitor="val_loss", mode="min"
)
trainer = pl.Trainer(
accelerator="gpu",
precision=16,
devices=1,
max_epochs=5,
callbacks=[
model_checkpoint
],
)
The subsequent training phase is initiated using the trainer.fit() method, specifying the model module, along with the training and testing data loaders:
trainer.fit(model_module, train_data_loader, test_data_loader)
To assess the model’s performance, we first load the best-trained model and upload it to the HuggingFace Hub:
#now the evaluation
trained_model = ModelModule.load_from_checkpoint(
model_checkpoint.best_model_path,
n_classes=len(DOCUMENT_CLASSES),
local_files_only=True
)
notebook_login()
trained_model.model.push_to_hub(
"layoutlmv3-financial-document-classification"
)
Once uploaded, we can easily download the model using its name or ID and load it for inference:
DEVICE = "cuda:0" if torch.cuda.is_available() else "cpu"
model = LayoutLMv3ForSequenceClassification.from_pretrained(
"curiousily/layoutlmv3-financial-document-classification"
)
model = model.eval().to(DEVICE)
Next, we define a function for inference on a single document image:
def predict_document_image(
image_path: Path,
model: LayoutLMv3ForSequenceClassification,
processor: LayoutLMv3Processor):
json_path = image_path.with_suffix(".json")
with json_path.open("r") as f:
ocr_result = json.load(f)
with Image.open(image_path).convert("RGB") as image:
width, height = image.size
width_scale = 1000 / width
height_scale = 1000 / height
words = []
boxes = []
for row in ocr_result:
boxes.append(
scale_bounding_box(
row["bounding_box"],
width_scale,
height_scale
)
)
words.append(row["word"])
encoding = processor(
image,
words,
boxes=boxes,
max_length=512,
padding="max_length",
truncation=True,
return_tensors="pt"
)
with torch.inference_mode():
output = model(
input_ids=encoding["input_ids"].to(DEVICE),
attention_mask=encoding["attention_mask"].to(DEVICE),
bbox=encoding["bbox"].to(DEVICE),
pixel_values=encoding["pixel_values"].to(DEVICE)
)
predicted_class = output.logits.argmax()
return model.config.id2label[predicted_class.item()]
Finally, we execute the function on all test documents and use a confusion matrix for a more comprehensive evaluation:
labels = []
predictions = []
for image_path in tqdm(test_images):
labels.append(image_path.parent.name)
predictions.append(
predict_document_image(image_path, model, processor)
)
Now all we have to do is check the documents’ correct labels and how the fine tuned model did on classifying them.
print("correct lables are ",labels)
print("Predicted lables are ",predictions)
In conclusion, the process of fine-tuning LayoutLMv3 emerges as a pivotal strategy in unlocking the full potential of this state-of-the-art language model. With its unique ability to seamlessly integrate text and layout information, LayoutLMv3 stands at the forefront of document analysis tasks, ranging from document classification to other downstream tasks.