Welcome to our exploration of fine-tuning LayoutLMv2 for Document Question Answering. In this article, we’ll delve into the practical aspects of refining LayoutLMv2, a versatile model known for its ability to understand both text and visual layout in documents. Document Question Answering has practical applications, and we’ll guide you through the steps of fine-tuning, from data preparation to model configuration. Specifically we will cover :
Document question answering (DQA) is a cutting-edge natural language processing task that focuses on developing models capable of extracting information from large textual sources to answer user queries. Unlike traditional question answering, which relies on predefined knowledge bases or structured databases, DQA involves comprehending unstructured documents, such as articles, reports, or books, to generate accurate and contextually relevant responses.
The DocVQA dataset serves as a crucial resource in the realm of Document Visual Question Answering (DocVQA), offering a standardized benchmark for evaluating the capabilities of models designed to comprehend textual and visual information within documents. Comprising images of diverse documents, accompanied by corresponding questions and answers, this dataset challenges researchers and practitioners in the fields of computer vision and natural language processing to develop robust algorithms capable of simultaneously processing visual content and answering questions related to the document’s context.
The tasks involved in DocVQA demand a fusion of image processing techniques to interpret the visual aspects of documents and sophisticated natural language understanding to extract relevant information for accurate responses. As the dataset evolves and expands, it continues to play a pivotal role in advancing the state-of-the-art in document understanding, fostering innovation in AI systems tailored for complex document-based tasks. Researchers frequently leverage the DocVQA dataset to push the boundaries of what is achievable in terms of visual and textual comprehension, with the ultimate goal of enhancing the efficiency and accuracy of question answering systems in real-world document analysis scenarios.
Prior to starting, ensure that you have installed all the required libraries. LayoutLM v2 relies on detectron2, torchvision, and tesseract.
!pip install -q transformers datasets
!pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install torchvision
!sudo apt install tesseract-ocr
!pip install -q pytesseract
Now let’s define some global variables:
model_checkpoint = "microsoft/layoutlmv2-base-uncased"
batch_size = 4
In this tutorial, we utilize a compact sample of preprocessed DocVQA. If you prefer working with the complete DocVQA dataset, you can register and obtain it from the DocVQA homepage.
from datasets import load_dataset
dataset = load_dataset("nielsr/docvqa_1200_examples")
dataset
The dataset has already been divided into training and testing sets.
dataset["train"].features
In the dataset, each example is characterized by several fields. The “id” serves as a unique identifier for the example, while the “image” field contains a PIL.Image.Image object representing the document image. The “query” field encapsulates the natural language question posed in various languages, and the corresponding correct answers provided by human annotators are stored in the “answers” field. The “words” and “bounding_boxes” fields hold the results of Optical Character Recognition (OCR), which we won’t be utilizing in this context. The “answer” field contains responses matched by a different model, and for our purposes, we will exclude this feature.
To streamline the dataset, we’ll filter for only English questions and discard the “answer” feature. Additionally, we’ll retain only the first answer from the set provided by annotators, or you can opt to randomly sample it for further analysis.
updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)
It’s important to note that the LayoutLM v2 checkpoint utilized in this guide has been trained with a maximum position embedding value of 512, as indicated in the checkpoint’s config.json file. While we can truncate examples, it’s crucial to avoid situations where the answer might be positioned at the end of a lengthy document and end up being truncated.
For that purpose we only keep those that are most likely to not exceed 512.
updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)
At this stage, we will remove the Optical Character Recognition (OCR) features from the dataset. These features were generated during the OCR process for fine-tuning a different model. However, they require additional processing to align with the input requirements of the model used in this guide.
updated_dataset = updated_dataset.remove_columns("words")
updated_dataset = updated_dataset.remove_columns("bounding_boxes")
Here’s an example of an how the data looks like now :
For the Document Question Answering task, which involves multiple modalities, it is crucial to preprocess inputs from each modality according to the model’s expectations. To initiate this process, we’ll begin by loading the LayoutLM v2 Processor. This processor internally integrates an image processor capable of handling image data and a tokenizer designed to encode text data. This combined functionality allows for comprehensive preprocessing of both image and text inputs to meet the model’s requirements.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_checkpoint)
image_processor = processor.image_processor
def get_ocr_words_and_boxes(examples):
images = [image.convert("RGB") for image in examples["image"]]
encoded_inputs = image_processor(images)
examples["image"] = encoded_inputs.pixel_values
examples["words"] = encoded_inputs.words
examples["boxes"] = encoded_inputs.boxes
return examples
And apply the processing to the whole dataset
dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)
tokenizer = processor.tokenizer
def subfinder(words_list, answer_list):
matches = []
start_indices = []
end_indices = []
for idx, i in enumerate(range(len(words_list))):
if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list:
matches.append(answer_list)
start_indices.append(idx)
end_indices.append(idx + len(answer_list) - 1)
if matches:
return matches[0], start_indices[0], end_indices[0]
else:
return None, 0, 0
This function takes two input lists, words_list and answer_list. It iterates over words_list, checks for a match with the first word of answer_list, and verifies if the sublist of words_list starting from the current word and of the same length as answer_list is equal to answer_list. If a match is found, it records the match’s starting index (idx) and ending index (end_idx). If multiple matches occur, the function returns only the first one. If no match is found, it returns (None, 0, 0).
To illustrate how this function finds the position of the answer, let’s use it on an example:
example = dataset_with_ocr[1]
words = [word.lower() for word in example["words"]]
match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split())
print("Question: ", example["question"])
print("Words:", words)
print("Answer: ", example["answer"])
print("start_index", word_idx_start)
print("end_index", word_idx_end)
After the encoding process, the examples will have the following appearance:
encoding = tokenizer(example["question"], example["words"], example["boxes"])
tokenizer.decode(encoding["input_ids"])
Upon completion of the encoding process, a pivotal step is to pinpoint the position of the answer within the encoded input. This involves leveraging crucial pieces of information: Firstly, the token_type_ids offer insights into the classification of tokens, distinguishing those relevant to the question and those associated with the document’s words.
This distinction is instrumental in comprehending the structural composition of the encoded input. Secondly, the presence of a special token at the input’s inception, identified by tokenizer.cls_token_id, aids in recognizing the overall organization of the encoded input. Lastly, the word_ids play a crucial role in establishing a connection between the answer in the original words and its counterpart within the fully encoded input.
This linkage is paramount for determining the precise start and end positions of the answer within the encoded input. In summary, these elements collectively contribute to the effective localization of the answer in the encoded representation.
def encode_dataset(examples, max_length=512):
questions = examples["question"]
words = examples["words"]
boxes = examples["boxes"]
answers = examples["answer"]
# encode the batch of examples and initialize the start_positions and end_positions
encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True)
start_positions = []
end_positions = []
# loop through the examples in the batch
for i in range(len(questions)):
cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id)
# find the position of the answer in example's words
words_example = [word.lower() for word in words[i]]
answer = answers[i]
match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())
if match:
# if match is found, use `token_type_ids` to find where words start in the encoding
token_type_ids = encoding["token_type_ids"][i]
token_start_index = 0
while token_type_ids[token_start_index] != 1:
token_start_index += 1
token_end_index = len(encoding["input_ids"][i]) - 1
while token_type_ids[token_end_index] != 1:
token_end_index -= 1
word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1]
start_position = cls_index
end_position = cls_index
# loop over word_ids and increase `token_start_index` until it matches the answer position in words
# once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding
for id in word_ids:
if id == word_idx_start:
start_position = token_start_index
else:
token_start_index += 1
# similarly loop over `word_ids` starting from the end to find the `end_position` of the answer
for id in word_ids[::-1]:
if id == word_idx_end:
end_position = token_end_index
else:
token_end_index -= 1
start_positions.append(start_position)
end_positions.append(end_position)
else:
start_positions.append(cls_index)
end_positions.append(cls_index)
encoding["image"] = examples["image"]
encoding["start_positions"] = start_positions
encoding["end_positions"] = end_positions
return encoding
With the completion of this preprocessing function, we are now poised to encode the entire dataset.
encoded_train_dataset = dataset_with_ocr.map(
encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr.column_names
)
encoded_test_dataset = dataset_with_ocr.map(
encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr.column_names
)
encoded dataset should look like:
Another Way to process the data is to use UBIAI’s Annotation tool. So head to UBIAI’s website. Create new projects, upload your data and start annotating. You can find the annotation steps dedicated to the last article about fine tuning GPT models.
The idea here is to start by creating Entities for questions and their answers. Then create a relationship between the entities that way we specify that The answer to each question.
Once the entities are specified you can annotate your documents.
Once you finish annotating. Export the annotated data as JSON file. And it is ready for use.
Load the model using Auto Model For Document Question Answering LayoutLM v2 with the same checkpoint utilized in the preprocessing step. Specify training hyperparameters with TrainingArguments. Create a function to batch examples, where the DefaultDataCollator is suitable. Provide the training arguments, model, dataset, and data collator to the Trainer. Execute the train() method to commence fine-tuning your model.
from transformers import AutoModelForDocumentQuestionAnswering
model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint)
This code snippet reflects the guidance you provided, using the output_dir parameter in TrainingArguments to specify where to save your model, and the push_to_hub parameter set to True if you want to share your model with the Hugging Face community. The output_dir will also serve as the name of the repository where your model checkpoint will be pushed if you choose to upload it.
from transformers import TrainingArguments
# REPLACE THIS WITH YOUR REPO ID
repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa"
training_args = TrainingArguments(
output_dir=repo_id,
per_device_train_batch_size=4,
num_train_epochs=20,
save_steps=200,
logging_steps=50,
evaluation_strategy="steps",
learning_rate=5e-5,
save_total_limit=2,
remove_unused_columns=False,
push_to_hub=True,
)
Create a straightforward data collator to group examples into batches. Utilize the DefaultDataCollator from the Transformers library to achieve this. Once the data collator is instantiated, combine all components, including the model, training arguments, datasets, and tokenizer. Conclude by invoking the train() function to initiate the training process.
from transformers import DefaultDataCollator
data_collator = DefaultDataCollator()
from transformers import Trainer
import transformers
# Set your Hugging Face API token
transformers.hf_hub_token = "My_ID"
# Assuming 'model', 'training_args', 'data_collator', 'encoded_train_dataset', 'encoded_test_dataset', 'processor' are defined
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=encoded_train_dataset,
eval_dataset=encoded_test_dataset,
tokenizer=processor,
)
trainer.train()
Example given this image from the testing set. and the question :
Who is ‘presiding’ TRRF GENERAL SESSION
We get as an answer :
example = dataset["test"][2]
question = example["query"]["en"]
image = example["image"]
print(question)
print(example["answers"])
In summary, fine-tuning LayoutLM v2 for Document Question Answering marks a significant step forward in advancing natural language processing for document analysis. This article has covered key aspects of the fine-tuning process, emphasizing the model’s dual proficiency in understanding text and layout. By integrating document layout insights, we enhance question answering systems, promising improved information retrieval across various applications.