Fine-tuning question answering model with layoutLM v2 in 2024

Feb 2nd 2024

Welcome to our exploration of fine-tuning LayoutLMv2 for Document Question Answering. In this article, we’ll delve into the practical aspects of refining LayoutLMv2, a versatile model known for its ability to understand both text and visual layout in documents. Document Question Answering has practical applications, and we’ll guide you through the steps of fine-tuning, from data preparation to model configuration. Specifically we will cover :

Document Question Answering
DocVQA dataset
Environment
Data loading
Data processing
Training
Inference
Conclusion

1- Document Question Answering with LayoutLM v2:

Document question answering (DQA) is a cutting-edge natural language processing task that focuses on developing models capable of extracting information from large textual sources to answer user queries. Unlike traditional question answering, which relies on predefined knowledge bases or structured databases, DQA involves comprehending unstructured documents, such as articles, reports, or books, to generate accurate and contextually relevant responses.

2- DocVQA dataset:

The DocVQA dataset serves as a crucial resource in the realm of Document Visual Question Answering (DocVQA), offering a standardized benchmark for evaluating the capabilities of models designed to comprehend textual and visual information within documents. Comprising images of diverse documents, accompanied by corresponding questions and answers, this dataset challenges researchers and practitioners in the fields of computer vision and natural language processing to develop robust algorithms capable of simultaneously processing visual content and answering questions related to the document’s context.

The tasks involved in DocVQA demand a fusion of image processing techniques to interpret the visual aspects of documents and sophisticated natural language understanding to extract relevant information for accurate responses. As the dataset evolves and expands, it continues to play a pivotal role in advancing the state-of-the-art in document understanding, fostering innovation in AI systems tailored for complex document-based tasks. Researchers frequently leverage the DocVQA dataset to push the boundaries of what is achievable in terms of visual and textual comprehension, with the ultimate goal of enhancing the efficiency and accuracy of question answering systems in real-world document analysis scenarios.

3-Environment:

Prior to starting, ensure that you have installed all the required libraries. LayoutLM v2 relies on detectron2, torchvision, and tesseract.

!pip install -q transformers datasets

				
					!pip install 'git+https://github.com/facebookresearch/detectron2.git'
!pip install torchvision

				
					!sudo apt install tesseract-ocr
!pip install -q pytesseract

Now let’s define some global variables:

				
					model_checkpoint = "microsoft/layoutlmv2-base-uncased"
batch_size = 4

4- Data Loading - LayoutLM v2:

In this tutorial, we utilize a compact sample of preprocessed DocVQA. If you prefer working with the complete DocVQA dataset, you can register and obtain it from the DocVQA homepage.

				
					from datasets import load_dataset


dataset = load_dataset("nielsr/docvqa_1200_examples")
dataset

The dataset has already been divided into training and testing sets.

				
					dataset["train"].features

In the dataset, each example is characterized by several fields. The “id” serves as a unique identifier for the example, while the “image” field contains a PIL.Image.Image object representing the document image. The “query” field encapsulates the natural language question posed in various languages, and the corresponding correct answers provided by human annotators are stored in the “answers” field. The “words” and “bounding_boxes” fields hold the results of Optical Character Recognition (OCR), which we won’t be utilizing in this context. The “answer” field contains responses matched by a different model, and for our purposes, we will exclude this feature.

To streamline the dataset, we’ll filter for only English questions and discard the “answer” feature. Additionally, we’ll retain only the first answer from the set provided by annotators, or you can opt to randomly sample it for further analysis.

				
					updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
    lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)

5- Data Processing:

It’s important to note that the LayoutLM v2 checkpoint utilized in this guide has been trained with a maximum position embedding value of 512, as indicated in the checkpoint’s config.json file. While we can truncate examples, it’s crucial to avoid situations where the answer might be positioned at the end of a lengthy document and end up being truncated.

For that purpose we only keep those that are most likely to not exceed 512.

				
					updated_dataset = dataset.map(lambda example: {"question": example["query"]["en"]}, remove_columns=["query"])
updated_dataset = updated_dataset.map(
    lambda example: {"answer": example["answers"][0]}, remove_columns=["answer", "answers"]
)

At this stage, we will remove the Optical Character Recognition (OCR) features from the dataset. These features were generated during the OCR process for fine-tuning a different model. However, they require additional processing to align with the input requirements of the model used in this guide.

				
					updated_dataset = updated_dataset.remove_columns("words")
updated_dataset = updated_dataset.remove_columns("bounding_boxes")

Here’s an example of an how the data looks like now :

For the Document Question Answering task, which involves multiple modalities, it is crucial to preprocess inputs from each modality according to the model’s expectations. To initiate this process, we’ll begin by loading the LayoutLM v2 Processor. This processor internally integrates an image processor capable of handling image data and a tokenizer designed to encode text data. This combined functionality allows for comprehensive preprocessing of both image and text inputs to meet the model’s requirements.

				
					from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained(model_checkpoint)

First let’s process the documents :

				
					image_processor = processor.image_processor




def get_ocr_words_and_boxes(examples):
    images = [image.convert("RGB") for image in examples["image"]]
    encoded_inputs = image_processor(images)


    examples["image"] = encoded_inputs.pixel_values
    examples["words"] = encoded_inputs.words
    examples["boxes"] = encoded_inputs.boxes


    return examples

And apply the processing to the whole dataset

				
					dataset_with_ocr = updated_dataset.map(get_ocr_words_and_boxes, batched=True, batch_size=2)

Now let’s process the Text using the tokenizer:

				
					tokenizer = processor.tokenizer
def subfinder(words_list, answer_list):
    matches = []
    start_indices = []
    end_indices = []
    for idx, i in enumerate(range(len(words_list))):
        if words_list[i] == answer_list[0] and words_list[i : i + len(answer_list)] == answer_list:
            matches.append(answer_list)
            start_indices.append(idx)
            end_indices.append(idx + len(answer_list) - 1)
    if matches:
        return matches[0], start_indices[0], end_indices[0]
    else:
        return None, 0, 0

This function takes two input lists, words_list and answer_list. It iterates over words_list, checks for a match with the first word of answer_list, and verifies if the sublist of words_list starting from the current word and of the same length as answer_list is equal to answer_list. If a match is found, it records the match’s starting index (idx) and ending index (end_idx). If multiple matches occur, the function returns only the first one. If no match is found, it returns (None, 0, 0).

To illustrate how this function finds the position of the answer, let’s use it on an example:

				
					example = dataset_with_ocr[1]
words = [word.lower() for word in example["words"]]
match, word_idx_start, word_idx_end = subfinder(words, example["answer"].lower().split())
print("Question: ", example["question"])
print("Words:", words)
print("Answer: ", example["answer"])
print("start_index", word_idx_start)
print("end_index", word_idx_end)

After the encoding process, the examples will have the following appearance:

				
					encoding = tokenizer(example["question"], example["words"], example["boxes"])
tokenizer.decode(encoding["input_ids"])

Upon completion of the encoding process, a pivotal step is to pinpoint the position of the answer within the encoded input. This involves leveraging crucial pieces of information: Firstly, the token_type_ids offer insights into the classification of tokens, distinguishing those relevant to the question and those associated with the document’s words.

This distinction is instrumental in comprehending the structural composition of the encoded input. Secondly, the presence of a special token at the input’s inception, identified by tokenizer.cls_token_id, aids in recognizing the overall organization of the encoded input. Lastly, the word_ids play a crucial role in establishing a connection between the answer in the original words and its counterpart within the fully encoded input.

This linkage is paramount for determining the precise start and end positions of the answer within the encoded input. In summary, these elements collectively contribute to the effective localization of the answer in the encoded representation.

				
					def encode_dataset(examples, max_length=512):
    questions = examples["question"]
    words = examples["words"]
    boxes = examples["boxes"]
    answers = examples["answer"]


    # encode the batch of examples and initialize the start_positions and end_positions
    encoding = tokenizer(questions, words, boxes, max_length=max_length, padding="max_length", truncation=True)
    start_positions = []
    end_positions = []


    # loop through the examples in the batch
    for i in range(len(questions)):
        cls_index = encoding["input_ids"][i].index(tokenizer.cls_token_id)


        # find the position of the answer in example's words
        words_example = [word.lower() for word in words[i]]
        answer = answers[i]
        match, word_idx_start, word_idx_end = subfinder(words_example, answer.lower().split())


        if match:
            # if match is found, use `token_type_ids` to find where words start in the encoding
            token_type_ids = encoding["token_type_ids"][i]
            token_start_index = 0
            while token_type_ids[token_start_index] != 1:
                token_start_index += 1


            token_end_index = len(encoding["input_ids"][i]) - 1
            while token_type_ids[token_end_index] != 1:
                token_end_index -= 1


            word_ids = encoding.word_ids(i)[token_start_index : token_end_index + 1]
            start_position = cls_index
            end_position = cls_index


            # loop over word_ids and increase `token_start_index` until it matches the answer position in words
            # once it matches, save the `token_start_index` as the `start_position` of the answer in the encoding
            for id in word_ids:
                if id == word_idx_start:
                    start_position = token_start_index
                else:
                    token_start_index += 1


            # similarly loop over `word_ids` starting from the end to find the `end_position` of the answer
            for id in word_ids[::-1]:
                if id == word_idx_end:
                    end_position = token_end_index
                else:
                    token_end_index -= 1


            start_positions.append(start_position)
            end_positions.append(end_position)


        else:
            start_positions.append(cls_index)
            end_positions.append(cls_index)


    encoding["image"] = examples["image"]
    encoding["start_positions"] = start_positions
    encoding["end_positions"] = end_positions


    return encoding

With the completion of this preprocessing function, we are now poised to encode the entire dataset.

				
					encoded_train_dataset = dataset_with_ocr.map(
    encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr.column_names
)
encoded_test_dataset = dataset_with_ocr.map(
    encode_dataset, batched=True, batch_size=2, remove_columns=dataset_with_ocr.column_names
)

encoded dataset should look like:

Another Way to process the data is to use UBIAI’s Annotation tool. So head to UBIAI’s website. Create new projects, upload your data and start annotating. You can find the annotation steps dedicated to the last article about fine tuning GPT models.

The idea here is to start by creating Entities for questions and their answers. Then create a relationship between the entities that way we specify that The answer to each question.

Once the entities are specified you can annotate your documents.

Once you finish annotating. Export the annotated data as JSON file. And it is ready for use.

6- Training:

Load the model using Auto Model For Document Question Answering LayoutLM v2 with the same checkpoint utilized in the preprocessing step. Specify training hyperparameters with TrainingArguments. Create a function to batch examples, where the DefaultDataCollator is suitable. Provide the training arguments, model, dataset, and data collator to the Trainer. Execute the train() method to commence fine-tuning your model.

				
					from transformers import AutoModelForDocumentQuestionAnswering


model = AutoModelForDocumentQuestionAnswering.from_pretrained(model_checkpoint)

This code snippet reflects the guidance you provided, using the output_dir parameter in TrainingArguments to specify where to save your model, and the push_to_hub parameter set to True if you want to share your model with the Hugging Face community. The output_dir will also serve as the name of the repository where your model checkpoint will be pushed if you choose to upload it.

				
					from transformers import TrainingArguments


# REPLACE THIS WITH YOUR REPO ID
repo_id = "MariaK/layoutlmv2-base-uncased_finetuned_docvqa"


training_args = TrainingArguments(
    output_dir=repo_id,
    per_device_train_batch_size=4,
    num_train_epochs=20,
    save_steps=200,
    logging_steps=50,
    evaluation_strategy="steps",
    learning_rate=5e-5,
    save_total_limit=2,
    remove_unused_columns=False,
    push_to_hub=True,
)

Create a straightforward data collator to group examples into batches. Utilize the DefaultDataCollator from the Transformers library to achieve this. Once the data collator is instantiated, combine all components, including the model, training arguments, datasets, and tokenizer. Conclude by invoking the train() function to initiate the training process.

				
					from transformers import DefaultDataCollator


data_collator = DefaultDataCollator()

from transformers import Trainer
import transformers


# Set your Hugging Face API token
transformers.hf_hub_token = "My_ID"


# Assuming 'model', 'training_args', 'data_collator', 'encoded_train_dataset', 'encoded_test_dataset', 'processor' are defined
trainer = Trainer(
    model=model,
    args=training_args,
    data_collator=data_collator,
    train_dataset=encoded_train_dataset,
    eval_dataset=encoded_test_dataset,
    tokenizer=processor,
)
trainer.train()

7- Inference:

Example given this image from the testing set. and the question :

Who is ‘presiding’ TRRF GENERAL SESSION

We get as an answer :

				
					example = dataset["test"][2]
question = example["query"]["en"]
image = example["image"]
print(question)
print(example["answers"])

Conclusion:

In summary, fine-tuning LayoutLM v2 for Document Question Answering marks a significant step forward in advancing natural language processing for document analysis. This article has covered key aspects of the fine-tuning process, emphasizing the model’s dual proficiency in understanding text and layout. By integrating document layout insights, we enhance question answering systems, promising improved information retrieval across various applications.

What are you waiting for?

Automate your process!

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Fine-tuning question answering model with layoutLM v2 in 2024

Feb 2nd 2024

1- Document Question Answering with LayoutLM v2:

2- DocVQA dataset:

3-Environment:

4- Data Loading - LayoutLM v2:

5- Data Processing:

6- Training:

7- Inference:

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Fine-tuning question answering model with layoutLM v2 in 2024

Feb 2nd 2024

1- Document Question Answering with LayoutLM v2:

2- DocVQA dataset:

3-Environment:

4- Data Loading - LayoutLM v2:

5- Data Processing:

6- Training:

7- Inference:

Conclusion:

What are you waiting for?

Automate your process!

Features

Case Studies

Company

Legal

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset