In this blog article, we’ll be exploring the Language-Independent Layout Transformer, or LiLT, and discovering the fantastic advantages it brings to the table for structured document analysis.Specificially we will be covering its architecture, its pre-training process, in addition to its advantages. So let’s dive in!
LiLT employs a parallel dual-stream Transformer architecture in its workflow. Beginning with an input document image, an off-the-shelf OCR engine is initially utilized to extract both text bounding boxes and their corresponding contents.
Subsequently, the extracted text and layout information undergo separate embedding processes, each directed into its respective Transformer-based architecture to yield enriched features. A bi-directional attention complementation mechanism (BiACM) is introduced to facilitate the effective cross-modality interaction between textual and layout clues.
In the final step, the encoded text and layout features are concatenated, providing a comprehensive and integrated representation for further processing.
Masked Visual-Language Modeling (MVLM) introduces a dynamic approach by randomly masking certain input tokens and challenging the model to reconstruct them across the entire vocabulary, leveraging the encoded features and guided by a cross-entropy loss. Remarkably, this process focuses on enhancing language-side learning while keeping non-textual information unchanged. MVLM’s innovation lies in leveraging cross-modality information, significantly improving the model’s ability to capture both inter- and intra-sentence relationships, thereby advancing its overall language understanding.
Key Point Location (KPL) strategically divides the entire layout into regions (defaulting to 7×7=49 regions) and randomly masks specific input bounding boxes. The model is then tasked with predicting the regions to which the key points (top-left corner, bottom-right corner, and center point) of each box belong, employing separate heads for each prediction. This meticulous strategy employed by KPL empowers the model to not only comprehend the text content comprehensively but also discern optimal placements for specific words or sentences within the context of their surroundings.
Cross-modal Alignment Identification (CMAI) plays a crucial role in the synergy of MVLM and KPL by collecting encoded features of token-box pairs that undergo masking in the previous stages. An additional head is established on these features to identify the alignment of each pair, imparting a cross-modal perception capacity to the model.
CMAI serves as a unifying element, ensuring the model learns to effectively align visual and textual elements, thereby enhancing its cross-modal comprehension capabilities.
In this section we will be covering the advantages of utilizing a LiLT model.
LiLT’s true versatility is exemplified by its capacity to effortlessly accommodate any pre-trained RoBERTa text encoder, including non-English variants like bertin-project/bertin-roberta-base-spanish. This ensures that LiLT is not confined by language boundaries, making it an ideal solution for document understanding tasks across a wide spectrum of languages.
The idea is that the substitution of language does not appear obviously unnatural when the layout structure remains unchanged, as shown in a (a) form/(b) receipt. Therefore the utilization of the Layout could be beneficial in both languages.
LiLT showcases its effectiveness by consistently achieving competitive or even superior performance across a spectrum of widely-used downstream benchmarks in various languages and settings. For example figure 1 showcases LILT’s performance on the Form Understanding in Noisy Scanned Documents dataset, while Figure 2 shows its performance on the CORD dataset.
LiLT enables the conversion of RoBERTa checkpoints into LiLT checkpoints which ensures adaptability and utility for a broader linguistic spectrum.
For example if you want to combine lilt-only-base with English roberta-base you can use this script:
mkdir roberta-en-base
wget https://huggingface.co/roberta-base/resolve/main/config.json -O roberta-en-base/config.json
wget https://huggingface.co/roberta-base/resolve/main/pytorch_model.bin -O roberta-en-base/pytorch_model.bin
python gen_weight_roberta_like.py \
--lilt lilt-only-base/pytorch_model.bin \
--text roberta-en-base/pytorch_model.bin \
--config roberta-en-base/config.json \
--out lilt-roberta-en-base
LiLT is offered under the MIT license, emphasizing an open-source ethos, and is conveniently accessible on the Hugging Face Hub or through github (https://github.com/jpWang/LiLT) for both Research and Commercial purposes. This availability ensures that the broader community can explore, implement, and contribute to the development of this language-independent document understanding tool.
The installation of the model for CUDA 11.0 could be done using this script.
conda create -n liltfinetune python=3.7
conda activate liltfinetune
conda install pytorch==1.7.1 torchvision==0.8.2 cudatoolkit=11.0 -c pytorch
python -m pip install detectron2==0.5 -f https://dl.fbaipublicfiles.com/detectron2/wheels/cu110/torch1.7/index.html
git clone https://github.com/jpWang/LiLT
cd LiLT
pip install -r requirements.txt
pip install -e .
In essence, LiLT stands at the forefront of a new era in document processing technologies, where adaptability, performance, and language independence converge. The journey through LiLT has revealed a transformative tool that holds immense potential for reshaping how we approach and understand structured documents in a world rich with linguistic diversity.