In the ever-evolving landscape of artificial intelligence, the integration of language and vision has marked a revolutionary leap forward. Amidst the impressive array of multimodal models, LLaVA emerges as a formidable contender, not just rivaling but surpassing the capabilities of many models in the domain.
Multimodal models represent the next frontier in AI, combining the power of language understanding with visual comprehension. Unlike traditional models that focus solely on text, multimodal models like LLaVA accept both text and images as inputs, allowing them to interpret and generate responses that seamlessly blend language and visual elements.
LLaVA stands as a testament to the synergy achievable when language and vision converge. At its core, LLaVA utilizes a sophisticated architecture that intertwines a vision encoder with a large language model (LLM). The visual encoder, CLIP ViT-L/14, excels in extracting features from images, while the language model, Vicuna, is a refined iteration of the open-source LLaMA model, tailored specifically for precise instruction adherence.
The training process involves a dual-phase approach. Initially, the model aligns visual aspects with language using image-text pairs. The subsequent phase, focusing on visual instruction, adds complexity to computational demands. LLaVA, however, rises to the challenge, demonstrating efficiency and high precision in diverse tasks.
The success of LLaVA lies not just in its architecture but in its ability to outshine other models. LLaVA 1.5, the latest iteration, goes beyond conventional boundaries by incorporating an MLP (multi-layer perceptron) for enhanced interaction between language and vision. With the integration of academic task-oriented data, LLaVA 1.5 exhibits remarkable performance and effectiveness, setting it apart from its predecessors and competitors.
In comparison to proprietary models like GPT-4 Vision, LLaVA stands tall as an open-source alternative. This openness not only fosters innovation but also addresses potential restrictions imposed by proprietary models. While GPT-4 Vision may hold its ground, LLaVA’s cost-effectiveness, scalability, and noteworthy performance in multimodal benchmarks make it a compelling choice for those seeking open-source solutions.
As we dive into the subsequent sections, you will fine-tune LLaVA on custom dataset, delivering excellence that precisely meets your specific requirements.
To ensure an efficient fine-tuning process for LLaVA-v1.5-13B, consider the following hardware requirements:
Recommended GPUs: High-end GPUs like NVIDIA A100 or NVIDIA V100 are suggested for faster training. Cloud-based Services: If access to such GPUs is unavailable, explore cloud-based services that offer GPU capabilities.
Memory Requirement: For optimal performance, GPUs with a memory capacity of at least 40-80GB are recommended during the fine-tuning of LLaVA.
Reducing Training Time: If feasible, employ parallelization across multiple GPUs. This strategy helps reduce the overall training time, enhancing efficiency.
Adequate Storage: Ensure you have sufficient storage space to accommodate the model, datasets, and checkpoints generated during the fine-tuning process.
In order to finetune the LLaVA model, data labeling becomes a pivotal aspect. We chose to leverage the advanced capabilities of UBIAI (https://ubiai.tools), a cutting-edge tool proficient in extracting information from images using top-tier OCR models. By employing UBIAI, we ensured the data is labeled correctly.
Our data labeling process goes beyond mere categorization. As an example, we have annotated a dataset for form undertstanding with three distinctive tags: “QUESTION”, “ANSWER” and a unique tag that acts as a link. This link tag connects the questions and answers or identifies questions with multiple answers. For our specific use case, the “QUESTION” aligns with the prompt sent to the multimodal, while the “ANSWER” corresponds to the response, perfectly fitting the required data structure for effective model finetuning. This detailed labeling, enriched with contextual nuances, significantly contributes to the model’s comprehensive understanding during training.
Recognizing the need for simplicity, we developed a script that effortlessly converts UBIAI’s annotation results into the required data structure for multimodal LLaVA. This script, bundled with the article, streamlines the process, ensuring not just ease but also the high quality of labeled data. We believe that this combination of UBIAI’s precision and our conversion script contributes significantly to the success of your finetuning journey.
In our LLaVA finetuning journey, precise data labeling is key, and UBIAI excels in this aspect. With top-notch OCR models, UBIAI ensures accurate information extraction from images, providing us with a meticulously labeled dataset. This collaboration is not just about labeling; it’s about optimizing our data for successful LLaVA finetuning. UBIAI simplifies and enhances the process, making our journey into multimodal AI both efficient and effective.
data = [
{
"id": "1",
"image": "path/to/image1.jpg",
"conversations": [
{
"from": "human",
"value": "\nWhat can you tell me about this scene?"
},
{
"from": "gpt",
"value": "LLaVA-generated information about the scene."
}
]
},
{
"id": "2",
"image": "path/to/image2.jpg",
"conversations": [
{
"from": "human",
"value": "\nAsk LLaVA to describe the content of this image."
},
{
"from": "gpt",
"value": "A description of the image content by LLaVA."
}
]
}
]
In order to commence the fine-tuning journey for LLaVA, we kick off by setting up the essential environment and installing the required packages. The following lines of code ensure the presence of key dependencies, including PyTorch, Transformers, DeepSpeed, and WandB. Let’s dive into the initial setup:
import os
# Install necessary packages
!pip install torch==1.10.0+cu113 torchvision==0.11.1+cu113 torchaudio==0.10.0+cu113 -f https://download.pytorch.org/whl/cu113/torch_stable.html
!pip install transformers
!pip install deepspeed
!pip install wandb
# Clone the LLaVA repository
!git clone https://github.com/haotian-liu/LLaVA.git
os.chdir("LLaVA")
With the environment set up, we proceed to load the pre-trained LLaVA model for fine-tuning. The following lines of code utilize the LLaVA model builder and evaluation utilities. The specified model path, ‘liuhaotian/llava-v1.5-7b,’ is loaded along with associated components such as the tokenizer, the model itself, the image processor, and the context length. Let’s take a closer look at the fine-tuning initialization:
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
model_path = "liuhaotian/llava-v1.5-7b"
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=model_path,
model_base=None,
model_name=get_model_name_from_path(model_path),
offload_folder="/content/llava_model"
)
To initiate the fine-tuning process, we configure the necessary paths and parameters for the LLaVA model. The subsequent script employs DeepSpeed to leverage distributed training efficiently. The paths for the data, images, and the desired output directory are defined, ensuring a seamless integration of the fine-tuning pipeline. The script encompasses various settings, including the use of LoRA (Local Regions of Attention) for enhanced performance, DeepSpeed configurations, and model-specific details such as the vision tower. Let’s delve into the fine-tuning command script:
# Assign paths to variables
DEEPSPEED_SCRIPT = "deepspeed llava/train/train_mem.py"
DEEPSPEED_JSON = "./scripts/zero3.json"
MODEL_NAME = "liuhaotian/llava-v1.5-7b"
DATA_PATH = "/path/to/your/data.json" # Replace with your JSON data path
IMAGE_FOLDER = "/path/to/your/image_folder" # Replace with your image folder path
VISION_TOWER = "openai/clip-vit-large-patch14-336"
OUTPUT_DIR = "/path/to/your/output_directory" # Replace with your desired output directory path
# Command to run the script
finetune_script = f'''
{DEEPSPEED_SCRIPT} \
--lora_enable True --lora_r 128 --lora_alpha 256 --mm_projector_lr 2e-5 \
--deepspeed {DEEPSPEED_JSON} \
--model_name_or_path {MODEL_NAME} \
--version v1 \
--data_path {DATA_PATH} \
--image_folder {IMAGE_FOLDER} \
--vision_tower {VISION_TOWER} \
--mm_projector_type mlp2x_gelu \
--mm_vision_select_layer -2 \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--image_aspect_ratio pad \
--group_by_modality_length True \
--bf16 True \
--output_dir {OUTPUT_DIR} \
--num_train_epochs 5 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 1 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 50000 \
--save_total_limit 1 \
--learning_rate 2e-4 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 2048 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to wandb
'''
To commence the fine-tuning process, we clear the CUDA cache using torch.cuda.empty_cache() to ensure efficient memory utilization. The subsequent line executes the fine-tuning script, orchestrating the training of the LLaVA model with the configured settings.
import torch
torch.cuda.empty_cache()
# Execute the fine-tuning script
!{finetune_script}
Post fine-tuning, we merge the LoRA (Local Regions of Attention) weights with the updated model weights using the script merge_lora_weights.py. This ensures that the enriched contextual information captured by LoRA during fine-tuning is seamlessly integrated into the LLaVA model. The merged model is then saved for subsequent usage
!python /content/LLaVA/scripts/merge_lora_weights.py --model-path /path/to/checkpoints/llava-v1.5-7b-task-lora --model-base liuhaotian/llava-v1.5-7b --save-model-path /output/merged_model
To validate the efficacy of our fine-tuned LLaVA model, we load the merged model obtained after the finetuning process. This involves setting up the evaluation parameters, including the prompt and image file, to assess the model’s performance on a specific task. The evaluation script is then executed, providing insights into the model’s ability to generate accurate and contextually relevant responses in a multimodal context.
from llava.model.builder import load_pretrained_model
from llava.mm_utils import get_model_name_from_path
from llava.eval.run_llava import eval_model
# Path to your fine-tuned model
fine_tuned_model_path = "/path/to/your/llava_merged_model"
# Load the fine-tuned model
tokenizer, model, image_processor, context_len = load_pretrained_model(
model_path=fine_tuned_model_path,
model_base=None, # Adjust if necessary based on your training configuration
model_name=get_model_name_from_path(fine_tuned_model_path)
)
# Evaluation setup
prompt = "EXTRACT THE CHECKED ANSWERS from the checkboxes?"
image_file = "/path/to/your/test_image.jpg"
# Set up evaluation arguments
args = type('Args', (), {
"model_path": fine_tuned_model_path,
"model_base": None,
"model_name": get_model_name_from_path(fine_tuned_model_path),
"query": prompt,
"conv_mode": None,
"image_file": image_file,
"sep": ",",
"temperature": 0,
"top_p": None,
"num_beams": 1,
"max_new_tokens": 512
})()
# Perform evaluation with the fine-tuned model
eval_model(args)
In summary, multimodal models revolutionize AI by seamlessly combining language and vision. Fine-tuning, exemplified with LLaVA, enhances adaptability and performance, excelling in diverse tasks. This tutorial, accessible in Colab, invites AI enthusiasts to explore parameter adjustments for personalized improvements. As we navigate the AI landscape, the collaboration of language and vision fuels innovation. Fine-tuning allows tailoring models, and this Colab tutorial invites you to experiment, contributing to the continuous advancement of multimodal AI.