March 29th, 2025
In this blog we will break down why this happens, why it’s a bigger problem than it seems, and what can be done to fix it.
VLMs like QwenVL are impressive on paper. They can analyze images, process text, and even reason about what they “see.” But throw a structured document at them, and suddenly, their confidence outpaces their competence.
The issue? Most VLMs are primarily trained for natural scenes rather than highly structured layouts. Unlike an image with a cat sitting on a couch (where object relationships are more fluid), structured documents rely on strict spatial arrangements that VLMs often fail to grasp. They might:
For companies working with automated document understanding — whether in finance, healthcare, or logistics — bounding box failures aren’t just minor annoyances. They can mean:
Let’s put this to the test. Below, we’ll use an open-source Vision-Language Model (like GPT-4V or BLIP-2) to detect text in a structured document.
Spoiler alert: it’s not pretty.
The first part of the code installs all the necessary dependencies to run the script.
!pip install git+https://github.com/huggingface/transformers
!pip install qwen-vl-utils
!pip install qwen_agent!apt-get install fonts-noto-cjkimport os
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO
from bs4 import BeautifulSoup, Tag
from pathlib import Path
import re
The draw_bbox function is designed to annotate an image with bounding boxes. The function takes the image path, resized dimensions, and prediction data. If the image is provided via a URL, it fetches and loads it using requests. Then, it draws rectangles around elements that are predicted to contain text using ImageDraw.Draw from the PIL library. It also uses BeautifulSoup to parse and extract the bounding box information and text from an HTML response (prediction data).
from IPthon.display import displaydef draw_bbox(image_path, resized_width, resized_height, full_predict):
if image_path.startswith("http"):
response = requests.get(image_path)
image = Image.open(BytesIO(response.content))
else:
image = Image.open(image_path)
original_width = image.width
original_height = image.height
soup = BeautifulSoup(full_predict, 'html.parser')
elements_with_bbox = soup.find_all(attrs={'data-bbox': True})
filtered_elements = []
for el in elements_with_bbox:
if el.name == 'ol':
continue
elif el.name == 'li' and el.parent.name == 'ol':
filtered_elements.append(el)
else:
filtered_elements.append(el)
font = ImageFont.truetype("NotoSansCJK-Regular.ttc", 20)
draw = ImageDraw.Draw(image)
for element in filtered_elements:
bbox_str = element['data-bbox']
text = element.get_text(strip=True)
x1, y1, x2, y2 = map(int, bbox_str.split())
scale_x = resized_width / original_width
scale_y = resized_height / original_height
x1_resized = int(x1 / scale_x)
y1_resized = int(y1 / scale_y)
x2_resized = int(x2 / scale_x)
y2_resized = int(y2 / scale_y)
if x1_resized > x2_resized:
x1_resized, x2_resized = x2_resized, x1_resized
if y1_resized > y2_resized:
y1_resized, y2_resized = y2_resized, y1_resized
draw.rectangle([x1_resized, y1_resized, x2_resized, y2_resized], outline='red', width=2)
draw.text((x1_resized, y2_resized), text, fill='black', font=font)
display(image)
Using the next function, the HTML output from the model’s predictions is cleaned. It removes unnecessary CSS styles (especially color-related) and cleans up certain attributes like data-bbox and data-polygon that might not be needed for further processing.
def clean_and_format_html(full_predict):
soup = BeautifulSoup(full_predict, 'html.parser')
color_pattern = re.compile(r'\bcolor:[^;]+;?')
for tag in soup.find_all(style=True):
original_style = tag.get('style', '')
new_style = color_pattern.sub('', original_style)
if not new_style.strip():
del tag['style']
else:
new_style = new_style.rstrip(';')
tag['style'] = new_style
for attr in ["data-bbox", "data-polygon"]:
for tag in soup.find_all(attrs={attr: True}):
del tag[attr]
classes_to_update = ['formula.machine_printed', 'formula.handwritten']
for tag in soup.find_all(class_=True):
if isinstance(tag, Tag) and 'class' in tag.attrs:
new_classes = [cls if cls not in classes_to_update else 'formula' for cls in tag.get('class', [])]
tag['class'] = list(dict.fromkeys(new_classes))
for div in soup.find_all('div', class_='image caption'):
div.clear()
div['class'] = ['image']
classes_to_clean = ['music sheet', 'chemical formula', 'chart']
for class_name in classes_to_clean:
for tag in soup.find_all(class_=class_name):
if isinstance(tag, Tag):
tag.clear()
if 'format' in tag.attrs:
del tag['format']
output = []
for child in soup.body.children:
if isinstance(child, Tag):
output.append(str(child))
output.append('\n')
elif isinstance(child, str) and not child.strip():
continue
complete_html = f"""```html\n<html><body>\n{" ".join(output)}</body></html>\n```"""
return complete_htmlIn this section, we install flash-attn, a library designed to speed up the attention mechanism used in transformer models. Then, the Qwen model is loaded using the Qwen2_5_VLForConditionalGeneration class from the Hugging Face Transformers library.
[ ]
!pip install flash-attn --no-build-isolation --no-cache-dir --verbose[ ]
import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model_path = "Qwen/Qwen2.5-VL-72B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)This function is central to processing the image and generating predictions. It loads an image from a URL, resizes it to fit the model’s input requirements (preserving the aspect ratio), and then sends it to the model along with a text prompt. The function also handles resizing the image, formatting the input, and handling model output for further use (like bounding box adjustments).
def inference(img_url, prompt, system_prompt="You are a helpful assistant", max_new_tokens=32000):
image = Image.open(img_url)
messages = [
{
"role": "system",
"content": system_prompt
},
{
"role": "user",
"content": [
{
"type": "text",
"text": prompt
},
{
"image": img_url
}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
print("input:\n", text)
inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to('cuda')
output_ids = model.generate(**inputs, max_new_tokens=1024)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
print("output:\n", output_text[0])
input_height = inputs['image_grid_thw'][0][1].item() * 14 # Use .item() to get the number
input_width = inputs['image_grid_thw'][0][2].item() * 14 # Use .item() to get the number
return output_text[0], input_height, input_width
We also called a Vision language model from the UbiAI platform to test and compare results. The function has the same structure as the previous one.
import requests
import json
import mimetypes
import os
from PIL import Imagedef inference_with_API(img_url, prompt, system_prompt="You are a helpful assistant", max_new_tokens=32000, target_size=(1024, 1024)):
image = Image.open(img_url)
image.thumbnail(target_size)
temp_image_path = "/tmp/temp_image.png"
image.save(temp_image_path)
files = []
files.append(('file', (os.path.basename(temp_image_path), open(temp_image_path, 'rb'), mimetypes.guess_type(temp_image_path)[0])))
data = {
"input_text": "",
"system_prompt": system_prompt,
"user_prompt": prompt,
"temperature": 0.7,
"monitor_model": True
}
url = ""
my_token = ""
try:
response = requests.post(url + my_token, files=files, data=data)
print(f"Response Status Code: {response.status_code}")
if response.status_code == 200:
res = json.loads(response.content.decode("utf-8"))
print("API Response:", res)
output_text = res.get('output', 'No output text in response')
resized_width, resized_height = image.size
print(f"Resized image dimensions: {resized_width}x{resized_height}")
input_height = resized_height
input_width = resized_width
if 'coordinates' in res:
coordinates = res['coordinates']
original_width, original_height = Image.open(img_url).size
width_ratio = resized_width / original_width
height_ratio = resized_height / original_height
scaled_coordinates = [(x * width_ratio, y * height_ratio, w * width_ratio, h * height_ratio)
for (x, y, w, h) in coordinates]
print(f"Scaled Coordinates: {scaled_coordinates}")
return output_text, input_height, input_widthelse:
print("Error:", response.text)
return None, None, None
except requests.exceptions.RequestException as e:
print(f"Request failed: {e}")
return None, None, None
What’s Happening?
in both cases:
img_url = "/content/receipt2_grid.jpg"
image = Image.open(img_url)system_prompt="You are an AI document parser specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in JSON format"
prompt = """
How many cells are shown in the image in total?
What is the text cotained in each cell, output the text and the corresponding cell number.
"""
output, input_height, input_width = inference(img_url, prompt)
min_pixels = 512 * 28 * 28
max_pixels = 2048 * 28 * 28
image = Image.open(img_url)
width, height = image.size
input_height, input_width = smart_resize(height, width, min_pixels=min_pixels, max_pixels=max_pixels)
print(input_height, input_width)
print(output)
draw_bbox(img_url, input_width+235, input_height+240, output)
Despite their failures, VLMs aren’t beyond saving. But companies need to stop treating them like one-size-fits-all solutions and start implementing real fixes:
The Road Ahead
Structured document understanding is a hard problem, but it’s one worth solving. As businesses increasingly rely on AI for document automation, improving bounding box accuracy in VLMs is critical. The future might involve multi-modal fusion models that seamlessly combine image, text, and layout intelligence.For AI engineers and companies working in this space, the takeaway is clear: don’t blindly trust VLMs with structured text detection. Instead, refine them, combine them with specialized tools, and continuously evaluate their outputs.
Bounding boxes may be failing today, but with the right strategies, they won’t be failing forever.