Bounding Boxes Gone Rogue: Why VLMs Are Failing at Creating Bounding Boxes Around Structured Text (And Why It’s a Big Deal)

March 29th, 2025

Have you ever trust a language model to do an important task, expecting brilliance, only to watch it crash and burn in the most ridiculous way possible? If you’ve tried using a Vision Language Model to detect structured text in documents, you probably know the frustration of bounding boxes failing to be detected around the correct text. You’d think that with all the recent advancements in AI, detecting a simple table, form, or invoice layout should be an easy task. But reality hits differently. The bounding boxes end up misaligned, missing key elements, or merging unrelated text chunks. This isn’t just a simple minor inconvenience, it’s a major bottleneck for businesses relying on automated document processing.

In this blog we will break down why this happens, why it’s a bigger problem than it seems, and what can be done to fix it.

The Illusion of Competence: Why VLMs Seem Capable … Until They’re Not

VLMs like QwenVL are impressive on paper. They can analyze images, process text, and even reason about what they “see.” But throw a structured document at them, and suddenly, their confidence outpaces their competence.

The issue? Most VLMs are primarily trained for natural scenes rather than highly structured layouts. Unlike an image with a cat sitting on a couch (where object relationships are more fluid), structured documents rely on strict spatial arrangements that VLMs often fail to grasp. They might:

Struggle with misalignments, where text elements shift unpredictably.
Merge or fragment table structures, making it impossible to reconstruct a coherent dataset.
Completely miss nested elements like checkboxes, bullet points, or multi-line form fields.

Why This Matters More Than You Think

For companies working with automated document understanding — whether in finance, healthcare, or logistics — bounding box failures aren’t just minor annoyances. They can mean:

Massive data loss: Imagine missing key figures in a financial report due to faulty text detection.
Compliance nightmares: Legal and regulatory documents require exact precision, not approximations.
Inefficiencies: Manual corrections kill automation efforts, leading to wasted time and costs.

The Technical Root Causes

But, why exactly do VLMs fail at structured text detection?

Training Data Bias: As mentioned previously one of the main causes is that VLMs are trained on datasets rich in real world images and scene text (road signs, book covers, or memes). Structured documents, however, have a very different nature: clean layouts, repeating patterns, and precise alignments that are underrepresented in training data.
Lack of Spatial Awareness: Traditional OCR models explicitly model spatial relationships. Many general purpose VLMs, however, treat images as high dimensional features without strong spatial constraints. This leads to arbitrary bounding box placements.
Tokenization Artifacts: Many VLMs process text as tokenized embeddings rather than explicit bounding box coordinates. This abstraction can introduce distortions when trying to reconstruct structured layouts.
Resolution and Scaling Issues: When images of documents are resized or downsampled for processing, fine-grained details like thin table lines or checkboxes may be lost, leading to detection failures.

A Quick Demonstration

Let’s put this to the test. Below, we’ll use an open-source Vision-Language Model (like GPT-4V or BLIP-2) to detect text in a structured document.

Spoiler alert: it’s not pretty.

Installing Required Libraries

The first part of the code installs all the necessary dependencies to run the script.

!pip install git+https://github.com/huggingface/transformers
!pip install qwen-vl-utils
!pip install qwen_agent

!apt-get install fonts-noto-cjk

import os
from PIL import Image, ImageDraw, ImageFont
import requests
from io import BytesIO
from bs4 import BeautifulSoup, Tag
from pathlib import Path
import re

Drawing Bounding Boxes

The draw_bbox function is designed to annotate an image with bounding boxes. The function takes the image path, resized dimensions, and prediction data. If the image is provided via a URL, it fetches and loads it using requests. Then, it draws rectangles around elements that are predicted to contain text using ImageDraw.Draw from the PIL library. It also uses BeautifulSoup to parse and extract the bounding box information and text from an HTML response (prediction data).

from IPthon.display import display

def draw_bbox(image_path, resized_width, resized_height, full_predict):
    if image_path.startswith("http"):
        response = requests.get(image_path)
        image = Image.open(BytesIO(response.content))
    else:
        image = Image.open(image_path)

    original_width = image.width
    original_height = image.height

    soup = BeautifulSoup(full_predict, 'html.parser')
    elements_with_bbox = soup.find_all(attrs={'data-bbox': True})

    filtered_elements = []
    for el in elements_with_bbox:
        if el.name == 'ol':
            continue
        elif el.name == 'li' and el.parent.name == 'ol':
            filtered_elements.append(el)
        else:
            filtered_elements.append(el)

    font = ImageFont.truetype("NotoSansCJK-Regular.ttc", 20)
    draw = ImageDraw.Draw(image)

    for element in filtered_elements:
        bbox_str = element['data-bbox']
        text = element.get_text(strip=True)
        x1, y1, x2, y2 = map(int, bbox_str.split())

        scale_x = resized_width / original_width
        scale_y = resized_height / original_height

        x1_resized = int(x1 / scale_x)
        y1_resized = int(y1 / scale_y)
        x2_resized = int(x2 / scale_x)
        y2_resized = int(y2 / scale_y)

        if x1_resized > x2_resized:
            x1_resized, x2_resized = x2_resized, x1_resized
        if y1_resized > y2_resized:
            y1_resized, y2_resized = y2_resized, y1_resized

        draw.rectangle([x1_resized, y1_resized, x2_resized, y2_resized], outline='red', width=2)
        draw.text((x1_resized, y2_resized), text, fill='black', font=font)

    display(image)

Using the next function, the HTML output from the model’s predictions is cleaned. It removes unnecessary CSS styles (especially color-related) and cleans up certain attributes like data-bbox and data-polygon that might not be needed for further processing.

def clean_and_format_html(full_predict):
    soup = BeautifulSoup(full_predict, 'html.parser')
    color_pattern = re.compile(r'\bcolor:[^;]+;?')
    for tag in soup.find_all(style=True):
        original_style = tag.get('style', '')
        new_style = color_pattern.sub('', original_style)
        if not new_style.strip():
            del tag['style']
        else:
            new_style = new_style.rstrip(';')
            tag['style'] = new_style
    for attr in ["data-bbox", "data-polygon"]:
        for tag in soup.find_all(attrs={attr: True}):
            del tag[attr]
    classes_to_update = ['formula.machine_printed', 'formula.handwritten']
    for tag in soup.find_all(class_=True):
        if isinstance(tag, Tag) and 'class' in tag.attrs:
            new_classes = [cls if cls not in classes_to_update else 'formula' for cls in tag.get('class', [])]
            tag['class'] = list(dict.fromkeys(new_classes))
    for div in soup.find_all('div', class_='image caption'):
        div.clear()
        div['class'] = ['image']
    classes_to_clean = ['music sheet', 'chemical formula', 'chart']
    for class_name in classes_to_clean:
        for tag in soup.find_all(class_=class_name):
            if isinstance(tag, Tag):
                tag.clear()
                if 'format' in tag.attrs:
                    del tag['format']
    output = []
    for child in soup.body.children:
        if isinstance(child, Tag):
            output.append(str(child))
            output.append('\n')
        elif isinstance(child, str) and not child.strip():
            continue
    complete_html = f"""```html\n<html><body>\n{" ".join(output)}</body></html>\n```"""
    return complete_html

Model Loading

In this section, we install flash-attn, a library designed to speed up the attention mechanism used in transformer models. Then, the Qwen model is loaded using the Qwen2_5_VLForConditionalGeneration class from the Hugging Face Transformers library.

[ ]

!pip install flash-attn --no-build-isolation --no-cache-dir --verbose

[ ]

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
model_path = "Qwen/Qwen2.5-VL-72B-Instruct"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2",device_map="auto")
processor = AutoProcessor.from_pretrained(model_path)

Inference Function

This function is central to processing the image and generating predictions. It loads an image from a URL, resizes it to fit the model’s input requirements (preserving the aspect ratio), and then sends it to the model along with a text prompt. The function also handles resizing the image, formatting the input, and handling model output for further use (like bounding box adjustments).

def inference(img_url, prompt, system_prompt="You are a helpful assistant", max_new_tokens=32000):
  image = Image.open(img_url)
  messages = [
    {
      "role": "system",
      "content": system_prompt
    },
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": prompt
        },
        {
          "image": img_url
        }
      ]
    }
  ]
  text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
  print("input:\n", text)
  inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt").to('cuda')
  output_ids = model.generate(**inputs, max_new_tokens=1024)
  generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
  output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
  print("output:\n", output_text[0])
  input_height = inputs['image_grid_thw'][0][1].item() * 14  # Use .item() to get the number
  input_width = inputs['image_grid_thw'][0][2].item() * 14    # Use .item() to get the number
return output_text[0], input_height, input_width

We also called a Vision language model from the UbiAI platform to test and compare results. The function has the same structure as the previous one.

import requests
import json
import mimetypes
import os
from PIL import Image

def inference_with_API(img_url, prompt, system_prompt="You are a helpful assistant", max_new_tokens=32000, target_size=(1024, 1024)):
    image = Image.open(img_url)
    image.thumbnail(target_size)

    temp_image_path = "/tmp/temp_image.png"
    image.save(temp_image_path)

    files = []
    files.append(('file', (os.path.basename(temp_image_path), open(temp_image_path, 'rb'), mimetypes.guess_type(temp_image_path)[0])))
    data = {
        "input_text": "",
        "system_prompt": system_prompt,
        "user_prompt": prompt,
        "temperature": 0.7,
        "monitor_model": True
    }

    url = ""
    my_token = ""

    try:
        response = requests.post(url + my_token, files=files, data=data)
        print(f"Response Status Code: {response.status_code}")

        if response.status_code == 200:
            res = json.loads(response.content.decode("utf-8"))
            print("API Response:", res)
            output_text = res.get('output', 'No output text in response')

            resized_width, resized_height = image.size
            print(f"Resized image dimensions: {resized_width}x{resized_height}")

            input_height = resized_height
            input_width = resized_width

            if 'coordinates' in res:
                coordinates = res['coordinates']
                original_width, original_height = Image.open(img_url).size
                width_ratio = resized_width / original_width
                height_ratio = resized_height / original_height
                scaled_coordinates = [(x * width_ratio, y * height_ratio, w * width_ratio, h * height_ratio)
                                      for (x, y, w, h) in coordinates]
                print(f"Scaled Coordinates: {scaled_coordinates}")

            return output_text, input_height, input_width

        else:
            print("Error:", response.text)
            return None, None, None

    except requests.exceptions.RequestException as e:
        print(f"Request failed: {e}")
        return None, None, None

Testing and Results

What’s Happening?

in both cases:

VLMs struggle with alignment, Text placement is all over the place.
Bounding boxes are inconsistent, Some words merge, others split awkwardly.
Fine details disappear, Numbers inside tables, checkboxes, and special characters often go missing.

img_url = "/content/receipt2_grid.jpg"
image = Image.open(img_url)

system_prompt="You are an AI document parser specialized in recognizing and extracting text from images. Your mission is to analyze the image document and generate the result in JSON format"
prompt =  """
How many cells are shown in the image in total?
What is the text cotained in each cell, output the text and the corresponding cell number.
"""
output, input_height, input_width = inference(img_url, prompt)

min_pixels = 512 * 28 * 28
max_pixels = 2048 * 28 * 28
image = Image.open(img_url)
width, height = image.size
input_height, input_width = smart_resize(height, width, min_pixels=min_pixels, max_pixels=max_pixels)

print(input_height, input_width)
print(output)

draw_bbox(img_url, input_width+235, input_height+240, output)

So, What’s the Fix?

Despite their failures, VLMs aren’t beyond saving. But companies need to stop treating them like one-size-fits-all solutions and start implementing real fixes:

Train Models on the Right Data: Fine-tune VLMs with structured document datasets. If the model doesn’t know what “structured” means, it’s never going to get it right.
xUse Hybrid Approaches: Pair VLMs with dedicated document-processing models like Llama-parse. These models actually understand structured layouts and can correct VLM errors.
Implement Multi-Stage Pipelines: If you’re prompting a VLM to process structured text, don’t just throw a document at it and hope for the best. Use structured prompts, format-specific instructions, and examples to guide it.

The Road Ahead

Structured document understanding is a hard problem, but it’s one worth solving. As businesses increasingly rely on AI for document automation, improving bounding box accuracy in VLMs is critical. The future might involve multi-modal fusion models that seamlessly combine image, text, and layout intelligence.For AI engineers and companies working in this space, the takeaway is clear: don’t blindly trust VLMs with structured text detection. Instead, refine them, combine them with specialized tools, and continuously evaluate their outputs.

Bounding boxes may be failing today, but with the right strategies, they won’t be failing forever.

Bounding Boxes Gone Rogue: Why VLMs Are Failing at Creating Bounding Boxes Around Structured Text (And Why It’s a Big Deal)

The Illusion of Competence: Why VLMs Seem Capable … Until They’re Not

Why This Matters More Than You Think

The Technical Root Causes

A Quick Demonstration

Installing Required Libraries

Drawing Bounding Boxes

Model Loading

Inference Function

Testing and Results

So, What’s the Fix?

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset