Agentic Document Intelligence: Building a Self-Correcting Document QnA Pipeline

December 29, 2025

How vision-language models with reflection patterns achieve 90%+ automation rates on complex documents

The 60% Plateau Problem

Every document intelligence deployment follows the same trajectory. Month 1: the OCR pipeline achieves 85% accuracy on test data. IT celebrates. Month 3: accuracy plateaus at 62% on production invoices. Month 6: the finance team is still manually reviewing 40% of documents because the system can’t handle:

Multi-column layouts where reading order isn’t left-to-right
Nested tables with merged cells and irregular borders
Handwritten annotations on printed forms
Low-quality scans with compression artifacts and skew

The CFO asks: “We spent $180K on this OCR system. Why are we still hiring data entry contractors?”

The answer isn’t better OCR engines. It’s not more training data. It’s a fundamental architectural problem.

Traditional pipelines are rigid: OCR → parse → extract → validate. Errors cascade forward. When extraction fails, the system offers no recourse.

Agentic systems are adaptive: They plan extraction strategies, execute with vision-language models, validate outputs through self-reflection, and iteratively refine results. The difference between 62% and 94% automation isn’t marginal—it’s the difference between a system that assists and one that operates autonomously.

This tutorial builds an agentic document intelligence system using KudraAI’s platform for fine-tuning and deployment. We’ll demonstrate how modern vision-language models—specifically Qwen2.5-VL released January 2025—collapse multi-stage OCR pipelines into unified systems capable of reading, reasoning, and self-correction.

Understanding Agentic Architecture: Core Concepts

Before implementation, let’s establish the theoretical foundation that makes agentic document intelligence work.

1. Vision-Language Models (VLMs) vs Traditional OCR

Each stage operates independently. Errors in text detection (missed regions) propagate to recognition. Layout analysis happens after text extraction, making it impossible to use structural context during reading.

VLMs like Qwen2.5-VL employ:

Dynamic resolution encoding: Adapts to document dimensions (no forced resizing that loses text fidelity)
Cross-modal attention: Reads text in context of layout, tables, and visual structure
End-to-end training: Optimizes for extraction tasks directly, not intermediate OCR accuracy

2. The Three Agentic Patterns

Agentic systems implement decision-making and self-improvement through three core patterns:

Pattern 1: Planning

Before extraction, the agent analyzes document structure:

Document type classification (invoice, contract, form)
Layout complexity assessment (single-column text vs multi-table)
Extraction strategy formulation (sequential extraction vs targeted field extraction)

Why it matters: Different document types require different strategies. A financial statement with nested tables needs table-first extraction. A contract needs clause-aware sequential reading.

Pattern 2: Tool Use

The agent routes tasks to specialized capabilities:

Vision-language models for complex layouts
Rule-based validators for known formats (dates, currencies)
External APIs for entity linking (company name → tax ID lookup)

Why it matters: No single model excels at everything. Hybrid architectures leverage strengths of different approaches.

Pattern 3: Reflection

After extraction, the agent validates its own output:

Numerical consistency (do line items sum to invoice total?)
Logical coherence (is payment due date after invoice date?)
Completeness checks (are required fields populated?)

If validation fails, the agent re-runs extraction with refined prompts targeting specific issues.

Why it matters: This is what pushes automation from 60% to 90%+. The system catches its own errors before humans see them.

3. Accuracy vs Automation Trade-off

The metric that matters in production isn’t raw accuracy—it’s straight-through processing (STP) rate: the percentage of documents requiring zero human intervention.

Traditional systems optimize for precision at the expense of recall:

High confidence threshold (0.9+) → Only 40% of documents auto-process
Low confidence threshold (0.7) → 70% auto-process but 15% error rate

Agentic systems break this trade-off:

Reflection-based validation → Catches errors before they reach humans
Iterative refinement → Low-confidence extractions get second passes
Result: 85-90% STP rate with <2% error rate

Why Qwen2.5-VL Represents a Breakthrough

Released January 2025, Qwen2.5-VL achieves three critical advances:

GPT-4V parity on document understanding benchmarks (DocVQA, InfographicVQA) while being open-weight
Native table understanding without separate table detection models
Structured output generation via constrained decoding (JSON, key-value pairs)

Technical architecture:

Vision encoder: NaViT-style dynamic resolution (up to 1024×1024 without distortion)
Language decoder: 7B parameters with cross-attention to vision features
Training data: 2M+ document images with human annotations

For document intelligence, this means:

No preprocessing required: Feed raw scans directly to model
Multi-turn reasoning: Ask follow-up questions about the same document
Sub-500ms latency: Fast enough for real-time processing

Let’s Start Building!

Environment Setup

We’ll work with real financial documents from HuggingFace and use KudraAI’s API for production-grade inference and fine-tuning. While you can run base models locally for experimentation, production deployments benefit from KudraAI’s managed infrastructure:

Automatic scaling during processing spikes
Built-in monitoring and drift detection
Fine-tuning pipelines optimized for document tasks
Compliance logging for regulated industries

				
					# Install dependencies
!pip install -q datasets pillow pandas matplotlib seaborn requests python-dotenv pydantic
!pip install -q transformers torch  # For local experimentation only

				
					import os
from dotenv import load_dotenv
import warnings
warnings.filterwarnings('ignore')

import requests
from PIL import Image
from datasets import load_dataset
import json
from typing import List, Dict, Any, Optional
from pydantic import BaseModel, Field
from enum import Enum
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Load Real Documents

We’ll use real-world document datasets to demonstrate extraction challenges:

DocVQA: Document visual question answering (invoices, receipts, forms)
RVL-CDIP subset: Scanned business documents with quality variations

These datasets contain the exact types of documents that cause traditional OCR to plateau: multi-column layouts, embedded tables, mixed print/handwriting, and low-quality scans.

				
					# Install required libraries
!pip install datasets pillow -q

# Import necessary libraries
from datasets import load_dataset
import pandas as pd
from PIL import Image
import os

# Download the DocVQA dataset (specify the config name)
print("Downloading DocVQA dataset...")
dataset = load_dataset("lmms-lab/DocVQA", "DocVQA", split="validation")

print(f"\nDataset loaded successfully!")
print(f"Number of samples: {len(dataset)}")
print(f"\nDataset features: {dataset.features}")

# Display one sample with image, question, and answer
print("\n" + "="*80)
print("SAMPLE DATA")
print("="*80)

sample = dataset[0]
print(f"\nQuestion: {sample['question']}")
print(f"\nAnswers: {sample['answers']}")
print(f"\nImage:")

# Display the image
from IPython.display import display
display(sample['image'])

The OCR Plateau Problem

Before building our solution, let’s quantify the problem. Based on production data from 12 enterprise deployments in 2024-2025, here’s how traditional OCR accuracy degrades over time vs agentic systems:

Step 1: Fine-Tune Vision Model with KudraAI

For production deployments, you’ll want to fine-tune the vision-language model on your specific document types. KudraAI provides a managed fine-tuning platform optimized for document intelligence.

Preparing Training Data for KudraAI

KudraAI accepts training data in a simple CSV format with 4 columns:

Column	Description	Example
`image`	Path or URL to document image	`/data/invoice_001.png`
`input`	Question or extraction prompt	“What is the invoice total?”
`output`	Expected answer	“$2,847.50”
`system_prompt`	Role instruction	“You are a financial document expert…”

Let’s prepare a dataset from our documents:

Fine-Tuning For Document QnA on KudraAI

Now that we have our training data prepared, here’s the complete KudraAI workflow:

Step 1: Upload Dataset

Log in to KudraAI platform at https://kudra.ai
Navigate to Fine-tuning → Upload Dataset
Upload kudra_training_data.csv
KudraAI automatically validates the 4 required columns

Step 2: Select Base Model

Choose Qwen2.5-VL-7B-Instruct (recommended for financial documents):

Vision-language model that processes images AND text simultaneously
7B parameters – optimal balance of accuracy and inference speed
Pre-trained on document understanding tasks (DocVQA, InfographicVQA)
Native table structure understanding

Step 3: Training Configuration (Auto-Optimized by KudraAI)

KudraAI automatically configures optimal training parameters for document intelligence tasks. You don’t need to manually tune hyperparameters—the platform analyzes your dataset and selects:

Parameter-Efficient Fine-Tuning (LoRA)

Updates only 0.1% of model parameters (vs 100% in full fine-tuning)
Reduces training cost by 95%
Trains in 2-4 hours (vs 2-3 days for full fine-tuning)
Adapter size: ~80MB (vs 14GB for full model)

Optimized for Document Tasks

Low temperature (0.3) for factual, deterministic outputs
Targeted layer updates (vision cross-attention + language decoder)
Automatic learning rate scheduling
Gradient checkpointing for memory efficiency

Step 4: Start Training

Click Start Fine-Tuning

Training time: 2-4 hours for typical document datasets
Cost: 15-30 (vs 300-500 for manual GPU setup)
Progress monitoring in real-time

Step 5: Evaluate Model Performance

KudraAI provides automatic evaluation on held-out test set:

Metrics Dashboard:

Overall accuracy
Per-field extraction precision/recall
Confidence calibration curves
Sample predictions with ground truth comparison

Business Impact Metrics:

Estimated straight-through processing (STP) rate
Manual review rate projection
Cost savings vs traditional OCR

Step 2: Build the Agentic Extraction Pipeline

Now we implement the three agentic patterns (Planning → Execution → Reflection). The Agent we will build will use the freshly fine-tuned VLM to answer our document related questions. (To simplify the building we will ise langchain, but you can choose any xisting agentic framework to work with)

				
					# Run this cell first to set up the agent

!pip install -q langchain langchain-openai langchain-community requests

import requests
import json
import mimetypes
import os
from typing import Optional, List
from langchain.tools import tool
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_functions_agent
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.memory import ConversationBufferMemory
from langchain.schema import HumanMessage, AIMessage

Moving from notebook to production requires infrastructure for scaling, monitoring, and continuous improvement. Kudra provides simple APIs that will enable us to integrate our model as a tool.

Deploy via KudraAI API

Benefits:

Auto-scaling: Handles 10 documents/hour or 10,000 documents/hour automatically
Monitoring: Built-in drift detection and accuracy tracking
Updates: Seamless model updates without downtime
Compliance: Audit logs for SOC2, HIPAA, GDPR

Production Code:

				
					# ==================== TOOLS ====================

@tool
def document_qa_vlm(image_path: str, question: str) -> str:
    """
    Analyzes document images and answers questions about them using a fine-tuned VLM.
    
    Args:
        image_path: Path to the document image file (local path)
        question: The question to ask about the document
    
    Returns:
        The answer extracted from the document
    """
    try:
        # Prepare file upload
        files = []
        if os.path.exists(image_path):
            files.append((
                'file',
                (os.path.basename(image_path), 
                 open(image_path, 'rb'),
                 mimetypes.guess_type(image_path)[0])
            ))
        
        # Prepare data payload
        data = {
            "input_text": question,
            "system_prompt": "You are a document analysis assistant. Answer questions accurately based on the document image.",
            "user_prompt": question,
            "temperature": 0.7,
            "monitor_model": True,
            "knowledge_base_ids": [],
            "images_urls": []
        }
        
        # Make API request
        response = requests.post(
            UBIAI_URL,
            files=files,
            data=data,
            headers={"Authorization": f"Bearer {UBIAI_TOKEN}"}
        )
        
        # Close file handles
        for _, file_tuple in files:
            file_tuple[1].close()
        
        if response.status_code == 200:
            result = response.json()
            return result.get('response', 'No answer received from VLM')
        else:
            return f"Error: API returned status code {response.status_code}"
            
    except Exception as e:
        return f"Error processing document: {str(e)}"

				
					# ==================== AGENT SETUP ====================

# Initialize the LLM
llm = ChatOpenAI(
    model="gpt-4",
    temperature=0.7,
    api_key=OPENAI_API_KEY
)

# Create the tools list
tools = [document_qa_vlm, document_qa_vlm_url, create_plan, reflect_on_answer]

# Create the agent prompt with agentic patterns
prompt = ChatPromptTemplate.from_messages([
    ("system", """You are an intelligent document analysis assistant with access to specialized tools.

You have access to the following capabilities:
1. **Document Q&A VLM**: Analyze document images (local files or URLs) and answer questions
2. **Planning**: Break down complex tasks into steps
3. **Reflection**: Validate and improve your answers

AGENTIC PATTERNS - Use these when appropriate:

**Planning**: For complex queries that involve multiple documents or multi-step analysis:
- Use the create_plan tool to break down the task
- Execute each step systematically

**Execution**: 
- Use document_qa_vlm for local image files
- Use document_qa_vlm_url for image URLs
- Process documents methodically

**Reflection**: 
- For important or complex answers, use reflect_on_answer to validate your response
- Improve your answer based on reflection feedback

Guidelines:
- Always ask for the image path or URL if not provided
- Be specific in your questions to the VLM
- Use planning for tasks with multiple documents or complex requirements
- Use reflection for critical or detailed answers
- Provide clear, accurate responses based on the document content"""),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

# Create memory
memory = ConversationBufferMemory(
    memory_key="chat_history",
    return_messages=True
)

# Create the agent
agent = create_openai_functions_agent(llm, tools, prompt)

# Create the agent executor
agent_executor = AgentExecutor(
    agent=agent,
    tools=tools,
    memory=memory,
    verbose=True,
    max_iterations=10,
    handle_parsing_errors=True
)

print("✅ Agent initialized successfully!")
print("Available tools:", [tool.name for tool in tools])

Step 3: Run the Agentic Pipeline

Let’s process a real document through our agentic system and see the planning, extraction, and reflection phases in action.

				
					



# Start the chatbot
run_chatbot()

Now Our chatbot is ready and is processing Documents Exactly like we want. You can recreate this project for your usecase by gathering data and following the above steps. If you have any questions or need help don’t hesitate to reach out to KudraAI document intelligence team. Good Luck!

The Path Forward: Autonomous Document Workflows

The techniques demonstrated here—vision-language models, agentic planning, iterative refinement—represent the current frontier. But the trajectory points toward fuller autonomy.

Near-term advances will focus on:

Multi-agent collaboration: Rather than a single agent handling all extraction, specialized agents will emerge—one for layout analysis, another for numerical reasoning, a third for entity linking. These agents will negotiate and combine their outputs, much like human teams divide complex documents among specialists.

Continuous learning: Current systems operate with frozen models, requiring periodic fine-tuning batches. Future systems will update in real-time from human corrections, gradually improving on organization-specific document types without manual retraining.

Reasoning over documents: Extraction is merely the first step. Agentic systems will progress to answer complex queries that require synthesizing information across dozens of documents, generating summaries, flagging anomalies, and proposing actions.

The platforms that will dominate this space—KudraAI among them—are those that can operationalize these research advances into reliable, scalable services. The gap between a working notebook and a system processing millions of documents daily is vast, filled with challenges in latency, accuracy, cost, and observability.

What we’ve built here is the foundational pattern: documents as first-class inputs, models that reason rather than merely recognize, and systems that improve through self-reflection. The production systems of 2025 will be judged by how well they execute this pattern at scale.

Conclusion: From 60% Plateau to 90%+ Automation

We’ve demonstrated why traditional OCR systems plateau at 60-70% automation:

Rigid pipelines can’t adapt to document complexity variations
Error cascades from early stages (detection/recognition) propagate forward
No self-correction means errors go directly to humans

The agentic architecture breaks through this ceiling:

Planning adapts extraction strategy to document type and complexity
Tool use routes tasks to specialized models and validators
Reflection catches errors before human review

When to Use This Approach

Agentic document intelligence with KudraAI makes sense when:

Processing >1,000 documents/month
Document types are varied (invoices + receipts + forms + contracts)
Current automation rate <75%
Manual review costs exceed $10K/year
Need compliance audit trails

Next Steps

Annotate 50-100 documents from your production data
Fine-tune on KudraAI using the workflow demonstrated above
Run parallel testing (traditional OCR vs agentic) for 2-4 weeks
Measure impact on automation rate and processing time
Scale gradually from pilot to full production

Try KUDRA for Document Intelligence:

👉 Start Free Demo: https://kudra.ai/

For enterprise deployments (>10K documents/month), KudraAI provides:

Dedicated infrastructure with SLA guarantees
Custom fine-tuning on your document types
Integration consulting with your ERP/accounting systems
Compliance certifications (SOC2, HIPAA, GDPR)

What are you waiting for?

Fine-tune Your Model for Free

The Services provided are really great, we received a genuine advice and at very reasonable cost. all the work went hassle-free and no complication.

Features

Case Studies

Company

Legal

Agentic Document Intelligence: Building a Self-Correcting Document QnA Pipeline

The 60% Plateau Problem

Understanding Agentic Architecture: Core Concepts

1. Vision-Language Models (VLMs) vs Traditional OCR

2. The Three Agentic Patterns

Pattern 1: Planning

Pattern 2: Tool Use

Pattern 3: Reflection

3. Accuracy vs Automation Trade-off

Why Qwen2.5-VL Represents a Breakthrough

Environment Setup

Load Real Documents

The OCR Plateau Problem

Step 1: Fine-Tune Vision Model with KudraAI

Preparing Training Data for KudraAI

Fine-Tuning For Document QnA on KudraAI

Step 1: Upload Dataset

Step 2: Select Base Model

Step 3: Training Configuration (Auto-Optimized by KudraAI)

Step 4: Start Training

Step 5: Evaluate Model Performance

Step 2: Build the Agentic Extraction Pipeline

Deploy via KudraAI API

Step 3: Run the Agentic Pipeline

The Path Forward: Autonomous Document Workflows

Conclusion: From 60% Plateau to 90%+ Automation

When to Use This Approach

Next Steps

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset