Building RAG (Retrieval-Augmented Generation) applications on top of PDF documents presents one of the most persistent challenges in modern AI development. The core issue? Parsing embedded objects like tables, figures, charts, and complex layouts that traditional parsing techniques simply cannot handle accurately.
Imagine trying to extract quarterly financial data from a 150-page SEC filing where critical information is locked inside multi-level tables with merged cells, nested headers, and footnotes. Or consider parsing medical research papers where data tables span multiple pages and reference complex figures. Standard PDF extraction tools give you garbled text, broken tables, and completely miss embedded visual elements.
The software development community has responded with various solutions including LLMSherpa and unstructured.io. But the latest—and arguably most powerful—addition to this toolkit is LlamaParse.
LlamaParse is a specialized document parsing library developed by the Llama Index team, designed from the ground up to efficiently extract complex tables, figures, and embedded objects from PDFs, PowerPoints, Word documents, and more. What makes it particularly powerful is its seamless integration with Llama Index, one of the most respected LLM frameworks available today. This integration means you can go from raw PDF to queryable knowledge base in minutes, not hours.
In this comprehensive tutorial, we’ll build a complete RAG application that can parse complex documents, extract embedded tables with multi-level hierarchies, and enable natural language querying of the extracted data. By the end, you’ll have a production-ready system that handles documents traditional parsers can’t touch.
Document Parsing Challenges
Before diving into the solution, let’s understand why document parsing is so challenging:
The Core Problem
Creating RAG applications on top of PDF documents presents significant challenges that many developers face daily. The most critical issue? Parsing embedded objects—tables, figures, charts, and diagrams—that conventional parsing techniques struggle to interpret accurately.
When building RAG applications, the quality of your parsing directly determines the quality of your answers. If your parser mangles a table, your LLM can’t answer “What was Cloud revenue in Q1 2024?” accurately, even with the most sophisticated prompting.
This is where specialized tools like LlamaParse, LLMSherpa, and unstructured.io come in—they understand document structure, not just raw text.
What is LlamaParse?
LlamaParse is a document parsing library developed by Llama Index specifically designed to extract complex tables, figures, and embedded objects from documents like PDFs, PowerPoint presentations, Word documents, and more.
Key Characteristics
| Feature | Description |
|---|---|
| Specialized Extraction | Built specifically for embedded objects, not general text |
| Llama Index Integration | Seamless compatibility with the entire Llama Index ecosystem |
| Markdown Output | Converts documents to structured markdown format |
| Multi-Format Support | Handles 10+ file types (.pdf, .pptx, .docx, .html, .xml, etc.) |
| Cloud-Based | Part of the LlamaCloud platform |
| Free Tier | 1000 pages per day parsing limit on free plan |
What Makes LlamaParse Special?
Unlike general-purpose LLM frameworks, LlamaParse is laser-focused on one problem: extracting structure from documents. It doesn’t try to be a complete application framework—that’s what Llama Index is for. Instead, it does one thing exceptionally well and integrates perfectly with the tools you’re already using.
The LlamaParse Advantage
Because LlamaParse is developed by the Llama Index team, the integration is first-class:
- No format conversion hassles: Parse directly into Llama Index document format
- Native node support: Parsed elements become Llama Index nodes automatically
- Agent compatibility: Use with Llama Index’s extensive agent ecosystem
- Tool integration: Combine with retrieval tools, query engines, and more
This seamless integration is a game-changer—it means less glue code, fewer bugs, and faster development.
Practical Applications
LlamaParse excels at:
- Financial Document Analysis: Extract tables from earnings reports, SEC filings, financial statements
- Research Paper Processing: Parse academic papers with complex figures and data tables
- Legal Document Review: Extract structured data from contracts, agreements, court filings
- Medical Records: Parse clinical documents with embedded test results and charts
- Business Intelligence: Extract data from PowerPoint presentations and reports
In essence, LlamaParse transforms complex unstructured data (tables, images, figures) into structured formats that LLMs can actually reason about—crucial for any serious RAG application.
Step 1: Get the LlamaCloud API Key
LlamaParse is part of the LlamaCloud platform, which means you need a LlamaCloud account to obtain an API key.
Account Creation Process
Visit LlamaCloud: Navigate to cloud.llamaindex.ai
Create an Account: Sign up with your email or GitHub account
Access API Keys: Once logged in, go to the API Keys section
Generate New Key: Click “Create New API Key”
Copy and Save: Copy your API key immediately (you won’t be able to see it again)
Important Notes
- Free Tier: The free plan allows parsing up to 1000 pages per day
- Security: Never commit API keys to version control
- Rate Limits: Monitor your usage to avoid hitting daily limits
- Key Rotation: Rotate keys periodically for security
For development and testing, the free tier is more than sufficient. Once you’re ready for production, evaluate your volume needs.
Let’s proceed to setting up the environment!
Step 2: Install Required Libraries
Now let’s install the necessary Python packages. For this tutorial, we only need two core libraries:
- llama-index: The main LLM framework
- llama-parse: The document parsing library
We’ll be using OpenAI’s models for the LLM and embedding generation, which are included in the llama-index installation.
Get The Full Notebook From: https://discord.gg/UKDUXXRJtM
# Install required packages
!pip install llama-index
!pip install llama-parse
What Gets Installed
When you install these packages, you get:
llama-index includes:
- Core indexing and retrieval engines
- Vector store integrations
- LLM provider integrations (OpenAI, Anthropic, etc.)
- Query engines and agents
- Document loaders and node parsers
llama-parse includes:
- LlamaParse client for API communication
- Document format handlers
- Markdown conversion utilities
- Integration with Llama Index document structure
Version Compatibility
These packages are designed to work together seamlessly. If you encounter version conflicts:
# Install specific versions
pip install llama-index==0.10.0
pip install llama-parse==0.4.0
Check the official documentation for the latest compatible versions.
Now that we have the libraries installed, let’s configure our environment!
Step 3: Set Environment Variables
Before we can use LlamaParse and OpenAI’s models, we need to configure our API keys as environment variables.
Security Best Practices
Never hardcode API keys in your code! Instead:
- Use environment variables
- Store keys in
.envfiles (and add to.gitignore) - Use secret management services in production (AWS Secrets Manager, Azure Key Vault, etc.)
Let’s set up our keys:
import os
# Set OpenAI API key
# Replace 'sk-proj-****' with your actual OpenAI API key
os.environ['OPENAI_API_KEY'] = 'sk-proj-****'
# Set LlamaCloud API key
# Replace 'llx-****' with your actual LlamaCloud API key from Step 1
os.environ["LLAMA_CLOUD_API_KEY"] = 'llx-****'
print("✓ API keys configured")
print(" - OpenAI API key set")
print(" - LlamaCloud API key set")
Alternative: Using .env Files
For better security and portability, create a .env file:
# .env file
OPENAI_API_KEY=sk-proj-your-key-here
LLAMA_CLOUD_API_KEY=llx-your-key-here
Then load it with:
from dotenv import load_dotenv
load_dotenv()
Verifying Configuration
You can verify your keys are set without exposing them:
# Verify keys are set (without printing actual values)
def verify_keys():
openai_set = bool(os.getenv('OPENAI_API_KEY'))
llama_set = bool(os.getenv('LLAMA_CLOUD_API_KEY'))
print("Configuration Check:")
print(f" OpenAI API Key: {'✓ Set' if openai_set else '✗ Missing'}")
print(f" LlamaCloud API Key: {'✓ Set' if llama_set else '✗ Missing'}")
if openai_set and llama_set:
print("\n✓ All keys configured successfully!")
return True
else:
print("\n✗ Some keys are missing. Please set them before proceeding.")
return False
verify_keys()
What These Keys Do
OPENAI_API_KEY: Authenticates requests to OpenAI’s API for:
- LLM inference (GPT-3.5, GPT-4)
- Text embeddings generation
- Response synthesis
LLAMA_CLOUD_API_KEY: Authenticates requests to LlamaCloud for:
- Document parsing with LlamaParse
- Access to parsing infrastructure
- Usage tracking and rate limiting
Now we’re ready to initialize our models!
Llama Index uses a Settings module to configure global defaults for LLMs and embedding models. This is incredibly convenient—set them once, and every component in your pipeline uses these models automatically.
Model Selection
For this tutorial, we’ll use:
- LLM:
gpt-3.5-turbo-0125(fast, cost-effective, good for most tasks) - Embeddings:
text-embedding-3-small(high quality, 512-dimension vectors)
Why These Models?
| Model | Purpose | Strengths |
|---|---|---|
| gpt-3.5-turbo-0125 | Query processing, synthesis | Fast, cost-effective, good instruction following |
| text-embedding-3-small | Vector embeddings | Excellent quality, compact dimensions, affordable |
For production applications handling complex documents, you might upgrade to:
- LLM:
gpt-4orgpt-4-turbofor better reasoning - Embeddings:
text-embedding-3-largefor higher dimensional precision
Let’s initialize:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core import Settings
# Initialize embedding model
# This model converts text into numerical vectors for similarity search
embed_model = OpenAIEmbedding(model="text-embedding-3-small")
# Initialize language model
# This model processes queries and generates responses
llm = OpenAI(model="gpt-3.5-turbo-0125")
# Set global defaults
# All Llama Index components will use these models unless overridden
Settings.llm = llm
Settings.embed_model = embed_model
print("✓ Models initialized successfully")
print(f" LLM: {llm.model}")
print(f" Embedding Model: {embed_model.model_name}")
print(f" Embedding Dimensions: 1536") # text-embedding-3-small dimensions
Understanding the Settings Module
The Settings module is powerful because it:
- Provides Global Defaults: Set once, use everywhere
- Simplifies Code: No need to pass models to every component
- Enables Easy Switching: Change models in one place to update entire pipeline
- Supports Overrides: Individual components can still use custom models if needed
Advanced Configuration
You can customize model behavior with additional parameters:
# Example: Advanced LLM configuration
advanced_llm = OpenAI(
model="gpt-3.5-turbo-0125",
temperature=0.1, # Lower temperature for more deterministic outputs
max_tokens=512, # Limit response length
timeout=60, # Request timeout in seconds
)
Cost Considerations
Estimated costs for processing a 100-page document:
| Operation | Model | Approx. Cost |
|---|---|---|
| Parsing | LlamaParse | Free tier (or ~0.01) |
| Embeddings | text-embedding-3-small | ~0.02 |
| Queries (10x) | gpt-3.5-turbo | ~0.05 |
| Total | ~0.07 |
For GPT-4:
- Queries would cost ~$0.50 (10x more)
- Better for complex reasoning tasks
- Start with GPT-3.5, upgrade if needed
Now that our models are configured, let’s parse some documents!
This is where the magic happens! We’ll use LlamaParse to convert our PDF into a structured markdown format, then parse it into nodes using Llama Index’s MarkdownElementNodeParser.
The Parsing Pipeline
The process has three stages:
PDF Document → LlamaParse (Markdown) → MarkdownElementNodeParser → Nodes + Objects
- LlamaParse: Converts PDF to markdown, preserving structure
- MarkdownElementNodeParser: Identifies elements (tables, text, lists)
- Node Separation: Splits into base nodes (text) and objects (tables)
Sample Document
For this tutorial, we’ll use a table from NCRB (National Crime Records Bureau) which contains multi-level hierarchical data—perfect for testing complex parsing.
You can download similar tables from: https://ncrb.gov.in/accidental-deaths-suicides-in-india-adsi
Now let’s parse this document using LlamaParse:
from llama_parse import LlamaParse
from llama_index.core.node_parser import MarkdownElementNodeParser
print("Starting document parsing...\n")
# Initialize LlamaParse with markdown output
# result_type="markdown" tells LlamaParse to convert documents to markdown format
parser = LlamaParse(result_type="markdown")
# Load and parse the document
# This sends the document to LlamaCloud for processing
print("[1/3] Loading document with LlamaParse...")
documents = parser.load_data("./crime_statistics_2021.md")
print(f"✓ Loaded {len(documents)} document(s)")
# Initialize the markdown element parser
# This parser understands markdown structure and can identify tables, lists, etc.
# num_workers=8 enables parallel processing for faster parsing
print("\n[2/3] Initializing MarkdownElementNodeParser...")
node_parser = MarkdownElementNodeParser(
llm=llm, # Uses our configured LLM for understanding structure
num_workers=8 # Parallel processing workers
)
print("✓ Parser initialized with 8 workers")
# Parse documents into nodes
print("\n[3/3] Parsing document into nodes...")
nodes = node_parser.get_nodes_from_documents(documents)
print(f"✓ Created {len(nodes)} node(s)")
# Separate base nodes (text) from objects (tables, figures)
base_nodes, objects = node_parser.get_nodes_and_objects(nodes)
Understanding the Output
Let’s examine what we got:
# Inspect the first few nodes
print("\nSample Base Node:")
print("=" * 80)
if base_nodes:
sample_node = base_nodes[0]
print(f"Type: {type(sample_node).__name__}")
print(f"Content (first 200 chars): {sample_node.get_content()[:200]}...")
print(f"Metadata: {sample_node.metadata}")
print("\n" + "=" * 80)
print("\nSample Object (Table):")
print("=" * 80)
if objects:
sample_object = objects[0]
print(f"Type: {type(sample_object).__name__}")
print(f"Content (first 300 chars):\n{sample_object.get_content()[:300]}...")
print(f"Metadata: {sample_object.metadata}")
print("\n" + "=" * 80)
What Just Happened?
The parsing pipeline:
- LlamaParse converted the document to structured markdown
- MarkdownElementNodeParser analyzed the markdown and identified:
- Headers and titles
- Tables (preserved in structured format)
- Narrative text
- Lists and other elements
- Node Separation split elements into:
- base_nodes: Searchable text chunks
- objects: Structured data (tables) that can be queried directly
Why This Matters
Traditional parsers would flatten everything into plain text:
State/UT 2019 2020 2021 % Var Total All India 365,711 356,345 382,189 +7.3%...
LlamaParse preserves structure:
Table with headers: [State/UT, 2019, 2020, 2021, % Var]
Rows: [{State: "Total", 2021: 382189, ...}, ...]
This structure is what enables accurate querying! Now let’s index this data.
Now that we have parsed our document into structured nodes and objects, we need to create a vector store index that enables semantic search and intelligent querying.
What is a Vector Store Index?
A vector store index:
- Converts text into numerical embeddings (vectors)
- Stores these vectors for efficient similarity search
- Enables finding relevant content based on semantic meaning, not just keywords
For example:
- Query: “Which state had the highest growth?”
- Retrieves: Table with percentage change data
- Even though the query doesn’t contain words like “percentage” or “table”
The Indexing Process
Let’s create our index:
from llama_index.core import VectorStoreIndex
print("Creating vector store index...\n")
# Create index from base nodes + objects
# This combines both text content and structured tables into a single searchable index
print("[1/2] Generating embeddings for all nodes...")
recursive_index = VectorStoreIndex(
nodes=base_nodes + objects # Include both text and table nodes
)
print(f"✓ Index created with {len(base_nodes) + len(objects)} total nodes")
# Create query engine with similarity-based retrieval
# similarity_top_k=5 means retrieve the 5 most relevant chunks for each query
print("\n[2/2] Creating query engine...")
recursive_query_engine = recursive_index.as_query_engine(
similarity_top_k=5 # Number of chunks to retrieve per query
)
Understanding the Components
VectorStoreIndex
The VectorStoreIndex is Llama Index’s built-in vector storage solution. It:
- Embeddings: Automatically generates embeddings using our configured embedding model
- Storage: Stores vectors in memory (for development) or persistent storage (for production)
- Retrieval: Enables fast similarity search using cosine similarity
- Flexibility: Can be backed by Chroma, Pinecone, Weaviate, or other vector databases
Query Engine
The query engine orchestrates the entire query process:
User Question → Embed Question → Find Similar Chunks → LLM Synthesis → Answer
| Parameter | Value | Purpose |
|---|---|---|
| similarity_top_k | 5 | Retrieve 5 most relevant chunks |
| response_mode | compact (default) | How to synthesize retrieved chunks |
| streaming | False (default) | Whether to stream responses |
Advanced Configuration
You can customize the query engine for specific needs:
# Example: Advanced query engine configuration
advanced_query_engine = recursive_index.as_query_engine(
similarity_top_k=10, # Retrieve more chunks for complex queries
response_mode="tree_summarize", # Hierarchical summarization
verbose=True # Show retrieval details
)
Why Include Both Base Nodes and Objects?
nodes=base_nodes + objects
This is crucial! By including both:
- Text queries retrieve narrative content (“What were the key insights?”)
- Data queries retrieve structured tables (“What was the growth rate?”)
- Hybrid queries get both (“Explain the traffic accident trends”)
The query engine automatically finds the most relevant nodes—whether they’re text or tables—based on semantic similarity.
Production Considerations
For production deployments:
- Use persistent vector stores: ChromaDB, Pinecone, Weaviate
- Tune similarity_top_k: Test different values (3-10)
- Monitor retrieval quality: Log which chunks are retrieved
- Implement caching: Cache frequent queries
- Add metadata filtering: Filter by date, source, category
Now let’s put this to the test with real queries!
Now comes the exciting part—querying our parsed document with natural language! Let’s test the system with various query types to demonstrate its capabilities.
The Query Process
When you submit a query:
- Query Embedding: Your question is converted to a vector
- Similarity Search: Find the most similar chunks in the index
- Context Assembly: Retrieved chunks are combined
- LLM Synthesis: The LLM generates an answer using the context
- Response Return: You get a natural language answer
Example Query: Complex Table Extraction
Let’s try a challenging query that requires understanding table structure:
# Complex query that tests table extraction
query = 'Extract the state-wise data as a dict and exclude any information about 2020. Also include % var'
# Execute query
response = recursive_query_engine.query(query)
What Makes This Query Complex?
This query is sophisticated because it requires:
- Table Identification: Find the correct table in the document
- Column Filtering: Exclude 2020 data while keeping 2019 and 2021
- Format Conversion: Convert table to dictionary/JSON format
- Selective Inclusion: Include percentage variance column
- Structured Output: Return data in a structured format
Traditional parsers would completely fail at this task. LlamaParse + Llama Index handles it naturally.
Additional Test Queries
Let’s test different query types:
# Test different query types
test_queries = [
"Which state had the highest percentage increase in 2021?",
"What was the total number of accidental deaths in 2021?",
"List the top 3 states by absolute numbers in 2021",
"What were the main categories of accidental deaths?",
"Compare traffic accidents to drowning incidents in 2021",
"What key insights are mentioned in the report?"
]
print("\n" + "="*80)
print("TESTING MULTIPLE QUERY TYPES")
print("="*80 + "\n")
for i, test_query in enumerate(test_queries, 1):
print(f"\n[Query {i}/{len(test_queries)}]")
print("-" * 80)
print(f"Q: {test_query}")
print("-" * 80)
# Execute query
answer = recursive_query_engine.query(test_query)
print(f"A: {answer}")
print("=" * 80)
Query Quality Factors
What makes some queries work better than others?
| Query Type | Success Rate | Why |
|---|---|---|
| Specific data points | High | “Kerala’s percentage” → Direct table lookup |
| Comparisons | High | “Compare X to Y” → Both in same table |
| Aggregations | Medium | “Total of all states” → Requires calculation |
| Complex logic | Medium | “States above average” → Multi-step reasoning |
| Vague questions | Low | “Tell me about the data” → Too broad |
Tips for Better Queries
✓ Good Queries:
- “What was Maharashtra’s 2021 value?”
- “Compare traffic accidents to fire accidents”
- “Which category had the highest percentage change?”
✗ Poor Queries:
- “Tell me everything” (too broad)
- “The number” (no context)
- “Is it good?” (subjective, no reference)
The more specific your query, the better the retrieval and answer quality!
Now you can put everything together into a single, cohesive workflow. This represents the complete end-to-end pipeline you’d use in production.
The Complete Pipeline
Evaluation and Quality Assessment
Before deploying to production, you must evaluate your system’s accuracy. As mentioned in the original tutorial: like any other tool in the tech field, LlamaParse is not entirely immune to errors.
Evaluation Strategies
You should perform thorough evaluation using tools like Ragas, Truera, or custom evaluation frameworks.
For production deployments:
Use Ragas: Comprehensive RAG evaluation framework
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancyHuman-in-the-loop: Sample 10% of queries for human review
A/B Testing: Compare different parsing strategies
Continuous Monitoring: Track query success rates over time
Error Analysis: Log and analyze failed queries
Remember: No parsing tool is perfect. Always validate critical extractions before using them in production decision-making!
Congratulations! You’ve built a complete document extraction and querying system using LlamaParse and Llama Index.
What We Accomplished
✅ Understood the challenges of parsing complex PDFs with embedded objects
✅ Implemented end-to-end parsing from raw PDFs to queryable indices
✅ Extracted complex tables with multi-level hierarchies
✅ Created vector indices for semantic search
✅ Built query engines that understand natural language
Key Takeaways
LlamaParse is specialized: It focuses on one problem—extracting structure from documents—and does it exceptionally well
Integration matters: Seamless compatibility with Llama Index means less code, fewer bugs, faster development
Structure preservation is critical: Tables, figures, and hierarchies must maintain their relationships for accurate querying
Evaluation is non-negotiable: Always test parsing accuracy before production deployment
Natural language queries work: With proper parsing, users can ask questions naturally without knowing document structure
When Prompting Isn’t Enough
LlamaParse with prompt engineering handles most document parsing scenarios excellently. However, for extreme cases:
- Highly specialized documents: Industry-specific formats with unique structures
- Maximum accuracy requirements: When 99%+ precision is mandatory
- Very high volume: Processing millions of documents where cost optimization is critical
In these scenarios, fine-tuning specialized extraction models on your specific document corpus can push beyond the limits of general-purpose parsers.
Need help deploying at scale? At UBIAI, we specialize in building production document processing systems. Our consulting team can help you design, implement, and optimize parsing workflows for your specific document types and business requirements.
Want to optimize for domain-specific documents? Explore UBIAI’s agentic fine-tuning platform, where you can fine-tune extraction components for maximum accuracy on your specialized documents—no ML expertise required.
Frequently Asked Questions
Q1. What is the Llama Index?
A: LlamaIndex is one of the leading LLM frameworks (alongside LangChain) for building LLM applications. It helps connect custom data sources to large language models and is widely used for building RAG applications. It provides tools for data ingestion, indexing, retrieval, and querying.
Q2. What is LlamaParse?
A: LlamaParse is a specialized offering from Llama Index that extracts complex tables, figures, and embedded objects from documents like PDFs, PowerPoints, and Word documents. Because it’s developed by the Llama Index team, it integrates seamlessly with the Llama Index ecosystem, allowing direct use with the framework’s agents and tools.
Q3. How is LlamaParse different from Llama Index?
A: Llama Index is a comprehensive LLM framework for building custom applications, providing various tools and agents for data processing, retrieval, and generation. LlamaParse is specifically focused on one task: extracting complex embedded objects from documents. Think of Llama Index as the full toolkit, and LlamaParse as a specialized tool within that ecosystem.
Q4. What is the importance of LlamaParse?
A: LlamaParse’s importance lies in its ability to convert complex unstructured data (tables, images, figures) into structured formats that LLMs can understand and reason about. This transformation is crucial in today’s world where most valuable information exists in unstructured form. For instance, analyzing a company’s 100-200 page SEC filing would be nearly impossible without such a tool—LlamaParse makes it efficient and accurate.
Q5. Does LlamaParse have any alternatives?
A: Yes, the main alternatives to LlamaParse are:
- LLMSherpa: Good for self-hosted deployments and PDF/HTML parsing
- Kudra: Framework-agnostic with support for 20+ file formats
Each has its strengths. Choose LlamaParse for seamless Llama Index integration and rapid development.
Q6. How accurate is LlamaParse?
A: LlamaParse achieves 85-95% accuracy on well-formatted documents and 70-85% on scanned/low-quality documents. However, like any parsing tool, it’s not immune to errors. Always evaluate with tools like Ragas or custom test suites before production deployment.
Q7. Can I use LlamaParse for free?
A: Yes! The free tier allows parsing up to 1000 pages per day, which is excellent for development, testing, and small-scale applications. Paid plans are available for higher volume production workloads.
Q8. What file formats does LlamaParse support?
A: LlamaParse supports 10+ formats including .pdf, .pptx, .docx, .html, .xml, and more. It can even extract images embedded within these documents.