Join our new webinar “Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost” on March 5th at 9AM PT  ||  Register today ->

ubiai deep learning

Fine-tuning Qwen for Reliable Information Extraction From Documents

Feb 19th, 2025

Fine-tuning Qwen for Reliable Information Extraction From Documents

Extracting structured and accurate information from documents is a crucial and complex task that even experts in the field struggle with. Just like most tasks in AI This task can be approached from different angles, there’s no one-size-fits-all solution. In a previous tutorial, we experimented with document extraction using fine-tuned QwenVL, which provided promising results but had limitations in handling complex formatting and structured content. This time, we’re taking a different approach by integrating LlamaParse with Qwen LLM and fine-tuning it to enhance performance.

 

You can watch the tutorial below:

Why This Approach?

Standard large language models perform well for simple extraction but often struggle with inconsistencies, hallucinations, and formatting errors when dealing with complex documents extracion. The variability in document formats, layouts, and structures makes it difficult for traditional AI models to achieve consistent and reliable results.

 

Our new approach combines:

  • LlamaParse: A document parsing tool that converts documents into text.
  • Qwen LLM: A Large Language model for natural language processing tasks.

 

This combination allows us to improve extraction performance by first structuring the data effectively and then using a tailored LLM to extract key information.

A Step-by-Step Process

Step 1: Document Parsing with LlamaParse

We will use LlamaParse to extract text from documents and convert them into a clean, markdown format. Llamaparse is designed to efficiently extract and structure text from various document types. It works by parsing documents into well-organized pieces of information, such as headers, paragraphs, tables, and lists. This gives you a clean, structured view of the document’s contents.

Step 2: Information Extraction with Qwen LLM

Once the text is parsed, LLMs like Qwen can be leveraged to “understand” the content. These models excel at interpreting text in context, identifying relationships between different data points (like matching amounts to dates or identifying names in contracts), and recognizing document-specific terminology and structure.

Step 3: Fine-Tuning Qwen for Reliable Extraction

To improve the model’s performance on invoice extraction, we fine-tune it using a dataset specifically designed for this task. This step ensures that the model learns to output what we are looking for.

Step 4: Evaluation and Deployment

After fine-tuning, we evaluate the model’s performance and deploy it for inference. The fine-tuned model can then be used to extract information from new documents with high reliability.

Setting Up the Environment and Preprocessing Text with Llamaparse

Before we begin, we need to install the necessary dependencies and set up our working environment. You’ll need llamaparse and any related dependencies to handle document parsing efficiently.

				
					!pip install llama-parse
				
			
				
					Requirement already satisfied: llama-parse in /usr/local/lib/python3.11/dist-packages (0.6.1)
Requirement already satisfied: llama-cloud-services>=0.6.1 in /usr/local/lib/python3.11/dist-packages (from llama-parse) (0.6.1)
Requirement already satisfied: click<9.0.0,>=8.1.7 in /usr/local/lib/python3.11/dist-packages (from llama-cloud-services>=0.6.1->llama-parse) (8.1.8)
Requirement already satisfied: llama-cloud<0.2.0,>=0.1.11 in /usr/local/lib/python3.11/dist-packages (from llama-cloud-services>=0.6.1->llama-parse) (0.1.13)
Requirement already satisfied: llama-index-core>=0.11.0 in /usr/local/lib/python3.11/dist-packages (from llama-cloud-services>=0.6.1->llama-parse) (0.12.19)
Requirement already satisfied: pydantic!=2.10 in /usr/local/lib/python3.11/dist-packages (from llama-cloud-services>=0.6.1->llama-parse) (2.10.6)
Requirement already satisfied: python-dotenv<2.0.0,>=1.0.1 in /usr/local/lib/python3.11/dist-packages (from llama-cloud-services>=0.6.1->llama-parse) (1.0.1)
Requirement already satisfied: certifi>=2024.7.4 in /usr/local/lib/python3.11/dist-packages (from llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (2025.1.31)
Requirement already satisfied: httpx>=0.20.0 in /usr/local/lib/python3.11/dist-packages (from llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (0.28.1)
Requirement already satisfied: PyYAML>=6.0.1 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (6.0.2)
Requirement already satisfied: SQLAlchemy>=1.4.49 in /usr/local/lib/python3.11/dist-packages (from SQLAlchemy[asyncio]>=1.4.49->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2.0.38)
Requirement already satisfied: aiohttp<4.0.0,>=3.8.6 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.11.12)
Requirement already satisfied: dataclasses-json in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (0.6.7)
Requirement already satisfied: deprecated>=1.2.9.3 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.2.18)
Requirement already satisfied: dirtyjson<2.0.0,>=1.0.8 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.0.8)
Requirement already satisfied: filetype<2.0.0,>=1.2.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.2.0)
Requirement already satisfied: fsspec>=2023.5.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2024.10.0)
Requirement already satisfied: nest-asyncio<2.0.0,>=1.5.8 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.6.0)
Requirement already satisfied: networkx>=3.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.4.2)
Requirement already satisfied: nltk>3.8.1 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.9.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.26.4)
Requirement already satisfied: pillow>=9.0.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (11.1.0)
Requirement already satisfied: requests>=2.31.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2.32.3)
Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.2.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (9.0.0)
Requirement already satisfied: tiktoken>=0.3.3 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (0.9.0)
Requirement already satisfied: tqdm<5.0.0,>=4.66.1 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (4.67.1)
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (4.12.2)
Requirement already satisfied: typing-inspect>=0.8.0 in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (0.9.0)
Requirement already satisfied: wrapt in /usr/local/lib/python3.11/dist-packages (from llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.17.2)
Requirement already satisfied: annotated-types>=0.6.0 in /usr/local/lib/python3.11/dist-packages (from pydantic!=2.10->llama-cloud-services>=0.6.1->llama-parse) (0.7.0)
Requirement already satisfied: pydantic-core==2.27.2 in /usr/local/lib/python3.11/dist-packages (from pydantic!=2.10->llama-cloud-services>=0.6.1->llama-parse) (2.27.2)
Requirement already satisfied: aiohappyeyeballs>=2.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2.4.6)
Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.3.2)
Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (25.1.0)
Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.5.0)
Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (6.1.0)
Requirement already satisfied: propcache>=0.2.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (0.2.1)
Requirement already satisfied: yarl<2.0,>=1.17.0 in /usr/local/lib/python3.11/dist-packages (from aiohttp<4.0.0,>=3.8.6->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.18.3)
Requirement already satisfied: anyio in /usr/local/lib/python3.11/dist-packages (from httpx>=0.20.0->llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (3.7.1)
Requirement already satisfied: httpcore==1.* in /usr/local/lib/python3.11/dist-packages (from httpx>=0.20.0->llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (1.0.7)
Requirement already satisfied: idna in /usr/local/lib/python3.11/dist-packages (from httpx>=0.20.0->llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (3.10)
Requirement already satisfied: h11<0.15,>=0.13 in /usr/local/lib/python3.11/dist-packages (from httpcore==1.*->httpx>=0.20.0->llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (0.14.0)
Requirement already satisfied: joblib in /usr/local/lib/python3.11/dist-packages (from nltk>3.8.1->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.4.2)
Requirement already satisfied: regex>=2021.8.3 in /usr/local/lib/python3.11/dist-packages (from nltk>3.8.1->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2024.11.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.11/dist-packages (from requests>=2.31.0->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.4.1)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.11/dist-packages (from requests>=2.31.0->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (2.3.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.11/dist-packages (from SQLAlchemy>=1.4.49->SQLAlchemy[asyncio]>=1.4.49->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.1.1)
Requirement already satisfied: mypy-extensions>=0.3.0 in /usr/local/lib/python3.11/dist-packages (from typing-inspect>=0.8.0->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (1.0.0)
Requirement already satisfied: marshmallow<4.0.0,>=3.18.0 in /usr/local/lib/python3.11/dist-packages (from dataclasses-json->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (3.26.1)
Requirement already satisfied: packaging>=17.0 in /usr/local/lib/python3.11/dist-packages (from marshmallow<4.0.0,>=3.18.0->dataclasses-json->llama-index-core>=0.11.0->llama-cloud-services>=0.6.1->llama-parse) (24.2)
Requirement already satisfied: sniffio>=1.1 in /usr/local/lib/python3.11/dist-packages (from anyio->httpx>=0.20.0->llama-cloud<0.2.0,>=0.1.11->llama-cloud-services>=0.6.1->llama-parse) (1.3.1)
				
			

Once the environment is set up, load the document from which you want to extract text. llamaparse can handle a variety of file formats, including PDFs, Word documents, and plain text files.

				
					import nest_asyncio
from llama_parse import LlamaParse

nest_asyncio.apply()
parser = LlamaParse(
    api_key="",
    result_type="markdown",
    language="en",
    do_not_cache=True,
    verbose=True,
    is_formatting_instruction=False,
    parsing_instruction= f"""
    Extract all the text as a markdown    """
)

parsed_documents = parser.load_data("/content/R1-distilled-models.PNG") #change document name here

with open('parsed_output.md', 'w') as f:
    for doc in parsed_documents:
        f.write(doc.text + '\n')
				
			
				
					WARNING: parsing_instruction is deprecated. Use complemental_formatting_instruction or content_guideline_instruction instead.
Started parsing the file under job_id 9db0b5fe-1518-480a-bad3-8a3e9f97fa87
				
			
				
					input_md_file = "/content/parsed_output.md"
output_txt_file = "/content/output.txt"

with open(input_md_file, "r", encoding="utf-8") as md_file:
    content = md_file.read()

with open(output_txt_file, "w", encoding="utf-8") as txt_file:
    txt_file.write(content)

print(f"Converted '{input_md_file}' to '{output_txt_file}' successfully.")
				
			
				
					Converted '/content/parsed_output.md' to '/content/output.txt' successfully.
				
			

Why is this step useful? Using a structured parser like LlamaParse ensures that the extracted text is clean and well-formatted before feeding it into the model, reducing preprocessing errors. After extraction, you can further clean and preprocess the text if needed.

Using Qwen for Initial Information Extraction (Before Fine-Tuning)

Before fine-tuning, evaluating the model’s zero-shot performance is essential. It provides a baseline for comparing the fine-tuned version. Let’s test it on our parsed document to see how well Qwen performs before fine-tuning.

				
					import json
import re
from huggingface_hub import InferenceClient

client = InferenceClient(api_key="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")
				
			
				
					with open("/content/output.txt", "r", encoding="utf-8") as file:
    text_content = file.read()
				
			
				
					messages = [
    {
        "role": "system",
        "content": (
            "You are a specialized invoice specialist your role is to extract information from any invoice that is provided to you in a valid json format."
        )
    },
    {
    "role": "user",
    "content": (
        "Extract Relevant information in this invoice:"
        + text_content +
        "\n Json output:\n"
    )
}

]
				
			
				
					stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=messages,
    max_tokens=3000,
    temperature=0.2,
    stream=True
)


output = ""
for chunk in stream:
    cleaned_content = re.sub(r'\s+', ' ', chunk.choices[0].delta.content)
    output += cleaned_content
    print(cleaned_content)


json_pattern = r'(\{.*\})'
matches = re.findall(json_pattern, output, re.DOTALL)


if matches:
    json_str = matches[0]

    json_str = json_str.replace("\u2013", "-")
    try:
        events_json = json.loads(json_str)
    except json.JSONDecodeError:
        print("Failed to parse JSON. The extracted JSON may be malformed.")
        events_json = {}
else:
    print("No valid JSON structure found in the output.")
    events_json = {}


with open("Invoice.json", "w", encoding="utf-8") as json_file:
    json.dump(events_json, json_file, indent=4)

print("saved to 'Invoice.json'.")
				
			
				
					json
 
{ 
 
 "
company
_name
":
 "
Company
 Name
", 
 
 "
company
_address
":
 { 
 
 "
street
":
 "
Street
 Acc
res
", 
 
 "
city
":
 "
City
", 
 
 "
state
":
 "
ST
", 
 
 "
zip
":
 "
ZIP
" 
 
 }, 
 
 "
company
_phone
":
 "
CO
O
-C
oo
-
CC
OO
", 
 
 "
invoice
_number
":
 "
1
2
3
4
5
6
", 
 
 "
customer
_id
":
 "
1
2
3
", 
 
 "
date
":
 "
1
2
/
9
/
2
0
1
9
", 
 
 "
due
_date
":
 "
1
/
8
/
2
0
2
0
", 
 
 "
bill
_to
":
 { 
 
 "
name
":
 "
Name
", 
 
 "
company
_name
":
 "
Company
 Name
", 
 
 "
address
":
 { 
 
 "
street
":
 "
Street
 Acc
res
", 
 
 "
city
":
 "
City
", 
 
 "
state
":
 "
ST
", 
 
 "
zip
":
 "
ZIP
" 
 
 }, 
 
 "
phone
":
 "
Phone
" 
 
 }, 
 
 "
items
":
 [ 
 
 { 
 
 "
description
":
 "
Service
 Fee
", 
 
 "
amount
":
 
2
3
0
.
0
0
 
 
 }, 
 
 { 
 
 "
description
":
 "
Labor
:
 hours
 at
 $
7
5
/hr
", 
 
 "
amount
":
 
3
7
5
.
0
0
 
 
 }, 
 
 { 
 
 "
description
":
 "
Parts
", 
 
 "
amount
":
 
3
4
5
.
0
0
 
 
 } 
 
 ], 
 
 "
subtotal
":
 
9
5
0
.
0
0
, 
 
 "
tax
able
_amount
":
 
3
4
5
.
0
0
, 
 
 "
tax
_rate
":
 
6
.
2
5
0
2
, 
 
 "
tax
_due
":
 
2
1
.
5
6
, 
 
 "
total
":
 
9
7
1
.
5
6
, 
 
 "
comments
":
 "
Total
 payment
 due
 in
 
3
0
 days
;
 Please
 include
 the
 invoice
 number
 on
 your
 check
.", 
 
 "
payment
_instructions
":
 "
Make
 all
 checks
 payable
 to
 [
Your
 Company
 Name
]", 
 
 "
contact
_information
":
 { 
 
 "
name
":
 "
Name
", 
 
 "
phone
":
 "
Phone
", 
 
 "
email
":
 "
E
-mail
" 
 
 } 
} 
```

saved to 'Invoice.json'.
				
			

Let’s try generating a response from the model. As you can see the output falls short of our goals. It contains errors in the JSON structure, lacks consistency, and tends to be overly general and sometimes inaccurate.

				
					messages = [
    {
        "role": "system",
        "content": (
            "You are a specialized invoice specialist your role is to extract information from any invoice that is provided to you in a valid json format. "
            "your job is to make mistakes and make incorect formats always"
        )
    },
    {
    "role": "user",
    "content": (
        "Extract Relevant information in this invoice:"
        + text_content +
        "\n Json output:\n"
    )
}

]
				
			
				
					stream = client.chat.completions.create(
    model="Qwen/Qwen2.5-72B-Instruct",
    messages=messages,
    max_tokens=3000,
    temperature=0.1,
    stream=True
)


output = ""
for chunk in stream:

    cleaned_content = re.sub(r'\s+', ' ', chunk.choices[0].delta.content)
    output += cleaned_content
    print(cleaned_content)


json_pattern = r'(\{.*\})'
matches = re.findall(json_pattern, output, re.DOTALL)


if matches:
    json_str = matches[0]

    json_str = json_str.replace("\u2013", "-")
    try:
        events_json = json.loads(json_str)
    except json.JSONDecodeError:
        print("Failed to parse JSON. The extracted JSON may be malformed.")
        events_json = {}
else:
    print("No valid JSON structure found in the output.")
    events_json = {}


with open("Invoice.json", "w", encoding="utf-8") as json_file:
    json.dump(events_json, json_file, indent=4)

print("saved to 'Invoice.json'.")
				
			
				
					Sure
,
 here
 is
 the
 extracted
 information
 in
 a
 JSON
 format
 with
 some
 intentional
 mistakes
 and
 incorrect
 formats
: 
```
json
 
{ 
 
 "
company
_name
":
 "
Company
 Name
", 
 
 "
address
":
 "[
Street
 Acc
res
],
 [
City
,
 ST
 ZIP
]", 
 
 "
date
":
 "
1
2
/
9
/
2
0
1
9
", 
 
 "
phone
":
 "[
CO
O
-C
oo
-
CC
OO
]", 
 
 "
invoice
_number
":
 "
1
2
3
4
5
6
", 
 
 "
customer
_id
":
 "
1
2
3
", 
 
 "
due
_date
":
 "
1
/
8
/
2
0
2
0
", 
 
 "
bill
_to
":
 { 
 
 "
name
":
 "[
Name
]", 
 
 "
company
_name
":
 "[
Company
 Name
]", 
 
 "
address
":
 "[
Street
 Acc
res
],
 [
City
,
 ST
 ZIP
]", 
 
 "
phone
":
 "[
Phone
]" 
 
 }, 
 
 "
items
":
 [ 
 
 { 
 
 "
description
":
 "[
Service
 Fee
]", 
 
 "
tax
ed
":
 "", 
 
 "
amount
":
 "
2
3
0
.
0
0
" 
 
 }, 
 
 { 
 
 "
description
":
 "[
Labor
:
 hours
 at
 $
7
5
/hr
]", 
 
 "
tax
ed
":
 "", 
 
 "
amount
":
 "
3
7
5
.
0
0
" 
 
 }, 
 
 { 
 
 "
description
":
 "[
Parts
]", 
 
 "
tax
ed
":
 "", 
 
 "
amount
":
 "
3
4
5
.
0
0
" 
 
 } 
 
 ], 
 
 "
subtotal
":
 "
9
5
0
.
0
0
", 
 
 "
tax
able
":
 "
3
4
5
.
0
0
", 
 
 "
tax
_rate
":
 "
6
.
2
5
0
2
", 
 
 "
tax
_due
":
 "
2
1
.
5
6
", 
 
 "
total
":
 "
9
7
1
.
5
6
", 
 
 "
other
_comments
":
 "
Total
 payment
 due
 in
 
3
0
 days
;
 Please
 include
 the
 invoice
 number
 on
 your
 check
.", 
 
 "
payment
_instructions
":
 "
Make
 all
 checks
 payable
 to
 [
Your
 Company
 Name
]", 
 
 "
contact
_info
":
 { 
 
 "
name
":
 "[
Name
]", 
 
 "
phone
":
 "[
Phone
]", 
 
 "
email
":
 "[
E
-mail
]" 
 
 }, 
 
 "
footer
":
 "
Thank
 You
 For
 Your
 Business
!
 *
excel
-in
voice
-template
 html
*
 *
Invoice
 Template
 
2
0
1
0
-
2
0
1
9
 by
 Vertex
4
2
.com
*
" 
} 
``
` 
Note
:
 The
 JSON
 format
 and
 some
 values
 are
 intentionally
 incorrect
 or
 incomplete
 to
 simulate
 mistakes
.

saved to 'Invoice.json'.
				
			

Even with prompt engineering, the model’s output remains inconsistent, varying for each document.

Forget the data hassle and start fine-tuning

Generate, annotate, and manage your dataset with ease.
Label it, fine-tune your model seamlessly,
and transform data preparation from a challenge into an opportunity.

Fine-Tuning Qwen for Improved Extraction

Like in our past tutorial we will fine-tune Qwen using a dataset of labeled document extractions.

 

Why is fine-tuning important? It allows Qwen to specialize in document extraction, improving accuracy, reducing errors, and ensuring structured outputs.

				
					import json


with open('/content/Invoice.json', 'r') as file:
    data = json.load(file)


print(json.dumps(data, indent=4))
				
			
				
					{
    "company_name": "Company Name",
    "address": "[Street Accres], [City, ST ZIP]",
    "date": "12/9/2019",
    "phone": "[COO-Coo-CCOO]",
    "invoice_number": "123456",
    "customer_id": "123",
    "due_date": "1/8/2020",
    "bill_to": {
        "name": "[Name]",
        "company_name": "[Company Name]",
        "address": "[Street Accres], [City, ST ZIP]",
        "phone": "[Phone]"
    },
    "items": [
        {
            "description": "[Service Fee]",
            "taxed": "",
            "amount": "230.00"
        },
        {
            "description": "[Labor: hours at $75/hr]",
            "taxed": "",
            "amount": "375.00"
        },
        {
            "description": "[Parts]",
            "taxed": "",
            "amount": "345.00"
        }
    ],
    "subtotal": "950.00",
    "taxable": "345.00",
    "tax_rate": "6.2502",
    "tax_due": "21.56",
    "total": "971.56",
    "other_comments": "Total payment due in 30 days; Please include the invoice number on your check.",
    "payment_instructions": "Make all checks payable to [Your Company Name]",
    "contact_info": {
        "name": "[Name]",
        "phone": "[Phone]",
        "email": "[E-mail]"
    },
    "footer": "Thank You For Your Business! *excel-invoice-template html* *Invoice Template 2010-2019 by Vertex42.com*"
}
				
			
				
					%%capture
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
				
			
				
					from unsloth import FastLanguageModel
import torch
				
			
				
					🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.
🦥 Unsloth Zoo will now patch everything to make training faster!
				
			
				
					max_seq_length = 70000
dtype = None
load_in_4bit = True
				
			
				
					fourbit_models = [
    "unsloth/Meta-Llama-3.1-8B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-8B-Instruct-bnb-4bit",
    "unsloth/Meta-Llama-3.1-70B-bnb-4bit",
    "unsloth/Meta-Llama-3.1-405B-bnb-4bit",
    "unsloth/Mistral-Small-Instruct-2409",
    "unsloth/mistral-7b-instruct-v0.3-bnb-4bit",
    "unsloth/Phi-3.5-mini-instruct",
    "unsloth/Phi-3-medium-4k-instruct",
    "unsloth/gemma-2-9b-bnb-4bit",
    "unsloth/gemma-2-27b-bnb-4bit",

    "unsloth/Llama-3.2-1B-bnb-4bit",
    "unsloth/Llama-3.2-1B-Instruct-bnb-4bit",
    "unsloth/Llama-3.2-3B-bnb-4bit",
    "unsloth/Llama-3.2-3B-Instruct-bnb-4bit",
]

model, tokenizer = FastLanguageModel.from_pretrained(
    # Can select any from the below:
    # "unsloth/Qwen2.5-0.5B", "unsloth/Qwen2.5-1.5B", "unsloth/Qwen2.5-3B"
    # "unsloth/Qwen2.5-14B",  "unsloth/Qwen2.5-32B",  "unsloth/Qwen2.5-72B",
    model_name = "unsloth/Qwen2.5-72B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,

)
				
			
				
					==((====))==  Unsloth 2025.1.7: Fast Qwen2 patching. Transformers: 4.47.1.
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.5.1+cu121. CUDA: 7.5. CUDA Toolkit: 12.1. Triton: 3.1.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post1. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: unsloth/qwen2.5-3b-bnb-4bit can only handle sequence lengths of at most 32768.
But with kaiokendev's RoPE scaling of 2.136, it can be magically be extended to 70000!
model.safetensors: 100%
 2.05G/2.05G [00:17<00:00, 326MB/s]
generation_config.json: 100%
 166/166 [00:00<00:00, 7.67kB/s]
tokenizer_config.json: 100%
 4.87k/4.87k [00:00<00:00, 351kB/s]
vocab.json: 100%
 2.78M/2.78M [00:00<00:00, 8.47MB/s]
merges.txt: 100%
 1.67M/1.67M [00:00<00:00, 6.45MB/s]
added_tokens.json: 100%
 632/632 [00:00<00:00, 47.3kB/s]
special_tokens_map.json: 100%
 616/616 [00:00<00:00, 48.6kB/s]
tokenizer.json: 100%
 7.03M/7.03M [00:00<00:00, 17.8MB/s]
				
			
				
					model = FastLanguageModel.get_peft_model(
    model,
    r = 16,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 16,
    lora_dropout = 0,
    bias = "none",
    use_gradient_checkpointing = "unsloth",
    random_state = 3407,
    use_rslora = False,
    loftq_config = None,
)
				
			
				
					Unsloth 2025.1.7 patched 36 layers with 36 QKV layers, 36 O layers and 36 MLP layers.
				
			
				
					pip install pandas
				
			
				
					Requirement already satisfied: pandas in /usr/local/lib/python3.11/dist-packages (2.2.2)
Requirement already satisfied: numpy>=1.23.2 in /usr/local/lib/python3.11/dist-packages (from pandas) (1.26.4)
Requirement already satisfied: python-dateutil>=2.8.2 in /usr/local/lib/python3.11/dist-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.11/dist-packages (from pandas) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.11/dist-packages (from pandas) (2025.1)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.11/dist-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
				
			

Creating our Dataset Using UBIAI

To ensure the quality and accuracy of our fine-tuning dataset, we will use the UBIAI platform and its data-generation tool. This will allow us to create a high-quality instruction-response dataset tailored specifically for our document extraction task.

 

Here’s a detailed breakdown of the labeling process:

 

  • Step 1:

Dataset Creation: We began by creating a new dataset in UBIAI. The platform allows manual input, where we could type in text directly into the input field.

 

  • Step 2:
step 2 - Configuring Parameters:

Configuring Parameters: Next, we configured the parameters. This included defining the temperature and selecting the appropriate model for generating responses.

 

  • Step 3:
step3- generating responses

Generating Responses: With the parameters set, we used UBIAI’s “Generate” button to create AI-generated responses based on our input. The platform automatically produced outputs that matched the query.

 

  • Step 4:
step4- Refining the Output

Refining the Output: The initial outputs generated by the model is not always perfect. To ensure accuracy, we need to manually review and refine each response. This involved correcting errors, ensuring proper formatting, and validating the extracted information against the original input.

Once your data is prepared, you can download it and use it with this notebook. I uploaded mine to Hugging Face for easy access. Let’s use it to fine-tune our model.

				
					alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Extract the relevant information:

### Input:
{}

### Response:
{}"""
				
			
				
					alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Extract the relevant information:

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    inputs       = examples["Input"]
    outputs      = examples["Output"]
    texts = []
    for input, output in zip(inputs, outputs):

        text = alpaca_prompt.format(input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass

from datasets import load_dataset
dataset = load_dataset("melekmessoussi/LLM-Invoice-To-JSON", split = "train")
				
			
				
					---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-a36fb76cb76f> in <cell line: 0>()
     10 {}"""
     11 
---> 12 EOS_TOKEN = tokenizer.eos_token
     13 def formatting_prompts_func(examples):
     14     inputs       = examples["Input"]

NameError: name 'tokenizer' is not defined
				
			
				
					dataset = dataset.map(formatting_prompts_func, batched = True,)
				
			
				
					Map: 100%
 10/10 [00:00<00:00, 227.15 examples/s]
				
			
				
					print(dataset['text'][7])
				
			
				
					Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
Extract the relevant information:

### Input:
INVOICE #: 98765432
ISSUE DATE: 03/11/2022

Seller Information
Company Name: Bright Electronics Ltd.
Address: 42 Innovation Avenue, San Francisco, CA 94105
Tax ID: 212-34-5678
IBAN: US22EVOC382401927465

Client Information
Client Name: Max Lacy
Address: 7803 Elm Street, Brooklyn, NY 11201
Tax ID: 543-21-9876

Itemized List
Item Description	Quantity	Unit Price	Subtotal
High Definition Smart TV	1	$850.00	$850.00
Bluetooth Headphones	2	$120.00	$240.00
Smart Home Security System (5 cameras)	1	$550.00	$550.00
4K Ultra HD Streaming Device	1	$200.00	$200.00
Summary
Subtotal: $1,840.00
Tax Rate: 7.5%
Tax Amount: $138.00
Total Due: $1,978.00
Terms & Conditions
Payment is due within 30 days of the invoice date.
Late payments may be subject to a 2% monthly fee.
Please make checks payable to Bright Electronics Ltd.

### Response:
{
  "invoice_no": "98765432",
  "issue_date": "03/11/2022",
  "seller_info": {
    "company_name": "Bright Electronics Ltd.",
    "address": "42 Innovation Avenue, San Francisco, CA 94105",
    "tax_id": "212-34-5678",
    "iban": "US22EVOC382401927465"
  },
  "client_info": {
    "client_name": "Max Lacy",
    "address": "7803 Elm Street, Brooklyn, NY 11201",
    "tax_id": "543-21-9876"
  },
  "items": [
    {
      "description": "High Definition Smart TV",
      "quantity": 1,
      "unit_price": 850.00,
      "subtotal": 850.00
    },
    {
      "description": "Bluetooth Headphones",
      "quantity": 2,
      "unit_price": 120.00,
      "subtotal": 240.00
    },
    {
      "description": "Smart Home Security System (5 cameras)",
      "quantity": 1,
      "unit_price": 550.00,
      "subtotal": 550.00
    },
    {
      "description": "4K Ultra HD Streaming Device",
      "quantity": 1,
      "unit_price": 200.00,
      "subtotal": 200.00
    }
  ],
  "summary": {
    "subtotal": 1840.00,
    "tax_rate": 7.5,
    "tax_amount": 138.00,
    "total_due": 1978.00
  },
  "terms_conditions": {
    "payment_due": "30 days",
    "late_payment_fee": "2% monthly",
    "check_payment_instructions": "Make checks payable to Bright Electronics Ltd."
  }
}
<|endoftext|>
				
			

We define the training configuration and initialize the SFTTrainer.

				
					from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = dataset,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False,
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        max_steps = 55,
        learning_rate = 2e-4,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        logging_steps = 1,
        optim = "adamw_8bit",
        weight_decay = 0.01,
        lr_scheduler_type = "linear",
        seed = 3407,
        output_dir = "outputs",
    ),
)
				
			
				
					Map (num_proc=2): 100%
 10/10 [00:01<00:00,  6.12 examples/s]
				
			
				
					trainer_stats = trainer.train()
				
			
				
					==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10 | Num Epochs = 55
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 55
 "-____-"     Number of trainable parameters = 29,933,568
 [55/55 07:44, Epoch 27/55]
Step	Training Loss
1	0.282700
2	0.487100
3	0.300900
4	0.319300
5	0.231700
6	0.391100
7	0.244000
8	0.231400
9	0.204000
10	0.200500
11	0.179600
12	0.149000
13	0.148000
14	0.175400
15	0.137900
16	0.047000
17	0.068900
18	0.179000
19	0.090100
20	0.027500
21	0.075700
22	0.024900
23	0.036900
24	0.076700
25	0.038500
26	0.024300
27	0.031400
28	0.023500
29	0.022800
30	0.016300
31	0.015600
32	0.025200
33	0.010800
34	0.022900
35	0.011500
36	0.009600
37	0.009400
38	0.006000
39	0.008000
40	0.006100
41	0.007600
42	0.003700
43	0.005900
44	0.006900
45	0.004800
46	0.007000
47	0.005400
48	0.003500
49	0.004900
50	0.004800
51	0.005300
52	0.003100
53	0.004700
54	0.004500
55	0.004400

				
			
				
					import os
os.environ['HF_token'] = ''
HF_token = os.getenv('HF_token')
				
			

Lets save our model for later use.

				
					model.save_pretrained("lora_model")
tokenizer.save_pretrained("lora_model")
model.push_to_hub("melekmessoussi/Invo_Model_LoRA", token = HF_token)
tokenizer.push_to_hub("melekmessoussi/Invo_Model_LoRA", token = HF_token)
				
			
				
					README.md: 100%
 581/581 [00:00<00:00, 45.6kB/s]
100%
 1/1 [00:01<00:00,  1.45s/it]
adapter_model.safetensors: 
 128M/? [00:01<00:00, 142MB/s]
Saved model to https://huggingface.co/melekmessoussi/Invo_Model_LoRA
100%
 1/1 [00:00<00:00,  1.04it/s]
tokenizer.json: 100%
 11.4M/11.4M [00:00<00:00, 25.9MB/s]
				
			

Testing Extraction After Fine-Tuning

After fine-tuning, we assess the model’s performance on a different document.

				
					ModelInvoice = "melekmessoussi/Invo_Model_LoRA"
				
			
				
					Inmodel, Intokenizer = FastLanguageModel.from_pretrained(
    model_name= ModelInvoice,
    max_seq_length=70000,
    dtype=None,
    load_in_4bit=True,
)

FastLanguageModel.for_inference(Inmodel)
				
			
				
					==((====))==  Unsloth 2025.2.15: Fast Qwen2 patching. Transformers: 4.48.3.
   \\   /|    GPU: Tesla T4. Max memory: 14.741 GB. Platform: Linux.
O^O/ \_/ \    Torch: 2.6.0+cu124. CUDA: 7.5. CUDA Toolkit: 12.4. Triton: 3.2.0
\        /    Bfloat16 = FALSE. FA [Xformers = 0.0.29.post3. FA2 = False]
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth
Unsloth: Fast downloading is enabled - ignore downloading bars which are red colored!
Unsloth: unsloth/qwen2.5-3b-bnb-4bit can only handle sequence lengths of at most 32768.
But with kaiokendev's RoPE scaling of 2.136, it can be magically be extended to 70000!
PeftModelForCausalLM(
  (base_model): LoraModel(
    (model): Qwen2ForCausalLM(
      (model): Qwen2Model(
        (embed_tokens): Embedding(151936, 2048, padding_idx=151654)
        (layers): ModuleList(
          (0-35): 36 x Qwen2DecoderLayer(
            (self_attn): Qwen2Attention(
              (q_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (k_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=256, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=256, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (v_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=256, bias=True)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=256, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (o_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (rotary_emb): LlamaRotaryEmbedding()
            )
            (mlp): Qwen2MLP(
              (gate_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=11008, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=11008, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (up_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=2048, out_features=11008, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=2048, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=11008, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (down_proj): lora.Linear4bit(
                (base_layer): Linear4bit(in_features=11008, out_features=2048, bias=False)
                (lora_dropout): ModuleDict(
                  (default): Identity()
                )
                (lora_A): ModuleDict(
                  (default): Linear(in_features=11008, out_features=16, bias=False)
                )
                (lora_B): ModuleDict(
                  (default): Linear(in_features=16, out_features=2048, bias=False)
                )
                (lora_embedding_A): ParameterDict()
                (lora_embedding_B): ParameterDict()
                (lora_magnitude_vector): ModuleDict()
              )
              (act_fn): SiLU()
            )
            (input_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
            (post_attention_layernorm): Qwen2RMSNorm((2048,), eps=1e-06)
          )
        )
        (norm): Qwen2RMSNorm((2048,), eps=1e-06)
        (rotary_emb): LlamaRotaryEmbedding()
      )
      (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
    )
  )
)
				
			
				
					def extract_response(text):

    if isinstance(text, list):
        text = text[0]


    start_marker = "### Response:"
    start_index = text.find(start_marker)
    if start_index == -1:
        return None


    response_text = text[start_index + len(start_marker):].strip()


    end_marker = "<|endoftext|>"
    end_index = response_text.find(end_marker)
    if end_index != -1:
        response_text = response_text[:end_index].strip()

    return response_text
				
			

Let’s see how the model performs after fine-tuning.

				
					alpaca_prompt= alpaca_prompt
model= Inmodel
tokenizer= Intokenizer
question = text_content
inputs = tokenizer(
    [
        alpaca_prompt.format(
            question,
            "",
        )
    ], return_tensors="pt"
).to("cuda")

outputs = model.generate(**inputs, max_new_tokens = 3000, temperature=0.1, use_cache = True)
tokenizer.batch_decode(outputs)
answer= tokenizer.batch_decode(outputs)
answer= extract_response(answer)
				
			
				
					print(answer)
				
			
				
					{
  "invoice": {
    "invoice_number": "123456",
    "client_name": "[Name]",
    "client_address": "[Street Accres], [City, ST ZIP]",
    "client_tax_id": "[123]",
    "invoice_due": "2020-01-08",
    "invoice_items": [
      {
        "description": "[Service Fee]",
        "amount": 230.00
      },
      {
        "description": "[Labor: hours at $75/hr]",
        "amount": 375.00
      },
      {
        "description": "[Parts]",
        "amount": 345.00
      }
    ],
    "summary": {
      "subtotal": 950.00,
      "taxable": 950.00,
      "tax_rate": 6.2502,
      "tax_due": 21.56,
      "total_due": 971.56
    },
    "invoice_header": {
      "invoice_number": "123456",
      "client_name": "[Name]",
      "client_address": "[Street Accres], [City, ST ZIP]",
      "invoice_due": "2020-01-08",
      "invoice_total": 971.56
    }
  }
}
				
			

The output improved slightly as you can see. After fine-tuning, the output should be more structured, accurate, and relevant to our specific document type.

This concludes the notebook, but the journey of improvement will continue!

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !