Enhancing Synthetic Data Generation with RAG for HTML

March 25th, 2025

Generating high quality synthetic data is essential for training good AI models, especially when real world data is scarce, biased, or difficult to collect. In the context of structured data like HTML, traditional data generation approaches often rely on rule based templates or large language models trained on vast web data. While this works for simple tasks, it often leads to three major issues:

Lack of Contextual Accuracy: Models tend to generate HTML that doesn’t reflect modern web standards or best practices. For example, they might produce a < table > based layout when a < div > based grid system would be more appropriate.
Limited Adaptability: Without external guidance, models struggle to generate HTML tailored to specific frameworks like Bootstrap, Tailwind, or React.
Hallucinations and Inconsistencies: Generative models sometimes produce code that looks plausible but is structurally incorrect or logically flawed.

These challenges highlight the need for a more robust approach; one that integrates external knowledge into the generative process.

Retrieval-Augmented Generation (RAG) and Synthetic Data

To solve these problems, Retrieval-Augmented Generation (RAG) brings a smarter approach to generating HTML. Instead of making guesses based only on what the model already knows, RAG retrieves relevant information from trusted sources; such as HTML documentation, code repositories, and best practice guidelines before generating the final output.

This has several key advantages:

More accurate code: By pulling real-world examples, RAG ensures the generated HTML is up-to-date and follows best practices.
Context-aware generation: The system adapts to different web frameworks and styles, making the code more relevant.
Scalability: As web development evolves, the system can integrate new documentation without requiring a complete model retraining.

How RAG Improves HTML Generation

The RAG pipeline for synthetic HTML generation typically involves three key steps: First, we store HTML documentation, example code snippets, and best practices in a vector database. This allows for efficient retrieval based on semantic similarity. When a user requests an HTML snippet, the system retrieves the most relevant examples from the database. Finally, the generative model synthesizes an HTML response based on both the retrieved content and the user’s request. This method ensures that the generated HTML is not only syntactically correct but also contextually aligned with modern development practices.

What You’ll Learn in This Blog

In the following sections, we’ll explore the practical implementation of RAG for synthetic HTML generation. Here’s what we’ll cover:

Setting Up a Vector Database: We’ll learn how to store and retrieve HTML knowledge using a vector database, which allows for efficient and context aware searches.
Retrieving Relevant HTML Documentation: I’ll show you how to design a retrieval mechanism that fetches the most relevant examples based on user queries.
Generating Well Structured HTML Snippets: Finally, we’ll look at how to use generative models to create HTML snippets that are both accurate and contextually appropriate.

By the end of this article, you’ll have a solid understanding of how RAG can enhance synthetic data generation for structured formats like HTML.

Step 1: Setting Up the Environment

Before the actual implementation, we need to install the necessary dependencies:

!pip install huggingface_hub
!pip install llama-parse
!pip install langchain langchain_community langchain_core langchain_huggingfacep

%%writefile requirements.txt
langchain
langchain-community
fastembed
chromadb
python-dotenv
langchain-groq
chainlit
fastembed
unstructured[md]

!pip install -r requirements.txt


import shutil
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')

Step 2: Parsing the HTML Tutorial Document

Using llama-parse, we extract structured text from an HTML tutorial PDF and convert it into Markdown format. This structured format is easier to process when building the dataset.

import nest_asyncio
from llama_parse import LlamaParse

nest_asyncio.apply()
parser = LlamaParse(
api_key=“”,
result_type=“markdown”,
language=“en”,
verbose=True,
is_formatting_instruction=False,
parsing_instruction=“””
create a markdown of the following document.
“””
)

parsed_documents = parser.load_data(“/content/html_tutorial.pdf”)
with open(‘parsed_output.md’, ‘w’) as f:
for doc in parsed_documents:
f.write(doc.text + ‘\n’)

This document serves as our knowledge base, providing relevant HTML information for later retrieval.

Step 3: Creating a Vector Database for Retrieval

We will use langchain and ChromaDB to store HTML-related information. This enables efficient similarity-based searches, retrieving relevant HTML knowledge during response generation.

import json
import re
from huggingface_hub import InferenceClient
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

def create_and_populate_vector_database():
    """
    Creates and populates a vector database with document embeddings.
    This function only needs to be run once to set up the vector database.
    """

markdown_path = “/content/parsed_output.md”
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)

vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)
vectorstore.add_documents(docs)
vectorstore.persist()

print(‘Vector database created and populated successfully!’)
return vectorstore

create_and_populate_vector_database()

from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings

def query_vector_database(query):
embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)

vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)

results = vectorstore.similarity_search(query, k=1)

retrieved_info = ” “.join(result.page_content for result in results)

return retrieved_info

Step 4: Generating Synthetic Data

Generating HTML Topics

We first generate a diverse set of HTML topics that will serve as categories for our dataset. This ensures our dataset covers various HTML components.

from huggingface_hub import InferenceClient

client = InferenceClient(api_key=“”)
MODEL = “Qwen/Qwen2.5-72B-Instruct”

n_subtopics = 10
TOPIC_GENERATION_PROMPT_TEMPLATE = """\
I want to create a synthetic dataset of natural language instructions and HTML code snippets. Based on this context, give me {n_subtopics} general subtopics to cover
that are different categories of HTML interface components.

The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list and no ().

def generate_subtopics(client, n_subtopics):
    prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(n_subtopics=n_subtopics)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
    )
    return response

responses = generate_subtopics(client, n_subtopics=n_subtopics)
print(responses.choices[0].message.content)

Generating Instructions for Each Topic

For each topic, we create user instructions that an AI assistant might receive:

n_instructions = 20
INSTRUCTION_PROMPT_TEMPLATE = """\
The objective is to create a dataset of user instructions in natural language that should be returned by HTML code snippets.
Given a topic in HTML, generate {n_instructions} possible concise instructions that could be given to an AI assistant about that topic.
Write some of these instructions as if given by someone with limited knowledge of HTML terminologies and knowledge,
like a beginner programmer. Your response should be in a list format.

The topic is: {sub_topic}
The list must be without numbers or any special character. The questions/instructions should be separated by a newline character. There must be no other text than the list.

subtopic_list = responses.choices[0].message.content.split(",")

def generate_instructions(client, sub_topic, n_instructions):
    print(f"Generating Instructions for {sub_topic}.")
    prompt = INSTRUCTION_PROMPT_TEMPLATE.format(sub_topic=sub_topic, n_instructions=n_instructions)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
    )
    return response.choices[0].message.content

def instructions_generator(client, subtopic_list, n_instructions):
instruction_list = [generate_instructions(client, subtopic, n_instructions) for subtopic in subtopic_list]
return instruction_list

instruction_list = instructions_generator(client, subtopic_list, n_instructions)

instruction_list_formatted = []
for instruction_set in instruction_list:
instruction_list_formatted.extend([instruction.strip() for instruction in instruction_set.split(“\n”) if instruction])

Generating HTML Responses with RAG

Instead of relying solely on a model’s pre-trained knowledge, we retrieve relevant HTML information before generating responses. For each instruction, we retrieve contextual information and generate an HTML response.

RESPONSE_PROMPT_TEMPLATE = """\
Given a question/instruction related to HTML, generate only the HTML code snippet without any explanatory text or additional information.

The user prompt is: {instruction}

Get some help using the following if needed:

{context}

def generate_responses(client, instruction):
    context= query_vector_database(instruction)
    prompt = RESPONSE_PROMPT_TEMPLATE.format(instruction=instruction, context=context)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=60,
    )
    return response.choices[0].message.content

def response_generator(client, instruction_list):
response_list = [generate_responses(client, instruction) for instruction in instruction_list]
return response_list

instruction_response_list = response_generator(client, instruction_list_formatted)

instruction_response_pair_list = []
for instruction, response in zip(instruction_list_formatted, instruction_response_list):
instruction_response_pair_list.append(
{
“instruction”: instruction,
“responses”: response,
}
)

instruction_response_pair_list

Step 6: Saving the Synthetic Dataset

Finally, we structure the generated data and save it as a CSV file:

import pandas as pd

df = pd.DataFrame(instruction_response_pair_list)
df[‘system_prompt’] = “You are a helpful HTML coding assistant”
df.to_csv(“html_dataset.csv”, index=False)

Step 7: Fine-Tuning the Model on UBIAI

With our synthetic dataset ready, the next step is to bring it to life by fine-tuning a model using UBIAI, an intuitive model training platform. This creates a model that generates context aware HTML, ensuring it aligns with real world coding practices.

Uploading the Dataset

The first step is to upload our structured dataset to UBIAI. Since our dataset is stored as a CSV file, the process is straightforward:

Head over to UBIAI and log into your account.
Navigate to the Datasets section and click New Dataset.
Select the html_dataset.csv file we created earlier.
Ensure the columns are mapped correctly; instructions should be recognized as prompts and responses as expected outputs.

At this stage, UBIAI allows you to preview and validate the dataset to make sure everything is formatted correctly before moving forward.

Setting Up for Fine-Tuning

Once the dataset is uploaded, it’s time to configure the fine-tuning process. UBIAI offers various models, and selecting the right one depends on our requirements. For HTML generation, Mistral or Qwen are great choices due to their efficiency in handling structured data. With everything in place, we launch the fine-tuning process. The model starts learning from our synthetic dataset, improving its ability.

Evaluating the Model

Once training is complete, it’s time to test the model. We provide new HTM Lrelated prompts and compare the generated responses with expected results. If inconsistencies arise, further tuning or additional training data may be needed. Once you are satisfied with the results you can deploy your model using the UbiAI API.

Thank you for following along! I hope this tutorial has helped you gain new skills.

Enhancing Synthetic Data Generation with RAG for HTML

Retrieval-Augmented Generation (RAG) and Synthetic Data

How RAG Improves HTML Generation

What You’ll Learn in This Blog

Step 1: Setting Up the Environment

Step 2: Parsing the HTML Tutorial Document

Step 3: Creating a Vector Database for Retrieval

Step 4: Generating Synthetic Data

Generating HTML Topics

Generating Instructions for Each Topic

Generating HTML Responses with RAG

Step 6: Saving the Synthetic Dataset

Step 7: Fine-Tuning the Model on UBIAI

Uploading the Dataset

Setting Up for Fine-Tuning

Evaluating the Model

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost

Fine Tuning LLMs on Your Own Dataset