March 25th, 2025
Generating high quality synthetic data is essential for training good AI models, especially when real world data is scarce, biased, or difficult to collect. In the context of structured data like HTML, traditional data generation approaches often rely on rule based templates or large language models trained on vast web data. While this works for simple tasks, it often leads to three major issues:
These challenges highlight the need for a more robust approach; one that integrates external knowledge into the generative process.
To solve these problems, Retrieval-Augmented Generation (RAG) brings a smarter approach to generating HTML. Instead of making guesses based only on what the model already knows, RAG retrieves relevant information from trusted sources; such as HTML documentation, code repositories, and best practice guidelines before generating the final output.

This has several key advantages:
The RAG pipeline for synthetic HTML generation typically involves three key steps: First, we store HTML documentation, example code snippets, and best practices in a vector database. This allows for efficient retrieval based on semantic similarity. When a user requests an HTML snippet, the system retrieves the most relevant examples from the database. Finally, the generative model synthesizes an HTML response based on both the retrieved content and the user’s request. This method ensures that the generated HTML is not only syntactically correct but also contextually aligned with modern development practices.
In the following sections, we’ll explore the practical implementation of RAG for synthetic HTML generation. Here’s what we’ll cover:
By the end of this article, you’ll have a solid understanding of how RAG can enhance synthetic data generation for structured formats like HTML.
Before the actual implementation, we need to install the necessary dependencies:
!pip install huggingface_hub
!pip install llama-parse
!pip install langchain langchain_community langchain_core langchain_huggingfacep%%writefile requirements.txt
langchain
langchain-community
fastembed
chromadb
python-dotenv
langchain-groq
chainlit
fastembed
unstructured[md]!pip install -r requirements.txt
import shutil
import nltk
nltk.download('punkt_tab')
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('averaged_perceptron_tagger_eng')
Using llama-parse, we extract structured text from an HTML tutorial PDF and convert it into Markdown format. This structured format is easier to process when building the dataset.
import nest_asyncio
from llama_parse import LlamaParsenest_asyncio.apply()
parser = LlamaParse(
api_key=“”,
result_type=“markdown”,
language=“en”,
verbose=True,
is_formatting_instruction=False,
parsing_instruction=“””
create a markdown of the following document.
“””
)
parsed_documents = parser.load_data(“/content/html_tutorial.pdf”)
with open(‘parsed_output.md’, ‘w’) as f:
for doc in parsed_documents:
f.write(doc.text + ‘\n’)
This document serves as our knowledge base, providing relevant HTML information for later retrieval.
We will use langchain and ChromaDB to store HTML-related information. This enables efficient similarity-based searches, retrieving relevant HTML knowledge during response generation.
import json
import re
from huggingface_hub import InferenceClient
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.document_loaders import UnstructuredMarkdownLoader
from langchain.text_splitter import RecursiveCharacterTextSplitterdef create_and_populate_vector_database():
"""
Creates and populates a vector database with document embeddings.
This function only needs to be run once to set up the vector database.
"""markdown_path = “/content/parsed_output.md”
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)
embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)
vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)
vectorstore.add_documents(docs)
vectorstore.persist()
print(‘Vector database created and populated successfully!’)
return vectorstore
create_and_populate_vector_database()
from langchain.vectorstores import Chroma
from langchain.embeddings import HuggingFaceEmbeddingsdef query_vector_database(query):
embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)
vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)
results = vectorstore.similarity_search(query, k=1)
retrieved_info = ” “.join(result.page_content for result in results)
return retrieved_info
We first generate a diverse set of HTML topics that will serve as categories for our dataset. This ensures our dataset covers various HTML components.
from huggingface_hub import InferenceClientclient = InferenceClient(api_key=“”)
MODEL = “Qwen/Qwen2.5-72B-Instruct”
n_subtopics = 10
TOPIC_GENERATION_PROMPT_TEMPLATE = """\
I want to create a synthetic dataset of natural language instructions and HTML code snippets. Based on this context, give me {n_subtopics} general subtopics to cover
that are different categories of HTML interface components.The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list and no ().
def generate_subtopics(client, n_subtopics):
prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(n_subtopics=n_subtopics)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "user",
"content": prompt}
],
temperature=0.2,
top_p=0.7,
)
return responseresponses = generate_subtopics(client, n_subtopics=n_subtopics)
print(responses.choices[0].message.content)
For each topic, we create user instructions that an AI assistant might receive:
n_instructions = 20
INSTRUCTION_PROMPT_TEMPLATE = """\
The objective is to create a dataset of user instructions in natural language that should be returned by HTML code snippets.
Given a topic in HTML, generate {n_instructions} possible concise instructions that could be given to an AI assistant about that topic.
Write some of these instructions as if given by someone with limited knowledge of HTML terminologies and knowledge,
like a beginner programmer. Your response should be in a list format.
The topic is: {sub_topic}
The list must be without numbers or any special character. The questions/instructions should be separated by a newline character. There must be no other text than the list.
subtopic_list = responses.choices[0].message.content.split(",")def generate_instructions(client, sub_topic, n_instructions):
print(f"Generating Instructions for {sub_topic}.")
prompt = INSTRUCTION_PROMPT_TEMPLATE.format(sub_topic=sub_topic, n_instructions=n_instructions)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "user",
"content": prompt}
],
temperature=0.2,
top_p=0.7,
)
return response.choices[0].message.contentdef instructions_generator(client, subtopic_list, n_instructions):
instruction_list = [generate_instructions(client, subtopic, n_instructions) for subtopic in subtopic_list]
return instruction_list
instruction_list = instructions_generator(client, subtopic_list, n_instructions)instruction_list_formatted = []
for instruction_set in instruction_list:
instruction_list_formatted.extend([instruction.strip() for instruction in instruction_set.split(“\n”) if instruction])
Instead of relying solely on a model’s pre-trained knowledge, we retrieve relevant HTML information before generating responses. For each instruction, we retrieve contextual information and generate an HTML response.
RESPONSE_PROMPT_TEMPLATE = """\
Given a question/instruction related to HTML, generate only the HTML code snippet without any explanatory text or additional information.
The user prompt is: {instruction}
Get some help using the following if needed:
{context}
def generate_responses(client, instruction):
context= query_vector_database(instruction)
prompt = RESPONSE_PROMPT_TEMPLATE.format(instruction=instruction, context=context)
response = client.chat.completions.create(
model=MODEL,
messages=[
{"role": "user",
"content": prompt}
],
temperature=0.2,
top_p=0.7,
max_tokens=60,
)
return response.choices[0].message.contentdef response_generator(client, instruction_list):
response_list = [generate_responses(client, instruction) for instruction in instruction_list]
return response_list
instruction_response_list = response_generator(client, instruction_list_formatted)instruction_response_pair_list = []
for instruction, response in zip(instruction_list_formatted, instruction_response_list):
instruction_response_pair_list.append(
{
“instruction”: instruction,
“responses”: response,
}
)
instruction_response_pair_list
Finally, we structure the generated data and save it as a CSV file:
import pandas as pddf = pd.DataFrame(instruction_response_pair_list)
df[‘system_prompt’] = “You are a helpful HTML coding assistant”
df.to_csv(“html_dataset.csv”, index=False)
With our synthetic dataset ready, the next step is to bring it to life by fine-tuning a model using UBIAI, an intuitive model training platform. This creates a model that generates context aware HTML, ensuring it aligns with real world coding practices.
The first step is to upload our structured dataset to UBIAI. Since our dataset is stored as a CSV file, the process is straightforward:
At this stage, UBIAI allows you to preview and validate the dataset to make sure everything is formatted correctly before moving forward.

Once the dataset is uploaded, it’s time to configure the fine-tuning process. UBIAI offers various models, and selecting the right one depends on our requirements. For HTML generation, Mistral or Qwen are great choices due to their efficiency in handling structured data. With everything in place, we launch the fine-tuning process. The model starts learning from our synthetic dataset, improving its ability.

Once training is complete, it’s time to test the model. We provide new HTM Lrelated prompts and compare the generated responses with expected results. If inconsistencies arise, further tuning or additional training data may be needed. Once you are satisfied with the results you can deploy your model using the UbiAI API.

Thank you for following along! I hope this tutorial has helped you gain new skills.