Enhancing Synthetic Data Generation with RAG for HTML

March 25th, 2025

 

 

Generating high quality synthetic data is essential for training good AI models, especially when real world data is scarce, biased, or difficult to collect. In the context of structured data like HTML, traditional data generation approaches often rely on rule based templates or large language models trained on vast web data. While this works for simple tasks, it often leads to three major issues:

  • Lack of Contextual Accuracy: Models tend to generate HTML that doesn’t reflect modern web standards or best practices. For example, they might produce a < table > based layout when a < div > based grid system would be more appropriate.
  • Limited Adaptability: Without external guidance, models struggle to generate HTML tailored to specific frameworks like Bootstrap, Tailwind, or React.
  • Hallucinations and Inconsistencies: Generative models sometimes produce code that looks plausible but is structurally incorrect or logically flawed.

These challenges highlight the need for a more robust approach; one that integrates external knowledge into the generative process.

 

To solve these problems, Retrieval-Augmented Generation (RAG) brings a smarter approach to generating HTML. Instead of making guesses based only on what the model already knows, RAG retrieves relevant information from trusted sources; such as HTML documentation, code repositories, and best practice guidelines before generating the final output.

 

 

This has several key advantages:

  • More accurate code: By pulling real-world examples, RAG ensures the generated HTML is up-to-date and follows best practices.
  • Context-aware generation: The system adapts to different web frameworks and styles, making the code more relevant.
  • Scalability: As web development evolves, the system can integrate new documentation without requiring a complete model retraining.
  •  
 

The RAG pipeline for synthetic HTML generation typically involves three key steps: First, we store HTML documentation, example code snippets, and best practices in a vector database. This allows for efficient retrieval based on semantic similarity. When a user requests an HTML snippet, the system retrieves the most relevant examples from the database. Finally, the generative model synthesizes an HTML response based on both the retrieved content and the user’s request. This method ensures that the generated HTML is not only syntactically correct but also contextually aligned with modern development practices.

 

In the following sections, we’ll explore the practical implementation of RAG for synthetic HTML generation. Here’s what we’ll cover:

  • Setting Up a Vector Database: We’ll learn how to store and retrieve HTML knowledge using a vector database, which allows for efficient and context aware searches.
  • Retrieving Relevant HTML Documentation: I’ll show you how to design a retrieval mechanism that fetches the most relevant examples based on user queries.
  • Generating Well Structured HTML Snippets: Finally, we’ll look at how to use generative models to create HTML snippets that are both accurate and contextually appropriate.

By the end of this article, you’ll have a solid understanding of how RAG can enhance synthetic data generation for structured formats like HTML.

 

Before the actual implementation, we need to install the necessary dependencies:


 

Using llama-parse, we extract structured text from an HTML tutorial PDF and convert it into Markdown format. This structured format is easier to process when building the dataset.

nest_asyncio.apply()
parser = LlamaParse(
api_key=“”,
result_type=“markdown”,
language=“en”,
verbose=True,
is_formatting_instruction=False,
parsing_instruction=“””
create a markdown of the following document.
“””

)

parsed_documents = parser.load_data(“/content/html_tutorial.pdf”)
with open(‘parsed_output.md’‘w’as f:
for doc in parsed_documents:
f.write(doc.text + ‘\n’)

This document serves as our knowledge base, providing relevant HTML information for later retrieval.

 

We will use langchain and ChromaDB to store HTML-related information. This enables efficient similarity-based searches, retrieving relevant HTML knowledge during response generation.

markdown_path = “/content/parsed_output.md”
loader = UnstructuredMarkdownLoader(markdown_path)
documents = loader.load()

text_splitter = RecursiveCharacterTextSplitter(chunk_size=2000, chunk_overlap=100)
docs = text_splitter.split_documents(documents)

embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)

vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)
vectorstore.add_documents(docs)
vectorstore.persist()

print(‘Vector database created and populated successfully!’)
return vectorstore

create_and_populate_vector_database()

def query_vector_database(query):
embed_model = HuggingFaceEmbeddings(model_name=“sentence-transformers/all-MiniLM-L6-v2”)

vectorstore = Chroma(
persist_directory=“./chroma_db”,
embedding_function=embed_model
)

results = vectorstore.similarity_search(query, k=1)

retrieved_info = ” “.join(result.page_content for result in results)

return retrieved_info

 

 
 

We first generate a diverse set of HTML topics that will serve as categories for our dataset. This ensures our dataset covers various HTML components.

client = InferenceClient(api_key=“”)
MODEL = “Qwen/Qwen2.5-72B-Instruct”

The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list and no ().


 

For each topic, we create user instructions that an AI assistant might receive:

 

 


 

The topic is: {sub_topic}
The list must be without numbers or any special character. The questions/instructions should be separated by a newline character. There must be no other text than the list.

def instructions_generator(client, subtopic_list, n_instructions):
instruction_list = [generate_instructions(client, subtopic, n_instructions) for subtopic in subtopic_list]
return instruction_list

 

 

instruction_list_formatted = []
for instruction_set in instruction_list:
instruction_list_formatted.extend([instruction.strip() for instruction in instruction_set.split(“\n”if instruction])

 

 

Instead of relying solely on a model’s pre-trained knowledge, we retrieve relevant HTML information before generating responses. For each instruction, we retrieve contextual information and generate an HTML response.

 

The user prompt is: {instruction}

Get some help using the following if needed:


{context}

def response_generator(client, instruction_list):
response_list = [generate_responses(client, instruction) for instruction in instruction_list]
return response_list

instruction_response_pair_list = []
for instruction, response in zip(instruction_list_formatted, instruction_response_list):
instruction_response_pair_list.append(
{
“instruction”: instruction,
“responses”: response,
}
)

instruction_response_pair_list

 

 

Finally, we structure the generated data and save it as a CSV file:

df = pd.DataFrame(instruction_response_pair_list)
df[‘system_prompt’] = “You are a helpful HTML coding assistant”
df.to_csv(“html_dataset.csv”, index=False)

 

 

With our synthetic dataset ready, the next step is to bring it to life by fine-tuning a model using UBIAI, an intuitive model training platform. This creates a model that generates context aware HTML, ensuring it aligns with real world coding practices.

 

 

The first step is to upload our structured dataset to UBIAI. Since our dataset is stored as a CSV file, the process is straightforward:

  • Head over to UBIAI and log into your account.
  • Navigate to the Datasets section and click New Dataset.
  • Select the html_dataset.csv file we created earlier.
  • Ensure the columns are mapped correctly; instructions should be recognized as prompts and responses as expected outputs.

At this stage, UBIAI allows you to preview and validate the dataset to make sure everything is formatted correctly before moving forward.

 

 

Once the dataset is uploaded, it’s time to configure the fine-tuning process. UBIAI offers various models, and selecting the right one depends on our requirements. For HTML generation, Mistral or Qwen are great choices due to their efficiency in handling structured data. With everything in place, we launch the fine-tuning process. The model starts learning from our synthetic dataset, improving its ability.

 

 

Once training is complete, it’s time to test the model. We provide new HTM Lrelated prompts and compare the generated responses with expected results. If inconsistencies arise, further tuning or additional training data may be needed. Once you are satisfied with the results you can deploy your model using the UbiAI API.

Thank you for following along! I hope this tutorial has helped you gain new skills.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !