ubiai deep learning

Synthetic Data Generation for SQL Commands

Dec 23rd, 2024

Synthetic Data Generation for SQL Commands

This notebook demonstrates an automated approach to synthetic data generation for instruction-response datasets, where each instruction (or question) is paired with a corresponding SQL command as a response. The goal is to streamline the process of creating datasets that can be used for fine-tuning AI models capable of understanding and producing SQL commands from natural language instructions.

 

You can find the Google Colab notebook following this link: https://colab.research.google.com/drive/1RBLpXynNcpFw2PPpLrBE7fXgF4aU1pBn?usp=sharing

You can watch the step-by-step video tutorial below:

You can watch the step-by-step video tutorial below:

Why Do We Generate Data?

Data is all around us, it’s the foundation for making decisions, powering AI models, and shaping the future of businesses, organizations, and society. But sometimes, real-world data isn’t easy to access. It could be difficult to collect, too expensive to obtain, or too sensitive to use.

 

This is where synthetic data comes into play. It’s computer-generated data that mimics the structure and patterns of real-world data, offering a solution that lets us explore and analyze insights without the risks associated with using actual, sensitive information.

Here’s why generating synthetic data is so important:

  • Fine-Tuning AI Models: The synthetic datasets we create are perfect for fine-tuning AI models, helping them understand various tasks like natural language processing and SQL command generation, without the need for large volumes of real-world data.
  • Cost and Time Efficiency: Generating synthetic data saves significant time and resources compared to manually collecting and curating real-world datasets. This method ensures high-quality results without the hassle.
  • Broad AI Applications: Synthetic data isn’t just for training models; it’s also used for testing, data augmentation, and benchmarking. By providing a diverse set of examples, synthetic data helps AI models generalize better, ultimately leading to improved real-world performance.

 

Synthetic data is a game-changer, offering a practical, efficient, and secure way to enhance AI systems and unlock new possibilities across industries.

How It Works

Here’s the game plan:

  • Subtopic Generation: Prompt the model to generate SQL-related subtopics like Data Retrieval, Joins, Aggregations, etc.
  • Instruction Generation: For each subtopic, prompt the model to create natural language instructions or questions.
  • SQL Response Generation: Prompt the model to generate SQL commands corresponding to each instruction.
  • Dataset Filtering: Use a reward model to filter out low-quality instruction-SQL pairs, ensuring only relevant responses are retained.

Step 1: Setting Up the Environment

We prepare the environment by installing necessary libraries and establishing a connection to Hugging Face’s Inference API. This setup allows us to access the Qwen2.5-72B-Instruct model, which I chose for its ability to generate SQL commands directly without any additional text when prompted. By creating a dataset from its responses, we can fine-tune a model on its answering style, ensuring high-quality SQL command generation. (You can pick any other model for your used case)

				
					%%capture
!pip install openai==1.55.3 httpx==0.27.2 --force-reinstall --quiet
				
			
				
					import os
os.kill(os.getpid(), 9)
				
			
				
					from huggingface_hub import InferenceClient

client = InferenceClient(api_key="Token")
MODEL = "Qwen/Qwen2.5-72B-Instruct"
				
			

Step 2: Generating Subtopics

Here, we focus on generating a set of subtopics related to the broader topic of SQL commands. The goal is to create a list of distinct SQL categories that we can later use to generate natural language instructions and corresponding SQL queries. While I chose SQL commands as the focus for this project, you could easily adapt this approach to generate subtopics for any other area of interest.

  • Prompt Template: We define a prompt that asks the model to generate a list of subtopics related to SQL commands. The prompt specifies that the subtopics should be general categories (such as “Data Manipulation” or “Joins”) and must not include numbers, descriptions, or any extraneous text.
  • Generating Subtopics: Using the Hugging Face InferenceClient, we pass the prompt to the Qwen2.5-72B-Instruct model, which generates a list of subtopics. In our example, we are asking for 10 subtopics related to SQL commands, but you can adjust this number.

By generating these subtopics, we lay the foundation for the rest of the notebook.

				
					n_subtopics = 10
TOPIC_GENERATION_PROMPT_TEMPLATE = """\
I want to create a synthetic dataset of natural language and SQL commands. Based on this context, give me {n_subtopics} general subtopics to cover
that are diffrent sql command categories.

The list must be without numbers, and without any description of the subtopics. The subtopics should be separated by a comma. There must be no other text than the list and no ().
"""
				
			
				
					def generate_subtopics(client, n_subtopics):
    prompt = TOPIC_GENERATION_PROMPT_TEMPLATE.format(n_subtopics=n_subtopics)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
    )
    return response
				
			
				
					responses = generate_subtopics(client, n_subtopics=n_subtopics)
print(responses.choices[0].message.content)
				
			

OUTPUT: Data Selection, Data Insertion, Data Update, Data Deletion, Data Aggregration, Data Join Operations, Data Filtering, Data Sorting, Data Grouping, Data Subqueries

Automated Metrics

Now that we have our subtopics, the next step is to generate natural language instructions related to each of those subtopics. These instructions are meant to be concise queries or requests that would correspond to SQL commands, serving as the basis for generating synthetic datasets.

  • Prompt Template for Instruction Generation: We define a template that asks the model to generate n_instructions related to a specific SQL subtopic. The instructions should be simple and clear, with some written as if they come from a beginner or someone with limited SQL knowledge. This helps ensure the instructions are realistic and can mimic real user interactions.
  • Generating Instructions: For each subtopic (like “Data Manipulation” or “Joins”), we pass the template to the Qwen2.5-72B-Instruct model. The model generates a list of instructions in natural language.
  • Processing Multiple Subtopics: We apply the instruction generation function to all the subtopics generated earlier. For each subtopic, we get a list of instructions that we then format into a single, unified list.

This step ensures that we have a variety of instructions for each SQL subtopic.

				
					n_instructions = 20
INSTRUCTION_PROMPT_TEMPLATE = """\
The objective is to create a dataset of user instructions in natural language that should be returned by SQL commands.
Given a topic in SQL, generate {n_instructions} possible concise instructions that could be given to an AI assistant about that topic.
Write some of these instructions as if given by someone with limited knowledge of SQL terminologies and knowledge,
like a beginner programmer. Your response should be in a list format.

The topic is: {sub_topic}
The list must be without numbers or any special character. The questions/instructions should be separated by a newline character. There must be no other text than the list.
"""
				
			
				
					subtopic_list = responses.choices[0].message.content.split(",")
				
			
				
					def generate_instructions(client, sub_topic, n_instructions):
    print(f"Generating Instructions for {sub_topic}.")
    prompt = INSTRUCTION_PROMPT_TEMPLATE.format(sub_topic=sub_topic, n_instructions=n_instructions)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
    )
    return response.choices[0].message.content

def instructions_generator(client, subtopic_list, n_instructions):
    instruction_list = [generate_instructions(client, subtopic, n_instructions) for subtopic in subtopic_list]
    return instruction_list
				
			
				
					instruction_list = instructions_generator(client, subtopic_list, n_instructions)

instruction_list_formatted = []
for instruction_set in instruction_list:
    instruction_list_formatted.extend([instruction.strip() for instruction in instruction_set.split("\n") if instruction])
				
			

OUTPUT:

Generating Instructions for Data Selection.
Generating Instructions for Data Insertion.
Generating Instructions for Data Update.
Generating Instructions for Data Deletion.
Generating Instructions for Data Aggregration.
Generating Instructions for Data Join Operations.
Generating Instructions for Data Filtering.
Generating Instructions for Data Sorting.
Generating Instructions for Data Grouping.
Generating Instructions for Data Subqueries.

The final result is a list of instructions that are ready to be paired with corresponding SQL queries, forming the foundation of the dataset.

				
					instruction_list_formatted[:6]
['Show me how to select all columns from a table',
 'Can you give me the SQL command to select specific columns from a table',
 'How do I select data from a table where a column has a certain value',
 'Can you show me how to filter rows in a table',
 'I need to select datawhere multiple conditionsare met',
 'How do I select only distinct values from a column']
				
			

Step 4: Generating Responses

Once we have our instructions ready, the next step is to generate the SQL commands that correspond to each of those instructions. The idea is to take each natural language instruction and get a precise SQL query as the response.

  • Response Prompt Template: We define a prompt template that asks the model to generate the SQL command based on a given instruction. The model is specifically instructed to provide only the SQL query, with no additional explanation or text. This ensures that the response is concise and strictly focused on the SQL command.
  • Pairing Instructions with Responses: After generating the SQL responses, each instruction is paired with its corresponding SQL command. This pairing is essential for creating a comprehensive instruction-response dataset that can be used for training or testing AI models.

This step ensures that we have both the natural language instructions and their corresponding SQL responses, ready for dataset construction and further use.

				
					RESPONSE_PROMPT_TEMPLATE = """\
Given a question/instruction related to SQL, generate only the SQL command without any explanatory text or additional information.

The user prompt is: {instruction}
"""
				
			
				
					def generate_responses(client, instruction):
    prompt = RESPONSE_PROMPT_TEMPLATE.format(instruction=instruction)
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "user",
             "content": prompt}
        ],
        temperature=0.2,
        top_p=0.7,
        max_tokens=60,
    )
    return response.choices[0].message.content


def response_generator(client, instruction_list):
    response_list = [generate_responses(client, instruction) for instruction in instruction_list]
    return response_list
				
			
				
					instruction_response_list = response_generator(client, instruction_list_formatted)

instruction_response_pair_list = []
for instruction, response in zip(instruction_list_formatted, instruction_response_list):
    instruction_response_pair_list.append(
        {
            "instruction": instruction,
            "responses": response,
        }
    )
				
			

The outcome is a list of instruction-response pairs, where each entry contains a user instruction and the corresponding SQL command.

				
					instruction_response_pair_list
[{'instruction': 'Show me how to select all columns from a table',
  'responses': 'SELECT * FROM table_name;'},
 {'instruction': 'Can you give me the SQL command to select specific columns from a table',
  'responses': 'SELECT column1, column2, column3 FROM table_name;'},
 {'instruction': 'How do I select data from a table where a column has a certain value',
  'responses': "SELECT * FROM table_name WHERE column_name = 'certain_value';"},
 {'instruction': 'Can you show me how to filter rows in a table',
  'responses': 'SELECT * FROM table_name WHERE condition;'},
 {'instruction': 'I need to select datawhere multiple conditionsare met',
  'responses': 'SELECT * FROM table_name WHERE condition1 AND condition2 AND condition3;'},
 {'instruction': 'How do I select only distinct values from a column',
  'responses': 'SELECT DISTINCT column_name FROM table_name;'},
 {'instruction': 'Can you help me select data with a limit on the number of rows',
  'responses': 'SELECT * FROM table_name LIMIT 10;'},
 {'instruction': 'How do I select data and sort it by a column',
  'responses': 'SELECT * FROM table_name ORDER BY column_name;'},
 {'instruction': 'Show me how to selectdatawith an order by clause',
  'responses': 'SELECT * FROM table_name ORDER BY column_name;'},
 {'instruction': 'Can you give me the command to select data and group it by a column',
  'responses': 'SELECT column1, COUNT(*) FROM table_name GROUP BY column1;'},
 {'instruction': 'How do I use the where clause to select data',
  'responses': 'SELECT * FROM table_name WHERE condition;'},
 {'instruction': 'Can you show me how to select data using a subquery',
  'responses': '```sql\nSELECT column_name\nFROM table_name\nWHERE column_name IN (SELECT column_name FROM table_name WHERE condition);\n```'},
 {'instruction': 'I need to select data from multiple tables at once',
  'responses': 'SELECT * FROM table1 JOIN table2 ON table1.id = table2.id JOIN table3 ON table2.id = table3.id;'},
 {'instruction': 'How do I select data with a join between two tables',
  'responses': 'SELECT * FROM table1 JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'Can you show me how to select data with an inner join',
  'responses': 'SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'How doIselectdatawithaleftjoin',
  'responses': 'SELECT * FROM table1 LEFT JOIN table2 ON table1.id = table2.id'},
 {'instruction': 'Can you help me select data with a right join',
  'responses': 'SELECT * FROM table1 RIGHT JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'Show me how to select data with a full outer join',
  'responses': 'SELECT * FROM table1 FULL OUTER JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'How do I select data and apply a function to a column',
  'responses': 'SELECT column_name, FUNCTION_NAME(column_name) FROM table_name;'},
 {'instruction': 'Can you show me howtouselectdatawith a union clause',
  'responses': '```sql\nSELECT column1, column2 FROM table1\nUNION\nSELECT column1, column2 FROM table2;\n```'},
 {'instruction': 'Add a new record to the customers table',
  'responses': 'INSERT INTO customers (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a row into the products table',
  'responses': 'INSERT INTO products (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Put new dataintotheorders table',
  'responses': 'INSERT INTO orders (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a new entry to the employees table',
  'responses': 'INSERT INTO employees (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a record into the inventory table',
  'responses': 'INSERT INTO inventory (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a new item to the sales table',
  'responses': "INSERT INTO sales (item_name, quantity, price) VALUES ('New Item', 1, 100.00);"},
 {'instruction': 'Put a new row in the suppliers table',
  'responses': 'INSERT INTO suppliers (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a new record into the categories table',
  'responses': "INSERT INTO categories (category_name, description) VALUES ('New Category', 'Description for the new category');"},
 {'instruction': 'Add a new entry to the payments table',
  'responses': "INSERT INTO payments (customer_id, payment_date, amount) VALUES (123, '2023-10-01', 500.00);"},
 {'instruction': 'Put a new record in the shipments table',
  'responses': 'INSERT INTO shipments (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a new row to the returns table',
  'responses': 'INSERT INTO returns (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a record into the discounts table',
  'responses': 'INSERT INTO discounts (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a newentrytotheusers table',
  'responses': 'INSERT INTO users (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Put a newrecordin the addresses table',
  'responses': 'INSERT INTO addresses (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a new row into the transactions table',
  'responses': 'INSERT INTO transactions (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a new record to the reviews table',
  'responses': 'INSERT INTO reviews (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Put a new entry in the feedbacks table',
  'responses': 'INSERT INTO feedbacks (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Insert a record into the sessions table',
  'responses': "INSERT INTO sessions (session_id, user_id, start_time, end_time) VALUES ('12345', 'user1', '2023-10-01 08:00:00', '2023-10-01"},
 {'instruction': 'Add a new row to the logs table',
  'responses': 'INSERT INTO logs (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Put a new record in the settings table',
  'responses': 'INSERT INTO settings (column1, column2, column3) VALUES (value1, value2, value3);'},
 {'instruction': 'Add a new entry to the configurations table',
  'responses': "INSERT INTO configurations (setting_name, value) VALUES ('new_setting', 'new_value');"},
 {'instruction': 'Update the email for a specific user in the users table',
  'responses': "UPDATE users SET email = 'new_email@example.com' WHERE user_id = 123;"},
 {'instruction': 'Change the status of an order to completed',
  'responses': "UPDATE orders SET status = 'completed' WHERE order_id = [specific_order_id];"},
 {'instruction': 'Set the quantity of a product to zero if it is out of stock',
  'responses': 'UPDATE products SET quantity = 0 WHERE in_stock = false;'},
 {'instruction': 'Update the last login date for a user',
  'responses': 'UPDATE users SET last_login_date = CURRENT_DATE WHERE user_id = 1;'},
 {'instruction': 'Change the price of a product by a certain percentage',
  'responses': 'UPDATE products SET price = price * (1 + :percentage) WHERE product_id = :product_id;'},
 {'instruction': 'Update the address for a customer in the customers table',
  'responses': "UPDATE customers SET address = 'New Address' WHERE customer_id = 123;"},
 {'instruction': 'Mark a task as incomplete if it is overdue',
  'responses': "UPDATE tasks SET status = 'incomplete' WHERE due_date < CURRENT_DATE AND status = 'complete';"},
 {'instruction': 'Change the manager of a department',
  'responses': 'UPDATE departments SET manager_id = new_manager_id WHERE department_id = specific_department_id;'},
 {'instruction': 'Update the rating of a movie in the movies table',
  'responses': "UPDATE movies SET rating = 'new_rating' WHERE movie_id = 'specific_movie_id';"},
 {'instruction': 'Change the category of a product',
  'responses': "UPDATE products SET category = 'new_category' WHERE product_id = 'product_id';"},
 {'instruction': 'Update the phone number for a supplier',
  'responses': "UPDATE suppliers SET phone_number = 'new_phone_number' WHERE supplier_id = 'supplier_id';"},
 {'instruction': 'Set the active status of a user to false',
  'responses': 'UPDATE users SET active = false WHERE id = [user_id];'},
 {'instruction': 'Change the discount rate for a customer',
  'responses': 'UPDATE customers SET discount_rate = 0.15 WHERE customer_id = 123;'},
 {'instruction': 'Update the publication year of a book',
  'responses': 'UPDATE books SET publication_year = 2023 WHERE book_id = 1;'},
 {'instruction': 'Change the department of an employee',
  'responses': "UPDATE employees SET department = 'new_department' WHERE employee_id = 'employee_id';"},
 {'instruction': 'Update the balance of a bank account',
  'responses': "UPDATE bank_accounts SET balance = balance + 500 WHERE account_number = '123456789';"},
 {'instruction': 'Change the availability of a room in the hotel table',
  'responses': "UPDATE hotel SET availability = 'unavailable' WHERE room_number = '101';"},
 {'instruction': 'Update the expiry date for a subscription',
  'responses': "UPDATE subscriptions SET expiry_date = '2023-12-31' WHERE subscription_id = 1;"},
 {'instruction': 'Change the status of a ticket to used',
  'responses': "UPDATE tickets SET status = 'used' WHERE ticket_id = 12345;"},
 {'instruction': 'Update the version number of an app in the apps table',
  'responses': "UPDATE apps SET version_number = 'new_version_number' WHERE app_name = 'app_name';"},
 {'instruction': 'Delete all records from the users table where the email is null',
  'responses': 'DELETE FROM users WHERE email IS NULL;'},
 {'instruction': 'Can you remove the row where the user ID is 123',
  'responses': 'DELETE FROM users WHERE user_id = 123;'},
 {'instruction': 'How do I delete entries that have a specific date',
  'responses': "DELETE FROM table_name WHERE date_column = 'specific_date';"},
 {'instruction': 'I need to clear out all the data in the orders table',
  'responses': 'DELETE FROM orders;'},
 {'instruction': 'Can you please deleteTheRecordWhereTheCustomerNameIsJohnDoe',
  'responses': "DELETE FROM customers WHERE customer_name = 'John Doe';"},
 {'instruction': 'Remove all the items from the inventory where the quantity is zero',
  'responses': 'DELETE FROM inventory WHERE quantity = 0;'},
 {'instruction': 'Could you delete the record with the product ID 456',
  'responses': 'DELETE FROM products WHERE product_id = 456;'},
 {'instruction': 'How can I remove all the duplicate entries in this table',
  'responses': 'DELETE FROM table_name WHERE id NOT IN (SELECT MIN(id) FROM table_name GROUP BY column1, column2, column3);'},
 {'instruction': 'I want to delete all the records older than one year',
  'responses': 'DELETE FROM table_name WHERE date_column < DATE_SUB(CURDATE(), INTERVAL 1 YEAR);'},
 {'instruction': 'Can you help me delete a specific record from the database',
  'responses': 'DELETE FROM table_name WHERE condition;'},
 {'instruction': 'How do I delete a row in a table',
  'responses': 'DELETE FROM table_name WHERE condition;'},
 {'instruction': 'I needToDeleteAllTheInactiveUsersFromTheDatabase',
  'responses': 'DELETE FROM users WHERE active = 0;'},
 {'instruction': 'Can you show me how to delete a record where the status is inactive',
  'responses': "DELETE FROM table_name WHERE status = 'inactive';"},
 {'instruction': 'Delete all the records where the price is less than 10',
  'responses': 'DELETE FROM table_name WHERE price < 10;'},
 {'instruction': 'I want to remove all the entries that have an empty description',
  'responses': "DELETE FROM table_name WHERE description = '' OR description IS NULL;"},
 {'instruction': 'Could you delete the record where the order number is 789',
  'responses': 'DELETE FROM orders WHERE order_number = 789;'},
 {'instruction': 'How doIRemoveAllTheRecordsThatHaveANullValueInAField',
  'responses': 'DELETE FROM table_name WHERE field_name IS NULL;'},
 {'instruction': 'Can you delete all the records where the category is outdated',
  'responses': "DELETE FROM table_name WHERE category = 'outdated';"},
 {'instruction': 'I need to delete all the records that match a certain condition',
  'responses': 'DELETE FROM table_name WHERE condition;'},
 {'instruction': 'Find the total number of entries in a specific table',
  'responses': '```sql\nSELECT COUNT(*) FROM specific_table;\n```'},
 {'instruction': 'Combine data from multiple tables into one result',
  'responses': '```sql\nSELECT *\nFROM table1\nJOIN table2 ON table1.id = table2.id\nJOIN table3 ON table1.id = table3.id;\n```'},
 {'instruction': 'Sum up the values in a particular column',
  'responses': 'SELECT SUM(column_name) FROM table_name;'},
 {'instruction': 'Get the average value of a column',
  'responses': '```sql\nSELECT AVG(column_name) FROM table_name;\n```'},
 {'instruction': 'Count how many times a value appears in a column',
  'responses': 'SELECT value, COUNT(*) FROM table_name GROUP BY value;'},
 {'instruction': 'Group data by a specific column and sum up another column',
  'responses': '```sql\nSELECT column1, SUM(column2) \nFROM table_name \nGROUP BY column1;\n```'},
 {'instruction': 'Show me the maximum value in a column',
  'responses': 'SELECT MAX(column_name) FROM table_name;'},
 {'instruction': 'Show me the minimum value in a column',
  'responses': 'SELECT MIN(column_name) FROM table_name;'},
 {'instruction': 'Calculate the total sales for each product category',
  'responses': '```sql\nSELECT category, SUM(sales) AS total_sales\nFROM sales_table\nGROUP BY category;\n```'},
 {'instruction': 'Combine the results of two queries into one list',
  'responses': '```sql\nSELECT * FROM table1\nUNION ALL\nSELECT * FROM table2;\n```'},
 {'instruction': 'How can I add up all the numbers in a column',
  'responses': 'SELECT SUM(column_name) FROM table_name;'},
 {'instruction': 'Can you tell me how many users are in each country',
  'responses': '```sql\nSELECT country, COUNT(*) AS user_count\nFROM users\nGROUP BY country;\n```'},
 {'instruction': 'What is the total amount spent by each customer',
  'responses': 'SELECT customer_id, SUM(amount) AS total_spent FROM orders GROUP BY customer_id;'},
 {'instruction': 'How do I find the average salary in the employees table',
  'responses': 'SELECT AVG(salary) FROM employees;'},
 {'instruction': 'Can you show me the highest and lowest values in a column',
  'responses': '```sql\nSELECT MAX(column_name) AS highest_value, MIN(column_name) AS lowest_value FROM table_name;\n```'},
 {'instruction': 'How can I group data by date and sum up the sales',
  'responses': '```sql\nSELECT DATE(date_column), SUM(sales_column) \nFROM table_name \nGROUP BY DATE(date_column);\n```'},
 {'instruction': 'I want to know the total number of orders placed each month',
  'responses': "```sql\nSELECT DATE_TRUNC('month', order_date) AS month, COUNT(*) AS total_orders\nFROM orders\nGROUP BY DATE_TRUNC('month', order_date)\nORDER BY month;\n```"},
 {'instruction': 'Can you count the number of unique customers in the database',
  'responses': 'SELECT COUNT(DISTINCT customer_id) FROM customers;'},
 {'instruction': 'How do I get the total quantity sold for each product',
  'responses': '```sql\nSELECT product_id, SUM(quantity) AS total_quantity_sold\nFROM sales\nGROUP BY product_id;\n```'},
 {'instruction': 'Can you show me the average rating for each movie in the database',
  'responses': 'SELECT movie_id, AVG(rating) AS average_rating FROM ratings GROUP BY movie_id;'},
 {'instruction': 'Combine the customer and orders tables to see all customer names and their order details',
  'responses': '```sql\nSELECT customer.name, orders.order_id, orders.order_date, orders.amount\nFROM customer\nLEFT JOIN orders\nON customer.customer_id = orders.customer_id;\n```'},
 {'instruction': 'I want to see which customers have placed orders how can I do that',
  'responses': '```sql\nSELECT customer_id, COUNT(order_id) AS order_count\nFROM orders\nGROUP BY customer_id;\n```'},
 {'instruction': 'Can you show me how to get a list of all products and the orders they are part of',
  'responses': '```sql\nSELECT p.product_name, o.order_id\nFROM products p\nJOIN order_items oi ON p.product_id = oi.product_id\nJOIN orders o ON oi.order_id = o.order_id;\n```'},
 {'instruction': 'How do I merge two tables that have a common column',
  'responses': '```sql\nSELECT *\nFROM table1\nJOIN table2\nON table1.common_column = table2.common_column;\n```'},
 {'instruction': 'I need to see all the records from table A even if there is no match in table B',
  'responses': 'SELECT * FROM A LEFT JOIN B ON A.id = B.id'},
 {'instruction': 'Can you help me join the users and their purchase history',
  'responses': '```sql\nSELECT users.*, purchase_history.*\nFROM users\nJOIN purchase_history ON users.user_id = purchase_history.user_id;\n```'},
 {'instruction': 'I want to see the names of customers who havent placed any orders',
  'responses': '```sql\nSELECT name FROM customers WHERE customer_id NOT IN (SELECT customer_id FROM orders);\n```'},
 {'instruction': 'How can I get a list of all employees and the departments they work in',
  'responses': '```sql\nSELECT employee_name, department_name\nFROM employees\nJOIN departments ON employees.department_id = departments.department_id;\n```'},
 {'instruction': 'Can you show me how to join three tables together',
  'responses': '```sql\nSELECT a.*, b.*, c.*\nFROM table1 a\nJOIN table2 b ON a.common_field = b.common_field\nJOIN table3 c ON a.common_field = c.common_field;\n```'},
 {'instruction': 'I need to find out which products have never been ordered',
  'responses': 'SELECT product_id, product_name FROM products WHERE product_id NOT IN (SELECT product_id FROM orders)'},
 {'instruction': 'How do I join tables to see all the data even if there are no matches',
  'responses': '```sql\nSELECT *\nFROM table1\nFULL OUTER JOIN table2\nON table1.id = table2.id;\n```'},
 {'instruction': 'Can you explain how to do a left join in simple terms',
  'responses': '```sql\nSELECT a.*, b.*\nFROM table_a a\nLEFT JOIN table_b b\nON a.id = b.id;\n```'},
 {'instruction': 'I want to see all the data from both tables where there is a match',
  'responses': 'SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'How can I combine data from multiple tables to get a complete view',
  'responses': 'SELECT * FROM table1 JOIN table2 ON table1.id = table2.id JOIN table3 ON table1.id = table3.id;'},
 {'instruction': 'Can you help me understand how to use inner join',
  'responses': 'SELECT a.column1, b.column2\nFROM tableA a\nINNER JOIN tableB b\nON a.common_column = b.common_column;'},
 {'instruction': 'I need to see all the records from table B even if there is no match in table A',
  'responses': 'SELECT * FROM B LEFT JOIN A ON B.id = A.id'},
 {'instruction': 'How do I join tables to see only the matching records',
  'responses': 'SELECT * FROM table1 INNER JOIN table2 ON table1.id = table2.id;'},
 {'instruction': 'Can you show me how to get a full list of customers and their orders including those who havent ordered anything',
  'responses': '```sql\nSELECT c.customer_id, c.customer_name, o.order_id, o.order_date\nFROM customers c\nLEFT JOIN orders o ON c.customer_id = o.customer_id;\n```'},
 {'instruction': 'I want to see all the products and their suppliers even if some products dont have a supplier',
  'responses': '```sql\nSELECT p.product_name, s.supplier_name\nFROM products p\nLEFT JOIN suppliers s ON p.supplier_id = s.supplier_id;\n```'},
 {'instruction': 'How can I join tables to see all the data from one table and only the matching data from another',
  'responses': '```sql\nSELECT * \nFROM table1 \nLEFT JOIN table2 \nON table1.id = table2.id;\n```'},
 {'instruction': 'Show me how to filter data in a table where a specific column has a certain value',
  'responses': "SELECT * FROM table_name WHERE column_name = 'specific_value';"},
 {'instruction': 'Can you explain how to get rows that match a condition',
  'responses': 'SELECT * FROM table_name WHERE condition;'},
 {'instruction': 'How do I select data where a column is not equal to a value',
  'responses': "SELECT * FROM table_name WHERE column_name != 'value';"},
 {'instruction': 'I need to know how to filter out rows that dont meet a criteria',
  'responses': "SELECT * FROM table_name WHERE column_name = 'criteria';"},
 {'instruction': 'Can you help me find records where a column is greater than a number',
  'responses': 'SELECT * FROM table_name WHERE column_name > number;'},
 {'instruction': 'How can I filter data for a specific date range',
  'responses': "SELECT * FROM table_name WHERE date_column BETWEEN 'start_date' AND 'end_date';"},
 {'instruction': 'Tell me how to exclude rows where a column is null',
  'responses': 'SELECT * FROM table_name WHERE column_name IS NOT NULL;'},
 {'instruction': 'How do I select entries where a column contains a specific word',
  'responses': "SELECT * FROM table_name WHERE column_name LIKE '%specific_word%'"},
 {'instruction': 'Can you show me how to filter data using multiple conditions',
  'responses': 'SELECT * FROM table_name WHERE condition1 AND condition2;'},
 {'instruction': 'I want to learn how to get data where a column is between two values',
  'responses': 'SELECT * FROM table_name WHERE column_name BETWEEN value1 AND value2;'},
 {'instruction': 'How can I filter data based on a list of values in a column',
  'responses': 'SELECT * FROM table_name WHERE column_name IN (value1, value2, value3);'},
 {'instruction': 'Can you guide me on filtering data where a column starts with a certain letter',
  'responses': "SELECT * FROM table_name WHERE column_name LIKE 'A%'"},
 {'instruction': 'How do I select data where a column ends with a specific character',
  'responses': "SELECT * FROM table_name WHERE column_name LIKE '%specific_character'"},
 {'instruction': 'I need to know how to filter data where a column is in a set of values',
  'responses': 'SELECT * FROM table_name WHERE column_name IN (value1, value2, value3);'},
 {'instruction': 'Can you help me filter out rows where a column is less than a value',
  'responses': 'SELECT * FROM table_name WHERE column_name < value;'},
 {'instruction': 'How can I filter data where a column is equal to one of several values',
  'responses': 'SELECT * FROM table_name WHERE column_name IN (value1, value2, value3);'},
 {'instruction': 'Show me how to filter data where a column does not contain a specific word',
  'responses': "SELECT * FROM table_name WHERE column_name NOT LIKE '%specific_word%'"},
 {'instruction': 'Can you explain how to filter data where a column is not in a list of values',
  'responses': 'SELECT * FROM table_name WHERE column_name NOT IN (value1, value2, value3);'},
 {'instruction': 'How do I select data where a column is like a pattern',
  'responses': "SELECT * FROM table_name WHERE column_name LIKE 'pattern%'"},
 {'instruction': 'I want to know how to filter data where a column is not like a pattern',
  'responses': "SELECT * FROM table_name WHERE column_name NOT LIKE 'pattern%'"},
 {'instruction': 'Show me how to sort data in ascending order',
  'responses': 'SELECT * FROM table_name ORDER BY column_name ASC;'},
 {'instruction': 'Can you sort this data from A to Z',
  'responses': 'SELECT * FROM table_name ORDER BY column_name ASC;'},
 {'instruction': 'How do I arrange the data from smallest to largest',
  'responses': 'SELECT * FROM table_name ORDER BY column_name ASC;'},
 {'instruction': 'Sort the records in descending order please',
  'responses': 'SELECT * FROM table_name ORDER BY column_name DESC;'},
 {'instruction': 'Can you put the data in order from highest to lowest',
  'responses': 'SELECT * FROM table_name ORDER BY column_name DESC;'},
 {'instruction': 'I need the data sorted alphabetically',
  'responses': 'SELECT * FROM table_name ORDER BY column_name ASC;'},
 {'instruction': 'How can I organize the data by date',
  'responses': 'SELECT * FROM table_name ORDER BY date_column_name;'},
 {'instruction': 'Sort the table by the second column',
  'responses': '```sql\nSELECT * FROM table_name ORDER BY column2;\n```'},
 {'instruction': 'Can you arrange the entries by the last name',
  'responses': 'ORDER BY last_name'},
 {'instruction': 'I want the results to be in reverse order',
  'responses': 'ORDER BY column_name DESC;'},
 {'instruction': 'How do I sort the data by multiple columns',
  'responses': 'SELECT * FROM table_name ORDER BY column1, column2;'},
 {'instruction': 'Can you show me the data sorted by price',
  'responses': 'SELECT * FROM table_name ORDER BY price;'},
 {'instruction': 'Sort the data by the first column in ascending order',
  'responses': 'SELECT * FROM table_name ORDER BY column1 ASC;'},
 {'instruction': 'Can you order the data by the creation date',
  'responses': 'ORDER BY creation_date'},
 {'instruction': 'I need the data sorted by the user ID',
  'responses': 'SELECT * FROM table_name ORDER BY user_id;'},
 {'instruction': 'How can I sort the data by the number of views',
  'responses': '```sql\nSELECT * FROM table_name ORDER BY views;\n```'},
 {'instruction': 'Can you arrange the data by the most recent entries',
  'responses': 'SELECT * FROM table_name ORDER BY date_column DESC;'},
 {'instruction': 'I want the data sorted by the product name',
  'responses': 'SELECT * FROM products ORDER BY product_name;'},
 {'instruction': 'Sort the data by the quantity in stock',
  'responses': 'SELECT * FROM inventory ORDER BY quantity_in_stock;'},
 {'instruction': 'Can you put the data in order by the rating',
  'responses': 'SELECT * FROM table_name ORDER BY rating;'},
 {'instruction': 'Find all the unique categories in a table',
  'responses': 'SELECT DISTINCT category FROM table_name;'},
 {'instruction': 'Combine similar data together in a table',
  'responses': '```sql\nSELECT column1, column2, SUM(column3) \nFROM table_name \nGROUP BY column1, column2;\n```'},
 {'instruction': 'Show me how to group data by a specific column',
  'responses': '```sql\nSELECT column1, column2, COUNT(*) \nFROM table_name \nGROUP BY column1;\n```'},
 {'instruction': 'Can you group the data by two columns',
  'responses': 'SELECT column1, column2, COUNT(*) \nFROM table_name \nGROUP BY column1, column2;'},
 {'instruction': 'How do I count items in each group',
  'responses': '```sql\nSELECT group_column, COUNT(*) \nFROM table_name \nGROUP BY group_column;\n```'},
 {'instruction': 'Group the data and calculate the average for eachGroup',
  'responses': '```sql\nSELECT eachGroup, AVG(data) \nFROM yourTable \nGROUP BY eachGroup;\n```'},
 {'instruction': 'Can you explain how to use group by with an example',
  'responses': 'SELECT department, COUNT(*) AS employee_count\nFROM employees\nGROUP BY department;'},
 {'instruction': 'I want to see the total sales for each product category',
  'responses': '```sql\nSELECT product_category, SUM(sales) AS total_sales\nFROM sales_table\nGROUP BY product_category;\n```'},
 {'instruction': 'How can I get the maximum value in each group',
  'responses': '```sql\nSELECT group_column, MAX(value_column) \nFROM table_name \nGROUP BY group_column;\n```'},
 {'instruction': 'Can you show me how to filter groups after grouping',
  'responses': '```sql\nSELECT column1, column2, COUNT(*) \nFROM table_name \nGROUP BY column1, column2 \nHAVING COUNT(*) > 10;\n```'},
 {'instruction': 'How do I group data and then sort the groups',
  'responses': '```sql\nSELECT column1, column2, COUNT(*)\nFROM table_name\nGROUP BY column1, column2\nORDER BY column1, column2;\n```'},
 {'instruction': 'I need to group data and then apply a function to each group',
  'responses': 'SELECT column1, column2, AGGREGATE_FUNCTION(column3) FROM table_name GROUP BY column1, column2;'},
 {'instruction': 'Can you help me understand the difference between group by and order by',
  'responses': 'SELECT column1, column2, COUNT(*) FROM table_name GROUP BY column1, column2 ORDER BY column1, column2;'},
 {'instruction': 'How do I group data and then find the minimum value in each group',
  'responses': '```sql\nSELECT group_column, MIN(value_column)\nFROM table_name\nGROUP BY group_column;\n```'},
 {'instruction': 'I want to group the data and then remove any duplicates',
  'responses': '```sql\nSELECT DISTINCT *\nFROM table_name\nGROUP BY column1, column2;\n```'},
 {'instruction': 'Can you show me how to group data and then apply multiple functions',
  'responses': '```sql\nSELECT department, COUNT(*) AS num_employees, AVG(salary) AS avg_salary\nFROM employees\nGROUP BY department;\n```'},
 {'instruction': 'How do I group data and then join it with another table',
  'responses': '```sql\nSELECT t1.*, t2.*\nFROM (SELECT column1, column2, SUM(column3) AS sum_column3\n      FROM table1\n      GROUP BY column1, column2) t1\nJOIN table2 t2 ON t1.column1 = t2.column1'},
 {'instruction': 'I need to group data and then pivot the results',
  'responses': "```sql\nSELECT *\nFROM (\n    SELECT column1, column2, column3\n    FROM table_name\n    GROUP BY column1, column2\n) AS subquery\nPIVOT (\n    SUM(column3)\n    FOR column2 IN ('value1', 'value2', '"},
 {'instruction': 'Can you show me how to group data and then export the results to a CSV file',
  'responses': "```sql\nCOPY (\n    SELECT * \n    FROM your_table \n    GROUP BY your_column\n) TO '/path/to/your/file.csv' WITH CSV HEADER;\n```"},
 {'instruction': 'How do I group data and then create a summary table',
  'responses': '```sql\nSELECT column1, column2, COUNT(*), SUM(column3)\nFROM table_name\nGROUP BY column1, column2;\n```'},
 {'instruction': 'How to use a subquery to filter data in SQL',
  'responses': '```sql\nSELECT column1, column2\nFROM table1\nWHERE column1 IN (SELECT column1 FROM table2 WHERE condition);\n```'},
 {'instruction': 'Can you explain how subqueries-work in simple terms',
  'responses': 'SELECT * FROM (SELECT column_name FROM table_name WHERE condition) AS subquery_name;'},
 {'instruction': 'Show me an example of a subquery that returns a single value',
  'responses': 'SELECT (SELECT MAX(salary) FROM employees) AS highest_salary;'},
 {'instruction': 'How do I use a subquery to get data from multiple tables',
  'responses': 'SELECT t1.column1, t2.column2 FROM table1 t1 JOIN (SELECT column2, column3 FROM table2 WHERE condition) t2 ON t1.column1 = t2.column3 WHERE t1.condition;'},
 {'instruction': 'Can you give meAnExample of a subquery that checks for the existence of certain records',
  'responses': "```sql\nSELECT *\nFROM employees\nWHERE EXISTS (SELECT 1 FROM departments WHERE departments.department_id = employees.department_id AND departments.location = 'New York');\n```"},
 {'instruction': 'How to write a subquery that returns multiple rows',
  'responses': '```sql\nSELECT column1, column2\nFROM table1\nWHERE column1 IN (SELECT column1 FROM table2 WHERE condition);\n```'},
 {'instruction': 'What is the difference between a subquery and a join',
  'responses': 'No SQL command can be generated for this question as it is asking for an explanation rather than a specific SQL query.'},
 {'instruction': 'Can you show me how to nest one subquery inside another',
  'responses': '```sql\nSELECT outer_column\nFROM (SELECT inner_column\n      FROM table_name\n      WHERE condition) AS subquery_alias\nWHERE outer_condition;\n```'},
 {'instruction': 'How do I use a subquery to find the maximum or minimum value in a column',
  'responses': 'SELECT (SELECT MAX(column_name) FROM table_name) AS MaxValue, (SELECT MIN(column_name) FROM table_name) AS MinValue FROM table_name;'},
 {'instruction': 'Can you explain how to use a subquery with comparison operators',
  'responses': '```sql\nSELECT column_name\nFROM table_name\nWHERE column_name operator (SELECT column_name FROM table_name WHERE condition);\n```'},
 {'instruction': 'How to use a subquery to filter results based on a condition',
  'responses': 'SELECT * FROM table_name WHERE column_name IN (SELECT column_name FROM table_name WHERE condition)'},
 {'instruction': 'Can you give me an example of a correlated subquery',
  'responses': '```sql\nSELECT e.employee_id, e.first_name, e.last_name\nFROM employees e\nWHERE e.salary > (SELECT AVG(salary) FROM employees WHERE department_id = e.department_id);\n```'},
 {'instruction': 'How do I optimize a query that uses subqueries',
  'responses': "```sql\nEXPLAIN PLAN FOR\nSELECT /*+ GATHER_PLAN_STATISTICS */ *\nFROM (\n    SELECT t1.*, t2.name\n    FROM table1 t1\n    JOIN table2 t2 ON t1.id = t2.id\n    WHERE t1.status = 'active'\n"},
 {'instruction': 'What are the common mistakes to avoid when writing subqueries',
  'responses': "SELECT * FROM (SELECT 'Common mistakes to avoid when writing subqueries are not directly queryable as data' AS Note) AS SubqueryMistakes"},
 {'instruction': 'How to use a subquery to update data in a table',
  'responses': '```sql\nUPDATE table_name\nSET column_name = (SELECT column_name FROM another_table WHERE condition)\nWHERE condition;\n```'},
 {'instruction': 'Can you show me how to use a subquery with aggregate functions',
  'responses': 'SELECT employee_id, (SELECT AVG(salary) FROM employees) AS average_salary FROM employees WHERE salary > (SELECT AVG(salary) FROM employees);'},
 {'instruction': 'How do I use a subquery to delete records from a table',
  'responses': 'DELETE FROM table_name WHERE id IN (SELECT id FROM another_table WHERE condition)'},
 {'instruction': 'Can you explain how to use a subquery with the IN operator',
  'responses': 'SELECT column_name\nFROM table_name\nWHERE column_name IN (SELECT column_name FROM table_name WHERE condition);'},
 {'instruction': 'How to use a subquery to return a list of values for a column',
  'responses': 'SELECT column_name FROM (SELECT column_name FROM table_name) AS subquery_alias'},
 {'instruction': 'Can you give me tips on writing efficient subqueries',
  'responses': '```sql\nSELECT *\nFROM (\n    SELECT column1, column2, column3\n    FROM table_name\n    WHERE condition\n    LIMIT 100\n) AS subquery\nWHERE subquery.condition;\n```'}]
				
			

Step 5: Filtering the Data

Now Let’s filter and refine the generated data. For this, we will use the NVIDIA Nemotron-4-340B-Reward model, which evaluates the quality of each response based on several criteria. This model provides scores that assess different aspects of the generated SQL commands. To use it we need to setup up the connection to the NVIDIA model using the OpenAI Python client. The base_url specifies the endpoint for the API, and the api_key is used for authentication. You can obtain your own API key by visiting this page: https://build.nvidia.com/nvidia/nemotron-4-340b-reward?snippet_tab=Python

				
					from openai import OpenAI

client = OpenAI(
  base_url = "https://integrate.api.nvidia.com/v1",
  api_key = "API"
)
				
			

We start by defining functions to evaluate the quality of responses using the NVIDIA Nemotron-4-340B-Reward API. The get_scores_from_response function extracts scores from the response. The get_response_and_scores function sends a pair of user instructions and model-generated responses to the model for evaluation, returning the score for each response.

				
					def get_scores_from_response(score_response_template):
    logprobs = score_response_template.choices[0].logprobs.content
    score_dict = {}
    for score in logprobs:
        score_dict[score.token] = score.logprob
    return score_dict
				
			
				
					def get_response_and_scores(client, model, question, response_content):
    messages = [
        {
            "role": "user",
            "content": question
        },
        {
            "role": "assistant",
            "content": response_content
        },
    ]

    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )

    scores = get_scores_from_response(response)
    return scores
				
			
				
					print(len(instruction_response_pair_list))
				
			

We then define the process_instruction_response_score function, which iterates through each instruction-response pair in instruction_response_pair_list. For each pair, it retrieves the instruction and response, checks if they are valid, and then gets the evaluation scores. The scores are collected in a list.

				
					def process_instruction_response_score(client, model, instruction_response_pair_list):
    score_list = []
    for instruction_response_pair in instruction_response_pair_list:
        instruction = instruction_response_pair.get("instruction")
        response = instruction_response_pair.get("responses")


        if instruction is not None and response is not None:
            score = get_response_and_scores(client, model, instruction, response)
            score_list.append(score)
        else:
            print(f"Skipping entry due to missing instruction or response: {instruction_response_pair}")

				
			
				
					score_list = process_instruction_response_score(client, "nvidia/nemotron-4-340b-reward", instruction_response_pair_list)
				
			
				
					score_list[:8]
[{'helpfulness': 3.78125,
  'correctness': 3.84375,
  'coherence': 3.84375,
  'complexity': 1.140625,
  'verbosity': 0.51953125},
 {'helpfulness': 3.734375,
  'correctness': 3.671875,
  'coherence': 3.96875,
  'complexity': 1.3203125,
  'verbosity': 0.66015625},
 {'helpfulness': 3.6875,
  'correctness': 3.5625,
  'coherence': 3.953125,
  'complexity': 1.2890625,
  'verbosity': 0.43359375},
 {'helpfulness': 2.140625,
  'correctness': 2.109375,
  'coherence': 3.625,
  'complexity': 0.96484375,
  'verbosity': 0.361328125},
 {'helpfulness': 2.9375,
  'correctness': 2.953125,
  'coherence': 3.921875,
  'complexity': 1.1171875,
  'verbosity': 0.640625},
 {'helpfulness': 3.875,
  'correctness': 3.609375,
  'coherence': 3.921875,
  'complexity': 1.328125,
  'verbosity': 0.57421875},
 {'helpfulness': 2.515625,
  'correctness': 2.265625,
  'coherence': 3.453125,
  'complexity': 1.046875,
  'verbosity': 0.515625},
 {'helpfulness': 2.71875,
  'correctness': 2.6875,
  'coherence': 3.671875,
  'complexity': 1.328125,
  'verbosity': 0.369140625}]
				
			

We finally filter the instruction_response_pair_list using two predefined thresholds:

  • The helpfulness_THRESHOLD is set to 3, meaning any response with a helpfulness score below 3 will be excluded.
  • The verbosity_THRESHOLD is set to 2.5, meaning any response with a verbosity score above 2.5 will be excluded.

The synthetic_data list is created by iterating through the instruction_response_pair_list. This process ensures that only the most relevant and concise responses remain for further use.

 
				
					helpfulness_THRESHOLD = 3
verbosity_THRESHOLD = 2.5
synthetic_data = [data for i, data in enumerate(instruction_response_pair_list)
                  if not (score_list[i]["helpfulness"] < helpfulness_THRESHOLD or
                          score_list[i]["verbosity"] > verbosity_THRESHOLD)]
				
			
				
					synthetic_data
				
			

Step 6: Saving and Deploying the Dataset

Now we can save the filtered synthetic data to a file and then deploy it as a dataset on Hugging Face. Here’s how the process works:

We save the filtered data to a file called synthetic_data_filtered.jsonl, where each item is stored as a separate JSON object for easy handling. After that, we log into our Hugging Face account using the huggingface_hub.login() function to upload the dataset to the platform. The dataset is then organized into a DatasetDict with a key called train, preparing it for use in machine learning tasks. Finally, we upload the dataset to Hugging Face using dataset_dict.push_to_hub(), making it publicly available for others to access and use

				
					import json
				
			
				
					with open('synthetic_data_filtered.jsonl', 'w') as f:
    for item in synthetic_data:
        f.write(json.dumps(item))
        f.write('\n')
				
			
				
					from huggingface_hub import login
login()
				
			
				
					!pip install datasets
				
			
				
					from datasets import Dataset, DatasetDict, load_dataset
				
			
				
					with open(f'synthetic_data_filtered.jsonl', 'r') as f:
    data = [json.loads(line) for line in f]
dataset = Dataset.from_list(data)
dataset_dict = DatasetDict({"train": dataset})
dataset_dict.push_to_hub("melekmessoussi/SQL-prompt-dataset")
				
			
				
					/usr/local/lib/python3.10/dist-packages/huggingface_hub/utils/_auth.py:94: UserWarning: 
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.
  warnings.warn(
Uploading the dataset shards: 100%
 1/1 [00:00<00:00,  1.13it/s]
Creating parquet from Arrow format: 100%
 1/1 [00:00<00:00, 36.76ba/s]
CommitInfo(commit_url='https://huggingface.co/datasets/melekmessoussi/SQL-prompt-dataset/commit/da6ddb583c2a0539b45b8973e38774effe8f7552', commit_message='Upload dataset', commit_description='', oid='da6ddb583c2a0539b45b8973e38774effe8f7552', pr_url=None, repo_url=RepoUrl('https://huggingface.co/datasets/melekmessoussi/SQL-prompt-dataset', endpoint='https://huggingface.co', repo_type='dataset', repo_id='melekmessoussi/SQL-prompt-dataset'), pr_revision=None, pr_num=None)
				
			

Voilà! Now you have a dataset ready to fine-tune any model.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !