Entity-based Synthetic Data Generation with chatGPT

Entity-based Synthetic Data Generation with chatGPT

Feb 27, 2023

Continuing on our previous article “What is Synthetic Data Generation”, in this article we will provide a step-by-step tutorial on how to generate synthetic text based on real named entities using Large Language Model (LLM) chatGPT. We will focus two domains: job description generation and medical abstract generation.


The development of text generation models has been greatly accelerated by the introduction of large pre-trained language models like GPT (Generative Pre-trained Transformer). These models are trained on massive amounts of text data using unsupervised learning techniques and can generate high-quality text that is often indistinguishable from human-written text.


One of the most popular and successful pre-trained language models is the GPT series developed by OpenAI, including GPT-1, GPT-2, and GPT-3. These models are trained on a massive amount of text data and can generate coherent and fluent text for a wide range of natural language processing tasks, including text completion, question answering, and chatbot responses.

What is Chat GPT :

ChatGPT is an advanced conversational AI model developed by OpenAI that is capable of engaging in natural language conversations with humans. It is built on the GPT 3.5 (Generative Pre-trained Transformer) architecture, which uses reinforcement learning with a reward function (RLHF) to further improve the quality of its responses. During fine-tuning, the model generates responses to user inputs and receives a reward signal based on how well the generated response matches the expected response. The reward function can be designed to take into account various factors such as relevance, coherence, and diversity (refer to the image below for a step-by-step explanation). By using reinforcement learning, the model can learn from its own experience and adjust its parameters to maximize the reward function. For more information, checkout openAI website.

Entity-based Synthetic Data Generation with chatGPT

Text Generation:

  • Limited Context: While Chat GPT is designed to consider the context of a question to generate an appropriate answer, it can still struggle with understanding complex or ambiguous context. It may also not have access to all the relevant information needed to answer a question, especially if it involves specialized knowledge or expertise.
  • Accuracy: Chat GPT can generate answers that are not accurate, particularly if it lacks the relevant information or if the question is poorly worded. Additionally, it can sometimes generate misleading or biased responses, particularly if the training data contains such biases.
  • Style and Consistency: While Chat GPT is trained to mimic human language, it may not always generate responses that are stylistically or tonally consistent, leading to an answer that feels “off” or inconsistent.

Synthetic Data Generation

  • Bias: Chat GPT, like any AI model, can generate synthetic data that contains biases or inaccuracies, particularly if the training data is biased. It is important to carefully evaluate the synthetic data generated to ensure it is not perpetuating harmful biases.
  • Quality: The quality of the synthetic data generated by Chat GPT may vary, and it may not be as high quality as the real data, particularly if the training data is of low quality.

Entity-Based Synthetic Data Generation

Named entities can play an important role in generating more accurate synthetic data. Entity extraction identifies and extracts specific entities or concepts from unstructured text data such as the name of a person, organization, or place.

The process of entity-based data generation involves feeding entities into a machine learning model and asking it to generate text based on those entities. For example, if you want to generate customer reviews for a particular product category, you can provide the category name as an entity to the model and ask it to create reviews based on that entity. This process can be enhanced by using relevant tags or attributes, such as sentiment or product features. You can use this to guide the generation process and ensure that your synthetic data represents similar entities or concepts to create a more realistic dataset.

Generating Synthetic Text Based on Named Entities

For this tutorial we are going to generate synthetic data using chatGPT for two different domains: job descriptions and medical abstracts.


Job Description Example

To generate synthetic text, we first need to feed it with relevant entities from a real data source. In this case, we have already trained a Named Entity Recognition (NER) using UBIAI Annotation Tool and extracted relevant entities from a small sample of job descriptions. The entities of interest are Experience, Skills, Diploma, and Diploma Major.


Once we have extracted the relevant entities from the data source, we can then feed them to ChatGPT. We will guide chatGPT to generate text that aligns with the type of data we are working with. Here is the prompt used:

					Generate a short job description based on the following entities:
Experience : ["3+ years", "5+ years."]
Skills : ["-recruiting", "managing technical teams", "performance management"]
DIPLOMA : ["BS", "BA"]
DIPLOMA_MAJOR : ["Computer Science"]

And here is the output:

Entity-based Synthetic Data Generation with chatGPT

The results are pretty impressive! ChatGPT was able to determine the domain of the job description we are seeking (recruiting and managing teams), use the entities we have fed it and created a realistic job description that can be used to train an NER model with .

It’s important to note that the quality of the synthetic text will depend on the accuracy and relevancy of the entities provided, as well as the quality of the Large Language Model (LLM) used.

Medical Abstract Example

Another example of synthetic text generation is generating medical abstracts. Following the same approach, we feed chatGPT entities related to COVID-19 extracted from the publicly available on Kaggle. The entities extracted are Medicines and Medical conditions. Here is the prompt:

					Create a scientific abstracts based on the entities below:
Medicines : ["DMARDs", "Hydroxychloroquine", "monoclonal antibodies", "Rituximab", "etanercept", "certolizumab", "sulfasalazine", "TNF-alpha inhibitors", "leflunomide", "RA treatment", "methotrexate", "Sodium", "infliximab", "tocilizumab"]
Medical Condition : ["DMARDs"]

Here is the result:

Entity-based Synthetic Data Generation with chatGPT

The results are again pretty impressive with chatGPT producing a realistic sounding medical text that is comparable to the original one:

Entity-based Synthetic Data Generation with chatGPT

However, looking closely at the produced abstract, we observe false information being generated such as DMARD being effective up to 70% for patients with Rheumatoid Arthritis (RA). This statement is not true and is not supported by any scientific research.

This example represents one of the major pitfalls of using LLM: hallucination. LLMs are not trained to be factual and can generate incorrect information. Their output should be carefully checked before using it for generating synthetic data especially for technical domains. For specific data augmentation tasks such as training an NER model to extract specific entities, the impact of misinformation is perhaps less dangerous but nevertheless one has to be very cautious.


In conclusion, entity-based synthetic data generation is a powerful technique to generate more accurate text that resembles the original text. Synthetic data generation enables the creation of large and diverse datasets for training and testing machine learning models. However, LLMs still suffer from hallucination problems and their output needs to be carefully checked before using it for data generation. As a next step, we can ask chatGPT to expand the entities list we have provided with similar entities and produce even more diverse synthetic dataset. In addition, we can add a list of relations that dictates how the entities should be related to each other in the generated text for more accurate synthetic data generation.

Follow us on Twitter @UBIAI5 or subscribe here!