Entity-based Synthetic Data Generation with chatGPT
Feb 27, 2023
Continuing on our previous article “What is Synthetic Data Generation”, in this article we will provide a step-by-step tutorial on how to generate synthetic text based on real named entities using Large Language Model (LLM) chatGPT. We will focus two domains: job description generation and medical abstract generation.
The development of text generation models has been greatly accelerated by the introduction of large pre-trained language models like GPT (Generative Pre-trained Transformer). These models are trained on massive amounts of text data using unsupervised learning techniques and can generate high-quality text that is often indistinguishable from human-written text.
One of the most popular and successful pre-trained language models is the GPT series developed by OpenAI, including GPT-1, GPT-2, and GPT-3. These models are trained on a massive amount of text data and can generate coherent and fluent text for a wide range of natural language processing tasks, including text completion, question answering, and chatbot responses.
What is Chat GPT :
ChatGPT is an advanced conversational AI model developed by OpenAI that is capable of engaging in natural language conversations with humans. It is built on the GPT 3.5 (Generative Pre-trained Transformer) architecture, which uses reinforcement learning with a reward function (RLHF) to further improve the quality of its responses. During fine-tuning, the model generates responses to user inputs and receives a reward signal based on how well the generated response matches the expected response. The reward function can be designed to take into account various factors such as relevance, coherence, and diversity (refer to the image below for a step-by-step explanation). By using reinforcement learning, the model can learn from its own experience and adjust its parameters to maximize the reward function. For more information, checkout openAI website.
Synthetic Data Generation
Entity-Based Synthetic Data Generation
Generating Synthetic Text Based on Named Entities
Generate a short job description based on the following entities: Experience : ["3+ years", "5+ years."] Skills : ["-recruiting", "managing technical teams", "performance management"] DIPLOMA : ["BS", "BA"] DIPLOMA_MAJOR : ["Computer Science"]
Create a scientific abstracts based on the entities below: Medicines : ["DMARDs", "Hydroxychloroquine", "monoclonal antibodies", "Rituximab", "etanercept", "certolizumab", "sulfasalazine", "TNF-alpha inhibitors", "leflunomide", "RA treatment", "methotrexate", "Sodium", "infliximab", "tocilizumab"] Medical Condition : ["DMARDs"]