We developed a complete pipeline for fine-tuning customer support chatbots using reinforcement learning (specifically KTO) without requiring any labeled training data. By leveraging synthetic conversation generation and LLM-as-a-judge evaluation, we created a system that can improve ticket resolution rates by 20% while maintaining high customer satisfaction. This is Part 1 of a 2-part series: here we focus on the data generation methodology that makes RLHF training possible without manual labeling. Part 2 covers the actual training process on UBIAI.

The Promise and Problem of RLHF for Customer Support
Reinforcement Learning from Human Feedback (RLHF) has revolutionized how we train language models. ChatGPT’s success? RLHF. Claude’s helpfulness? RLHF. The technique is simple in principle: show the model examples of good and bad responses, then use reinforcement learning to steer it toward the good ones.
For customer support chatbots, RLHF offers something tantalizing: the ability to train agents that don’t just sound professional, but actually resolve tickets efficiently while keeping customers happy. Instead of hoping your prompt engineering catches every edge case, you fine-tune the model’s behavior directly on the outcomes you care about.
But there’s a catch. A expensive one.
Traditional RLHF requires human feedback. Lots of it. Industry standard suggests 2,000-5,000 labeled examples minimum. At $5-25 per labeled conversation (factoring in quality control), you’re looking at $10,000 to $125,000 just to get started. And that’s before your business requirements change, forcing you to start over.
This economic reality has kept RLHF out of reach for most teams building customer support agents. Until now.
Understanding KTO: The RLHF Algorithm We’re Using
Before diving into implementation, let’s talk about the specific reinforcement learning approach we’re using: KTO (Kahneman-Tversky Optimization).
Traditional RLHF often uses PPO (Proximal Policy Optimization) which requires carefully tuned reward models and complex training dynamics. KTO simplifies this by working directly with binary preference labels: this response was “Accurate” or “Inaccurate.”
The algorithm is inspired by Kahneman and Tversky’s prospect theory from behavioral economics. The key insight: humans (and models) are more motivated to avoid losses than to pursue equivalent gains. KTO exploits this by:
– Strongly penalizing responses labeled “Inaccurate”
– Modestly rewarding responses labeled “Accurate”
– Maintaining a reference point (the base model) to prevent drift
This asymmetry creates models that are conservative about bad behaviors while still learning good ones. For customer support, this is exactly what we want: an agent that rarely says the wrong thing, even if it means being slightly more cautious.
KTO also has a practical advantage: it only needs binary labels (good/bad), not ranked preferences (A is better than B) or numerical scores. This makes synthetic labeling much more reliable—LLMs are better at judging “is this response acceptable?” than “rate this response from 1-10.”
Our Six-Phase Training Pipeline
Here’s how we’re going to fine-tune a customer support agent using RLHF without any manual labeling:
Phase 1: Goal Definition
Define concrete, measurable objectives. Not “be better at support” but “increase ticket resolution rate by 20% while maintaining customer satisfaction above 4.5/5.” The specificity matters—vague goals produce vague training data.
Phase 2: Training Aspect Specification
Identify the specific capabilities to improve: refund handling, angry customer management, first-contact resolution, technical troubleshooting, etc. Each aspect becomes a dimension we’ll explicitly train.
Phase 3: Persona Simulation
Use UBIAI’s API to generate challenging customer personas that stress-test each training aspect. We’re not creating typical customers—we’re creating edge cases. The serial refund seeker. The technically confused. The justifiably angry. If the model can handle these, it can handle anything.
Phase 4: Conversation Generation
Generate 200-500 realistic, multi-turn conversations between our personas and a base support agent. Each conversation ends with an unanswered customer question, creating the perfect training format: context + required response.
Phase 5: LLM-as-a-Judge Evaluation
Deploy UBIAI’s API as an expert evaluator to rate each response as “Accurate” or “Inaccurate” based on resolution potential, professionalism, policy adherence, and customer satisfaction impact. This generates the preference labels KTO needs.
Phase 6: KTO Training
Feed the labeled data into KTO on UBIAI’s platform. The algorithm learns which response patterns lead to good outcomes and which don’t, adjusting the model’s behavior accordingly.
In this article, we’re implementing Phases 1-5: the synthetic data generation and evaluation pipeline. Part 2 will cover Phase 6: the actual RLHF training process and deployment.
Why This Approach Works for Customer Support
Customer support is uniquely well-suited to this methodology. Unlike creative writing or technical research where quality is subjective and domain-specific, customer support has clear success criteria:
– Did the response address the customer’s issue?
– Was the tone appropriate and empathetic?
– Did it move toward resolving the ticket?
– Does it follow company policies?
These are questions that advanced LLMs can answer reliably. The models have processed millions of customer service interactions during training and understand the patterns of effective support. We’re not asking them to have domain expertise in your specific product—we’re asking them to evaluate whether a response demonstrates good support practices.
Moreover, customer support conversations follow predictable structures. There’s context establishment, problem identification, solution proposal, and resolution or escalation. This structural regularity makes synthetic generation more reliable than in open-ended domains.
The real innovation here isn’t just avoiding labeling costs. It’s creating a training pipeline that’s as agile as your business. When your return policy changes, you don’t wait weeks for new training data—you regenerate it overnight. When you identify a new failure mode, you create personas targeting it specifically and retrain within days.
This is RLHF that keeps pace with your business.
Let’s build it.
Setup and Installation
First, let’s install the required libraries and set up our UBIAI API client.

Phase 1: Defining Success Metrics for RLHF
The most common mistake in RLHF is starting without clear success criteria. You can’t train a model to be “better” at customer support any more than you can optimize a function without defining what you’re optimizing for.
We need concrete, measurable objectives. Our primary goal:
increase ticket resolution rate by 20% while maintaining customer satisfaction above 4.5/5. This gives us two metrics to optimize simultaneously—efficiency and quality.
But that’s still too abstract for training. We need to decompose this into specific capabilities. What does “good support” actually look like behaviorally? We’ve identified eight training aspects that directly ladder up to our goal:
1. Refund handling (clarity about policies, efficient processing)
2. Managing difficult customers (professionalism under pressure)
3. First-contact resolution (solving issues without escalation)
4. Customer satisfaction optimization (empathy, validation, tone)
5. Product complaint handling (appropriate responses to defects)
6. Escalation efficiency (knowing when to involve humans)
7. Shipping communication (clarity about delays and logistics)
8. Proactive problem-solving (anticipating customer needs)
Each aspect represents a common failure mode in baseline models. By explicitly targeting these during data generation, we ensure our RLHF training addresses real weaknesses rather than optimizing in the dark.

Phase 2: Synthetic Persona Generation for RLHF Training Data
The quality of reinforcement learning is directly determined by the quality of the training signal. If we want our fine-tuned agent to handle edge cases, we need to train it on edge cases.
This is where synthetic persona generation becomes powerful. We’re not creating average customers—we’re creating stress tests. Each persona represents an extreme version of a common support challenge:
– The customer who demands refunds aggressively and won’t take no for an answer
– The technically confused person whose frustration grows with each clarification attempt
– The polite customer whose patience has finally run out after repeated failures
– The quality inspector who finds flaws in everything
These personas serve a specific purpose in RLHF: they generate the difficult examples where baseline models struggle. When the LLM judges these interactions, the model gets strong training signal about what not to do. The errors are more informative than the successes.
We’re leveraging UBIAI’s API to generate these personas, tapping into extensive knowledge of customer behavior patterns, support psychology, and service scenarios. The model has effectively “seen” millions of customer interactions during training and can synthesize that knowledge into archetypal challenge scenarios that would take human designers weeks to develop.

Phase 3: Generating Training Conversations for KTO
Now we generate the actual training data: realistic multi-turn conversations between our challenging personas and a support agent. These conversations form the foundation of our RLHF training.
The format is critical for KTO. Each conversation must:
1. Provide sufficient context (3-6 turns) so the model understands the customer’s situation
2. End with an unanswered question creating a clear prediction target
3. Exhibit realistic conversation dynamics including escalation, clarification, and emotional evolution
4. Include specific details (order numbers, dates, product names) that make the scenario concrete
We’re generating approximately 500 examples (62 per persona). This number is based on RLHF research suggesting that 500-1000 high-quality preference pairs is sufficient for meaningful fine-tuning of already-capable base models. More examples help, but with diminishing returns—the first 500 examples do most of the heavy lifting.
The temperature setting (0.9) is intentionally high. We want diversity in our training data. Two conversations with the same persona should explore different variations of that persona’s behavior. This prevents the model from memorizing specific phrasings and instead learning general patterns of good support.
Think of this as creating a gym for your model—a comprehensive set of exercises that will strengthen specific capabilities through repeated exposure to varied challenges.

Phase 4: Generating Baseline Responses for Comparison
Before we can train with reinforcement learning, we need a baseline: how does an un-fine-tuned model perform on these challenging scenarios?
We’re using UBIAI’s base model to generate responses for two strategic reasons:
First, it’s representative of what many companies actually deploy: Base models are fast, cost-effective, and handle most routine support queries adequately. But they struggle with edge cases—exactly what we’re targeting with RLHF.
Second, it provides room for improvement: The base model gives us headroom to demonstrate measurable gains from fine-tuning.
Each response is generated with temperature 0.7—not deterministic, but not wildly creative either. This represents realistic production behavior: some variation in responses, but generally following similar patterns.
The responses we generate here are crucial for RLHF. Some will be excellent—these become positive examples that KTO will reinforce. Many will be adequate but suboptimal—these show opportunities for improvement. Some will be outright poor—these become strong negative signals that teach the model what to avoid.
This distribution of quality is actually desirable. If everything was perfect, we’d have nothing to train on. If everything was terrible, the signal would be too noisy. We’re looking for that middle ground: enough good examples to anchor on success patterns, enough bad examples to learn boundaries.

Phase 5: LLM-as-a-Judge for Generating Preference Labels with RAG
This is where our RLHF pipeline diverges from tradition. Instead of human annotators, we’re using UBIAI’s API as an expert evaluator to generate the preference labels that KTO needs.
Let’s address the elephant in the room: is it valid to use an AI to judge an AI?
The LLM-as-a-judge approach is controversial, and the concerns are legitimate. How do we know the judgments are correct? What if it has systematic biases? Isn’t this circular—using one model to train another?
Here’s the nuanced answer:
LLM judges aren’t perfect, but they’re consistently imperfect in ways we can work with especially if we add context to their response.
Human annotators have the same problems—they’re inconsistent over time, biased by recent examples, and influenced by subjective factors. The difference is that LLM biases are reproducible and scalable.
For customer support evaluation, advanced LLMs are surprisingly reliable because the criteria are relatively objective:
–Accuracy: Does the response address the actual issue raised?
–Completeness: Is all necessary information provided?
–Professionalism: Is the tone appropriate and empathetic?
–Policy adherence: Does it follow reasonable support policies?
–Resolution potential: Does it move toward closing the ticket?
These aren’t questions requiring deep domain expertise or subjective aesthetic judgment. They’re pattern-matching tasks that LLMs excel at after processing millions of customer service examples during training.
We use temperature 0.3 for evaluation—low enough to ensure consistency, high enough to allow nuanced judgment. The output is binary (Accurate/Inaccurate) because KTO works with preference pairs, not scores. This simplicity actually improves reliability: LLMs are better at “is this acceptable?” than “rate this 1-10.”
For production deployment, you’d want to validate a sample of judgments against human evaluation. But for training purposes, the consistency of LLM judges provides sufficient signal for the model to learn meaningful patterns.

Packaging Data for KTO Training
We now have everything needed for reinforcement learning: conversations, agent responses, and binary preference labels. The final step is formatting this into the structure that KTO expects.
The format is elegantly simple:
– system_prompt: The agent’s goal (identical across all examples for consistency)
– input: Conversation history ending with the customer’s question
– output: The agent’s response to evaluate
– rating: Binary label (Accurate/Inaccurate)
This structure provides KTO with exactly what it needs: context, action, and evaluation. The algorithm will learn to predict which response patterns lead to “Accurate” labels and which lead to “Inaccurate” ones, adjusting the model’s behavior accordingly.
We’re preserving additional metadata (persona type, training aspect, judge reasoning) for analysis purposes. This helps us understand which scenarios our base model struggles with most, information that will be valuable for interpreting training results in Part 2.
The Training Signal Quality
Looking at our evaluation results, we likely see accuracy rates varying by training aspect—perhaps 60-75% on challenging scenarios like handling aggressive refund seekers, higher on more straightforward aspects like shipping communication.
This variance is valuable data. It tells us exactly where KTO training will have the most impact. The aspects with lower accuracy represent the biggest opportunity for improvement through reinforcement learning.
The inaccurate responses are particularly crucial for RLHF. They’re not noise to be filtered out—they’re explicit negative examples that teach the model boundaries. KTO will learn: “When customers escalate like this, don’t respond like that.” This negative signal is often more informative than positive examples.
Why This Works: The Three Pillars
The success of this approach rests on three foundational insights:
Pillar 1: Specificity in Goal Definition
We didn’t try to make the agent “better at everything.” We identified eight precise capabilities with measurable outcomes. This specificity ensures our training data actually teaches something rather than providing generic examples.
Pillar 2: Edge Case Focus Through Personas
Our personas represent worst-case scenarios. RLHF training on edge cases naturally improves average-case performance. If the model learns to handle the most demanding customers, routine interactions become trivial.
Pillar 3: Consistent Evaluation Standards
LLM-as-a-judge isn’t perfect, but it’s consistently imperfect. The biases are reproducible, which provides the reliable training signal that KTO needs to learn patterns effectively.
Limitations to Consider
This methodology isn’t universally applicable. Some important constraints:
Domain Specificity: Our approach works well for customer support because LLMs have extensive knowledge of support best practices. For highly specialized domains (medical diagnosis, legal advice, advanced technical support), you may need to provide domain-specific context or examples to the generation and evaluation models.
Evaluation Validity: While LLM judges are reliable for customer support criteria, they can miss subtle issues that human evaluators would catch. For production deployment, validate a random sample of judgments against human evaluation to calibrate confidence.
Cost Considerations: Generating 500 examples with API calls for both generation and evaluation costs real money. This is dramatically cheaper than human labeling ($10,000+), but it’s not free. Budget accordingly for your specific use case.
Bias Inheritance: The training data inherits any biases present in the model’s outputs. Test your fine-tuned model for fairness issues across different customer demographics and communication styles.
The Path to Production RLHF
We’ve built the foundation. We have high-quality, labeled training data targeting specific support capabilities. We understand where the baseline model succeeds and where it needs improvement.
In Part 2, we’ll complete the RLHF training loop:
1. Upload our dataset to UBIAI for KTO trainingMonitor the training process and interpret loss curves
2. Monitor the training process and interpret loss curves
4. Evaluate the fine-tuned model on held-out test scenarios
5. Measure improvement on our goal metrics (resolution rate, satisfaction)
6. Deploy the production agent with monitoring and fallback strategies
The promise of RLHF is simple: transform that 60-70% baseline accuracy on challenging scenarios into 85-90%+ performance. Create an agent that doesn’t just respond to customers—it resolves their issues efficiently while maintaining the empathy and professionalism that drives satisfaction.
And do it all without spending months collecting labeled data.
That’s the future of practical RLHF for production systems. And it starts with the data pipeline we’ve built here.
In Part 2, we’ll take this dataset and transform it into a production-ready customer support agent through KTO training on UBIAI. The data generation is complete—now comes the reinforcement learning.
Find Full Notebook Here:
Read The Full Blog On: https://ubiai.tools/building-agentic-ai-systems-for-insurance-claims-processing/
Get The Notebook : https://discord.gg/UKDUXXRJtM
