Building Observable and Reliable AI Agents Using LangGraph, LangSmith, and UBIAI

December 8, 2025

Production AI agents fail silently, and most teams don’t discover the problem until customers complain.

The challenge with deploying multi-component AI agents isn’t getting them to work in development—it’s maintaining reliability once they’re handling real traffic. A recent analysis of production agent deployments revealed that 68% of failures occur in components that showed no issues during testing.

Why does this happen? Because most agent architectures are black boxes. You send a query in, get a response out, and have minimal visibility into what happened in between. When something goes wrong, you’re left guessing which component failed, why it failed, and how to fix it.

This post presents a technical framework for building observable AI agents with measurable reliability. We’ll cover instrumentation strategies for multi-component architectures, failure mode detection and classification, component-level performance metrics, and systematic approaches to improving reliability through targeted fine-tuning.

The implementation uses LangGraph for agent orchestration, LangSmith for observability, and UBIAI for component-level fine-tuning. The techniques apply to any multi-component agent architecture, regardless of the specific frameworks you’re using.

The Observability Problem in Multi-Component Agents

 

Traditional monitoring approaches fail for AI agents because they treat the system as a monolith.

A production-grade AI agent typically consists of multiple specialized components: a router that classifies intent and determines workflow, a retriever that fetches relevant context from knowledge bases, a reasoner that plans the response strategy, and a generator that produces the final output. Each component can fail independently, and each failure mode requires different remediation.

Standard logging and metrics don’t capture this component-level behavior effectively. You might log input and output, track latency and error rates, but these aggregate metrics obscure which specific component is degrading performance. A slow response could indicate retriever inefficiency, reasoner complexity, or generator verbosity. A wrong answer could stem from routing errors, retrieval failures, or generation hallucinations.

Effective observability for AI agents requires component-level instrumentation that tracks several key dimensions:

Component execution flow: Which components executed, in what order, with what inputs and outputs. This reveals routing decisions and workflow paths.

Component-specific latency: Breakdown of response time by component. Identifies performance bottlenecks accurately.

Intermediate state: Context retrieved, reasoning steps taken, prompts generated. Essential for debugging failures.

Component success metrics: Did each component accomplish its specific task correctly? Not just whether the final answer was right.

Failure attribution: When the system produces a wrong answer, which component caused it? Router misclassification, retrieval miss, reasoning error, or generation hallucination?

Without this granular observability, you’re optimizing blindly. You might fine-tune your generator when the real problem is your retriever, or invest in better prompts when the issue is actually routing errors sending queries to the wrong workflow.

Instrumenting a Multi-Component Agent with LangGraph and LangSmith

Let’s build a fully instrumented agent with component-level observability.

This implementation demonstrates a realistic multi-component architecture: a router for intent classification, a retriever for knowledge access, a reasoner for response planning, and a generator for output creation. We’ll instrument each component to capture the metrics needed for reliability analysis.

LangGraph provides the orchestration framework, allowing us to define explicit state transitions between components. LangSmith handles distributed tracing, giving us visibility into execution flow and component performance.

				
					# Install required dependencies
!pip install langgraph langchain langchain-openai langsmith chromadb datasets pandas -q
				
			
				
					import os
from typing import TypedDict, Annotated, Sequence
import operator
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain.prompts import ChatPromptTemplate
from langchain_community.vectorstores import Chroma
from langchain.schema import Document
from langgraph.graph import StateGraph, END
from langsmith import Client
from datasets import load_dataset
import pandas as pd
import json
from datetime import datetime

# Initialize LangSmith client for observability
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"
langsmith_client = Client()

# Initialize language model
llm = ChatOpenAI(model="gpt-4", temperature=0.3)

print("Observability infrastructure initialized")
print(f"LangSmith project: agent-observability-demo")
				
			
				
					# Define agent state schema
# This captures all intermediate states as the query flows through components

class AgentState(TypedDict):
    # Input
    query: str
    
    # Routing
    intent: str
    confidence: float
    workflow: str
    
    # Retrieval
    retrieved_docs: list[str]
    retrieval_scores: list[float]
    retrieval_method: str
    
    # Reasoning
    reasoning_steps: list[str]
    response_strategy: str
    
    # Generation
    response: str
    generation_method: str
    
    # Observability
    component_latencies: dict
    component_errors: dict
    execution_path: Annotated[list[str], operator.add]

print("Agent state schema defined")
				
			
				
					# Load customer support dataset and build knowledge base

print("Loading knowledge base data...")
dataset = load_dataset("bitext/Bitext-customer-support-llm-chatbot-training-dataset", split="train")
df = pd.DataFrame(dataset)

# Create vector store for retrieval
support_docs = []
for idx, row in df.head(200).iterrows():
    support_docs.append(
        Document(
            page_content=row['response'],
            metadata={
                "query": row['instruction'],
                "category": row.get('category', 'general'),
                "doc_id": idx
            }
        )
    )

vectorstore = Chroma.from_documents(
    documents=support_docs,
    embedding=OpenAIEmbeddings()
)

print(f"Knowledge base created: {len(support_docs)} documents")
				
			

Get The Full Notebook From: https://discord.gg/UKDUXXRJtM

				
					# Build the agent graph
# This defines the execution flow between components

workflow = StateGraph(AgentState)

# Add nodes
workflow.add_node("router", router_node)
workflow.add_node("retriever", retriever_node)
workflow.add_node("reasoner", reasoner_node)
workflow.add_node("generator", generator_node)

# Define edges (execution flow)
workflow.set_entry_point("router")
workflow.add_edge("router", "retriever")
workflow.add_edge("retriever", "reasoner")
workflow.add_edge("reasoner", "generator")
workflow.add_edge("generator", END)

# Compile the graph
agent = workflow.compile()

print("Agent graph compiled successfully")
print("Execution flow: router → retriever → reasoner → generator")
				
			
				
					# Test the instrumented agent

test_query = "What's your refund policy for items purchased last week?"

print(f"\n{'='*70}")
print(f"Testing query: {test_query}")
print(f"{'='*70}\n")

# Run the agent
result = agent.invoke({
    "query": test_query,
    "execution_path": []
})

# Display results with observability data
print("\n📊 OBSERVABILITY METRICS\n")
print(f"Intent: {result['intent']}")
print(f"Confidence: {result['confidence']:.2f}")
print(f"Workflow: {result['workflow']}")
print(f"\nRetrieval scores: {[f'{s:.3f}' for s in result['retrieval_scores']]}")
print(f"Response strategy: {result['response_strategy']}")
print(f"\nReasoning steps:")
for step in result['reasoning_steps']:
    print(f"  - {step}")

print(f"\n⏱️ COMPONENT LATENCIES\n")
total_latency = 0
for component, latency in result['component_latencies'].items():
    print(f"{component:12s}: {latency*1000:6.2f}ms")
    total_latency += latency
print(f"{'Total':12s}: {total_latency*1000:6.2f}ms")

print(f"\n💬 FINAL RESPONSE\n")
print(result['response'])
print(f"\n{'='*70}")
				
			

This instrumentation provides the foundation for reliability analysis.

 

Notice how the state object captures every intermediate decision and computation. We can see exactly which intent was classified, what documents were retrieved with what similarity scores, what reasoning strategy was chosen, and how long each component took.

 

This level of observability is essential for identifying failure modes. If the agent gives a wrong answer, we can trace back through the execution path to see where it went wrong. Wrong intent classification? Router problem. Relevant documents not retrieved? Retriever problem. Good context ignored? Generator problem.

 

Now let’s build the analysis infrastructure to systematically detect and classify failures.

 

Failure Mode Detection and Classification

 

Not all failures are equal. Different failure modes require different solutions.

 

The key insight from analyzing production agent failures is that failure modes cluster into distinct categories, each attributable to specific components. Identifying which type of failure you’re experiencing tells you which component needs improvement.

 

We’ve identified six primary failure modes that account for approximately 90% of production issues: routing failures (query sent to wrong workflow), retrieval failures (relevant information not found), reasoning failures (incorrect strategy selected), generation failures (poor output despite good inputs), latency failures (response time exceeds SLA), and degradation failures (quality decreases over time).

 

Let’s build a failure detection system that automatically classifies these modes based on the observability data we’re collecting.

 

				
					# Test failure detection on our previous execution

analysis = detector.analyze_execution(
    agent_state=result,
    expected_intent="refund",
    is_correct_response=True  # Assuming the response was correct
)

print("\n🔍 FAILURE ANALYSIS\n")
print(f"Failure mode: {analysis.failure_mode.value}")
print(f"Confidence: {analysis.confidence:.2f}")
print(f"Component: {analysis.component}")
print(f"\nEvidence:")
for e in analysis.evidence:
    print(f"  - {e}")
print(f"\nRecommended action: {analysis.recommended_action}")
				
			

Get The Full Notebook From: https://discord.gg/UKDUXXRJtM

Building a Component Performance Baseline

 

To improve reliability, you need quantitative baselines for each component.

 

Before fine-tuning anything, establish baseline performance metrics for each component. This lets you measure improvement accurately and identify which components contribute most to overall system failures.

 

For each component, we track accuracy (percentage of times the component performs its task correctly), latency distribution (p50, p95, p99 response times), and failure rate (percentage of executions that error or produce invalid output). At the system level, we track end-to-end accuracy, total latency, and component-attributed failure rates.

 

Let’s build a performance tracking system that runs the agent on a test set and computes these metrics.

				
					# Run agent on test set to establish baseline

test_queries = [
    {"query": "What's your refund policy?", "expected_intent": "refund"},
    {"query": "How long does shipping take?", "expected_intent": "shipping"},
    {"query": "I was charged twice for my order", "expected_intent": "account"},
    {"query": "The app crashes when I log in", "expected_intent": "technical"},
    {"query": "Can I change my delivery address?", "expected_intent": "shipping"},
    {"query": "How do I cancel my subscription?", "expected_intent": "account"},
    {"query": "What payment methods do you accept?", "expected_intent": "general"},
    {"query": "I need to return an item", "expected_intent": "refund"},
]

print("Running baseline performance evaluation...\n")

for i, test_case in enumerate(test_queries, 1):
    print(f"[{i}/{len(test_queries)}] Processing: {test_case['query'][:50]}...")
    
    # Run agent
    result = agent.invoke({
        "query": test_case['query'],
        "execution_path": []
    })
    
    # Analyze for failures
    analysis = detector.analyze_execution(
        agent_state=result,
        expected_intent=test_case['expected_intent']
    )
    
    # Record results
    tracker.record_execution(result, analysis)

# Print baseline report
tracker.print_report()
				
			

Get The Full Notebook From: https://discord.gg/UKDUXXRJtM

This baseline gives you concrete targets for improvement.

 

Notice how the metrics break down by component. If the generator has a 40% failure rate but the retriever only has 10%, you know where to focus your fine-tuning efforts. If the p95 latency is 2000ms and the generator accounts for 1500ms of that, you know which component is the performance bottleneck.

 

Now let’s demonstrate how to improve the worst-performing component using targeted fine-tuning.

 

Targeted Fine-Tuning for Component Reliability

 

Once you’ve identified the failing component, fine-tune only that component.

 

Based on our baseline metrics, let’s assume the generator is the primary source of failures. This is the most common scenario—retrieval works, routing works, but the generator produces vague answers, hallucinates, or ignores the retrieved context.

 

The fine-tuning approach is straightforward: collect examples where the generator failed, create training data showing what the correct output should have been, fine-tune only the generator component, and integrate the fine-tuned generator back into the agent graph.

 

Let’s prepare training data and integrate a fine-tuned generator using UBIAI.

 

				
					# Prepare training data for generator fine-tuning
# Focus on cases where the generator produced suboptimal responses

training_data = []

# Use the support dataset to create training examples
# Each example shows: context (what was retrieved) + query → ideal response
for idx, row in df.head(300).iterrows():
    training_data.append({
        "system_prompt": "You are a customer support agent. Use the provided context to give accurate, specific responses. Cite details from the context. Format responses clearly with proper structure.",
        "input": f"Context: {row['response']}\n\nCustomer Question: {row['instruction']}\n\nStrategy: cite_specific_policy",
        "output": row['response']
    })

# Save training data
df_training = pd.DataFrame(training_data)
df_training.to_csv('generator_finetuning_training.csv', index=False)

print(f"Training data prepared: {len(training_data)} examples")
print(f"Saved to: generator_finetuning_training.csv")
print("\n📝 Next steps:")
print("1. Upload generator_finetuning_training.csv to UBIAI")
print("2. Select 'Generator/Reasoner' component type")
print("3. Choose fine-tuning method (start with prompt, upgrade to weights if needed)")
print("4. Train and get API endpoint")
print("5. Integrate fine-tuned generator (code below)")
				
			
				
					# Integration: Fine-tuned generator using UBIAI

import requests
import json

UBIAI_API_URL = "https://api.ubiai.tools:8443/api_v1/annotate"
UBIAI_API_KEY = os.environ.get("UBIAI_API_KEY")

def finetuned_generator_node(state: AgentState) -> AgentState:
    """Generate response using fine-tuned UBIAI model"""
    start_time = time.time()
    
    query = state['query']
    retrieved_docs = state['retrieved_docs']
    strategy = state['response_strategy']
    
    # Build context
    context = "\n\n".join(retrieved_docs)
    
    # Call UBIAI fine-tuned model
    url = f"{UBIAI_API_URL}{UBIAI_API_KEY}"
    
    input_text = f"Context: {context}\n\nCustomer Question: {query}\n\nStrategy: {strategy}"
    
    data = {
        "input_text": "",
        "system_prompt": "You are a customer support agent. Use the provided context to give accurate, specific responses. Cite details from the context. Format responses clearly with proper structure.",
        "user_prompt": input_text,
        "temperature": 0.5
    }
    
    try:
        response = requests.post(url, json=data, timeout=10)
        result = json.loads(response.content.decode("utf-8"))
        state['response'] = result.get('response', '').strip()
        state['generation_method'] = 'ubiai_finetuned'
    except Exception as e:
        # Fallback to generic on error
        state['response'] = "I apologize, but I'm having trouble generating a response right now."
        state['generation_method'] = 'fallback'
        if 'component_errors' not in state:
            state['component_errors'] = {}
        state['component_errors']['generator'] = str(e)
    
    # Record metrics
    state['component_latencies']['generator'] = time.time() - start_time
    state['execution_path'].append('generator_finetuned')
    
    return state

# Build new agent graph with fine-tuned generator
workflow_finetuned = StateGraph(AgentState)

workflow_finetuned.add_node("router", router_node)
workflow_finetuned.add_node("retriever", retriever_node)
workflow_finetuned.add_node("reasoner", reasoner_node)
workflow_finetuned.add_node("generator", finetuned_generator_node)  # Fine-tuned version

workflow_finetuned.set_entry_point("router")
workflow_finetuned.add_edge("router", "retriever")
workflow_finetuned.add_edge("retriever", "reasoner")
workflow_finetuned.add_edge("reasoner", "generator")
workflow_finetuned.add_edge("generator", END)

agent_finetuned = workflow_finetuned.compile()

print("Fine-tuned agent graph compiled")
print("Only generator component changed - all other components identical")
				
			

Get The Full Notebook From: https://discord.gg/UKDUXXRJtM

The improvement comes from surgical intervention, not wholesale replacement.

 

Notice that we kept the router, retriever, and reasoner identical. The only change was swapping the generic generator for a fine-tuned one trained on examples of high-quality responses in this domain.

 

This component-level approach has several advantages. First, it’s faster than fine-tuning an end-to-end model—you need fewer examples and training completes in minutes rather than hours. Second, it’s more debuggable—if something regresses, you know exactly which component changed. Third, it’s more maintainable—you can iterate on individual components independently.

 

The key is that observability told us which component to fine-tune. Without component-level metrics, we’d be guessing.

 

Continuous Monitoring and Improvement

 
Production reliability requires continuous measurement and iteration.
 

Fine-tuning is not a one-time operation. As your product evolves, new query patterns emerge, edge cases appear, and performance can degrade. The observability infrastructure we’ve built enables continuous improvement.

 

The recommended workflow is: deploy with full instrumentation enabled, collect component-level metrics on every request, run automated failure detection daily, identify components with degrading performance, collect failure examples for retraining, fine-tune the degraded component, deploy the updated component, and measure improvement.

 

Let’s build a monitoring dashboard that makes this workflow concrete.

Get The Full Notebook From: https://discord.gg/UKDUXXRJtM

				
					# Simulate production monitoring by loading tracker data

# Transfer baseline data to monitor (simulating production traffic)
for execution, analysis in zip(tracker.executions, tracker.failure_analyses):
    monitor.record_interaction(execution, analysis)

# Generate monitoring report
monitor.generate_report()
				
			

Technical Best Practices for Observable Agents

 

Based on production deployments, here are the technical practices that matter most.

 

Instrument every component boundary: Capture inputs, outputs, latency, and errors at every transition between components. This is your primary debugging tool when things go wrong.

 

Track intermediate state explicitly: Don’t just log final outputs. Capture what was retrieved, what was reasoned, what strategy was chosen. These intermediate states are essential for failure attribution.

 

Use structured state objects. Define explicit schemas for agent state using typed data structures. This makes instrumentation consistent and enables automated analysis.

 

Implement component-level metrics: Don’t just measure end-to-end success. Track each component’s success rate, latency distribution, and failure modes independently.

 

Build failure classifiers: Automate the detection and classification of failure modes based on observability data. This scales much better than manual log analysis. 

 

Establish quantitative baselines: Before fine-tuning anything, measure current performance rigorously. You need concrete numbers to evaluate improvement.

 

Fine-tune components, not systems: When you identify a failing component, fine-tune only that component. This is faster, more maintainable, and easier to debug than end-to-end fine-tuning.

 

Use prompt fine-tuning first: Start with prompt optimization before training weights. It’s faster to iterate and often provides 80-90% of the improvement with 10% of the effort.

 

Monitor continuously: Production performance degrades over time as query distributions shift. Continuous monitoring catches this before it becomes critical.

 

Automate the improvement loop: Build systems that automatically detect degradation, collect failure examples, trigger fine-tuning, and deploy updated components. Manual processes don’t scale.

 

The common thread across all these practices is measurement. You can’t improve what you can’t measure, and you can’t measure what you don’t instrument.

 

Key Takeaways

 

Observable, reliable AI agents require systematic instrumentation and targeted improvement.

 

The core insight from this technical deep dive is that agent reliability is fundamentally a measurement and optimization problem. You need observability to identify which components are failing, metrics to quantify the failures, and targeted fine-tuning to fix specific components without rebuilding the entire system.

 

What we covered:

 

We identified why traditional monitoring fails for multi-component agents—aggregate metrics obscure which specific component is causing failures.

 

We built a fully instrumented agent using LangGraph for orchestration and LangSmith for distributed tracing, capturing component-level execution flow, latency, intermediate state, and failure modes.

 

We implemented automated failure detection and classification that attributes failures to specific components based on observability data.

 

We established quantitative performance baselines across all components, providing concrete targets for improvement.

 

We demonstrated targeted fine-tuning of the generator component using UBIAI, showing how to improve reliability without changing the entire system.

 

We built production monitoring infrastructure that continuously tracks performance, detects degradation, and triggers alerts when components fall below thresholds.

 

The component-level approach enables:

 

Faster iteration—fine-tuning a single component takes minutes rather than hours for end-to-end training.

 

Better debuggability—when something regresses, you know exactly which component changed.

 

More maintainability—you can evolve individual components independently without risking the entire system.

 

Measurable improvement—component-level metrics let you quantify exactly how much each change improved reliability.

 

The techniques presented here apply to any multi-component agent architecture, regardless of specific frameworks. The key requirements are explicit state management, component boundary instrumentation, and systematic measurement.

 
 

Implementation Resources

 
 

Ready to implement observable, reliable agents in your system?

 

The complete code from this notebook provides a working reference implementation. To adapt it to your use case:

 

Define your agent state schema capturing all intermediate states relevant to your components. Use TypedDict or dataclasses for type safety.

 

Instrument each component to record inputs, outputs, latency, and errors. Ensure every component boundary is observable.

 

Implement failure detection tailored to your specific failure modes. The detector shown here is a starting point—extend it based on your domain.

 

Establish performance baselines by running your agent on representative test sets and computing component-level metrics.

 

Collect training data from production failures. Focus on examples where specific components produced incorrect or suboptimal outputs.

 

Fine-tune failing components using UBIAI. Upload your training data at app.ubiai.tools, select the component type (router, retriever, or generator), choose prompt or weight fine-tuning, and integrate the resulting API endpoint.

 

Deploy with monitoring that continuously tracks component performance and alerts on degradation.

 

For detailed UBIAI integration guides, see the technical documentation. For the complete fine-tuning workflow, watch the UBIAI tutorial.

 

The goal is to move from reactive debugging (“why did this query fail?”) to proactive optimization (“which components are degrading and how do we fix them systematically?”).

 

Component-level observability and targeted fine-tuning make that transition possible.

Unlocking the Power of SLM Distillation for Higher Accuracy and Lower Cost​

How to make smaller models as intelligent as larger ones

Recording Date : March 7th, 2025

Unlock the True Potential of LLMs !

Harnessing AI Agents for Advanced Fraud Detection

How AI Agents Are Revolutionizing Fraud Detection

Recording Date : February 13th, 2025

Unlock the True Potential of LLMs !

Thank you for registering!

Check your email for the live demo details

see you on February 19th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Thank you for registering!

Check your email for webinar details

see you on March 5th

While you’re here, discover how you can use UbiAI to fine-tune highly accurate and reliable AI models!

Fine Tuning LLMs on Your Own Dataset ​

Fine-Tuning Strategies and Practical Applications

Recording Date : January 15th, 2025

Unlock the True Potential of LLMs !