Observability & Evaluation

AI Agents & Orchestration — Session 1

2026-04-26

A. The Black Box Problem

You cannot ship what you cannot trace

Agenda

Section Topic Time
A The Black Box Problem ~10 min
B Structured Tracing & Langfuse ~15 min
C Loop Detection ~15 min
D Cost Tracking & Guardrails ~10 min
E Production Logging & Metrics ~15 min
F DeepEval for Agents ~20 min
G Wrap-up & Lab Preview ~5 min

Traditional vs Agent Debugging

🖥️ Traditional Debugging

  • Same input → same output
  • Stack trace shows exact failure
  • Unit tests with assertEqual
  • Fixed cost per execution
  • Errors crash the program

🤖 Agent Debugging

  • Same input → different outputs
  • Failure emerges over many steps
  • Subjective quality evaluation
  • Variable cost (1–50 LLM calls)
  • Errors silently “reasoned away”

The Core Problem

An agent might technically succeed (no crash, produces an answer) while being completely wrong. Or it might spend $0.50 on a $0.02 question. You can’t fix what you can’t see.

Five Ways Agents Fail

  1. 🤔 Prompt ambiguity — the agent didn’t understand the task
  2. 🔧 Tool misuse — right tool, wrong arguments
  3. 📄 Formatting errors — tried to format JSON, failed
  4. 🔁 Infinite loops — kept searching “Python” 50 times
  5. 💬 Hallucination — confidently lied about a search result it never saw

Production Tip: Never deploy an agent without tracing. Costs can explode if an agent enters an infinite loop.

B. Structured Tracing & Langfuse

Every step, captured and queryable

What a Good Trace Captures

For every step in the agent loop:

📍 Identity

  • Unique Trace ID per request
  • Step number
  • Agent name & model

📊 Metrics

  • Token usage (input / output)
  • Cost per step (USD)
  • Duration (milliseconds)

📝 Content

  • Agent’s reasoning (LLM response text)
  • Tool calls (name + arguments) + Tool results

A Trace Data Model

@dataclass
class ToolCallRecord:
    tool_name: str
    tool_input: dict
    tool_output: str
    duration_ms: float

@dataclass
class AgentStep:
    step_number: int
    reasoning: Optional[str]
    tool_calls: list[ToolCallRecord]
    input_tokens: int = 0
    output_tokens: int = 0
    cost_usd: float = 0.0

@dataclass
class Trace:
    trace_id: str
    agent_name: str
    steps: list[AgentStep]
    status: str  # "running", "completed", "failed", "loop_detected"
    total_cost_usd: float = 0.0

Langfuse Setup

To get full visibility, we use a two-layer strategy:

  1. Auto-Trace (LiteLLM): Captures every raw LLM input/output.
  2. Structured Trace (@observe): Captures your agent’s logical steps.
import litellm
from langfuse import observe

# 1. Zero-code: LiteLLM sends every completion to Langfuse
litellm.success_callback = ["langfuse"]

# 2. Structured: Wrap your agent methods to create a "Tree"
@observe(as_type="agent")
def run_agent(query):
    # This becomes the parent span
    return "Result"

Enriching Traces with Context

The propagate_attributes context manager allows you to inject rich metadata without polluting your function signatures.

from langfuse import observe, propagate_attributes
 
@observe(as_type="tool")
def call_tool(self, name, args):
    # 1. Propagate metadata and tags to this span and its children
    with propagate_attributes(
        metadata={"tool_version": "1.0.2", "user_id": "user_123"},
        tags=["production", "critical-path"]
    ):
        result = self.execute(name, args)
    
    return result

Note

In our labs, we use a SimpleObserver that mirrors this API exactly—allowing you to switch to production Langfuse with a single import change.

Trace Output Example

============================================================
TRACE SUMMARY: a1b2c3d4
============================================================
Agent: react_agent | Model: gpt-4o
Status: completed
Query: What is the population of the capital of France?

Steps (3 total):
------------------------------------------------------------
  Step 1: 1200ms, $0.0085 -> search
  Step 2:  980ms, $0.0062 -> search
  Step 3:  450ms, $0.0031 (no tools)
------------------------------------------------------------
Total Tokens: 2847 input + 312 output = 3159
Total Cost:   $0.0178
Total Time:   2630ms

Answer: The population of Paris is approximately 2.1 million.
============================================================

Reading a Trace

Look at step count (was 3 steps necessary?) and cost per step (which step was expensive?). These two signals reveal most agent problems.

C. Loop Detection

Catching agents that spin in circles

The Infinite Loop Problem

An agent calls search("python tutorial"), gets a result, then calls search("python tutorial") again. And again. And again.

Why it happens:

  • Doesn’t “understand” the result satisfied the query
  • Tool returns an error → retries same args
  • Ambiguous prompt → keeps trying variations

💸 Cost Impact

A looping agent can burn through hundreds of API calls at $0.01–0.05 each.

A single bad query can cost $5+ before max_steps kicks in.

Three Detection Strategies

Loops appear in two places — we need a different check for each:

Tool-call checks (before each tool executes)

  1. Exact Match — same tool + identical args
  2. Fuzzy Match — same tool + suspiciously similar args

Reasoning check (after each LLM response)

  1. Output Stagnation — the agent keeps saying the same thing

Strategy 1: Exact Match

Same tool + identical arguments repeated N times.

# Before executing each tool call:
count = tool_history.count(
    (current_tool_name, current_args_string)
)

if count >= exact_threshold:
    return LoopDetected(confidence=1.0)

tool_history.append(
    (current_tool_name, current_args_string)
)

✅ Confidence: 100%

If the tool is idempotent (like a search), same call twice = always a loop.

Polling Exception

For tools like check_job_status, repeating calls are expected. Skip exact match detection for polling tools.

Strategy 2: Fuzzy Match

Similar (but not identical) tool calls — catches minor rephrasing.

We compare argument strings only (not the tool name), tokenised into word sets:

Jaccard(A, B) = |A ∩ B| / |A ∪ B|

# Call 1: search("Paris population")
args_A = {Paris, population}

# Call 2: search("population of Paris")
args_B = {population, of, Paris}

Intersection = {Paris, population}        → 2
Union        = {Paris, population, of}    → 3
Jaccard = 2/3 = 0.67  ← Suspicious 🚩

# A genuinely different call for comparison:
args_A = {how, to, code, python}          → search("how to code python")
args_B = {python, programming, tutorial}  → search("python programming tutorial")
Jaccard = 1/6 = 0.17  ← Different query ✅

How Jaccard works

Count the shared words divided by the total unique words across both sets.

A score near 1.0 = almost identical. A score near 0.0 = completely different.

Threshold: 0.8 — catches rephrasings, ignores truly different queries.

Strategy 3: Output Stagnation

A different kind of loop

Strategies 1 & 2 detect repeated tool calls. Stagnation detects when the agent’s reasoning text is stuck — it runs after each LLM response, not before tool execution.

What it looks like:

Step 4: "I need to search for more info..."
Step 5: "I need to search for more info..."
Step 6: "I need to search for more info..."

The agent keeps producing nearly identical reasoning, even if it issues different tool calls.

How we detect it:

After each LLM response, we add the text to output_history.

We then compute the average word overlap across the last N entries — a high score means the agent is thinking in circles.

recent_outputs = output_history[-stagnation_window:]

if avg_pairwise_word_overlap(recent_outputs) >= threshold:
    return LoopDetected(strategy="stagnation")

The Circuit Breaker Pattern

Combine loop detection with the agent loop to break infinite cycles:

for step in range(max_steps):
    # ... get LLM response, extract tool calls ...

    for tool_call in tool_calls:
        # Check BEFORE executing
        loop_check = loop_detector.check_tool_call(
            tool_call.name, str(tool_call.arguments)
        )

        if loop_check.is_looping:
            # Inject warning into conversation instead of executing
            messages.append({
                "role": "tool",
                "content": f"LOOP DETECTED: {loop_check.message}"
            })
            break  # Skip remaining tool calls this step

Tip

The agent receives the loop warning as if it were a tool result — it can then change strategy on the next step.

D. Cost Tracking

Monitoring spend per query

Why Track Cost?

😰 Without tracking:

  • “Our AI bill was $500 this month”
  • “Some queries are expensive but we don’t know which”
  • No budget enforcement
  • Runaway agents go undetected

✅ With tracking:

  • “Query X cost $2.30 (15 steps)”
  • “Average cost: $0.12/query”
  • Budget alerts per query
  • Loop detection saves money

Cost Tracking with LiteLLM

LiteLLM provides built-in cost calculation:

from litellm import completion_cost

response = completion(
    model="gpt-4o", messages=messages
)

cost = completion_cost(
    completion_response=response
)
print(f"This step cost: ${cost:.4f}")

Integrate with the tracer:

tracer.log_cost(
    trace_id,
    step_number=step,
    cost_usd=cost
)

# Total Cost: $0.0178
# (sum of all steps)

Setting Budget Limits

class CostTracker:
    def __init__(self, budget_limit_usd: float = 1.0):
        self.budget_limit = budget_limit_usd
        self.total_spent = 0.0

    def add_cost(self, cost: float) -> bool:
        self.total_spent += cost
        if self.total_spent > self.budget_limit:
            raise TokenBudgetExceeded(
                f"Query cost ${self.total_spent:.2f} "
                f"exceeds budget of ${self.budget_limit:.2f}"
            )
        return True

Production Tip: Set per-query budgets ($1–5) AND daily budgets ($50–500). A single runaway agent should never bankrupt your project.

Budget Guardrails Pattern

Use litellm’s completion_cost() to track actual costs securely:

from datetime import date
from threading import Lock

from src.exceptions import TokenBudgetExceeded

class BudgetGuard:
    """Enforce daily and monthly API spend limits."""

    def __init__(self, daily_limit_usd: float, monthly_limit_usd: float):
        self.daily_limit = daily_limit_usd
        self.monthly_limit = monthly_limit_usd
        self._daily_spend: dict[str, float] = {}
        self._lock = Lock()

    def check_and_charge(self, cost_usd: float) -> bool:
        """Returns True if spend is allowed, raises TokenBudgetExceeded otherwise."""
        with self._lock:
            day_key = date.today().isoformat()
            day_spend = self._daily_spend.get(day_key, 0)

            if day_spend + cost_usd > self.daily_limit:
                raise TokenBudgetExceeded(f"Daily budget ${self.daily_limit} reached")

            self._daily_spend[day_key] = day_spend + cost_usd
            return True

Cost Anomaly Detection

# Simple z-score anomaly detector for hourly cost
import statistics

class CostAnomalyDetector:
    def __init__(self, window: int = 24, z_threshold: float = 2.5):
        self.history: list[float] = []
        self.window = window
        self.z_threshold = z_threshold

    def check(self, hourly_cost: float) -> bool:
        if len(self.history) < self.window:
            self.history.append(hourly_cost)
            return False

        mean = statistics.mean(self.history[-self.window:])
        stdev = statistics.stdev(self.history[-self.window:])
        z_score = (hourly_cost - mean) / (stdev or 1)

        self.history.append(hourly_cost)
        if abs(z_score) > self.z_threshold:
            print(f"Anomaly detected! Z-Score: {z_score:.2f}")
            return True
        return False

E. Production Logging & Metrics

Going from traces to aggregations.

Prometheus Metrics — Design Table

Metric Name Type Labels Measures
agent_requests_total Counter model, cache_hit, status Every completed request
agent_request_latency_ms Histogram Latency distribution
agent_tokens_total Counter model, token_type Input/output tokens
agent_cost_usd_total Counter model Cumulative spend

API reminder: Counter → only up (inc). Gauge → up/down (inc/dec/set). Histogramobserve(value).

Exposing the /metrics Endpoint

# FastAPI integration
from prometheus_client import make_asgi_app
from fastapi import FastAPI

app = FastAPI()

# Mount Prometheus metrics at /metrics
metrics_app = make_asgi_app()
app.mount("/metrics", metrics_app)

# Now Prometheus can scrape:
# GET http://api:8000/metrics
# → # HELP agent_requests_total ...
# → agent_requests_total{model="gpt-4o",cache_hit="false"} 1423

F. DeepEval for Agents

Production-grade evaluation framework

The Evaluation Problem

How do you know if your agent is any good?

  • Manual review doesn’t scale
  • String matching can’t evaluate free-text
  • Unit tests check format, not quality
  • Custom LLM judges are hard to calibrate

✅ The Solution: DeepEval

An open-source framework with 50+ research-backed metrics for LLM and agent evaluation.

Integrates with pytest for CI/CD workflows.

Why DeepEval for Agents?

❌ Traditional Metrics Fail

Metric Problem
BLEU/ROUGE Word overlap, not meaning
Accuracy Binary, no nuance
Precision/Recall Not for open-ended output

✅ DeepEval Approach

  • LLM-as-Judge with calibrated prompts
  • Agentic metrics: task completion, tool correctness
  • Component-level evaluation via tracing

DeepEval Agent Metrics

Metric What It Measures Mode
TaskCompletionMetric Did the agent accomplish the task? Standalone
ToolCorrectnessMetric Were the right tools called? Standalone
StepEfficiencyMetric Was the execution path optimal? Traced
PlanQualityMetric Was the agent’s plan logical? Traced
PlanAdherenceMetric Did the agent follow its plan? Traced

Standalone = evaluate a fixed dataset. Traced = requires @observe to inspect internal reasoning.

Traces & Spans: The Structure

One user query = one Trace. Each step inside it = a Span.

graph LR
    T["📦 Trace (one user request)"] --> A["🤖 Agent Span (root)"]
    A --> B["💬 LLM Span — Step 1"]
    A --> C["🔧 Tool Span — search()"]
    A --> D["💬 LLM Span — Step 2"]
    D --> E["🔧 Tool Span — search()"]
    D --> F["💬 LLM Span — Final Answer"]
    style T fill:#1C355E,stroke:#00C9A7,color:white
    style A fill:#9B8EC0,stroke:#1C355E,color:white

Traces & Spans: The Mental Model

📦 Trace = the whole journey

A trace is the complete record of one agent run — from the moment the user sends a query to the moment the agent returns an answer.

Think of it as the receipts for an entire shopping trip.

  • Every trace has a unique ID
  • One trace = one user request
  • A trace can contain dozens of spans

🔬 Span = one step of the journey

A span is the record of a single unit of work inside that trace — one LLM call, one tool execution, one function.

Think of it as one item on the receipt.

  • Every span records: timing, inputs, outputs
  • Spans are nested — a parent span can contain children
  • The Agent span is the root; LLM and Tool spans are its children

Unified Tracing with @observe

@observe is the industry standard pattern for automatic telemetry. Use the exact same code for monitoring OR evaluation:

# The implementation is identical
from deepeval.tracing import observe
# OR: from langfuse import observe

@observe(name="Run Agent", type="agent") # type for DeepEval, as_type for Langfuse
async def run(self, query):
    # Logic...
    return answer

The Universal Pattern

Because they share the same API, you can switch from DeepEval (for CI/CD evals) to Langfuse (for Production monitoring) by just changing the import.

Tracing Agent Components

The pattern scales across your entire agent class — every decorated method becomes a nested span in the trace:

from deepeval.tracing import observe

class ReactAgent:
    @observe(type="agent")      # ← Root span
    def run(self, query): ...

    @observe(type="llm")        # ← LLM call
    def call_llm(self, messages): ...

    @observe(type="retriever")  # ← Search/RAG tool
    def search_docs(self, query): ...

The trace captures the full execution graph — ensuring you can audit every decision, model choice, and tool result.

graph LR
    A["🤖 Agent Span (root)"] --> B["💬 LLM Span"]
    A --> C["🔧 Tool Span"]
    B --> D["💬 LLM Span"]
    D --> E["🔧 Tool Span"]
    style A fill:#9B8EC0,stroke:#1C355E,color:white

Evaluation with Test Cases

from deepeval import evaluate
from deepeval.test_case import LLMTestCase, ToolCall
from deepeval.metrics import TaskCompletionMetric, ToolCorrectnessMetric

test_case = LLMTestCase(
    input='What is the capital of France?', # The prompt/query being tested
    actual_output='Paris is the capital of France.', # The actual output from your agent
    expected_output='Paris', # The ideal "ground truth" answer
    # Tools actually called by the agent
    tools_called=[ToolCall(name='search', input_parameters={'query': 'capital of France'})],
    # Tools the agent WAS EXPECTED to call
    expected_tools=[ToolCall(name='search', input_parameters={'query': 'capital of France'})]
)

evaluate(
    test_cases=[test_case],
    metrics=[TaskCompletionMetric(), ToolCorrectnessMetric()]
)

Evaluation Results Dashboard

======================================================================
DEEPEVAL EVALUATION RESULTS (5 test cases)
======================================================================
Test Case                      Task Comp  Tool Correct  Step Eff  Overall
---------------------------------------------------------------------
What is the capital of France?    PASS        PASS         -      PASS
Compare Python and JavaScript     PASS        PASS       0.85     PASS
Latest AI research trends         FAIL        PASS       0.62     FAIL
Population of Tokyo metro         PASS        PASS       0.91     PASS
Explain quantum computing         PASS        FAIL       0.78     FAIL
---------------------------------------------------------------------
PASS RATE: 60% (3/5)
Average Step Efficiency: 0.79
======================================================================

Tip

“It feels better” → “Task completion improved from 72% to 89% after fixing the plan prompt.”

CI/CD Integration

# test_agent.py - runs in your CI pipeline
import pytest
from deepeval import assert_test
from deepeval.metrics import TaskCompletionMetric

def test_agent_factual_queries():
    test_case = LLMTestCase(
        input="What is the capital of France?",
        actual_output=agent.run("What is..."),
        expected_output="Paris"
    )
    assert_test(test_case,
                [TaskCompletionMetric(threshold=0.8)])

Best Practice

Maintain a living eval dataset (20–50 examples).

Run evals on every PR.

Gate merges on pass rate threshold.

Embedding-Based Metrics

When LLM-as-Judge is too expensive or too slow for your pipeline:

Metric What It Measures Cost
RetrievalRelevanceMetric Query ↔︎ retrieved chunks $0
ContextCoverageMetric Coverage of expected context $0
AnswerSimilarityMetric Expected ↔︎ actual answer $0

Scale

Scores range 0.0 → 1.0.

Thresholds like 0.25 for retrieval relevance are typical — even moderate semantic overlap is useful signal. (More on this in Module 4.)

from evaluation.embedding_metrics import get_embedding_metrics

relevance, coverage, answer = get_embedding_metrics(
    relevance_threshold=0.25, coverage_threshold=0.4
)
result = relevance.evaluate(query_embedding, retrieved_embeddings)
# Offline — no API calls, instant feedback

DeepEval vs Custom Evaluation

🛠️ Custom LLM-as-Judge

  • Quality depends on your prompt
  • You own and maintain the prompts
  • Must build agentic metrics from scratch
  • Manual CI/CD scripts

✅ DeepEval

  • Research-backed, calibrated metrics
  • Community + Confident AI maintenance
  • 6+ agent-specific metrics built-in
  • @observe decorator + pytest native

G. Wrap-up

Key Takeaways

🔍 Observability

  1. Tracing is non-negotiable — every step, tool call, and cost
  2. Loop detection uses three strategies: exact, fuzzy, stagnation
  3. Circuit breakers inject warnings into the conversation
  4. Cost tracking prevents runaway spending

📊 Evaluation

  1. DeepEval provides production-grade metrics
  2. Standalone vs Traced modes for different use cases
  3. Test case datasets (JSON) scale your eval effort
  4. CI/CD integration gates merges on quality thresholds

Lab Preview: Observability & Evaluation

Step 1: Instrumentation 🔬

  • Inject AgentTracer into ReactAgent
  • Run a query that triggers a loop

Step 2: Diagnosis 🔎

  • Read the trace JSON/logs
  • Identify the repeating tool calls

Step 3: The Fix 🔧

  • Implement a circuit breaker
  • Add loop detection to the agent loop

Step 4: Verify with DeepEval

  • Run run_eval.py CLI
  • Load test cases from test_cases.json
  • Generate evaluation report

⏱️ Time: 75 minutes

Questions?

Session 1 Complete