Agent Reliability, Evaluation & Operations

Agentic RAG with self-correction

Production RAG systems often suffer from "Contextual Blindness" where the model retrieves irrelevant data but attempts to answer anyway. Agentic RAG adds a Self-Correction (Maker-Checker) loop to validate retrieval quality before generation.

Self-Correcting RAG Loop

100%drag to pan

Loading diagram...

Grounding Verification

In this pattern, the Checker agent doesn't just look for "an answer." It evaluates the Candidate Context against a specific rubric: "Does this context contain facts required to satisfy the User Query?" If not, it instructs the Maker to try a different search strategy, effectively automating the "retry" logic humans do manually.

Evaluating agents is fundamentally different from evaluating deterministic software

Traditional software testing assumes deterministic behavior: the same input always produces the same output. Agent systems violate that assumption. Quality evaluation for agents requires statistical approaches, diverse test datasets, and continuous measurement—not single-pass unit tests.

Agent evaluation approaches and when to use them

Approach	When to use	Strengths	Limitations
Offline benchmarks	Early development, model selection, regression testing	Fast, reproducible, good for catching major regressions	May not reflect real-world usage patterns, can be gamed
Online A/B testing	Production deployment, comparing model or prompt versions	Measures real-world performance, captures actual user impact	Requires significant traffic, slow to converge, ethical concerns for some domains
Human evaluation	Complex tasks, safety-critical decisions, quality assessment	Captures nuance and context that automated metrics miss	Expensive, slow, subjective, does not scale well
Automated regression	Continuous integration, prompt changes, model updates	Fast, repeatable, integrates into CI/CD pipelines	Requires maintaining evaluation datasets, may miss edge cases
Adversarial testing	Security validation, jailbreak resistance, safety testing	Finds vulnerabilities and failure modes that normal tests miss	Cannot cover all possible attacks, requires adversarial expertise

Effective agent evaluation requires a layered approach: automated regression tests for fast feedback during development, human evaluation for quality assurance, and adversarial testing for security validation. The evaluation dataset itself must evolve as the agent encounters new edge cases in production.

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Salesforce Architect guide covering the ADLC Testing & Validation phase, including evaluation dataset management, regression suites, adversarial testing, and outer-loop continuous tuning for non-deterministic systems.

Read the ADLC Testing and Evaluation section

Production reliability requires accepting and bounding non-determinism

You cannot eliminate non-determinism in agent systems—the same input will sometimes produce different outputs. Production reliability comes from bounding the blast radius of bad outputs, detecting low-confidence decisions, and having clear escalation paths when the agent is uncertain.

Strategies for handling non-deterministic agent behavior

Strategy	How it works	Best for
Temperature control	Lower temperature reduces randomness, higher temperature increases creativity but also variance	Balancing consistency vs. creativity based on task requirements
Deterministic tool routing	Use rule-based or semantic routing to choose tools rather than letting the model decide	Reducing variance in tool selection for common, repetitive tasks
Validation layers	Post-processing checks that validate outputs against schemas, rules, or business logic before use	Catching hallucinations, format errors, and policy violations before they affect downstream systems
Confidence thresholds	Require the agent to estimate confidence and reject or escalate low-confidence results	Filtering uncertain decisions and surfacing cases that need human review
Retry with consensus	Run the same request multiple times and use voting or aggregation to produce a more stable result	Improving consistency for critical decisions where latency is acceptable

Confidence-based decision flow for agent outputs

100%drag to pan

Loading diagram...

Accept some non-determinism while bounding the blast radius

Perfection is not the goal—consistency within acceptable bounds is. Focus on detecting and handling the cases where non-determinism produces bad outcomes, rather than trying to eliminate variation entirely. Confidence thresholds, validation layers, and clear escalation patterns are more practical than attempting fully deterministic behavior.

Observability, rollback, and cost management are production requirements, not afterthoughts

Running agent systems in production requires operational disciplines that go beyond prompt engineering. You need traces for debugging, metrics for performance, budgets for cost control, and rollback procedures for when agents misbehave. These must be designed before launch, not bolted on after incidents.

Operational concerns for production agent systems

Concern	Why it matters	Key practices
Agent tracing	Debugging multi-agent workflows requires full execution traces, not just final outputs	Log every tool call, intermediate output, and routing decision; correlate traces across agents
Token cost tracking	Multi-step workflows can have surprisingly high token costs that only appear in production	Track tokens per agent, per tool, and per user; set budgets and alerts
Latency monitoring	Agent response times vary based on model, tools, and workflow complexity	Measure end-to-end latency, break down by step, and track percentiles
Error classification	Agent errors come from models, tools, prompts, or data—root cause requires categorization	Classify errors by type, track error rates per category, and alert on anomalies
Rollback procedures	Agents may produce correct results today but fail after a model or prompt change	Version prompts and model configurations, maintain canary deployments, and have rollback triggers

Cost management deserves explicit attention because token usage scales with workflow complexity. Simple tasks should use cheap models, repeated queries should use caching, and expensive models should be reserved for complex reasoning steps. Model routing—choosing the right model for each subtask—can reduce costs by 60-80% without sacrificing quality.

Observability is a pre-production requirement, not a post-launch add-on

Design tracing, metrics, and logging into your agent architecture from day one. Without observability, you cannot debug failures, measure performance, or prove compliance. Trying to add observability after a production incident is too late—you need the data before the problem occurs.

Cloud-native evaluation & state mapping

Concept / Tool	AWS	Azure	GCP
Statistical Evaluation	Bedrock Model Evaluation	Azure AI Studio Eval SDK	Vertex AI Model Evaluation
Task Ledger / State	Step Functions (Express)	Semantic Kernel Store	Reasoning Engine State
Durable Memory	Amazon OpenSearch Serverless	Azure AI Search (Vector)	Vertex AI Vector Search

Observability Primitives: Runs, Traces, and Threads

Agent observability differs fundamentally from traditional software observability. Where software observability tracks deterministic request/response cycles and known error states, agent observability must capture emergent behavior—reasoning chains, tool selections, and multi-turn context—that only materializes at runtime. LangChain's observability platform, LangSmith, defines three core primitives that map directly to evaluation granularity.

Observability Primitives Hierarchy

100%drag to pan

Loading diagram...

Observability Primitives

Primitive	What It Measures	Eval Granularity	Example
Run	Single execution step: one LLM call with its input, output, and metadata	Single-step evaluation	Did the agent select the correct tool and format arguments properly for this one call?
Trace	Complete agent execution showing all runs, their parent-child relationships, and the full execution tree	Full-turn evaluation	Did the agent complete the entire task correctly, including all tool calls, reasoning steps, and final output?
Thread	Multi-turn conversations grouping multiple traces over time, preserving context across interactions	Multi-turn evaluation	Did the agent maintain context, remember user preferences, and build on prior interactions across a conversation?

Each primitive measures a different granularity. Traces can reach hundreds of megabytes for complex, long-running agents. The teams shipping reliable agents have embraced the shift from debugging code to debugging reasoning.

Agent observability powers agent evaluation

LangChain blog by Harrison Chase explaining how observability primitives—runs, traces, and threads—map directly to agent evaluation granularity.

Read on the LangChain Blog

Evaluation at Every Granularity

Evaluation granularity is determined by the observability primitive you are evaluating against. Each level answers a different question about your agent's reliability, requires different scoring approaches, and serves different phases of the development lifecycle.

Evaluation Granularity Taxonomy

Level	Observability Primitive	What It Validates	Scoring Difficulty	When to Use
Single-step	Run	Individual decision quality: tool selection correctness, argument formatting, output validity for one LLM call	Easiest to automate—binary or rubric-based scoring on isolated steps	During development and unit testing; validating tool call correctness before integrating into workflows
Full-turn	Trace	Complete task execution: end-to-end correctness, multi-step reasoning quality, and overall task success	Easiest to create inputs but hardest to score—requires LLM-as-judge or human review for complex tasks	Pre-deployment validation; regression testing after prompt or model changes; measuring overall agent quality
Multi-turn	Thread	Context maintenance across conversation: memory retention, preference tracking, and conversational coherence over time	Hardest to implement—requires conditional logic, state tracking, and long-running conversation simulations	Production monitoring and regression; validating that agents maintain context across user sessions without degradation

Choose evaluation granularity based on what you are testing. Single-step for tool call correctness. Full-turn for end-to-end task success. Multi-turn for conversation coherence over time. Teams typically start with single-step evals and add full-turn and multi-turn as their agent matures.

From Production Traces to Evaluation Datasets

Production traces become evaluation datasets automatically when you close the feedback loop. When a trace reveals a failure, it enters an annotation queue, gets incorporated into evaluation datasets with ground truth, and powers offline testing and regression suites. This converges two traditionally separate software concerns—tracing for debugging and testing for validation—into a unified, compounding pipeline.

Traces-to-Datasets Pipeline

100%drag to pan

Loading diagram...

Trace-to-Dataset Pipeline Steps

Step	Action	Outcome
Capture	Production traces are stored automatically for every agent execution, including all runs, tool calls, and intermediate reasoning	Complete execution history available for analysis without any additional instrumentation
Filter	Identify failures, edge cases, novel usage patterns, and high-value examples from the production trace stream	Curated subset of traces prioritized for human review, avoiding noise from routine successes
Annotate	Human reviewers add ratings, corrections, ground-truth labels, and structured feedback to selected traces	Ground truth data that captures expert judgment on what the agent should have done
Curate	Add annotated traces to golden evaluation datasets with expected outcomes, scoring rubrics, and reference answers	Growing, high-quality evaluation dataset that reflects real production scenarios
Evaluate	Run offline evaluations against curated datasets using automated graders, LLM-as-judge, and statistical scoring	Quantitative quality metrics that detect regressions before they reach production
Iterate	Fix issues uncovered by evaluation, update prompts or models, retest against the dataset, and redeploy	Continuous improvement loop where each production failure strengthens the evaluation suite

Production failures become regression tests. The improvement loop compounds because each cycle generates better data. Annotation is the bridge between production signals and better evaluations—without human-labeled ground truth, automated evaluation remains a noisy approximation.

The Agent Improvement Loop

In traditional software, the code documents the application. In AI systems, the traces do. The agent improvement loop formalizes this insight into a continuous cycle: observe traces at every stage, run evaluations against datasets, annotate failures, and feed learnings back into the build phase. Each cycle produces better data, sharper evaluations, and more reliable agents.

LangChain — The Agent Improvement Loop Starts with a TraceLast verified: 2026-05-17

Agent Improvement Loop

100%drag to pan

Loading diagram...

Improvement Loop Phases

Phase	Purpose	Tools & Techniques	Exit Criteria
Build	Implement changes informed by trace evidence from prior cycles—prompt adjustments, tool refinements, model routing updates	Prompt engineering tools, model configuration, tool definitions, code changes	Changes pass local unit tests and basic smoke testing
Observe (Staging)	Capture pre-production traces to debug reasoning chains, tool selections, and multi-step behavior before live traffic	LangSmith tracing, structured logging, span-level inspection	Traces show correct reasoning paths; no unexpected tool selections or empty responses
Offline Evals	Run reproducible test cases from curated golden datasets using cheap, fast graders to catch regressions	Golden datasets, LLM-as-judge, rubric-based scoring, statistical comparison against baselines	All regression tests pass; quality metrics meet or exceed baseline thresholds
Deploy	Promote to production only if all quality gates pass, using canary deployments and gradual rollout	Canary deployments, feature flags, traffic splitting, automated rollback triggers	Canary metrics stable; no P0/P1 incidents within observation window
Observe (Production)	Every production run generates a trace; automated monitoring detects anomalies, cost spikes, and quality drift	Trace storage, automated monitoring, anomaly detection, cost dashboards	Traces flowing; monitoring dashboards show healthy baselines
Online Evals	Continuous evaluation on live traffic using LLM-as-judge graders to detect quality degradation in real time	LLM-as-judge scoring, automated quality sampling, statistical process control on quality metrics	Online quality metrics within acceptable bounds; no sustained downward trends
Annotations	Human reviewers add structured feedback on failures, calibrate LLM-as-judge graders against human judgment, and capture ground truth for dataset growth	Annotation queues, human review interfaces, inter-annotator agreement metrics, calibration datasets	Failed traces annotated; new examples added to golden datasets; grader calibration verified

The loop compounds: each cycle generates better data, sharper evaluations, and more reliable agents. An insights agent can automatically cluster production traces to surface usage patterns and failure modes, accelerating the annotation phase by prioritizing the most impactful traces for human review.

The agent improvement loop starts with a trace

LangChain blog by Sam Crowder detailing how production traces feed the agent improvement loop, from observability through evaluation to annotation and back to building.

Read on the LangChain Blog

Simulation and test-bed agents give you a safer place to discover failures

AWS explicitly separates simulation and test-bed agents from production observers because they answer a different reliability question: “What happens if the agent explores this environment repeatedly before we trust it live?” That matters for reinforcement-style loops, CLI or browser sandboxes, workflow rehearsals, and multi-agent coordination experiments.

Official AWS simulation and test-bed agent diagram showing an agent operating in a controlled environment with feedback, learning, and replay. — AWS simulation and test-bed agents are about structured rehearsal and policy refinement before deployment, not just offline benchmarking. Source: AWS Prescriptive Guidance.

AWS Prescriptive Guidance: Simulation and test-bed agentsLast verified: 2026-05-17

When simulation adds reliability value

Scenario	What simulation reveals	Why production tests are weaker
Long-running workflow rehearsal	State drift, dead ends, recovery behavior, and retry loops across many iterations	Live traffic rarely gives enough repetitions to isolate structural failure modes safely
CLI or browser automation	How agents handle UI changes, shell errors, or partial success conditions	Real systems carry side effects that make aggressive exploration unsafe
Swarm or multi-agent behavior	Emergent coordination failures, conflict over shared state, and unstable escalation patterns	Production incidents are expensive places to discover collective behavior problems

Observer agents upgrade telemetry into agent-aware operations signals

The AWS observer and monitoring pattern closes an important reliability gap. Traditional dashboards show raw logs, metrics, and traces. Observer agents reason across those signals, classify anomalies, summarize trends, and escalate only when the telemetry pattern actually matters.

Official AWS observer and monitoring agent diagram showing telemetry ingestion, context parsing, reasoning, classification, and escalation. — AWS observer agents turn noisy telemetry into interpreted operations signals and audit-ready summaries. Source: AWS Prescriptive Guidance.

AWS Prescriptive Guidance: Observer and monitoring agentsLast verified: 2026-05-17

Observer agents vs basic monitoring

Concern	Traditional monitoring	Observer agent behavior
Anomaly detection	Threshold and rule alerts	Interprets distributed signals, sequence changes, and contextual shifts before escalation
Compliance and audit	Raw event retention for later review	Produces structured summaries, policy classifications, and escalation records
Agent operations	Tracks latency and failures per step	Correlates tool misuse, policy drift, suspicious patterns, and out-of-band behavior across runs

Production Agent Runtime Infrastructure

Deploying long-horizon agents in production requires purpose-built infrastructure. A good harness gives your agent the right prompts, tools, and skills. But a production runtime provides durable execution, memory, multi-tenancy, human-in-the-loop controls, and observability to keep the agent running reliably across crashes, deploys, and long-running tasks.

Production Agent Runtime Architecture

100%drag to pan

Loading diagram...

Core Capabilities of a Production Agent Runtime

Capability	Description	Why it is required in production
Durable Execution	Checkpoints state after every step and allows resumption from the exact point of interruption.	Agents run for minutes or hours. Without checkpointing, a crash, transient error, or deploy would erase all progress and cost tokens repeatedly.
Memory Management	Separates short-term thread memory (within a run) from long-term memory (across conversations via semantic stores).	Agents must remember preferences across sessions (long-term) while strictly isolating the context of the current task (short-term).
Human-in-the-loop (HITL)	Dynamic interruptions that freeze state, free compute resources, and wait indefinitely for a human to resume or approve.	High-stakes actions (sending emails, executing trades) require approval, and agents must pause efficiently without blocking threads.
Multi-tenancy & Auth	Isolates user data, enforces RBAC, and manages OAuth token refreshes for third-party tools.	An agent serving many users must never leak state across tenants and must securely act on behalf of users.
Middleware	Deterministic hooks running before/after LLM calls for redaction, rate limiting, and safety checks.	Policies like PII redaction or budget caps must run deterministically, not rely on the LLM's adherence to prompts.

The Runtime Behind Production Deep Agents

LangChain guide detailing the infrastructure requirements for long-horizon agents: durable execution, streaming, memory, HITL, and observability.

Read the LangChain Runtime Guide

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 12

Why is evaluating agent systems fundamentally different from evaluating traditional software?

← PreviousCross-Platform Agent Stacks Next →Final Assessment

Approach

When to use

Strengths

Limitations

Offline benchmarks

Early development, model selection, regression testing

Fast, reproducible, good for catching major regressions

May not reflect real-world usage patterns, can be gamed

Online A/B testing

Production deployment, comparing model or prompt versions

Measures real-world performance, captures actual user impact

Requires significant traffic, slow to converge, ethical concerns for some domains

Human evaluation

Complex tasks, safety-critical decisions, quality assessment

Captures nuance and context that automated metrics miss

Expensive, slow, subjective, does not scale well

Automated regression

Continuous integration, prompt changes, model updates

Fast, repeatable, integrates into CI/CD pipelines

Requires maintaining evaluation datasets, may miss edge cases

Adversarial testing

Security validation, jailbreak resistance, safety testing

Finds vulnerabilities and failure modes that normal tests miss

Cannot cover all possible attacks, requires adversarial expertise

Strategy

How it works

Best for

Temperature control

Lower temperature reduces randomness, higher temperature increases creativity but also variance

Balancing consistency vs. creativity based on task requirements

Deterministic tool routing

Use rule-based or semantic routing to choose tools rather than letting the model decide

Reducing variance in tool selection for common, repetitive tasks

Validation layers

Post-processing checks that validate outputs against schemas, rules, or business logic before use

Catching hallucinations, format errors, and policy violations before they affect downstream systems

Confidence thresholds

Require the agent to estimate confidence and reject or escalate low-confidence results

Filtering uncertain decisions and surfacing cases that need human review

Retry with consensus

Run the same request multiple times and use voting or aggregation to produce a more stable result

Improving consistency for critical decisions where latency is acceptable

Concern

Why it matters

Key practices

Agent tracing

Debugging multi-agent workflows requires full execution traces, not just final outputs

Log every tool call, intermediate output, and routing decision; correlate traces across agents

Token cost tracking

Multi-step workflows can have surprisingly high token costs that only appear in production

Track tokens per agent, per tool, and per user; set budgets and alerts

Latency monitoring

Agent response times vary based on model, tools, and workflow complexity

Measure end-to-end latency, break down by step, and track percentiles

Error classification

Agent errors come from models, tools, prompts, or data—root cause requires categorization

Classify errors by type, track error rates per category, and alert on anomalies

Rollback procedures

Agents may produce correct results today but fail after a model or prompt change

Version prompts and model configurations, maintain canary deployments, and have rollback triggers

Concept / Tool

AWS

Azure

GCP

Statistical Evaluation

Bedrock Model Evaluation

Azure AI Studio Eval SDK

Vertex AI Model Evaluation

Task Ledger / State

Step Functions (Express)

Semantic Kernel Store

Reasoning Engine State

Durable Memory

Amazon OpenSearch Serverless

Azure AI Search (Vector)

Vertex AI Vector Search

Primitive

What It Measures

Eval Granularity

Example

Run

Single execution step: one LLM call with its input, output, and metadata

Single-step evaluation

Did the agent select the correct tool and format arguments properly for this one call?

Trace

Complete agent execution showing all runs, their parent-child relationships, and the full execution tree

Full-turn evaluation

Did the agent complete the entire task correctly, including all tool calls, reasoning steps, and final output?

Thread

Multi-turn conversations grouping multiple traces over time, preserving context across interactions

Multi-turn evaluation

Did the agent maintain context, remember user preferences, and build on prior interactions across a conversation?

Level

Observability Primitive

What It Validates

Scoring Difficulty

When to Use

Single-step

Run

Individual decision quality: tool selection correctness, argument formatting, output validity for one LLM call

Easiest to automate—binary or rubric-based scoring on isolated steps

During development and unit testing; validating tool call correctness before integrating into workflows

Full-turn

Trace

Complete task execution: end-to-end correctness, multi-step reasoning quality, and overall task success

Easiest to create inputs but hardest to score—requires LLM-as-judge or human review for complex tasks

Pre-deployment validation; regression testing after prompt or model changes; measuring overall agent quality

Multi-turn

Thread

Context maintenance across conversation: memory retention, preference tracking, and conversational coherence over time

Hardest to implement—requires conditional logic, state tracking, and long-running conversation simulations

Production monitoring and regression; validating that agents maintain context across user sessions without degradation

Step

Action

Outcome

Capture

Production traces are stored automatically for every agent execution, including all runs, tool calls, and intermediate reasoning

Complete execution history available for analysis without any additional instrumentation

Filter

Identify failures, edge cases, novel usage patterns, and high-value examples from the production trace stream

Curated subset of traces prioritized for human review, avoiding noise from routine successes

Annotate

Human reviewers add ratings, corrections, ground-truth labels, and structured feedback to selected traces

Ground truth data that captures expert judgment on what the agent should have done

Curate

Add annotated traces to golden evaluation datasets with expected outcomes, scoring rubrics, and reference answers

Growing, high-quality evaluation dataset that reflects real production scenarios

Evaluate

Run offline evaluations against curated datasets using automated graders, LLM-as-judge, and statistical scoring

Quantitative quality metrics that detect regressions before they reach production

Iterate

Fix issues uncovered by evaluation, update prompts or models, retest against the dataset, and redeploy

Continuous improvement loop where each production failure strengthens the evaluation suite

Phase

Purpose

Tools & Techniques

Exit Criteria

Build

Implement changes informed by trace evidence from prior cycles—prompt adjustments, tool refinements, model routing updates

Prompt engineering tools, model configuration, tool definitions, code changes

Changes pass local unit tests and basic smoke testing

Observe (Staging)

Capture pre-production traces to debug reasoning chains, tool selections, and multi-step behavior before live traffic

LangSmith tracing, structured logging, span-level inspection

Traces show correct reasoning paths; no unexpected tool selections or empty responses

Offline Evals

Run reproducible test cases from curated golden datasets using cheap, fast graders to catch regressions

Golden datasets, LLM-as-judge, rubric-based scoring, statistical comparison against baselines

All regression tests pass; quality metrics meet or exceed baseline thresholds

Deploy

Promote to production only if all quality gates pass, using canary deployments and gradual rollout

Canary deployments, feature flags, traffic splitting, automated rollback triggers

Canary metrics stable; no P0/P1 incidents within observation window

Observe (Production)

Every production run generates a trace; automated monitoring detects anomalies, cost spikes, and quality drift

Trace storage, automated monitoring, anomaly detection, cost dashboards

Traces flowing; monitoring dashboards show healthy baselines

Online Evals

Continuous evaluation on live traffic using LLM-as-judge graders to detect quality degradation in real time

LLM-as-judge scoring, automated quality sampling, statistical process control on quality metrics

Online quality metrics within acceptable bounds; no sustained downward trends

Annotations

Human reviewers add structured feedback on failures, calibrate LLM-as-judge graders against human judgment, and capture ground truth for dataset growth

Annotation queues, human review interfaces, inter-annotator agreement metrics, calibration datasets

Failed traces annotated; new examples added to golden datasets; grader calibration verified

Scenario

What simulation reveals

Why production tests are weaker

Long-running workflow rehearsal

State drift, dead ends, recovery behavior, and retry loops across many iterations

Live traffic rarely gives enough repetitions to isolate structural failure modes safely

CLI or browser automation

How agents handle UI changes, shell errors, or partial success conditions

Real systems carry side effects that make aggressive exploration unsafe

Swarm or multi-agent behavior

Emergent coordination failures, conflict over shared state, and unstable escalation patterns

Production incidents are expensive places to discover collective behavior problems

Concern

Traditional monitoring

Observer agent behavior

Anomaly detection

Threshold and rule alerts

Interprets distributed signals, sequence changes, and contextual shifts before escalation

Compliance and audit

Raw event retention for later review

Produces structured summaries, policy classifications, and escalation records

Agent operations

Tracks latency and failures per step

Correlates tool misuse, policy drift, suspicious patterns, and out-of-band behavior across runs

Capability

Description

Why it is required in production

Durable Execution

Checkpoints state after every step and allows resumption from the exact point of interruption.

Agents run for minutes or hours. Without checkpointing, a crash, transient error, or deploy would erase all progress and cost tokens repeatedly.

Memory Management

Separates short-term thread memory (within a run) from long-term memory (across conversations via semantic stores).

Agents must remember preferences across sessions (long-term) while strictly isolating the context of the current task (short-term).

Human-in-the-loop (HITL)

Dynamic interruptions that freeze state, free compute resources, and wait indefinitely for a human to resume or approve.

High-stakes actions (sending emails, executing trades) require approval, and agents must pause efficiently without blocking threads.

Multi-tenancy & Auth

Isolates user data, enforces RBAC, and manages OAuth token refreshes for third-party tools.

An agent serving many users must never leak state across tenants and must securely act on behalf of users.

Middleware

Deterministic hooks running before/after LLM calls for redaction, rate limiting, and safety checks.

Policies like PII redaction or budget caps must run deterministically, not rely on the LLM's adherence to prompts.

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 12

Knowledge Tree

Agentic RAG with self-correction

Self-Correcting RAG Loop

Grounding Verification

Evaluating agents is fundamentally different from evaluating deterministic software

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Production reliability requires accepting and bounding non-determinism

Confidence-based decision flow for agent outputs

Accept some non-determinism while bounding the blast radius

Observability, rollback, and cost management are production requirements, not afterthoughts

Observability is a pre-production requirement, not a post-launch add-on

Observability Primitives: Runs, Traces, and Threads

Observability Primitives Hierarchy

Agent observability powers agent evaluation

Evaluation at Every Granularity

From Production Traces to Evaluation Datasets

Traces-to-Datasets Pipeline

The Agent Improvement Loop

Agent Improvement Loop

The agent improvement loop starts with a trace

Simulation and test-bed agents give you a safer place to discover failures

Observer agents upgrade telemetry into agent-aware operations signals

Production Agent Runtime Infrastructure

Production Agent Runtime Architecture

The Runtime Behind Production Deep Agents

Knowledge Check

Why is evaluating agent systems fundamentally different from evaluating traditional software?

Knowledge Tree

Agentic RAG with self-correction

Self-Correcting RAG Loop

Grounding Verification

Evaluating agents is fundamentally different from evaluating deterministic software

Agent Development Lifecycle (ADLC) — Testing and Evaluation

Production reliability requires accepting and bounding non-determinism

Confidence-based decision flow for agent outputs

Accept some non-determinism while bounding the blast radius

Observability, rollback, and cost management are production requirements, not afterthoughts

Observability is a pre-production requirement, not a post-launch add-on

Observability Primitives: Runs, Traces, and Threads

Observability Primitives Hierarchy

Agent observability powers agent evaluation

Evaluation at Every Granularity

From Production Traces to Evaluation Datasets

Traces-to-Datasets Pipeline

The Agent Improvement Loop

Agent Improvement Loop

The agent improvement loop starts with a trace

Simulation and test-bed agents give you a safer place to discover failures

Observer agents upgrade telemetry into agent-aware operations signals

Production Agent Runtime Infrastructure

Production Agent Runtime Architecture

The Runtime Behind Production Deep Agents

Knowledge Check

Why is evaluating agent systems fundamentally different from evaluating traditional software?