Agentic RAG with self-correction
Production RAG systems often suffer from "Contextual Blindness" where the model retrieves irrelevant data but attempts to answer anyway. Agentic RAG adds a Self-Correction (Maker-Checker) loop to validate retrieval quality before generation.
Self-Correcting RAG Loop
Loading diagram...
Grounding Verification
In this pattern, the Checker agent doesn't just look for "an answer." It evaluates the Candidate Context against a specific rubric: "Does this context contain facts required to satisfy the User Query?" If not, it instructs the Maker to try a different search strategy, effectively automating the "retry" logic humans do manually.
Evaluating agents is fundamentally different from evaluating deterministic software
Traditional software testing assumes deterministic behavior: the same input always produces the same output. Agent systems violate that assumption. Quality evaluation for agents requires statistical approaches, diverse test datasets, and continuous measurement—not single-pass unit tests.
Agent evaluation approaches and when to use them
| Approach | When to use | Strengths | Limitations |
|---|---|---|---|
| Offline benchmarks | Early development, model selection, regression testing | Fast, reproducible, good for catching major regressions | May not reflect real-world usage patterns, can be gamed |
| Online A/B testing | Production deployment, comparing model or prompt versions | Measures real-world performance, captures actual user impact | Requires significant traffic, slow to converge, ethical concerns for some domains |
| Human evaluation | Complex tasks, safety-critical decisions, quality assessment | Captures nuance and context that automated metrics miss | Expensive, slow, subjective, does not scale well |
| Automated regression | Continuous integration, prompt changes, model updates | Fast, repeatable, integrates into CI/CD pipelines | Requires maintaining evaluation datasets, may miss edge cases |
| Adversarial testing | Security validation, jailbreak resistance, safety testing | Finds vulnerabilities and failure modes that normal tests miss | Cannot cover all possible attacks, requires adversarial expertise |
Effective agent evaluation requires a layered approach: automated regression tests for fast feedback during development, human evaluation for quality assurance, and adversarial testing for security validation. The evaluation dataset itself must evolve as the agent encounters new edge cases in production.
Agent Development Lifecycle (ADLC) — Testing and Evaluation
Salesforce Architect guide covering the ADLC Testing & Validation phase, including evaluation dataset management, regression suites, adversarial testing, and outer-loop continuous tuning for non-deterministic systems.
Read the ADLC Testing and Evaluation sectionProduction reliability requires accepting and bounding non-determinism
You cannot eliminate non-determinism in agent systems—the same input will sometimes produce different outputs. Production reliability comes from bounding the blast radius of bad outputs, detecting low-confidence decisions, and having clear escalation paths when the agent is uncertain.
Strategies for handling non-deterministic agent behavior
| Strategy | How it works | Best for |
|---|---|---|
| Temperature control | Lower temperature reduces randomness, higher temperature increases creativity but also variance | Balancing consistency vs. creativity based on task requirements |
| Deterministic tool routing | Use rule-based or semantic routing to choose tools rather than letting the model decide | Reducing variance in tool selection for common, repetitive tasks |
| Validation layers | Post-processing checks that validate outputs against schemas, rules, or business logic before use | Catching hallucinations, format errors, and policy violations before they affect downstream systems |
| Confidence thresholds | Require the agent to estimate confidence and reject or escalate low-confidence results | Filtering uncertain decisions and surfacing cases that need human review |
| Retry with consensus | Run the same request multiple times and use voting or aggregation to produce a more stable result | Improving consistency for critical decisions where latency is acceptable |
Confidence-based decision flow for agent outputs
Loading diagram...
Accept some non-determinism while bounding the blast radius
Perfection is not the goal—consistency within acceptable bounds is. Focus on detecting and handling the cases where non-determinism produces bad outcomes, rather than trying to eliminate variation entirely. Confidence thresholds, validation layers, and clear escalation patterns are more practical than attempting fully deterministic behavior.
Observability, rollback, and cost management are production requirements, not afterthoughts
Running agent systems in production requires operational disciplines that go beyond prompt engineering. You need traces for debugging, metrics for performance, budgets for cost control, and rollback procedures for when agents misbehave. These must be designed before launch, not bolted on after incidents.
Operational concerns for production agent systems
| Concern | Why it matters | Key practices |
|---|---|---|
| Agent tracing | Debugging multi-agent workflows requires full execution traces, not just final outputs | Log every tool call, intermediate output, and routing decision; correlate traces across agents |
| Token cost tracking | Multi-step workflows can have surprisingly high token costs that only appear in production | Track tokens per agent, per tool, and per user; set budgets and alerts |
| Latency monitoring | Agent response times vary based on model, tools, and workflow complexity | Measure end-to-end latency, break down by step, and track percentiles |
| Error classification | Agent errors come from models, tools, prompts, or data—root cause requires categorization | Classify errors by type, track error rates per category, and alert on anomalies |
| Rollback procedures | Agents may produce correct results today but fail after a model or prompt change | Version prompts and model configurations, maintain canary deployments, and have rollback triggers |
Cost management deserves explicit attention because token usage scales with workflow complexity. Simple tasks should use cheap models, repeated queries should use caching, and expensive models should be reserved for complex reasoning steps. Model routing—choosing the right model for each subtask—can reduce costs by 60-80% without sacrificing quality.
Observability is a pre-production requirement, not a post-launch add-on
Design tracing, metrics, and logging into your agent architecture from day one. Without observability, you cannot debug failures, measure performance, or prove compliance. Trying to add observability after a production incident is too late—you need the data before the problem occurs.
Cloud-native evaluation & state mapping
| Concept / Tool | AWS | Azure | GCP |
|---|---|---|---|
| Statistical Evaluation | Bedrock Model Evaluation | Azure AI Studio Eval SDK | Vertex AI Model Evaluation |
| Task Ledger / State | Step Functions (Express) | Semantic Kernel Store | Reasoning Engine State |
| Durable Memory | Amazon OpenSearch Serverless | Azure AI Search (Vector) | Vertex AI Vector Search |
Observability Primitives: Runs, Traces, and Threads
Agent observability differs fundamentally from traditional software observability. Where software observability tracks deterministic request/response cycles and known error states, agent observability must capture emergent behavior—reasoning chains, tool selections, and multi-turn context—that only materializes at runtime. LangChain's observability platform, LangSmith, defines three core primitives that map directly to evaluation granularity.
Observability Primitives Hierarchy
Loading diagram...
Observability Primitives
| Primitive | What It Measures | Eval Granularity | Example |
|---|---|---|---|
| Run | Single execution step: one LLM call with its input, output, and metadata | Single-step evaluation | Did the agent select the correct tool and format arguments properly for this one call? |
| Trace | Complete agent execution showing all runs, their parent-child relationships, and the full execution tree | Full-turn evaluation | Did the agent complete the entire task correctly, including all tool calls, reasoning steps, and final output? |
| Thread | Multi-turn conversations grouping multiple traces over time, preserving context across interactions | Multi-turn evaluation | Did the agent maintain context, remember user preferences, and build on prior interactions across a conversation? |
Each primitive measures a different granularity. Traces can reach hundreds of megabytes for complex, long-running agents. The teams shipping reliable agents have embraced the shift from debugging code to debugging reasoning.
Agent observability powers agent evaluation
LangChain blog by Harrison Chase explaining how observability primitives—runs, traces, and threads—map directly to agent evaluation granularity.
Read on the LangChain BlogEvaluation at Every Granularity
Evaluation granularity is determined by the observability primitive you are evaluating against. Each level answers a different question about your agent's reliability, requires different scoring approaches, and serves different phases of the development lifecycle.
Evaluation Granularity Taxonomy
| Level | Observability Primitive | What It Validates | Scoring Difficulty | When to Use |
|---|---|---|---|---|
| Single-step | Run | Individual decision quality: tool selection correctness, argument formatting, output validity for one LLM call | Easiest to automate—binary or rubric-based scoring on isolated steps | During development and unit testing; validating tool call correctness before integrating into workflows |
| Full-turn | Trace | Complete task execution: end-to-end correctness, multi-step reasoning quality, and overall task success | Easiest to create inputs but hardest to score—requires LLM-as-judge or human review for complex tasks | Pre-deployment validation; regression testing after prompt or model changes; measuring overall agent quality |
| Multi-turn | Thread | Context maintenance across conversation: memory retention, preference tracking, and conversational coherence over time | Hardest to implement—requires conditional logic, state tracking, and long-running conversation simulations | Production monitoring and regression; validating that agents maintain context across user sessions without degradation |
Choose evaluation granularity based on what you are testing. Single-step for tool call correctness. Full-turn for end-to-end task success. Multi-turn for conversation coherence over time. Teams typically start with single-step evals and add full-turn and multi-turn as their agent matures.
From Production Traces to Evaluation Datasets
Production traces become evaluation datasets automatically when you close the feedback loop. When a trace reveals a failure, it enters an annotation queue, gets incorporated into evaluation datasets with ground truth, and powers offline testing and regression suites. This converges two traditionally separate software concerns—tracing for debugging and testing for validation—into a unified, compounding pipeline.
Traces-to-Datasets Pipeline
Loading diagram...
Trace-to-Dataset Pipeline Steps
| Step | Action | Outcome |
|---|---|---|
| Capture | Production traces are stored automatically for every agent execution, including all runs, tool calls, and intermediate reasoning | Complete execution history available for analysis without any additional instrumentation |
| Filter | Identify failures, edge cases, novel usage patterns, and high-value examples from the production trace stream | Curated subset of traces prioritized for human review, avoiding noise from routine successes |
| Annotate | Human reviewers add ratings, corrections, ground-truth labels, and structured feedback to selected traces | Ground truth data that captures expert judgment on what the agent should have done |
| Curate | Add annotated traces to golden evaluation datasets with expected outcomes, scoring rubrics, and reference answers | Growing, high-quality evaluation dataset that reflects real production scenarios |
| Evaluate | Run offline evaluations against curated datasets using automated graders, LLM-as-judge, and statistical scoring | Quantitative quality metrics that detect regressions before they reach production |
| Iterate | Fix issues uncovered by evaluation, update prompts or models, retest against the dataset, and redeploy | Continuous improvement loop where each production failure strengthens the evaluation suite |
Production failures become regression tests. The improvement loop compounds because each cycle generates better data. Annotation is the bridge between production signals and better evaluations—without human-labeled ground truth, automated evaluation remains a noisy approximation.
The Agent Improvement Loop
In traditional software, the code documents the application. In AI systems, the traces do. The agent improvement loop formalizes this insight into a continuous cycle: observe traces at every stage, run evaluations against datasets, annotate failures, and feed learnings back into the build phase. Each cycle produces better data, sharper evaluations, and more reliable agents.
Agent Improvement Loop
Loading diagram...
Improvement Loop Phases
| Phase | Purpose | Tools & Techniques | Exit Criteria |
|---|---|---|---|
| Build | Implement changes informed by trace evidence from prior cycles—prompt adjustments, tool refinements, model routing updates | Prompt engineering tools, model configuration, tool definitions, code changes | Changes pass local unit tests and basic smoke testing |
| Observe (Staging) | Capture pre-production traces to debug reasoning chains, tool selections, and multi-step behavior before live traffic | LangSmith tracing, structured logging, span-level inspection | Traces show correct reasoning paths; no unexpected tool selections or empty responses |
| Offline Evals | Run reproducible test cases from curated golden datasets using cheap, fast graders to catch regressions | Golden datasets, LLM-as-judge, rubric-based scoring, statistical comparison against baselines | All regression tests pass; quality metrics meet or exceed baseline thresholds |
| Deploy | Promote to production only if all quality gates pass, using canary deployments and gradual rollout | Canary deployments, feature flags, traffic splitting, automated rollback triggers | Canary metrics stable; no P0/P1 incidents within observation window |
| Observe (Production) | Every production run generates a trace; automated monitoring detects anomalies, cost spikes, and quality drift | Trace storage, automated monitoring, anomaly detection, cost dashboards | Traces flowing; monitoring dashboards show healthy baselines |
| Online Evals | Continuous evaluation on live traffic using LLM-as-judge graders to detect quality degradation in real time | LLM-as-judge scoring, automated quality sampling, statistical process control on quality metrics | Online quality metrics within acceptable bounds; no sustained downward trends |
| Annotations | Human reviewers add structured feedback on failures, calibrate LLM-as-judge graders against human judgment, and capture ground truth for dataset growth | Annotation queues, human review interfaces, inter-annotator agreement metrics, calibration datasets | Failed traces annotated; new examples added to golden datasets; grader calibration verified |
The loop compounds: each cycle generates better data, sharper evaluations, and more reliable agents. An insights agent can automatically cluster production traces to surface usage patterns and failure modes, accelerating the annotation phase by prioritizing the most impactful traces for human review.
The agent improvement loop starts with a trace
LangChain blog by Sam Crowder detailing how production traces feed the agent improvement loop, from observability through evaluation to annotation and back to building.
Read on the LangChain BlogSimulation and test-bed agents give you a safer place to discover failures
AWS explicitly separates simulation and test-bed agents from production observers because they answer a different reliability question: “What happens if the agent explores this environment repeatedly before we trust it live?” That matters for reinforcement-style loops, CLI or browser sandboxes, workflow rehearsals, and multi-agent coordination experiments.
When simulation adds reliability value
| Scenario | What simulation reveals | Why production tests are weaker |
|---|---|---|
| Long-running workflow rehearsal | State drift, dead ends, recovery behavior, and retry loops across many iterations | Live traffic rarely gives enough repetitions to isolate structural failure modes safely |
| CLI or browser automation | How agents handle UI changes, shell errors, or partial success conditions | Real systems carry side effects that make aggressive exploration unsafe |
| Swarm or multi-agent behavior | Emergent coordination failures, conflict over shared state, and unstable escalation patterns | Production incidents are expensive places to discover collective behavior problems |
Observer agents upgrade telemetry into agent-aware operations signals
The AWS observer and monitoring pattern closes an important reliability gap. Traditional dashboards show raw logs, metrics, and traces. Observer agents reason across those signals, classify anomalies, summarize trends, and escalate only when the telemetry pattern actually matters.
Observer agents vs basic monitoring
| Concern | Traditional monitoring | Observer agent behavior |
|---|---|---|
| Anomaly detection | Threshold and rule alerts | Interprets distributed signals, sequence changes, and contextual shifts before escalation |
| Compliance and audit | Raw event retention for later review | Produces structured summaries, policy classifications, and escalation records |
| Agent operations | Tracks latency and failures per step | Correlates tool misuse, policy drift, suspicious patterns, and out-of-band behavior across runs |
Production Agent Runtime Infrastructure
Deploying long-horizon agents in production requires purpose-built infrastructure. A good harness gives your agent the right prompts, tools, and skills. But a production runtime provides durable execution, memory, multi-tenancy, human-in-the-loop controls, and observability to keep the agent running reliably across crashes, deploys, and long-running tasks.
Production Agent Runtime Architecture
Loading diagram...
Core Capabilities of a Production Agent Runtime
| Capability | Description | Why it is required in production |
|---|---|---|
| Durable Execution | Checkpoints state after every step and allows resumption from the exact point of interruption. | Agents run for minutes or hours. Without checkpointing, a crash, transient error, or deploy would erase all progress and cost tokens repeatedly. |
| Memory Management | Separates short-term thread memory (within a run) from long-term memory (across conversations via semantic stores). | Agents must remember preferences across sessions (long-term) while strictly isolating the context of the current task (short-term). |
| Human-in-the-loop (HITL) | Dynamic interruptions that freeze state, free compute resources, and wait indefinitely for a human to resume or approve. | High-stakes actions (sending emails, executing trades) require approval, and agents must pause efficiently without blocking threads. |
| Multi-tenancy & Auth | Isolates user data, enforces RBAC, and manages OAuth token refreshes for third-party tools. | An agent serving many users must never leak state across tenants and must securely act on behalf of users. |
| Middleware | Deterministic hooks running before/after LLM calls for redaction, rate limiting, and safety checks. | Policies like PII redaction or budget caps must run deterministically, not rely on the LLM's adherence to prompts. |
The Runtime Behind Production Deep Agents
LangChain guide detailing the infrastructure requirements for long-horizon agents: durable execution, streaming, memory, HITL, and observability.
Read the LangChain Runtime GuideKnowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.