Agent Development Lifecycle (ADLC)

ADLC vs SDLC: why traditional lifecycle models are not enough

Traditional SDLC assumes deterministic behavior: given the same input and code, a system produces the same output. AI agents break that assumption. Even with identical prompts, tools, and context, an agent can choose different reasoning paths, call tools in different orders, or produce varied text depending on model version, temperature settings, or hidden state. ADLC extends SDLC with explicit phases for non-deterministic testing, evaluation dataset management, and continuous tuning.

Key differences between SDLC and ADLC

Dimension	SDLC approach	ADLC approach
Testing	Unit and integration tests assert exact outputs for given inputs	Evaluation metrics, golden datasets, and statistical bounds over multiple runs
Versioning	Code commits and semantic versioning	Prompt versions, tool contract versions, model versions, and evaluation dataset versions
Deployment	Binaries or containers are immutable once deployed	Model endpoints can be updated upstream; prompts can change without code deployment
Monitoring	Error rates, latency, and resource metrics	Quality metrics, hallucination rates, approval rates, and evaluation scores over time
Rollback	Revert to previous binary or container image	Revert prompt, tool policy, model version, or routing rules independently
Regression	Test suite catches breaking changes	Evaluation suite detects quality drift; prompts may need tuning even without code changes

Prompts are code, but they are not the only code

Modern agent systems treat prompts as versioned artifacts alongside tool contracts, evaluation datasets, and orchestration logic. The "prompt as code" discipline is necessary but not sufficient: you must also version the evaluation criteria, golden examples, and policy rules that define acceptable behavior.

Agent Development Lifecycle (ADLC)

Salesforce Architect guide defining the ADLC phases (Ideation, Development, Testing, Deployment, Monitoring), inner- and outer-loop activities, and non-deterministic testing strategies for production agents.

Read the Salesforce ADLC guide

ADLC phases: from ideation to production

ADLC organizes agent development into five phases: ideation, development, testing and validation, deployment, and monitoring and tuning. The inner loop (ideation, development, testing) supports rapid iteration, while the outer loop (deployment, monitoring, and feeding insights back into development) handles continuous improvement in production.

Agent Development Lifecycle phases

100%drag to pan

Loading diagram...

Building Effective Agents

LangChain's guide to building effective agents with LangGraph, covering the agent construction lifecycle from prototype to production.

Read on LangChain

ADLC phase details

Phase	Description	Key activities	Primary risk	Exit criteria
Ideation	Define the agent's scope, autonomy level, and success metrics	Identify user workflows, select tools, define policy boundaries, draft evaluation questions	Building an agent for a problem better solved by deterministic automation	Clear use case, bounded tool set, and measurable success criteria
Development (Inner Loop)	Rapid iteration on prompts, tool contracts, and orchestration logic	Draft prompts, implement tools, run local tests, adjust temperature and routing rules	Overfitting to a narrow test set or missing edge cases	Agent completes end-to-end workflows in a sandbox environment
Testing & Validation	Systematic evaluation against golden datasets and adversarial inputs	Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces	Silent regression where quality drifts without obvious errors	Evaluation metrics meet baselines; failure modes are understood and mitigated
Deployment	Controlled release to production with gradual rollout and observability	Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers	Unexpected behavior in production due to scale, data shifts, or model drift	Successful pilot run with no critical incidents; rollback paths tested
Monitoring & Tuning (Outer Loop)	Continuous observation and improvement based on production signals	Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies	Drift where quality degrades gradually without clear triggers	Ongoing; tuning loops feed back into development when thresholds are breached

Example: Eval Dataset Versioning Through the ADLC

Consider a customer-support agent being developed: 1. Development: Initial eval dataset v1 with 50 hand-crafted scenarios (happy path only) 2. Testing: Dataset v2 adds 30 adversarial cases (prompt injection, edge cases). Regression suite catches 3 tool-selection errors, fixed before deployment. 3. Staging: Dataset v3 pulls 200 real production traces from similar deployed agents via LangSmith. Annotators label correct outcomes, creating a hybrid eval set. 4. Production: Online evaluation runs continuously. New traces that score poorly are flagged, annotated, and fed back into v4 of the eval dataset — closing the improvement loop. Each phase enriches the eval dataset, making it progressively harder and more representative of real traffic.

Tradeoff: Iteration Speed vs Evaluation Thoroughness

Skipping the dedicated testing phase to ship faster is tempting but dangerous. Agents are non-deterministic — bugs that pass a quick manual smoke test WILL surface in production. Each skipped evaluation cycle compounds technical debt. Conversely, over-testing every prompt variation before any real traffic wastes time on hypothetical edge cases. Balance: run at least one full offline eval cycle before every deployment, then let online evaluation catch the long tail.

Testing non-deterministic systems

Testing agents requires shifting from "does this exact output match?" to "is this behavior acceptable within bounds?" A well-designed test strategy combines golden datasets for happy-path validation, adversarial inputs for failure-mode testing, regression suites for drift detection, and human evaluation for nuanced quality judgments.

Evaluator-reflect-refine pattern: generator produces output, evaluator scores it, refiner improves it in a loop — The evaluator-reflect-refine pattern embeds quality control inside the agent loop. A generator produces output, an evaluator scores it, and a refiner iterates — mirroring the offline eval cycle at the agent architecture level. Source: AWS Prescriptive Guidance.

AWS Prescriptive Guidance — Evaluator, Reflect, and Refine Loop PatternsLast verified: 2026-05-17

Testing strategies for AI agents

Strategy	Purpose	Key techniques	Limitations
Golden paths	Validate that the agent handles expected workflows correctly	Curated input-output pairs, reference reasoning chains, tool call sequences	Does not catch edge cases or novel situations
Adversarial inputs	Test failure modes, safety boundaries, and robustness	Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs	Hard to exhaust; may miss subtle safety issues
Regression suites	Detect quality drift over time as prompts or models change	Periodic evaluation runs, metric baselines, threshold alerts	Requires stable evaluation datasets and clear success metrics
A/B evaluation	Compare candidate prompts, models, or configurations in production	Canary deployments, interleaved trials, blind human ratings	Requires traffic volume and careful experimental design
Human evaluation	Assess nuanced quality, safety, and appropriateness	Expert review, crowdsourced ratings, clinical or domain-specific rubrics	Expensive, slow, and subject to bias or inconsistency

Evaluation datasets must be versioned and treated as first-class artifacts. When you change a prompt, tool, or model, you should re-run the previous evaluation dataset to check for regression. When you discover new edge cases in production, add them to the dataset for future testing. This discipline creates a feedback loop where production experience continuously strengthens the test suite.

Test tool contracts separately from agent reasoning

Tool contracts (APIs, MCP servers, FHIR operations) should have their own unit tests independent of the agent. This separation isolates failures: if the tool contract tests pass but the agent fails, the problem is in planning or tool selection. If the tool contract tests fail, the problem is in the tool implementation.

Testing Granularity Levels

Level	Scope	What You Test	Ease of Automation
Single-step (run-level)	One tool selection	Did the agent pick the right tool?	Easiest — deterministic scoring
Full-turn (trace-level)	Complete trajectory	Tool call order, final response quality	Moderate — requires trajectory analysis
Multi-turn (thread-level)	Conversational state across turns	State retention, context management, multi-step goals	Hardest — needs full conversation simulation

Agent Evaluation Readiness Checklist

Pre-deployment checklist covering dataset preparation, metric selection, and evaluation harness setup for reliable agent testing.

Read on LangChain

Observability & Tracing from Day One

A common mistake in traditional software development is treating observability as an operational concern bolted on during the Deployment phase. In the Agent Development Lifecycle, tracing is a design-time requirement.

Traditional feedback control loop: build, test, pass or fail, remediate, and repeat — The build-test-remediate loop is the foundation of agent observability: traces expose failures, evals classify them, and the loop feeds corrections back into the agent. In production, this loop runs continuously rather than at release cadence. Source: AWS Prescriptive Guidance.

AWS Prescriptive Guidance — Evaluator, Reflect, and Refine Loop PatternsLast verified: 2026-05-17

Why Tracing belongs in the Inner Loop

ADLC Phase	Tracing Value
Ideation	Define trace metadata (user ID, session ID, task tags) so usage can be filtered and analyzed later.
Development	Trace step-by-step reasoning. You cannot debug a 200-step trajectory without detailed traces. Code documents what tools exist; traces document what the agent actually did.
Testing & Validation	Failed traces are curated into the golden evaluation dataset. Traces from manual testing become the automated regression tests.
Monitoring & Tuning	Production traces feed online evaluations, which trigger annotations, which loop back into the development phase for continuous improvement.

Traces replace the stack trace

When an agent makes a mistake, there is no code stack trace because no code failed. The failure was in the reasoning loop. Detailed traces—capturing prompts, tool inputs, results, and LLM outputs at every step—are the only way to debug agentic software.

Online vs Offline Evaluation

Online evaluation runs against live production data. It uses reference-free scoring methods — LLM-as-judge and code-based checks — to continuously monitor agent quality without ground truth labels. This catches real-world drift (prompt sensitivity, model behavior changes, edge cases) but cannot compare against known correct answers. Offline evaluation runs pre-deployment against curated datasets with known ground truth. It supports both reference-based scoring (exact match, semantic similarity to golden answers) and reference-free methods. Offline eval provides controlled, reproducible validation but cannot catch surprises that only emerge in production traffic. Tradeoff: Online catches drift but lacks ground truth; offline has ground truth but misses production edge cases. Mature teams run both — offline for pre-deployment gating, online for continuous quality monitoring.

Tradeoff: Tracing Overhead vs Debugging Capability

Full tracing captures every LLM call, tool invocation, and state transition — invaluable for debugging but adds latency and storage cost. Sampling strategies help: trace 100% of errors and 10% of successful runs. For high-throughput agents, consider trace-level sampling with run-level retention. The key insight: traces you don't collect can't help you debug the outage you didn't predict.

Traces Start the Agent Improvement Loop

How LangSmith traces feed into evaluation datasets, enabling continuous agent improvement through annotated production data.

Read on LangChain Blog

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 9

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?

← PreviousMulti-Agent Orchestration Patterns Next →Human-Agent Collaboration and UX Patterns

Dimension

SDLC approach

ADLC approach

Testing

Unit and integration tests assert exact outputs for given inputs

Evaluation metrics, golden datasets, and statistical bounds over multiple runs

Versioning

Code commits and semantic versioning

Prompt versions, tool contract versions, model versions, and evaluation dataset versions

Deployment

Binaries or containers are immutable once deployed

Model endpoints can be updated upstream; prompts can change without code deployment

Monitoring

Error rates, latency, and resource metrics

Quality metrics, hallucination rates, approval rates, and evaluation scores over time

Rollback

Revert to previous binary or container image

Revert prompt, tool policy, model version, or routing rules independently

Regression

Test suite catches breaking changes

Evaluation suite detects quality drift; prompts may need tuning even without code changes

Phase

Description

Key activities

Primary risk

Exit criteria

Ideation

Define the agent's scope, autonomy level, and success metrics

Identify user workflows, select tools, define policy boundaries, draft evaluation questions

Building an agent for a problem better solved by deterministic automation

Clear use case, bounded tool set, and measurable success criteria

Development (Inner Loop)

Rapid iteration on prompts, tool contracts, and orchestration logic

Draft prompts, implement tools, run local tests, adjust temperature and routing rules

Overfitting to a narrow test set or missing edge cases

Agent completes end-to-end workflows in a sandbox environment

Testing & Validation

Systematic evaluation against golden datasets and adversarial inputs

Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces

Silent regression where quality drifts without obvious errors

Evaluation metrics meet baselines; failure modes are understood and mitigated

Deployment

Controlled release to production with gradual rollout and observability

Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers

Unexpected behavior in production due to scale, data shifts, or model drift

Successful pilot run with no critical incidents; rollback paths tested

Monitoring & Tuning (Outer Loop)

Continuous observation and improvement based on production signals

Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies

Drift where quality degrades gradually without clear triggers

Ongoing; tuning loops feed back into development when thresholds are breached

Strategy

Purpose

Key techniques

Limitations

Golden paths

Validate that the agent handles expected workflows correctly

Curated input-output pairs, reference reasoning chains, tool call sequences

Does not catch edge cases or novel situations

Adversarial inputs

Test failure modes, safety boundaries, and robustness

Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs

Hard to exhaust; may miss subtle safety issues

Regression suites

Detect quality drift over time as prompts or models change

Periodic evaluation runs, metric baselines, threshold alerts

Requires stable evaluation datasets and clear success metrics

A/B evaluation

Compare candidate prompts, models, or configurations in production

Canary deployments, interleaved trials, blind human ratings

Requires traffic volume and careful experimental design

Human evaluation

Assess nuanced quality, safety, and appropriateness

Expert review, crowdsourced ratings, clinical or domain-specific rubrics

Expensive, slow, and subject to bias or inconsistency

Level

Scope

What You Test

Ease of Automation

Single-step (run-level)

One tool selection

Did the agent pick the right tool?

Easiest — deterministic scoring

Full-turn (trace-level)

Complete trajectory

Tool call order, final response quality

Moderate — requires trajectory analysis

Multi-turn (thread-level)

Conversational state across turns

State retention, context management, multi-step goals

Hardest — needs full conversation simulation

ADLC Phase

Tracing Value

Ideation

Define trace metadata (user ID, session ID, task tags) so usage can be filtered and analyzed later.

Development

Trace step-by-step reasoning. You cannot debug a 200-step trajectory without detailed traces. Code documents what tools exist; traces document what the agent actually did.

Testing & Validation

Failed traces are curated into the golden evaluation dataset. Traces from manual testing become the automated regression tests.

Monitoring & Tuning

Production traces feed online evaluations, which trigger annotations, which loop back into the development phase for continuous improvement.

Knowledge Check

Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.

Quiz Progress

Question 1 of 9

Knowledge Tree

ADLC vs SDLC: why traditional lifecycle models are not enough

Prompts are code, but they are not the only code

Agent Development Lifecycle (ADLC)

ADLC phases: from ideation to production

Agent Development Lifecycle phases

Building Effective Agents

Example: Eval Dataset Versioning Through the ADLC

Tradeoff: Iteration Speed vs Evaluation Thoroughness

Testing non-deterministic systems

Test tool contracts separately from agent reasoning

Agent Evaluation Readiness Checklist

Observability & Tracing from Day One

Traces replace the stack trace

Online vs Offline Evaluation

Tradeoff: Tracing Overhead vs Debugging Capability

Traces Start the Agent Improvement Loop

Knowledge Check

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?

Knowledge Tree

ADLC vs SDLC: why traditional lifecycle models are not enough

Prompts are code, but they are not the only code

Agent Development Lifecycle (ADLC)

ADLC phases: from ideation to production

Agent Development Lifecycle phases

Building Effective Agents

Example: Eval Dataset Versioning Through the ADLC

Tradeoff: Iteration Speed vs Evaluation Thoroughness

Testing non-deterministic systems

Test tool contracts separately from agent reasoning

Agent Evaluation Readiness Checklist

Observability & Tracing from Day One

Traces replace the stack trace

Online vs Offline Evaluation

Tradeoff: Tracing Overhead vs Debugging Capability

Traces Start the Agent Improvement Loop

Knowledge Check

Why does traditional SDLC testing assume determinism, and why does that assumption break for AI agents?