ADLC vs SDLC: why traditional lifecycle models are not enough
Traditional SDLC assumes deterministic behavior: given the same input and code, a system produces the same output. AI agents break that assumption. Even with identical prompts, tools, and context, an agent can choose different reasoning paths, call tools in different orders, or produce varied text depending on model version, temperature settings, or hidden state. ADLC extends SDLC with explicit phases for non-deterministic testing, evaluation dataset management, and continuous tuning.
Key differences between SDLC and ADLC
| Dimension | SDLC approach | ADLC approach |
|---|---|---|
| Testing | Unit and integration tests assert exact outputs for given inputs | Evaluation metrics, golden datasets, and statistical bounds over multiple runs |
| Versioning | Code commits and semantic versioning | Prompt versions, tool contract versions, model versions, and evaluation dataset versions |
| Deployment | Binaries or containers are immutable once deployed | Model endpoints can be updated upstream; prompts can change without code deployment |
| Monitoring | Error rates, latency, and resource metrics | Quality metrics, hallucination rates, approval rates, and evaluation scores over time |
| Rollback | Revert to previous binary or container image | Revert prompt, tool policy, model version, or routing rules independently |
| Regression | Test suite catches breaking changes | Evaluation suite detects quality drift; prompts may need tuning even without code changes |
Prompts are code, but they are not the only code
Modern agent systems treat prompts as versioned artifacts alongside tool contracts, evaluation datasets, and orchestration logic. The "prompt as code" discipline is necessary but not sufficient: you must also version the evaluation criteria, golden examples, and policy rules that define acceptable behavior.
Agent Development Lifecycle (ADLC)
Salesforce Architect guide defining the ADLC phases (Ideation, Development, Testing, Deployment, Monitoring), inner- and outer-loop activities, and non-deterministic testing strategies for production agents.
Read the Salesforce ADLC guideADLC phases: from ideation to production
ADLC organizes agent development into five phases: ideation, development, testing and validation, deployment, and monitoring and tuning. The inner loop (ideation, development, testing) supports rapid iteration, while the outer loop (deployment, monitoring, and feeding insights back into development) handles continuous improvement in production.
Agent Development Lifecycle phases
Loading diagram...
Building Effective Agents
LangChain's guide to building effective agents with LangGraph, covering the agent construction lifecycle from prototype to production.
Read on LangChainADLC phase details
| Phase | Description | Key activities | Primary risk | Exit criteria |
|---|---|---|---|---|
| Ideation | Define the agent's scope, autonomy level, and success metrics | Identify user workflows, select tools, define policy boundaries, draft evaluation questions | Building an agent for a problem better solved by deterministic automation | Clear use case, bounded tool set, and measurable success criteria |
| Development (Inner Loop) | Rapid iteration on prompts, tool contracts, and orchestration logic | Draft prompts, implement tools, run local tests, adjust temperature and routing rules | Overfitting to a narrow test set or missing edge cases | Agent completes end-to-end workflows in a sandbox environment |
| Testing & Validation | Systematic evaluation against golden datasets and adversarial inputs | Run evaluation suites, test failure modes, validate tool contract compliance, review reasoning traces | Silent regression where quality drifts without obvious errors | Evaluation metrics meet baselines; failure modes are understood and mitigated |
| Deployment | Controlled release to production with gradual rollout and observability | Configure canary deployments, set up monitoring dashboards, document rollback procedures, train reviewers | Unexpected behavior in production due to scale, data shifts, or model drift | Successful pilot run with no critical incidents; rollback paths tested |
| Monitoring & Tuning (Outer Loop) | Continuous observation and improvement based on production signals | Track quality metrics, collect edge cases, re-run evaluations, tune prompts and policies | Drift where quality degrades gradually without clear triggers | Ongoing; tuning loops feed back into development when thresholds are breached |
Example: Eval Dataset Versioning Through the ADLC
Consider a customer-support agent being developed: 1. Development: Initial eval dataset v1 with 50 hand-crafted scenarios (happy path only) 2. Testing: Dataset v2 adds 30 adversarial cases (prompt injection, edge cases). Regression suite catches 3 tool-selection errors, fixed before deployment. 3. Staging: Dataset v3 pulls 200 real production traces from similar deployed agents via LangSmith. Annotators label correct outcomes, creating a hybrid eval set. 4. Production: Online evaluation runs continuously. New traces that score poorly are flagged, annotated, and fed back into v4 of the eval dataset — closing the improvement loop. Each phase enriches the eval dataset, making it progressively harder and more representative of real traffic.
Tradeoff: Iteration Speed vs Evaluation Thoroughness
Skipping the dedicated testing phase to ship faster is tempting but dangerous. Agents are non-deterministic — bugs that pass a quick manual smoke test WILL surface in production. Each skipped evaluation cycle compounds technical debt. Conversely, over-testing every prompt variation before any real traffic wastes time on hypothetical edge cases. Balance: run at least one full offline eval cycle before every deployment, then let online evaluation catch the long tail.
Testing non-deterministic systems
Testing agents requires shifting from "does this exact output match?" to "is this behavior acceptable within bounds?" A well-designed test strategy combines golden datasets for happy-path validation, adversarial inputs for failure-mode testing, regression suites for drift detection, and human evaluation for nuanced quality judgments.
Testing strategies for AI agents
| Strategy | Purpose | Key techniques | Limitations |
|---|---|---|---|
| Golden paths | Validate that the agent handles expected workflows correctly | Curated input-output pairs, reference reasoning chains, tool call sequences | Does not catch edge cases or novel situations |
| Adversarial inputs | Test failure modes, safety boundaries, and robustness | Malicious prompts, out-of-scope requests, ambiguous or contradictory inputs | Hard to exhaust; may miss subtle safety issues |
| Regression suites | Detect quality drift over time as prompts or models change | Periodic evaluation runs, metric baselines, threshold alerts | Requires stable evaluation datasets and clear success metrics |
| A/B evaluation | Compare candidate prompts, models, or configurations in production | Canary deployments, interleaved trials, blind human ratings | Requires traffic volume and careful experimental design |
| Human evaluation | Assess nuanced quality, safety, and appropriateness | Expert review, crowdsourced ratings, clinical or domain-specific rubrics | Expensive, slow, and subject to bias or inconsistency |
Evaluation datasets must be versioned and treated as first-class artifacts. When you change a prompt, tool, or model, you should re-run the previous evaluation dataset to check for regression. When you discover new edge cases in production, add them to the dataset for future testing. This discipline creates a feedback loop where production experience continuously strengthens the test suite.
Test tool contracts separately from agent reasoning
Tool contracts (APIs, MCP servers, FHIR operations) should have their own unit tests independent of the agent. This separation isolates failures: if the tool contract tests pass but the agent fails, the problem is in planning or tool selection. If the tool contract tests fail, the problem is in the tool implementation.
Testing Granularity Levels
| Level | Scope | What You Test | Ease of Automation |
|---|---|---|---|
| Single-step (run-level) | One tool selection | Did the agent pick the right tool? | Easiest — deterministic scoring |
| Full-turn (trace-level) | Complete trajectory | Tool call order, final response quality | Moderate — requires trajectory analysis |
| Multi-turn (thread-level) | Conversational state across turns | State retention, context management, multi-step goals | Hardest — needs full conversation simulation |
Agent Evaluation Readiness Checklist
Pre-deployment checklist covering dataset preparation, metric selection, and evaluation harness setup for reliable agent testing.
Read on LangChainObservability & Tracing from Day One
A common mistake in traditional software development is treating observability as an operational concern bolted on during the Deployment phase. In the Agent Development Lifecycle, tracing is a design-time requirement.
Why Tracing belongs in the Inner Loop
| ADLC Phase | Tracing Value |
|---|---|
| Ideation | Define trace metadata (user ID, session ID, task tags) so usage can be filtered and analyzed later. |
| Development | Trace step-by-step reasoning. You cannot debug a 200-step trajectory without detailed traces. Code documents what tools exist; traces document what the agent actually did. |
| Testing & Validation | Failed traces are curated into the golden evaluation dataset. Traces from manual testing become the automated regression tests. |
| Monitoring & Tuning | Production traces feed online evaluations, which trigger annotations, which loop back into the development phase for continuous improvement. |
Traces replace the stack trace
When an agent makes a mistake, there is no code stack trace because no code failed. The failure was in the reasoning loop. Detailed traces—capturing prompts, tool inputs, results, and LLM outputs at every step—are the only way to debug agentic software.
Online vs Offline Evaluation
Online evaluation runs against live production data. It uses reference-free scoring methods — LLM-as-judge and code-based checks — to continuously monitor agent quality without ground truth labels. This catches real-world drift (prompt sensitivity, model behavior changes, edge cases) but cannot compare against known correct answers. Offline evaluation runs pre-deployment against curated datasets with known ground truth. It supports both reference-based scoring (exact match, semantic similarity to golden answers) and reference-free methods. Offline eval provides controlled, reproducible validation but cannot catch surprises that only emerge in production traffic. Tradeoff: Online catches drift but lacks ground truth; offline has ground truth but misses production edge cases. Mature teams run both — offline for pre-deployment gating, online for continuous quality monitoring.
Tradeoff: Tracing Overhead vs Debugging Capability
Full tracing captures every LLM call, tool invocation, and state transition — invaluable for debugging but adds latency and storage cost. Sampling strategies help: trace 100% of errors and 10% of successful runs. For high-throughput agents, consider trace-level sampling with run-level retention. The key insight: traces you don't collect can't help you debug the outage you didn't predict.
Traces Start the Agent Improvement Loop
How LangSmith traces feed into evaluation datasets, enabling continuous agent improvement through annotated production data.
Read on LangChain BlogKnowledge Check
Test your understanding with this quiz. You need to answer all questions correctly to mark this section as complete.