The software development lifecycle has been refined over four decades. Requirements, design, implementation, testing, deployment, maintenance — the SDLC works because the systems it governs are deterministic. Given the same input, they produce the same output. When they fail, they throw exceptions. When they pass tests, they work. When they ship, they stay shipped until someone changes the code.

AI agents are none of these things. Given the same input, an agent may produce different outputs depending on model temperature, context window contents, and the stochastic nature of token sampling. When they fail, they do not throw exceptions — they produce confident, well-formatted, wrong answers. When they pass evaluation at deployment, they may degrade within weeks as input distributions shift and upstream model providers push silent updates. And when they ship, the underlying model, the retrieval corpus, and the business context all continue to change without anyone touching the agent's code.

Enterprise teams have noticed. McKinsey's 2025 State of AI report found that only 23 per cent of enterprises are scaling AI agents successfully, while 39 per cent remain stuck in experimentation. The gap is not a technology problem. The models work. The frameworks work. The gap is methodological: organisations are applying deterministic engineering practices to probabilistic systems and wondering why the results are unpredictable.

The Agent Development Lifecycle — ADLC — is the emerging answer. Glean published its Enterprise ADLC framework in May 2026, defining seven stages from opportunity identification through continuous monitoring. IBM released a complementary framework organised around four phases — Plan, Code and Build, Test and Release, and Deploy — with the middle two phases running in an iterative loop. EPAM, Arthur AI, and Salesforce have each contributed variations. The frameworks differ in structure but converge on a shared insight: agents need their own development methodology because the assumptions baked into the SDLC — deterministic outputs, binary test outcomes, stable post-deployment behaviour — do not hold.

Stage 1: Opportunity — tie every agent to a business KPI

The first ADLC stage is not "pick a use case." It is "identify a business outcome that an agent can measurably improve." The distinction matters because the most common failure mode in enterprise agent development is building agents that demonstrate technical capability without connecting to a KPI that the organisation actually tracks.

A procurement agent that automates purchase order creation is a technology demonstration. A procurement agent that reduces purchase-to-pay cycle time from fourteen days to three and is measured against that target every month is a business initiative. The technology is identical. The framing determines whether the agent survives the first budget review.

The opportunity stage requires four outputs: the specific business KPI the agent will improve, the current baseline for that KPI, the target improvement, and the economic value of reaching the target. Without these, the agent is an experiment — and experiments, as the pilot-to-production research consistently documents, die in the transition from demo to deployment. Anthropic's 2026 State of AI Agents survey of over 500 technical leaders shows that 57 per cent of organisations already deploy multi-step agent workflows, while 16 per cent have progressed to cross-functional agents spanning multiple teams. But without KPI alignment from day one, most of these agents operate as point solutions that never compound into organisational value. Microsoft's 2026 Cyber Pulse report found that over 80 per cent of Fortune 500 companies deploy active AI agents, with 29 per cent of employees already using unsanctioned agents for work tasks. The opportunity stage is where governance starts — not with security controls, but with the basic question of whether this agent should exist in the first place.

Stage 2: Design — scope, guardrails, and escalation paths

Agent design is fundamentally different from software design because the design must account for what the agent should not do, not just what it should do. A software function either handles an input or it does not — the type system and the compiler enforce this. An agent will cheerfully attempt to handle any input, regardless of whether it falls within its intended scope, unless explicit boundaries prevent it.

The design stage produces three artefacts that have no equivalent in traditional software development. The scope definition specifies exactly what tasks the agent is authorised to perform, what data it may access, and what decisions it can make autonomously. The delegation framework applies directly — every agent needs the same clarity of scope, authority, boundaries, and accountability that you would provide to a human employee handling the same function.

The guardrail specification defines hard constraints the agent cannot violate, regardless of its reasoning. A customer-facing agent cannot commit to pricing outside the approved range. A compliance agent cannot approve its own risk assessments. These are not soft guidelines — they are constraints enforced in code, tested before deployment, and monitored in production.

The escalation design defines what happens when the agent encounters situations outside its operational boundaries. Automation stops when it encounters something unexpected. A well-designed agent recognises that it has reached the edge of its competence and routes the situation to the appropriate human or agent — with full context, not just an error code.

The multi-agent architecture decisions intersect directly with the design stage. Whether your agent operates within a hub-and-spoke, peer-to-peer, or hierarchical orchestration determines the scope of what it can be designed to do. The ADLC does not replace architecture decisions — it provides the methodology for making them systematically.

Stage 3: Performance — acceptance criteria for probabilistic outputs

This is where the SDLC breaks most obviously. In traditional software, a test either passes or fails. The system either returns the correct value or it does not. There is no ambiguity about whether the output is acceptable.

Agents produce outputs sampled from a probability distribution. Two runs of the same agent with the same input may produce semantically equivalent but textually different outputs. A classification agent might assign borderline cases to different categories. A reasoning agent might follow different chains of logic to reach the same conclusion — or a different one.

Performance criteria must account for this variability. The evaluation framework covers the mechanics in depth, but the ADLC frames them differently: evaluation is not a test gate that the agent passes once. It is a continuous measurement that defines the operational envelope within which the agent is authorised to run.

The practical approach is to define acceptance criteria across five dimensions: task accuracy (precision and recall against a golden test set), consistency (semantically equivalent outputs across multiple runs), latency (response time within business process constraints), cost (inference cost per completed task within the economic model), and safety (guardrails respected, escalation triggered appropriately). These are not pass/fail thresholds — they are operating ranges. An agent that maintains 92 per cent accuracy, sub-two-second p95 latency, and under EUR 0.15 per task is operating within its envelope. When any dimension drifts outside the range, the monitoring system (Stage 7) flags it for intervention.

Stage 4: Context and input — data governance as a first-class concern

The single largest determinant of agent output quality is not the model, the prompt, or the framework. It is the data the agent has access to. An agent with access to accurate, current, well-structured data and a mediocre prompt will outperform an agent with a brilliant prompt and stale, incomplete, or poorly structured data.

The context stage defines what data sources the agent accesses, how that data is retrieved, how freshness is maintained, and how access is governed. For RAG-based agents, this means specifying the corpus, the chunking strategy, the embedding model, the retrieval and re-ranking approach, and the refresh cadence. For agents with tool access, it means defining which APIs the agent can call and what happens when an API is unavailable.

Data governance is not an afterthought in the ADLC — it is a core stage. An agent that accesses customer data is subject to DSGVO requirements regardless of what task it performs. An agent that retrieves financial data must respect access controls at the row and column level, not just the API level. The build decision between low-code and pro-code platforms affects this stage directly — platform-managed agents inherit the platform's data governance controls, while custom-built agents require purpose-built governance layers.

The context stage also addresses context poisoning — a failure mode unique to agents. If an agent's retrieval corpus contains outdated or incorrect information, the agent will surface it confidently. Unlike a database query, an agent weaves retrieved information into natural-language responses that obscure the provenance of each claim. Tracing a wrong answer back to the document chunk that caused it requires explicit provenance tracking — logging which documents were retrieved and how the agent incorporated them.

Stage 5: Develop — evaluation frameworks, not unit tests

The development stage is where the gap between SDLC and ADLC is widest. Traditional software relies on unit tests with deterministic expected outcomes: given input X, the system must return output Y. Agent development requires evaluation frameworks instead. A unit test asserts a specific output. An evaluation framework assesses whether an output falls within an acceptable range across multiple quality dimensions. A unit test runs in milliseconds and produces a binary result. An evaluation run may take minutes, consume API credits, and produce a multi-dimensional quality score.

The practical evaluation stack has three layers. The first is deterministic tests for deterministic components — does the tool-calling interface correctly parse function signatures, does the retrieval pipeline return relevant documents for known queries, does the guardrail system block known-harmful inputs. These belong in the CI/CD pipeline.

The second layer is LLM-as-judge evaluation for generative outputs. A separate model evaluates whether the agent's outputs are faithful to source material, factually consistent, and within scope. This pattern, standardised by Galileo AI and Arize among others, is imperfect — the judge model has its own biases — but it provides automated quality assessment at scale.

The third layer is human evaluation on a sampling basis. Five to ten per cent of outputs reviewed by domain experts who assess dimensions that automated evaluation cannot capture: Is the tone appropriate? Does the recommendation make business sense? Human evaluation is expensive and slow, but it grounds the automated layers and catches failure modes that the judge model misses.

This three-layer approach connects directly to the model lifecycle management discipline. The evaluation framework established during development becomes the monitoring framework used in production. The golden test set curated during development becomes the regression test set used to validate model updates, prompt changes, and retrieval pipeline modifications.

Stage 6: Launch — staged rollout with human-in-the-loop

Launching an agent is not deploying code. Code deployment is binary — the new version replaces the old, and if it passes health checks, it is live. Agent deployment is gradual, because the only reliable test environment for a probabilistic system is production traffic.

The staged rollout pattern mirrors what Glean, IBM, and EPAM all recommend: start with shadow mode, where the agent processes real inputs but its outputs are reviewed by humans before being acted upon. When the acceptance rate exceeds the defined threshold (typically 90 to 95 per cent), move to supervised mode, where the agent acts autonomously but a human reviews a sample of outputs after the fact. When supervised mode demonstrates stable quality for a defined period, move to autonomous mode with monitoring.

Each stage transition is a governance decision, not a technical one. The decision to move from supervised to autonomous should involve the business owner, the compliance function, and the technical team — because the decision is about risk tolerance, not accuracy metrics alone. An agent with 95 per cent accuracy operating autonomously in a low-risk workflow is a straightforward approval. The same accuracy in a high-risk compliance workflow may not be sufficient regardless of the metrics.

The launch stage is also where Copilot Studio agents and pro-code agents diverge operationally. Platform-managed agents benefit from built-in deployment infrastructure — versioning, rollback, traffic splitting — that pro-code agents must build or integrate. The ADLC applies to both, but the implementation effort for the launch stage is significantly lower on managed platforms.

Stage 7: Monitor and improve — the stage that never ends

The seventh stage is where most organisations fail, and it is where the ADLC earns its value. Traditional software, once deployed and stable, can run for months without active monitoring beyond infrastructure health checks. Agents cannot. They operate in environments that change continuously — input distributions shift, upstream models update, retrieval corpora evolve, and user behaviour adapts to the agent's presence.

The observability stack for AI in production covers the technical implementation. The ADLC frames monitoring as four concurrent streams: performance monitoring (do outputs still meet Stage 3 acceptance criteria), drift detection (have input distributions or output patterns shifted beyond thresholds), cost monitoring (is inference spend per task still within the economic model), and governance monitoring (does the agent still operate within its defined scope and escalate appropriately).

The "improve" half of Stage 7 closes the loop. When monitoring detects degradation, the ADLC cycles back to the appropriate earlier stage. Performance degradation triggers a return to Stage 5 for evaluation and remediation. A change in business requirements triggers a return to Stage 2 for scope redefinition. A new data source triggers a return to Stage 4. The lifecycle is a loop, not a line — and the loop runs continuously for every agent in production.

Agent sprawl: the problem the ADLC solves

Without a lifecycle methodology, enterprises hit agent sprawl. The innovation team builds three agents. IT builds two more. A business unit commissions one from a vendor. Within twelve months, the organisation has ten to fifteen agents scattered across teams, vendors, and platforms. Nobody knows the full inventory. Nobody owns the cross-cutting concerns — security, data governance, cost management, model updates. Nobody can answer the question that every CFO eventually asks: what is the total cost and the total value of our agent portfolio?

The ADLC prevents sprawl by establishing a consistent methodology from the first agent. Every agent has a business KPI owner, defined scope, guardrails, acceptance criteria, and a monitoring plan. When the tenth agent is proposed, the organisation can evaluate it against the same criteria as the first — and decide whether it should be built, consolidated with an existing agent, or deprioritised.

This is where the ADLC connects to the broader operating discipline. The AI Operating System provides the governance layer and the continuous improvement framework across all AI initiatives. The ADLC provides the development methodology within it — the specific practices for building, deploying, and managing individual agents. The Operating System answers "how do we govern AI at the enterprise level?" The ADLC answers "how do we build and operate each agent responsibly?"

The enterprises scaling successfully — the 23 per cent in McKinsey's data — are not using better models or better frameworks than the 39 per cent stuck in experimentation. They have a methodology. They treat agents as managed assets, not as experiments. The ADLC is the methodology that makes that possible.

A Fit Call maps your current agent portfolio against the ADLC stages — identifying which agents lack business KPI alignment, which are operating without defined guardrails, and where lifecycle gaps create the sprawl and governance risk that stall scaling.

Book a Fit Call →


References: Glean, "Enterprise ADLC Framework," May 2026; IBM, "Agent Development Lifecycle: Plan, Code & Build, Test & Release, Deploy," 2026; EPAM, "ADLC: A Structured Approach to Building AI Agents," 2026; McKinsey & Company, "The State of AI: How Organizations Are Rewiring to Capture Value," 2025 (23% scaling, 39% experimentation); Anthropic, "The 2026 State of AI Agents Report," 2026 (57% multi-step workflows, 16% cross-functional agents); Microsoft, "Cyber Pulse: An AI Security Report," February 2026 (80% Fortune 500 adoption); Arthur AI, "Agent Development Lifecycle Best Practices," 2026; Salesforce, "Building Enterprise AI Agents: Lifecycle Methodology," 2026; Galileo AI, "The MLOps Guide to Transform Model Failures Into Production Success," 2026.