The Agent Development Lifecycle: Why Your SDLC Does Not Work for AI Agents

The software development lifecycle has been refined over four decades. Requirements, design, implementation, testing, deployment, maintenance — the SDLC works because the systems it governs are deterministic. Given the same input, they produce the same output. When they fail, they throw exceptions. When they pass tests, they work. When they ship, they stay shipped until someone changes the code.

AI agents are none of these things. Given the same input, an agent may produce different outputs depending on temperature, the contents of its context window, and the stochastic nature of token sampling. When agents fail, they do not throw exceptions — they produce confident, well-formatted, wrong answers. When they pass evaluation at deployment, they can degrade within weeks as input distributions shift and upstream providers push silent model updates. And when they ship, the underlying model, the retrieval corpus, and the business context all keep changing without anyone touching the agent's code.

Enterprise teams have noticed the cost of pretending otherwise. McKinsey's State of AI 2025, drawn from nearly 2,000 respondents across 105 nations, found that 23 per cent of organisations are scaling an agentic AI system somewhere in the enterprise, while 62 per cent are at least experimenting with agents — yet within any single business function, no more than 10 per cent report scaling. The interest is broad; the production footprint is thin. That is not a technology problem. The models work. The frameworks work. The gap is methodological: organisations apply deterministic engineering practices to probabilistic systems, then wonder why the results are unpredictable.

The Agent Development Lifecycle — ADLC — is the emerging answer, and it is converging fast. Glean published its seven-stage Enterprise ADLC in May 2026, running from opportunity identification through continuous monitoring. IBM frames a five-phase lifecycle — Plan, Code & Build, Test & Release, Deploy, Operate — with the build-and-test phases running as an iterative loop and the deploy-and-operate phases as a second one. The structures differ; the insight is shared. Agents need their own methodology because the assumptions baked into the SDLC — deterministic outputs, binary test outcomes, stable post-deployment behaviour — simply do not hold. What follows is the seven-stage shape, read through the lens of what actually breaks in a DACH mid-market deployment.

Opportunity: tie every agent to a business KPI

The first stage is not "pick a use case." It is "identify a business outcome an agent can measurably improve." The distinction matters because the most common failure mode in enterprise agent work is building agents that demonstrate technical capability without connecting to a number the organisation already tracks.

A procurement agent that automates purchase-order creation is a technology demonstration. A procurement agent that cuts purchase-to-pay cycle time from fourteen days to three, and is measured against that target every month, is a business initiative. The technology is identical. The framing decides whether the agent survives the first budget review.

So the opportunity stage produces four outputs: the specific KPI the agent will improve, its current baseline, the target improvement, and the economic value of reaching it. Without these, the agent is an experiment — and experiments, as the pilot-to-production research consistently documents, die in the transition from demo to deployment. Anthropic's 2026 State of AI Agents report, based on a survey of more than 500 technical leaders, found that 57 per cent of organisations already deploy agents for multi-step workflows and 16 per cent run cross-functional agents spanning several teams. Adoption is no longer the constraint. KPI alignment is. Without it, those agents stay point solutions that never compound into organisational value. This is where governance actually starts — not with a security control, but with the unglamorous question of whether this agent should exist at all.

Design: scope, guardrails, and escalation paths

Agent design differs from software design because it must account for what the agent should not do, not only what it should. A software function either handles an input or it does not; the type system and the compiler enforce the boundary. An agent will cheerfully attempt any input, in scope or not, unless something explicit stops it.

The design stage therefore produces three artefacts with no real equivalent in traditional development. The scope definition specifies exactly which tasks the agent is authorised to perform, which data it may touch, and which decisions it may make on its own. The delegation framework applies directly here — an agent needs the same clarity of scope, authority, boundaries, and accountability you would give a human employee handling the same function.

The guardrail specification defines the hard constraints the agent cannot violate regardless of its reasoning. A customer-facing agent cannot commit to pricing outside the approved range. A compliance agent cannot sign off its own risk assessments. These are not soft guidelines — they are constraints enforced in code, tested before deployment, and monitored in production. For systems that fall under the EU AI Act's high-risk category, this artefact is also where you build the human-oversight measures that Article 14 requires: a named person must be able to understand the system's limits, override its output, and resist the automation bias of over-trusting it.

The escalation design specifies what happens at the edge of competence. A well-built agent recognises when it has reached the limit of what it can safely handle and routes the situation onward — with full context, not a bare error code. The multi-agent architecture decisions intersect here: whether the agent sits in a hub-and-spoke, peer-to-peer, or hierarchical orchestration shapes what it can be designed to do. The ADLC does not replace the architecture choice; it gives you a systematic way to make it.

Performance: acceptance criteria for probabilistic outputs

This is where the SDLC breaks most visibly. In traditional software a test passes or fails; the system returns the correct value or it does not. There is no ambiguity about acceptability.

Agents produce outputs sampled from a distribution. Two runs on the same input may be semantically equivalent but textually different. A classification agent might place borderline cases in different buckets. A reasoning agent might take different paths to the same conclusion — or to a different one. Performance criteria have to absorb that variability. The evaluation framework covers the mechanics; the ADLC reframes them: evaluation is not a gate the agent passes once, but a continuous measurement that defines the operating envelope within which the agent is allowed to run.

In practice that means acceptance criteria across five dimensions: task accuracy against a golden test set, consistency across repeated runs, latency within the business process's tolerance, inference cost per completed task, and safety — guardrails respected, escalation triggered when it should be. These are operating ranges, not pass/fail lines. An agent holding steady accuracy, sub-two-second p95 latency, and a per-task cost that fits the economic model is inside its envelope. When any dimension drifts out, monitoring flags it for intervention. Pick the targets to match the workflow, not a vendor benchmark: the right accuracy bar for a low-risk internal summariser is not the bar for an agent touching customer money.

Context and input: data governance as a first-class concern

The single largest determinant of agent output quality is not the model, the prompt, or the framework. It is the data the agent can reach. An agent with accurate, current, well-structured data and a mediocre prompt will beat an agent with a brilliant prompt and stale, incomplete, or messy data every time.

The context stage defines which sources the agent uses, how data is retrieved, how freshness is maintained, and how access is governed. For RAG-based agents that means specifying the corpus, the chunking strategy, the embedding model, the retrieval and re-ranking approach, and the refresh cadence. For tool-using agents it means defining which APIs the agent may call and what happens when one is down. Governance is not an afterthought here — it is the stage. An agent that touches customer data is subject to DSGVO regardless of its task, and one that retrieves financial records must respect access controls at row and column level, not merely at the API. The build decision between low-code and pro-code platforms lands directly on this stage: platform-managed agents inherit the platform's data-governance controls, while custom builds need a purpose-built governance layer.

This stage also addresses context poisoning, a failure mode peculiar to agents. If the retrieval corpus holds something outdated or wrong, the agent will surface it with full confidence. Unlike a database query, the agent weaves retrieved material into natural language that hides the provenance of each claim. Tracing a wrong answer back to the chunk that caused it requires explicit provenance logging — recording which documents were retrieved and how the agent used them. Under the EU AI Act's Article 12, high-risk systems must keep automatic event logs anyway; building provenance in from the start serves both the auditor and the engineer.

Develop: evaluation frameworks, not unit tests

The development stage is where the SDLC and the ADLC diverge most widely. Traditional software leans on unit tests with deterministic expectations: given input X, return output Y. Agent development needs evaluation frameworks instead. A unit test asserts a specific output; an evaluation framework assesses whether an output sits inside an acceptable range across several quality dimensions. A unit test runs in milliseconds and returns a binary; an evaluation run may take minutes, burn API credits, and produce a multi-dimensional score.

A workable evaluation stack has three layers. The first is deterministic tests for the deterministic components — does the tool-calling interface parse function signatures correctly, does the retrieval pipeline return relevant documents for known queries, does the guardrail layer block known-harmful inputs. These belong in CI/CD. The second is model-graded evaluation for generative outputs, where a separate model judges whether the agent's output is faithful to its sources, factually consistent, and in scope. The technique is imperfect — the judge carries its own biases — but it gives you automated quality assessment at scale, and IBM's lifecycle folds exactly this kind of structured eval and red-teaming into its Test & Release loop. The third layer is human review on a sampling basis: a small share of outputs read by domain experts who catch what automation misses — tone, business sense, the subtly wrong recommendation that scores well on every metric. It is slow and expensive, and it is what keeps the automated layers honest.

This three-layer approach connects straight to model lifecycle management. The evaluation framework built during development becomes the monitoring framework used in production, and the golden test set curated here becomes the regression set that validates every later model update, prompt change, and retrieval tweak.

Launch: staged rollout with human-in-the-loop

Launching an agent is not deploying code. Code deployment is binary — the new version replaces the old and, if it clears health checks, it is live. Agent rollout is gradual, because the only reliable test environment for a probabilistic system is production traffic.

The pattern both Glean and IBM describe is progressive. Start in shadow mode, where the agent processes real inputs but a human reviews its outputs before anyone acts on them. When the acceptance rate clears the defined threshold, move to supervised mode, where the agent acts but a human samples its outputs after the fact. When supervised mode holds quality for a defined period, move to autonomous operation with monitoring — IBM frames this same progression as rolling out in stages behind a gateway that enforces policy. Each transition is a governance decision, not a technical one. Moving from supervised to autonomous should pull in the business owner, the compliance function, and the technical team, because the question is risk tolerance, not accuracy alone. The same accuracy that justifies autonomy in a low-risk internal workflow may be plainly insufficient in a regulated, customer-facing one.

Launch is also where Copilot Studio agents and pro-code agents diverge operationally. Platform-managed agents come with versioning, rollback, and traffic-splitting built in; pro-code agents have to build or integrate the same. The ADLC applies to both — the implementation effort is simply lower on the managed platform.

Monitor and improve: the stage that never ends

The final stage is where most organisations fail, and where the ADLC earns its keep. Traditional software, once stable, can run for months on infrastructure health checks alone. Agents cannot. They live in environments that change continuously — input distributions move, upstream models update, retrieval corpora evolve, and users adapt their behaviour to the agent's presence.

The observability stack for AI in production covers the implementation. The ADLC frames monitoring as four concurrent streams: performance (do outputs still meet the acceptance criteria), drift (have inputs or output patterns shifted past threshold), cost (is per-task spend still inside the economic model), and governance (does the agent still stay in scope and escalate when it should). The "improve" half closes the loop. When monitoring detects degradation, the lifecycle cycles back to the relevant earlier stage — performance decay returns to evaluation, a requirements change returns to design, a new data source returns to context. It is a loop, not a line, and it runs continuously for every agent in production.

Agent sprawl: the problem the ADLC actually solves

Without a lifecycle, enterprises hit sprawl. The innovation team builds three agents, IT builds two, a business unit buys one from a vendor. Inside a year the organisation runs ten to fifteen agents scattered across teams, vendors, and platforms. Nobody holds the full inventory. Nobody owns the cross-cutting concerns — security, data governance, cost, model updates. Nobody can answer the question every CFO eventually asks: what is the total cost, and the total value, of our agent portfolio?

This is not hypothetical for the mid-market. Microsoft's Cyber Pulse report of February 2026 found that more than 80 per cent of Fortune 500 firms now run active AI agents, that 29 per cent of employees have already turned to unsanctioned ones, and that fewer than half of organisations have implemented dedicated security controls for generative AI. The capability is racing ahead of the governance — and a Mittelstand firm with a leaner security function feels that gap sooner, not later.

The ADLC prevents sprawl by imposing one methodology from the first agent. Every agent has a KPI owner, defined scope, guardrails, acceptance criteria, and a monitoring plan. When the tenth agent is proposed, it is assessed against the same criteria as the first — and the organisation can decide, deliberately, whether to build it, fold it into an existing agent, or drop it. This is where the ADLC meets the broader operating discipline. The AI Operating System provides the governance and continuous-improvement layer across all AI initiatives; the ADLC provides the development methodology inside it. The Operating System answers "how do we govern AI at the enterprise level?" The ADLC answers "how do we build and run each agent responsibly?"

The firms scaling successfully are not using better models or better frameworks than the ones stuck in experimentation. They have a methodology. They treat agents as managed assets, not experiments — and that, far more than any model choice, is what carries an agent from a convincing demo to a line item the CFO will keep funding.

A Fit Call maps your current agent portfolio against the ADLC stages — surfacing which agents lack KPI alignment, which run without defined guardrails, and where the lifecycle gaps create the sprawl and governance risk that stall scaling.

Book a Fit Call →

References: Glean, "Introducing the Agent Development Lifecycle (ADLC)," May 2026; IBM, "What Is the Agent Development Lifecycle?," 2026; McKinsey & Company, "The State of AI 2025," November 2025 (23% scaling agents, 62% at least experimenting, ≤10% scaling within any single function); Anthropic, "The 2026 State of AI Agents Report," 2026 (57% multi-step workflows, 16% cross-functional); Microsoft, "Cyber Pulse: An AI Security Report," February 2026 (80%+ Fortune 500 active agents, 29% unsanctioned use); European Union, "AI Act, Articles 12 and 14," 2024.

The Agent Development Lifecycle: Why Your SDLC Does Not Work for AI Agents

Opportunity: tie every agent to a business KPI

Design: scope, guardrails, and escalation paths

Performance: acceptance criteria for probabilistic outputs

Context and input: data governance as a first-class concern

Develop: evaluation frameworks, not unit tests

Launch: staged rollout with human-in-the-loop

Monitor and improve: the stage that never ends

Agent sprawl: the problem the ADLC actually solves

Related articles

Multi-Agent Architecture: What Matters More Than Framework Choice

No-Code vs. Pro-Code AI Agents: The Architecture Decision That Determines Your AI ROI

The Agentic AI Governance Gap: Why 97% of Enterprises Explore AI Agents But Only 11% Run Them in Production

Ready for the next step?