Every AI workflow makes decisions. That is the point. The question is never whether AI should decide — it is which decisions, under what conditions, with what authority, and with what fallback when it gets one wrong. Most enterprises answer that question by accident. The model ships, the workflow goes live, and the allocation of authority is whatever the integration happened to default to. That is not architecture. That is drift.
The failure shows up in two directions. Some organisations over-automate — letting a system make calls that demand human judgment, which manufactures compliance exposure and quietly erodes the trust of the people meant to rely on it. Others under-delegate — routing every output through human approval, which dissolves the operating leverage that justified the investment in the first place. Both failures come from the same root cause: treating the human-or-machine boundary as a single switch rather than as something you design, decision by decision.
Decision architecture is the third component of the AI Operating System. It sits between the context layer, which defines what the system knows, and workflow design, which defines what the system does. Its sole job is to define who decides what — explicitly, before deployment, and in a form you can defend to a regulator.
The decision spectrum
The first mistake is binary thinking: either the human decides or the machine does. In practice there are five distinct configurations, and most real workflows use several of them at once.
Fully automated means the system decides and acts with no human in the path. The action is taken, the record is logged, and humans monitor aggregate performance rather than individual outputs. Classifying inbound support email by topic and routing it to the correct queue belongs here — the cost of any single error is a short delay, easily corrected.
The system acts, a human is notified keeps the action automatic but surfaces every decision to a person who can intervene after the fact. Auto-approving expense claims under a low threshold while copying the finance team fits this pattern: speed by default, with a human able to reverse anything that looks wrong.
The system recommends, a human decides is the configuration most people picture when they say "human in the loop." The system analyses the inputs and presents a recommendation with its supporting evidence; the human makes the call. Claims triage, where the model proposes approve-or-investigate and the handler confirms, lives here.
The system prepares, a human decides withholds the recommendation entirely. The machine structures and summarises, handing over an organised briefing instead of raw data, but offers no verdict. A due-diligence summary for an acquisition — compiled and structured by the system, judged by the investment committee — is the type case.
Human only keeps the system out of the decision altogether. Some calls should stay fully human not because the inputs are unprocessable but because the consequence of error, the need for genuine empathy, or a hard regulatory line makes machine involvement inappropriate. A termination decision is the obvious one.
Mapping decisions to the right level
Three factors place a decision on that spectrum: how bad it is when the answer is wrong, how structured the decision actually is, and what the law requires.
Consequence severity asks what happens on a wrong answer. A misrouted support ticket costs minutes. An auto-approved fraudulent claim costs money and invites regulatory attention. Low-consequence decisions can be pushed toward full automation. High-consequence ones need human involvement — but high consequence does not mean human-only. More often it means the system recommends and a person decides, which preserves judgment without surrendering throughput.
Decision structure asks how rule-based the call really is. If it can be expressed exhaustively as a decision tree — if A and B and not C, then approve — it is a candidate for full automation almost regardless of consequence, because the logic can be validated end to end and audited. If it requires weighing ambiguous evidence, reading context that resists formalisation, or applying judgment that professionals build over years, it needs a human. The system still earns its place by structuring the evidence and surfacing the factors that matter, but the verdict stays human.
Regulatory requirement can override both of the above. The EU AI Act classifies a specific set of uses as high-risk in its Annex III — among them creditworthiness assessment, risk pricing of life and health insurance, and a broad band of employment decisions covering recruitment, promotion, task allocation and termination. For these systems, human oversight is not a design preference. Article 14 obliges the system to be built so that a competent person can understand its output, monitor its operation, stay alert to automation bias, and — decisively — disregard, override or reverse any individual result. Build that in from day one. Retrofitting oversight onto a system that was architected to run unattended is the most expensive way to discover what the law wanted. For how these categories are classified in practice, see the EU AI Act compliance guide.
The confidence threshold model
The most effective pattern we deploy refuses to assign every instance of a decision type to the same position on the spectrum. Instead it routes each individual case by the system's own confidence in its output — and, crucially, by the stakes attached to that specific case.
Take claims triage as the worked example. A case the model handles with high confidence and a low monetary value can run fully automated: the system classifies, routes, and the handler simply sees a pre-sorted claim. A case in the middle band — moderate confidence, or a value high enough to matter — moves to recommend-and-decide: the handler sees the classification, the confidence score, and the evidence, then confirms or overrides. A low-confidence case, or one above a serious value ceiling, drops to prepare-and-decide: the system organises the file but offers no recommendation, and the handler works it from a clean structured brief. And any case that trips a fraud indicator is pulled out entirely for specialist human review, with no machine verdict attached.
The point of this routing is that it does two things at once. It captures the efficiency of automation on the straightforward majority of cases, and it reserves human judgment for the minority that genuinely needs it — without forcing people to rubber-stamp every routine output, which is the surest way to breed the automation bias the AI Act explicitly warns against.
Thresholds are not set once and forgotten. They are calibrated during deployment and refined against outcome data. If the auto-handled band shows an acceptable error rate after the first quarter, the value ceiling can rise. If the middle band produces too many overrides, the lower confidence bound moves up until the recommendations are worth trusting. That calibration is part of the review cycle, not a one-off configuration step.
What this looks like across a Mittelstand portfolio
The pattern generalises beyond insurance, and the shape is consistent across the kinds of mid-market operations we work with.
In claims and back-office processing, the lever is volume. An insurer reviewing every claim by hand — the €200 broken window getting the same attention as the €50,000 water-damage case — is spending its scarcest resource, experienced handler time, on cases that do not need it. Decision architecture pushes the small, clean, fraud-free claims to automation with notification, routes the mid-value standard patterns to recommend-and-decide, and reserves prepare-and-decide for the large or unusual ones. Handlers stop triaging and start adjudicating, and capacity grows without headcount because the routine no longer competes with the consequential.
In industrial order intake, the lever is latency. A manufacturer taking orders across email, web portal — and, still, fax — typically runs each one through classification, stock check, delivery confirmation and routing to planning by hand. Standard catalogue items with stock on hand and standard terms can self-confirm in minutes instead of days. Orders with a stock shortfall or a delivery conflict move to recommend-and-decide, with the system proposing a substitute or an adjusted date for sales to accept or amend. Custom specifications, large volumes, and new customers get prepare-and-decide, with the key account manager judging a brief that already carries the customer history and the margin maths.
In e-commerce pricing, the lever is breadth. A catalogue of thousands of SKUs cannot be priced by hand against shifting competitor data, but it also cannot be handed wholesale to an algorithm. Commodity lines with clear benchmarks and stable margins reprice automatically inside hard corridors — never below floor, never above ceiling — with a category manager reviewing aggregates rather than individual moves. Seasonal or volatile lines move to recommend-and-decide. And brand-defining products, deliberate loss leaders and new launches stay human-only, with the system supplying competitive intelligence but no pricing authority.
Three different operations, one architecture: granular authority, assigned by consequence and structure, with the law setting hard floors where Annex III applies.
The mistakes that recur
Uniform authority is the most common. Applying the same level of human oversight to every decision in a workflow guarantees you get it wrong somewhere — either a human reviewing every classified email, which erases the saving, or no human anywhere near a consequential decision, which manufactures the risk. The fix is granular assignment: different decisions inside one workflow can and should sit at different points on the spectrum.
Static thresholds come next. Confidence bounds and authority levels set at deployment and never revisited will drift out of alignment with reality, because the right values are empirical — they depend on actual error rates, actual consequence patterns, and actual team capacity. Calibrate them against outcome data on a regular cycle, not on intuition.
Confusing transparency with authority is subtler. Making the system's reasoning visible to people is good practice, but it is not the same as giving them the decision. If a person is expected to review the reasoning and approve each output, that is recommend-and-decide. If they can see the reasoning but are not expected to act on it, that is acts-and-notifies. These are different configurations with different staffing costs, and conflating them produces a workflow nobody is actually accountable for.
Ignoring the cost of human review is the one that quietly bleeds the business case. Every review step carries a time cost, and at volume it compounds fast — a workflow processing several hundred items a day, each taking a few minutes to review, is a full headcount of review work before anything is automated at all. Before mandating review, price it, and weigh it against the expected cost of the errors that review would actually catch. Sometimes the discipline of reviewing everything costs more than the mistakes it prevents.
Decision architecture and the EU AI Act
For high-risk systems, the Act does not prescribe an implementation — it prescribes capabilities. Article 14(4) requires that a competent person can properly understand the system's capacities and limitations, can remain aware of the tendency to over-rely on its output — the automation bias the Act names explicitly — can correctly interpret that output, can decide in any given situation not to use the system or to disregard, override or reverse a result, and can intervene in or interrupt the system through a stop function that brings it to a halt in a safe state. The confidence threshold model satisfies all of this when it is built honestly: the handler sees the classification and its confidence score (understanding and interpretation), is shown enough to question a confident-but-wrong output rather than rubber-stamp it (automation-bias awareness), can override any individual case (disregard and reverse), works inside a workflow that can be bypassed or halted (intervention), and operates against thresholds that can be tuned to shift more decisions back to human review (control).
A note on timing for DACH planning, because the deadline has been moving. The high-risk obligations for stand-alone Annex III systems were originally set to apply from 2 August 2026. Under the Commission's Digital Omnibus package, EU legislators have provisionally agreed to defer that date to 2 December 2027 — but as of writing that deferral is politically agreed, not yet formally adopted and published in the Official Journal, and therefore not yet law. Treat it as breathing room you do not yet have. The prudent posture is to architect for the obligations to land, not for the extension to save you, and to read any confirmed delay as time to do it properly rather than a reason to defer the work. The engineering conclusion does not move either way: design oversight in from the start. Architecture built with the regulatory requirement in view is both more compliant and more efficient than architecture that bolts compliance on after the fact.
Building your own
Start with a single workflow and list every decision point in it. For each one, assess consequence severity, decision structure, and regulatory exposure, then assign an initial position on the spectrum. Define the confidence thresholds where the pattern applies, and document the escalation path for cases that fall outside the defined parameters. Then deploy, measure, and calibrate. The initial architecture is a hypothesis; the production architecture is what survives the first quarter of outcome data.
The full framework — including the consequence-structure matrix and the threshold-calibration method — is in Chapter 05 of The AI Operating System. For the adjacent question of when to automate versus augment, see Automation vs. Augmentation.
A Fit Call maps the decision points in one of your live workflows to the right level of human authority — so you capture the automation upside without tripping the EU AI Act's oversight requirements.
References: EU Artificial Intelligence Act, Article 14 — Human Oversight; Annex III — High-Risk AI Systems; Implementation Timeline; Gibson Dunn, "EU AI Act Omnibus Agreement — Postponed High-Risk Deadlines and Other Key Changes," 2026.
