Your Datadog dashboard shows green. Latency is within SLA. Error rate is 0.1 percent. Uptime is 99.9 percent. And your AI system is producing wrong answers, confidently, in a fraction of every conversation — and nobody in the building knows.

That gap is the defining operational problem of enterprise AI. Traditional application performance monitoring measures whether the system is running. AI observability measures whether the system is working correctly. These are not the same question, and the distance between them is exactly where AI fails silently in production — without an alert, without a stack trace, without anyone noticing until a customer, an auditor, or a regulator does.

Why traditional monitoring goes blind

APM tools were built to measure infrastructure: CPU, memory, latency, throughput, status codes. For a conventional API that contract is sufficient. If the service responds within SLA and returns a 200, it is working. The output is deterministic; correct plumbing implies a correct answer.

AI breaks that assumption at the root. A model can respond in 200 milliseconds, return a clean 200, and hand back a fluent, well-formatted, completely fabricated answer. Every infrastructure signal stays green because nothing infrastructural failed. The failure lives one layer up — in the content of the response — and your existing monitoring stack has no sensor pointed there. You need a second observability layer that measures the quality of what the system produces, not merely whether it produced something.

This is not an optional refinement. Under the EU AI Act, providers and deployers of high-risk AI systems are now legally required to keep records of how those systems behave in operation — and qualitative "it seemed fine" is not a record.

The six things you actually have to watch

Most of the failure surface for production AI collapses into six monitoring categories. You do not need a new platform to cover them; you need the right signals wired into the platform you already run.

Output quality is the signal APM cannot give you. Sample a slice of production outputs — one to five percent is a sensible starting band — and evaluate them rather than just count them. For classification tasks, compare against known-correct labels. For generative tasks, use an LLM-as-judge pattern: a separate model scores whether each output is faithful to its source material, correctly structured, and internally consistent, with humans reviewing anything the judge flags. The discipline that matters here is frequency. Run quality scoring as a continuous pipeline, not a quarterly audit. A weekly batch review catches a regression seven days after it shipped; daily automated scoring with human triage of flagged outputs catches it before it compounds into a reputational or compliance event.

Cost has to be measured per task, not per token. Token dashboards flatter you. A single customer-service interaction might fan out into three model calls, two retrieval queries, and a verification step — its real cost is the sum of that workflow, not one prompt. Build dashboards around cost per business action: per ticket resolved, per document processed, per recommendation generated. Then alert on anomalies, because the dangerous cost regressions are silent. A prompt change that quietly adds verbosity can double consumption overnight. A retrieval tweak that over-fetches context can triple window usage. None of that registers on an infrastructure dashboard, and at Mittelstand scale — where an AI workflow might run on a four- or five-figure monthly API budget rather than a hyperscaler's nine — a doubling that runs unnoticed for a billing cycle is a real, board-visible number.

Drift is the warning that arrives before accuracy drops. Monitor your input distributions for statistical shift using an established measure such as the Population Stability Index. PSI is well understood from decades of credit-risk monitoring, with conventional reading: below 0.1 the distribution is stable, 0.1 to 0.25 warrants investigation, and above 0.25 signals a shift material enough to justify retraining or remediation. When inputs drift, outputs degrade — but the degradation is gradual and easy to miss, which is precisely why a leading indicator beats waiting for the accuracy number to fall. For LLM-based systems, extend the idea to the outputs: track semantic similarity over time, because a sudden change in vocabulary, structure, or confidence often signals not your drift but a silent provider-side model update underneath you. Open-source tooling such as Evidently implements these statistical tests out of the box, so this is configuration, not a research project.

Latency has to be read at the tail, not the average. Mean latency lies. Monitor p50, p95, and p99 separately. A system with a 200-millisecond p50 and a five-second p99 has a tail problem affecting one percent of requests — and one percent of ten thousand daily requests is a hundred poor experiences every day, every one of them invisible in the average. For streaming LLM applications, separate time-to-first-token, which governs perceived responsiveness, from total generation time, which governs whatever pipeline consumes the output downstream.

Prompt injection is a security control, not a curiosity. The OWASP Top 10 for Large Language Model Applications has ranked prompt injection as the number-one risk for the second consecutive edition, and its reasoning is uncomfortable: LLMs process instructions and data through the same channel, so an attacker can hide an instruction inside content — a support message, a PDF, a web page your retrieval system ingests — and the model cannot reliably tell command from data. There is no single patch. OWASP's own guidance is defence in depth: least-privilege tooling, input and output filtering, human approval for high-risk actions, and regular adversarial testing. For monitoring specifically, that means logging and alerting on inputs that carry instruction-like patterns, role-play framings, or attempts to extract a system prompt — and watching outputs for the anomalies that indicate an injection landed. For any customer-facing system, this is not optional hardening.

Resource utilisation is where waste and risk both hide. Track GPU utilisation, memory, and token throughput over time. Persistently low GPU utilisation signals oversized infrastructure you are paying for and not using; sustained high utilisation signals capacity risk, where a single traffic spike tips you into degradation. For API-based deployments — the common case in the Mittelstand — track consumption against budget and alert well before the ceiling, because uninstrumented API spend is one of the most reliable ways a promising pilot quietly turns uneconomic.

What the regulation now obliges you to log

For DACH companies this layer stopped being a best-practice argument and became a legal one. The EU AI Act requires that high-risk AI systems technically allow for the automatic recording of events over the lifetime of the system — automatic logs, not manual notes — so that risky situations, post-market monitoring, and ongoing operation are all traceable (Article 12). Providers must retain those automatically generated logs for an appropriate period and at minimum six months, unless other law demands longer (Article 19). And the obligation does not stop at deployment: providers must run a documented, proportionate post-market monitoring system that systematically collects and analyses performance data across the system's life, with the Commission's implementing template due in February 2026 (Article 72). Buying the system from a vendor does not transfer these duties away — the deployer remains accountable.

NIS2 reinforces the point from the security side. Germany's implementing law brought essential and important entities into scope in December 2025, with active detection, monitoring, and analysis obligations and a 24-hour initial reporting clock to the BSI for significant incidents. You cannot detect, analyse, or report on an incident in an AI system you are not observing. The observability stack is, increasingly, the evidence base for two regulatory regimes at once.

Building it without overbuilding

For a Mittelstand company running three to ten AI workflows, the right move is to layer onto existing monitoring, not to stand up a parallel platform. Extend the APM you already run: output-quality scores, cost-per-task, and drift indicators are custom metrics with standard alerting attached, and Datadog, Grafana, or Prometheus can hold all of them. Add a small scheduled job that samples some outputs daily, runs an LLM-as-judge evaluation, and writes the scores back into your metrics system — a few euros a day in API calls against the cost of catching a quality regression before your customers do. Add a per-workflow cost view built from the billing APIs every provider exposes, and put a fifteen-minute weekly review on it. That is the whole programme. It is deliberately unglamorous, and it is the difference between an AI system you can defend to your board and your auditor and one you are merely hoping about.

A Fit Call maps your live AI workflows against this six-category framework and the AI Act's logging duties — so you find the blind spots before an outage, an overspend, or an audit finds them for you.

Book a Fit Call →


References: European Parliament and Council, "Regulation (EU) 2024/1689 (AI Act), Articles 12, 19 and 72," artificialintelligenceact.eu; OWASP, "Top 10 for Large Language Model Applications, 2025," owasp.org; BSI / German NIS2 Implementation Act (in force December 2025); Evidently AI, open-source ML and LLM observability documentation, docs.evidentlyai.com; Arize AI, "Population Stability Index (PSI)," arize.com.