Model Lifecycle Management: Versioning, Monitoring, and Drift Detection

Deployment is not the finish line. It is where the real operational challenge begins.

The day a model goes live is the day it starts ageing. Input distributions shift, business processes change, upstream data sources get redefined, and the world the model was trained on quietly drifts away from the world it is now operating in. None of this announces itself. A model does not crash when it goes stale — it keeps returning confident answers that are gradually, then sharply, wrong. The first signal most Mittelstand teams get is not a monitoring alert but a complaint: a sales forecast that stopped landing, a classification queue that filled with escalations, a quality check that started waving through defects. By then the degradation is weeks old. Lifecycle management is the discipline that closes the gap between when a model starts failing and when you find out.

The four ways a model goes wrong in production

Drift is not one phenomenon. It arrives through four distinct doors, and the response differs for each.

Data drift is a change in the distribution of inputs. A customer-classification model trained on one purchasing pattern meets a different one after a market shock. A fraud model tuned on stable transaction behaviour meets a volatile season. The inputs no longer resemble what the model learned from, and its predictions lose calibration even though nothing in the code has changed. Evidently AI's production-monitoring guidance frames this precisely: data drift is a change in the statistical properties of the input data once a model is live, and you detect it by comparing the live distribution against a reference using summary statistics, hypothesis tests, or distance metrics.

Concept drift is subtler and more dangerous, because the inputs can look identical. Here the relationship between inputs and correct outputs has changed. A lead-scoring model built when deals closed in thirty days becomes systematically wrong when the cycle stretches to sixty. The features arrive looking normal; what they mean has moved. Concept drift is hard to catch without ground truth, which is exactly why it slips past teams that monitor only inputs.

Feature drift is an upstream-data failure dressed as a model problem. A sensor feed drops out of a production-quality model. A CRM field gets redefined by another team and silently changes meaning. The pipeline keeps delivering rows, so nothing errors — but the model is now reading different data than it was built to read. Most "the model broke" incidents are, on inspection, feature drift somewhere upstream.

Provider drift is the form unique to API-based LLMs, and it is the one teams most often forget to monitor at all. When you build on a hosted model, the model underneath your prompt can change without your version pinning changing. This is not hypothetical: a Stanford and UC Berkeley study tracked GPT-4 and GPT-3.5 across two snapshots and found large swings in behaviour over just a few months — on one code-generation benchmark the share of directly executable GPT-4 output fell from 52 per cent to 10 per cent between the March and June versions. The lesson is not that any one provider is unreliable; it is that a prompt is not a stable contract. The model behind it is a moving target, and your evaluation has to assume so.

Monitor at three levels, not one

Most teams monitor the easiest thing — system latency and error rates — and call it observability. That tells you the service is up. It tells you nothing about whether the answers are still right. Effective lifecycle management watches three layers at once.

Performance, measured in business terms. Track the metrics that actually matter for the use case against the baseline you recorded at deployment. Evidently's guidance is blunt on this point: the ultimate measure of a model's quality is its impact on the business — approvals, conversions, resolution time, cost saved — not the technical score in isolation. If the model triages support tickets, your headline metric is escalation rate and time-to-resolution, not classification accuracy alone. Accuracy is the diagnostic; the business metric is the symptom you are actually defending.

Data, measured statistically. Watch input distributions so you see drift before it reaches the business metric. The practical methods are well established: Kolmogorov–Smirnov tests for numerical features, Chi-square for categorical ones, and distance metrics such as Jensen–Shannon divergence or Wasserstein distance to quantify how far the live distribution has moved from the reference. Treat these as heuristics, not verdicts — the right threshold depends on data volume, how much change you can tolerate, and how costly the model's mistakes are. A noisy alert that everyone learns to ignore is worse than no alert.

Output, measured by sampling. For generative systems where there is no single correct answer, you cannot score accuracy directly. Sample outputs over time and check them — a small recurring human review, automated consistency checks against known-good answers, and judge-model scoring on a fixed rubric. Watching the distribution of outputs is also your early-warning system for the provider drift described above: when answers to a stable set of prompts start shifting shape, the model underneath has usually moved.

Version everything, not just the model

When outputs change, your first question is what changed — and you can only answer it if you tracked every moving part. Versioning only the model weights is the most common gap.

The model version is the obvious one: tag every deployed version with its training-data snapshot, hyperparameters, and the evaluation numbers it shipped with, so you can compare a degrading version against its predecessor and tell whether the model or the data moved. But for LLM systems the prompt is code, and it deserves the same rigour — version it in Git, track which prompt is live in each environment, and keep the history so you can diff it. More than one multi-day debugging session has ended with the discovery that someone edited a prompt and never told anyone. The pipeline counts too: chunking strategy, retrieval parameters, pre- and post-processing logic shape outputs as much as the model does, and a change to retrieval configuration can shift behaviour more than a model swap. Finally, keep data snapshots — training, evaluation, and reference sets at each deployment — because the moment you suspect drift, the only way to confirm it is to compare today's inputs against the distribution the model was actually built on.

Decide the retraining trigger before you need it

Detection is worthless if it does not lead to action, and "we'll know it when we see it" is not a plan. Define the triggers in advance, in writing, with an owner.

There are three honest triggers. Threshold-based: when the business or accuracy metric falls a defined amount below baseline, a retraining cycle starts — which presupposes you maintain a current golden test set to measure against. Schedule-based: where drift is predictable, such as models tied to quarterly financial cycles or seasonal demand, retrain on a known cadence regardless of measured drift. Event-based: major business changes — a product launch, a regulatory shift, an acquisition, a pricing change — invalidate model assumptions overnight and should force a re-evaluation before the metrics have even had time to move. The sensible operating posture is tiered, not binary: minor drift raises monitoring intensity, moderate drift triggers evaluation against a fresh test set, and significant drift triggers retraining or model replacement. A single hair-trigger threshold either fires constantly or never — neither protects you.

There is now also a regulatory floor under all of this. Under Article 72 of the EU AI Act, providers of high-risk AI systems must establish and document a post-market monitoring system that systematically collects and analyses performance data across the system's lifetime, governed by a written post-market monitoring plan that forms part of the technical documentation — with a common template for that plan that the Commission was required to adopt, by implementing act, by 2 February 2026 (Article 72(3)), a deadline that has now passed. If your AI touches a high-risk use case, lifecycle monitoring is no longer just operational hygiene; it is a documented obligation you will be expected to evidence.

The minimum viable lifecycle for a Mittelstand stack

You do not need a hyperscaler MLOps platform to do this well at Mittelstand scale. For a company running a handful of AI workflows — three to ten in production — a credible lifecycle system is deliberately modest: weekly accuracy checks against a golden test set of fifty to a hundred examples, refreshed quarterly so the test set itself does not go stale; a monthly distribution check on your key input features using one of the statistical tests above; a version log recording model, prompt, and pipeline versions with their deployment dates; and written retraining triggers with a named escalation path. That is the whole system. It runs on monitoring infrastructure you almost certainly already own — Grafana or Datadog extended with a few custom metrics — and it can be stood up in weeks, not quarters. The hard part was never the tooling. It is the discipline of deciding, before anything breaks, what "broken" means and who acts when it does.

A Fit Call pressure-tests your model lifecycle against the four drift types — and shows where silent degradation may already be quietly eroding your business outcomes, before a stakeholder finds it for you.

Book a Fit Call →

References: Evidently AI, "Model monitoring for ML in production: a comprehensive guide" (evidentlyai.com/ml-in-production/model-monitoring); Evidently AI, "What is data drift in ML, and how to detect and handle it" (evidentlyai.com/ml-in-production/data-drift); Lingjiao Chen, Matei Zaharia, James Zou, "How Is ChatGPT's Behavior Changing over Time?" 2023 (arxiv.org/abs/2307.09009); European Union, EU AI Act Article 72, "Post-market monitoring by providers and post-market monitoring plan for high-risk AI systems" (artificialintelligenceact.eu/article/72).

Model Lifecycle Management: Versioning, Monitoring, and Drift Detection

The four ways a model goes wrong in production

Monitor at three levels, not one

Version everything, not just the model

Decide the retraining trigger before you need it

The minimum viable lifecycle for a Mittelstand stack

Related articles

Monitoring AI in Production: The Observability Stack You Actually Need

AI Evaluation Beyond Accuracy: How to Benchmark Enterprise AI Systems

MLOps for Mittelstand: What You Actually Need vs. What Vendors Sell You

Check your AI operating maturity