The data scientist's Jupyter notebook works. The model produces correct outputs. The demo impresses the Geschäftsführung. Then someone asks the question that quietly kills most projects: "Can we put this in production?"

That is where the initiative stalls — not because the model is wrong, but because the distance between a working notebook and a dependable production API is far longer than anyone budgeted for. Gartner has predicted that at least 30 percent of generative AI projects would be abandoned after proof of concept by the end of 2025, citing poor data quality, inadequate risk controls, escalating costs, and unclear business value. In our experience with DACH mid-market teams, the unspoken fifth reason hides inside "escalating costs": nobody costed the serving layer. The model was the cheap part. Turning it into something an order-management system can call ten thousand times a day, at predictable latency, under audit, is the expensive part — and it is an engineering discipline, not a data-science one.

The production requirements gap

A notebook runs on one machine, handles one request at a time, has no error handling, no authentication, no monitoring, and no way to recover from a crash. Production demands all of those at once, and the absence of any single one is enough to take the service down.

The first gap is concurrency. A model fielding a hundred simultaneous users needs request queuing, batching, and resource management that a Python script in a notebook simply does not have — fire two requests at a naïve Flask wrapper around a GPU model and the second one waits for the first to finish, or the process falls over. The second gap is reliability: health checks, graceful degradation, automatic restart on failure, and defined behaviour when the model is overloaded. A notebook crashes silently and someone notices the next morning; a production endpoint that crashes silently takes a billing run or a customer-facing workflow down with it.

The third gap is latency consistency. A notebook takes as long as it takes. Production lives and dies on the tail — not the average, but the p95 and p99 response times that downstream systems and impatient users actually feel. The fourth is security: authentication, rate limiting, input validation, output sanitisation. A model that accepts arbitrary input without validation is a prompt-injection and denial-of-service liability the moment it is exposed. The fifth is observability — knowing in real time how the service behaves: request volume, latency distribution, error rates, GPU utilisation, and drift in output quality. Without it you are flying a production system blind, which under NIS2 and the EU AI Act's logging expectations for higher-risk systems is no longer merely careless; it is a documentation gap an auditor will find.

The serving stack: three layers that close the gap

Containerisation comes first. Package the model, its dependencies, and the inference code into a Docker image, with the NVIDIA Container Toolkit exposing the GPU to the container. This gives you a portable, reproducible unit that runs identically on the data scientist's workstation, in staging, and in production, and it retires "it works on my machine" — still the single most common deployment failure mode. Pin every version: the CUDA runtime, the framework, the model weights by hash. An unpinned dependency is a future outage with a delayed fuse.

The serving framework is the layer most teams underestimate. It owns the HTTP API, request batching, model loading, and GPU memory. For large language models, vLLM is the pragmatic default: its PagedAttention algorithm manages the key-value cache the way an operating system manages virtual memory — in non-contiguous blocks with near-zero waste — and its continuous batching schedules work at the level of individual decode steps rather than fixed batch windows, so a finishing request immediately frees its slot for the next one in the queue. Hugging Face's Text Generation Inference (TGI) trades some of that tunability for tighter integration with the Hugging Face hub and an OpenAI-compatible endpoint, which shortens setup for standard open-weight models. NVIDIA's Triton Inference Server is the heavier, enterprise-grade choice: it serves models from many frameworks at once — TensorRT, PyTorch, ONNX, OpenVINO, Python — and supports dynamic batching, concurrent model execution, and config-defined ensembles that chain several models into one pipeline. Triton earns its configuration overhead only when you are genuinely running a diverse fleet; for a single LLM endpoint it is over-engineering.

Orchestration is the third layer, and here Mittelstand teams should resist fashion. Kubernetes manages container lifecycle, scaling, and resource allocation, with load balancers spreading traffic across replicas and health checks replacing failed containers — and for a first deployment it is usually overkill. One or two models running behind Traefik or NGINX with Docker Compose, on a single GPU host, is a perfectly respectable production posture that a small team can actually operate and reason about. Reach for Kubernetes when you are running several models, need genuine auto-scaling against spiky demand, or have to spread inference across multiple GPU nodes — not because a conference talk said so. The right architecture is the one your team can keep alive at 3 a.m., not the one with the most impressive diagram.

Deployment patterns that let you change the model without holding your breath

Once a model is live, every update is a risk, and the pattern you choose decides how that risk is contained. Blue-green deployment runs the new version alongside the old and switches traffic across only after validation; if something breaks, you switch back instantly. It costs double infrastructure during the cutover but buys zero-downtime releases and an immediate rollback path. Canary deployment is the safest default for model updates: route five to ten percent of traffic to the new version, watch its metrics against the incumbent, and widen the split only if quality and latency hold — otherwise roll it back having exposed almost no one. Shadow deployment goes further for high-stakes changes: send live traffic to both versions but serve only the old one's responses, comparing the new model's outputs offline. It surfaces regressions with zero user impact, at the price of paying for inference twice during the trial. For most Mittelstand cases, canary is the workhorse and shadow is reserved for the model that sits in a regulated decision path.

The optimisation that decides whether the GPU pays for itself

A GPU is the most expensive line item in the stack, so utilisation is an economic question, not just a technical one. Continuous batching is the largest single lever: a naïve inference loop leaves much of the GPU idle between requests, and switching to a serving engine that batches continuously can lift throughput several-fold on the same hardware — vLLM's own benchmarks put it at multiples of a basic PyTorch serving loop. That is the difference between buying one GPU and buying three. Quantisation is the next lever — serving weights at INT8 or INT4 via formats such as GPTQ or AWQ, which vLLM and TGI support natively, shrinks the memory footprint and raises throughput. The discipline that separates professionals from optimists here is non-negotiable: always measure accuracy on your own task before and after quantising, because some tasks tolerate it cheerfully and others degrade in ways a generic benchmark never shows.

Two smaller disciplines round it out. KV-cache management determines how many concurrent sequences a single GPU can hold; vLLM's PagedAttention is the reason it serves so many at once without fragmenting memory. And warm-up matters more than it sounds: load the model and run it on representative inputs before opening the endpoint, because the first request after a cold load is markedly slower than steady state and should never be the one a customer waits on. None of this is exotic. It is the ordinary plumbing that turns a clever notebook into infrastructure your business can lean on — and it is precisely the plumbing that nobody scoped when the demo got the applause.

A Fit Call maps your serving framework, deployment pattern, and scaling path to the team you actually have — before the GPU bill, not after the outage. We have written about scaling from pilot to production and the self-hosting decision; this is where those choices become concrete infrastructure.

Book a Fit Call →


References: Gartner, "Gartner Predicts 30% of Generative AI Projects Will Be Abandoned After Proof of Concept By End of 2025," July 2024 (https://www.gartner.com/en/newsroom/press-releases/2024-07-29-gartner-predicts-30-percent-of-generative-ai-projects-will-be-abandoned-after-proof-of-concept-by-end-of-2025); vLLM Project, "PagedAttention" design documentation (https://docs.vllm.ai/en/latest/design/paged_attention/); vLLM Project repository and benchmarks (https://github.com/vllm-project/vllm); NVIDIA, "Triton Inference Server — Dynamic Batching and Concurrent Model Execution," official documentation (https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html).