Inference Economics: Self-Hosted vs. API — The Real Math

The self-hosting-versus-API decision is the most consequential infrastructure choice in enterprise AI — and the one most often made on instinct rather than arithmetic.

In DACH, the instinct runs one direction: keep the data on your own iron, hold full control, avoid dependency on US cloud providers. DSGVO, NIS2 and sector regulation all reinforce the pull. The instinct is legitimate. But when you actually price both paths, the data-sovereignty argument and the cost argument point in opposite directions far more often than most Geschäftsführungen expect — and conflating the two leads companies to buy infrastructure they neither need nor can staff.

The API price floor keeps dropping

The single most important fact in this analysis is that inference is getting cheaper at a pace that breaks ordinary planning assumptions. Epoch AI's price-trend analysis found that the cost to reach a given benchmark score has been falling between 9x and 900x per year, with a median around 50x — and the steepest declines are the most recent, beginning after January 2024. To hit GPT-4-class performance on PhD-level science questions, the price fell roughly 40x per year. The Artificial Analysis pricing database tells the same story from the buy side: frontier-class capability that was priced like a luxury in early 2025 is commodity-priced a year later, and capable mid-range models cost a fraction again of that.

What this means commercially is uncomfortable for the self-hosting case. A capital outlay you amortise over three years is competing against a per-token price that may fall by an order of magnitude inside that same window. The hardware you buy today is fixed; the API price you would otherwise pay is a moving target heading down. Any break-even model that assumes today's API price holds is already wrong in your favour for the API.

The second advantage of APIs is operational, not financial: radical simplicity. No GPU procurement, no cluster engineering, no on-call rotation, no model-update project every quarter. Changing model versions is a configuration change. Scaling from one million to a hundred million tokens a day requires nothing from your infrastructure team — because you do not have one for this.

What self-hosting actually costs

Self-hosting economics are front-loaded, non-linear, and consistently underestimated because most estimates stop at the GPU sticker price. That price is real enough. A single NVIDIA H100 runs roughly $25,000 to $33,000 for the PCIe variant and $35,000 to $40,000-plus for the higher-bandwidth SXM5, per IntuitionLabs' 2026 pricing guide. A production-grade eight-GPU server — the validated systems Dell, Lenovo and Supermicro ship — lands between $250,000 and $400,000 depending on configuration, with the DGX-class boxes at the top of that range.

Most DACH companies rent rather than buy, which moves the number but not the logic. H100 cloud rental in 2026 spans an enormous range: roughly $1.49 to $2.50 an hour on specialised GPU clouds, against $6 to $12 an hour on the hyperscalers — AWS, Azure and GCP charge a multiple of the specialist rate for the same silicon, per IntuitionLabs' and Spheron's cross-provider benchmarks. And rental pricing is no longer reliably falling: one-year contract rates rose nearly 40 per cent between late 2025 and early 2026 as demand outran supply. So the asset whose price you are betting will hold steady is, if anything, getting tighter.

The sticker price is the smallest line. A GPU sitting idle generates no value. Around it you need networking, fast storage, redundancy, cooling, security, and — most expensively — people. A self-hosted inference stack is a system you now own end to end: monitoring, patching, model validation, incident response, and a re-qualification cycle every time a meaningfully better open model ships, which in 2025–26 has been roughly every couple of months. Each of those updates is a small project with testing and pipeline risk. The API equivalent of that entire workstream is a model-name string.

The talent line is the one that breaks the case in DACH. Running production GPU inference reliably needs ML-infrastructure engineers, and that skill set is scarce and expensive across the German-speaking market. If you already employ that capability and it is under-utilised, self-hosting can use it well. If you have to hire for it, you are not adding a line item — you are starting a months-long recruitment effort for a role that competes directly with every other AI-ambitious company in the region, before a single token is served.

Where the break-even really sits

There is no single token threshold at which self-hosting wins; anyone quoting a precise universal number is selling a conclusion. The honest model has three variables: sustained volume, model size, and whether you already hold the engineering capability.

The shape, though, is clear and decisive. Self-hosting rewards high, steady, predictable volume on a fixed model, where amortised hardware and a standing team are spread across enough tokens to beat the per-call API price — and where you can keep utilisation high enough that idle GPUs are not quietly burning capital. APIs win on everything else: low or spiky volume, experimentation, frequent model switching, and any workload where you would otherwise stand up a team purely to serve it. For the typical DACH Mittelstand company running development plus moderate production traffic, the per-token API bill is materially smaller than the fully loaded cost of doing it yourself — once labour, redundancy and the re-qualification treadmill are counted rather than waved away. Self-hosting becomes the right call at genuine scale, or when regulation leaves no choice — not as a default.

The DACH-specific factors

Three regional factors tilt the maths further against on-premise GPUs than a generic US analysis would suggest.

Energy. German industrial electricity sits around 18 to 20 cents per kWh — among the highest in Europe and roughly double UK or US industrial rates, before the relief schemes that energy-intensive firms may qualify for. An eight-GPU server draws several kilowatts continuously, all year. That is a structural monthly cost that simply does not exist in the API model, where the provider absorbs power and cooling inside the per-token price. The grid context makes it tangible: in Frankfurt, data centres already account for around 40 per cent of total electricity demand, and the local utility reports available power capacity is effectively booked out — so this is not a cost that competition is about to drive down.

Regulation. Under Germany's Energy Efficiency Act (EnEfG), data-centre operators must cover 50 per cent of their electricity consumption from renewables since January 2024, rising to 100 per cent from January 2027, alongside tightening PUE targets (≤1.5 from mid-2027, ≤1.3 from 2030) and waste-heat-reuse obligations for newer facilities. If your "self-hosting" means a meaningful on-premise build rather than a few servers in a rack, these obligations add procurement complexity and cost that the API path externalises entirely.

Sovereignty is a governance question, not a hosting question. The argument for self-hosting is usually "we cannot send this data to US servers." But the EU AI Act does not mandate on-premise processing — it mandates documented data governance, and for high-risk systems, demonstrable data quality and oversight. EU-resident API endpoints — Azure's EU Data Boundary, AWS's European Sovereign Cloud, EU-region model providers — satisfy most data-residency requirements without the infrastructure burden. The one caveat worth taking seriously is the US CLOUD Act: a US-headquartered provider can in principle be compelled to produce data regardless of where the server physically sits, which is precisely why sovereign-cloud constructs (German-incorporated entities, EU-resident control) exist. That is a reason to choose your endpoint carefully — not, for most workloads, a reason to buy GPUs.

The hybrid default

For most DACH enterprises the right architecture is neither pure path but a deliberate split: API endpoints for development, experimentation and the bulk of moderate-volume production; self-hosting reserved for the specific workload that clears a genuine scale break-even, or where a real regulatory requirement — not a preference, and not a CLOUD-Act anxiety an EU sovereign endpoint already answers — mandates on-premise processing. The discipline is to decide each workload on its own arithmetic, and to revisit it as API prices keep falling and your volumes change.

The trap is letting the sovereignty instinct write a capital-expenditure cheque the economics do not support. Sovereignty you can usually buy as governance. Capacity you only need at scale. Confusing the two is how AI infrastructure budgets get spent on idle GPUs.

A Fit Call models your inference break-even against your real workloads, volumes and regulatory constraints — so you invest in on-premise capacity only where it actually pays, before the procurement decision is made for you.

Book a Fit Call →

References: Epoch AI, "LLM inference prices have fallen rapidly but unequally across tasks," 2025 (https://epoch.ai/data-insights/llm-inference-price-trends); IntuitionLabs, "NVIDIA AI GPU Prices: H100 & H200 Cost Guide," 2026 (https://intuitionlabs.ai/articles/nvidia-ai-gpu-pricing-guide); IntuitionLabs, "H100 Rental Prices Compared," 2026 (https://intuitionlabs.ai/articles/h100-rental-prices-cloud-comparison); Spheron, "GPU Cloud Pricing 2026," 2026 (https://www.spheron.network/blog/gpu-cloud-pricing-comparison-2026/); White & Case, "Data center requirements under the new German Energy Efficiency Act" (https://www.whitecase.com/insight-alert/data-center-requirements-under-new-german-energy-efficiency-act); AlgorithmWatch, "Germany's Data Center Boom is Pushing the Power Grid to its Limits" (https://algorithmwatch.org/en/germany-data-center-boom/); TechPolicy.Press, "Germany's Data Center Boom Is Pushing the Power Grid to Its Limits" (https://www.techpolicy.press/germanys-data-center-boom-is-pushing-the-power-grid-to-its-limits/).

Inference Economics: Self-Hosted vs. API — The Real Math

The API price floor keeps dropping

What self-hosting actually costs

Where the break-even really sits

The DACH-specific factors

The hybrid default

Related articles

GPU Infrastructure Economics: On-Premise vs. Cloud vs. Hybrid for DACH

The Self-Hosting Decision Tree: Data Sovereignty vs. Operational Reality

Small Language Models for Enterprise: When 7B Parameters Beat 70B

Ready for the next step?