Most enterprises default to the largest available model for every task. This is the AI equivalent of flying first class to every meeting — impressive, expensive, and usually unnecessary.
The language model landscape now spans from sub-billion-parameter models that run on a laptop to frontier systems trained on more than 10²⁵ floating-point operations. Two things have changed the calculus. First, capability has commoditised faster than almost anyone forecast: Stanford's 2025 AI Index found that the price of querying a model at GPT-3.5 quality fell from roughly twenty dollars per million tokens in late 2022 to about seven cents by late 2024 — a 280-fold drop in eighteen months. Second, the gap between the top tier and a well-chosen smaller model on a narrow, repetitive task has all but closed. The question is no longer "which model is best." It is "which model is the cheapest one that handles this specific task reliably."
The three weight classes
Enterprise workloads fall into three broad categories that map cleanly onto three model tiers.
Tier 1: small models (sub-1B to roughly 8B parameters). Classification, entity extraction, routing, structured-data parsing, deterministic summarisation. These run on commodity hardware and can be fine-tuned in hours rather than weeks. The case for them is no longer aspirational. NVIDIA's research team argues, in Small Language Models are the Future of Agentic AI, that for the specialised, repetitive calls that dominate real agentic systems, small models are not merely cheaper but more suitable — and are typically an order of magnitude cheaper to serve than a generalist frontier model. Peer-reviewed work points the same way: a fine-tuned Phi-3.5 Mini, for example, has matched GPT-4o on enterprise search-relevance labelling while running faster and far cheaper. For a manufacturer triaging support tickets or a logistics firm pulling shipment data off PDFs, a fine-tuned small model is usually the correct answer, not a compromise.
Tier 2: mid-range models (roughly 8B to 70B parameters). Document summarisation, multi-step reasoning over structured data, code generation, content drafting, conversational agents. This tier carries most enterprise knowledge work and tends to offer the best capability-per-euro. A quantised mid-range model running on a manageable GPU footprint handles the bulk of day-to-day drafting, analysis and question-answering without reaching for the frontier — and, critically, can be self-hosted inside your own perimeter.
Tier 3: frontier models (the largest, most capable systems). Genuine ambiguity: legal analysis with conflicting precedents, novel problem decomposition, cross-domain synthesis, long-horizon agentic workflows. These are the models you reach for when the task rewards real handling of nuance. Per token they remain materially more expensive than Tier 2, so every request routed here should be one that genuinely needs it. Note also that the most capable models — those trained above the 10²⁵ FLOP threshold — are precisely the ones the EU AI Act now treats as carrying systemic risk, with provider obligations that took effect on 2 August 2025. The tier you depend on is therefore also a compliance posture, not just a cost line.
The routing architecture
The single idea that separates cost-efficient AI operations from expensive ones is unglamorous: route each request to the cheapest model that can handle it reliably.
In practice that means a routing layer. A small, cheap classifier — often a Tier 1 model itself — inspects each incoming request and sends it to the appropriate tier. Simple extraction goes to the small model. Document analysis goes to the mid-range model. Genuine reasoning goes to the frontier. The router itself adds negligible cost relative to the inference it governs, and the savings come from the shape of real traffic: in most enterprise workloads the overwhelming majority of calls are mundane, and paying frontier prices for them is pure waste. You do not need a precise percentage to see the logic — when the cheap tier is roughly an order of magnitude cheaper to serve and absorbs most of the volume, the blended cost collapses.
The decision matrix
When you select a model for a specific workflow, five factors decide it.
Accuracy threshold. Define the acceptable error rate before choosing the model, not after. A customer-facing assistant answering product questions demands a higher bar than an internal tool summarising meeting notes, and the bar — not the brand name of the model — should drive the choice.
Latency requirement. Real-time work — live customer interactions, in-line quality checks — needs sub-second responses. Batch work — overnight report generation, document classification — can tolerate minutes. Smaller models are inherently faster to first token, which is often the deciding factor for anything a human waits on.
Data sensitivity. Regulated DACH industries — financial services, healthcare, parts of manufacturing — frequently cannot send data to an external API at all. That pushes you towards self-hosted models, which in turn favours the smaller architectures that run on infrastructure you can actually staff and govern. Here the small-model choice and the compliance choice point in the same direction.
Volume. At a hundred queries a day, model cost is a rounding error and you should optimise purely for quality. At a hundred thousand queries a day, the gap between tiers becomes one of the larger lines in your operating budget — and the entire economic case for routing rests on getting the high-volume traffic onto the cheap tier.
Maintenance budget. Bigger self-hosted models demand more infrastructure, more monitoring and more scarce ML-engineering time. If your team is one data engineer, a fine-tuned small model is operationally realistic and a self-hosted 70B is not. Be honest about the team you have, not the one in the architecture diagram.
What this means for your organisation
The organisations getting the best return on AI are not the ones running the most powerful models. They are the ones matching model capability to task complexity — and building the modest routing infrastructure that lets them do it automatically. That requires understanding your own workloads in enough detail to classify them, which is itself the more valuable exercise. Most teams discover that the share of their traffic that genuinely needs a frontier model is far smaller than their bill implies.
This is an architecture decision, not a technology one. And like most architecture decisions it compounds: the firms that get it right early spend less, move faster and scale more predictably than those routing everything through a single frontier API and hoping the price keeps falling. The price is falling — but a disciplined routing layer captures that benefit on every tier at once, and it does not wait for the next model release to pay off.
A Fit Call maps your real workloads to the right model tiers — and sizes the routing layer — before you over-commit to a single frontier API.
References: Stanford HAI, "2025 AI Index Report — Research and Development," 2025 (hai.stanford.edu/ai-index/2025-ai-index-report/research-and-development); Epoch AI, "LLM inference prices have fallen rapidly but unequally across tasks," 2025 (epoch.ai/data-insights/llm-inference-price-trends); Belcak et al., "Small Language Models are the Future of Agentic AI," NVIDIA Research, 2025 (arxiv.org/abs/2506.02153); European Commission, "AI Act — regulatory framework" and GPAI provider obligations effective 2 August 2025 (digital-strategy.ec.europa.eu/en/policies/regulatory-framework-ai).
