Data Quality for AI: What the Research Shows About Garbage In, Garbage Out

Every AI vendor says "data is the new oil." None of them tell you that most enterprise data is closer to crude sludge than refined fuel — and that deploying AI on top of it does not produce visibly bad results. It produces confidently wrong results at scale. A model fed inconsistent, stale, half-complete data does not fail loudly. It answers fluently, plausibly, and incorrectly, and it does so faster than any human could catch.

That is the trap. Bad data does not break the demo. It breaks the third quarter of production use, when the model has quietly been wrong about a few percent of cases the whole time and nobody built the controls to notice. For a DACH Mittelstand business about to commit budget to AI, the data underneath the use case is the variable that decides whether the investment compounds or quietly bleeds.

Garbage in is not a metaphor

The label-noise problem is real and measured. In the most cited study on the subject, Curtis Northcutt and colleagues at MIT examined the test sets of ten of the most widely used machine-learning benchmarks — ImageNet, CIFAR, MNIST, Amazon Reviews and others — and found an average of at least 3.3 percent label errors, with the ImageNet validation set alone carrying at least 6 percent. These are the curated, academically scrubbed datasets the entire field treats as ground truth. If the gold-standard public benchmarks are wrong 3 to 6 percent of the time, the spreadsheet your sales team has been hand-tagging for five years is not in better shape.

Northcutt's team also found something that should reframe how Mittelstand leaders think about AI spend: on noisy real-world data, smaller and simpler models sometimes outperformed larger ones, because the larger models had more capacity to memorise the errors. In plain terms — past a certain point, throwing a bigger model at a messy dataset makes the output worse, not better. The lever is the data, not the model.

The regulator now agrees. This is no longer just an engineering opinion. The EU AI Act, whose obligations for high-risk systems apply from 2 August 2026, makes data quality a legal requirement. Article 10 states that training, validation and testing datasets must be "relevant, sufficiently representative, and to the best extent possible, free of errors and complete in view of the intended purpose," with appropriate statistical properties and documented governance over how the data was collected, labelled, cleaned and checked for bias. If your AI use case touches credit decisions, HR screening, critical infrastructure or any other Annex III category, "we'll fix the data later" stops being a delivery risk and becomes a compliance exposure with a date attached.

The five dimensions that actually decide outcomes

Not all data problems are equal. Five dimensions do most of the damage, and they map directly to the kinds of systems a Mittelstand company already runs.

Completeness. Missing fields, partial records, gaps in time series. A churn model trained on a base where a third of customers have no interaction history will still produce predictions — it simply learns to predict from whatever features remain, which may not be the ones that matter. The model does not warn you it is guessing. It is the silent failures, not the loud ones, that erode trust in AI inside an organisation.

Consistency. The same entity described differently across systems: "Siemens AG" in the CRM, "Siemens" in the ERP, "SIEMENS AKTIENGESELLSCHAFT" in the contract archive. Entity resolution — reconciling these to a single canonical record — is a precondition for any AI application that reads across systems. In DACH businesses that have grown through acquisition, this inconsistency is not an edge case; it is the steady state, and it is where most cross-system AI ambitions quietly die.

Currency. How old is the data, and does that matter for the decision? A recommendation engine trained on last year's purchases recommends last year's products. A compliance assistant trained on pre-2024 regulatory text will not know the AI Act obligations it is supposed to help you meet. The discipline is to define, per use case, the maximum acceptable data age and measure against it — monthly-fresh data is fine for a monthly forecast and useless for real-time pricing.

Accuracy. Does the data reflect reality? Contact data is the canonical example: B2B contact records degrade at roughly a fifth or more of the database per year as people change roles and companies, by widely cited industry estimates. Sensor data drifts as instruments fall out of calibration. Financial records carry reconciliation gaps. An AI system inherits every one of these inaccuracies and propagates it into decisions at machine speed — the error does not get diluted, it gets multiplied.

Structure. Free-text fields, scanned PDFs, email threads — unstructured data needs preprocessing before AI can use it well, and the quality of that preprocessing sets the ceiling on the quality of the output. Poorly chunked documents produce poor retrieval. Inconsistently parsed PDFs produce noisy context. This is the dimension where Mittelstand companies most consistently underinvest, because it is invisible work with no demo value — right up until it is the reason the system does not work.

Why retrieval punishes bad documents harder than bad models

For retrieval-augmented generation — the architecture most Mittelstand companies should reach for before fine-tuning — the practitioner pattern is consistent and counterintuitive to anyone shopping by model leaderboard. Swapping in a more capable model usually moves RAG accuracy by a modest margin. Cleaning, deduplicating and properly structuring the source documents the system retrieves from usually moves it far more. The reason is mechanical: the model can only reason over what retrieval hands it. If retrieval surfaces a stale policy, a duplicate with a contradictory clause, or a mangled table from a bad PDF parse, the most capable model on the market will reason flawlessly over the wrong input and hand you a confident, wrong answer.

The strategic read for a budget-holder is straightforward. When a RAG pilot underperforms, the instinct is to upgrade the model. The higher-yield move is almost always to fix the corpus. This is also the cheaper move, and it compounds — a clean document base lifts every future use case that retrieves from it.

The readiness bar, in practice

You do not need perfect data. You need data good enough for the specific use case in front of you, and you need to know which use case that is before you commit. For retrieval, that means a source corpus that is current within the relevant business cycle, structurally consistent, deduplicated and cleanly parsed — and a realistic expectation that the majority of the project's effort goes into document preparation, not prompt-tuning. For fine-tuning, it means a meaningful volume of consistently labelled examples with the label noise driven down, plus the honesty to admit that most organisations need real preparation time before fine-tuning is even viable. For analytics and prediction, it means completeness high enough that the model is not mostly guessing, entity consistency across the systems it reads, and currency inside the decision cycle.

The companies that succeed do three things, and none of them is glamorous. First, they audit data quality before choosing use cases, not after. The use cases you can pursue depend on the data you have, not the data you wish you had — and the most expensive mistake in enterprise AI is committing to an initiative the underlying data was never going to support. Second, they treat data work as the AI investment, not a tax on it. Cleaning, entity resolution, document structuring: this is the foundation, and money spent here buys disproportionate performance downstream. Third, they build quality monitoring into operations from day one — drift detection, completeness tracking, freshness alerts — so degradation is caught by a control, not by an angry customer.

The vendors selling you a model will not tell you any of this, because the data is your problem and the model is their product. The operating partner that has actually built these systems starts at the other end — with the sludge — because that is where the result is decided.

A Diagnostic maps your data across all five quality dimensions and tells you which AI use cases your data can support today — and exactly what preparation the rest need — before you commit budget to an initiative the data cannot carry.

Book a Fit Call →

References: Northcutt, Athalye & Mueller, "Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks," NeurIPS 2021 (https://arxiv.org/abs/2103.14749); EU AI Act, Article 10 "Data and Data Governance" (https://artificialintelligenceact.eu/article/10/).

Check your AI operating maturity

12 questions, 6 dimensions, 10 minutes.

Data Quality for AI: What the Research Shows About Garbage In, Garbage Out

Garbage in is not a metaphor

The five dimensions that actually decide outcomes

Why retrieval punishes bad documents harder than bad models

The readiness bar, in practice

Related articles

AI Readiness for Mittelstand: What Actually Matters Before You Build

AI Readiness vs. AI Maturity: Why the Distinction Matters for Your First Initiative

RAG vs. Fine-Tuning vs. Prompt Engineering: A Decision Framework

Check your AI operating maturity