📊 Concept · Measurement Design · Deming

Gaming the Measure

Any single-point measurement that becomes known as a target will be gamed. Not through dishonesty — through the entirely rational human response to being judged by a number. Deming called this out in Point 11 of his 14 Points: eliminate numerical targets for management. The target changes what people do without changing the system. The measurement becomes the target. The target becomes the game. The game destroys the signal. This page explains why — and what to do instead.

StepChangeAnalysis.com · Deming, W.E. Out of the Crisis, Point 11 · Related: Types of Measures

▶ Key rule — gaming the measure

The defence against gaming is not to hide the measurement. The defence is a system of measures — primary outcome, process measure, and balancing measures — where improving one by gaming produces visible deterioration in another. Gaming the system of measures requires genuinely improving the system.

Bootstrap CUSUM on all measures simultaneously. A genuine structural improvement produces a change point in the primary outcome AND improvement in the process measure AND no deterioration in the balancing measures. Gaming produces a change point in the primary outcome but deterioration elsewhere — the signal that the metric was managed, not the system.

☰ Contents

Deming’s Point 11 — why targets fail
Goodhart’s Law — the same insight from economics
How gaming happens — the mechanisms
NHS examples — four-hour target, corridor care
The balancing measures defence
Designing a measurement system that is hard to game
Bootstrap CUSUM on the full measurement system

Deming’s Point 11 — why targets fail

Point 11 of Deming’s 14 Points: “Eliminate numerical quotas for the workforce and numerical goals for management. Substitute leadership.”

Deming was not saying that measurement is wrong. He was a statistician who spent his career arguing for more and better measurement. What he was saying is that a numerical target applied to a person or a team changes behaviour without changing the system — and the behaviour change almost always involves finding ways to hit the number that are easier than actually improving the system.

The mechanism is straightforward. A target creates pressure. Pressure creates ingenuity. Ingenuity finds the path of least resistance between the current state and the number. The path of least resistance is almost never the path through the structural constraint. It is the path around the measurement — redefine what counts, retime the measurement, reclassify the category, apply local effort at the measurement moment rather than systemic effort at the constraint.

The result: the number hits the target. The system does not change. The next period, the pressure returns. The ingenuity finds a slightly more creative path. The divergence between the number and the reality grows. The measurement becomes progressively less useful as a signal of system state and progressively more useful as a signal of how hard the team is working to avoid the consequences of missing the target.

Goodhart’s Law — the same insight from economics

Charles Goodhart, the economist, stated the same principle in 1975: “When a measure becomes a target, it ceases to be a good measure.” This became known as Goodhart’s Law — and it applies to every domain where performance is measured and judged by numbers.

Deming arrived at the same insight from a different direction: not from economic theory but from observing what happened in factories and organisations when numerical targets were imposed. The ingenuity of people under pressure is not the problem. The design of the measurement system is. A well-designed measurement system makes gaming produce visible deterioration in another measure. A poorly designed one makes gaming invisible.

How gaming happens — the mechanisms

Gaming is not usually dishonest. It is usually the rational response of people doing their best under pressure. Understanding the mechanisms helps design measurement systems that are resistant to them.

Gaming mechanism	How it works	What it hides
Retiming	The measurement is taken at a specific time. Effort is concentrated immediately before that time. The number looks good at the measurement moment; the rest of the period is unchanged.	The actual system state for 23 hours and 50 minutes of every day.
Reclassification	The definition of what counts is gradually narrowed. Patients are reclassified from “corridor” to “assessment area.” Waits are reclassified from “delayed” to “planned.” The category changes; the experience doesn’t.	The real volume of the problem being measured.
Selection	The cases that would miss the target are managed differently — diverted, delayed, or handled outside the measured pathway — so the measured cases look better than the full picture.	The patients or cases who are experiencing the worst outcomes.
Recording adjustment	On bad days, the recording is delayed, estimated, or not completed. On good days, it is meticulous. The data series gradually drifts toward the good days.	The variance in system performance — the bad days are systematically underrepresented.
Threshold management	Performance is managed to just above the target rather than to the best achievable level. Once the target is hit, effort relaxes. The system settles at the minimum acceptable rather than the maximum possible.	The gap between what the system could achieve and what the target requires.
Cascade pressure	The target pressure is passed down the hierarchy. Each level manages to hit its number by passing the pressure to the level below. Front-line staff absorb the pressure as unsustainable workload. The metric improves; the staff experience deteriorates.	The human cost of hitting the number — visible only in staff experience scores and sickness rates.

NHS examples — four-hour target and corridor care

The NHS four-hour A&E target is the most extensively documented example of Goodhart’s Law in healthcare. Introduced in 2004, it produced exactly the gaming mechanisms described above:

Retiming: patients admitted or discharged at 3 hours 55 minutes to avoid the 4-hour breach, regardless of clinical need
Reclassification: patients moved to assessment units, corridors, or “decision to admit” status that paused the clock
Selection: lower-acuity patients streamed through faster to protect the target percentage; higher-acuity patients waited longer
Threshold management: trusts managed to 95% rather than striving for 100%; once at 95%, pressure relaxed

The result: the four-hour target measured how hard trusts were working to hit the four-hour target, not how quickly patients were being seen and treated. By 2019, the NHS itself acknowledged that the target was producing perverse behaviours and the metric was being reformed.

The corridor care measurement risk

The same gaming risk applies to any single-point corridor care measurement. If the metric is patients in the corridor at 1800 hours, staff will find ways to have fewer patients visible at 1800 — moving patients to side rooms at 1750, reclassifying corridor spaces as assessment areas, not recording patients who arrive after 1745. The Bootstrap CUSUM change point appears. The improvement does not.

The measurement is not wrong. The single-point snapshot is wrong. Total corridor patient-hours per 24 hours is much harder to game — it requires reducing the actual time patients spend in corridors throughout the day, not just at one moment.

The balancing measures defence

The defence against gaming is not to hide the measurement. Hidden measurements create a different problem — staff who don’t know what is being measured cannot improve it. The defence is to measure several things simultaneously, designed so that gaming one produces visible deterioration in another.

This is the balancing measures principle from improvement science: for every primary outcome measure, identify the balancing measures that would deteriorate if the primary measure were gamed rather than genuinely improved.

The measurement system for corridor care elimination

A measurement system designed so that gaming any single measure produces deterioration in at least one other:

Measure	Type	What gaming looks like	Balancing signal
Total corridor patient-hours per 24 hours	Primary outcome	Moving patients repeatedly; unsustainable for staff	Staff experience score deteriorates
Discharge before noon rate (%)	Process measure	Rushed discharges; patients sent home not fully ready	30-day readmission rate rises
30-day emergency readmission rate	Balancing measure	Cannot be easily gamed — readmissions independently recorded	Self-balancing
Patient safety incident rate	Balancing measure	Cannot be easily gamed — independently reported	Self-balancing
Staff experience score (quarterly)	Balancing measure	Anonymous survey — hard to game; reveals sustainability	Self-balancing

Gaming the primary outcome (corridor patient-hours) requires moving patients continuously throughout the day — which is unsustainable and shows up in staff experience scores. Gaming the process measure (discharge before noon) requires rushing discharges — which shows up in 30-day readmissions. Gaming the balancing measures is very difficult because they are independently recorded.

Designing a measurement system that is hard to game

Six principles for measurement system design that resist gaming:

1. Measure the whole period, not a moment Total patient-hours rather than a snapshot count. Average throughput across the day rather than performance at a single reporting time. Retiming gaming requires effort throughout the period, not just at the measurement moment.

2. Measure outcomes, not outputs 30-day readmission rate rather than discharge count. Patient-reported experience rather than satisfaction survey score at the point of discharge. Outcomes are what the system is for. Outputs are what the system produces. Gaming an output measure is much easier than gaming an outcome measure because outcomes require real change in what happens to the patient after they leave.

3. Use independent data sources for balancing measures Readmission data from other trusts. Patient safety incidents reported through the national reporting system. Staff survey data collected centrally. If the balancing measure data is collected by the same team that manages the primary outcome, gaming both simultaneously becomes possible. Independence makes the balancing measures self-protecting.

4. Pre-commit the measures before the intervention The measurement system must be defined before the intervention begins — which metrics, which frequency, which time period, which balancing measures. A measurement system designed after results are known will unconsciously favour metrics that confirm the desired narrative. Pre-commitment prevents this.

5. Stratify rather than aggregate Weekday primary series, weekend series, bank holiday series. Separate control charts for each patient group, pathway, or ward rather than a trust-wide aggregate. Aggregation hides gaming by averaging good performance on easy cases with poor performance on hard ones. Stratification makes selection gaming visible.

6. Publish the full series including bad periods A measurement system that only reports during improvement programmes, or only reports when performance is good, is not a measurement system. It is a communications exercise. The full series — including the months before the intervention, the months after, and the months when performance deteriorated — is what Bootstrap CUSUM needs to detect genuine change points rather than cherry-picked improvements.

Bootstrap CUSUM on the full measurement system

Bootstrap CUSUM applied to a single metric tells you whether that metric has structurally changed. Applied to the full measurement system — primary outcome, process measure, and balancing measures — it tells you whether the system has genuinely improved or whether the metric was managed.

The interpretation:

Primary outcome	Process measure	Balancing measures	Interpretation
▼ Change point	▼ Improving	→ Stable	Genuine structural improvement. Scale it.
▼ Change point	▼ Improving	▲ Deteriorating	Improvement with harm shift. The system improved one outcome by worsening another. Investigate which balancing measure is deteriorating and why before scaling.
▼ Change point	→ Unchanged	→ Stable	Possible gaming. The outcome improved without the process changing. How? Investigate retiming, reclassification, or selection before attributing to the intervention.
→ Flat line	→ Unchanged	→ Stable	Constraint not reached. Three possibilities: wrong constraint, insufficient conditions, too soon. Investigate before redesigning.
▲ Deterioration	▼ Improving	→ Stable	Process improved but outcome worsened. The process change addressed a non-constraint. Something else is now binding. Find it.

Run Bootstrap CUSUM on your full measurement system

Upload each metric in your measurement system separately to the StepChange Analyzer. Compare the change points across primary outcome, process measure, and balancing measures. The pattern tells you what actually happened.

▶ Open the StepChange Analyzer