Gaming the Measure
Any single-point measurement that becomes known as a target will be gamed. Not through dishonesty — through the entirely rational human response to being judged by a number. Deming called this out in Point 11 of his 14 Points: eliminate numerical targets for management. The target changes what people do without changing the system. The measurement becomes the target. The target becomes the game. The game destroys the signal. This page explains why — and what to do instead.
The defence against gaming is not to hide the measurement. The defence is a system of measures — primary outcome, process measure, and balancing measures — where improving one by gaming produces visible deterioration in another. Gaming the system of measures requires genuinely improving the system.
Bootstrap CUSUM on all measures simultaneously. A genuine structural improvement produces a change point in the primary outcome AND improvement in the process measure AND no deterioration in the balancing measures. Gaming produces a change point in the primary outcome but deterioration elsewhere — the signal that the metric was managed, not the system.
☰ Contents
Deming’s Point 11 — why targets fail
Point 11 of Deming’s 14 Points: “Eliminate numerical quotas for the workforce and numerical goals for management. Substitute leadership.”
Deming was not saying that measurement is wrong. He was a statistician who spent his career arguing for more and better measurement. What he was saying is that a numerical target applied to a person or a team changes behaviour without changing the system — and the behaviour change almost always involves finding ways to hit the number that are easier than actually improving the system.
The mechanism is straightforward. A target creates pressure. Pressure creates ingenuity. Ingenuity finds the path of least resistance between the current state and the number. The path of least resistance is almost never the path through the structural constraint. It is the path around the measurement — redefine what counts, retime the measurement, reclassify the category, apply local effort at the measurement moment rather than systemic effort at the constraint.
The result: the number hits the target. The system does not change. The next period, the pressure returns. The ingenuity finds a slightly more creative path. The divergence between the number and the reality grows. The measurement becomes progressively less useful as a signal of system state and progressively more useful as a signal of how hard the team is working to avoid the consequences of missing the target.
Goodhart’s Law — the same insight from economics
Charles Goodhart, the economist, stated the same principle in 1975: “When a measure becomes a target, it ceases to be a good measure.” This became known as Goodhart’s Law — and it applies to every domain where performance is measured and judged by numbers.
Deming arrived at the same insight from a different direction: not from economic theory but from observing what happened in factories and organisations when numerical targets were imposed. The ingenuity of people under pressure is not the problem. The design of the measurement system is. A well-designed measurement system makes gaming produce visible deterioration in another measure. A poorly designed one makes gaming invisible.
How gaming happens — the mechanisms
Gaming is not usually dishonest. It is usually the rational response of people doing their best under pressure. Understanding the mechanisms helps design measurement systems that are resistant to them.
| Gaming mechanism | How it works | What it hides |
|---|---|---|
| Retiming | The measurement is taken at a specific time. Effort is concentrated immediately before that time. The number looks good at the measurement moment; the rest of the period is unchanged. | The actual system state for 23 hours and 50 minutes of every day. |
| Reclassification | The definition of what counts is gradually narrowed. Patients are reclassified from “corridor” to “assessment area.” Waits are reclassified from “delayed” to “planned.” The category changes; the experience doesn’t. | The real volume of the problem being measured. |
| Selection | The cases that would miss the target are managed differently — diverted, delayed, or handled outside the measured pathway — so the measured cases look better than the full picture. | The patients or cases who are experiencing the worst outcomes. |
| Recording adjustment | On bad days, the recording is delayed, estimated, or not completed. On good days, it is meticulous. The data series gradually drifts toward the good days. | The variance in system performance — the bad days are systematically underrepresented. |
| Threshold management | Performance is managed to just above the target rather than to the best achievable level. Once the target is hit, effort relaxes. The system settles at the minimum acceptable rather than the maximum possible. | The gap between what the system could achieve and what the target requires. |
| Cascade pressure | The target pressure is passed down the hierarchy. Each level manages to hit its number by passing the pressure to the level below. Front-line staff absorb the pressure as unsustainable workload. The metric improves; the staff experience deteriorates. | The human cost of hitting the number — visible only in staff experience scores and sickness rates. |
NHS examples — four-hour target and corridor care
The NHS four-hour A&E target is the most extensively documented example of Goodhart’s Law in healthcare. Introduced in 2004, it produced exactly the gaming mechanisms described above:
- Retiming: patients admitted or discharged at 3 hours 55 minutes to avoid the 4-hour breach, regardless of clinical need
- Reclassification: patients moved to assessment units, corridors, or “decision to admit” status that paused the clock
- Selection: lower-acuity patients streamed through faster to protect the target percentage; higher-acuity patients waited longer
- Threshold management: trusts managed to 95% rather than striving for 100%; once at 95%, pressure relaxed
The result: the four-hour target measured how hard trusts were working to hit the four-hour target, not how quickly patients were being seen and treated. By 2019, the NHS itself acknowledged that the target was producing perverse behaviours and the metric was being reformed.
The same gaming risk applies to any single-point corridor care measurement. If the metric is patients in the corridor at 1800 hours, staff will find ways to have fewer patients visible at 1800 — moving patients to side rooms at 1750, reclassifying corridor spaces as assessment areas, not recording patients who arrive after 1745. The Bootstrap CUSUM change point appears. The improvement does not.
The measurement is not wrong. The single-point snapshot is wrong. Total corridor patient-hours per 24 hours is much harder to game — it requires reducing the actual time patients spend in corridors throughout the day, not just at one moment.
The balancing measures defence
The defence against gaming is not to hide the measurement. Hidden measurements create a different problem — staff who don’t know what is being measured cannot improve it. The defence is to measure several things simultaneously, designed so that gaming one produces visible deterioration in another.
This is the balancing measures principle from improvement science: for every primary outcome measure, identify the balancing measures that would deteriorate if the primary measure were gamed rather than genuinely improved.
The measurement system for corridor care elimination
A measurement system designed so that gaming any single measure produces deterioration in at least one other:
| Measure | Type | What gaming looks like | Balancing signal |
|---|---|---|---|
| Total corridor patient-hours per 24 hours | Primary outcome | Moving patients repeatedly; unsustainable for staff | Staff experience score deteriorates |
| Discharge before noon rate (%) | Process measure | Rushed discharges; patients sent home not fully ready | 30-day readmission rate rises |
| 30-day emergency readmission rate | Balancing measure | Cannot be easily gamed — readmissions independently recorded | Self-balancing |
| Patient safety incident rate | Balancing measure | Cannot be easily gamed — independently reported | Self-balancing |
| Staff experience score (quarterly) | Balancing measure | Anonymous survey — hard to game; reveals sustainability | Self-balancing |
Gaming the primary outcome (corridor patient-hours) requires moving patients continuously throughout the day — which is unsustainable and shows up in staff experience scores. Gaming the process measure (discharge before noon) requires rushing discharges — which shows up in 30-day readmissions. Gaming the balancing measures is very difficult because they are independently recorded.
Designing a measurement system that is hard to game
Six principles for measurement system design that resist gaming:
Bootstrap CUSUM on the full measurement system
Bootstrap CUSUM applied to a single metric tells you whether that metric has structurally changed. Applied to the full measurement system — primary outcome, process measure, and balancing measures — it tells you whether the system has genuinely improved or whether the metric was managed.
The interpretation:
| Primary outcome | Process measure | Balancing measures | Interpretation |
|---|---|---|---|
| ▼ Change point | ▼ Improving | → Stable | Genuine structural improvement. Scale it. |
| ▼ Change point | ▼ Improving | ▲ Deteriorating | Improvement with harm shift. The system improved one outcome by worsening another. Investigate which balancing measure is deteriorating and why before scaling. |
| ▼ Change point | → Unchanged | → Stable | Possible gaming. The outcome improved without the process changing. How? Investigate retiming, reclassification, or selection before attributing to the intervention. |
| → Flat line | → Unchanged | → Stable | Constraint not reached. Three possibilities: wrong constraint, insufficient conditions, too soon. Investigate before redesigning. |
| ▲ Deterioration | ▼ Improving | → Stable | Process improved but outcome worsened. The process change addressed a non-constraint. Something else is now binding. Find it. |
Run Bootstrap CUSUM on your full measurement system
Upload each metric in your measurement system separately to the StepChange Analyzer. Compare the change points across primary outcome, process measure, and balancing measures. The pattern tells you what actually happened.
▶ Open the StepChange Analyzer