📊 Concept · Statistical Phenomenon

Regression to the Mean

An extreme result — good or bad — is likely to be followed by a more typical one, regardless of any intervention. This is not a theory. It is a mathematical inevitability whenever measurement contains any random variation. It is also the most commonly misattributed phenomenon in improvement work: the reason treatment gets credit for recovery, inspections get credit for school improvement, and winter programmes get credit for spring.

StepChangeAnalysis.com  ·  Open the StepChange Analyzer
▶ Key rule — regression to the mean

Never attribute improvement after an extreme period to an intervention without first asking: would the data have returned toward average anyway? Regression to the mean produces improvement without any intervention at all.

The only defence is a pre-committed prediction made before the intervention: specify the direction, metric, timing, and confidence threshold. If Bootstrap CUSUM (StepChange Analyzer) then detects a change point at the right time and in the right direction, the improvement is structural. If the metric simply drifts back toward average, regression to the mean was doing the work — not the intervention.

☰  Contents

What regression to the mean is

When a random variable produces an extreme value — unusually high or unusually low — the next measurement of the same variable is likely to be closer to its long-run average. This happens not because anything changed, but because extreme values are, by definition, unlikely to be repeated. The more extreme the initial observation, the more strongly the next observation tends to regress toward the mean.

This applies in both directions. An exceptionally good month is likely to be followed by a less exceptional one. An exceptionally bad month is likely to be followed by a less bad one. Neither movement requires explanation. Neither reflects genuine structural change. Both are statistical inevitability.

The formal definition

In statistics, regression toward the mean is the phenomenon where if one sample of a random variable is extreme, the next sampling of the same random variable is likely to be closer to its mean. Furthermore, when many random variables are sampled and the most extreme results are intentionally selected, a second sampling of those selected variables will produce less extreme results — closer to the mean of all variables. (Galton, 1886; formalised by Pearson, 1896.)


Galton's discovery — heights and "reversion to mediocrity"

Francis Galton stumbled on the phenomenon in the 1880s while measuring the heights of parents and their adult children. He expected tall parents to produce tall children and short parents to produce short children — and they did. But not proportionally. Tall parents produced children who were tall, but not as tall as their parents. Short parents produced children who were short, but not as short as their parents. Every generation drifted back toward the population average.

Galton called it "reversion to mediocrity" — a wonderfully blunt description. He was puzzled by it, initially suspecting some biological force pulling generations toward average height. The truth, as Karl Pearson later formalised, was simpler and more fundamental: it was a mathematical property of any measurement that contains random variation. No biological mechanism was needed. The phenomenon emerged from the structure of the data itself.

The insight generalised far beyond heights. Anywhere a measurement has both a systematic component (the true underlying level) and a random component (noise, measurement error, natural fluctuation), extreme values will contain more random variation than typical values — and will therefore tend to be followed by less extreme ones.


Why it happens — the mathematical inevitability

Consider a process with a true underlying mean of 100, subject to random monthly fluctuation of ±20. In any given month, the observed value might be 120 (extremely good) or 80 (extremely bad). Neither value reflects a change in the true underlying level — both reflect random variation around a stable mean.

If you observe a value of 120, what is the most likely next value? Not 120 again — extreme values are unlikely by definition. The most likely next value is something closer to 100. If you observe 80, the most likely next value is something closer to 100. The process has not changed. The true mean has not changed. Only the random component has varied.

The critical point: this happens regardless of what you do between the two measurements. If you intervene after the extreme value, the regression toward the mean will occur anyway — and your intervention will appear to have caused it.

The clinical version — why treatments appear to work

A patient develops a symptom severe enough to seek treatment. By definition, their symptom level at the point of seeking treatment is near its worst. The doctor prescribes a treatment. The symptom improves. The treatment receives the credit.

But consider the counterfactual: what would have happened without treatment? For many conditions, symptoms fluctuate naturally. A patient presenting at their worst is statistically likely to improve next week regardless of treatment — because extreme symptom levels are unlikely to persist. Regression to the mean is doing most of the work. The treatment gets the credit.

This is why randomised controlled trials require a control group. The control group experiences the same regression to the mean as the treatment group. Any additional improvement in the treatment group, beyond what the control group experienced, is attributable to the treatment. Without the control group, regression to the mean is invisible — and the treatment appears far more effective than it is.


Where it appears in improvement work

📋 Common patterns in healthcare and public services

League tables The worst-performing trust this year is unlikely to be the worst-performing trust next year. Not because it improved — but because its position at the bottom of the table reflects both its true performance and an element of random variation. Next year's random variation will produce a different configuration. The trust that was second-worst will appear at the bottom. The previous worst performer will appear to have "improved." Intervention teams dispatched to the worst performers will appear to have succeeded regardless of their actual effect.
Winter programmes An unusually bad winter followed by an average winter looks like improvement. NHS winter programmes are routinely evaluated by comparing the intervention winter to the previous one. If the previous winter was exceptionally bad — a genuine outlier — the following winter will tend toward average regardless of any programme. The programme receives the credit. Bootstrap CUSUM on a longer series reveals whether the improvement is structural or a return to the previous stable level.
School inspections Schools are more likely to be inspected when performance is poor. Poor performance contains both a genuine component (the school is struggling) and a random component (an unusually bad year). After inspection, performance tends to improve — partly because of the inspection's effect, and partly because the random component regresses toward the mean. Studies attempting to separate the two effects consistently find that regression to the mean accounts for a significant proportion of the apparent improvement.
Patient safety incidents A spike in never events triggers an investigation and an action plan. The following quarter shows fewer never events. The action plan appears to have worked. But never events are rare by definition — a quarterly spike is likely to contain a large random component. Regression toward the (low) mean was always likely regardless of any intervention. Without a pre-committed Bootstrap CUSUM prediction, the action plan gets the credit.
Individual performance Praise and punishment appear to have opposite effects — but don't. Galton's insight generalised to human performance: after an exceptional performance, the next is likely to be more ordinary regardless of any feedback. After a poor performance, the next is likely to be better regardless of any feedback. This creates the illusion that praise makes people complacent (their next performance is worse) and criticism makes people try harder (their next performance is better). In reality, both are regression to the mean. The feedback is coincidental.

The Deming connection — pre-committed predictions

Deming understood regression to the mean as one of the central obstacles to honest improvement evaluation. His insistence on pre-committed predictions — stating in writing, before an intervention, what change you expect to see and when — is the direct methodological response to the problem.

Without a pre-committed prediction, any improvement after an intervention can be attributed to the intervention, even if regression to the mean is the actual mechanism. The improvement feels real, the attribution feels logical, and the learning is false. The same intervention will be repeated regardless of whether it actually worked — because the evidence of its working was an artefact of measurement, not a signal of structural change.

With a pre-committed prediction, regression to the mean is visible. If the prediction specifies that Bootstrap CUSUM should detect a change point at a particular confidence level within a particular timeframe — and the data instead shows a return toward average without crossing the detection threshold — the prediction has failed. The intervention did not produce structural change. That is honest information, however uncomfortable.

⚠️ The intervention that always works

Any intervention applied immediately after an extreme bad result will appear to work, because regression to the mean will produce improvement regardless. This is why the most confident improvement claims are often the least reliable: they were made after unusually bad periods, applied an intervention, and observed the inevitable return toward average. The intervention receives permanent credit for a temporary statistical phenomenon. The same intervention applied to a system at its average level would show no effect at all — because there is no extreme value to regress from.


How Bootstrap CUSUM distinguishes it from genuine change

Regression to the mean produces a characteristic pattern in time-series data: a return from an extreme value toward the previous stable level. Bootstrap CUSUM distinguishes this from genuine structural change in a precise way.

Pattern What it looks like in data Bootstrap CUSUM result What it means
Regression to the mean Extreme value followed by return toward previous stable level. The new level is approximately the same as the pre-intervention level. No change point. CUSUM shows a temporary excursion that returns to baseline. Or: a change point at the extreme value followed by a second change point as it regresses — ending at approximately the original mean. The system has not structurally changed. The extreme value was a fluctuation. The intervention (if any) did not produce lasting structural improvement.
Genuine structural improvement Sustained shift to a new stable level that is different from the pre-intervention level. The improvement is maintained through subsequent periods including the next hard season. A single change point at the appropriate date, sustained. The new level does not return toward the previous mean. Confirmed across at least one full seasonal cycle. The system has structurally changed. The improvement is not a regression artefact — it is maintained when the random component fluctuates in both directions around the new mean.
Mixed — partial regression plus genuine improvement Improvement after an extreme value, but stabilising at a level better than the pre-intervention baseline rather than returning to it fully. Change point detected, but the new mean is between the extreme value and the original baseline. Partial regression plus genuine structural shift. Both effects are present. The intervention produced real improvement, but some of the observed improvement is regression. Bootstrap CUSUM quantifies the genuine structural component.

The two mistakes it causes

Regression to the mean produces two distinct errors in improvement work — one more common, one more costly.

Mistake What happens Consequence
Crediting the intervention
(most common)
An intervention is applied after an extreme bad result. Performance improves (regresses toward mean). The intervention receives the credit. The same intervention is rolled out at scale and repeated in future. Resources and effort are invested in interventions that may have had no structural effect. The true cause of the extreme result remains unaddressed. The next extreme result produces the same pattern — and the same false attribution.
Discrediting a genuine improvement
(less common, more costly)
A genuine structural improvement is dismissed because it followed a bad period and therefore "must be regression to the mean." The intervention is not scaled or standardised. A real improvement is lost. The conditions that produced it are not preserved. Performance returns to the previous level — which then appears to confirm the original scepticism, completing a self-fulfilling cycle of under-investment in what actually works.

Bootstrap CUSUM resolves both mistakes with the same mechanism: the pre-committed prediction combined with the change point test. A genuine structural improvement produces a sustained change point at the predicted time. Regression to the mean produces a temporary excursion without a sustained change point. The distinction is visible in the data if you look for it correctly.

Test your data — genuine change or regression to the mean?

Upload your time-series data to the StepChange Analyzer. If the improvement is structural, Bootstrap CUSUM will detect a sustained change point. If it is regression to the mean, the CUSUM will show a temporary excursion with no confirmed change point.

▶ Open the StepChange Analyzer