📊 Concept · Measurement · Improvement Evaluation

The Hawthorne Effect

Behaviour changes when people know they are being measured — regardless of what is being measured. The Hawthorne Effect is the most important confounder in NHS improvement evaluation. Every GIRFT visit, every national programme, every CQC inspection creates one. The question is never whether metrics improve during observation. The question is whether they hold after. Most NHS improvement evaluation never answers this question because it stops when the programme ends.

StepChangeAnalysis.com · Western Electric Hawthorne Studies 1924–1932 · Related: Regression to the mean

▶ Key rule — the Hawthorne Effect

A metric improving during an improvement programme is not evidence that the programme worked. It may be evidence that people knew they were being measured. The only way to distinguish Hawthorne improvement from structural improvement is a pre-committed prediction tested by Bootstrap CUSUM on data collected after external observation has ended.

The NHS has significant statistical capability — SPC charts and run charts are widely used. But it does not yet routinely require pre-committed predictions before improvement interventions are implemented. Without that discipline, the Hawthorne Effect cannot be separated from genuine structural change.

☰ Contents

What the Hawthorne Effect actually is
The folk version vs the precise version
Why it matters for NHS improvement
The three confounders — Hawthorne, regression, tampering
What is a pre-committed prediction?
The Bootstrap CUSUM response
The irony — visibility as intervention
The pre-committed prediction checklist

What the Hawthorne Effect actually is

Between 1924 and 1932, researchers at Western Electric’s Hawthorne plant near Chicago conducted a series of studies into what affected worker productivity. They changed lighting levels, rest periods, working hours, and physical conditions — and found that almost every change, whether improvement or deterioration in conditions, was followed by improved productivity.

The conclusion that eventually emerged was that the improvement was not caused by the changes themselves. It was caused by the workers knowing they were being studied. Being observed changed their behaviour — not because they were being manipulated or performing dishonestly, but because attention itself is motivating, and being measured changes what people attend to.

The original studies were more complex and contested than the folk version suggests — subsequent analysis showed the effects were smaller and more varied than initially reported. But the core insight survived: the act of measurement changes what is measured.

The folk version vs the precise version

Version	What it says	Why it matters
Folk version	“People work harder when they are being watched.”	Implies the effect is about effort and motivation. Suggests removing observation would return performance to baseline. Misses the structural dimension.
Precise version	“Behaviour changes when people know they are being measured, regardless of what is being measured or whether conditions improve or deteriorate.”	Implies the effect is about visibility and attention, not effort. A metric that improves because it is being tracked may not improve because the underlying system has changed. The improvement is real but not structural.
NHS improvement version	“Trust metrics improve when a GIRFT team, CQC inspection, or national programme is present. They may or may not hold when the observation ends.”	The central question for every NHS improvement claim: is this structural change or Hawthorne? SPC charts during the observation period cannot answer this. Only sustained data after the observation ends can.

Why it matters for NHS improvement

Every major NHS improvement mechanism creates a Hawthorne Effect:

Mechanism	What happens	The Hawthorne question never asked
GIRFT visits	Teams compare trust performance against peers. Staff know they are being observed nationally. Metrics improve. GIRFT reports improvement.	Do the metrics hold 12 months after the GIRFT team leaves? GIRFT’s dataset measures performance during and shortly after engagement — not sustained improvement against the counterfactual.
CQC inspections	Trusts in special measures face intensive scrutiny and high-frequency reporting. Metrics almost always improve. The re-inspection shows improvement.	Does the improvement reflect structural change or sustained Hawthorne response to continued scrutiny? Trusts that exit special measures and reduce reporting frequency sometimes see metrics drift back.
National programmes	NHS England announces a programme with centrally monitored KPIs. Trusts know metrics are tracked nationally. Reporting improves. Metrics improve. Programme declared a success.	Is the improvement structural (produced by redesigned system) or Hawthorne (produced by national monitoring)? Almost never tested because monitoring continues indefinitely while the programme is live. The pre-committed prediction was never made before launch.
Improvement collaboratives	Groups of trusts share data and compare performance on a shared theme. Data sharing itself creates a Hawthorne Effect — trusts attend more carefully to what peers can see.	Does improvement persist after the collaborative ends and the shared measurement structure is removed?

The three confounders — Hawthorne, regression, tampering

The Hawthorne Effect is one of three systematic confounders that make improvement claims in healthcare unreliable without rigorous pre-committed evaluation. All three produce metric improvement that is real but not structural.

Hawthorne Effect

Behaviour changes because people know they are being measured. Metrics improve during observation, may revert when observation ends.

Bootstrap CUSUM test: Does the change point hold 12–24 months after external observation ends?

Regression to the Mean

After an extreme bad period, the next period will look better regardless of any intervention. Improvement is statistical inevitability, not structural change.

Bootstrap CUSUM test: Did the change point occur before or after the extreme period? Was the improvement sustained beyond what regression alone would predict?

Tampering

Reacting to common cause variation as if it were special cause. Each intervention adds variation rather than reducing it. Metrics oscillate without structurally improving.

Bootstrap CUSUM test: Is there a sustained change point or a series of interventions producing oscillation around an unchanged mean?

All three confounders share the same structural feature: the metric moves without the system changing. Bootstrap CUSUM applied to the right outcome metric, with a pre-committed prediction made before the intervention, is the instrument that distinguishes all three from genuine structural improvement.

What is a pre-committed prediction — and why before the intervention?

A pre-committed prediction is a specific, written, publicly stated prediction about what will happen to a named metric, in a named direction, by a named amount, within a named timeframe — made before the improvement intervention begins and before any outcome data is collected.

The six components of a valid pre-committed prediction:

Which metric will change (e.g. 12-hour A&E waits, ambulance handover hours, discharge-ready patient count)
Which direction (downward, upward)
By how much (e.g. 30% reduction from the 12-month baseline mean)
Within what timeframe (e.g. a Bootstrap CUSUM change point will appear within 6 months of the intervention)
At what confidence level (e.g. Bootstrap CUSUM p<0.05)
What balancing measures will be monitored to detect unintended consequences (e.g. 30-day readmission rates, patient safety incidents)

Why before the intervention? Because the moment outcome data is visible, the prediction is no longer independent of the result. A prediction made after even one month of data is shaped — consciously or unconsciously — by what the data shows. Only a prediction made before any outcome data exists is genuinely independent of the result and therefore genuinely falsifiable.

This is the scientific method applied to improvement: hypothesis before experiment, not after. Without it, every intervention can be made to appear successful in retrospect — the Hawthorne Effect, regression to the mean, and seasonal variation all produce movements in the data that can be attributed to any recent intervention if the attribution is made after the data arrives.

The pre-committed prediction is what makes Bootstrap CUSUM honest. The algorithm detects change points in data regardless of when or why they occurred. The pre-committed prediction is what determines whether a detected change point is evidence of a successful intervention or coincidence.

The Bootstrap CUSUM response

The only rigorous response to the Hawthorne Effect in improvement evaluation is a pre-committed prediction tested by Bootstrap CUSUM on sustained data after external observation has ended.

The sequence:

Before State the prediction publicly before the intervention begins. Which metric will change? In which direction? By how much? Within what timeframe? At what confidence threshold? What balancing measures will be monitored? The prediction must be made before the data arrives — not after.

During Run SPC charts and Bootstrap CUSUM during the intervention period. A change point during the observation period is promising but not conclusive. It may be Hawthorne. Continue monitoring.

After Continue Bootstrap CUSUM for 12–24 months after external observation ends. If the change point holds without external monitoring — the system has structurally changed. If the metric drifts back toward baseline — the improvement was Hawthorne. Report the result honestly either way.

Report Report the Bootstrap CUSUM result, not the narrative. A flat line is reported as a flat line. A reversed change point is reported as reversed. The data is allowed to say no. This is what honest evidence of structural improvement looks like.

The irony — visibility as intervention

There is a productive irony in the Hawthorne Effect that improvement practitioners should take seriously: being measured changes behaviour because being measured changes what is visible. Staff who know their discharge times are being tracked attend more carefully to discharge. The metric improves not because the system changed but because visibility changed what people attended to.

This is a genuine visibility intervention — and it is not worthless. Visual management, run charts on the wall, Red/Green Day tracking all work partly through this mechanism. Making performance visible changes performance. That is a real effect, even if it is not structural change.

The question is whether visibility alone is sufficient for structural improvement — or whether it requires the deeper Level 3 intervention of changing the system that produces the performance. Hawthorne improvement is Level 1 and Level 2: it changes outputs and processes through heightened attention. Structural improvement changes the system design so that good performance is produced automatically, without requiring sustained external observation.

The Hawthorne Effect tells you that visibility matters. Bootstrap CUSUM tells you whether visibility alone was sufficient — or whether the structural change that would sustain improvement without observation still needs to be made.

The honest synthesis — what Deming actually said about prediction

You might ask: how can anyone predict the outcome of an improvement intervention before it happens? And is any statistically proven gain not worth having, regardless of whether it was pre-committed?

Both are fair challenges. Deming’s answer to the first: in The New Economics he stated that the PDSA cycle begins with a prediction — not just a plan. “The Study step is the comparison of the results of the execution with the prediction.” Without a prediction, the Study phase has nothing to compare against. It can only ask “did something change?” — not “did the right thing change in the right direction for the right reason?”

The answer to the second: yes — any Bootstrap CUSUM change point is worth having. A structural improvement without a pre-committed prediction is still a structural improvement. The pre-committed prediction adds attribution confidence, not the existence of the improvement. The evidence spectrum:

Before/after comparison — something may have changed
SPC/run charts — something structurally changed at this date
Bootstrap CUSUM — a structural step change occurred, precisely dated
Bootstrap CUSUM + pre-committed prediction — the change occurred as predicted, suggesting the intervention caused it and can be replicated

The pre-committed prediction is not a guarantee of accuracy. It is a statement of theory made at the only moment the theory is genuinely independent of the result — before the data arrives. A Bootstrap CUSUM change point that matches a pre-committed prediction tells you the theory was right and the intervention can be scaled with confidence. A change point that contradicts the prediction is equally valuable: it tells you the theory was wrong, which means scaling would not replicate the result.

The discipline of writing the prediction forces the improvement team to articulate what they actually believe will happen and why. Most improvement programmes cannot do this because their theory of change is implicit rather than explicit. The act of writing the prediction makes the theory visible — and visible theories can be questioned, challenged, and refined. Invisible theories cannot.

A practical example — the complete pre-committed prediction sequence

The pre-committed prediction does not require certainty about magnitude. It requires honesty about direction and a baseline that makes the question answerable. Here is the complete four-phase sequence using a real corridor care intervention.

Phase 0 — Measure the baseline variation first (four weeks)

Before any intervention, measure the number of patients in the corridor at 1800 hours every day for four weeks. Four weeks gives 28–30 data points — the conventional minimum for a stable estimate of a process mean and its natural variation. It also captures a full weekly cycle: four of each day of the week, so the baseline mean is not distorted by having three Mondays and only two Fridays.

This is the Shewhart step. Without it, you cannot distinguish a genuine downward shift from normal day-to-day variation. If the 1800 corridor count varies between 2 and 18 patients through normal common cause variation, a reduction from an average of 12 to an average of 10 is noise, not signal. You need to know the baseline range before you can know what “going down” actually means.

The four-week baseline also gives you something even more valuable: a pre-observation measure of the system as it actually is, before anyone is being watched. This is the cleanest possible control for the Hawthorne Effect — measure before observation changes behaviour, then measure after.

⚠ The gaming problem — Deming’s Point 11

Any single-point measurement that becomes known can be gamed. If the metric is patients in the corridor at 1800 hours, the system will find ways to have fewer patients visible at 1800 — moving patients to side rooms at 1750 and back at 1810, narrowing the definition of “corridor,” not recording on bad days. The Bootstrap CUSUM change point appears. The improvement does not.

This is Deming’s Point 11: eliminate numerical targets for management. A target changes what people do without changing the system. The measurement becomes the target. The target becomes the game. The game destroys the signal.

The defence is not to hide the measurement time. The defence is a system of measures:

Measure	What it captures	Gaming resistance
Total corridor patient-hours per 24 hours (primary outcome)	Actual patient experience — a patient in the corridor 18 hours is ten times the harm of one there two hours	High — requires moving patients continuously throughout the day, not just at one moment
Discharge before noon rate (process measure)	Output of morning coordination — directly what the senior manager on the floor is influencing	High — requires actual discharges, not patient movement
30-day emergency readmission rate (balancing measure)	Whether faster discharge is safe — detects premature discharge	Very high — readmissions are independently recorded
Patient safety incident rate (balancing measure)	Whether flow improvement is producing harm elsewhere	Very high — independently reported
Staff experience score (balancing measure, quarterly)	Whether the intervention is sustainable — staff doing impossible things to hit numbers burn out	High — anonymous survey

Bootstrap CUSUM on all five. A genuine structural improvement produces a change point in the primary outcome AND improvement in the process measure AND no deterioration in the balancing measures. Gaming produces a change point in the primary outcome but deterioration in one or more balancing measures — the signal that the metric was managed, not the system.

See Gaming the Measure — Deming’s Point 11 and the Balancing Measures Defence for the full treatment of this problem.

⚠ Practical caution — the baseline is not one stable process

The 1800 corridor patient count is not a single stable process with one mean and one variation range. It contains several systematically different processes embedded within it:

Weekdays (Mon–Fri) — the primary process: elective admissions, ward rounds, discharge rounds, full social care and community services operating
Weekends — genuinely different: reduced elective activity, different staffing levels, reduced discharge capacity, different demand pattern
Bank holidays — similar to weekends but with additional disruption: GP services reduced, social care emergency-only, community capacity further restricted
Summer (school holidays) — staff leave reduces coordination capacity; elective admissions fall; different demand pattern from winter
Winter — respiratory illness surge, higher acuity, social care more stretched, bank holidays clustered

Running Bootstrap CUSUM on a mixed daily series that combines all of these without accounting for the systematic patterns produces control limits that are too wide. The variation between a Monday in January and a Saturday in August is common cause variation of the calendar, not of the system. The real system variation is hidden inside the calendar variation — and genuine change points become harder to detect.

The practical recommendation for the four-week baseline:

Use Monday to Friday as the primary series — comparing like with like
Note bank holidays and exclude them from the primary series or treat them as a separate series
Note weekends separately — useful as a balancing measure but not the primary signal
Specify the season in the pre-committed prediction — a summer baseline will look different from a winter one; the prediction should state which conditions it applies to

The pre-committed prediction should therefore say: “We believe a sustained downward shift in weekday (Monday–Friday, excluding bank holidays) 1800 corridor patient count — below the common cause range established in the four-week weekday baseline — will appear within three months of the intervention starting.” That specificity makes the prediction cleaner, the Bootstrap CUSUM more sensitive, and the learning more precise.

Phase 1 — State the pre-committed prediction

“We believe that putting a senior manager on the floor 24/7 with cross-departmental authority will produce a sustained downward shift in the 1800 corridor patient count — below the common cause range established in the four-week baseline — within three months of the intervention starting.”

Notice what this prediction does and does not claim. It does not claim a specific magnitude — 15% or 40%. It cannot, because the effect of a system-level intervention depends on what the manager sees, whether staff feel safe enough to show it, and whether the constraint is actually internal coordination. It does claim a direction (downward), a comparison point (below the baseline range, not just below the baseline mean), and a timeframe (three months). Those three elements are enough to make the prediction specific enough that the data can say no.

Phase 2 — Implement the intervention

Senior manager takes post. Continue measuring at 1800 every day. Do not change the measurement protocol — same time, same definition, same recorder if possible.

Phase 3 — Run Bootstrap CUSUM on the full series

Applied to the complete series (four-week baseline plus intervention period). A downward change point appearing within three months, sustained below the baseline common cause range, confirms the theory. The change point date tells you when the system shifted — which reveals whether it was immediate (the constraint was visible and the authority was sufficient from day one) or gradual (the departments needed time to reorganise).

Phase 4 — The Act step: change point or flat line?

If Bootstrap CUSUM shows a change point: What specifically changed? Is the change sustained week by week? Can it be maintained without this specific person, or does it depend on their continued presence? The change point is the beginning of the sustainability question, not the end of the improvement story.

If Bootstrap CUSUM shows a flat line at three months: This is not failure. It is the Study phase of PDSA working exactly as Deming intended — the result compared against the prediction, the theory found incomplete. Three possibilities, and only honest investigation distinguishes them:

The constraint is not where you thought. The manager coordinated every department perfectly and the 1800 count stayed flat — because the beds are full of patients who are medically fit to leave but have nowhere to go. The constraint is external discharge capacity, not internal coordination. The theory was wrong. Move to a different level of intervention.
The conditions were insufficient. The authority wasn’t enough to change consultant priorities. Or the psychological safety wasn’t there — staff showed the manager a performance rather than the real constraint. Diagnose which condition was missing and redesign the intervention.
Three months wasn’t long enough. The change is coming but hasn’t appeared yet. Make a revised prediction with an extended timeframe — but acknowledge the extension honestly rather than quietly moving the goalposts.

The critical rule: do not push harder without investigating which possibility explains the flat line. Reacting to a flat line by doing more of the same thing faster is Deming’s tampering applied to improvement programmes. The flat line is information. Treat it as information.

What the pre-committed prediction does for the trust:

Makes the theory of change visible before it’s too late to learn from it
Protects from motivated reasoning — the result cannot be reinterpreted to fit the theory after the fact
Gives permission to stop — defines what “not working” looks like before the intervention begins
Makes learning transferable — other trusts can adopt with confidence because the mechanism is articulated and tested
Controls for the Hawthorne Effect — the four-week pre-observation baseline is the cleanest possible separator between structural change and observation-induced behaviour change

The pre-committed prediction is the cheapest insurance the NHS has against wasting improvement investment on interventions that produce activity, Hawthorne response, or regression to the mean rather than structural change. It costs nothing to make. It saves everything if the intervention doesn’t work — because it turns a failure into learning rather than embarrassment.

The pre-committed prediction checklist

The pre-committed prediction is the missing step in most improvement evaluation — in healthcare, manufacturing, software, policy, and public services. It is not a criticism of any organisation’s statistical capability. SPC charts, run charts, and comparative analysis are all valuable. The gap is not what is measured after the fact. It is what is specified before the intervention begins.

The gap is the timing of the prediction relative to the intervention. Most improvement evaluation analyses data after the intervention has been implemented and results observed. The pre-committed prediction — made before the data arrives, specifying what would count as success and what would count as failure — is what makes the analysis honest rather than confirmatory.

Without it:

Any improvement can be attributed to the intervention
Any flat line can be explained by confounding factors
The Hawthorne Effect cannot be separated from structural change
Regression to the mean cannot be separated from genuine improvement
The bundle effect makes it impossible to isolate which intervention worked
Improvement programmes accumulate false evidence of what works — and that false evidence gets scaled

This applies equally to a trust putting a senior manager on the floor, a software company tackling its Customer Acceptance Testing bottleneck, a council improving planning application turnaround times, or a manufacturer redesigning a diagnostic pathway. The framework is universal. The checklist is the same.

☑ The pre-committed prediction checklist

Before any improvement intervention, five things should be in place. Write them down. Make them visible. They are the standard against which the data will be compared.

The metric — which one, measured how, at what frequency, by whom. Not a vague outcome area but a specific, countable number. Example: total corridor patient-hours per 24-hour period, counted from ward records, daily.
The baseline — four weeks minimum before the intervention begins, stratified by day type (weekdays, weekends, bank holidays separately). This is the Shewhart step: establishing common cause variation before any change is made.
The direction — will the metric go up or down? This is always knowable even when the magnitude is not. Example: will go down.
The timeframe — within how many months should Bootstrap CUSUM detect a change point if the theory is right? Be honest. If you cannot state a timeframe, the theory is not specific enough to test.
The balancing measures — what would deteriorate if the primary metric were gamed rather than genuinely improved? Name them before the intervention. Monitor them throughout. A genuine structural improvement produces no deterioration in balancing measures. Gaming does.

That is the complete pre-committed prediction. It does not require certainty about magnitude. It requires honesty about direction, timeframe, and what failure looks like. The data is then allowed to say no.

Make your prediction before the data arrives

Upload your baseline data to the StepChange Analyzer before your intervention begins. Bootstrap CUSUM will establish the change point if and when it occurs — and will tell you honestly whether it holds after observation ends.

▶ Open the StepChange Analyzer