📈 Improvement Concepts

Types of Measures: Outcome, Process, and Balancing

The single most common failure in any quality improvement programme is evaluating it against the wrong type of measure. Improving a process measure and calling it success. Ignoring a balancing measure until it becomes a crisis. The distinction between outcome, process, and balancing measures — and between lead and lag indicators — is not a technicality. It determines whether you can actually answer Deming’s question: did the change work?

What you’ll be able to do after this page
▶ Open the StepChange Analyzer

Next: Interpret resultsWhat to do next

StepChangeAnalysis.com  ·  Concepts series  ·  June 2026
☰  Contents — click to expand

The second question of the Model for Improvement

The Model for Improvement — from Langley, Nolan et al., The Improvement Guide — asks three questions before any PDSA cycle begins. The second question is the one most often answered inadequately:

Question 1

What are we trying to accomplish?

Question 2 — The critical one

How will we know that a change is an improvement?

Question 3

What changes can we make that will result in an improvement?

Question 2 requires three types of measure working together. Any programme that answers it with process measures alone — compliance rates, audit scores, training completions — has not answered the question. It has answered a different question: are we doing what we said we would do? That matters. But it is not the same as: is it working?


The three types of measure

● Outcome measure

Did it work for patients?

Measures what ultimately matters — the result the system exists to produce. The end-state for the patient or population.

Mortality rate · Readmission rate · Patient-reported outcome · Length of stay · Quality-adjusted life year

● Process measure

Are we doing what we said?

Measures whether the change is actually being implemented as designed. Compliance, delivery, adherence.

Bundle compliance rate · Time-to-antibiotics · Percentage screened · Training completion · Audit score

● Balancing measure

What might be getting worse?

Measures unintended consequences of the change — problems created elsewhere in the system while fixing the target area.

Antibiotic resistance rate · Staff burnout · Readmission from over-early discharge · Waiting times elsewhere

The relationship between the three

Process measures tell you whether the intervention is being delivered. Outcome measures tell you whether it is producing the result you wanted. Balancing measures tell you whether it is producing results you did not want. All three are needed simultaneously. A programme that shows excellent process compliance, unchanged outcomes, and rising antibiotic resistance has told you three separate things — each of which demands a different response.


Outcome measures — what ultimately matters

An outcome measure captures the result the system exists to produce. For a clinical intervention, that is typically what happens to the patient: do they survive, recover, stay well, return to function? For a service improvement, it is typically what the patient experiences: how long they wait, how safe they are, how well the system serves them.

Outcome measures have two critical properties that process measures do not. First, they are what patients actually care about. A 100% bundle compliance rate is meaningless if the bundle does not improve survival. Second, they are the only valid basis for claiming that a change was an improvement. Improved process measures with unchanged outcome measures is not success — it is evidence that the theory connecting the process to the outcome may be wrong.

Why outcome measures are harder to collect — and why the lag matters

Outcome measures are often the last thing to move. There is always a lag between implementing a change and seeing the outcome shift — and the length of that lag depends entirely on how the intervention works. A price signal acting through market economics (the carbon price floor) may produce a detectable Bootstrap CUSUM change point within two years. A workforce intervention acting through training pipelines may take a decade. A clinical pathway change acting through patient cohort turnover may take three to five years.

The lag creates two dangers. First, impatience: declaring failure before the outcome measure has had time to respond, and abandoning an intervention that was working. Second, tampering: adding new interventions on top of the original before it has had time to produce a measurable result, resetting the lag clock each time. Deming identified both as the most destructive management behaviours in improvement programmes. The dementia diagnosis article on this site works through the lag analysis in detail: an intervention that requires 7–10 years to produce workforce change, and 3–5 years for structural process change, will show no Bootstrap CUSUM change point in a 4-year parliamentary cycle — and will routinely be declared a failure and replaced before the evidence could ever appear.

Outcome measures are also harder to attribute: many factors affect mortality besides the bundle introduced last year. And they are often the most politically sensitive: a programme that improved compliance but did not move outcomes is an uncomfortable finding. The temptation to use process measures as a proxy is understandable. The problem is that a proxy is not the thing itself.


Process measures — whether you are doing what you said

A process measure captures whether the change is being implemented as designed. It is a direct measure of fidelity — is the bundle being delivered to every eligible patient, within the specified time window, by every staff group, on every shift?

Process measures are essential and valuable. Without them you cannot diagnose why an intervention failed to produce the expected outcome: was the theory wrong, or was the intervention simply not delivered consistently enough to test the theory? A programme with good process measures and poor outcome measures has answered the question. A programme with poor process measures and poor outcome measures has answered nothing.

The process measure trap

The trap is declaring success on the basis of process measures alone. The NHS regularly reports improving compliance rates as evidence that a programme is working. Compliance is evidence the programme is being delivered. It is not evidence the programme is producing the desired outcome. These are different claims and they require different evidence. The Sepsis Six article on this site is the clearest example: extensive process compliance data, and no reliable national outcome measure.


Balancing measures — what might be getting worse

A balancing measure captures the unintended consequences of a change — the things that might be getting worse in a different part of the system while the target measure improves. Deming called this tampering: improving one metric at the expense of another without understanding the system as a whole.

Balancing measures are the most frequently neglected of the three types. They are rarely pre-specified before an intervention begins, and they are not typically the subject of audit or governance reporting. The result is that negative consequences accumulate unchecked until they become a crisis — at which point the original intervention is often blamed rather than the measurement gap that allowed the consequence to go undetected.

Examples of neglected balancing measures in NHS programmes

Sepsis Six: Antibiotic resistance from broad-spectrum empirical prescribing; fluid overload in patients with renal impairment from aggressive IV resuscitation; unnecessary treatment in patients who did not have sepsis. None systematically tracked alongside compliance data.

Discharge-to-assess: Readmission rates and community care capacity are the natural balancing measures for any programme that reduces length of stay. Rarely reported alongside the length-of-stay improvement figures.

Four-hour A&E target: The balancing measure was always ambulance handover time and corridor care. Optimising four-hour performance pushed waiting into other parts of the system. The metric improved briefly; the system did not.


Lead and lag measures

The lead/lag distinction cuts across all three measure types. It describes the relationship between a measure and the outcome in time.

● Lag measure

What has already happened

Measures the historical output of the system. Tells you what the system produced. High validity — it actually happened. Low actionability — by the time you read it, it is already in the past. Examples: annual mortality rate, quarterly readmission rate, Bootstrap CUSUM change point.

● Lead measure

What predicts the future

Measures a process or condition that predicts the future lag measure. Moves before the outcome does — so it allows early course correction. Lower validity than a lag measure (it is a prediction, not a result) but higher actionability. Examples: time-to-antibiotics, EV fleet share, charging point density.

The lead/lag relationship in practice

The Bootstrap CUSUM is a lag measure instrument. It tells you what has structurally happened in a time series. It cannot tell you what is about to happen. For that you need lead measures — the process conditions that predict the future lag measure outcome. The right measurement system combines both: lead measures for early warning and course correction, lag measures for verification and governance. A programme that tracks only lag measures is always learning from the past. A programme that tracks only lead measures is always predicting without verifying.


Deming’s system inputs — prioritising what to measure

Deming’s approach to management begins with a simple but profound observation: a result is the output of a system, and the system has inputs. You cannot sustainably change the output without understanding and changing the inputs that produce it. His question — “by what method?” — is a demand that you identify the specific inputs, understand how they relate to the output, and change the right ones.

Applied to measurement, this means defining all the factors that impact your system, then prioritising which ones to track. Not every input needs a measure. What needs a measure is:

First, the inputs with the greatest leverage on the outcome — the ones where a change in the input produces a predictable change in the output. Second, the inputs that are actually changeable — measuring something you cannot influence produces frustration, not improvement. Third, the inputs where the current state is unknown — if you already know the value and it is not going to change, measuring it adds no information.

📊 Deming’s four-step measurement framework

“You cannot manage what you cannot measure — but measuring the wrong thing is worse than not measuring at all.”

Step 1 — Define all factors that impact the system. Map the inputs: staffing levels, equipment availability, patient acuity, process adherence, environmental factors. This is the system map. Without it you are measuring fragments of a whole you have not yet described.

Step 2 — Prioritise the inputs. Not all inputs have equal leverage. Use Pareto analysis, fishbone diagrams, or process mapping to identify the vital few — the inputs that account for most of the variation in the output. Measure these. Monitoring twenty inputs produces noise. Monitoring the three that matter produces signal.

Step 3 — Develop the theory of change. State explicitly: if input X changes by Y, we predict output Z will change by W, within T time periods. This is the prediction that makes the Study step of PDSA meaningful. Without a prior prediction, any result can be rationalised as success.

Step 4 — Evaluate against the prediction. Apply Bootstrap CUSUM to the lag measure. If a structural change point appears at the predicted time with the predicted direction, the theory is supported. If not, the theory needs revision — not the data. This is Deming’s System of Profound Knowledge applied to measurement: the result tells you about the theory, not just about the system.

The table below maps the types of inputs Deming identified to the measure types from Langley et al. and to the Bootstrap CUSUM question each answers.

Input type Measure type Lead or lag Bootstrap CUSUM question NHS example
System output — the result the system produces Outcome Lag Did the outcome structurally change, and if so when? Sepsis mortality rate, A&E four-hour performance, dementia diagnosis rate
Process adherence — fidelity to the designed process Process Process Did compliance structurally change? Does it precede an outcome change point? Sepsis bundle compliance rate, time-to-antibiotics, screening rate
Upstream predictor — condition that drives the output Lead Lead Is the leading indicator moving before the outcome moves? Discharge-ready patients in beds, GP contact rate, EV fleet share
Unintended consequence — what the system trades off Balancing Balancing Is a structural deterioration occurring in the balancing measure? Antibiotic resistance, corridor care hours, community care capacity
Throughput constraint — the binding limit on system output Lead / Outcome Lead Has the constraint changed? (Often more important than the output measure) Discharge-to-assess beds available, social care capacity, operating theatre utilisation

Where Bootstrap CUSUM fits

Bootstrap CUSUM can be applied to any time series — outcome measures, process measures, and balancing measures equally. The method does not know or care what type of measure it is analysing. It asks one question of a time series: did the underlying process mean permanently change, and if so when? Applied to an outcome measure it detects whether patients are genuinely better off. Applied to a process measure it detects whether compliance structurally shifted. Applied to a balancing measure it detects whether an unintended consequence silently took hold. The measure type determines what the result means — the method is the same.

In the lead/lag framework, Bootstrap CUSUM is most naturally a lag verification tool — it tells you what has structurally happened in historical data. But used prospectively, pre-specifying a change point as the test of whether an intervention worked, it becomes the objective Study step of PDSA. The pre-specified prediction is: after we implement change X, we expect a Bootstrap CUSUM change point to appear in [outcome/process/balancing] measure Y within Z time periods, at W% confidence.

Bootstrap CUSUM as the Study step — for any measure type

The power of pre-specifying a Bootstrap CUSUM change point as the test of whether an intervention worked is that it removes retrospective rationalisation. If a change point appears at the predicted time in the predicted direction, the intervention is supported. If it appears at a different time, something else caused it. If it does not appear, the intervention did not produce a structural change — regardless of what the process compliance data shows. The objectivity of the test depends entirely on the prediction being made before the data is collected. Once you know the result, any method can be made to confirm it.

See PDSA cycle for the full prospective use framework, and Model for Improvement for the three questions that frame it. See Three Charts, Three Stories for how Bootstrap CUSUM compares to X-mR and run charts on the same data.

Applied retrospectively to balancing measures, Bootstrap CUSUM is particularly valuable for detecting unintended consequences that were never pre-specified — checking whether something deteriorated at the same time as the target measure improved. A structural change point in a balancing measure coinciding with the intervention change point is the signal that an unintended consequence occurred and needs investigation. This is the analysis almost never done in improvement programmes, and whose absence allows negative consequences to accumulate undetected.


How measurement programmes fail — four patterns

Pattern 1 — Process measures declared as outcome measures

The most common failure. “Compliance with the bundle improved from 40% to 85% — the programme worked.” No. Compliance improved. Whether patients benefited is a separate question that requires an outcome measure. The Sepsis Six national rollout is the clearest example in the NHS: excellent process measure data, no reliable national outcome measure, and the two conflated throughout the governance reporting.

Pattern 2 — Balancing measures not pre-specified

The intervention produces the target improvement but creates a new problem in a different part of the system. This is precisely what Joiner’s Levels of Fix warns about: fixing the output (Level 1) without understanding the system consequences. Because no balancing measure was specified before the intervention, the consequence is not detected until it reaches crisis level. The target measure improvement is reported as success. The balancing measure deterioration is reported as a separate, unrelated problem. The connection is never made and the theory of change is never revised.

Pattern 3 — Lead measures tracked without lag verification

The programme tracks upstream predictors — EV registrations, time-to-antibiotics, training completion rates — as evidence of progress. These move earlier and faster than outcome measures, which makes them attractive for governance reporting. But a lead measure is a prediction, not a result. Without periodic Bootstrap CUSUM verification of the lag outcome measure, you do not know whether the prediction is coming true.

Pattern 4 — No prediction made before the data is collected

The intervention is implemented, data is collected, and then a measurement method is chosen that confirms the desired conclusion. Bootstrap CUSUM is not immune to this if applied retrospectively without a prior prediction. The study step of PDSA is only valid if the prediction precedes the result. Without a pre-specified prediction — what measure, what direction, what magnitude, within what timeframe — any result can be made to look like confirmation.


NHS worked examples

Sepsis Six — the measurement that got conflated

The Sepsis Six rollout from 2013 had strong process measures (compliance rates, time-to-antibiotics via CQUIN), no pre-specified national outcome measure, and no balancing measures (antibiotic resistance, fluid overload, post-sepsis syndrome). Bootstrap CUSUM on the ONS sepsis mortality series finds no structural change point in either available series across 22 years — but finds a structural change point in the ratio between the two series dated to 2013, the year the CQUIN coding incentive was introduced. The coding system changed. The clinical outcome measure did not. The two were never distinguished in governance reporting. See the full Sepsis Six analysis.

NHS A&E — the wrong lag measure

The four-hour target is itself a process measure — it measures whether the process produced the patient within four hours, not whether the patient’s health outcome was better as a result. The outcome measure would be: did patients admitted via A&E have lower mortality, lower readmission rates, or better recovery when the target was met than when it was not? That analysis was rarely done. The process measure — four-hour performance — became the de facto outcome measure, and every policy intervention was evaluated against it. The constraint — discharge-ready patients occupying beds — is a lead measure for A&E performance that was never formally tracked as the governing input. See the A&E analysis.

UK carbon emissions — the measurement that worked

The carbon price floor (2013) had an implicit pre-specified outcome measure: electricity supply emissions. The mechanism was clear (changing coal plant economics), the direction was predicted (downward), and the lag was short (economics respond faster than behaviour). Bootstrap CUSUM finds a structural change point at 99.8% confidence in 2013 — the strongest signal in 35 years of UK emissions data. This is what it looks like when the theory of change is correct, the intervention targets the right leverage point, and the outcome measure is stable and attributable. See the carbon emissions analysis.


📉 The measurement checklist — before any improvement programme begins

1. Outcome measure: What is the specific measure that will tell you whether patients are better off? Is it stable over time? Will it be affected by coding or recording changes during the programme? Is it attributable to this intervention rather than confounders?

2. Process measure: What will tell you whether the change is actually being delivered? Is it measuring the right process — the one theorised to produce the outcome change? Is it being collected consistently across all sites and shifts?

3. Balancing measures: What might get worse in another part of the system if this intervention succeeds? List at least two before the programme begins. Specify when and how they will be checked.

4. Lead measures: What upstream conditions predict the outcome measure? What is their current state and trajectory?

5. The Bootstrap CUSUM prediction: State in writing — we expect a structural change point in [outcome measure] within [X] time periods of implementation, at [Y]% confidence. This is the objective test. Everything else is context.


Related concepts