📊 Concept · Improvement Method · Joiner p.141

Stratify, Experiment, Disaggregate

When Bootstrap CUSUM shows a system is stable but performing at the wrong level, the instinct is to intervene harder. Joiner’s answer — drawn from page 141 of Fourth Generation Management — is to think harder first. Three strategies, applied in sequence, reveal what the aggregate data conceals: where the variation comes from, which interventions actually move it, and which part of the process is driving the result you are trying to change.

StepChangeAnalysis.com  ·  Source: Joiner, Fourth Generation Management, p.141  ·  Open the StepChange Analyzer
📋 The three strategies at a glance
1. Stratify
Break the aggregate into groups. Who is doing it better? By how much? Under what conditions? The answer tells you whether variation in the system is explained by differences between subgroups — and points to Bright Spots worth understanding.
2. Experiment
Test a change on a small scale with a pre-specified, measurable prediction. PDSA — Plan, Do, Study, Act. Bootstrap CUSUM is the Study step: it detects whether a genuine structural change point appeared, at the confidence level you committed to in advance.
3. Disaggregate
Divide the process into its component mechanisms. The aggregate metric combines multiple distinct pathways. Identifying which pathway is dominant — and treating it specifically — is more powerful than managing the total number.
☰  Contents

The problem these strategies solve

A system is stable. Bootstrap CUSUM finds no change point. The process is producing consistent results — but at a level that is not good enough. Previous interventions have not moved it. The temptation is to try a bigger version of the same intervention, or to try several interventions simultaneously, or to commission a review.

Joiner’s diagnosis is precise: the problem is not insufficient effort. It is insufficient understanding. The aggregate metric obscures three things that are essential for improvement — who is doing it differently, whether any tested change actually works, and which specific mechanism within the process is responsible for most of the problem.

Until all three questions are answered, any intervention is a guess. It may be a well-intentioned, evidence-informed guess — but it is still a guess. Stratify, Experiment, and Disaggregate replace guessing with knowing.

Why aggregate data blocks improvement

An aggregate metric — total wrong-route events, average A&E waiting time, national adverse drug reaction rate — is a summary. Summaries are useful for tracking direction. They are useless for identifying causes. An aggregate that is stable in common cause variation is telling you the average of a set of processes that may be very different from each other. Some of those processes may be excellent. Some may be failing. The average obscures both. Stratification is the act of looking inside the average.


Strategy 1 — Stratify

Stratification means breaking the aggregate data into meaningful subgroups and asking whether the variation between those subgroups explains the overall result. The subgroups might be sites, teams, patient groups, time periods, geographies, product types, or any other dimension that might plausibly drive different performance.

The stratification question

If the aggregate metric is stable at an unacceptable level, ask: is everyone performing at this level, or does the aggregate conceal some units performing well and others performing badly?

If performance is uniformly poor across all subgroups, the cause is in the shared system — the conditions, resources, or design that all subgroups have in common. The solution is a system redesign that reaches every subgroup simultaneously.

If performance varies significantly between subgroups, the cause is in what differentiates them. Some subgroups have found a way to perform better within the same overall system. Those subgroups are Bright Spots — and the question becomes: what are they doing differently, and can it be standardised?

🔎 How to stratify

Step 1 Choose the stratification dimensions. Start with the dimensions most likely to explain variation: geography (site, region, trust), patient group (age, diagnosis, complexity), time (season, shift, day of week), and process pathway (route, channel, team). Do not stratify by everything at once — choose the two or three dimensions most plausible given what you know about the system.
Step 2 Run Bootstrap CUSUM on each subgroup separately. A change point that is invisible in the aggregate may be clearly visible in a subgroup. Equally, a change point that appears in the aggregate may be driven by a single outlier subgroup — detectable only by looking at each subgroup independently.
Step 3 Identify the Bright Spots. Which subgroups are performing significantly better than the aggregate? Are they sustaining that performance over time — a genuine structural difference, detectable as a consistently lower level in their Bootstrap CUSUM output — or is it a temporary fluctuation? A Bright Spot that is genuinely structurally different is worth studying. See Bright Spots for the investigation framework.
Step 4 Ask what the Bright Spot is doing differently — and whether it can be standardised. Deming’s question applies: “by what method?” A Bright Spot that cannot describe its own mechanism cannot be replicated. A Bright Spot that can describe a specific, transferable practice is the seed of the next experiment.
⚠️ The stratification trap — explaining away variation

Stratification can be used honestly or defensively. Honestly: to find subgroups that are performing better and understand why. Defensively: to explain away poor performance by finding a reason why “our subgroup is different.” The test is simple — does the stratification lead to a transferable insight that could improve the aggregate, or does it lead to a conclusion that nothing can be done? If the latter, the stratification is not analytical discipline. It is rationalisation.


Strategy 2 — Experiment

Experimentation is the act of testing a change on a small scale, with a pre-specified prediction, before implementing it at scale. It is the direct application of Deming’s PDSA cycle — and it is the strategy most consistently absent from public sector improvement programmes, which tend to implement at scale first and evaluate (or not) afterwards.

The pre-specification is not a bureaucratic formality. It is the mechanism that makes the experiment honest. Without a pre-specified prediction, any subsequent data movement — including random common cause fluctuation — can be interpreted as evidence the intervention worked. This is the most common form of false attribution in improvement work, and it is almost invisible because it feels like rigour.

📝 What a pre-specified experiment looks like

Before implementing a change, state in writing:

If the change point appears at the predicted time, in the predicted direction, in the predicted metric, at the predicted confidence level — and the balancing measures have not deteriorated — the experiment has produced genuine evidence. That is the Study step of PDSA. Act by standardising and scaling. Then Plan the next PDSA cycle.

If the change point does not appear: the intervention did not work at this level for this constraint. Return to Stratify or Disaggregate to refine the understanding before the next experiment. Do not scale the intervention.

Bootstrap CUSUM is the Study step of PDSA

The PDSA cycle is only as strong as its Study step. Without a rigorous method for detecting whether a change produced a genuine structural shift — rather than a temporary fluctuation or a seasonal effect — the Study step defaults to judgement: “it feels like it worked” or “the numbers are better this month.” Bootstrap CUSUM replaces that judgement with a pre-committed statistical test. It answers, at a specified confidence level, whether the data shows a structural change point — and it dates that change point so you can verify it coincides with the intervention. That is the Study step done properly.

PDSA phase What it requires Bootstrap CUSUM role
Plan A specific change, a specific prediction, a specific metric, a specific confidence threshold, a specific timeframe. Written down before implementation. Defines the test: which metric, which threshold, which timeframe. Without this, the Study step has no benchmark.
Do Implement the change at small scale. Collect data consistently. Do not make additional changes during the Do phase — they muddy the baseline. Data is collected in the format required for Bootstrap CUSUM input: date and metric value, one row per period.
Study Test the prediction against the data. Did a change point appear? When? In which metric? At what confidence level? Did balancing measures hold? Bootstrap CUSUM is the Study step. Run the algorithm at the pre-specified confidence level. Compare the change point date to the intervention date. Check the direction. Check the balancing measures.
Act If the change point confirmed the prediction: standardise and scale. If not: return to Plan with the new understanding. Do not scale an intervention that did not produce a confirmed change point. The change point date and magnitude inform the standardisation — you know exactly when the shift occurred and how large it was. Future monitoring runs Bootstrap CUSUM to detect any subsequent reversal.

Strategy 3 — Disaggregate

Disaggregation means dividing the process into its component mechanisms and managing those mechanisms separately. An aggregate metric — total events, overall rate, national average — is a sum of several distinct pathways, each with its own root cause and its own solution. Managing the total without understanding the components is managing the wrong thing.

Joiner’s insight is that the dominant mechanism — the one pathway responsible for the majority of the total — is rarely obvious from the aggregate. It requires dividing the data by process type, route, mechanism, or cause until the dominant contributor is visible. Once visible, it can be addressed specifically. Fix the dominant mechanism and the aggregate will move. Fix a minor mechanism and the aggregate will barely shift, regardless of how successful the local intervention was.

⛭️ How to disaggregate

Step 1 Map the pathways that contribute to the aggregate metric. For a patient safety metric: list every distinct mechanism by which the adverse event can occur. For a waiting time metric: list every pathway through the system that contributes to total wait. For an error rate: list every step in the process where the error can originate. A fishbone diagram or process map is the right tool here — not statistical analysis. You need to understand the structure before you can measure the components.
Step 2 Measure each pathway separately. What proportion of the aggregate total does each pathway contribute? Run Bootstrap CUSUM on each pathway independently. A pathway that has already improved but is hidden in a stable aggregate is now visible. A pathway that is driving the aggregate upwards is now identifiable.
Step 3 Identify the dominant mechanism. Which pathway contributes the largest proportion of the total? This is the binding constraint within the process — the one that, if addressed, will move the aggregate most. In Goldratt’s terms: this is where the constraint sits within the process itself, once the system boundary question has been answered.
Step 4 Design the intervention for the dominant mechanism — not the aggregate. An intervention designed for the aggregate metric is usually too broad to address any specific mechanism effectively. An intervention designed for the dominant pathway can be precise, testable, and measurable. Run the PDSA experiment on that specific pathway, with Bootstrap CUSUM on the pathway metric — not the aggregate — as the Study step.
⚠️ The dominant mechanism trap — fixing the wrong pathway

The most common disaggregation failure is addressing a visible or politically salient pathway rather than the dominant one. In NHS wrong-route medication errors, the NRFit connector mandate addressed neuraxial-to-IV misconnection — a real problem with a clear engineering solution, and one that generated significant advocacy. But 16 of 20 wrong-route events in 2023–24 were oral-to-IV: a different mechanism, a different root cause, a different solution. An intervention that successfully eliminates neuraxial-to-IV errors entirely would — if oral-to-IV is unchanged — reduce the aggregate by at most 20%. That is the dominant mechanism trap. Fix what is visible and politically tractable rather than what is driving the number.


The sequence — how the three strategies work together

The three strategies are not independent options — they are a sequence. Each one sets up the next, and skipping one reduces the precision of those that follow.

Stage Strategy Question answered Output
1 Stratify Is the variation between subgroups or within the system? Who is doing it better? Bright Spots to study. Hypothesis about what differentiates better-performing subgroups. Candidate intervention derived from observed practice, not theory.
2 Experiment Does the candidate intervention actually produce a structural change point when tested at small scale? A Bootstrap CUSUM-confirmed change point — or an honest null result that sends you back to the hypothesis. Either way, knowledge rather than assumption.
3 Disaggregate Which specific pathway or mechanism within the process is responsible for the dominant proportion of the aggregate result? A precisely targeted intervention designed for the dominant mechanism — not the aggregate. A PDSA experiment that can detect success or failure in the pathway metric specifically.
Repeat Iterate Once a change point is confirmed in the dominant pathway, what is the next constraint? Return to Stratify at the new baseline. Goldratt’s Step 5 applies: do not let inertia become the next constraint. The constraint will move.

The Deming connection — System of Profound Knowledge

Joiner’s three strategies are a direct application of Deming’s System of Profound Knowledge — specifically its four components, each of which maps precisely onto the Stratify/Experiment/Disaggregate framework.

Deming’s component What it means Where it appears in Joiner’s three strategies
Appreciation for a system Understanding that outcomes are produced by the system, not by individuals. The system has structure, interdependencies, and boundaries. Improving it requires understanding those boundaries. Disaggregate. Mapping the pathways within the process is the act of understanding the system’s internal structure. Identifying the dominant mechanism is identifying where the system most needs to change.
Knowledge of variation Distinguishing common cause variation (the system) from special cause variation (a specific event). Not reacting to common cause variation as if it were a special cause. Using statistical methods to tell the difference. Experiment. Bootstrap CUSUM is the application of knowledge of variation to the Study step of PDSA — detecting whether a genuine structural shift has occurred or whether the observed change is within the system’s normal common cause range.
Theory of knowledge Knowledge requires a theory — a prediction that can be tested. Data alone is not knowledge. An observation that was not predicted in advance is not confirmed evidence. Improvement requires prediction, test, and update. Experiment. The pre-specified prediction — direction, metric, timing, confidence threshold — is Deming’s theory of knowledge applied to improvement. Without the pre-specification, the Study step has no theory to test.
Psychology Understanding how people respond to measurement, management, and change. Fear of data produces gaming. Blame produces concealment. Intrinsic motivation produces genuine improvement. Stratify. Finding Bright Spots — units that are performing better within the same system — is an act of positive psychology. It reframes the question from “who is failing?” to “who has found a better way?” That is a fundamentally different relationship between measurement and the people being measured.

The PDSA connection — where Bootstrap CUSUM fits

The PDSA cycle and Joiner’s three strategies are not parallel frameworks — PDSA is the operational container inside which Stratify, Experiment, and Disaggregate are run. Each strategy generates a PDSA cycle, and the cycles run in sequence.

🔁 Three PDSA cycles — one for each strategy

PDSA 1 Stratify PDSA — find the Bright Spot.

Plan: Hypothesise which subgroup dimension is most likely to explain variation. Predict that at least one subgroup will show significantly better performance than the aggregate. Define “significantly better” in Bootstrap CUSUM terms: a sustained level at least X% below the aggregate, visible as a structurally different baseline.

Do: Collect and disaggregate the data by the chosen dimension.

Study: Run Bootstrap CUSUM on each subgroup. Is any subgroup genuinely structurally better? Is the difference sustained or a fluctuation?

Act: If a Bright Spot is confirmed: investigate what it is doing differently. Produce a transferable hypothesis — a candidate intervention. If no Bright Spot: the variation is in the shared system, not between subgroups. Proceed to disaggregation.
PDSA 2 Experiment PDSA — test the candidate intervention.

Plan: Define the intervention precisely. State the pre-specified prediction: which metric, which direction, within how many periods, at what confidence threshold. Define the balancing measures. Apply to a small number of sites or patients — enough to generate detectable data, small enough to limit harm if the intervention does not work.

Do: Implement. Collect data. Make no other changes during the Do phase.

Study: Run Bootstrap CUSUM at the pre-specified confidence threshold. Did the change point appear? When? Does the date match the intervention? Are balancing measures stable?

Act: If confirmed: standardise the conditions that produced the change point. Prepare to scale. If not confirmed: return to Plan. What was wrong with the hypothesis? Was the intervention at the right level? Was the constraint correctly identified?
PDSA 3 Disaggregate PDSA — scale to the dominant mechanism.

Plan: Having confirmed the intervention works in the experiment, design the scaled application targeted at the dominant mechanism. Define the aggregate change point you expect as a result — the timing, direction, magnitude, and confidence threshold.

Do: Implement at scale across the dominant pathway.

Study: Run Bootstrap CUSUM on both the pathway metric and the aggregate. The pathway change point should appear first; the aggregate change point should follow with a lag proportional to the pathway’s share of the aggregate.

Act: If both change points confirmed: standardise, monitor, and return to Stratify at the new baseline. Identify the next dominant mechanism. If aggregate does not follow pathway: the pathway share of the aggregate was smaller than estimated, or a compensating adverse change occurred in another pathway. Disaggregate further.

Applied examples

📋 How the three strategies have been applied on this site

Wrong-route medication errors Stratify: Trust-level analysis would reveal whether wrong-route events are distributed uniformly across NHS trusts or concentrated in a minority. Trusts with zero sustained events over multiple years are the Bright Spots — they have achieved something within the same national system that others have not.

Experiment: Any intervention — pharmacy unit-dose dispensing, separated storage, ENFit rollout — should be tested with a pre-specified Bootstrap CUSUM prediction on the pathway metric before national rollout. Previous mandates (NatSSIPs, MSO, colour coding) were implemented at scale without pre-specified predictions and are not detectable as change points in the aggregate data.

Disaggregate: 16 of 20 wrong-route events in 2023–24 were oral-to-IV. The NRFit mandate addresses neuraxial-to-IV — a real problem, but not the dominant mechanism. Managing the aggregate number without addressing the oral-to-IV pathway specifically is addressing the wrong constraint. See Never Events — Wrong Route.
Anticoagulation safety Stratify: Regional variation in DOAC prescribing appropriateness ranges from 53% to 99% (Bassi 2020). The aggregate national figure conceals regions where almost all prescribing is appropriate alongside regions where half of it is not. The regions performing well are the Bright Spots — they have protocols, pharmacy review processes, or clinical leadership structures that others do not.

Experiment: The COBRRA trial (NEJM, March 2026) is the correct form of this strategy — a pre-specified, controlled experiment testing apixaban versus rivaroxaban with a defined outcome measure. The Bootstrap CUSUM change point in DOAC-specific adverse drug reaction rates at 2016 corresponds to the ROCKET-AF controversy and the subsequent rivaroxaban-to-apixaban switch — a natural experiment with a dateable cause.

Disaggregate: The aggregate adverse event rate combines events from inappropriate prescribing (wrong drug, wrong dose, wrong patient), monitoring failures (renal function not checked), and drug interactions (50% of DOACs not adjusted for renal function by 2019). Each has a different cause and a different solution. See Anticoagulation Safety.
NHS A&E performance Stratify: The national four-hour performance figure is the average of 136 NHS trusts, some of which consistently outperform the aggregate. Those trusts are the Bright Spots. The question is whether their performance is explained by case-mix (lower-acuity populations) or by genuinely transferable practice. Bootstrap CUSUM on individual trust data would separate structural outperformers from lucky averages.

Experiment: No A&E intervention in 15 years was implemented with a pre-specified Bootstrap CUSUM prediction. The consequence: not one improvement change point is detectable in 184 monthly observations. The absence of pre-specified predictions makes it impossible to know whether any intervention worked — or would have worked at a different scale or in a different context.

Disaggregate: “A&E performance” is an aggregate of at least four distinct bottlenecks: inflow (GP access failure, self-referral), triage and treatment capacity, acute bed availability, and discharge delay (DTOC/NCR). The dominant constraint — blocked discharge — sits outside the trust’s system boundary. Improving any of the in-trust pathways without addressing discharge produces a pathway-level change point that does not appear in the aggregate. See Why Nothing Has Worked.
📋 The three questions — applied to your system

Before designing the next intervention, answer these in order:

  1. Stratify: Is anyone achieving significantly better results within this same system? Under what conditions? What are they doing differently? Run Bootstrap CUSUM on subgroup data before designing the intervention.
  2. Experiment: What specific, small-scale, pre-specified test would confirm whether the candidate intervention actually produces a structural change point? Write the prediction — direction, metric, timing, confidence threshold, balancing measures — before implementation.
  3. Disaggregate: Which specific pathway or mechanism within this process is responsible for the dominant proportion of the aggregate result? Is the planned intervention aimed at that pathway — or at a more visible but smaller contributor?

If any of the three questions cannot be answered before implementation, the intervention is a guess. Stratify, Experiment, and Disaggregate replace guessing with knowing.

Run the Experiment step

Upload your subgroup or pathway data to the StepChange Analyzer. Run Bootstrap CUSUM at your pre-specified confidence threshold. The Study step of PDSA — done properly.

▶ Open the StepChange Analyzer