Stratify, Experiment, Disaggregate
When Bootstrap CUSUM shows a system is stable but performing at the wrong level, the instinct is to intervene harder. Joiner’s answer — drawn from page 141 of Fourth Generation Management — is to think harder first. Three strategies, applied in sequence, reveal what the aggregate data conceals: where the variation comes from, which interventions actually move it, and which part of the process is driving the result you are trying to change.
☰ Contents
The problem these strategies solve
A system is stable. Bootstrap CUSUM finds no change point. The process is producing consistent results — but at a level that is not good enough. Previous interventions have not moved it. The temptation is to try a bigger version of the same intervention, or to try several interventions simultaneously, or to commission a review.
Joiner’s diagnosis is precise: the problem is not insufficient effort. It is insufficient understanding. The aggregate metric obscures three things that are essential for improvement — who is doing it differently, whether any tested change actually works, and which specific mechanism within the process is responsible for most of the problem.
Until all three questions are answered, any intervention is a guess. It may be a well-intentioned, evidence-informed guess — but it is still a guess. Stratify, Experiment, and Disaggregate replace guessing with knowing.
An aggregate metric — total wrong-route events, average A&E waiting time, national adverse drug reaction rate — is a summary. Summaries are useful for tracking direction. They are useless for identifying causes. An aggregate that is stable in common cause variation is telling you the average of a set of processes that may be very different from each other. Some of those processes may be excellent. Some may be failing. The average obscures both. Stratification is the act of looking inside the average.
Strategy 1 — Stratify
Stratification means breaking the aggregate data into meaningful subgroups and asking whether the variation between those subgroups explains the overall result. The subgroups might be sites, teams, patient groups, time periods, geographies, product types, or any other dimension that might plausibly drive different performance.
The stratification question
If the aggregate metric is stable at an unacceptable level, ask: is everyone performing at this level, or does the aggregate conceal some units performing well and others performing badly?
If performance is uniformly poor across all subgroups, the cause is in the shared system — the conditions, resources, or design that all subgroups have in common. The solution is a system redesign that reaches every subgroup simultaneously.
If performance varies significantly between subgroups, the cause is in what differentiates them. Some subgroups have found a way to perform better within the same overall system. Those subgroups are Bright Spots — and the question becomes: what are they doing differently, and can it be standardised?
🔎 How to stratify
Stratification can be used honestly or defensively. Honestly: to find subgroups that are performing better and understand why. Defensively: to explain away poor performance by finding a reason why “our subgroup is different.” The test is simple — does the stratification lead to a transferable insight that could improve the aggregate, or does it lead to a conclusion that nothing can be done? If the latter, the stratification is not analytical discipline. It is rationalisation.
Strategy 2 — Experiment
Experimentation is the act of testing a change on a small scale, with a pre-specified prediction, before implementing it at scale. It is the direct application of Deming’s PDSA cycle — and it is the strategy most consistently absent from public sector improvement programmes, which tend to implement at scale first and evaluate (or not) afterwards.
The pre-specification is not a bureaucratic formality. It is the mechanism that makes the experiment honest. Without a pre-specified prediction, any subsequent data movement — including random common cause fluctuation — can be interpreted as evidence the intervention worked. This is the most common form of false attribution in improvement work, and it is almost invisible because it feels like rigour.
📝 What a pre-specified experiment looks like
Before implementing a change, state in writing:
- What will change: the specific intervention, applied to which subgroup, starting when.
- What metric will move: the primary outcome measure — not a process measure, not a proxy, but the metric that matters.
- Which metric will move first: the leading indicator, if one exists — and the expected lag between the leading indicator change point and the outcome change point.
- The direction: up or down.
- The timing: within how many periods of the intervention start do you expect a Bootstrap CUSUM change point to be detectable?
- The confidence threshold: 90%, 95%, or 99% — and why.
- The balancing measures: what could get worse if this intervention works — and you will monitor those simultaneously.
If the change point appears at the predicted time, in the predicted direction, in the predicted metric, at the predicted confidence level — and the balancing measures have not deteriorated — the experiment has produced genuine evidence. That is the Study step of PDSA. Act by standardising and scaling. Then Plan the next PDSA cycle.
If the change point does not appear: the intervention did not work at this level for this constraint. Return to Stratify or Disaggregate to refine the understanding before the next experiment. Do not scale the intervention.
The PDSA cycle is only as strong as its Study step. Without a rigorous method for detecting whether a change produced a genuine structural shift — rather than a temporary fluctuation or a seasonal effect — the Study step defaults to judgement: “it feels like it worked” or “the numbers are better this month.” Bootstrap CUSUM replaces that judgement with a pre-committed statistical test. It answers, at a specified confidence level, whether the data shows a structural change point — and it dates that change point so you can verify it coincides with the intervention. That is the Study step done properly.
| PDSA phase | What it requires | Bootstrap CUSUM role |
|---|---|---|
| Plan | A specific change, a specific prediction, a specific metric, a specific confidence threshold, a specific timeframe. Written down before implementation. | Defines the test: which metric, which threshold, which timeframe. Without this, the Study step has no benchmark. |
| Do | Implement the change at small scale. Collect data consistently. Do not make additional changes during the Do phase — they muddy the baseline. | Data is collected in the format required for Bootstrap CUSUM input: date and metric value, one row per period. |
| Study | Test the prediction against the data. Did a change point appear? When? In which metric? At what confidence level? Did balancing measures hold? | Bootstrap CUSUM is the Study step. Run the algorithm at the pre-specified confidence level. Compare the change point date to the intervention date. Check the direction. Check the balancing measures. |
| Act | If the change point confirmed the prediction: standardise and scale. If not: return to Plan with the new understanding. Do not scale an intervention that did not produce a confirmed change point. | The change point date and magnitude inform the standardisation — you know exactly when the shift occurred and how large it was. Future monitoring runs Bootstrap CUSUM to detect any subsequent reversal. |
Strategy 3 — Disaggregate
Disaggregation means dividing the process into its component mechanisms and managing those mechanisms separately. An aggregate metric — total events, overall rate, national average — is a sum of several distinct pathways, each with its own root cause and its own solution. Managing the total without understanding the components is managing the wrong thing.
Joiner’s insight is that the dominant mechanism — the one pathway responsible for the majority of the total — is rarely obvious from the aggregate. It requires dividing the data by process type, route, mechanism, or cause until the dominant contributor is visible. Once visible, it can be addressed specifically. Fix the dominant mechanism and the aggregate will move. Fix a minor mechanism and the aggregate will barely shift, regardless of how successful the local intervention was.
⛭️ How to disaggregate
The most common disaggregation failure is addressing a visible or politically salient pathway rather than the dominant one. In NHS wrong-route medication errors, the NRFit connector mandate addressed neuraxial-to-IV misconnection — a real problem with a clear engineering solution, and one that generated significant advocacy. But 16 of 20 wrong-route events in 2023–24 were oral-to-IV: a different mechanism, a different root cause, a different solution. An intervention that successfully eliminates neuraxial-to-IV errors entirely would — if oral-to-IV is unchanged — reduce the aggregate by at most 20%. That is the dominant mechanism trap. Fix what is visible and politically tractable rather than what is driving the number.
The sequence — how the three strategies work together
The three strategies are not independent options — they are a sequence. Each one sets up the next, and skipping one reduces the precision of those that follow.
| Stage | Strategy | Question answered | Output |
|---|---|---|---|
| 1 | Stratify | Is the variation between subgroups or within the system? Who is doing it better? | Bright Spots to study. Hypothesis about what differentiates better-performing subgroups. Candidate intervention derived from observed practice, not theory. |
| 2 | Experiment | Does the candidate intervention actually produce a structural change point when tested at small scale? | A Bootstrap CUSUM-confirmed change point — or an honest null result that sends you back to the hypothesis. Either way, knowledge rather than assumption. |
| 3 | Disaggregate | Which specific pathway or mechanism within the process is responsible for the dominant proportion of the aggregate result? | A precisely targeted intervention designed for the dominant mechanism — not the aggregate. A PDSA experiment that can detect success or failure in the pathway metric specifically. |
| Repeat | Iterate | Once a change point is confirmed in the dominant pathway, what is the next constraint? | Return to Stratify at the new baseline. Goldratt’s Step 5 applies: do not let inertia become the next constraint. The constraint will move. |
The Deming connection — System of Profound Knowledge
Joiner’s three strategies are a direct application of Deming’s System of Profound Knowledge — specifically its four components, each of which maps precisely onto the Stratify/Experiment/Disaggregate framework.
| Deming’s component | What it means | Where it appears in Joiner’s three strategies |
|---|---|---|
| Appreciation for a system | Understanding that outcomes are produced by the system, not by individuals. The system has structure, interdependencies, and boundaries. Improving it requires understanding those boundaries. | Disaggregate. Mapping the pathways within the process is the act of understanding the system’s internal structure. Identifying the dominant mechanism is identifying where the system most needs to change. |
| Knowledge of variation | Distinguishing common cause variation (the system) from special cause variation (a specific event). Not reacting to common cause variation as if it were a special cause. Using statistical methods to tell the difference. | Experiment. Bootstrap CUSUM is the application of knowledge of variation to the Study step of PDSA — detecting whether a genuine structural shift has occurred or whether the observed change is within the system’s normal common cause range. |
| Theory of knowledge | Knowledge requires a theory — a prediction that can be tested. Data alone is not knowledge. An observation that was not predicted in advance is not confirmed evidence. Improvement requires prediction, test, and update. | Experiment. The pre-specified prediction — direction, metric, timing, confidence threshold — is Deming’s theory of knowledge applied to improvement. Without the pre-specification, the Study step has no theory to test. |
| Psychology | Understanding how people respond to measurement, management, and change. Fear of data produces gaming. Blame produces concealment. Intrinsic motivation produces genuine improvement. | Stratify. Finding Bright Spots — units that are performing better within the same system — is an act of positive psychology. It reframes the question from “who is failing?” to “who has found a better way?” That is a fundamentally different relationship between measurement and the people being measured. |
The PDSA connection — where Bootstrap CUSUM fits
The PDSA cycle and Joiner’s three strategies are not parallel frameworks — PDSA is the operational container inside which Stratify, Experiment, and Disaggregate are run. Each strategy generates a PDSA cycle, and the cycles run in sequence.
🔁 Three PDSA cycles — one for each strategy
Plan: Hypothesise which subgroup dimension is most likely to explain variation. Predict that at least one subgroup will show significantly better performance than the aggregate. Define “significantly better” in Bootstrap CUSUM terms: a sustained level at least X% below the aggregate, visible as a structurally different baseline.
Do: Collect and disaggregate the data by the chosen dimension.
Study: Run Bootstrap CUSUM on each subgroup. Is any subgroup genuinely structurally better? Is the difference sustained or a fluctuation?
Act: If a Bright Spot is confirmed: investigate what it is doing differently. Produce a transferable hypothesis — a candidate intervention. If no Bright Spot: the variation is in the shared system, not between subgroups. Proceed to disaggregation.
Plan: Define the intervention precisely. State the pre-specified prediction: which metric, which direction, within how many periods, at what confidence threshold. Define the balancing measures. Apply to a small number of sites or patients — enough to generate detectable data, small enough to limit harm if the intervention does not work.
Do: Implement. Collect data. Make no other changes during the Do phase.
Study: Run Bootstrap CUSUM at the pre-specified confidence threshold. Did the change point appear? When? Does the date match the intervention? Are balancing measures stable?
Act: If confirmed: standardise the conditions that produced the change point. Prepare to scale. If not confirmed: return to Plan. What was wrong with the hypothesis? Was the intervention at the right level? Was the constraint correctly identified?
Plan: Having confirmed the intervention works in the experiment, design the scaled application targeted at the dominant mechanism. Define the aggregate change point you expect as a result — the timing, direction, magnitude, and confidence threshold.
Do: Implement at scale across the dominant pathway.
Study: Run Bootstrap CUSUM on both the pathway metric and the aggregate. The pathway change point should appear first; the aggregate change point should follow with a lag proportional to the pathway’s share of the aggregate.
Act: If both change points confirmed: standardise, monitor, and return to Stratify at the new baseline. Identify the next dominant mechanism. If aggregate does not follow pathway: the pathway share of the aggregate was smaller than estimated, or a compensating adverse change occurred in another pathway. Disaggregate further.
Applied examples
📋 How the three strategies have been applied on this site
Experiment: Any intervention — pharmacy unit-dose dispensing, separated storage, ENFit rollout — should be tested with a pre-specified Bootstrap CUSUM prediction on the pathway metric before national rollout. Previous mandates (NatSSIPs, MSO, colour coding) were implemented at scale without pre-specified predictions and are not detectable as change points in the aggregate data.
Disaggregate: 16 of 20 wrong-route events in 2023–24 were oral-to-IV. The NRFit mandate addresses neuraxial-to-IV — a real problem, but not the dominant mechanism. Managing the aggregate number without addressing the oral-to-IV pathway specifically is addressing the wrong constraint. See Never Events — Wrong Route.
Experiment: The COBRRA trial (NEJM, March 2026) is the correct form of this strategy — a pre-specified, controlled experiment testing apixaban versus rivaroxaban with a defined outcome measure. The Bootstrap CUSUM change point in DOAC-specific adverse drug reaction rates at 2016 corresponds to the ROCKET-AF controversy and the subsequent rivaroxaban-to-apixaban switch — a natural experiment with a dateable cause.
Disaggregate: The aggregate adverse event rate combines events from inappropriate prescribing (wrong drug, wrong dose, wrong patient), monitoring failures (renal function not checked), and drug interactions (50% of DOACs not adjusted for renal function by 2019). Each has a different cause and a different solution. See Anticoagulation Safety.
Experiment: No A&E intervention in 15 years was implemented with a pre-specified Bootstrap CUSUM prediction. The consequence: not one improvement change point is detectable in 184 monthly observations. The absence of pre-specified predictions makes it impossible to know whether any intervention worked — or would have worked at a different scale or in a different context.
Disaggregate: “A&E performance” is an aggregate of at least four distinct bottlenecks: inflow (GP access failure, self-referral), triage and treatment capacity, acute bed availability, and discharge delay (DTOC/NCR). The dominant constraint — blocked discharge — sits outside the trust’s system boundary. Improving any of the in-trust pathways without addressing discharge produces a pathway-level change point that does not appear in the aggregate. See Why Nothing Has Worked.
Before designing the next intervention, answer these in order:
- Stratify: Is anyone achieving significantly better results within this same system? Under what conditions? What are they doing differently? Run Bootstrap CUSUM on subgroup data before designing the intervention.
- Experiment: What specific, small-scale, pre-specified test would confirm whether the candidate intervention actually produces a structural change point? Write the prediction — direction, metric, timing, confidence threshold, balancing measures — before implementation.
- Disaggregate: Which specific pathway or mechanism within this process is responsible for the dominant proportion of the aggregate result? Is the planned intervention aimed at that pathway — or at a more visible but smaller contributor?
If any of the three questions cannot be answered before implementation, the intervention is a guess. Stratify, Experiment, and Disaggregate replace guessing with knowing.
Run the Experiment step
Upload your subgroup or pathway data to the StepChange Analyzer. Run Bootstrap CUSUM at your pre-specified confidence threshold. The Study step of PDSA — done properly.
▶ Open the StepChange Analyzer