Healthcare Quality Improvement

Same Data, Three Charts, Three Very Different Stories

A practical guide for clinical pharmacists, medicines safety officers, and quality improvement leads who need to present defensible evidence of change to boards, regulators, and commissioners.

StepChangeAnalysis.com | Bootstrap CUSUM SPC Analysis | 488 weekly observations | 2017–2026

📋 Article Summary · ⇣ Download Executive Summary PDF · ⇣ Download the data CSV

The X-mR verdict

The most common NHS SPC chart reports a flat mean of 22.57 across nine and a half years. It flags early high points in 2017–18 as the main story. The systematic 85% improvement is completely invisible. The flat mean is an average of a journey reported as a destination.

The run chart verdict

Correctly detects that something shifted — broadly around 2019–2020. But cannot say how many steps occurred, when each one happened, what each change was worth, or with what confidence. A staircase described as a slope.

The Bootstrap CUSUM verdict

Eleven distinct stages at 90% confidence. Stage means descend from 42 to 6 — an 85% reduction over eight years. Each change point is dated within a confidence window. Three stages detected at 100% confidence. One temporary worsening in 2018 correctly identified as a special cause and explained.

The governance implication

The three-sigma rule in Shewhart charts is fixed by convention. Bootstrap CUSUM confidence is earned from your own data by resampling it — not assumed from statistical theory. For board papers, CQC submissions, and commissioning cases, this is the difference between a chart that can be challenged and one that can be defended.

Data: 488 weekly observations · 2017–2026 · real anonymised data · Method: Bootstrap CUSUM, 90%/95%/99.7% confidence · Loops=1,000–10,000 · Data file: problem-cases.csv

📚 New to the terminology? Glossary of key terms — Bootstrap CUSUM, Deming, Meadows, Joiner, PDSA, common cause variation, hierarchy of controls, and more.

☰ Table of Contents — click to expand or collapse

Introduction: the chart is not neutral
The dataset
Chart 1: The X-mR Shewhart Control Chart
Chart 2: The Run Chart
Chart 3: Bootstrap CUSUM Step-Change Analysis
How to read the green CUSUM line
Why Bootstrap — why not classical CUSUM?
Choosing your confidence level
The three-chart comparison
When each chart is the right choice
Retrospective vs Prospective Use — Two Different but Equally Powerful Applications
A note on Bootstrap convergence — verifying your results
A note on information governance
Summary
References

Introduction: the chart is not neutral

When a medicines safety team presents data to a clinical governance board, the choice of chart is rarely treated as a significant decision. You have the data. You plot it. You present it.

But the chart you choose is not a neutral vessel for your data. It is an analytical lens — and different lenses reveal different things. More troublingly, some lenses actively conceal things that are genuinely present in the data.

This article presents three charts of exactly the same dataset: 488 weekly observations of support cases submitted by clinicians using a high-risk medicines clinical decision support system, running from 2016 to 2026. The mean is 22.57. The standard deviation is 11.05. The data is real and has been anonymised.

Each chart is legitimate. Each is widely used in healthcare quality improvement. Each tells a different story. Only one tells the whole truth.

The dataset

Over the nine and a half years covered by this dataset, a high-risk medicines clinical decision support system was progressively rolled out, refined, and adopted across a clinical user base. Support cases — queries submitted by clinicians asking how to use the system — were logged weekly throughout this period.

The weekly case count started high — around 40 to 42 per week — and through successive system improvements, training initiatives, and deepening user proficiency, descended to approximately 6 per week by 2026. That is an 85% reduction over eight years, achieved not in a single step but through eleven distinct, sequential improvements, each embedding before the next was introduced.

This is exactly the kind of story a quality improvement programme should be able to tell its board. The question is: which chart tells it?

Chart 1: The X-mR Shewhart Control Chart

“Nothing much to see here — except some worrying high points in 2017 and 2018.”

X-mR Shewhart Control Chart — 488 weekly observations 2017–2026 — X-mR chart: flat process mean of 22.57, UNPL 39.23, LNPL 5.91. Points breaching upper control limit in 2017–2018 flagged as signals.

The X-mR (Individuals and Moving Range) chart is the most commonly used SPC chart in NHS quality improvement. It calculates a process mean (green line, at 22.57), adds upper and lower natural process limits at plus and minus 2.66 times the mean moving range (UNPL 39.23, LNPL 5.91), and flags individual points that breach those limits as potential signals of special cause variation.

Applied to this dataset, the X-mR chart tells the following story: the process has been broadly stable across nine and a half years, with a flat mean of 22.57. There are a number of points breaching the upper control limit in 2017 and 2018 — these would be flagged for investigation. The lower control limit of 5.91 is never breached significantly until very recently.

The implicit governance message is: this is a stable process with some concerning elevated periods in its recent history.

This conclusion is not just incomplete. It is actively misleading.

The flat mean of 22.57 is an average of data that began at 42 and ended at 6. It is a number that accurately describes neither the start of the series nor the end — it is the arithmetic mean of a journey, reported as if it were a destination. The high points in 2017–2018 that the chart flags as “signals” were in fact the beginning of the improvement story, not evidence of deterioration. And the most important feature of the dataset — a sustained, systematic, eight-year programme of measurable improvement — is completely invisible.

Why does this happen?

The X-mR chart calculates its control limits from the mean moving range — the average week-to-week variation across the entire series. When a dataset contains genuine, sustained step-changes over time, this calculation absorbs the between-stage variation into the within-stage noise estimate, producing limits that are too wide to detect the steps. The chart is designed for a stable process and interprets everything through that assumption — even when the process is demonstrably not stable.

Additionally, the standard deviation of 11.05 is 49% of the mean of 22.57. This immediately signals substantial non-normality: the data is right-skewed, with most weekly values below the mean and occasional high values pulling the average upward. The control limits, derived from normal-distribution mathematics, are therefore placed in the wrong position — further reducing the chart’s sensitivity to genuine change.

Chart 2: The Run Chart

“Something changed — roughly around 2020. But we couldn’t tell you much more than that.”

Run Chart — 488 weekly observations with median 21.00 and shift markers — Run chart: median 21.00. Red markers indicate runs of 6+ consecutive points above the median; blue markers indicate runs below. Broad downward shift visible from approximately 2020.

The run chart is simpler than the Shewhart chart and, in one important respect, more honest with this dataset. It plots each observation against the overall median (21.00) and uses run rules — specifically, runs of six or more consecutive points above or below the median — to flag potential shifts.

On this dataset, the run chart identifies two broad zones. The red dots — points in a run of six or more above the median — dominate the period from 2017 to approximately 2020. The blue dots — runs of six or more below the median — begin appearing around 2019–2020 and become almost universal from 2023 onwards.

The run chart is correctly detecting that something real happened. A quality manager looking at this chart would rightly conclude that the process has shifted downward at some point — probably around 2019 or 2020. This is more than the X-mR chart managed. But it is still a profoundly incomplete picture.

What the run chart cannot tell you

That there were not one but eleven distinct stages of improvement
That the first step-change occurred in 2017, not 2019–2020
That each step had a specific magnitude — from 42 to 35, from 35 to 24, and so on down to 6
That each change point can be dated within a confidence window
That the overall reduction was 85% from start to finish
That the confidence level on any of these findings is 90%

The run chart draws one flat median line through a process that visited eleven different levels. It is the equivalent of describing a staircase as a slope — directionally correct, but losing all the structure that makes it useful for governance, evaluation, or funding decisions.

Chart 3: Bootstrap CUSUM Step-Change Analysis

“Eleven stages. Ninety percent confidence. An 85% reduction — dated, quantified, and bounded step by step.”

Bootstrap CUSUM Step-Change Analysis at 90% confidence — 11 stages identified — Bootstrap CUSUM step-change analysis at 90% confidence: 11 distinct stages identified. Blue step-mean line descends from approximately 42 in early 2017 to 6 by 2026. Green CUSUM line tracks accumulated evidence of improvement.

The Bootstrap CUSUM step-change analysis of the same dataset produces something categorically different from the previous two charts.

Eleven distinct stages are identified at 90% confidence. The blue step-mean line descends in a clear staircase: approximately 30 at the opening of the series, rising briefly to around 42 in early 2017, then stepping down — 35, 24, 22, 22, 17, 15, 11, 10, and finally approximately 6 by 2026. Each step is accompanied by a dashed confidence box, showing the window within which that step-change is estimated to have occurred.

The green CUSUM line tells the story of accumulated evidence: it rises as the metric ran above its eventual long-run level in the early years, peaks around 2019–2020, then descends continuously as sustained improvement compounds. By 2026 it approaches zero — the signature of a process that has been running below its historical average long enough that the accumulated evidence of improvement is overwhelming.

A governance report based on this analysis can state:

“Bootstrap CUSUM step-change analysis of 488 weekly observations identifies eleven distinct stages of improvement at 90% confidence. Weekly clinical decision support queries have reduced from a stage mean of approximately 42 in early 2017 to approximately 6 in 2026 — an 85% reduction. Each change point is dated within a confidence interval of four to eight weeks, consistent with the implementation timelines of successive pathway interventions.”

That statement can go into a board paper. It can withstand scrutiny from a clinical governance committee or a CQC inspector. It supports a funding continuation decision. It quantifies the return on a quality improvement programme investment.

How to read the green CUSUM line

Before going further, it is worth pausing to explain what the green CUSUM line actually is and what it tells you — because it is the most important element of the chart, and the most commonly misread.

📊 How the CUSUM line is calculated and what it means

The CUSUM (Cumulative Sum) line is built by taking each observation, subtracting the overall mean of the entire series, and adding that deviation to a running total. The result is a line that tells you, at any point in time, whether the data has been running consistently above or below its long-run average.

Rising slope — data is running above the overall mean. In this dataset: weekly case counts are higher than the long-run average.
Falling slope — data is running below the overall mean. Case counts are lower — improvement is accumulating.
Flat — data is hovering around the overall mean. No net drift in either direction.
A peak or turning point — the most important feature. This is the moment the process changed direction — where it switched from running above average to below, or vice versa. The Bootstrap algorithm tests whether that turning point is statistically significant or genuine, or could have occurred by chance.
Steepness — how far above or below the mean the data is running. A steep slope means the data is strongly and consistently above or below average. A gentle slope means the deviation is small or inconsistent.

The blue stage mean lines are the Bootstrap algorithm’s verdict on where genuine structural changes occurred and with what confidence. The green line is the raw evidence. The blue lines are the statistical interpretation of it. Change the confidence level and the blue lines move — but the green line never changes, because it is derived entirely from the data.

With that in mind, look again at the Bootstrap CUSUM chart above. The green line rises from 2017 to around 2019–2020 — telling you that during that period, weekly case counts were consistently running above the overall series mean. It then turns and falls continuously to 2026 — telling you that from that turning point onwards, case counts have been consistently below the overall mean. The turning point is precisely the moment the improvement programme began to structurally outweigh the earlier elevated period. The Bootstrap algorithm’s job is to test whether each inflexion in that line represents a statistically significant structural change or is simply noise.

Why Bootstrap — why not classical CUSUM?

The CUSUM chart has been used in industrial quality control since E.S. Page developed it at Cambridge in 1954. Its logic — accumulating evidence of drift rather than evaluating individual observations in isolation — makes it fundamentally better suited than Shewhart charts to detecting sustained shifts.

But classical CUSUM still relies on distributional assumptions when setting its decision threshold. For non-normal data, those assumptions are as problematic as the ones underlying the Shewhart limits.

Bootstrap CUSUM solves this by deriving the decision threshold directly from the data itself. The actual dataset is randomly reordered one thousand times. For each reordering, the CUSUM statistic is calculated. A random reordering destroys any genuine temporal structure, so this generates an empirical picture of what the CUSUM looks like when there is definitively no change present — using your data’s actual distributional characteristics, not a theoretical normal distribution it doesn’t follow.

The decision threshold is then set at the chosen confidence level from this empirical distribution. The confidence level is calculated directly from your actual data by resampling it 1,000 times (or whatever level you choose), rather than looked up from a statistical formula or textbook table that assumes your data follows a normal distribution. It is earned from the data, not assumed from theory.

Choosing your confidence level

One of the most practically useful features of Bootstrap CUSUM step-change analysis is that the confidence level is not fixed — it is a choice the analyst makes explicitly, based on the purpose of the analysis and the audience receiving it.

Here is what happens when the same 488-observation dataset is analysed at three different confidence levels:

Bootstrap CUSUM at 95% confidence — 10 stages — 95% confidence — 10 stages

Bootstrap CUSUM at 99.7% confidence — 6 stages — 99.7% confidence — 6 stages

Confidence level	Stages detected	Interpretation
90%	11 stages	Every change the data plausibly supports, including smaller steps
95%	10 stages	One marginal change point drops out; remaining steps are more certain
99.7%	6 stages	Only the largest, most unambiguous structural shifts remain

The tool’s Statistical Evidence Table makes this concrete with real numbers. At 95% confidence, the ten stages are:

Stage	From	To	Mean	SD	Conf %	Change %
1	02/01/2017	19/06/2017	34.88	11.05	Baseline	Baseline
2	19/06/2017	26/03/2018	29.59	11.05	96.8%	−15.2%
3	26/03/2018	10/12/2018	41.84	11.05	96.0%	+41.4%
4	10/12/2018	29/07/2019	33.85	11.05	96.0%	−19.1%
5	29/07/2019	28/09/2020	23.23	11.05	100.0%	−31.4%
6	28/09/2020	19/04/2021	18.17	11.05	95.4%	−21.8%
7	19/04/2021	20/03/2023	22.82	11.05	95.4%	+25.6%
8	20/03/2023	06/01/2025	15.54	11.05	100.0%	−31.9%
9	06/01/2025	29/09/2025	11.46	11.05	99.3%	−26.3%
10	29/09/2025	04/05/2026	6.75	11.05	99.3%	−41.1%

At 99.7% confidence, only the six largest structural shifts survive:

Stage	From	To	Mean	SD	Conf %	Change %
1	02/01/2017	29/07/2019	34.91	11.05	Baseline	Baseline
2	29/07/2019	28/09/2020	23.23	11.05	100.0%	−33.5%
3	28/09/2020	20/03/2023	21.81	11.05	100.0%	−6.1%
4	20/03/2023	06/01/2025	15.54	11.05	100.0%	−28.7%
5	06/01/2025	29/09/2025	11.46	11.05	99.7%	−26.3%
6	29/09/2025	04/05/2026	6.75	11.05	99.7%	−41.1%

Notice that Stages 2, 3, and 4 at 99.7% confidence are each detected at 100% confidence — meaning that across all 1,000 bootstrap resamples, not a single one failed to detect these changes. This is as definitive as statistical evidence can be.

Also worth noting: the 95% table reveals a genuine upward step in Stage 3 (March–December 2018, mean 41.84, +41.4%). This is not noise — it is a real, statistically confirmed temporary worsening before the sustained improvement resumed. Stage 3 was subsequently identified as a period of elevated log-on and printing support queries coinciding with reduced nursing and pharmacy staffing levels during the July summer holiday period, when the clinical users most familiar with the system were on leave and knowledge transfer to cover staff was limited. Improvement resumed as staffing normalised.

In Deming/Shewhart terminology, Stage 3 represents a special cause — an assignable, external event temporarily disrupting an otherwise improving process. The Bootstrap CUSUM correctly distinguished it from common cause variation, flagging it as a discrete stage rather than absorbing it into background noise. Identifying and explaining such episodes is only possible when each stage is individually dated and quantified. A standard control chart would have absorbed this into background variation and it would never have been investigated.

The staircase descends at all three levels. The 85% overall reduction holds regardless of which confidence level is chosen. What changes is the granularity — how many of the individual steps are considered sufficiently well-evidenced to report.

This has direct practical implications for governance reporting. A board paper going to a conservative clinical governance committee, or a submission to CQC, might use 95% or 99.7% — fewer steps, but each one defensible at a very high level of certainty. An internal improvement team evaluating programme performance in detail might use 90%, capturing every statistically detectable step to understand which interventions moved the needle.

Critically, the core finding is robust across all three confidence levels. The direction of change, the approximate magnitude of the overall reduction, and the timing of the major structural shifts are consistent whether you choose 90%, 95%, or 99.7%. This robustness — the fact that the story does not change depending on the threshold — is itself a form of evidence.

For a sceptical reviewer or governance committee, this is a powerful argument: “We tested the analysis at three different confidence levels. The number of steps detected varies, but the fundamental finding — a sustained, multi-stage reduction of approximately 85% over eight years — is present at every level of stringency we applied. And at every level, the confidence is earned from the data itself, not assumed from a statistical formula.”

No classical SPC method offers this kind of explicit, transparent confidence calibration. The three-sigma rule used in Shewhart charts is fixed by convention, not chosen by the analyst based on the governance context. Bootstrap CUSUM puts that choice where it belongs: with the person responsible for the analysis and accountable for its conclusions.

The three-chart comparison

	Run Chart	X-mR Shewhart	Bootstrap CUSUM
Detects that something changed	Yes — broadly	No	Yes — precisely
Identifies how many changes	No	No	Yes — 11 stages
Dates each change point	No	No	Yes — with confidence interval
Quantifies each change	No	No	Yes — stage means, conf % & change %
Provides a confidence level	No	No	Yes — from your own data, not statistical theory
Handles non-normal data reliably	Partially	No	Yes — distribution-free
Defensible under governance challenge	Partially	Unlikely	Yes
Suitable for board papers and CQC	Limited	Limited	Yes

When each chart is the right choice

Use the Run Chart when:

You need a simple, accessible chart that any clinical staff member can interpret
You are doing initial exploratory analysis to check whether anything has changed
You are working with very small datasets where formal threshold methods are unreliable

Use the X-mR Shewhart chart when:

You are monitoring a process in real time and need staff to respond to signals at the point of care
Your data is approximately normally distributed
You need a chart that flags individual anomalous observations for immediate investigation

Use Bootstrap CUSUM step-change analysis when:

You are evaluating historical data to determine whether and when a sustained change occurred
Your data is non-normally distributed — counts, rates, costs, rare events, or any series where SD exceeds roughly 40% of the mean
You need a statistically defensible confidence level for governance, regulatory, or commissioning purposes
You want to identify multiple change points and date each one
You are making a retrospective case for the impact of a quality improvement programme

Retrospective vs Prospective Use — Two Different but Equally Powerful Applications

The case study in this article is retrospective — we are looking back at nine and a half years of data and asking: did the process change, when, and by how much? The Bootstrap CUSUM answers all three questions with precision.

In our dataset, the step-change boundaries correspond to periods of known system development activity — successive product upgrades and server migrations implemented over the nine-year period. The Bootstrap CUSUM correctly identified that genuine structural changes occurred and dated each one — providing the starting point for any retrospective investigation.

For the January 2025 stage boundary specifically, the cause is known: the support team introduced a new case category (‘Problems at Customer Site’) which reclassified a proportion of existing cases out of the main count. This is a textbook special cause in Deming/Shewhart terms — a recording change rather than a genuine process improvement — and should be interpreted accordingly. Crucially, the Bootstrap CUSUM detected it. A standard control chart would not have.

The prospective use case: where Bootstrap CUSUM is most powerful

When used prospectively — monitoring for the effect of a planned intervention — the attribution problem disappears entirely.

The workflow is straightforward. You implement a change: a new clinical protocol, a prescribing policy, a system upgrade, a training programme. You continue logging your weekly or monthly data as normal. You run the Bootstrap CUSUM analysis periodically. When the analysis detects a new stage, you have statistical confirmation, dated to within weeks, that your intervention has produced a genuine structural change in the process.

You know what you did. You know when you did it. The Bootstrap CUSUM tells you whether it worked, with what confidence, and from what date the new level was established.

This is precisely the evidence that clinical governance committees, commissioners, and CQC inspectors are asking for: not a run chart showing a vague downward trend, but a dated, confidence-bounded, quantified stage change that coincides with your intervention and can be defended under challenge.

A note on Bootstrap convergence — verifying your results

Because the Bootstrap CUSUM derives its confidence thresholds by resampling the data randomly, there is a small degree of run-to-run variability when the number of iterations is low. At 1,000 loops, marginal change points — those sitting close to the confidence threshold — may appear in some runs and not others. This is not a flaw in the method. It is a property of any resampling procedure and can be managed straightforwardly: increase the number of bootstrap iterations until the result stabilises.

In our dataset at 99.7% confidence:

At 1,000 loops: 5 or 6 stages detected (marginal variability at the boundary)
At 5,000 loops: 5 stages consistently
At 10,000 loops: 5 stages consistently

The consistency of results at 5,000 and 10,000 iterations confirms convergence. For any formal governance submission, board paper, or publication, 10,000 bootstrap iterations is recommended to ensure results are fully converged and reproducible.

A note on information governance

Uploading patient-adjacent data to cloud-based analytical platforms requires formal DPIA assessment and Information Governance approval in NHS settings — a process that can take months. Browser-native tools, where the analysis runs entirely within the local browser and no data is transmitted to any external server, eliminate this barrier entirely. The CSV file never leaves the analyst’s machine. There is no upload, no cloud storage, no third-party data processing — removing the IG approval requirement and making it straightforward to deploy even on air-gapped NHS machines.

Try it on your own data

If you have a CSV file of time-series medicines data — weekly or monthly incident counts, prescribing volumes, error rates, cost figures — you can generate all three charts in this article, plus a one-click PDF formatted for board or governance submission, directly in your browser.

No installation. No login. No data leaving your computer.

📊 Open the Free Tool

Summary

Three charts. Same data. Radically different conclusions.

The run chart detects a broad shift — useful, but unable to say when, how many steps, or with what confidence. The X-mR Shewhart chart misses the systematic improvement entirely, reporting a flat mean and flagging early high points as the main story. The Bootstrap CUSUM step-change analysis reveals eleven stages of progressive improvement, an 85% reduction over eight years, with each step dated and bounded at 90% confidence.

The difference is not in the data. It is in the method. And for medicines safety teams preparing governance reports, funding cases, or regulatory submissions, the method you choose determines the story your data is allowed to tell.

In this dataset, the full story is one of the most compelling a quality improvement programme could present: a sustained, multi-intervention, eight-year programme that reduced weekly clinical decision support queries by 85% through eleven measurable, dateable steps — demonstrating that the system embedded, users became proficient, and the support burden reduced in ways that are statistically defensible and chronologically precise.

That story was always in the data. It just needed the right chart to show it.

The dataset used in this article is real, anonymised support case data from a high-risk medicines clinical decision support system collected over nine and a half years. No patient-identifiable data was used. The analysis was performed using a browser-native SPC tool; no data was transmitted to any external server at any point.

References

Page, E.S. (1954). Continuous inspection schemes. Biometrika, 41(1–2), 100–115.

Hinkley, D.V. (1971). Inference about the change-point from cumulative sum tests. Biometrika, 58(3), 509–523.

Efron, B. & Tibshirani, R.J. (1993). An Introduction to the Bootstrap. Chapman & Hall.

Taylor, W.A. (2000). Change-point analysis: a powerful new tool for detecting changes. Taylor Enterprises.

Mohammed, M.A., Worthington, P. & Woodall, W.H. (2008). Plotting basic control charts: tutorial notes for healthcare practitioners. Quality and Safety in Health Care, 17(2), 137–145.

Perla, R.J., Provost, L.P. & Murray, S.K. (2011). The run chart: a simple analytical tool for learning from variation in healthcare processes. BMJ Quality & Safety, 20(1), 46–51.

🇬🇧 Related Article

The Chart That Changes How You See 77 Years of UK Economic History

The same three-chart comparison applied to UK GDP data — and why Bootstrap CUSUM finds an 8-stage structural deceleration that conventional analysis completely misses.

Read the article →