Root Cause Analysis
Root cause analysis (RCA) is the family of structured techniques used to identify why a problem occurred — not just what happened, but why the system produced that outcome. Finding the root cause is necessary but not sufficient. The fix must operate at the right level of the system, and Bootstrap CUSUM must verify that it worked.
- Build a cause map that identifies system causes (not just “human error”).
- Choose the right RCA tool for the problem (5 Whys vs fishbone vs fault tree).
- Handle psychological “why” questions honestly (fear, hierarchy, incentives) without blame.
- Close the loop: verify whether the outcome actually changed using data over time.
☰ Contents — click to expand
What root cause analysis is
Root cause analysis is a structured investigation methodology applied after an adverse event, near-miss, or persistent problem. Its purpose is to identify the fundamental cause — the root cause — that, if addressed, would prevent recurrence. Not the proximate cause (what immediately triggered the event), not the contributing factors (what made it worse), but the underlying condition that made the event possible.
The distinction matters because most organisations respond to problems at the proximate cause level. A patient falls: the immediate response is to put up the bed rails. A medication error occurs: the immediate response is to retrain the nurse. These responses address the proximate cause — they may prevent this specific event in this specific way, but they leave the underlying condition unchanged. The next event of the same type will occur through a slightly different proximate cause, and the cycle repeats.
Proximate cause: The immediate trigger of the event. What happened just before the adverse outcome. A necessary but not sufficient explanation — it tells you the mechanism but not the cause.
Contributing factor: A condition that increased the likelihood or severity of the event but did not directly cause it. Fatigue, understaffing, poor lighting, time pressure. Important context but not the root cause.
Root cause: The fundamental system condition without which the event either could not have occurred or would have been far less likely. Addressing the root cause prevents recurrence of the entire class of events, not just this specific instance.
The RCA tool family
Several complementary tools exist for root cause analysis. Each is suited to different types of problems and different organisational contexts.
The 5 Whys
Ask why five times to trace a linear causal chain from symptom to root cause. Simple, fast, requires no special equipment. Developed by Toyota.
Fishbone (Ishikawa) Diagram
Maps multiple categories of potential causes onto a diagram shaped like a fishbone, with the problem at the head. Developed by Kaoru Ishikawa at Kawasaki in the 1960s.
Fault Tree Analysis (FTA)
Top-down logical diagram that maps all possible combinations of failures that could produce an undesired top event. Uses Boolean logic gates (AND, OR). Standard in aerospace and nuclear industries.
Significant Event Analysis (SEA)
A reflective, team-based review of significant events — including near-misses and good outcomes — to learn from what happened. Standard in UK primary care.
The fishbone (Ishikawa) diagram
Each bone represents a category of potential causes. Sub-branches identify specific causes within each category. The fishbone generates hypotheses — the 5 Whys then tests each one.
The fishbone diagram — also called the Ishikawa diagram or cause-and-effect diagram — was developed by Kaoru Ishikawa at Kawasaki Heavy Industries in the 1960s and is now used across healthcare, manufacturing, and service industries worldwide. It provides a structured way to brainstorm and organise potential causes across multiple categories simultaneously.
The 6M categories
The most widely used fishbone structure in manufacturing and healthcare uses six categories — the 6Ms. In healthcare the categories are sometimes adapted to the 4Ps (People, Process, Place, Policy) or to specific clinical frameworks.
| Category | Manufacturing original | Healthcare equivalent | Examples of causes |
|---|---|---|---|
| Man / People | Operator skills, training | Staff knowledge, fatigue, communication | Insufficient training, unclear roles, handover failures |
| Machine / Equipment | Tools, machinery | Medical devices, IT systems, connectors | Equipment not available, alert fatigue, incompatible connectors |
| Method / Process | Procedures, work instructions | Clinical protocols, care pathways | No standard procedure, procedure not followed, outdated guideline |
| Material | Raw materials, components | Medications, supplies, patient information | Look-alike/sound-alike drugs, missing information, supply chain failures |
| Measurement | Inspection methods | Monitoring, audit, reporting | No monitoring system, measurement error, metric not tracked |
| Mother Nature / Environment | Temperature, humidity | Ward culture, staffing levels, time pressure | Understaffing, interruptions, normalisation of deviance |
When to use which tool
| Situation | Recommended tool | Why |
|---|---|---|
| Single adverse event with a clear sequence of events | 5 Whys | Fast, simple, follows the causal chain directly |
| Complex event with multiple contributing causes across different departments or systems | Fishbone diagram | Captures multiple categories simultaneously, good for team sessions |
| Recurring pattern of similar events across multiple sites or time periods | Bootstrap CUSUM + RCA | CUSUM identifies the pattern and dates it; RCA explains the cause |
| Safety-critical system where all failure pathways must be mapped | Fault tree analysis | Systematic, quantifiable, maps all pathways including combinations |
| Learning from near-misses in primary care or community settings | Significant Event Analysis | Reflective format, culturally accessible, covers positive events too |
Finding the root cause is necessary but not sufficient
Root cause analysis is widely assumed to lead naturally to prevention. Find the cause, fix the cause, prevent recurrence. In practice the chain frequently breaks at the third link. Two specific failures account for most of this.
The fix operates at the wrong level. The RCA correctly identifies the root cause at the system level. The fix is implemented at the process or output level because the system-level fix is too expensive, too slow, or outside the authority of the team conducting the analysis. The root cause remains unchanged. The event recurs through a different proximate cause. Another RCA is conducted. The pattern repeats. Joiner’s Levels of Fix is the diagnostic tool for this failure: if the fix is at Level 1 or Level 2 but the root cause is at Level 3, the fix will not prevent recurrence.
The fix is never verified. The fix is implemented and assumed to work. No pre-specified outcome measure was defined before the fix. No Bootstrap CUSUM prediction was made. When the next review occurs, the team reports that the action was completed — not that the outcome changed. Completing an action and changing an outcome are not the same thing. Without a pre-specified test, the improvement is asserted rather than confirmed.
A Never Event occurs. An RCA is conducted. A corrective action plan is produced. The actions are completed and signed off. The event occurs again the following year. Another RCA is conducted. The same root causes are identified. A similar action plan is produced. This cycle, documented in multiple NHS investigations, is the direct consequence of RCA without Joiner-level awareness and without Bootstrap CUSUM verification. The root cause is found, a Level 1 or Level 2 fix is applied, the system remains unchanged, and the event recurs. Bootstrap CUSUM on NHS Never Events data shows the result: 17.5 events per year, unchanged for 15 years, across thousands of individual RCA investigations.
Psychological “Why” frameworks — why people ask why differently
The 5 Whys is a logical technique. But the question “why?” also has a psychological dimension that determines whether an RCA reveals the true root cause or a socially acceptable one.
In organisations where fear is present — where pointing out problems or naming system failures carries personal risk — the 5 Whys produces a sanitised causal chain that stops at the level where blame becomes uncomfortable. The questioning process appears rigorous. The conclusions are systematically incomplete. The real root cause — the management system, the accountability structure, the incentive that produced the behaviour — is never named because naming it carries too high a personal cost.
Psychological safety is not a pre-condition for asking why. It is a pre-condition for the answers being honest. Amy Edmondson’s research on psychological safety in healthcare teams showed precisely this: teams with low psychological safety reported fewer errors, not because they made fewer errors, but because they were less willing to report them. RCA conducted in conditions of low psychological safety produces fewer root causes found, not because there are fewer root causes, but because the investigation stops before reaching the ones that are uncomfortable to name.
This is why Going to the Gemba is a pre-condition for honest RCA in complex organisations: the senior person who goes to where the problem manifests, in a culture of psychological safety, hears what the front line actually knows — not what the front line thinks it is safe to say. Deming’s Point 8 (drive out fear) is not a management philosophy. It is a prerequisite for root cause analysis to reach the root.
RCA in the NHS — why it frequently fails to prevent recurrence
The NHS conducts thousands of Root Cause Analyses every year through the Serious Incident framework, the Patient Safety Incident Response Framework (PSIRF), and clinical audit processes. The volume of RCA activity is not in question. The effectiveness is.
Three structural features of NHS RCA produce the recurring-event pattern:
1. Individual event focus without pattern analysis. Each RCA analyses one specific event. The systemic pattern — that the same type of event recurs at the same rate year after year — is visible only when multiple events are analysed as a series. Bootstrap CUSUM on the series answers the question that individual RCA cannot: has the rate of this type of event structurally changed? If it has not, the individual RCAs have not produced system change.
2. Action completion measured, not outcome change. NHS governance frameworks typically require trusts to report that RCA actions have been completed. They do not require trusts to demonstrate that completing those actions changed the outcome. The accountability framework measures activity, not effect. This is precisely the process measure vs outcome measure confusion: completing an action plan is a process measure. Reducing the event rate is an outcome measure. The NHS reports the former and calls it improvement.
3. PSIRF and the shift toward system learning. The Patient Safety Incident Response Framework (2022) is a genuine improvement on previous frameworks. It explicitly moves away from individual RCA of every serious incident toward a system-level analysis of themes and patterns. It is the right direction. The analytical tool that makes system-level pattern analysis rigorous — Bootstrap CUSUM applied to the event series — is not yet routinely used within it.
Closing the loop with Bootstrap CUSUM
📊 The complete RCA + Bootstrap CUSUM cycle
Step 1 — Identify the pattern with Bootstrap CUSUM. Apply Bootstrap CUSUM to the event rate series (incidents per month, adverse events per quarter, never events per year). If the process is stable with no change point, individual events are common cause variation — the system is producing them routinely. This tells you the problem is systemic, not episodic. If an upward change point appears, something specific changed and made things worse. Date the change point — that narrows the investigation window.
Step 2 — Identify the root cause with RCA. Use the 5 Whys or fishbone diagram to trace the causal chain. Apply the Joiner test at the end: is the proposed fix at Level 1 (output), Level 2 (process), or Level 3 (system)? If it is at Level 1 or 2, ask whether a Level 3 cause exists that the analysis has not yet reached.
Step 3 — Implement a Level 3 fix. The fix must address the root cause at the system level. Physical redesign where possible (making the wrong action impossible), structural change where physical redesign is not feasible, economic or accountability mechanism change where structural redesign is not feasible.
Step 4 — Pre-specify the Bootstrap CUSUM test. Before implementing the fix, state in writing: we expect a Bootstrap CUSUM change point in [outcome measure — the event rate series] within [Z] time periods at [Y]% confidence. This is the commitment that makes the verification step meaningful.
Step 5 — Verify with Bootstrap CUSUM. Run Bootstrap CUSUM on the event rate series periodically after the fix. When a downward change point appears at the predicted confidence level, the fix is confirmed. When it does not appear within the expected lag window, the root cause analysis was incomplete — return to Step 2.
Related concepts
This concept sits within a broader framework for understanding why improvement programmes succeed or fail. Start with Why Nothing Changes for the full picture, or go to Start Here for a guided introduction to the method.