📁 Data preparation guide

Fix Your CSV for Bootstrap CUSUM

Bootstrap CUSUM needs two things from your CSV: a date column and a numeric value column. Most upload problems come from one of five causes — date format, missing values, merged cells, comma separators, or too few data points. This page fixes all five.

StepChangeAnalysis.com  ·  Data preparation guide  ·  June 2026
Method: Bootstrap CUSUM  ·  Open the StepChange Analyzer
☰  Contents

What is a CSV file?

CSV stands for Comma-Separated Values. It is the simplest possible format for tabular data — plain text, one row per line, columns separated by commas. Every spreadsheet application can save as CSV — Microsoft Excel, Google Sheets, and LibreOffice all support it natively. Bootstrap CUSUM requires CSV because it runs entirely in your browser with no server — CSV is the universal format that any browser can read without plugins or uploads.

Your data never leaves your computer

The StepChange Analyzer runs entirely in your browser. When you upload a CSV, it is read locally by JavaScript — it is never sent to any server. This means you can safely use patient-level NHS data, commercially sensitive operational data, or any other confidential dataset without any information governance concern.

Microsoft Excel: File → Save As → CSV (Comma delimited) (.csv)  ·  Google Sheets: File → Download → Comma-separated values (.csv)  ·  LibreOffice Calc: File → Save As → Text CSV (.csv) Open the saved file in a text editor (Notepad on Windows, TextEdit on Mac) and you will see the raw structure — this is useful for diagnosing problems.


What the Analyzer needs

The minimum requirement is two columns: a date column and a numeric value column. Column names go in the first row (the header). Everything else is optional.

📋 The minimum valid CSV structure

Date,Value
01/04/2020,142
01/05/2020,138
01/06/2020,151
01/07/2020,144
01/08/2020,139

The Analyzer will detect the date column and value column automatically. Column names can be anything — Month, Rate, Incidents, Admissions — you select them from dropdown menus after upload. Extra columns are ignored.


Date formats — the most common problem

The most common reason a CSV fails to analyse is an unrecognised date format. The Analyzer supports six formats. The safest is ISO format (YYYY-MM-DD) because it is unambiguous — but all six work if selected correctly in the Date Style dropdown.

Format Example When to use Common mistake
YYYY-MM-DD (ISO) 2024-03-01 Always safe. Unambiguous. Recommended for monthly data — use the first of each month. Using 2024-03 without a day — add -01 to make it a valid date.
DD/MM/YYYY (UK) 01/03/2024 NHS data exported from most UK systems. Select “DD/MM/YYYY (UK/EU)” in Date Style. Excel auto-converting to MM/DD/YYYY on a US-locale machine. Check the first row.
MM/DD/YYYY (US) 03/01/2024 US system exports. Select “MM/DD/YYYY (US)” in Date Style. Confusing UK and US format when both are plausible (e.g. 01/03/2024 vs 03/01/2024).
YYYY-MM (Year-Month ISO) 2024-03 Monthly data without a day. ONS and NHS England often publish in this format. Mixing YYYY-MM and YYYY-MM-DD in the same column.
MM/YYYY (Year-Month UK) 03/2024 Some NHS data exports use this format for monthly returns. None specific — but select this explicitly, do not rely on auto-detect.
YYYY (Year only) 2024 Annual data series. GDP, annual mortality rates, yearly audit data. Mixing four-digit years with other formats in the same column.
Financial year formats are not supported

Formats like 2004-05 or 2023/24 (NHS financial year notation) are not recognised. Convert each financial year to a standard date before uploading — the start of the financial year is usually clearest: 01/04/2004 for 2004-05, 01/04/2023 for 2023/24.

Excel formula to convert from financial year notation: If your financial year is in cell A2 as 2004-05 or 2023/24, use this formula to extract the start year and build a date:

=DATE(LEFT(A2,4), 4, 1)

This takes the first 4 characters (the start year) and builds 1 April of that year. Format the result column as DD/MM/YYYY before saving as CSV. For 2023/24 format, the formula is identical — LEFT(A2,4) extracts 2023 in both cases.

Mixed date formats in the same column

If some rows use 01/03/2024 and others use 2024-03, the Analyzer cannot parse all rows correctly. The rows in the minority format will return no timestamp — the out-of-order check will silently skip them, missing periods around them will go undetected, and the stage mean will be calculated on fewer points than it should be.

This is more common than it looks. NHS data exports sometimes switch format mid-series when a reporting system was upgraded, or when data from two different sources is merged into one spreadsheet.

The Data Validator detects mixed formats and flags the minority rows by row number. Once you know which rows are affected, standardise the entire column to one format before saving as CSV. The safest choice is always ISO (YYYY-MM-DD or YYYY-MM) because it is unambiguous regardless of locale.

Excel formula to convert DD/MM/YYYY to YYYY-MM-DD:

=TEXT(DATEVALUE(A2),"YYYY-MM-DD")

If the cell is already a real Excel date (not text), use: =TEXT(A2,"YYYY-MM-DD")


Missing values — zeros and blanks

Missing values are the second most common cause of incorrect results. There are two distinct cases and they require different fixes.

Case 1 — The row is omitted entirely from the CSV

The Analyzer treats each row as the next observation in sequence. If your data has monthly observations but March 2022 is missing from the CSV entirely, the Analyzer will treat February 2022 and April 2022 as consecutive months. This compresses the time axis and may cause change points to be dated incorrectly.

A missing row may mean genuinely no readings that period — and that matters analytically. A month with no incidents, no referrals, or no activity is a real data point of zero (or near-zero). Omitting it removes that information from the stage mean calculation and makes the average appear higher than it was. For NHS safety data especially, a period of no reported events should be included, not silently dropped.

Fix: Add the missing row. If no measurement was taken, use the average of the surrounding two values as an estimate and note the interpolation. If genuinely zero events occurred, use zero and make sure zero is a valid observation for this metric. Do not simply leave the row out — the gap will distort both the stage mean and the change point date.

Case 2 — A blank cell in the value column

A blank value cell may be read as zero. Zero is almost never a valid observation in NHS data — a month with zero incidents, zero admissions, or zero referrals is extremely rare and would be genuine data. A zero created by a blank cell will severely depress the stage mean and create false change points.

Fix: In Excel, filter the value column for blank or zero values before saving as CSV. Delete any row where zero is not a genuine observation. If the period genuinely had no events, use the Data Validator to confirm it before proceeding.


Merged cells in Excel

Excel worksheets often use merged cells for formatting — a date label spanning multiple rows, or a quarter label covering three months. When saved as CSV, merged cells leave blank rows beneath them. The blank rows appear as missing values in the Analyzer.

⚙️ How to unmerge cells in Excel before saving as CSV

Step 1 — Select the entire date column (click the column header letter).

Step 2 — Home tab → Alignment group → click the dropdown arrow on “Merge & Center” → select “Unmerge Cells”.

Step 3 — The unmerged cells below the original merged cell will now be blank. Select those blank cells and fill them with the correct date value (Ctrl+D fills down in Excel).

Step 4 — Save as CSV. Open in a text editor to verify every row has a date.


Numbers with comma separators

Large numbers formatted with comma thousands separators — 1,234,567 — break the CSV structure because the comma is also the column separator. The Analyzer will read 1,234,567 as three separate columns: 1, 234, and 567.

⚙️ How to remove comma separators in Excel before saving as CSV

Step 1 — Select the value column.

Step 2 — Right-click → Format Cells → Number → set Decimal Places to 0 → untick “Use 1000 Separator (,)”.

Step 3 — Save as CSV. Values should now appear as 1234567 without commas.

Alternatively: if your values are already quoted in the CSV (“1,234,567”), this is valid CSV and the Analyzer will parse it correctly — the quotes protect the comma within the value.


Date order — oldest to newest

Bootstrap CUSUM reads your CSV from top to bottom, treating each row as the next observation in time. The series must run from oldest date at the top to newest date at the bottom. If rows are in any other order — newest first, or randomly sorted — the CUSUM accumulation is meaningless and change points will be dated incorrectly or not found at all.

Why order matters — the CUSUM accumulation

The CUSUM statistic accumulates deviations from the mean in sequence. If row 1 is April 2026 and row 2 is January 2019, the algorithm is accumulating a sum across time that is running backwards. The change point it detects will correspond to a position in your CSV rows, not to a real point in time. A chart that looks plausible may be completely wrong if the underlying row order is wrong.

The Data Validator detects out-of-order dates and flags them by row number. But it cannot fix the order for you — that requires sorting the data before saving as CSV.

⚙️ How to sort oldest to newest in Excel and Google Sheets

Microsoft Excel:

Select the entire data range (including headers) → Data tab → Sort → Sort by your date column → Order: Oldest to Newest → OK. Then save as CSV.

Google Sheets:

Select the entire data range → Data → Sort range → Advanced range sorting options → tick “Data has header row” → Sort by your date column → A → Z → Sort. Then File → Download → CSV.

Quick check before saving: Look at the first and last rows. The first row should be your earliest date. The last row should be your most recent date. If not, sort again.

Data exported newest-first is a common trap

Many NHS reporting systems and dashboards export data with the most recent period at the top — this makes sense for reading a table, but is wrong for Bootstrap CUSUM. Always check the sort order of any exported file before uploading. The Data Validator will flag out-of-order rows but will not automatically reverse the series for you.


Minimum data requirements

Bootstrap CUSUM requires a minimum number of data points to produce reliable results. Too few points and the algorithm cannot distinguish genuine structural change from normal variation.

Duplicate dates

If your data has multiple readings per day — hourly observations, shift data, or intraday measurements — duplicate dates are expected and correct. Bootstrap CUSUM will treat each row as a separate observation in sequence.

If each row is meant to represent a unique period (one row per month, one row per week) and the same date appears twice, Bootstrap CUSUM will produce incorrect stage means. This often happens when data is exported from reporting tools that aggregate differently across date ranges.

Fix for unintended duplicates: In Excel, sort by date and scan for repeated values, or use Data → Remove Duplicates on the date column. If both rows contain valid data, merge them (sum or average the values) into a single row before saving as CSV.

Under 20
Not enough data

Fewer than 20 data points is insufficient for Bootstrap CUSUM. The confidence levels will be unreliable. Options: extend the historical series, switch to a lower frequency (quarterly instead of monthly), or use the X-mR chart tab instead.

20–30
Minimum workable

Borderline. Results are valid but confidence levels are sensitive to the exact number of points. Use 95% confidence rather than 99.7% at this length. Treat results as indicative rather than definitive.

30+
Good to go

30 or more data points gives Bootstrap CUSUM reliable power at 95% confidence. 40+ points at 99.7% confidence. Monthly data going back 3–5 years is ideal for most NHS improvement applications.

Too soon after an intervention?

If you implemented an intervention less than 6 months ago, there may not be enough post-intervention data for Bootstrap CUSUM to detect a change point even if one has occurred. This is not a data problem — it is a timing problem. Set a review date and re-run when more data has accumulated. See Decide to wait for the decision rules around review dates.


Run the Data Validator first

Before uploading to the Analyzer, run your CSV through the Data Validator. It checks for the most common problems automatically and gives you a specific diagnosis rather than a generic error message.

What the Validator checks What it finds What to do
Date format consistency Mixed formats in the date column, unrecognised formats, financial year notation Standardise all dates to a single format. Use ISO (YYYY-MM-DD) if in doubt.
Missing date periods Gaps in the date sequence — months or years missing from the series Add missing rows with interpolated or blank values.
Zero and blank values Rows where the value column is zero or blank Delete rows where zero is not a genuine observation. Fill blanks or interpolate.
Column structure Columns split by comma separators within values, extra columns, missing headers Remove comma separators from numeric values. Ensure first row is a header.
Duplicate dates Rows where the same date appears more than once in the date column Sort by date in Excel, merge or delete duplicate rows. Use Data → Remove Duplicates.
Row count Whether the series has at least 20 usable rows Extend the series or reduce frequency if under 20 rows.

Pre-upload checklist

✅ Before you upload to the StepChange Analyzer

☐ Two columns minimum: a date column and a numeric value column. Headers in row 1.

☐ Dates are consistent: all rows use the same format. No financial year notation (2004-05). No blank date cells.

☐ No missing rows: every period in your series has a row, even if the value is estimated or interpolated.

☐ All dates use the same format: no mixing of DD/MM/YYYY with YYYY-MM in the same column. Use the Data Validator to detect mixed formats by row number.

☐ No duplicate dates: each date appears only once. Use Excel Data → Remove Duplicates on the date column to check.

☐ No blank or spurious zero values: any row where the value is blank or zero because of a data export issue has been deleted or filled.

☐ No comma separators in numbers: 1234567 not 1,234,567. Check in a text editor.

☐ No merged cells: all cells in the date column contain a value. Unmerged and filled in Excel before saving as CSV.

☐ At least 20 rows: ideally 30+. If fewer, consider quarterly instead of monthly frequency.

☐ Data Validator passed: run through scanner.html with no errors before uploading to the Analyzer.

Ready? Open the StepChange Analyzer

Your CSV is prepared and the Data Validator has passed. Upload it to the Analyzer and run Bootstrap CUSUM.

▶ Open the StepChange Analyzer