Type I and Type II Errors

Every statistical test can fail in exactly two ways. Which failure you find more tolerable is not a technical question. It is a values question, and the answer determines how the system treats you.

Time: 12 minutes

Requires: Unit 2.7

Opening Hook

A radiologist reads a mammogram and calls it suspicious. The patient is recalled, referred for biopsy, subjected to weeks of anxiety. The biopsy comes back clear. No cancer. The screening test was wrong.

Now consider the other direction. A radiologist reads a mammogram, calls it clear, and sends the patient home. Eighteen months later, the patient is diagnosed with breast cancer that was, in retrospect, visible on that first scan. The test was wrong again, in a different way.

These are not both just “errors.” They are structurally different failures with structurally different consequences. In the first case, a healthy person is told they might be ill. In the second, a sick person is told they are fine. The first error causes fear, expense, and unnecessary medical procedures. The second error allows a disease to progress untreated.

What is the right balance between them? There is no universal answer. The answer depends on what it costs to be wrong in each direction, and that cost is different in every domain. Understanding this is not an abstract statistical exercise. It is the lens through which you should read every screening programme, every diagnostic test, and every criminal verdict.

The Concept

When a statistical test reaches a conclusion, it is comparing evidence against a default assumption. That default assumption is called the null hypothesis, a concept you met in Unit 2.7. Typically, the null hypothesis is the “nothing interesting is happening” position: the drug has no effect, the defendant is innocent, the screening test is negative.

The test can be wrong in two ways, and statisticians have given them names.

A Type I error is a false positive. The test rejects the null hypothesis when the null hypothesis is actually true. The mammogram flags suspicious tissue when there is no cancer. The clinical trial declares the drug effective when it is not. The jury convicts when the defendant is innocent. The test has produced a positive result, but that result is wrong.

A Type II error is a false negative. The test fails to reject the null hypothesis when the null hypothesis is actually false. The mammogram misses real cancer. The clinical trial finds no effect when the drug genuinely works. The jury acquits when the defendant is guilty. The test has produced a negative result, but that result is wrong.

The symbol alpha (written α) represents the probability of making a Type I error. It is a threshold you set in advance. If you set alpha at 0.05, you are accepting that, in a world where the null hypothesis is true, you will incorrectly reject it 5 percent of the time. This is the same 0.05 threshold you encountered in the p-value discussion in Unit 2.7: declaring a result “statistically significant” at p < 0.05 means you are willing to tolerate a 5 percent false positive rate.

The symbol beta (written β) represents the probability of making a Type II error. It is the probability that the test misses a real effect when one genuinely exists. A test with a high beta is one that is poor at detecting the things it is supposed to detect.

Power is defined as 1 minus beta. It is the probability that the test correctly detects a real effect when one exists. A high-powered test is sensitive: it finds things. A low-powered test is blunt: it misses things. Well-designed studies aim for power of 80 percent or higher, meaning a beta of no more than 0.20.

Here is the trade-off, and this is the thing that matters. For a given amount of data, reducing one type of error increases the other. If you lower your alpha threshold from 0.05 to 0.01, you become stricter about what counts as evidence. Fewer false positives will get through. But the same strictness means you will also miss more real effects, because borderline signals will no longer clear the threshold. Your Type II error rate goes up. Your power goes down.

You cannot simply decide to have fewer of both types of error without doing something else, usually collecting more data. This is not a design choice that can be wished away. It is a mathematical consequence of the structure of the problem.

The key insight, and the reason this unit exists in a self-defence curriculum rather than an academic statistics course, is that the choice of where to set this balance is not a statistical question. It is a values question. It is a question about what kinds of mistakes you are willing to make, who bears the cost of those mistakes, and whether the people designing the system have the same interests as the people subjected to it.

Why It Matters

Breast cancer screening is the standard teaching example because the stakes are high and the trade-off is visible. Mammography screening programmes produce both error types at meaningful rates. False positives recall healthy women for further investigation. In the United States, a woman who undergoes annual mammography screening from age 40 to 50 has roughly a 60 percent cumulative probability of at least one false-positive recall over that decade. In the United Kingdom, the NHS Breast Screening Programme estimates that for every life saved by screening, approximately three women are diagnosed with and treated for a cancer that would never have caused them harm in their lifetime. That is a Type I error in a different sense: not “the test was wrong” but “the condition found would not have progressed.” The term for this is overdiagnosis, and it sits in the same conceptual family.

The programme designers have made a judgement call. They have decided that the costs of missing real cancers (deaths that could have been prevented) are worse than the costs of recalling healthy women (anxiety, unnecessary biopsies, occasional overtreatment). Reasonable people can disagree about whether that judgement is correct, and the disagreement is not about statistics, it is about values. But you can only participate in that debate if you understand the structure of the trade-off.

Criminal justice reveals a different set of priorities. The legal standard of proof in most common-law systems, beyond reasonable doubt, is a deliberate institutional preference for minimising Type I errors. Convicting an innocent person is treated as categorically worse than acquitting a guilty one. Blackstone’s formulation from 1769 puts it directly: “Better that ten guilty persons escape than that one innocent suffer.” The presumption of innocence is the statistical prior, and the high burden of proof is a low alpha threshold. The system tolerates high Type II errors (guilty people going free) to keep Type I errors (innocent people convicted) rare.

This is a coherent choice, but it has consequences. A criminal justice system that systematically prioritises Type I error reduction will let more guilty people walk. A prosecution service that knows this will face pressure to bring only the strongest cases, potentially declining to prosecute where evidence is genuine but incomplete. How you feel about that depends on what you think the system is for and whose protection matters most.

Policy decisions operate in the same space. A food safety regulator deciding whether to ban a pesticide faces a Type I error if it bans a safe chemical (unnecessary cost to farmers and consumers) and a Type II error if it allows a harmful one (public health damage). A financial regulator deciding whether to approve a new banking product faces the same structure. Different regulators in different political cultures have made systematically different choices about which error type worries them more, and those choices have observable consequences for which industries get blocked and which risks get through.

How to Spot It

In 2009, the United States Preventive Services Task Force changed its guidance on mammography, recommending that routine screening begin at 50 rather than 40. The reaction was immediate and fierce. Major medical organisations, patient advocates, and politicians condemned the guidance as setting women up to have cancers missed.

What was actually happening was a recalibration of the error trade-off. The task force had looked at the evidence and concluded that screening women in their 40s, where breast cancer rates are lower than in older age groups, produced a high ratio of false positives to true positives. The harms of those false positives (anxiety, unnecessary procedures, small but real complication rates from biopsies) were being weighed against the benefit of catching the relatively small number of genuine cancers in that age group. Moving the starting age to 50 was not a decision to ignore cancer. It was a decision to accept a higher Type II error rate in a low-prevalence population in exchange for substantially fewer Type I errors.

The tell is this: when you see an outcry framed as “the system is abandoning patients” or “the rules are too lax,” look for what has actually changed. Very often, a policy change that looks like negligence from one angle is a deliberate shift in the Type I/Type II balance from another. The question is not whether errors exist, because they always will. The question is who bears them, and whether the people who designed the balance are the same people who pay when it gets it wrong.

In the 2009 mammography case, the conflict was visible in part because the patients most affected by false positives (younger women subject to unnecessary biopsies) were different from the loudest advocates for unchanged screening (oncologists and cancer charities whose professional and institutional interests were aligned with broader detection). The error-cost asymmetry, and the asymmetry of who suffers which error, was the real story. It was almost never reported as such.

Your Challenge

A government health agency proposes a new screening programme for a rare but treatable genetic condition. The condition affects approximately 1 in 5,000 people in the general population. The proposed test has a false positive rate of 2 percent and a false negative rate of 0.5 percent.

If the programme screens 1,000,000 people per year, roughly 200 people in that population have the condition. Work through the numbers: how many false positives will the programme generate annually? How many false negatives? Who bears the cost of each error type? The false positives will be healthy people subjected to follow-up tests, genetic counselling, and potential anxiety about their own health and the implications for their children. The false negatives will be people with the condition who are not identified and therefore not treated.

Now ask the question that the programme designers should have asked, and that the press release will not answer. Is the cost of a false positive the same as the cost of a false negative? If the condition is treatable and treatment is highly effective, a false negative represents a real failure with real consequences. If the follow-up for a positive result is expensive, invasive, or distressing, the false positives represent a real harm too. Where should the threshold be set, and who should decide?

There is no answer on this page. That is the point.

References

Cumulative false positive mammography rates: Elmore JG, Barton MB, Moceri VM, Polk S, Arena PJ, Fletcher SW. “Ten-year risk of false positive screening mammograms and clinical breast examinations.” New England Journal of Medicine 1998;338(16):1089-1096. The 60 percent cumulative estimate over a decade of annual screening is drawn from this and subsequent replication studies.

NHS overdiagnosis estimate: Independent UK Panel on Breast Cancer Screening. “The benefits and harms of breast cancer screening: an independent review.” The Lancet 2012;380(9855):1778-1786. The panel estimated approximately three overdiagnosed cases for every life saved.

US Preventive Services Task Force 2009 mammography guidance: US Preventive Services Task Force. “Screening for Breast Cancer: US Preventive Services Task Force Recommendation Statement.” Annals of Internal Medicine 2009;151(10):716-726.

Blackstone’s formulation: Blackstone W. Commentaries on the Laws of England, Volume 4. Oxford: Clarendon Press, 1769. “It is better that ten guilty persons escape, than that one innocent suffer.” Book IV, Chapter 27.

Alpha, beta, and statistical power: Cohen J. Statistical Power Analysis for the Behavioral Sciences. 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988. The standard reference for power analysis.