◆ Powerful

P-Hacking and Data Dredging

Statistical significance can be manufactured without fraud, simply by making enough analysis choices until the right number appears. This is p-hacking, and it is built into how academic publishing works. Understanding it changes how you read every headline that says 'scientists find'.

Time: 15 minutes

Requires: Unit 2.7 Unit 2.9

Opening Hook

A researcher runs an experiment on 200 people and records 50 different variables: age, weight, how much sleep they got, whether they ate breakfast, their stress levels, their income, their mood, and 43 other things. She tests each variable for a relationship with the health outcome she is studying. After running all 50 tests, one comes back with p = 0.04. She writes it up as a finding: “Low sleep duration associated with increased risk of outcome X.” The paper gets published.

Nothing in this story involves fraud. No data was invented. No results were hidden. The researcher did not set out to deceive anyone. She may have genuinely believed she had discovered something real.

She had not. She had done something that produces findings indistinguishable from fraud, using entirely legitimate statistical tools, in a process that is standard practice in dozens of scientific fields. The finding will probably not replicate if anyone tries. The published literature is full of findings like it.

The Concept

P-hacking is the practice of adjusting, selecting, or re-running analyses until a p-value below 0.05 is achieved. The term covers a range of behaviours, from the obviously suspect to the entirely unconscious, but the effect is the same: the p-value loses its meaning.

To see why, recall what a p-value of 0.05 actually means. It means: if there were truly no effect, you would observe data this extreme by chance 5 percent of the time. That 5 percent figure is the false positive rate for a single test on a single dataset. Run twenty tests, and the probability that at least one of them produces a false positive by chance alone is not 5 percent. It is much closer to 64 percent. Run fifty tests and you will almost certainly get at least one false positive, even if none of the underlying relationships are real. The p-value was calibrated for one test. It is being used for fifty. The maths no longer holds.

This is the false discovery rate problem. When you test many hypotheses simultaneously, the proportion of your “significant” findings that are actually noise can be very high, even when every individual test is conducted correctly. If you test 50 variables, each with a true null effect, and use a 0.05 threshold, you expect roughly 2 to 3 false positives just by chance. Find any one of them, write it up as a discovery, and you have published a false finding without doing anything technically wrong.

The statistical term for the choices a researcher makes in designing and running an analysis is researcher degrees of freedom: which participants to include or exclude, which outcome variable to prioritise, which covariates to adjust for, when to stop collecting data, how to handle missing values. Each of these choices is a fork in the road. Any single choice is defensible. But the accumulation of choices, especially when made after the data has been seen, inflates the false positive rate in ways that are invisible in the final published paper.

Andrew Gelman and Eric Loken gave this problem a name that has stuck: the garden of forking paths. Their 2013 paper argued that the problem does not require deliberate fishing. A researcher can reach a false positive through a sequence of perfectly reasonable, locally-justified decisions, each of which seemed sensible given what the data looked like at the time. The path to significance is paved with good intentions.

The result is that the 0.05 threshold, which appears to mean “a 5 percent chance of a false positive,” can in practice mean a 50 percent or higher chance, depending on how many paths were available and how many were tried. The published paper shows only the path that worked. The reader has no way to know how many paths were attempted.

Data dredging is the more aggressive version of the same thing: systematically testing every available variable, every available subgroup, every available time period, until something significant emerges. Sometimes called “fishing” or “mining,” it is especially common in fields where large datasets exist and the theoretical basis for choosing which variables to test is weak. Nutritional epidemiology, psychology, and certain areas of genomics have been particularly affected.

Why It Matters

The practical consequence is a published scientific literature that contains a large proportion of false positive findings. John Ioannidis’s 2005 paper, “Why Most Published Research Findings Are False,” made this argument rigorously. His point was not that scientists are dishonest. His point was that the incentive structure of academic publishing, combined with small samples, flexible analysis, and the 0.05 threshold, produces a literature where the majority of reported findings in many fields will not hold up.

Dietary supplement research is one of the clearest examples. Studies regularly find associations between specific nutrients and health outcomes. Many of these are published, picked up by the press, and influence public behaviour and purchasing. When pre-registered trials are run to test the same hypotheses under controlled conditions, the effects largely disappear. The published literature on omega-3 fatty acids, vitamin E, beta-carotene, and many other supplements showed strong protective effects in observational studies. Large pre-registered trials found either no effect or, in some cases, harm. The observational findings were not fabricated. They were almost certainly the product of a literature generated by flexible analysis applied to observational data with many variables in play.

Underpowered studies make things worse. A study with a small sample has low statistical power, meaning it will often fail to detect real effects. But when a small study does produce a significant result, that result is disproportionately likely to be a false positive or an extreme overestimate of the true effect. Small underpowered studies that have been mined for positive results are where much of the junk in the scientific literature originates.

How to Spot It

The most thoroughly documented case of p-hacking in recent academic history is the Cornell Food and Brand Lab, run by Brian Wansink until his resignation in 2019.

Wansink was a prolific and widely cited food marketing researcher who published dozens of studies on how portion size, plate size, packaging, and environmental cues influence eating behaviour. His findings were cheerful and intuitive: people eat more from bigger bowls; children eat more vegetables when they are given funny names; diners underestimate how much they eat at all-you-can-eat buffets. His work was extensively covered in the press and influenced public health policy and school nutrition programmes.

In 2016, Wansink published a blog post describing how he had assigned a visiting graduate student to re-analyse data from a failed experiment. The experiment, conducted at an Italian restaurant, had not produced the results he was looking for. He instructed the student to break the diners into every possible subgroup: males, females, lunch diners, dinner diners, people eating alone, people eating in groups, people who ordered alcohol, people who ordered soft drinks, people sitting close to the buffet, people sitting far away, and so on. Then to test everything against everything else and find something that “went virally big time.” The blog post was presented as an inspirational story about persistence and scientific creativity.

Statisticians reading it recognised it as a description of systematic data dredging. A subsequent investigation by journalists at BuzzFeed News obtained emails between Wansink and his co-authors that showed the pattern in detail: datasets mined repeatedly for publishable results, analyses adjusted until significant findings appeared, papers written around whatever the data happened to produce.

Cornell’s formal investigation found that Wansink had committed academic misconduct, including “misreporting of research data, problematic statistical techniques, failure to properly document and preserve research results, and inappropriate authorship.” Thirteen of his papers were retracted, one of them twice. At least fifteen more were corrected. The research base for several public health recommendations was invalidated.

The tell is the absence of a pre-registered analysis plan. In any study where the hypotheses and analysis methods were not locked down before data collection, you have no way of knowing how many analytical paths were tried before the published result was selected. When a paper reports a surprising or counterintuitive finding from an observational dataset with many variables, and does not cite a pre-registered protocol, apply significant scepticism. The finding may be real. But you have no mechanism for knowing whether it is.

A secondary tell is an unusually large number of findings from a single dataset. If a research group regularly publishes three or four separate papers from one study, with each paper reporting a different significant result, this is consistent with data dredging: the same dataset was tested until multiple publishable thresholds were crossed. Each paper will look like independent evidence. It is not.

Your Challenge

A nutrition researcher collects data on 1,000 adults, measuring 30 different dietary and lifestyle variables: consumption of red meat, processed food, vegetables, alcohol, coffee, sleep duration, exercise frequency, and so on. She tests each variable for association with a cardiovascular health outcome using a standard statistical test, with a 0.05 significance threshold.

One variable, daily coffee consumption, comes back at p = 0.038. She writes a paper concluding that coffee consumption is significantly associated with better cardiovascular health.

Before accepting this finding, consider: how many false positives would you expect from 30 independent tests at a 0.05 threshold, even if none of the variables had any real effect? If coffee is the one that cleared the threshold, what does the p-value of 0.038 actually tell you now? What would you need to see before treating this finding as meaningful evidence that coffee is genuinely protective?

There is no answer on this page. That is the point.

References

Wansink, B. (2016). “The grad student who never said no.” Healthier & Happier (Cornell Food and Brand Lab blog). The original blog post that prompted wider scrutiny. Archived copies are available via the Wayback Machine.

Lee, S.M. (2018). “Here’s how Cornell scientist Brian Wansink turned shoddy data into viral studies about how we eat.” BuzzFeed News. URL: https://www.buzzfeednews.com/article/stephaniemlee/brian-wansink-cornell-p-hacking The primary journalistic investigation, drawing on email correspondence between Wansink and collaborators.

Cornell University (2018). Statement on the investigation of Brian Wansink. Reported by Retraction Watch, 20 September 2018. URL: https://retractionwatch.com/2018/09/20/beleaguered-food-marketing-researcher-brian-wansink-announces-his-retirement-from-cornell/ Full text of Cornell’s misconduct finding and summary of the investigation outcome.

Gelman, A. and Loken, E. (2013). “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time.” Unpublished manuscript. URL: https://sites.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf Introduced the garden of forking paths framing and distinguished it from deliberate data fishing.

Simmons, J.P., Nelson, L.D. and Simonsohn, U. (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological Science, 22(11), 1359–1366. Introduced the term “researcher degrees of freedom” and demonstrated through simulation that flexible analysis produces false positive rates far above the nominal 5 percent. URL: https://pubmed.ncbi.nlm.nih.gov/22006061/

Ioannidis, J.P.A. (2005). Why most published research findings are false. PLoS Medicine, 2(8), e124. The paper that formally modelled the conditions under which a majority of published positive findings are false positives. URL: https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.0020124

Nosek, B.A. et al. (2015). Estimating the reproducibility of psychological science. Science, 349(6251), aac4716. The large-scale replication project that found fewer than half of published psychological findings replicated. URL: https://www.science.org/doi/10.1126/science.aac4716