★ Essential

Cherry-Picking — Time Periods, Subgroups, and Studies

Any dataset contains patterns. The selection of which patterns to present is a powerful tool. This unit covers cherry-picked time periods, subgroup analyses, and selective citation of studies.

Time: 15 minutes

Requires: Unit 0.4 Unit 1.2

A finance minister stands at a podium and shows a graph. GDP has been rising steadily, he says, and the line on the screen confirms it. What he does not mention is where the graph begins: not at the start of his government’s term, not at the last election, but at the precise trough of the recession that preceded him. The recovery that his predecessors were already engineering when they lost office does not appear. The entire upward slope belongs, visually and apparently, to him.

No number in that graph is invented. The line is accurate. The manipulation is entirely in the choice of where to start.

The concept

Cherry-picking is the selection of evidence to present a misleading picture. It does not require fabrication. It requires only that someone with access to a large body of data chooses which piece to show you.

It operates in three main forms, and you should be able to name all three.

Time period selection is the most common. Any trend looks different depending on where you start the clock. A company’s share price that has fallen 40% over five years might still show a 15% gain over the last six months. A government’s job creation record might look strong from the month they took office or weak from the month of the previous peak. There is almost always a start date that flatters the presenter and a start date that does not, and the one you get shown is rarely chosen at random. As the Unit 0.4 graph section established, the visual impression of a chart is set before the reader has processed a single number. A shrewdly chosen start date exploits exactly that.

Subgroup analysis is the pharmaceutical industry’s preferred version. A clinical trial typically tests whether a treatment works on a defined population. But the raw data contains many subgroups: men and women, different age bands, people with and without complicating conditions, different ethnic groups, patients from different centres. After the trial is done, a researcher or a sponsor with a commercial interest can divide the data however they like and look at each slice separately. If you test enough slices, one of them will show a positive result by chance alone, even if the treatment does nothing overall. This is not a theoretical risk; it is a statistical certainty. If you run twenty subgroup analyses, roughly one will appear significant at the conventional threshold even if the drug is inert.

The term for the broader problem is the garden of forking paths, coined by the statistician Andrew Gelman. The garden refers to all the choices an analyst makes before arriving at a result: which subjects to include, which outcome to measure, which time point to report, which covariates to control for. Each choice is a fork. By the time a result emerges, an enormous number of path combinations have been implicitly tried, even if the analyst was not consciously trying to manufacture a result. The final path presented to you looks like a single, clean analysis. The garden behind it is invisible.

Study selection is the same logic applied across an entire body of literature. A pharmaceutical company funding ten trials of a new drug and publishing only the seven that show positive results is not fabricating data. The seven published papers are internally valid. But the picture they create for doctors, regulators, and patients is systematically wrong. This is closely related to publication bias, which gets its full treatment in Unit 3.11. The relevant point here is that cherry-picking studies to cite in a report or presentation is the same manoeuvre as cherry-picking a start date, just with papers instead of months.

The structural solution to all three forms is pre-registration: committing to the analysis plan before seeing the data. A pre-registered trial specifies in advance which subgroups will be analysed, which outcome measure is primary, and how long the follow-up period will be. A pre-registered study cannot be accused of cherry-picking its subgroup after the fact, because the subgroups were declared before data collection began. Pre-registration does not prevent subgroup analyses entirely; it forces researchers to distinguish between analyses they planned and analyses they invented after looking at the data. The planned ones carry evidential weight. The post-hoc ones are, at best, hypotheses worth testing in a fresh study.

Why it matters

Political performance claims run almost entirely on cherry-picked time periods. The same underlying economic data supports whatever conclusion you set out to reach. Growth since the last recession trough looks like success; growth since the previous peak looks like failure. Unemployment since the government took office can look dramatically different from unemployment since the onset of the global downturn that preceded them. This is not a feature of dishonest governments; it is a feature of all political communication, because the incentive to choose favourable data is structural. No communications team presents the start date that makes their leader look worst.

In medicine, the costs are literal. The ENHANCE trial, completed in 2006, tested whether ezetimibe added to a statin reduced arterial plaque in patients with familial hypercholesterolaemia. The drug failed its primary endpoint: after two years, the combination did not reduce intima-media thickness of the carotid artery wall compared to the statin alone. Merck and Schering-Plough, the companies behind the drug, delayed releasing these results for approximately eighteen months. During that delay, the drug continued to be prescribed and reimbursed by government health programmes. The delay was not random.

In both cases, the mechanism is the same. The data exists. The evidence is real. The selection of what to show you is not.

How to spot it

The best-documented illustration of subgroup cherry-picking in medical history was performed by scientists who were trying to demonstrate the absurdity of the practice.

In 1988, the ISIS-2 trial published results showing that aspirin given after a heart attack dramatically reduced the risk of dying within the next month, with a p-value so extreme it was effectively indisputable. During peer review, The Lancet asked the authors to break down the results by patient subgroup. The Oxford team considered the request scientifically unsound but decided to comply in an unusual way. They divided the patients by birth sign, using the categories from a newspaper horoscope column.

The result was exactly what statistics predicts when you run enough subgroup tests. Aspirin appeared not to work for patients born under Libra or Gemini. The overall result, showing massive benefit with p < 0.00001, was meaningless when you looked at those two birth signs specifically. The authors published the zodiac analysis alongside the main result to demonstrate, with conspicuous clarity, what subgroup data-dredging looks like.

The tell is consistent across all three forms of cherry-picking. Ask one question: what were the other options?

For a time period graph, the question is: where else could this graph have started? If you extend the x-axis backwards by two years, does the story change? If the presenter cannot explain why this particular start point was chosen, the choice is not innocent.

For a subgroup analysis, the question is: was this subgroup specified before the trial ran? If a drug trial reports a benefit in women aged 45 to 55 and you search the original trial registration, that subgroup should appear in the protocol filed before data collection began. If it does not appear there, it was found after the fact, and finding it after the fact means nothing: the data contained many slices and this was the one that worked.

For study citations, the question is: how many studies exist on this topic, and have the negative ones been cited? A review that cites eight positive studies without mentioning the twelve null results is not a review. It is a selection.

The pre-registration status of a clinical trial is public information. Trials registered at ClinicalTrials.gov in the US or the ISRCTN registry in the UK show the protocol before data collection. The primary endpoint, the planned subgroups, and the follow-up period are all there. When a paper reports a result that differs from the registered protocol, that divergence is the evidence of a forked path.

Your challenge

A pharmaceutical company publishes a press release announcing positive results for their new arthritis drug. The headline says the drug “significantly reduced joint inflammation in the most affected patients.” Reading further, you find that the overall trial population showed no statistically significant improvement compared to placebo. The company notes that the subgroup of patients with “severe baseline inflammation” did show a significant effect, with p = 0.03.

The trial enrolled 800 patients. You look up the trial registration and find that the primary endpoint was improvement across the full trial population. The severe baseline inflammation subgroup is not mentioned in the protocol.

What is the evidential value of the p = 0.03 result? What should you look for before accepting it as evidence that the drug works? And how many patients would you estimate might fall into the “severe baseline inflammation” subgroup, given that describing a subgroup this way tends to work statistically?

References

ISIS-2 zodiac subgroup analysis: ISIS-2 Collaborative Group. “Randomised trial of intravenous streptokinase, oral aspirin, both, or neither among 17,187 cases of suspected acute myocardial infarction.” The Lancet, 1988. The zodiac subgroup analysis is discussed in: Sleight P. “Debate: Subgroup analyses in clinical trials: fun to look at, but don’t believe them.” Current Controlled Trials in Cardiovascular Medicine, 2000; 1(1): 25–27. See also: PMC59592.

Garden of forking paths: Gelman A and Loken E. “The garden of forking paths: Why multiple comparisons can be a problem, even when there is no ‘fishing expedition’ or ‘p-hacking’ and the research hypothesis was posited ahead of time.” Columbia University Statistics Department working paper, 2013. Available at: sites.stat.columbia.edu/gelman/research/unpublished/p_hacking.pdf

ENHANCE trial and publication delay: Kastelein JJP et al. “Simvastatin with or without Ezetimibe in Familial Hypercholesterolemia.” New England Journal of Medicine, 2008; 358: 1431–1443. On the delay: Drazen JM et al. “Ezetimibe and the Regression of Atherosclerosis.” NEJM, 2008. The eighteen-month delay and its consequences are documented in congressional testimony and covered in: PMC4215424.

Time period cherry-picking in economics: Economics Help. “Cherry picking of data.” economicshelp.org. Discusses the UK government economic record example and the effect of start-date selection on apparent GDP performance.

Pre-registration: ClinicalTrials.gov (US registry): clinicaltrials.gov. ISRCTN Registry (UK/international): isrctn.com.

Cherry-picking in pharmaceutical subgroup reporting: Vedula SS et al. “Outcome Reporting in Industry-Sponsored Trials of Gabapentin for Off-Label Use.” New England Journal of Medicine, 2009; 361: 1963–1971. Covers systematic divergence between registered outcomes and published results.