Science aims for the truth. But could the way scientists approach empirical data be sabotaging their efforts? Leif Nelson at UC Berkeley’s Haas School of Business and colleagues have recently suggested that common practices in data analysis are too flexible to be reliable: scientists may be unwittingly tricking themselves into reporting false results.
The basic problem arises in any statistical hypothesis test, which calculates the probability, or p-value, that what look like patterns in your data may in fact have occurred by chance, rather than reflecting the reliable relationship you anticipated. Finding this p-value to be sufficiently small (e.g. p< 0.05) generates that unique rush of elation. It’s significant! Except… many seemingly significant “findings” may actually be scientists searching for systematicity in random noise.
Nelson and colleagues suggest that the familiar joy of significant results is compromised by two features inherent to research: ambiguity in how to analyze data, and scientists’ hopes of finding a significant result. We are intimately familiar with the latter, but may be less cognizant of the ambiguities we regularly resolve in examining data. How many data points are collected before ending a study? Which dependent variables should be reported? Does one control for a particular factor, like time of day, ambient light, vibration, or inter-animal variation? Is there a good reason to drop a flawed condition or problematic experiment?
Nelson claims that these choices introduce “researcher degrees of freedom”—that is, flexibilities in making decisions that researchers may lean on in finding a significant result. Ambiguities inherent in analysis make many decisions look equally logical. This conceals how post-hoc a decision really was: many equally justifiable approaches were likely passed over when they failed to reveal significant results. It’s tempting to assume that no one you know would be so careless, but extensive psychological research reveals that people often fail to recognize their bias to interpret ambiguity in self-serving ways.
Think back to recent research findings in your own field—or if you’re feeling courageous, your own department. When might such ambiguities have been inadvertently exploited to find significant results? Many scientists may believe that although these mistakes are prevalent, their effect is tiny. In fact, Nelson admits, “when we first ran simulations to examine how these researcher degrees of freedom influenced false positives, I was expecting minimal effects.”
Instead, his statistical simulations show that shocking inflations of false positives are possible. Consider the consequences of some apparently innocuous sources of flexibility. When testing the effect of an independent variable, the freedom to choose one (or both) of two dependent variables doubles the probability of a false finding. Conducting statistical tests as more data points are being collected can increase erroneous findings by 50%. Moreover, using more than one such tactic can increase the probability of accidentally finding a result (the p-value) from 5% to 30% or even 60%. “I was completely shocked when I saw these results,” Nelson recalls.
Scientists frequently gripe about the misuse of statistics, so what does this paper contribute beyond an obvious confirmation of those war stories? In addition to systematic and quantitative analysis, the paper reports an experiment that audaciously demonstrates researcher degrees of freedom run amok. Subjects invited into the lab reported their ages (along with other variables) before listening to a Beatles song or, as a control, suffering through a Microsoft Windows system tune.
In analyzing the data, Nelson and his colleagues allowed themselves to use the researcher degrees of freedom outlined earlier and simulated in their paper: collecting multiple conditions and not reporting one; continually testing for significance as data was collected; choosing which dependent variables to analyze and which covariates to use. Their groundbreaking and statistically significant finding was that listening to the Beatles song made people younger than listening to a less stimulating tune, when controlling for a different variable between the two groups (father’s age). This impossible result was obtained by methods that were chillingly indistinguishable from those reported in a typical article. Nelson recalls, “Initially we weren’t sure whether our experiment would actually find such a result. But after seeing the results of the simulations, I was absolutely confident.”
What can be done about this problem? The findings raise sensitive policy issues. Public opinion is in little danger of being excessively pro-science. Do the benefits of potentially reducing false findings outweigh the cost of giving everyday people and special-interest groups license to dismiss scientific research?
Moreover, these challenges are not unique to scientists but are shared more generally in interpreting any observations. Many opinions and beliefs we form from everyday experience carry the risk of researcher degrees of freedom writ large. While scientists’ use of statistics clarifies their underlying logic and makes it easier to detect errors, everyday reasoning provides the opportunity to cherry-pick a vast collection of anecdotes in support of one’s preferred conclusion.
Nelson et al.’s paper provides simple and specific recommendations for authors: list all independent and dependent variables that were collected, report at what points in data collection results were analyzed, and report results with and without any ambiguous decisions such as eliminating an outlier or controlling for a factor. But there is a tremendous disincentive for individual scientists to hurt their chances of publication by holding themselves to higher standards than the community. A list of recommendations for reviewers and editors is also provided. In addition to asking authors to follow the recommendations, reviewers should be more tolerant of imperfections in results, ask authors to report analyses with and without particular analysis decisions, and in extreme cases consider requesting exact replications.
When asked what readers should carry away from this work, Nelson said he hoped it would change both the culture around reporting analyses and the approach taken by individual scientists. “Analyze data knowing that those decision are important, and you will be reporting every step in print. This work has completely changed how I look at my research. Now I’m constantly evaluating my decisions, trying to detect situations in which there would otherwise be temptations to game the data to find statistical significance.”
Read Nelson’s paper here.