The Reproducibility Crisis

Related articles

Much of the literature in the softer sciences, and here we need to include studies of public health issues, like nutrition, exercise, or even COVID-19, seem irreproducible. One group's work does not seem to be easily reproduced by another, giving rise to concerns about bias and veracity. As always, there is far more to the story.

There are many steps in research that, like making-sausage, are rarely viewed. Specifically, the process of collecting data, cleaning up the inconsistencies, standardizing it in some way all to create a “data set,” that is, in turn, analyzed and then discussed. Creating the dataset involves any number of decisions that are not felt necessary to report in the article’s methodology. Those decisions that shape the dataset also shape the analysis. A new study in economics looks at the reproducibility crisis from that viewpoint. The findings can be generalized to other of those softer sciences.

The researchers identified two published articles from reputable journals, with a clear outcome, based on publicly available data that could be easily found. By various means, they recruited published academics to reproduce a portion of these studies. Each was given all the raw data along with instructions as to “what the data set and research question of interest are, as well as some identifying assumptions the variables of interest and sought outcome without restraining their choices too much.” The “replicators” would then treat this as their “research” and clean, categorize and analyze the data. The choices of how to carry out those tasks were left to the replicators – the replication’s “degrees of freedom.” Each article had seven replications. 

Replication can take at least two forms. In pure replication studies, you accept the researcher’s data, methods, and analysis and basically check their math. The researchers note that this form of replication has become easier as more and more journals require submitting datasets and details of the analytic treatment for publication. The replications in this study provide the raw data and leave the rest in the replicator’s hands. It is more akin to a paper that results from a different group of researchers considering the same raw data – it is more real-world.

The researcher collected their results and went through the coding and analysis to identify the frequently unspoken decision that impacted the results. 

In the case of one paper, 

“different researchers answering the same question using the same data set may arrive at

starkly different conclusions. Three (of seven) would likely conclude that compulsory schooling had a negative and statistically significant effect on teen pregnancy, two would find no significant effect, and one would find a positive and significant effect.”

In the review, researchers found that replicators made different decisions on cleaning the data, what values to ignore, what values to imput (a way of guesstimating a missing value), what periods to consider. As a result, the sample size used in the subsequent analysis varied eightfold. The subsequent analysis also varied, including which variable to take into account or to control. 

There was more consistency in replicating the second paper with the estimates, confidence intervals, and statistical conclusions essentially the same.  But again, the researchers found wide variations in sample size due to those “data cleaning” and aggregating decisions with variables “bined” in different ways.

“Given the same data and research question of interest, we find considerable variation across researchers in the way that they clean and prepare data and design their analysis.” 

The researchers point out that there was nothing inherently wrong with how the data was cleaned and prepared, no smoking-gun “bias.” Those decisions, when discussed with the replicators, reflected 

“familiarity with a given model, differing intuitive or technical ideas about which control variables are appropriate or whether linear probability models are appropriate, and differing preferences for parsimony.”

The real difficulty was that the decisions made were invisible to others and assumed that scientists would approach a problem in a standard manner. This is a part of the reproducibility crisis, and that can be made better by explicitly identifying the implicit decisions in data preparation. It is another reason why transparency in regulatory science is critical. Few studies come with miscalculated values, but many come with choices that, as this study suggests, can change the data and subsequent outcomes. And these choices are more reflective of training than of finding the desired result or p-hacking.

Source:  The influence of hidden researcher decisions in applied microeconomics. Economic Inquiry, DOI:10.1111/ecin.12992