Sharing data is a cornerstone in the push for scientific transparency. It is also a significant speed bump that impedes achieving that goal. Why is that?
It is hard to argue that sharing data to make your work more transparent and accessible to review is bad. Especially in science, where “show your work” has been in various forms a mantra for hundreds of years. When the Trump administration pushed for scientific transparency, a regulatory fight that continues, many academics spoke out against the proposed regulations because not all data could be shared – like patient information or the data underlying older science used in regulations. A study in Bioscience looks at some other, lesser-mentioned concerns.
The study involved a survey of academics at the 20 top Canadian universities in the fields of ecology and evolution; I would hasten to add that I feel we can generalize from this select group. The academics were principal investigators or PIs, “the individual responsible for the preparation, conduct, and administration of a research grant, cooperative agreement, training or public service project, contract, or other sponsored project” – the person in charge. Roughly half of the PIs responded. More than 80% believed open data was beneficial to society and had always or occasionally shared data in their published work. (As an aside, it would have been interesting to note what percentage was “always” versus “occasionally,” an argument, in itself, for more transparency). Nearly 80% supported mandatory open data policies.
Delving deeper into the data demonstrated some of the differences in these widely held beliefs. Roughly half had benefited from sharing, although mostly this was the “satisfaction from openly sharing one’s data.” The cynic might interpret this as more virtue signaling than satisfaction. One in five reported more negative outcomes. There were unanticipated costs, in terms of misuse or misinterpretation of data, and lost time or additional effort – data may want to be free, but it comes at a cost to its putative “owner.” These concerns dare we call them fears, were more likely to be voiced by early career researchers, those without career security. It is an interesting paradox because these most fearful were also the most vocal about the need for sharing.
The concerns were also voiced more readily by men. The researchers suggested a possible explanation that “costs are more readily perceived by men than women.” But the same data can support the unmade claim that “women are more confident in the value of sharing, that men, by nature, are less collaborative.”
In addition to the idea that a researcher’s data might be misinterpreted, there was the concern that the data, all nicely packaged, would be used by another to “scoop them” to publication. After all, this data sharing is going on in a highly competitive market for funding and professional recognition.
Garbage In, Garbage Out (GIGO)
Without quality data, you cannot hope for quality results. Truthfully, there is very little data available in the real world that doesn’t require significant work to avoid the garbage in problem. That work goes under the less glamorous term of data cleaning. Cleaning up the data means removing missing values and ensuring that the values have been reliably recorded – a relatively straightforward process for numeric information.
When you begin to look at categorical information, say two of the most common, ethnicity and gender, you need a data dictionary – agreed-upon definitions. The UK has 13 ethnicities in its census, the US, five.  Expanding the US ethnicities to include those of the UK or compressing the UK’s to match the US results in a lot of ambiguity, that is, the value of a data dictionary. I will leave it up to the reader to sort out what categories and definitions should be made for gender – I will start you with three, male, female, and binary.
There are few, if any, universally agreed-upon definitions. And even with agreed-upon definitions, you must make sure that the coder of information consistently applies those definitions – another thankless, custodial task. It should come, then, as no surprise that many researchers would claim that they are not trained in these functions and that they should be provided and financed by “someone else.” Some researchers are willing to undertake these functions provided they receive payment, in terms of funding or the other coin of the realm, points toward career advancement.
If we are to find ways to truly let data be free, to honor that injunction, in deed, over words, then we need to find a way to collaborate in a very competitive field. We have not found a good way to do that in the marketplace, where money serves as the realm’s coin. I am unsure we will find a way in the marketplace of ideas – but we need to try harder.
 Welsh/English/Scottish/Northern Irish/British, Irish, Gypsy or Irish Traveller, White and Black Caribbean, White and Black African, White and Asian, Indian, Pakistani, Bangladeshi, Chinese, African, Caribbean, Arab, and five different types of “other.” For the US, they are American Indian or Alaska Native, Asian, Black or African American, Hispanic or Latino, Native Hawaiian or Other Pacific Islander, and White.
Source: Reported Individual Costs and Benefits of Sharing Open Data among Canadian Academic Faculty in Ecology and Evolution Biosciences DOI: 10.1093/biosci/biab024