As it happens, the last two scientific papers I’ve had accepted for publication are also the first two papers for which my first-authorial duties included some substantial journal-mandated archiving of supporting data (beyond uploading a handful of DNA sequences to GenBank). The respective journals that are publishing the two papers each require authors to upload data supporting published papers to some public repository, and both strongly suggested that the repository be Dryad.
And that’s about where the similarities end. The differences, I think, suggest that there’s still some work to be done before journals, authors, and public data archives settle on a set of standard procedures that will make the data collected in publicly funded scientific research easily available with a minumum of fuss.
The first paper went to Systematic Biology. It’s a phylogenetic analysis using data collected by the Medicago HapMap Project. (I described the results briefly over at Nothing in Biology Makes Sense! last week.) Like most genome projects, the MHP has its own infrastructure for making its data available, and I’d cited that website in the MS and figured we’d done our duty. But the afternoon after I submitted the manuscript, I recieved an e-mail from the editorial office reminding me that (1) Systematic Biology expects authors to upload supplementary figures (the MS had a couple) to Dryad and (2) we were also expected to make supporting data available to reviewers at the time of submission … so why not put the data in the same Dryad package?
The folks at Dryad cleaned everything up once the manuscript was accepted, but this offended my sense of tidiness. Asking reviewers to wade through a pile of past versions seems not very helpful. What if I’d had to change the supporting data files as well? Speaking of which files: in the course of a review process marked by some really outstandingly thorough, helpful input from reviewers—dozens of pages of it—I’m pretty sure none of the reviewers did anything with the supporting data files. One of the key issues in the review was what specific kind of analysis was most appropriate for a genome-wide SNP data set. But actually replicating much of the analysis would’ve taken days, and I don’t think the reviewers needed to do that to evaluate the manuscript.
Contrast this with the other paper, which is in press at The Journal of Evolutionary Biology. It uses morphological and microsatellite data from populations of Joshua trees and their pollinators scattered across the Mojave Desert, to try to determine whether the pollinators’ preferences shape gene flow between Joshua tree populations. While the review process for this manuscript also centered around what analysis would best apply our data to answer that question, the reviewers never asked to examine the data directly.
It wasn’t until the paper had been accepted for publication that JEB sent me an e-mail asking if I wanted to archive the supporting data at Dryad. I said I did, and the editorial office sent me a link to the upload page. It was pretty obvious what I needed to archive—after a little sorting, I uploaded files containing microsatellite genotypes for the trees and their pollinators, more files containing measurements of some key traits for each species, and one last file containing a table of lattitude and longitude coordinates for all the collection sites. Everything reflects the data supporting the manuscript as it will be published, and it’s ready for other folks to dig in, if they’re interested, as soon as the paper is released online.
I’m in favor of public data archiving on principle—I’ve made use of open data, and of course there’s empirical support for its benefits—and I’m happy to put my work-time where my mouth is. But different journal policies may or may not make archiving easier, and maximize its benefits both before and after publication. So, as an author, which of these approaches did I prefer?
On a purely emotional level, I preferred being asked to archive my data after the paper was accepted for publication a lot more than before review had even begun. At Systematic Biology, archiving is another step added to the already tedious and nerve-wracking process of formatting and submitting a manuscript that might or might not be accepted and might or might not undergo substantial changes in the course of peer review. (Please note that I find the submission process for every journal tedious and nerve-wracking!) It also strikes me as problematic to ask authors to post data that will be archived elsewhere, even if it’s in a somewhat different form—proliferating versions of the same dataset can only confuse people looking for data if they don’t start by following links from the associated journal articles.
And why on Earth is it necessary to freeze files once they’re uploaded to a data package that’s under review?
On the other hand, after review, I was in a good mood and ready to do whatever I was asked, if it’d bring me closer to a final, type-set publication. It didn’t matter if I couldn’t change files once I’d uploaded them, because I could upload exactly the versions associated with the final, accepted paper. It’s all much tidier.
I can see the argument for archiving before review: reviewers should have access to the data supporting the manuscripts they’re evaluating. But I wonder how often reviewers make use of such access. I try to be a careful and thorough reviewer myself, when called upon, but I doubt I’ll ever take the time to completely re-run an analysis just to check the results reported in a manuscript. If I did, it’d probably be because I already have serious questions about the manuscript. In which case, how much difference could re-analysis make to my recommendation?
I think I’m probably safe to think that most of our readers agree with me that data archiving is a good and worthy thing. But, given that we agree on this point—how should we then make it happen?