Last week, a friend sent me this policy forum article published in Science. Fifty co-authors, mostly tenured and from prestigious universities, some of them among my dearest idols, have written this piece to call for publicly available genome data. What struck me the most is their request that data shall be immediately released after its generation.
Most funding bodies already support, sometimes request, the release of data for the benefit of the scientific community and society. However, there seems to be a lack of rules for data usage. The landmark 2003 Fort Lauderdale Agreement promotes the free and unrestricted use of genome sequencing data by the scientific community after data have had ethical approval for release but before they are used for publication. A well-known role model is the U.S. Department of Energy. Here, for example at the Joint Genome Institute, data are immediately released after sequencing.
The authors of the Science article argue that (i) sequencing data should become publicly available right after their generation for unrestricted use, (ii) science will advance through competition with transparent rules, and (iii) credits should be given to the resource producers.
While I fully agree that science as a whole will advance faster with immediate sharing of data, I also worry about the consequences for individual scientists. If you imagine science advancement being a large and beautiful Monet painting, each dot is an individual publication and behind every publication there are human beings. Many puplications are the stepping-stones in the bridge to survival as an academic for at least some of these people. Our current system values individuals based on high impact publications and rank (tenure) instead of supporting inclusive community-driven research. Hence, there is a huge discrepancy between grad students or postdocs and tenured PIs. As a postdoc, I could easily feel mined by a powerful PI without the ability to contribute if I had to release my sequencing data immediately after its generation. With immediate release, data producers are not guaranteed that they can publish prominent peer-reviewed reports if others use their data first. This could easily lead to data piracy with a few groups who can afford a huge amount of computational power dominating the field, and the Monet painting would start looking more like Rothko’s late period. Is there academic fair play?
Generating new data typically involves years of preparation including project design, setting up collaborations, hiring people, getting collection and experimentation permits, organizing field expeditions, storing samples appropriately, extracting the DNA, and eventually, producing the sequences. I wonder whether there is a way how we could reward data acquisition on its own? Databanks have to make sure that they provide the source information for all contributions. In theory, people would then get cited for coming up with good sampling designs and providing useful data. Scientists should be given incentives and rewards for publishing reproducible protocols of data acquisition, making their data publicly available, and going the extra mile to provide the metadata with the raw material. Could we assign DOIs (digital object identifier) to datasets, protocols and scripts, and then cite each other for good work?
How does this affect the molecular ecologist? Many of us are producing sequencing data. We are running long-term studies, climbing up mountains, hiking through deserts, or diving into kelp forests to collect our data. Are we supposed to release all of our sequences once they are generated as data reports? What if they are part of a graduate student’s thesis? What if a postdoc spent half of his appointment applying to get collection permits for endangered species or remote areas? Isn’t the moment you start analyzing the data you collected the most rewarding time of being a scientist?
Can we agree on submitting all sequencing data, including metadata and protocols on the day we submit our first manuscript and post the biorxiv version? And to go a step further, can we reward each other and build up a reputation of creating reproducible, good data?
Beyond genomic data, I believe that science would also advance faster if we would grant access to negative results and their accompanying datasets. Groups would not have to repeat the same mistakes over and over again without publishing, and we would open the way for new approaches. The same applies to preliminary experiments that could provide the basis for new hypotheses. If we published those, other researchers could adjust their experimental design or use our data for their power analysis.