Archiving genetic data is important for a lot of reasons, like ensuring reproducibility and transparency of results. Being able to access previously published data is also important given that the same set of data can often help answer a diversity of relevant questions in the field of evolutionary biology. In the current issue of Molecular Ecology, Pope et al. analyzed 419 data sets from 289 articles published in the journal over the last 5 years, recording the extent to which the data sets could be recreated given the geographic and temporal provided by the authors. For example, for sequences collected across a geographic range, could Pope et al. determine which sequences were collected in which areas? If only unique sequences were uploaded to Genbank by the original authors, was information needed to figure out the number of individuals from a given location that had a particular sequence also provided (i.e. sample sizes and haplotype/allele frequencies)? Did the authors report the timeframe in which they collected the samples?
Pope et al. found that since the 2011 implementation of the Joint Data Archiving Policy (JDAP), which requires that data supporting publications be made publicly available, the archiving of genetic data increased from 49% (pre-2011) to 98% (2011-today). To me, uploading genetic data to a curated database like Genbank or the European Nucleotide Archive feels as much as part of the process as does writing the paper. Unfortunately, Pope et al. were unable to recreate 31% of the archived data sets they downloaded based on the information provided in the paper or with the sequence data themselves. Over a third of articles provided geographic information as text only without including geographic coordinates and 18% of those described sampling at the broader regional scale. About 40% of the articles provided no temporal information and 20% reported only a range of years.
While great progress has been made towards the public availability of genetic data, the lack of emphasis on provision of associated information, such as geographic location and time of sampling, may impede our ability to fully reproduce such studies or use their genetic data in new ways.
Pope et al. recommended that in order to make genetic data truly accessible and useful for future analyses, at a minimum, individual genotypes should be recoverable and linked to geographic and temporal information. The authors also suggested including a readme file with the archived data that provides relevant information, like the naming/coding system used to identify sequences generated in the study.
To fully realize the future potential of this data legacy, there should now be a greater push to link spatio-temporal metadata to genetic data and to develop standards and repositories that facilitate data deposition, curation and searchability.
Reference:
Pope, L. C., Liggins, L., Keyse, J., Carvalho, S. B., & Riginos, C. (2015). Not the time or the place: the missing spatio‐temporal link in publicly available genetic data. Molecular Ecology (24) 3802–3809. DOI: 10.1111/mec.13254