Identifying and correcting errors in draft genomes

Cumulative number of genomes sequenced over the past 3 decades (figure by Greg Zynda http://gregoryzynda.com/)


Over the past decade we have seen an exponential increase in the number of sequenced, assembled, and annotated genomes. These these genomes are essential for pretty much any genomics research. If you want to sequence the genome, transcriptome, epigenome, or whatever-ome of your super-special study species and population, you’ll need (or at least want!) a pretty solid (read: well-annotated) reference genome to which to align your sequence data.
Fortunately for you, genomicists have been sequencing pretty much any genome that they can get their hands on. Unfortunately, these genomes are first published in “draft” form and come with a multitude of potential errors. These errors are highlighted in a recent paper by James Denton and colleagues. Here’s the one-sentence summary of their paper:

Low-quality assemblies result in low-quality annotations, and these annotation errors cause both the over- and under-estimation of gene numbers.

The good news is that:

many genome assemblies and annotations have improved over time due to further efforts aimed at both increasing sequence contiguity and adding functional data (e.g. RNA-seq) in order to correct gene models.

… but the bad news is that:

it is often the case that a great deal of research will be based upon the draft assembly before it has reached a finished state, and erroneous conclusions may result.

More specifically, in this paper the authors compared the most up-to-date genomes (from fruit flies to chickens to chimpanzees) to their draft-genome predecessors. What they found was that:

low-quality assemblies can result in huge numbers of both added and missing genes, and that most of the additional genes are due to genome fragmentation (“cleaved”* gene models)… Upwards of 40% of all gene families are inferred to have the wrong number of genes in draft assemblies, and that these incorrect assemblies both add and subtract genes.

(*”cleaved” gene models are those in which multiple genes are estimated from sequences that actually came from just one gene.)
Their findings make sense. If you are sequencing fragments of the genome then the prediction algorithms will be more likely to assign fragments from different exons, which may be far apart, to different genes. These cleaved gene models lead to an overestimation of single-exon genes and a depletion of multi-exon genes.
Alas, there is hope, and this hope comes in the form of RNA-sequencing. The authors found that paired-end RNA-sequencing improves the annotation of genomes by connecting the cleaved genes.
Overall, this suggests that caution should be taken when using and interpreting draft genomes. Use them with caution and, if you can, improve the annotation by sequencing your organism’s transcriptome.
Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, et al. (2014) Extensive Error in the Number of Genes Inferred from Draft Genome Assemblies. PLoS Comput Biol 10(12): e1003998. doi: 10.1371/journal.pcbi.1003998

This entry was posted in Uncategorized. Bookmark the permalink.