The Goldilocks zone of missing data

One of the more adorable members of the Iguania

One of the more adorable members of the Iguania. Photo by Rob Denton


Reduced representation sequencing approaches, such as RADseq and UCEs, have provided some fascinating inferences in recent years, but something has always been missing in these analyses: data. As sampled taxa become more divergent, the price paid for more loci is more missing data. The extent to which this is a problem has been debated, and there is no best recommendation for balancing the choices of number of taxa, number of loci, and acceptable percentage of missing data.

More generally, it is not clear how sampling for targeted-sequence capture studies should be designed (given finite resources). Should studies try to obtain large numbers of
loci for a more limited set of taxa? Or more taxa and fewer loci? Should taxa or loci with missing data be excluded? What amount of missing data should be allowed? Do the answers to these questions change when applying concatenated versus species tree approaches? These fundamental questions have barely been addressed.

Streicher, Schulte, and Wiens provide a new empirical example to further inform these decisions in an upcoming issue of Systematic Biology (Don’t have access? A version can be found here too). They take a dataset of UCEs from Iguanian lizards and create different variations by adjusting the number of taxa sampled (44, 29, or 16) and the percentage of missing data per locus (20%, 30%, 40%, 50%, 60%). The resulting 15 datasets were used to create phylogenies with both concatenated (RAxML) and species-tree (NJst) methods. The authors then looked for clades with previously-supported monophyly and compared the support between datasets and methodologies to the “true” relationships.

Figure 1 from Streicher et al. (2015) describing the relationship between the number of loci and percent missing taxa among their datasets


For both types of analysis, branch support was maximized when up to 50% of taxa were missing per gene. That isn’t to say that the missing values themselves improved inferences, but that threshold allowed for a greater number of taxa and genes to be used overall.

We show that allowing more missing data can increase the number of taxa and loci that are included, and increase support for estimated relationships (but that including the maximum amount of missing data does not necessarily maximize support).

That’s right, more missing data is helpful for capturing more loci/taxa, but too much missing data is a problem no matter how many loci/taxa get included. The most peculiar thing of all is that the two method for building the phylogenies perform best under opposite sampling strategies: concatenated analyses are most accurate with maximum taxon sampling and moderate locus sampling, species-tree analyses are most accurate with minimum taxon sampling and extensive locus sampling. Method matters!
The authors suggest other researchers avoid the removal of loci based on the fear of missing data, since the added breadth of genes and possibly taxa is likely more beneficial. However, more caution than ever is required since different sampling strategies cause very different results within each method.
These considerations are most apparent when branch lengths are short. To avoid these problems entirely, find some nice long branch lengths to resolve. But where’s the fun in that?

Thus, we show that some sampling strategies must be yielding incorrect but strongly supported results. While this sensitivity may be largely confined to short branches in this ancient, rapid radiation, it is just such branches that phylogenomic data may be needed to resolve.

 
Streicher, J. W., Schulte, J. A., & Wiens, J. J. (2015). How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards. Systematic Biology, doi: 10.1093/sysbio/syv058.

This entry was posted in evolution, methods, next generation sequencing, phylogenetics and tagged , , , . Bookmark the permalink.