I’m jumping on the bandwagon with a blog post about this new PLoS ONE paper (taking the lead from the man in charge in my lab) because the algorithms are just so exciting:
Matsen FA IV, Evans SN. (2013) Edge Principal Components and Squash Clustering: Using the Special Structure of Phylogenetic Placement Data for Sample Comparison. PLoS ONE. 8(3):e56859.
Our lab works closely with Erick Matsen and his group at the FHCRC – we’ve implemented their software for phylogenetic placement and community comparisons of short read data (pplacer and guppy) into our in-house pipeline for phylogenetic analysis of environmental metagenomes (PhyloSift). The two algorithms they discuss in this new PLOS ONE paper, Edge PCA and Squash Clustring, are implemented within the guppy software package. I can vouch for the usability of the Matsen group’s software – it is well documented and typically pretty easy to install, so I suggest you try it out if you’re as excited as I am by the work described in the above-mentioned paper.
Now for the good stuff – what do EdgePCA and Squash Clustering do? Conceptually, they represent alternatives to traditional PCoA/MDS analysis and UPGMA clustering, respectively. The UniFrac algorithm (as implemented in QIIME) currently represents the default approach for carrying out these traditional ecological analyses on high-throughput rRNA amplicon datasets. However, although UniFrac uses a phylogenetic tree as input, it is still fundamentally a distance-based metric:
Once distances have been computed between samples using UniFrac, these distances are typically fed into general-purpose ordination and clustering methods, such as principal coordinates analysis and UPGMA. Although it is appropriate to apply such techniques to distance matrices of this sort, the classical methods do not use the fact that the underlying distances were calculated in a specific manner, namely, on a phylogenetic tree. Consequently, in an application of principal components analysis, it is difficult to describe what the axes represent. Similarly, in hierarchical clustering, it is unclear what is driving a certain agglomeration step; although it can be explained in terms of an arithmetic operation, a certain amount of interpretability in the original phylogenetic setting is lost. [Matsen & Evans 2013]
Personally, I find EdgePCA and Squash Clustering to be more intuitive, because you can visualize and explore community patterns on the tree topology itself. In EdgePCA, lineages that drive community differences in each principal component are visualized as colored and fattened branches in the reference tree:
Squash Clustering, on the other hand, is a way of comparing “phylogentic fingerprints” of microbial communities, to see how similar or different they may be. Matsen & Evans give a good analogy:
Imagine that the phylogenetic tree is a road network and that each sample is represented by the distribution of a unit of mass into piles of dirt along this road network. The distance between two samples is then defined to be the minimal amount of ‘‘work’’ required to move the dirt in the first configuration to that in the second configuration (in this context the amount of work needed to move an infinitesimal mass d a distance x is defined to be d:x). Thus, similar collections of phylogenetic placements result in similar dirt pile configurations that don’t require much mass movement to transform one into the other, while quite different collections of placements require that significant amounts of mass must move long distances across the tree. This distance is classical, having roots in 18th century mathematics, and is a generalization of the weighted UniFrac distance. [Matsen & Evans 2013]
In practice, Squash Clustering looks at the placement of reads across the reference tree for sample 1 vs sample 2 (and so on for as many samples in your datasets):
There are currently several limitations of EdgePCA and Squash clustering, both related to taxon sampling in your reference phylogeny.
- First, you lose resolution if a clade (a taxon of interest) is not represented amongst the sequences you use to build your reference tree. In the the vaginal dataset presented by Matsen & Evans (Figure 5, above), the second principal component, accounting for 24% of variance between samples, was driven by two species of Lactobacillus (a highly sampled clade in 16S phylogenies, due to the importance of this genus in human health). You just wouldn’t see this fine scale variance in a dataset from a much less characterized environment (the deep sea, for example) because we just don’t have enough representative taxa in public rRNA databases like SILVA.
- Similarly, the distribution of taxon sampling across a tree will currently bias the computation of community comparisons, because as Matsen & Evans state, “more highly sampled lineages will be assigned comparatively more weight in the PCA analysis than less sampled lineages.” A sparsely sampled clade might have one long, deep branch with one leaf, whereas a well-sampled clade will have many taxa and therefore many leaves and internal branches. Because EdgePCA works on edges (branches), you’ve just got more edges to work with in well-sampled clades.
And finally, the paper hints at other exciting things on the horizon, including:
- A manuscript in preparation that looks at how reads from different regions of rRNA genes (16S in this case) affect phylogeny-based clustering and ordination methods.
- Further modifications to the edge PCA algorithm which reduces biases stemming from taxon sampling in your reference tree.
Holly Bik is postdoctoral researcher in marine genomics, working in Jonathan Eisen’s lab at the University of California Davis.