Update, 29 Jan 2015: This post has been edited to remove a video clip from the movie “Chinatown,” which was jarring and really just unnecessary, as pointed out in the comments.
At its most basic level, population genetics is about looking for patterns. Patterns of discontinuity that might indicate barriers to dispersal. Patterns of association that might indicate local adaptation. Or, better yet, individual loci that violate the patterns seen in the rest of the genome, because those loci are likely to be interesting—i.e., recent targets of natural selection.
So naturally, it should worry us if the patterns we see in our data don’t mean what we think they mean. That’s the problem posed by Patrick Meirmans in a recent perspective article for Molecular Ecology. Meirmans argues that a lot of the most common methods we use to look for population genetics patterns, and to detect loci that violate those patterns, can be confounded by one of the most old-established patterns of them all: isolation by distance.
Isolation by distance (IBD) is a simple consequence of limited dispersal across space, which Sewell Wright described almost seventy years ago: pairs of populations close to each other will be more genetically similar to each other than populations farther away from each other, not because of any selective need for those genetic similarities, but just because individual critters, or their seeds, or pollen, or larvae are less likely to travel longer distances.
IBD, or population structure? Or both?
There’s a simple, well-known test for IBD: a Mantel test for a significant relationship between the genetic distances between sampling sites and their physical/geographic distances. But Meirmans points out that, generally speaking, once population geneticists perform a Mantel test and find evidence of IBD, we don’t do anything about it. Surveying 72 Molecular Ecology papers published in 2011 that include a Mantel test for IBD, Meirmans found that a large majority detected IBD; and the majority of those papers then didn’t use that discovery to inform the rest of the anlayses they performed.
That’s a problem because, as Meirmans demonstrates, IBD can be easy to conflate with one of the first patterns most population geneticists test for in a new data set: population structure. He simulates genetic data under two scenarios: one in which populations are clustered in two sharply divided groups, and one in which populations are continuously distributed across a landscape. In the first case, allele frequencies change rapidly at the point of division, as you’d get in real life if there’s some sort of environmental barrier to dispersal or change in the selective environment that means migrants across the line don’t do very well. In the second case, IBD is the only force acting on spatial variation in allele frequencies, and they change in a gentle slope from one end of the landscape to the other.
But these two different scenarios look very similar when viewed through the lens of certain standard population genetics analyses. Mantel tests showed a qualitatively similar profile of genetic correlation with distance in both the sharp-transition landscape and the IBD-only landscape. On the other hand, when Meirmans divided the IBD-only landscape into eastern and western regions, then ran an analysis of molecular variance (AMOVA) test, he found a significant portion of genetic variation was attributable just to the east-west division. Meirmans doesn’t do a similar demonstration with the popular clustering analysis Structure, though he notes that Structure has been previously reported to be confused by IBD. (I would have liked to see an explicit test of Structure within the same framework as AMOVA, myself.)
So let’s say you’ve collected genetic data from sites on either side of a line you think might be biologically significant—a pretty standard-issue population genetics study. You run your data through Structure, and find two clusters of collection sites that line up pretty well with that Line of Hypothesized Biological Significance. As a followup, you conduct an AMOVA with the collection sites grouped according to their placement by Structure, and you find that the clusters explain a significant fraction of the total genetic variation in your data set. Therefore, you conclude that the LHBS is, in fact, a significant barrier to dispersal.
Except that as we’ve just discussed, everything you’ve just found could be a consequence of simple IBD plus the fact that you’ve structured your sampling so that your LHBS happens to bisect the landscape you’re studying. And just to add to the frustration, even if you’d started out by testing for IBD before you started with all of the tests for population structure, a significant result in a Mantel test for IBD wouldn’t necessarily mean that population structure wasn’t there.
Lying outliers
But wait, it gets worse.
Meirmans follows up with simulations of a kind of analysis that’s becoming very popular as big, next-generation sequencing datasets become more and more accessible for people studying non-model organisms (i.e., most of us): outlier analysis. This is the class of analyses where, given population data from hundreds or thousands of loci, we “scan” for loci with an unusually strong association between their alleles and some environmental variable of interest (Meirmans tests the approach implemented in SAM) or for that show unusually high or low differentiation (like FDIST, which uses FST as its measure of differentation). Essentially, both of these approaches assume that if a locus falls outside of the 95% confidence interval—established by a big sample of other loci or coalescent simulations or what have you—then it’s probably in the tail of the distribution because natural selection has been acting on it.
Simulating populations evolving under IBD—without any selection at all—across a map of Scandinavia, Meirmans tested for associations with real climate data for the mapped region.
He not only found that loci ended up in the tail of the association distribution (in SAM) or the differentiation distribution (in FDIST)—he found that both analyses identified an excess of loci with p ≤ 0.05. In the case of one SAM analysis, upwards of 30% of the simulated loci had a p-value below the traditional threshold for “significance.” For SAM, this is because the spatial pattern arising from IBD—greater genetic differentiation from one end of the Scandinavian Peninsula to the other—lines up with the major north-south axis along which most major climate variables change across the region. For FDIST, Meirmans attributes the excess of significant loci to violations of the population genetic model underlying the coalescent simulations used to identify outliers.
Living with IBD
So what then should we do, as we sit down to analyze a new table of microsatellite or SNP genotypes from our favorite critters? Meirmans’s advice comes down to “watch out for IBD.” To deal with the Chinatown dilemma, he proposes testing for IBD within population clusters identified by AMOVA or a clustering algorithm, with the caveat that subdividing your data will reduce statistical power. Meirmans also reports that a partial Mantel test can be used to test whether geography (and thereby IBD) contributes to apparent clustering—by testing to see whether an association between a matrix of cluster assignments and genetic distances disappears when controlled for geographic distance.
It might also be possible to test for a difference in the slope of the IBD relationship for pairs of collection sites from different clusters and pairs of sites from the same cluster—if the clusters are on either side of a true barrier to dispersal, you’d expect that genetic distance would increase more rapidly with geographical distance when making comparisons across your LHBS.
For the outlier analyses, Meirmans advocates approaches that explicitly take into account geographic location, rather than simply measuring association with environmental variables or differentiation in a vaccuum. One such method, spatial ancestry analysis (spa), was recently published in Nature Genetics, and although it uses a possibly over-simple model of allele frequency change across space, it looks like a promising start.
But at a much more fundamental level, Meirmans’s examination of outlier analyses is a reminder that there’s nothing magical about p ≤ 0.05—which I’d hope everyone reading this already knows. Whatever your criterion for detecting “unusual” loci in a genetic dataset, it’s important to make sure that what you choose to call outliers are actually outliers in the distribution of your data as a whole—and to understand that identifying outliers isn’t necessarily the same thing as positively identifying targets of selection. It’s really only a way to pick out a subset of loci for further, in-depth analysis—”candidate loci,” in the jargon of association genetics, which has learned some of these lessons already.
In the end, if all your data shows is that allele frequencies change with geographic location: Forget it Jake, it’s (probably) IBD.
References
Meirmans, P.G. 2012. The trouble with isolation by distance. Molecular Ecology 21: 2839-2846. DOI: 10.1111/j.1365-294X.2012.05578.x.
Wright, S. 1943. Isolation by distance. Genetics 28: 114-138. PMCID: PMC1209196.
Yang, W.-Y., Novembre, J., Eskin, E. & Halperin, E. 2012. A model-based approach for analysis of spatial structure in genetic data. Nat Genet 44: 725-731. DOI: 10.1038/ng.2285.