To RADseq or not to RADseq?

In the end, we all want to do the best science we can, on the budget we have.

It’s a cliche to say that we live in a moment of unprecedented possibility for molecular ecology, as high-throughput sequencing methods drive the cost of collecting DNA sequence data ever lower. But at the same time, it’s a tricky moment, because the future — in which population genomic data for any species is within, say, the scope of a standard NSF grant proposal — is still unevenly distributed. For study species with small genomes and established resources like high-quality reference assemblies and deep annotation databases, the future is now. For species with large and complex genomes, or without good “infrastructure” to build on, it can still be challenging to obtain useful population-scale data without spending hundreds of thousands of dollars.

For going on a decade, now, the go-to solution for this problem has been reduced-representation sequencing. Led by RADseq, or restriction site-associated DNA sequencing, these methods solve the problem of genomes that are too big to easily sequence by, as it says on the tin, reducing them. Reduced representation offers us an accessible means to identify parts of the genome are involved in species’ adaptation to different environments and, ultimately, the formation of new species — one of the key questions of evolutionary ecology. So it’s no surprise that RADseq and its relatives have been hugely popular. The method was name-checked in the 2010 “Breakthrough of the Year” feature in Science, and the original RADseq papers, published in 2007 and 2008, have almost 2000 citations, as counted by Google Scholar.

So any paper that proposes there may be some problems with RADseq is bound to be controversial. An article published in Molecular Ecology Resources back in December leaned into that controversy right from its title: “Breaking RAD: An evaluation of the utility of restriction site associated DNA sequencing for genome scans of adaptation.” MER has now published the second of two response articles, and a response from the authors of “Breaking RAD” to those responses, so it seems like a good time to break down the reasoning for, and against, RADseq.

RADseq as it works

To understand the controversy, we should start with the details of what RADseq and related methods do: They reduce a sequencing job to a sample of short, unconnected fragments, scattered at random across the genome. This is achieved by digesting samples of genomic DNA with restriction enzymes that make a cut wherever they encounter a specific short sequence of nucleotides. By attaching customized adapters to the cut ends of the fragments, it’s possible to ensure that a high-throughput sequencing system will only sequence from a specific subset of the fragments. Tailoring the mix of enzymes and how the cut fragments are selected after digestion lets you vary the total number of fragments created and their average length, and thereby how much sequence data needs to be collected to read the digested, selected “RAD loci”.

The fragment sequences can be aligned to a reference genome, or simply assembled into anonymous chunks of sequence, to identify single-nucleotide polymorphisms. Barring some bias in the restriction enzymes, this will result in short stretches of sequence variation data scattered randomly across the genome. With just a few thousand such markers, or RAD loci, it’s possible to do more powerful population genetic analysis than anyone dreamed of at the turn of the century — reconstruction of past changes in population size, evaluation of population structure, estimation of migration rates across the landscape. With many more markers, it’s possible to do genome scans, testing population differentiation or association with environmental conditions at each marker to identify outliers, which might be targets of local adaptation, or closely linked to them.

RADseq at its limits

If you read “closely linked to them” and thought “wait, what?” you’ve run right into the core issue that David Lowry and his coauthors identify in the “Breaking RAD” paper. (Disclosure, at this point: I know authors on both sides of this debate, but Katie Lotterhos, one of the “Breaking RAD” coauthors, is a collaborator — and I was a postdoc with Peter Tiffin, who coauthored another recent review that makes a similar point.) Different bits of sequence within a chromosome don’t evolve independently — they are physically linked, and will be inherited together unless they’re separated by recombination. That linkage means that any locally adapted locus will be embedded in a region of sequence that isn’t responsible for local adaptation, but that does show many of the same population genetic signals we use to identify local adaptation, like higher-than-usual differentiation between populations. With enough randomly-distributed RAD loci, it should be possible to hit those linked target zones even if it’s highly unlikely a RAD locus will cover the specific sequence that is actually locally adapted.

And there’s the rub. The width of that linked target zone is unpredictable — the rate and scale of recombination varies both across individual genomes and among species, and it can depend on not just the basic biology of a study species, but its population structure and demographic history. Determining the structure of linkage within a population under study is part of the genomic “infrastructure”, like a reference genome assembly and annotation, that can only be established with a lot of time and sequence data. Lowry et al. surveyed published papers applying RADseq or related methods, and found that, of 27 studies, the median had just over 4 RAD loci per megabase of genome sequence. For context, they assembled published estimates of the extent of linkage in a wide array of species, and found that many had linkage extending only hundreds of base-pairs, or even less. They also simulated RADseq protocols to estimate the proportion of a genome “covered” by RAD loci given different linkage extents and genome sizes, and found that, for many genomes, hundreds of thousands of RAD loci would be necessary — a couple orders of magnitude more than were generated in many published studies. That suggests, in other words, that a lot of discoveries made using RADseq are based on incomplete samples of the genome.

(A) Figure 3 of Lowry et al. (2016) and (B) an extension of that figure by McKinney et al. (2017). Edited to add:Each panel plots the proportion of a simulated genome “covered” by a given number of randomly distributed RAD loci, given different average linkage distances. Different line styles indicate different sizes of simulated genomes.

But maybe linkage extends farther than we think?

The first response paper, lead-authored by Garret McKinney, wholeheartedly rejects that implication. They review the RADseq studies examined by Lowry et al., and argue that many of them did indeed find positive targets of selection, or other important insight into evolution and adaptation, and take a “glass is half-full” view of the range of linkage presented in the “Breaking RAD” paper:

Of course, it is unrealistic to expect that the 30 species in this table broadly represent all species and all populations of interest for scientific study, and Lowry et al. (2016) fail to point out that six of the species in their Table 1 (20%) have LD estimates either equal or much greater than 100 Kb, three of which even had LD estimates of 1 Mb or greater …

Using code from the Supporting Information of Lowry et al., McKinney et al. extend a key figure from the Breaking RAD paper to show that, if linkage really does extend farther than Lowry et al. assume, a typical RADseq protocol can account for 100% of multi-gigabase genomes. I’d characterize the thrust of their argument as, why assume the worst? And, when candidates from a RADseq-based genome scan show evidence of functional roles or a history of selection based on independent data, I’d say they’ve got a point. Even if RADseq can’t find every part of the genome driving adaptation, finding some parts is a good start, and can be a major breakthrough. Still, if we want to understand how adaptation works in general, we ultimately want to build up a comprehensive picture of even individual cases.

Are the alternatives any better?

The second response paper is by several of the original RADseq authors, let by Julian Catchen, and they argue that even if RADseq has its limitations, the alternatives are not necessarily better. Lowry et al. suggest, rather than the random genome reduction of RADseq, taking a targeted approach to reduce sequencing costs — either using RNAseq to sequence only expressed genes, or exome capture to sequence protein-coding genes identified from RNAseq or from the annotation of a reference genome. The logic here is that, if you can’t afford to sequence the whole genome, you might as well know what you’re missing. (In this case, targeting protein-coding regions with the understanding that non-coding regions, including potentially important regulatory elements, will be missed.) It’s also possible to estimate population allele frequencies with whole-genome coverage using pooled sequencing, in which individual genomic samples are mixed in equal proportions to produce a “pool” that is then sequenced — as long as enough samples are pooled and the pooling process is precise, this can give good estimates of allele frequencies with a lot less sequencing than would be needed to sequence every sample individually.

Catchen et al. argue that the limitations of RNAseq, exome capture, and pooled sequencing put them out of reach in many cases where RADseq can still work. RNA sequencing is potentially biased by variable gene expression, while exome capture and pooled sequencing are more reliant on a good reference genome, or at least a transcriptome, as a starting point for capture array design or to align pooled sequences for genotyping. They also point out that it’s perfectly possible to assess the extent of linkage using RADseq data — so RADseq users can evaluate the suitability of their data before running a genome scan. This isn’t possible with pooled sequencing, because the protocol elides individual genotypes.

It takes a village (of genomic resources)

In their response to these responses, Lowry et al. emphasize that their argument is not against RADseq as a method, but against its use without proper understanding of the biology and genome structure of the species to be studied. While it is clear that many individual studies using RADseq take this precaution, others have not. They recommend that molecular ecology studies of previously un-sequenced species — based on RADseq or otherwise — start by building the infrastructure of an annotated reference genome or a linkage map, or both; that study designs be informed by model-based power analysis; and that candidate loci found in genome-scan analyses be validated by data beyond population genomics. As they conclude

We did not and do not advocate for any ascertainment method across all scenarios, only that investigators responsibly assess different ascertainment designs including RADseq, whole-genome pooled sequencing, RNA-seq, and sequence capture in the context of a study question, genome size, and expected patterns of LD. We do advocate for careful consideration of experimental designs and acknowledgement of errors when they occur.

References

Andrews KR, JM Good, MR Miller, G Luikart, and PA Hohenlohe. 2016. Harnessing the power of RADseq for ecological and evolutionary genomics. Nature Reviews Genetics 2:81–92. doi: 10.1038/nrg.2015.28

Baird NA, PD Etter, TS Atwood, MC Currey, AL Shiver, et al. 2008. Rapid SNP discovery and genetic mapping using sequenced RAD markers. PLOS ONE 3(10): e3376. doi: 10.1371/journal.pone.0003376

Catchen JM, PA Hohenlohe, L Bernatchez, WC Funk, KR Andrews, and FW Allendorf. 2017. Unbroken: RADseq remains a powerful tool for understanding the genetics of adaptation in natural populations. Mol. Ecol. Resources. doi: 10.1111/1755-0998.12669

Gaut BS, SI Wright, C Rizzon, J Dvorak, and LK Anderson. 2007. Recombination: an underappreciated factor in the evolution of plant genomes. Nature Reviews Genetics 8:77-84 doi: 10.1038/nrg1970

Lowry DB, S Hoban, JL Kelley, KE Lotterhos, LK Reed, MF Antolin, and A Storfer. 2016. Breaking RAD: An evaluation of the utility of restriction site associated DNA sequencing for genome scans of adaptation. Mol. Ecol. Resources. 17:142–52. doi: 10.1111/1755-0998.12635

Lowry DB, S Hoban, JL Kelley, KE Lotterhos, LK Reed, MF Antolin, and A Storfer. 2017. Responsible RAD: Striving for best practices in population genomic studies of adaptation. Mol. Ecol. Resources. doi: 10.1111/1755-0998.12677

McKinney GJ, WA Larson, LW Seeb, and JE Seeb. 2017. RADseq provides unprecedented insights into molecular ecology and evolutionary genetics: comment on Breaking RAD by Lowry et al. (2016). Mol. Ecol. Resources. doi: 10.1111/1755-0998.12649

Miller MR, JP Dunham, A Amores, WA Cresko, and EA Johnson. 2007. Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Research 17:240-248. doi: 10.1101/gr.5681207

Schlötterer C., R Tobler, R Kofler, and V Nolte. 2014. Sequencing pools of individuals — mining genome-wide polymorphism data without big funding. Nature Reviews Genetics 15:749-763. doi: 10.1038/nrg3803

About Jeremy Yoder

Jeremy B. Yoder is an Associate Professor of Biology at California State University Northridge, studying the evolution and coevolution of interacting species, especially mutualists. He is a collaborator with the Joshua Tree Genome Project and the Queer in STEM study of LGBTQ experiences in scientific careers. He has written for the website of Scientific American, the LA Review of Books, the Chronicle of Higher Education, The Awl, and Slate.
This entry was posted in adaptation, association genetics, genomics, methods, next generation sequencing, selection and tagged , , , , . Bookmark the permalink.