If you have recently collected or are in the process of collecting next-generation sequencing data, then you may be wondering what the next step to working with your data will entail. Hopefully, you have been working a little bit with a Linux and/or Mac command line: enough so that you can move files, create directories, install software etc. You are now ready to analyze your sequencing data and, in particular, you want to use your next-gen data to identify and genotype loci. Regardless of whether you are interested in linkage mapping, phylogenetics, or population genomics, the program STACKS may be just what you are looking for. STACKS is a program that specializes in identifying and genotyping loci from next-generation sequencing data, in particular RAD-seq data: that is large numbers of short reads collected throughout the genome using restriction enzyme cut sites. STACKS was created by Julian Catchen et al. at the University of Oregon (click here for the associated paper; here for the program website). The program was initially developed to handle data types associated with linkage mapping, but the program can be more generally applied to identify and call loci from barcoded individuals collected from natural populations. The program runs on both Linux and Mac machines. I’d imagine that the program could also be run with a Linux virtual machine on Windows, but it would probably be much less memory-intensive to set up your PC to dual boot Linux and Windows. STACKS can be run with the command line as well as with a web interface. The web interface is designed to interact with a mySQL database, and is required for some of the analyses. To address some questions associated with first using STACKS, some of the program developers (well, at least one) have agreed to a Q&A here on the Molecular Ecologist in the next few weeks – so stay tuned! Below, I explore the general idea and functionality of the program.
STACKS can be used with or without a reference genome. Thus, if there is a genome available for a not-too-distantly related species, it may be worth trying both options in STACKS. The program works by first identifying heterozygous loci within an individual (ustacks). Reads within an individual are aligned with respect to one another in order to create “stacks”. At this point each read in the stack has 100% identical sequence (Fig. 1A, B). It should now also be obvious as to why RAD-seq type data are required: the sequencing depth must be sufficient to identify stacks. The user sets a threshold (the stack-depth parameter) and stacks that have fewer reads than this parameter are set aside (secondary reads). One neat feature of STACKS is that it holds onto these secondary reads and later attempt to align them to identified stacks in order to increase the sequencing depth of each stack. Stacks that have greater than two standard deviations worth of reads (lumberjack stacks) are also excluded. These lumberjack stacks usually represent repetitive elements. The user can next set the number of nucleotide differences with which to align stacks together (bearing in mind this is still within a single, barcoded individual). The goal here is to create stacks that have a true polymorphism, while allowing for some differences due to mutation and sequencing error. It should be noted that STACKS requires high quality sequence data or else the reads will be discarded. Specifically, reads must have an average quality score within a sliding window of greater than 90% such that, “STACKS accepts reads with isolated errors but detects reads with prolonged drops in quality and discards them.” The algorithms that go into matching stacks are complex (see Appendix 1), but Stacks that match at all user-defined nucleotide distances, are next aggregated together (See Fig. 1C-D). If there are too many different stacks connected (gray circles in Fig. 1D), then they are discarded.
The next step in STACKS, is to employ a maximum likelihood procedure to detect polymorphisms and infer alleles (See Hohenloe et al. 2010 for details). Each nucleotide position within a combined stack (a putative locus) is examined within this framework. Specifically, “This approach considers the counts of each of the four possible nucleotides in a multinomial sampling and uses a likelihood ratio test to assess the significance of the most likely genotype. The sequencing error rate is estimated implicitly by maximum likelihood at each position, and a significance level of α = 0.05 is used (without correction for multiple testing) to assign a diploid genotype (homozygote or heterozygote) at each position for each individual (Hohenlohe et al. 2010).” In my opinion, this feature separates STACKS from scripts that you may be tempted to write for yourself. Accurately distinguishing heterozygotes from homozygotes, with varying sequence depths across loci, is not an easy to solve problem. The maximum-likelihood framework used here provides a quantitative framework for accomplishing this task. Note that some loci may have more than one position with a polymorphism, but because of the short sequence length associated with next-generation sequencing, it is unlikely that recombination would occur within a stack. Thus each locus represents a haplotype. The last step of the ustacks program is to define a consensus sequence for each locus (Fig. 1 F).
It is important to remember that all of the above has taken place with reads collected from a single individual. The next steps in STACKS (cstacks and sstacks) are to create a catalog of loci from each individual in the population (whether this represents the parents from a genetic cross, or individuals sampled from population(s) that are fixed for different alleles). When merging individuals, essentially the same process as described above is repeated. Again, the user can specify a distance parameter for merging loci between individuals. STACKS can then export the results either for further use in genetic mapping programs (JoinMap, RQTL) or as observed haplotypes for a general population (as VCF format). The user can also employ the web interface to look at all of the haplotypes in every individual and perform some additional filtering.
In conclusion, STACKS might be the program you are looking for to identify and genotype loci from your next-generation sequencing data. Have you used other programs or your own scripts to accomplish these tasks? Have you used STACKS extensively? If so, please post your experiences and thoughts to the comments below.
References:
Catchen JM, Amores A, Hohenlohe PA, Cresko WA, Postlethwait JH. 2011. Stacks: building and genotyping loci de novo from short-read sequences. G3 Genes Genomes Genetics 1: 171-182.
Hohenlohe PA, Bassham S, Etter PD, Stiffler N, Johnson EA, Cresko WA. 2010. Population genomics of parallel adaptation in threespine stickleback using sequenced RAD tags. PLoS Genetics 6(2): e1000862 (23 pp).