PacBio is emerging as the favoured sequencing approach for assembling high-quality reference genomes. But the big issue with PacBio sequencing is that to get long sequence reads you need to start with high molecular weight DNA. For my first foray into PacBio sequencing back in 2016 I sent a single DNA sample from the parasitic plant Euphrasia that I’d extracted from silica dried tissue with a standard commercial column-based DNA extraction kit (Qiagen DNeasy Plant Mini Kit). I did all that I could to minimise shearing by using wide-bore tips and by not vortexing the sample. The DNA looked fine when run on an agarose gel, with a single band above 20Kb, and with no smear that would indicate shearing.
The PacBio sequence data I got back from this DNA sample was disappointing. Most of the sequences were incredibly short, the size distribution showed a peak at less than 2 Kb, and few reads were over 20 Kb. It seemed that the initial gel picture wasn’t really capturing the integrity of the DNA, and DNA damage such as breaks or nicks were present. This damage causes the polymerase to fall off during PacBio sequencing and results in short or failed reads.
For my second attempt, I sent my silica dried tissue sample to a commercial company that offers a high molecular weight DNA extraction service (there are many companies to choose). I paid a hefty $1500 for them to extract DNA from a sample using their own proprietary DNA isolation protocol (similar to this). While I’d normally extract DNA myself, in this case I was short of time before some grant money ran out. The DNA they extracted looked excellent when run on a gel, with a smear above 50 Kb.
This time the PacBio data I got back was much better. While there were plenty of short fragments that are of little use, there is a good proportion over 10Kb and 20Kb, and the tail of read lengths is really long. There’s even a single 140Kb read! While the comparison between the read length distributions of the two libraries isn’t exactly like-for-like (the sequencing centres performed different size selections), I’ve now seen for myself the massive impact of DNA integrity on the quality of long-read sequence data.
What have I learnt from this experience?
- I think many of us need to reconsider our reliance on basic quality control (QC) checks for DNA samples. My QC checks usually involve measuring total yield using a fluorescent assay such as the Qubit, and the size distribution of DNA run on an agarose gel or a Tapestation/Bioanalyser. I don’t think any of these clearly show DNA breaks or nicking, though it may be indicated by a smear below a band on a gel. Perhaps we’ll have to accept that even what appears to be the ‘perfect’ DNA sample may perform poorly, and that we need to treat our DNA very carefully. Or perhaps we’ll have to adopt additional QC measures to look for DNA breaks or nicks.
- While there has been a massive and necessary shift from lab skills to bioinformatic skills, this has reminded me that lab skills are still important. There are a massive number of protocols for extracting high molecular weight DNA. Just about all of them forgo the easy-to-use extraction kits (putting DNA through a regular column is a bad idea if it is intended for long-read sequencing), with many protocols returning to old fashioned DNA extractions used for BAC sequencing. These protocols are often technically challenging and involve many stages, as well as species-specific optomisation. Perhaps the move to high molecular weight DNA extractions and long-read sequencing will require us to spend more time in the lab.
- Recent years have seen greater use of museum specimens and dried specimen collections for genetic analysis. I can’t help but think that many of these collections will prove not to be useful for these new long-read sequencing approaches and whole genome assembly. This may not be absolute—my freshly collected silica dried plant sample worked fine—but in some cases we may need to get back in the field and recollect samples for genomic analyses.
What is your experience with DNA extraction for PacBio sequencing? Let me know @alex_twyford.