Unique molecular identifiers for downstream data analysis Seamless integration of UMIs in the protocol help to mitigate PCR bias and accurately identify true variants and rare mutations, without any additional deduplication steps.
Superior sequencing performance Reconfigured sequencing libraries perform well on all Illumina platforms (including NextSeq® and HiSeq® 3000/4000), without the addition of PhiX adapter-ligated library as a control.
Introduction
Massively parallel cDNA sequencing, or RNA-seq, has become the gold standard for whole-transcriptome gene expression analysis and is widely used in numerous applications to study cell and tissue transcriptomes. However, despite its many advantages, RNA-seq can be challenging in some situations, including cases where input amounts are low or comprised of degraded RNA samples.
Takara Bio was a pioneer in the development of a low-input solution: RiboGone technology for rRNA removal from total RNA, which enables library construction from inputs spanning 10 ng to 100 ng. We integrated this technology into our SMARTer stranded RNA-seq kits, reducing the representation of rRNA in the final libraries and leading to exceptional performance with inputs as low as 10 ng. With the release of the SMARTer Stranded Total RNA-Seq Kit - Pico Input Mammalian, we were able to be successful with even lower inputs by incorporating a proprietary technology in which ribosomal cDNA is removed after creating the complete cDNA library, thereby enriching RNAs of interest—namely mRNA and non-polyadenylated RNA.
The latest update to the original kit, the SMARTer Stranded Total RNA-Seq Kit v3 - Pico Input Mammalian (referred to as “Pico v3”) now features seamless integration of unique molecular identifiers (UMIs), thereby helping to mitigate PCR bias as well as allowing the user the opportunity to identify true variance within the sample. Pico v3 provides a unique, sensitive, and ligation-free method to generate stranded, Illumina-ready cDNA libraries from an input range of 250 pg–10 ng of total mammalian RNA in about 7.5 hours.
Results
Unique molecular identifiers for downstream data analysis
Commonly used high-throughput sequencing platforms, including Illumina NextSeq and HiSeq 3000/4000, require PCR amplification during library construction to increase the number of cDNA molecules to an amount sufficient for sequencing and/or enrichment for fragments with successful adapter ligation (Cha and Thilly 1998; Dohm et al. 2008). However, PCR often stochastically introduces errors that can propagate to later cycles, such as sequencing artifacts and false mutations. Biases in the PCR amplification step lead to particular sequences becoming overrepresented in the final library, resulting in an inaccurate fold-change measurement (Aird et al. 2011). What then, is the solution? This is where UMIs come in.
Unique molecular identifiers (UMI) are molecular tags that are used to detect and quantify unique mRNA transcripts. The random sequence composition of UMIs assures that every fragment-UMI combination is unique in the library—much like how a barcode identifies an item in a grocery store or the Dewey decimal system identifies a book in a library. By using UMIs ligated to fragments of the input sample, PCR clones can be found by searching for non-unique fragment-UMI combinations, and therefore help to determine whether a sequence arises from truly distinct molecules, or from PCR amplification (Fu et al. 2018). UMIs provide the highest levels of error correction and accuracy, allowing for superior representation of the transcripts.
UMIs are seamlessly integrated into libraries produced with Pico v3 such that they are an inherent part of the protocol. An 8-nt UMI is introduced into the same location in each fragment during library preparation through the reverse transcription step (prior to PCR amplification). Thus, it is possible to accurately identify PCR duplicates and correct for specific preferentially amplified sequences, thereby allowing for high-resolution reads and enabling accurate detection of true variants.
Superior sequencing performance
To test the consistency in performance of Pico v3 over the recommended input range, libraries were generated from human lung cancer normal adjacent tissue (NAT FFPE) from a single donor (500 pg–10 ng), with two technical replicates per input amount (Figure 1).
Sequencing metrics were shown to be consistent between technical replicates, as well the across the entire range of RNA input amounts. The average number of transcripts ≥0.1 TPM (transcripts per kilobase million) was reported as 43,778 ± 1,352 (mean ± s.d., standard deviation), CV=3.1. Similarly, the average number of genes ≥1 TPM was reported as 18,690 ± 276, CV=1.5%. Comparison transcript expression levels also indicated strong correlation across a range of input amounts. Proportions of reads mapped to various RNA species were comparable, regardless of RNA input amount. Of particular note were the relatively low proportions of reads mapping to nuclear and mitochondrial rRNA compared to those of the remaining RNA species reported.
Although we recommend using high-quality, gDNA-free RNA, we often notice residual levels of gDNA contamination in some challenging RNA samples (e.g., degraded RNA extracted from FFPE), resulting high intergenic mapping rates and low strand specificities. Therefore, we validated an optional step (see the user manual, Appendix A) to remove potential gDNA contamination from RNA samples if such contamination is shown to be an issue. This optional step is seamlessly integrated into the Pico v3 workflow and significantly improves sequencing performance, as is indicated by a reduced fraction of reads mapped to intergenic RNA and a higher strand specificity versus no treatment (data not shown).
Figure 1. Performance metrics for Pico v3. Sequencing libraries were generated from total RNA extracted from human lung FFPE tissue using the Pico v3 kit and sequenced on a NextSeq 500 instrument. Sequencing metrics are shown for for libraries generated from inputs of 0.5, 1, 5, and 10 ng, with two technical replicates per input amount. Sequences were analyzed as described in the Methods.
Conclusions
The SMARTer Stranded Total RNA-Seq Kit v3 - Pico Input Mammalian is a complete solution to the challenge of creating stranded, indexed cDNA libraries for RNA-seq from picogram amounts of total mammalian RNA. The unique combination of SMART technology with a proprietary ribosomal cDNA depletion method enables unparalleled sensitivity, consistency, and reproducibility over the recommended range of 250 pg–10 ng, with demonstrated results from inputs as low as 100 pg. This kit excels with high-quality, partially degraded, and low-quality input RNA, enabling consistent, reproducible results from a broad range of sample types. Additionally, design updates made to this version of the kit allow for seamless integration of unique molecular identifiers (UMIs) into the protocol, helping to mitigate PCR bias and to accurately identify true variants and rare mutations within the sample. In around 7.5 hours, using very low total RNA input amounts from samples of varying types and qualities, this kit can generate Illumina-ready libraries that accurately represent coding and noncoding RNA—a major development in library prep for next-gen RNA-seq.
Methods
Library preparation and sequencing for FFPE samples
To evaluate the performance of the Pico v3 kit with FFPE samples, total RNA was extracted from FFPE human lung cancer normal adjacent tissue (BioOption) using a NucleoSpin total RNA FFPE kit (Cat. # 740982.10). Prior to library preparation, RNA intergrity was evaluated on an Agilent Bioanalyzer using an Agilent RNA 6000 Pico Kit (Agilent, Cat. # 5067-1513), yielding a DV200 value of 66%. Libraries were generated from the extracted RNA using the Pico v3 kit without additional RNA fragmentation (Protocol Option 2 in the user manual). Libraries were sequenced on a MiniSeqTM using a MiniSeq High Output Reagent Kit (150 cycles), (Illumina, Cat. # FC-420-1002).
Sequencing data analysis
Reads from all libraries were trimmed and mapped to mammalian rRNA and the human mitochondrial genomes using CLC Genomics Workbench. The remaining reads were subsequently mapped to the human genome using CLC with ENSEMBL-GRCh38.81 annotation. All percentages shown, including the number of reads that map to introns, exons, or intergenic regions, are percentages of the total reads in the library. The number of transcripts identified in each library was determined by the number of transcripts with an TPM greater than or equal to 0.1, as shown in Figure 1. Scatter plots were generated using UMI counts generated from Cogent NGS Analysis Pipeline.
References
Aird, D., Ross, M.G., Chen, W. et al. Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol.12:R18 (2011).
Cha R.S. & Thilly W.G. Specificity, efficiency, and fidelity of PCR. PCR Methods Appl.3:S18–S29 (1993).
Dohm J.C., Lottaz C., Borodina T., Himmelbauer H. Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res.36:e105 (2008).
Everaert C., et al. Benchmarking of RNA-sequencing analysis workflows using whole-transcriptome RT-qPCR expression data. Sci Rep.7:1559 (2017).
Fu, Y., Wu, P., Beane, T. et al. Elimination of PCR duplicates in RNA-seq and small RNA-seq using unique molecular identifiers. BMC Genomics19:531 (2018).
Mortazavi, A., Williams, B.A., McCue K., Schaeffer L., Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Methods5:621–628 (2008).
Let us deliver the latest news to you
We have created a low-volume (not-spammy) newsletter so that you can easily keep up with what's going on in the industry.