Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Letter
  • Published:

StringTie enables improved reconstruction of a transcriptome from RNA-seq reads

Abstract

Methods used to sequence the transcriptome often produce more than 200 million short sequences. We introduce StringTie, a computational method that applies a network flow algorithm originally developed in optimization theory, together with optional de novo assembly, to assemble these complex data sets into transcripts. When used to analyze both simulated and real data sets, StringTie produces more complete and accurate reconstructions of genes and better estimates of expression levels, compared with other leading transcript assembly programs including Cufflinks, IsoLasso, Scripture and Traph. For example, on 90 million reads from human blood, StringTie correctly assembled 10,990 transcripts, whereas the next best assembly was of 7,187 transcripts by Cufflinks, which is a 53% increase in transcripts assembled. On a simulated data set, StringTie correctly assembled 7,559 transcripts, which is 20% more than the 6,310 assembled by Cufflinks. As well as producing a more complete transcriptome assembly, StringTie runs faster on all data sets tested to date compared with other assembly software, including Cufflinks.

This is a preview of subscription content, access via your institution

Access options

Rent or buy this article

Prices vary by article type

from$1.95

to$39.95

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Transcript assembly pipelines for StringTie, Cufflinks and Traph.
Figure 2: Transcriptome assemblers' accuracies in detecting expressed transcripts from two simulated RNA-seq data sets.
Figure 3: Accuracy of transcript assemblers at assembling known genes, measured on real data sets from four different tissues.

Similar content being viewed by others

Accession codes

Primary accessions

Sequence Read Archive

References

  1. Blencowe, B.J. Alternative splicing: new insights from global analyses. Cell 126, 37–47 (2006).

    Article  CAS  Google Scholar 

  2. Ponting, C.P., Oliver, P.L. & Reik, W. Evolution and functions of long noncoding RNAs. Cell 136, 629–641 (2009).

    Article  CAS  Google Scholar 

  3. Wang, E.T. et al. Alternative isoform regulation in human tissue transcriptomes. Nature 456, 470–476 (2008).

    Article  CAS  Google Scholar 

  4. Cabili, M.N. et al. Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses. Genes Dev. 25, 1915–1927 (2011).

    Article  CAS  Google Scholar 

  5. Salzberg, S.L. Recent advances in RNA sequence analysis. F1000 Biol. Rep. 2, 64 (2010).

    PubMed  PubMed Central  Google Scholar 

  6. Garber, M., Grabherr, M.G., Guttman, M. & Trapnell, C. Computational methods for transcriptome annotation and quantification using RNA-seq. Nat. Methods 8, 469–477 (2011).

    Article  CAS  Google Scholar 

  7. Grabherr, M.G. et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat. Biotechnol. 29, 644–652 (2011).

    Article  CAS  Google Scholar 

  8. Schulz, M.H., Zerbino, D.R., Vingron, M. & Birney, E. Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. Bioinformatics 28, 1086–1092 (2012).

    Article  CAS  Google Scholar 

  9. Li, B. & Dewey, C.N. RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinformatics 12, 323 (2011).

    Article  CAS  Google Scholar 

  10. Roberts, A. & Pachter, L. Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Methods 10, 71–73 (2013).

    Article  CAS  Google Scholar 

  11. Feng, J., Li, W. & Jiang, T. Inference of isoforms from short sequence reads. J. Comput. Biol. 18, 305–321 (2011).

    Article  CAS  Google Scholar 

  12. Guttman, M. et al. Ab initio reconstruction of cell type-specific transcriptomes in mouse reveals the conserved multi-exonic structure of lincRNAs. Nat. Biotechnol. 28, 503–510 (2010).

    Article  CAS  Google Scholar 

  13. Trapnell, C. et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010).

    Article  CAS  Google Scholar 

  14. Li, J.J., Jiang, C.R., Brown, J.B., Huang, H. & Bickel, P.J. Sparse linear modeling of next-generation mRNA sequencing (RNA-Seq) data for isoform discovery and abundance estimation. Proc. Natl. Acad. Sci. USA 108, 19867–19872 (2011).

    Article  CAS  Google Scholar 

  15. Li, W., Feng, J. & Jiang, T. IsoLasso: a LASSO regression approach to RNA-Seq based transcriptome assembly. J. Comput. Biol. 18, 1693–1707 (2011).

    Article  Google Scholar 

  16. Mezlini, A.M. et al. iReckon: simultaneous isoform discovery and abundance estimation from RNA-seq data. Genome Res. 23, 519–529 (2013).

    Article  CAS  Google Scholar 

  17. Tomescu, A.I., Kuosmanen, A., Rizzi, R. & Makinen, V. A novel min-cost flow method for estimating transcript expression with RNA-Seq. BMC Bioinformatics 14 (suppl. 5), S15 (2013).

    Article  Google Scholar 

  18. Steijger, T. et al. Assessment of transcript reconstruction methods for RNA-seq. Nat. Methods 10, 1177–1184 (2013).

    Article  CAS  Google Scholar 

  19. Kim, D. et al. TopHat2: accurate alignment of transcriptomes in the presence of insertions, deletions and gene fusions. Genome Biol. 14, R36 (2013).

    Article  Google Scholar 

  20. Wu, T.D. & Nacu, S. Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 26, 873–881 (2010).

    Article  CAS  Google Scholar 

  21. Zhao, Q.Y. et al. Optimizing de novo transcriptome assembly from short-read RNA-Seq data: a comparative study. BMC Bioinformatics 12 (suppl. 14), S2 (2011).

    Article  CAS  Google Scholar 

  22. Behr, J. et al. MITIE: simultaneous RNA-Seq-based transcript identification and quantification in multiple samples. Bioinformatics 29, 2529–2538 (2013).

    Article  CAS  Google Scholar 

  23. Griebel, T. et al. Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 40, 10073–10083 (2012).

    Article  CAS  Google Scholar 

  24. Karolchik, D. et al. The UCSC Genome Browser database: 2014 update. Nucleic Acids Res. 42, D764–D770 (2014).

    Article  CAS  Google Scholar 

  25. Hansen, K.D., Brenner, S.E. & Dudoit, S. Biases in Illumina transcriptome sequencing caused by random hexamer priming. Nucleic Acids Res. 38, e131 (2010).

    Article  Google Scholar 

  26. Zimin, A.V. et al. The MaSuRCA genome assembler. Bioinformatics 29, 2669–2677 (2013).

    Article  CAS  Google Scholar 

  27. Rehrauer, H., Opitz, L., Tan, G., Sieverling, L. & Schlapbach, R. Blind spots of quantitative RNA-seq: the limits for assessing abundance, differential expression, and isoform switching. BMC Bioinformatics 14, 370 (2013).

    Article  Google Scholar 

  28. Encode Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).

  29. Pruitt, K.D., Tatusova, T., Klimke, W. & Maglott, D.R. NCBI Reference Sequences: current status, policy and new initiatives. Nucleic Acids Res. 37, D32–D36 (2009).

    Article  CAS  Google Scholar 

  30. Flicek, P. et al. Ensembl 2014. Nucleic Acids Res. 42, D749–D755 (2014).

    Article  CAS  Google Scholar 

  31. Ford, L. & Fulkerson, D. Flows in Networks (Princeton University Press, Princeton, NJ, 1962).

  32. Goldberg, A. & Tarjan, R. A new approach to the maximum-flow problem. JACM 35, 921–940 (1988).

    Article  Google Scholar 

  33. Dantzig, G. Linear Programming and Extensions (Princeton University Press, Princeton, NJ, 1962).

  34. Goldberg, A., Plotkin, S. & Tardos, E. Combinatorial algorithms for the generalized circulation problem. Math. Oper. Res. 16, 351–381 (1991).

    Article  Google Scholar 

Download references

Acknowledgements

These studies were supported in part by US National Institutes of Health grants R01-HG006677 (S.L.S.), R01-HG006102 (S.L.S.), R01-GM105705 (G.M.P.), R01-CA120185 (J.T.M.), P01-CA134292 (J.T.M.), and the Cancer Prevention and Research Institute of Texas (J.T.M.).

Author information

Authors and Affiliations

Authors

Contributions

M.P. designed the StringTie method with input from S.L.S. M.P. and G.M.P. implemented the algorithms. C.M.A. ran all programs on the RNA-seq data and tuned their performance. J.T.M. and T.-C.C. produced the kidney cell line data and gave feedback on StringTie's performance. M.P. and S.L.S. wrote the paper. S.L.S. supervised the entire project. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Steven L Salzberg.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–13, Supplementary Tables 1–11 and Supplementary Discussion (PDF 1024 kb)

Supplementary Software 1

StringTie code (ZIP 351 kb)

Source data

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pertea, M., Pertea, G., Antonescu, C. et al. StringTie enables improved reconstruction of a transcriptome from RNA-seq reads. Nat Biotechnol 33, 290–295 (2015). https://doi.org/10.1038/nbt.3122

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.3122

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing