Skip to main navigation Skip to search Skip to main content

TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads

  • Mengyang Xu
  • , Lidong Guo
  • , Shengqiang Gu
  • , Ou Wang
  • , Rui Zhang
  • , Brock A. Peters
  • , Guangyi Fan
  • , Xin Liu
  • , Xun Xu
  • , Li Deng
  • , Yongwei Zhang
  • BGI-Shenzhen
  • University of Chinese Academy of Sciences
  • Complete Genomics Inc.
  • China National GeneBank

Research output: Contribution to journalArticlepeer-review

365 Scopus citations

Abstract

Background: Analyses that use genome assemblies are critically affected by the contiguity, completeness, and accuracy of those assemblies. In recent years single-molecule sequencing techniques generating long-read information have become available and enabled substantial improvement in contig length and genome completeness, especially for large genomes (>100 Mb), although bioinformatic tools for these applications are still limited. Findings: We developed a software tool to close sequence gaps in genome assemblies, TGS-GapCloser, that uses low-depth (∼10×) long single-molecule reads. The algorithm extracts reads that bridge gap regions between 2 contigs within a scaffold, error corrects only the candidate reads, and assigns the best sequence data to each gap. As a demonstration, we used TGS-GapCloser to improve the scaftig NG50 value of 3 human genome assemblies by 24-fold on average with only ∼10× coverage of Oxford Nanopore or Pacific Biosciences reads, covering with sequence data up to 94.8% gaps with 97.7% positive predictive value. These improved assemblies achieve 99.998% (Q46) single-base accuracy with final inserted sequences having 99.97% (Q35) accuracy, despite the high raw error rate of single-molecule reads, enabling high-quality downstream analyses, including up to a 31-fold increase in the scaftig NGA50 and up to 13.1% more complete BUSCO genes. Additionally, we show that even in ultra-large genome assemblies, such as the ginkgo (∼12 Gb), TGS-GapCloser can cover 71.6% of gaps with sequence data. Conclusions: TGS-GapCloser can close gaps in large genome assemblies using raw long reads quickly and cost-effectively. The final assemblies generated by TGS-GapCloser have improved contiguity and completeness while maintaining high accuracy. The software is available at https://github.com/BGI-Qingdao/TGS-GapCloser.

Original languageEnglish
JournalGigaScience
Volume9
Issue number9
DOIs
StatePublished - 1 Sep 2020
Externally publishedYes

Keywords

  • Gap closure
  • Genome assembly
  • Ginkgo
  • MHC
  • Third-generation sequencing

Fingerprint

Dive into the research topics of 'TGS-GapCloser: A fast and accurate gap closer for large genomes with low coverage of error-prone long reads'. Together they form a unique fingerprint.

Cite this