首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Due to the advent of the so-called Next-Generation Sequencing (NGS) technologies the amount of monetary and temporal resources for whole-genome sequencing has been reduced by several orders of magnitude. Sequence reads can be assembled either by anchoring them directly onto an available reference genome (classical reference assembly), or can be concatenated by overlap (de novo assembly). The latter strategy is preferable because it tends to maintain the architecture of the genome sequence the however, depending on the NGS platform used, the shortness of read lengths cause tremendous problems the in the subsequent genome assembly phase, impeding closing of the entire genome sequence. To address the problem, we developed a multi-pronged hybrid de novo strategy combining De Bruijn graph and Overlap-Layout-Consensus methods, which was used to assemble from short reads the entire genome of Corynebacterium pseudotuberculosis strain I19, a bacterium with immense importance in veterinary medicine that causes Caseous Lymphadenitis in ruminants, principally ovines and caprines. Briefly, contigs were assembled de novo from the short reads and were only oriented using a reference genome by anchoring. Remaining gaps were closed using iterative anchoring of short reads by craning to gap flanks. Finally, we compare the genome sequence assembled using our hybrid strategy to a classical reference assembly using the same data as input and show that with the availability of a reference genome, it pays off to use the hybrid de novo strategy, rather than a classical reference assembly, because more genome sequences are preserved using the former.  相似文献   

2.

Background

Third generation sequencing methods, like SMRT (Single Molecule, Real-Time) sequencing developed by Pacific Biosciences, offer much longer read length in comparison to Next Generation Sequencing (NGS) methods. Hence, they are well suited for de novo- or re-sequencing projects. Sequences generated for these purposes will not only contain reads originating from the nuclear genome, but also a significant amount of reads originating from the organelles of the target organism. These reads are usually discarded but they can also be used for an assembly of organellar replicons. The long read length supports resolution of repetitive regions and repeats within the organelles genome which might be problematic when just using short read data. Additionally, SMRT sequencing is less influenced by GC rich areas and by long stretches of the same base.

Results

We describe a workflow for a de novo assembly of the sugar beet (Beta vulgaris ssp. vulgaris) chloroplast genome sequence only based on data originating from a SMRT sequencing dataset targeted on its nuclear genome. We show that the data obtained from such an experiment are sufficient to create a high quality assembly with a higher reliability than assemblies derived from e.g. Illumina reads only. The chloroplast genome is especially challenging for de novo assembling as it contains two large inverted repeat (IR) regions. We also describe some limitations that still apply even though long reads are used for the assembly.

Conclusions

SMRT sequencing reads extracted from a dataset created for nuclear genome (re)sequencing can be used to obtain a high quality de novo assembly of the chloroplast of the sequenced organism. Even with a relatively small overall coverage for the nuclear genome it is possible to collect more than enough reads to generate a high quality assembly that outperforms short read based assemblies. However, even with long reads it is not always possible to clarify the order of elements of a chloroplast genome sequence reliantly which we could demonstrate with Fosmid End Sequences (FES) generated with Sanger technology. Nevertheless, this limitation also applies to short read sequencing data but is reached in this case at a much earlier stage during finishing.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-015-0726-6) contains supplementary material, which is available to authorized users.  相似文献   

3.
4.
Next Generation Sequencing (NGS) is a disruptive technology that has found widespread acceptance in the life sciences research community. The high throughput and low cost of sequencing has encouraged researchers to undertake ambitious genomic projects, especially in de novo genome sequencing. Currently, NGS systems generate sequence data as short reads and de novo genome assembly using these short reads is computationally very intensive. Due to lower cost of sequencing and higher throughput, NGS systems now provide the ability to sequence genomes at high depth. However, currently no report is available highlighting the impact of high sequence depth on genome assembly using real data sets and multiple assembly algorithms. Recently, some studies have evaluated the impact of sequence coverage, error rate and average read length on genome assembly using multiple assembly algorithms, however, these evaluations were performed using simulated datasets. One limitation of using simulated datasets is that variables such as error rates, read length and coverage which are known to impact genome assembly are carefully controlled. Hence, this study was undertaken to identify the minimum depth of sequencing required for de novo assembly for different sized genomes using graph based assembly algorithms and real datasets. Illumina reads for E.coli (4.6 MB) S.kudriavzevii (11.18 MB) and C.elegans (100 MB) were assembled using SOAPdenovo, Velvet, ABySS, Meraculous and IDBA-UD. Our analysis shows that 50X is the optimum read depth for assembling these genomes using all assemblers except Meraculous which requires 100X read depth. Moreover, our analysis shows that de novo assembly from 50X read data requires only 6–40 GB RAM depending on the genome size and assembly algorithm used. We believe that this information can be extremely valuable for researchers in designing experiments and multiplexing which will enable optimum utilization of sequencing as well as analysis resources.  相似文献   

5.
Despite the ever-increasing output of next-generation sequencing data along with developing assemblers, dozens to hundreds of gaps still exist in de novo microbial assemblies due to uneven coverage and large genomic repeats. Third-generation single-molecule, real-time (SMRT) sequencing technology avoids amplification artifacts and generates kilobase-long reads with the potential to complete microbial genome assembly. However, due to the low accuracy (~85%) of third-generation sequences, a considerable amount of long reads (>50X) are required for self-correction and for subsequent de novo assembly. Recently-developed hybrid approaches, using next-generation sequencing data and as few as 5X long reads, have been proposed to improve the completeness of microbial assembly. In this study we have evaluated the contemporary hybrid approaches and demonstrated that assembling corrected long reads (by runCA) produced the best assembly compared to long-read scaffolding (e.g., AHA, Cerulean and SSPACE-LongRead) and gap-filling (SPAdes). For generating corrected long reads, we further examined long-read correction tools, such as ECTools, LSC, LoRDEC, PBcR pipeline and proovread. We have demonstrated that three microbial genomes including Escherichia coli K12 MG1655, Meiothermus ruber DSM1279 and Pdeobacter heparinus DSM2366 were successfully hybrid assembled by runCA into near-perfect assemblies using ECTools-corrected long reads. In addition, we developed a tool, Patch, which implements corrected long reads and pre-assembled contigs as inputs, to enhance microbial genome assemblies. With the additional 20X long reads, short reads of S. cerevisiae W303 were hybrid assembled into 115 contigs using the verified strategy, ECTools + runCA. Patch was subsequently applied to upgrade the assembly to a 35-contig draft genome. Our evaluation of the hybrid approaches shows that assembling the ECTools-corrected long reads via runCA generates near complete microbial genomes, suggesting that genome assembly could benefit from re-analyzing the available hybrid datasets that were not assembled in an optimal fashion.  相似文献   

6.
7.
The 3rd generation of sequencing (3GS) technologies generate ultra-long reads (up to 1 Mb), which makes it possible to eliminate gaps and effectively resolve repeats in genome assembly. However, the 3GS technologies suffer from the high base-level error rates (15%–40%) and high sequencing costs. To address these issues, the hybrid assembly strategy, which utilizes both 3GS reads and inexpensive NGS (next generation sequencing) short reads, was invented. Here, we use 10×-Genomics® technology, which integrates a novel bar-coding strategy with Illumina® NGS with an advantage of revealing long-range sequence information, to replace common NGS short reads for hybrid assembly of long erroneous 3GS reads. We demonstrate the feasibility of integrating the 3GS with 10×-Genomics technologies for a new strategy of hybrid de novo genome assembly by utilizing DBG2OLC and Sparc software packages, previously developed by the authors for regular hybrid assembly. Using a human genome as an example, we show that with only 7× coverage of ultra-long Nanopore® reads, augmented with 10× reads, our approach achieved nearly the same level of quality, compared with non-hybrid assembly with 35× coverage of Nanopore reads. Compared with the assembly with 10×-Genomics reads alone, our assembly is gapless with slightly high cost. These results suggest that our new hybrid assembly with ultra-long 3GS reads augmented with 10×-Genomics reads offers a low-cost (less than ¼ the cost of the non-hybrid assembly) and computationally light-weighted (only took 109 calendar hours with peak memory-usage = 61GB on a dual-CPU office workstation) solution for extending the wide applications of the 3GS technologies.  相似文献   

8.
9.
10.
In just the last decade, a multitude of bio-technologies and software pipelines have emerged to revolutionize genomics. To further their central goal, they aim to accelerate and improve the quality of de novo whole-genome assembly starting from short DNA sequences/reads. However, the performance of each of these tools is contingent on the length and quality of the sequencing data, the structure and complexity of the genome sequence, and the resolution and quality of long-range information. Furthermore, in the absence of any metric that captures the most fundamental “features” of a high-quality assembly, there is no obvious recipe for users to select the most desirable assembler/assembly. This situation has prompted the scientific community to rely on crowd-sourcing through international competitions, such as Assemblathons or GAGE, with the intention of identifying the best assembler(s) and their features. Somewhat circuitously, the only available approach to gauge de novo assemblies and assemblers relies solely on the availability of a high-quality fully assembled reference genome sequence. Still worse, reference-guided evaluations are often both difficult to analyze, leading to conclusions that are difficult to interpret. In this paper, we circumvent many of these issues by relying upon a tool, dubbed , which is capable of evaluating de novo assemblies from the read-layouts even when no reference exists. We extend the FRCurve approach to cases where lay-out information may have been obscured, as is true in many deBruijn-graph-based algorithms. As a by-product, FRCurve now expands its applicability to a much wider class of assemblers – thus, identifying higher-quality members of this group, their inter-relations as well as sensitivity to carefully selected features, with or without the support of a reference sequence or layout for the reads. The paper concludes by reevaluating several recently conducted assembly competitions and the datasets that have resulted from them.  相似文献   

11.
Methods for the analysis of chromatin immunoprecipitation sequencing (ChIP-seq) data start by aligning the short reads to a reference genome. While often successful, they are not appropriate for cases where a reference genome is not available. Here we develop methods for de novo analysis of ChIP-seq data. Our methods combine de novo assembly with statistical tests enabling motif discovery without the use of a reference genome. We validate the performance of our method using human and mouse data. Analysis of fly data indicates that our method outperforms alignment based methods that utilize closely related species.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0756-4) contains supplementary material, which is available to authorized users.  相似文献   

12.
An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.  相似文献   

13.
《Genomics》2021,113(3):1366-1377
Oxford Nanopore sequencing has been widely used to achieve complete genomes of bacterial pathogens. However, the error rates of Oxford Nanopore long reads are high. Various polishing algorithms using Illumina short reads to correct the errors in Oxford Nanopore long-read assemblies have been developed. The impact of polishing the Oxford Nanopore long-read assemblies of bacterial pathogens with Illumina short reads on improving genomic analyses was evaluated using both simulated and real reads. Ten species (10 strains) were selected for simulated reads, while real reads were tested on 11 species (11 strains). Oxford Nanopore long reads were assembled with Unicycler to produce a draft assembly, followed by three rounds of polishing with Illumina short reads using two polishing tools, Pilon and NextPolish. One round of NextPolish polishing generated genome completeness and accuracy parameters similar to the reference genomes, whereas two or three rounds of Pilon polishing were needed, though contiguity remained unchanged after polishing. The polished assemblies of Escherichia coli O157:H7, Salmonella Typhimurium, and Cronobacter sakazakii with simulated reads did not provide accurate plasmid identifications. One round of NextPolish polishing was needed for accurately identifying plasmids in Staphylococcus aureus and E. coli O26:H11 with real reads, whereas one and two rounds of Pilon polishing were necessary for these two strains, respectively. Polishing failed to provide an accurate antimicrobial resistance (AMR) genotype for S. aureus with real reads. One round of polishing recovered an accurate AMR genotype for Klebsiella pneumoniae with real reads. The reference genome and draft assembly of Citrobacter braakii with real reads differed, which carried blaCMY-83 and fosA6, respectively, while both genes were present after one round of polishing. However, polishing did not improve the assembly of E. coli O26:H11 with real reads to achieve numbers of virulence genes similar to the reference genome. The draft and polished assemblies showed a phylogenetic tree topology comparable with the reference genomes. For multilocus sequence typing and pan-genome analyses, one round of NextPolish polishing was sufficient to obtain accurate results, while two or three rounds of Pilon polishing were needed. Overall, NextPolish outperformed Pilon for polishing the Oxford Nanopore long-read assemblies of bacterial pathogens, though both polishing strategies improved genomic analyses compared to the draft assemblies.  相似文献   

14.

Background

Whole genome sequence construction is becoming increasingly feasible because of advances in next generation sequencing (NGS), including increasing throughput and read length. By simply overlapping paired-end reads, we can obtain longer reads with higher accuracy, which can facilitate the assembly process. However, the influences of different library sizes and assembly methods on paired-end sequencing-based de novo assembly remain poorly understood.

Results

We used 250 bp Illumina Miseq paired-end reads of different library sizes generated from genomic DNA from Escherichia coli DH1 and Streptococcus parasanguinis FW213 to compare the assembly results of different library sizes and assembly approaches. Our data indicate that overlapping paired-end reads can increase read accuracy but sometimes cause insertion or deletions. Regarding genome assembly, merged reads only outcompete original paired-end reads when coverage depth is low, and larger libraries tend to yield better assembly results. These results imply that distance information is the most critical factor during assembly. Our results also indicate that when depth is sufficiently high, assembly from subsets can sometimes produce better results.

Conclusions

In summary, this study provides systematic evaluations of de novo assembly from paired end sequencing data. Among the assembly strategies, we find that overlapping paired-end reads is not always beneficial for bacteria genome assembly and should be avoided or used with caution especially for genomes containing high fraction of repetitive sequences. Because increasing numbers of projects aim at bacteria genome sequencing, our study provides valuable suggestions for the field of genomic sequence construction.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1859-8) contains supplementary material, which is available to authorized users.  相似文献   

15.

Background

Problems associated with using draft genome assemblies are well documented and have become more pronounced with the use of short read data for de novo genome assembly. We set out to improve the draft genome assembly of the African cichlid fish, Metriaclima zebra, using a set of Pacific Biosciences SMRT sequencing reads corresponding to 16.5× coverage of the genome. Here we characterize the improvements that these long reads allowed us to make to the state-of-the-art draft genome previously assembled from short read data.

Results

Our new assembly closed 68 % of the existing gaps and added 90.6Mbp of new non-gap sequence to the existing draft assembly of M. zebra. Comparison of the new assembly to the sequence of several bacterial artificial chromosome clones confirmed the accuracy of the new assembly. The closure of sequence gaps revealed thousands of new exons, allowing significant improvement in gene models. We corrected one known misassembly, and identified and fixed other likely misassemblies. 63.5 Mbp (70 %) of the new sequence was classified as repetitive and the new sequence allowed for the assembly of many more transposable elements.

Conclusions

Our improvements to the M. zebra draft genome suggest that a reasonable investment in long reads could greatly improve many comparable vertebrate draft genome assemblies.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1930-5) contains supplementary material, which is available to authorized users.  相似文献   

16.
Next-generation sequencing (NGS) approaches rapidly produce millions to billions of short reads, which allow pathogen detection and discovery in human clinical, animal and environmental samples. A major limitation of sequence homology-based identification for highly divergent microorganisms is the short length of reads generated by most highly parallel sequencing technologies. Short reads require a high level of sequence similarities to annotated genes to confidently predict gene function or homology. Such recognition of highly divergent homologues can be improved by reference-free (de novo) assembly of short overlapping sequence reads into larger contigs. We describe an ensemble strategy that integrates the sequential use of various de Bruijn graph and overlap-layout-consensus assemblers with a novel partitioned sub-assembly approach. We also proposed new quality metrics that are suitable for evaluating metagenome de novo assembly. We demonstrate that this new ensemble strategy tested using in silico spike-in, clinical and environmental NGS datasets achieved significantly better contigs than current approaches.  相似文献   

17.
18.
19.
As a greater number and diversity of high-quality vertebrate reference genomes become available, it is increasingly feasible to use these references to guide new draft assemblies for related species. Reference-guided assembly approaches may substantially increase the contiguity and completeness of a new genome using only low levels of genome coverage that might otherwise be insufficient for de novo genome assembly. We used low-coverage (∼3.5–5.5x) Illumina paired-end sequencing to assemble draft genomes of two bird species (the Gunnison Sage-Grouse, Centrocercus minimus, and the Clark''s Nutcracker, Nucifraga columbiana). We used these data to estimate de novo genome assemblies and reference-guided assemblies, and compared the information content and completeness of these assemblies by comparing CEGMA gene set representation, repeat element content, simple sequence repeat content, and GC isochore structure among assemblies. Our results demonstrate that even lower-coverage genome sequencing projects are capable of producing informative and useful genomic resources, particularly through the use of reference-guided assemblies.  相似文献   

20.
Next-generation-sequencing (NGS) has revolutionized the field of genome assembly because of its much higher data throughput and much lower cost compared with traditional Sanger sequencing. However, NGS poses new computational challenges to de novo genome assembly. Among the challenges, GC bias in NGS data is known to aggravate genome assembly. However, it is not clear to what extent GC bias affects genome assembly in general. In this work, we conduct a systematic analysis on the effects of GC bias on genome assembly. Our analyses reveal that GC bias only lowers assembly completeness when the degree of GC bias is above a threshold. At a strong GC bias, the assembly fragmentation due to GC bias can be explained by the low coverage of reads in the GC-poor or GC-rich regions of a genome. This effect is observed for all the assemblers under study. Increasing the total amount of NGS data thus rescues the assembly fragmentation because of GC bias. However, the amount of data needed for a full rescue depends on the distribution of GC contents. Both low and high coverage depths due to GC bias lower the accuracy of assembly. These pieces of information provide guidance toward a better de novo genome assembly in the presence of GC bias.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号