首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Despite the power of massively parallel sequencing platforms, a drawback is the short length of the sequence reads produced. We demonstrate that short reads can be locally assembled into longer contigs using paired-end sequencing of restriction-site associated DNA (RAD-PE) fragments. We use this RAD-PE contig approach to identify single nucleotide polymorphisms (SNPs) and determine haplotype structure in threespine stickleback and to sequence E. coli and stickleback genomic DNA with overlapping contigs of several hundred nucleotides. We also demonstrate that adding a circularization step allows the local assembly of contigs up to 5 kilobases (kb) in length. The ease of assembly and accuracy of the individual contigs produced from each RAD site sequence suggests RAD-PE sequencing is a useful way to convert genome-wide short reads into individually-assembled sequences hundreds or thousands of nucleotides long.  相似文献   

2.
We present Quip, a lossless compression algorithm for next-generation sequencing data in the FASTQ and SAM/BAM formats. In addition to implementing reference-based compression, we have developed, to our knowledge, the first assembly-based compressor, using a novel de novo assembly algorithm. A probabilistic data structure is used to dramatically reduce the memory required by traditional de Bruijn graph assemblers, allowing millions of reads to be assembled very efficiently. Read sequences are then stored as positions within the assembled contigs. This is combined with statistical compression of read identifiers, quality scores, alignment information and sequences, effectively collapsing very large data sets to <15% of their original size with no loss of information. Availability: Quip is freely available under the 3-clause BSD license from http://cs.washington.edu/homes/dcjones/quip.  相似文献   

3.

Background

Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned.

Results

In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented.

Conclusion

Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph.
  相似文献   

4.
We describe a new algorithm, meraculous, for whole genome assembly of deep paired-end short reads, and apply it to the assembly of a dataset of paired 75-bp Illumina reads derived from the 15.4 megabase genome of the haploid yeast Pichia stipitis. More than 95% of the genome is recovered, with no errors; half the assembled sequence is in contigs longer than 101 kilobases and in scaffolds longer than 269 kilobases. Incorporating fosmid ends recovers entire chromosomes. Meraculous relies on an efficient and conservative traversal of the subgraph of the k-mer (deBruijn) graph of oligonucleotides with unique high quality extensions in the dataset, avoiding an explicit error correction step as used in other short-read assemblers. A novel memory-efficient hashing scheme is introduced. The resulting contigs are ordered and oriented using paired reads separated by ~280 bp or ~3.2 kbp, and many gaps between contigs can be closed using paired-end placements. Practical issues with the dataset are described, and prospects for assembling larger genomes are discussed.  相似文献   

5.
Protein-mediated error correction for de novo DNA synthesis   总被引:5,自引:2,他引:3       下载免费PDF全文
The availability of inexpensive, on demand synthetic DNA has enabled numerous powerful applications in biotechnology, in turn driving considerable present interest in the de novo synthesis of increasingly longer DNA constructs. The synthesis of DNA from oligonucleotides into products even as large as small viral genomes has been accomplished. Despite such achievements, the costs and time required to generate such long constructs has, to date, precluded gene-length (and longer) DNA synthesis from being an everyday research tool in the same manner as PCR and DNA sequencing. A critical barrier to low-cost, high-throughput de novo DNA synthesis is the frequency at which errors pervade the final product. Here, we employ a DNA mismatch-binding protein, MutS (from Thermus aquaticus) to remove failure products from synthetic genes. This method reduced errors by >15-fold relative to conventional gene synthesis techniques, yielding DNA with one error per 10000 base pairs. The approach is general, scalable and can be iterated multiple times for greater fidelity. Reductions in both costs and time required are demonstrated for the synthesis of a 2.5 kb gene.  相似文献   

6.
Current challenges in de novo plant genome sequencing and assembly   总被引:1,自引:0,他引:1  
Genome sequencing is now affordable, but assembling plant genomes de novo remains challenging. We assess the state of the art of assembly and review the best practices for the community.  相似文献   

7.
The genome of the parasitic platyhelminth Schistosoma mansoni is composed of approximately 40% of repetitive sequences of which roughly 20% correspond to transposable elements. When the genome sequence became available, conventional repeat prediction programs were used to find these repeats, but only a fraction could be identified. To exhaustively characterize the repeats we applied a new massive sequencing based strategy: we re-sequenced the genome by next generation sequencing, aligned the sequencing reads to the genome and assembled all multiple-hit reads into contigs corresponding to the repetitive part of the genome. We present here, for the first time, this de novo repeat assembly strategy and we confirm that such assembly is feasible. We identified and annotated 4,143 new repeats in the S. mansoni genome. At least one third of the repeats are transcribed. This strategy allowed us also to identify 14 new microsatellite markers, which can be used for pedigree studies. Annotations and the combined (previously known and new) 5,420 repeat sequences (corresponding to 47% of the genome) are available for download (http://methdb.univ-perp.fr/downloads/).  相似文献   

8.
Recent developments in high-throughput sequencing technology have made low-cost sequencing an attractive approach for many genome analysis tasks. Increasing read lengths, improving quality and the production of increasingly larger numbers of usable sequences per instrument-run continue to make whole-genome assembly an appealing target application. In this paper we evaluate the feasibility of de novo genome assembly from short reads (≤100 nucleotides) through a detailed study involving genomic sequences of various lengths and origin, in conjunction with several of the currently popular assembly programs. Our extensive analysis demonstrates that, in addition to sequencing coverage, attributes such as the architecture of the target genome, the identity of the used assembly program, the average read length and the observed sequencing error rates are powerful variables that affect the best achievable assembly of the target sequence in terms of size and correctness.  相似文献   

9.
10.
ABSTRACT: BACKGROUND: Short-read data from next-generation sequencing technologies are now being generated across a range of research projects. The fidelity of this data can be affected by several factors and it is important to have simple and reliable approaches for monitoring it at the level of individual experiments. RESULTS: We developed a fast, scalable and accurate approach to estimating error rates in short reads, which has the added advantage of not requiring a reference genome. We build on the fundamental observation that there is a linear relationship between the copy number for a given read and the number of erroneous reads that differ from the read of interest by one or two bases. The slope of this relationship can be transformed to give an estimate of the error rate, both by read and by position. We present simulation studies as well as analyses of real data sets illustrating the precision and accuracy of this method, and we show that it is more accurate than alternatives that count the difference between the sample of interest and the reference genome. We show how this methodology led to the detection of mutations in the genome of the PhiX strain used for calibration of Illumina data. The proposed method is implemented in an R package, which can be downloaded from http://bcb.dfci.harvard.edu/~vwang/shadowRegression.html, and will be submitted to Bioconductor upon publication of this article. CONCLUSIONS: The proposed method can be used to monitor the quality of sequencing pipelines at the level of individual experiments without the use of reference genomes. Furthermore, having an estimate of the error rates gives one the opportunity to improve analyses and inferences in many applications of next-generation sequencing data.  相似文献   

11.
Annotated genomes can provide new perspectives on the biology of species. We present the first de novo whole genome sequencing for the pink-footed goose. In order to obtain a high-quality de novo assembly the strategy used was to combine one short insert paired-end library with two mate-pair libraries. The pink-footed goose genome was assembled de novo using three different assemblers and an assembly evaluation was subsequently performed in order to choose the best assembler. For our data, ALLPATHS-LG performed the best, since the assembly produced covers most of the genome, while introducing the fewest errors. A total of 26,134 genes were annotated, with bird species accounting for virtually all BLAST hits. We also estimated the substitution rate in the pink-footed goose, which can be of use in future demographic studies, by using a comparative approach with the genome of the chicken, the mallard and the swan goose. A substitution rate of 1.38 × 10? 7 per nucleotide per generation was obtained when comparing the genomes of the two closely-related goose species (the pink-footed and the swan goose). Altogether, we provide a valuable tool for future genomic studies aiming at particular genes and regions of the pink-footed goose genome as well as other bird species.  相似文献   

12.
MOTIVATION: Several new de novo assembly tools have been developed recently to assemble short sequencing reads generated by next-generation sequencing platforms. However, the performance of these tools under various conditions has not been fully investigated, and sufficient information is not currently available for informed decisions to be made regarding the tool that would be most likely to produce the best performance under a specific set of conditions. RESULTS: We studied and compared the performance of commonly used de novo assembly tools specifically designed for next-generation sequencing data, including SSAKE, VCAKE, Euler-sr, Edena, Velvet, ABySS and SOAPdenovo. Tools were compared using several performance criteria, including N50 length, sequence coverage and assembly accuracy. Various properties of read data, including single-end/paired-end, sequence GC content, depth of coverage and base calling error rates, were investigated for their effects on the performance of different assembly tools. We also compared the computation time and memory usage of these seven tools. Based on the results of our comparison, the relative performance of individual tools are summarized and tentative guidelines for optimal selection of different assembly tools, under different conditions, are provided.  相似文献   

13.
An important step in ‘metagenomics’ analysis is the assembly of multiple genomes from mixed sequence reads of multiple species in a microbial community. Most conventional pipelines use a single-genome assembler with carefully optimized parameters. A limitation of a single-genome assembler for de novo metagenome assembly is that sequences of highly abundant species are likely misidentified as repeats in a single genome, resulting in a number of small fragmented scaffolds. We extended a single-genome assembler for short reads, known as ‘Velvet’, to metagenome assembly, which we called ‘MetaVelvet’, for mixed short reads of multiple species. Our fundamental concept was to first decompose a de Bruijn graph constructed from mixed short reads into individual sub-graphs, and second, to build scaffolds based on each decomposed de Bruijn sub-graph as an isolate species genome. We made use of two features, the coverage (abundance) difference and graph connectivity, for the decomposition of the de Bruijn graph. For simulated datasets, MetaVelvet succeeded in generating significantly higher N50 scores than any single-genome assemblers. MetaVelvet also reconstructed relatively low-coverage genome sequences as scaffolds. On real datasets of human gut microbial read data, MetaVelvet produced longer scaffolds and increased the number of predicted genes.  相似文献   

14.
Restriction site‐associated DNA sequencing (RADseq) provides researchers with the ability to record genetic polymorphism across thousands of loci for nonmodel organisms, potentially revolutionizing the field of molecular ecology. However, as with other genotyping methods, RADseq is prone to a number of sources of error that may have consequential effects for population genetic inferences, and these have received only limited attention in terms of the estimation and reporting of genotyping error rates. Here we use individual sample replicates, under the expectation of identical genotypes, to quantify genotyping error in the absence of a reference genome. We then use sample replicates to (i) optimize de novo assembly parameters within the program Stacks, by minimizing error and maximizing the retrieval of informative loci; and (ii) quantify error rates for loci, alleles and single‐nucleotide polymorphisms. As an empirical example, we use a double‐digest RAD data set of a nonmodel plant species, Berberis alpina, collected from high‐altitude mountains in Mexico.  相似文献   

15.
Gene synthesis attempts to assemble user-defined DNA sequences with base-level precision. Verifying the sequences of construction intermediates and the final product of a gene synthesis project is a critical part of the workflow, yet one that has received the least attention. Sequence validation is equally important for other kinds of curated clone collections. Ensuring that the physical sequence of a clone matches its published sequence is a common quality control step performed at least once over the course of a research project. GenoREAD is a web-based application that breaks the sequence verification process into two steps: the assembly of sequencing reads and the alignment of the resulting contig with a reference sequence. GenoREAD can determine if a clone matches its reference sequence. Its sophisticated reporting features help identify and troubleshoot problems that arise during the sequence verification process. GenoREAD has been experimentally validated on thousands of gene-sized constructs from an ORFeome project, and on longer sequences including whole plasmids and synthetic chromosomes. Comparing GenoREAD results with those from manual analysis of the sequencing data demonstrates that GenoREAD tends to be conservative in its diagnostic. GenoREAD is available at www.genoread.org.  相似文献   

16.
17.
The conventional approach in modern proteomics to identify proteins from limited information provided by molecular and fragment masses of their enzymatic degradation products carries an inherent risk of both false positive and false negative identifications. For reliable identification of even known proteins, complete de novo sequencing of their peptides is desired. The main problems of conventional sequencing based on tandem mass spectrometry are incomplete backbone fragmentation and the frequent overlap of fragment masses. In this work, the first proteomics-grade de novo approach is presented, where the above problems are alleviated by the use of complementary fragmentation techniques CAD and ECD. Implementation of a high-current, large-area dispenser cathode as a source of low-energy electrons provided efficient ECD of doubly charged peptides, the most abundant species (65-80%), in a typical trypsin-based proteomics experiment. A new linear de novo algorithm is developed combining efficiency and speed, processing on a conventional 3 GHz PC, 1000 MS/MS data sets in 60 s. More than 6% of all MS/MS data for doubly charged peptides yielded complete sequences, and another 13% gave nearly complete sequences with a maximum gap of two amino acid residues. These figures are comparable with the typical success rates (5-15%) of database identification. For peptides reliably found in the database (Mowse score > or = 34), the agreement with de novo-derived full sequences was >95%. Full sequences were derived in 67% of the cases when full sequence information was present in MS/MS spectra. Thus the new de novo sequencing approach reached the same level of efficiency and reliability as conventional database-identification strategies.  相似文献   

18.

Background

The short reads output by first- and second-generation DNA sequencing instruments cannot completely reconstruct microbial chromosomes. Therefore, most genomes have been left unfinished due to the significant resources required to manually close gaps in draft assemblies. Third-generation, single-molecule sequencing addresses this problem by greatly increasing sequencing read length, which simplifies the assembly problem.

Results

To measure the benefit of single-molecule sequencing on microbial genome assembly, we sequenced and assembled the genomes of six bacteria and analyzed the repeat complexity of 2,267 complete bacteria and archaea. Our results indicate that the majority of known bacterial and archaeal genomes can be assembled without gaps, at finished-grade quality, using a single PacBio RS sequencing library. These single-library assemblies are also more accurate than typical short-read assemblies and hybrid assemblies of short and long reads.

Conclusions

Automated assembly of long, single-molecule sequencing data reduces the cost of microbial finishing to $1,000 for most genomes, and future advances in this technology are expected to drive the cost lower. This is expected to increase the number of completed genomes, improve the quality of microbial genome databases, and enable high-fidelity, population-scale studies of pan-genomes and chromosomal organization.  相似文献   

19.
Vezzi F  Narzisi G  Mishra B 《PloS one》2012,7(2):e31002
The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the "excess-dimensionality" of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results.  相似文献   

20.
De novo peptide sequencing by mass spectrometry (MS) can determine the amino acid sequence of an unknown peptide without reference to a protein database. MS-based de novo sequencing assumes special importance in focused studies of families of biologically active peptides and proteins, such as hormones, toxins, and antibodies, for which amino acid sequences may be difficult to obtain through genomic methods. These protein families often exhibit sequence homology or characteristic amino acid content; yet, current de novo sequencing approaches do not take advantage of this prior knowledge and, hence, search an unnecessarily large space of possible sequences. Here, we describe an algorithm for de novo sequencing that incorporates sequence constraints into the core graph algorithm and thereby reduces the search space by many orders of magnitude. We demonstrate our algorithm in a study of cysteine-rich toxins from two cone snail species (Conus textile and Conus stercusmuscarum) and report 13 de novo and about 60 total toxins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号