首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.  相似文献   

2.
As the more recent next-generation sequencing (NGS) technologies provide longer read sequences, the use of sequencing datasets for complete haplotype phasing is fast becoming a reality, allowing haplotype reconstruction of a single sequenced genome. Nearly all previous haplotype reconstruction studies have focused on diploid genomes and are rarely scalable to genomes with higher ploidy. Yet computational investigations into polyploid genomes carry great importance, impacting plant, yeast and fish genomics, as well as the studies of the evolution of modern-day eukaryotes and (epi)genetic interactions between copies of genes. In this paper, we describe a novel maximum-likelihood estimation framework, HapTree, for polyploid haplotype assembly of an individual genome using NGS read datasets. We evaluate the performance of HapTree on simulated polyploid sequencing read data modeled after Illumina sequencing technologies. For triploid and higher ploidy genomes, we demonstrate that HapTree substantially improves haplotype assembly accuracy and efficiency over the state-of-the-art; moreover, HapTree is the first scalable polyplotyping method for higher ploidy. As a proof of concept, we also test our method on real sequencing data from NA12878 (1000 Genomes Project) and evaluate the quality of assembled haplotypes with respect to trio-based diplotype annotation as the ground truth. The results indicate that HapTree significantly improves the switch accuracy within phased haplotype blocks as compared to existing haplotype assembly methods, while producing comparable minimum error correction (MEC) values. A summary of this paper appears in the proceedings of the RECOMB 2014 conference, April 2–5.  相似文献   

3.

Background  

Next-generation sequencing (NGS) offers a unique opportunity for high-throughput genomics and has potential to replace Sanger sequencing in many fields, including de-novo sequencing, re-sequencing, meta-genomics, and characterisation of infectious pathogens, such as viral quasispecies. Although methodologies and software for whole genome assembly and genome variation analysis have been developed and refined for NGS data, reconstructing a viral quasispecies using NGS data remains a challenge. This application would be useful for analysing intra-host evolutionary pathways in relation to immune responses and antiretroviral therapy exposures. Here we introduce a set of formulae for the combinatorial analysis of a quasispecies, given a NGS re-sequencing experiment and an algorithm for quasispecies reconstruction. We require that sequenced fragments are aligned against a reference genome, and that the reference genome is partitioned into a set of sliding windows (amplicons). The reconstruction algorithm is based on combinations of multinomial distributions and is designed to minimise the reconstruction of false variants, called in-silico recombinants.  相似文献   

4.
Genome assemblies are currently being produced at an impressive rate by consortia and individual laboratories. The low costs and increasing efficiency of sequencing technologies now enable assembling genomes at unprecedented quality and contiguity. However, the difficulty in assembling repeat‐rich and GC‐rich regions (genomic “dark matter”) limits insights into the evolution of genome structure and regulatory networks. Here, we compare the efficiency of currently available sequencing technologies (short/linked/long reads and proximity ligation maps) and combinations thereof in assembling genomic dark matter. By adopting different de novo assembly strategies, we compare individual draft assemblies to a curated multiplatform reference assembly and identify the genomic features that cause gaps within each assembly. We show that a multiplatform assembly implementing long‐read, linked‐read and proximity sequencing technologies performs best at recovering transposable elements, multicopy MHC genes, GC‐rich microchromosomes and the repeat‐rich W chromosome. Telomere‐to‐telomere assemblies are not a reality yet for most organisms, but by leveraging technology choice it is now possible to minimize genome assembly gaps for downstream analysis. We provide a roadmap to tailor sequencing projects for optimized completeness of both the coding and noncoding parts of nonmodel genomes.  相似文献   

5.
Recent improvements in technology have made DNA sequencing dramatically faster and more efficient than ever before. The new technologies produce highly accurate sequences, but one drawback is that the most efficient technology produces the shortest read lengths. Short-read sequencing has been applied successfully to resequence the human genome and those of other species but not to whole-genome sequencing of novel organisms. Here we describe the sequencing and assembly of a novel clinical isolate of Pseudomonas aeruginosa, strain PAb1, using very short read technology. From 8,627,900 reads, each 33 nucleotides in length, we assembled the genome into one scaffold of 76 ordered contiguous sequences containing 6,290,005 nucleotides, including one contig spanning 512,638 nucleotides, plus an additional 436 unordered contigs containing 416,897 nucleotides. Our method includes a novel gene-boosting algorithm that uses amino acid sequences from predicted proteins to build a better assembly. This study demonstrates the feasibility of very short read sequencing for the sequencing of bacterial genomes, particularly those for which a related species has been sequenced previously, and expands the potential application of this new technology to most known prokaryotic species.  相似文献   

6.
The Genome Sequencer FLX System (GS FLX), powered by 454 Sequencing, is a next-generation DNA sequencing technology featuring a unique mix of long reads, exceptional accuracy, and ultra-high throughput. It has been proven to be the most versatile of all currently available next-generation sequencing technologies, supporting many high-profile studies in over seven applications categories. GS FLX users have pursued innovative research in de novo sequencing, re-sequencing of whole genomes and target DNA regions, metagenomics, and RNA analysis. 454 Sequencing is a powerful tool for human genetics research, having recently re-sequenced the genome of an individual human, currently re-sequencing the complete human exome and targeted genomic regions using the NimbleGen sequence capture process, and detected low-frequency somatic mutations linked to cancer.  相似文献   

7.
High-throughput DNA sequencing technologies have revolutionized genomic analysis, including the de novo assembly of whole genomes. Nevertheless, assembly of complex genomes remains challenging, in part due to the presence of dispersed repeats which introduce ambiguity during genome reconstruction. Transposable elements (TEs) can be particularly problematic, especially for TE families exhibiting high sequence identity, high copy number, or complex genomic arrangements. While TEs strongly affect genome function and evolution, most current de novo assembly approaches cannot resolve long, identical, and abundant families of TEs. Here, we applied a novel Illumina technology called TruSeq synthetic long-reads, which are generated through highly-parallel library preparation and local assembly of short read data and which achieve lengths of 1.5–18.5 Kbp with an extremely low error rate (0.03% per base). To test the utility of this technology, we sequenced and assembled the genome of the model organism Drosophila melanogaster (reference genome strain y; cn, bw, sp) achieving an N50 contig size of 69.7 Kbp and covering 96.9% of the euchromatic chromosome arms of the current reference genome. TruSeq synthetic long-read technology enables placement of individual TE copies in their proper genomic locations as well as accurate reconstruction of TE sequences. We entirely recovered and accurately placed 4,229 (77.8%) of the 5,434 annotated transposable elements with perfect identity to the current reference genome. As TEs are ubiquitous features of genomes of many species, TruSeq synthetic long-reads, and likely other methods that generate long-reads, offer a powerful approach to improve de novo assemblies of whole genomes.  相似文献   

8.
Here we use whole-genome de novo assembly of second-generation sequencing reads to map structural variation (SV) in an Asian genome and an African genome. Our approach identifies small- and intermediate-size homozygous variants (1-50 kb) including insertions, deletions, inversions and their precise breakpoints, and in contrast to other methods, can resolve complex rearrangements. In total, we identified 277,243 SVs ranging in length from 1-23 kb. Validation using computational and experimental methods suggests that we achieve overall <6% false-positive rate and <10% false-negative rate in genomic regions that can be assembled, which outperforms other methods. Analysis of the SVs in the genomes of 106 individuals sequenced as part of the 1000 Genomes Project suggests that SVs account for a greater fraction of the diversity between individuals than do single-nucleotide polymorphisms (SNPs). These findings demonstrate that whole-genome de novo assembly is a feasible approach to deriving more comprehensive maps of genetic variation.  相似文献   

9.
Since the turn of the century the complete genome sequence of just one mouse strain, C57BL/6J, has been available. Knowing the sequence of this strain has enabled large-scale forward genetic screens to be performed, the creation of an almost complete set of embryonic stem (ES) cell lines with targeted alleles for protein-coding genes, and the generation of a rich catalog of mouse genomic variation. However, many experiments that use other common laboratory mouse strains have been hindered by a lack of whole-genome sequence data for these strains. The last 5?years has witnessed a revolution in DNA sequencing technologies. Recently, these technologies have been used to expand the repertoire of fully sequenced mouse genomes. In this article we review the main findings of these studies and discuss how the sequence of mouse genomes is helping pave the way from sequence to phenotype. Finally, we discuss the prospects for using de novo assembly techniques to obtain high-quality assembled genome sequences of these laboratory mouse strains, and what advances in sequencing technologies may be required to achieve this goal.  相似文献   

10.

Background  

The decrease in cost for sequencing and improvement in technologies has made it easier and more common for the re-sequencing of large genomes as well as parallel sequencing of small genomes. It is possible to completely sequence a small genome within days and this increases the number of publicly available genomes. Among the types of genomes being rapidly sequenced are those of microbial and viral genomes responsible for infectious diseases. However, accurate gene prediction is a challenge that persists for decoding a newly sequenced genome. Therefore, accurate and efficient gene prediction programs are highly desired for rapid and cost effective surveillance of RNA viruses through full genome sequencing.  相似文献   

11.
Advances in DNA sequencing technology have improved our ability to characterize most genomic diversity. However, accurate resolution of large structural events is challenging because of the short read lengths of second-generation technologies. Third-generation sequencing technologies, which can yield longer multikilobase reads, have the potential to address limitations associated with genome assembly. Here we combine sequencing data from second- and third-generation DNA sequencing technologies to assemble the two-chromosome genome of a recent Haitian cholera outbreak strain into two nearly finished contigs at >99.9% accuracy. Complex regions with clinically relevant structure were completely resolved. In separate control assemblies on experimental and simulated data for the canonical N16961 cholera reference strain, we obtained 14 scaffolds of greater than 1 kb for the experimental data and 8 scaffolds of greater than 1 kb for the simulated data, which allowed us to correct several errors in contigs assembled from the short-read data alone. This work provides a blueprint for the next generation of rapid microbial identification and full-genome assembly.  相似文献   

12.
Heterozyosity is an important feature of many plant genomes, and is related to heterosis. Sweet orange, a highly heterozygous species, is thought to have originated from an inter‐species hybrid between pummelo and mandarin. To investigate the heterozygosity of the sweet orange genome and examine how this heterozygosity affects gene expression, we characterized the genome of Valencia orange for single nucleotide variations (SNVs), small insertions and deletions (InDels) and structural variations (SVs), and determined their functional effects on protein‐coding genes and non‐coding sequences. Almost half of the genes containing large‐effect SNVs and InDels were expressed in a tissue‐specific manner. We identified 3542 large SVs (>50 bp), including deletions, insertions and inversions. Most of the 296 genes located in large‐deletion regions showed low expression levels. RNA‐Seq reads and DNA sequencing reads revealed that the alleles of 1062 genes were differentially expressed. In addition, we detected approximately 42 Mb of contigs that were not found in the reference genome of a haploid sweet orange by de novo assembly of unmapped reads, and annotated 134 protein‐coding genes within these contigs. We discuss how this heterozygosity affects the quality of genome assembly. This study advances our understanding of the genome architecture of sweet orange, and provides a global view of gene expression at heterozygous loci.  相似文献   

13.
An analysis of the feasibility of short read sequencing   总被引:11,自引:1,他引:10  
Several methods for ultra high-throughput DNA sequencing are currently under investigation. Many of these methods yield very short blocks of sequence information (reads). Here we report on an analysis showing the level of genome sequencing possible as a function of read length. It is shown that re-sequencing and de novo sequencing of the majority of a bacterial genome is possible with read lengths of 20–30 nt, and that reads of 50 nt can provide reconstructed contigs (a contiguous fragment of sequence data) of 1000 nt and greater that cover 80% of human chromosome 1.  相似文献   

14.
Next-generation sequencing technologies have fostered an unprecedented proliferation of high-throughput sequencing projects and a concomitant development of novel algorithms for the assembly of short reads. In this context, an important issue is the need of a careful assessment of the accuracy of the assembly process. Here, we review the efficiency of a panel of assemblers, specifically designed to handle data from GS FLX 454 platform, on three bacterial data sets with different characteristics in terms of reads coverage and repeats content. Our aim is to investigate their strengths and weaknesses in the reconstruction of the reference genomes. In our benchmarking, we assess assemblers' performance, quantifying and characterizing assembly gaps and errors, and evaluating their ability to solve complex genomic regions containing repeats. The final goal of this analysis is to highlight pros and cons of each method, in order to provide the final user with general criteria for the right choice of the appropriate assembly strategy, depending on the specific needs. A further aspect we have explored is the relationship between coverage of a sequencing project and quality of the obtained results. The final outcome suggests that, for a good tradeoff between costs and results, the planned genome coverage of an experiment should not exceed 20-30 ×.  相似文献   

15.
For many researchers, next generation sequencing data holds the key to answering a category of questions previously unassailable. One of the important and challenging steps in achieving these goals is accurately assembling the massive quantity of short sequencing reads into full nucleic acid sequences. For research groups working with non-model or wild systems, short read assembly can pose a significant challenge due to the lack of pre-existing EST or genome reference libraries. While many publications describe the overall process of sequencing and assembly, few address the topic of how many and what types of reads are best for assembly. The goal of this project was use real world data to explore the effects of read quantity and short read quality scores on the resulting de novo assemblies. Using several samples of short reads of various sizes and qualities we produced many assemblies in an automated manner. We observe how the properties of read length, read quality, and read quantity affect the resulting assemblies and provide some general recommendations based on our real-world data set.  相似文献   

16.
Fundamental improvement was made for genome sequencing since the next-generation sequencing (NGS) came out in the 2000s. The newer technologies make use of the power of massively-parallel short-read DNA sequencing, genome alignment and assembly methods to digitally and rapidly search the genomes on a revolutionary scale, which enable large-scale whole genome sequencing (WGS) accessible and practical for researchers. Nowadays, whole genome sequencing is more and more prevalent in detecting the genetics of diseases, studying causative relations with cancers, making genome-level comparative analysis, reconstruction of human population history, and giving clinical implications and instructions. In this review, we first give a typical pipeline of whole genome sequencing, including the lab template preparation, sequencing, genome assembling and quality control, variants calling and annotations. We compare the difference between whole genome and whole exome sequencing (WES), and explore a wide range of applications of whole genome sequencing for both mendelian diseases and complex diseases in medical genetics. We highlight the impact of whole genome sequencing in cancer studies, regulatory variant analysis, predictive medicine and precision medicine, as well as discuss the challenges of the whole genome sequencing.   相似文献   

17.
Following the discovery of Acanthamoeba polyphaga mimivirus, diverse giant viruses have been isolated. However, only a small fraction of these isolates have been completely sequenced, limiting our understanding of the genomic diversity of giant viruses. MinION is a portable and low-cost long-read sequencer that can be readily used in a laboratory. Although MinION provides highly error-prone reads that require correction through additional short-read sequencing, recent studies assembled high-quality microbial genomes only using MinION sequencing. Here, we evaluated the accuracy of MinION-only genome assemblies for giant viruses by re-sequencing a prototype marseillevirus. Assembled genomes presented over 99.98% identity to the reference genome with a few gaps, demonstrating a high accuracy of the MinION-only assembly. As a proof of concept, we de novo assembled five newly isolated viruses. Average nucleotide identities to their closest known relatives suggest that the isolates represent new species of marseillevirus, pithovirus and mimivirus. The assembly of subsampled reads demonstrated that their taxonomy and genomic composition could be analysed at the 50× sequencing coverage. We also identified a pithovirus gene whose homologues were detected only in metagenome-derived relatives. Collectively, we propose that MinION-only assembly is an effective approach to rapidly perform a genome-wide analysis of isolated giant viruses.  相似文献   

18.

Background

There are a growing number of next-generation sequencing technologies. At present, the most cost-effective options also produce the shortest reads. However, even for prokaryotes, there is uncertainty concerning the utility of these technologies for the de novo assembly of complete genomes. This reflects an expectation that short reads will be unable to resolve small, but presumably abundant, repeats.

Methodology/Principal Findings

Using a simple model of repeat assembly, we develop and test a technique that, for any read length, can estimate the occurrence of unresolvable repeats in a genome, and thus predict the number of gaps that would need to be closed to produce a complete sequence. We apply this technique to 818 prokaryote genome sequences. This provides a quantitative assessment of the relative performance of various lengths. Notably, unpaired reads of only 150nt can reconstruct approximately 50% of the analysed genomes with fewer than 96 repeat-induced gaps. Nonetheless, there is considerable variation amongst prokaryotes. Some genomes can be assembled to near contiguity using very short reads while others require much longer reads.

Conclusions

Given the diversity of prokaryote genomes, a sequencing strategy should be tailored to the organism under study. Our results will provide researchers with a practical resource to guide the selection of the appropriate read length.  相似文献   

19.
Short‐read sequencing technologies have in principle made it feasible to draw detailed inferences about the recent history of any organism. In practice, however, this remains challenging due to the difficulty of genome assembly in most organisms and the lack of statistical methods powerful enough to discriminate between recent, nonequilibrium histories. We address both the assembly and inference challenges. We develop a bioinformatic pipeline for generating outgroup‐rooted alignments of orthologous sequence blocks from de novo low‐coverage short‐read data for a small number of genomes, and show how such sequence blocks can be used to fit explicit models of population divergence and admixture in a likelihood framework. To illustrate our approach, we reconstruct the Pleistocene history of an oak‐feeding insect (the oak gallwasp Biorhiza pallida), which, in common with many other taxa, was restricted during Pleistocene ice ages to a longitudinal series of southern refugia spanning the Western Palaearctic. Our analysis of sequence blocks sampled from a single genome from each of three major glacial refugia reveals support for an unexpected history dominated by recent admixture. Despite the fact that 80% of the genome is affected by admixture during the last glacial cycle, we are able to infer the deeper divergence history of these populations. These inferences are robust to variation in block length, mutation model and the sampling location of individual genomes within refugia. This combination of de novo assembly and numerical likelihood calculation provides a powerful framework for estimating recent population history that can be applied to any organism without the need for prior genetic resources.  相似文献   

20.
State-of-the-art DNA sequencing technologies are transforming the life sciences due to their ability to generate nucleotide sequence information with a speed and quantity that is unapproachable with traditional Sanger sequencing. Genome sequencing is a principal application of this technology, where the ultimate goal is the full and complete sequence of the organism of interest. Due to the nature of the raw data produced by these technologies, a full genomic sequence attained without the aid of Sanger sequencing has yet to be demonstrated.We have successfully developed a four-phase strategy for using only next-generation sequencing technologies (Illumina and 454) to assemble a complete microbial genome de novo. We applied this approach to completely assemble the 3.7 Mb genome of a rare Geobacter variant (KN400) that is capable of unprecedented current production at an electrode. Two key components of our strategy enabled us to achieve this result. First, we integrated the two data types early in the process to maximally leverage their complementary characteristics. And second, we used the output of different short read assembly programs in such a way so as to leverage the complementary nature of their different underlying algorithms or of their different implementations of the same underlying algorithm.The significance of our result is that it demonstrates a general approach for maximizing the efficiency and success of genome assembly projects as new sequencing technologies and new assembly algorithms are introduced. The general approach is a meta strategy, wherein sequencing data are integrated as early as possible and in particular ways and wherein multiple assembly algorithms are judiciously applied such that the deficiencies in one are complemented by another.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号