首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Advancement in high throughput DNA sequencing technologies has supported a rapid proliferation of microbial genome sequencing projects, providing the genetic blueprint for in-depth studies. Oftentimes, difficult to sequence regions in microbial genomes are ruled "intractable" resulting in a growing number of genomes with sequence gaps deposited in databases. A procedure was developed to sequence such problematic regions in the "non-contiguous finished" Desulfovibrio desulfuricans ND132 genome (6 intractable gaps) and the Desulfovibrio africanus genome (1 intractable gap). The polynucleotides surrounding each gap formed GC rich secondary structures making the regions refractory to amplification and sequencing. Strand-displacing DNA polymerases used in concert with a novel ramped PCR extension cycle supported amplification and closure of all gap regions in both genomes. The developed procedures support accurate gene annotation, and provide a step-wise method that reduces the effort required for genome finishing.  相似文献   

2.
Assembling individual genomes from complex community metagenomic data remains a challenging issue for environmental studies. We evaluated the quality of genome assemblies from community short read data (Illumina 100 bp pair-ended sequences) using datasets recovered from freshwater and soil microbial communities as well as in silico simulations. Our analyses revealed that the genome of a single genotype (or species) can be accurately assembled from a complex metagenome when it shows at least about 20 × coverage. At lower coverage, however, the derived assemblies contained a substantial fraction of non-target sequences (chimeras), which explains, at least in part, the higher number of hypothetical genes recovered in metagenomic relative to genomic projects. We also provide examples of how to detect intrapopulation structure in metagenomic datasets and estimate the type and frequency of errors in assembled genes and contigs from datasets of varied species complexity.  相似文献   

3.
With the ever increasing amount of genomic data available, the interest for generating biochemical pathways has grown tremendously. So far, mainly complete genomes have been used to reconstruct the biochemical pathways and their associated interactions. However, a large number of low coverage genomes, as well as other sources of partial genomic data, are currently available for many organisms. In order to be able to use incomplete data for metabolic reconstruction, the inherent properties of this procedure need to be investigated. In this short note, we describe the robustness and predictive power of metabolic reconstructions using partial information from Schizosaccharomyces pombe. We also discuss the implications of the results on reference genome projects as well as other large-scale sequencing data.  相似文献   

4.
Expected-value models have long provided a rudimentary theoretical foundation for random DNA sequencing. Here, we are interested in improving characterization of genome coverage in terms of its underlying probability distributions. We find that the mathematical notion of occupancy serves as a good model for evolution of the coverage distribution function and reveals new insights related to sequence redundancy. Established concepts, such as “full shotgun depth,” have been assumed invariant, but actually depend on project size and decrease over time. For most microbial projects, the full shotgun milestone should be revised downward by about 30%. Accordingly, many already-completed genomes appear to have been over-sequenced. Results also suggest that read lengths for emerging high-throughput sequencing methods must be increased substantially before they can be considered as possible successors to the standard Sanger method. In particular, gains in throughput and sequence depth cannot be made to compensate for diminished read length. Limits are well approximated by a simple logarithmic equation, which should be useful in estimating maximum coverage-based redundancy for future projects.  相似文献   

5.
Genome-scale metabolic models provide a powerful means to harness information from genomes to deepen biological insights. With exponentially increasing sequencing capacity, there is an enormous need for automated reconstruction techniques that can provide more accurate models in a short time frame. Current methods for automated metabolic network reconstruction rely on gene and reaction annotations to build draft metabolic networks and algorithms to fill gaps in these networks. However, automated reconstruction is hampered by database inconsistencies, incorrect annotations, and gap filling largely without considering genomic information. Here we develop an approach for applying genomic information to predict alternative functions for genes and estimate their likelihoods from sequence homology. We show that computed likelihood values were significantly higher for annotations found in manually curated metabolic networks than those that were not. We then apply these alternative functional predictions to estimate reaction likelihoods, which are used in a new gap filling approach called likelihood-based gap filling to predict more genomically consistent solutions. To validate the likelihood-based gap filling approach, we applied it to models where essential pathways were removed, finding that likelihood-based gap filling identified more biologically relevant solutions than parsimony-based gap filling approaches. We also demonstrate that models gap filled using likelihood-based gap filling provide greater coverage and genomic consistency with metabolic gene functions compared to parsimony-based approaches. Interestingly, despite these findings, we found that likelihoods did not significantly affect consistency of gap filled models with Biolog and knockout lethality data. This indicates that the phenotype data alone cannot necessarily be used to discriminate between alternative solutions for gap filling and therefore, that the use of other information is necessary to obtain a more accurate network. All described workflows are implemented as part of the DOE Systems Biology Knowledgebase (KBase) and are publicly available via API or command-line web interface.  相似文献   

6.
Many genomes have been sequenced to high-quality draft status using Sanger capillary electrophoresis and/or newer short-read sequence data and whole genome assembly techniques. However, even the best draft genomes contain gaps and other imperfections due to limitations in the input data and the techniques used to build draft assemblies. Sequencing biases, repetitive genomic features, genomic polymorphism, and other complicating factors all come together to make some regions difficult or impossible to assemble. Traditionally, draft genomes were upgraded to “phase 3 finished” status using time-consuming and expensive Sanger-based manual finishing processes. For more facile assembly and automated finishing of draft genomes, we present here an automated approach to finishing using long-reads from the Pacific Biosciences RS (PacBio) platform. Our algorithm and associated software tool, PBJelly, (publicly available at https://sourceforge.net/projects/pb-jelly/) automates the finishing process using long sequence reads in a reference-guided assembly process. PBJelly also provides “lift-over” co-ordinate tables to easily port existing annotations to the upgraded assembly. Using PBJelly and long PacBio reads, we upgraded the draft genome sequences of a simulated Drosophila melanogaster, the version 2 draft Drosophila pseudoobscura, an assembly of the Assemblathon 2.0 budgerigar dataset, and a preliminary assembly of the Sooty mangabey. With 24× mapped coverage of PacBio long-reads, we addressed 99% of gaps and were able to close 69% and improve 12% of all gaps in D. pseudoobscura. With 4× mapped coverage of PacBio long-reads we saw reads address 63% of gaps in our budgerigar assembly, of which 32% were closed and 63% improved. With 6.8× mapped coverage of mangabey PacBio long-reads we addressed 97% of gaps and closed 66% of addressed gaps and improved 19%. The accuracy of gap closure was validated by comparison to Sanger sequencing on gaps from the original D. pseudoobscura draft assembly and shown to be dependent on initial reference quality.  相似文献   

7.
SUMMARY: GenColors is a new web-based software/database system aimed at an improved and accelerated annotation of prokaryotic genomes, considering information on related genomes and making extensive use of genome comparison. It offers a seamless integration of data from ongoing sequencing projects and annotated genomic sequences obtained from GenBank. The genome comparison tools determine, for example, best-bidirectional hits, gene conservation, syntenies and gene core sets. Swiss-Prot/TrEMBL hits allow annotations in an effective manner. To further support the annotation base-specific quality data can also be displayed if available. With GenColors dedicated genome browsers containing a group of related genomes can be easily set up and maintained. It has been efficiently used for Borrelia garinii and is currently applied to various ongoing genome projects. AVAILABILITY: Detailed information on GenColors is available at http://gencolors.imb-jena.de. Online usage of GenColors-based genome browsers is the preferred application mode. The system is also available upon request for local installation.  相似文献   

8.

Background

Next-generation sequencing technologies are rapidly generating whole-genome datasets for an increasing number of organisms. However, phylogenetic reconstruction of genomic data remains difficult because de novo assembly for non-model genomes and multi-genome alignment are challenging.

Results

To greatly simplify the analysis, we present an Assembly and Alignment-Free (AAF) method (https://sourceforge.net/projects/aaf-phylogeny) that constructs phylogenies directly from unassembled genome sequence data, bypassing both genome assembly and alignment. Using mathematical calculations, models of sequence evolution, and simulated sequencing of published genomes, we address both evolutionary and sampling issues caused by direct reconstruction, including homoplasy, sequencing errors, and incomplete sequencing coverage. From these results, we calculate the statistical properties of the pairwise distances between genomes, allowing us to optimize parameter selection and perform bootstrapping. As a test case with real data, we successfully reconstructed the phylogeny of 12 mammals using raw sequencing reads. We also applied AAF to 21 tropical tree genome datasets with low coverage to demonstrate its effectiveness on non-model organisms.

Conclusion

Our AAF method opens up phylogenomics for species without an appropriate reference genome or high sequence coverage, and rapidly creates a phylogenetic framework for further analysis of genome structure and diversity among non-model organisms.

Electronic supplementary material

The online version of this article (doi:10.1186/s12864-015-1647-5) contains supplementary material, which is available to authorized users.  相似文献   

9.
Accurately identifying DNA polymorphisms can bridge the gap between phenotypes and genotypes and is essential for molecular marker assisted genetic studies. Genome complexities,including large-scale structural variations, bring great challenges to bioinformatic analysis for obtaining high-confidence genomic variants, as sequence differences between non-allelic loci of two or more genomes can be misinterpreted as polymorphisms. It is important to correctly filter out artificial variants to avoid ...  相似文献   

10.
Genome sequencing projects have been initiated for a wide range of eukaryotes. A few projects have reached completion, but most exist as draft assemblies. As one of the main reasons to sequence a genome is to obtain its catalog of genes, an important question is how complete or completable the catalog is in unfinished genomes. To answer this question, we have identified a set of core eukaryotic genes (CEGs), that are extremely highly conserved and which we believe are present in low copy numbers in higher eukaryotes. From an analysis of a phylogenetically diverse set of eukaryotic genome assemblies, we found that the proportion of CEGs mapped in draft genomes provides a useful metric for describing the gene space, and complements the commonly used N50 length and x-fold coverage values.  相似文献   

11.
The classical theory of shotgun DNA sequencing accounts for neither the placement dependencies that are a fundamental consequence of the forward-reverse sequencing strategy, nor the edge effect that arises for small to moderate-sized genomic targets. These phenomena are relevant to a number of sequencing scenarios, including large-insert BAC and fosmid clones, filtered genomic libraries, and macro-nuclear chromosomes. Here, we report a model that considers these two effects and provides both the expected value of coverage and its variance. Comparison to methyl-filtered maize data shows significant improvement over classical theory. The model is used to analyze coverage performance over a range of small to moderately-sized genomic targets. We find that the read pairing effect and the edge effect interact in a non-trivial fashion. Shorter reads give superior coverage per unit sequence depth relative to longer ones. In principle, end-sequences can be optimized with respect to template insert length; however, optimal performance is unlikely to be realized in most cases because of inherent size variation in any set of targets. Conversely, single-stranded reads exhibit roughly the same coverage attributes as optimized end-reads. Although linking information is lost, single-stranded data should not pose a significant assembly liability if the target represents predominantly low-copy sequence. We also find that random sequencing should be halted at substantially lower redundancies than those now associated with larger projects. Given the enormous amount of data generated per cycle on pyro-sequencing instruments, this observation suggests devising schemes to split each run cycle between twoor more projects. This would prevent over-sequencing and would further leverage the pyrosequencing method.  相似文献   

12.
Construction of DNA fragment libraries for next-generation sequencing can prove challenging, especially for samples with low DNA yield. Protocols devised to circumvent the problems associated with low starting quantities of DNA can result in amplification biases that skew the distribution of genomes in metagenomic data. Moreover, sample throughput can be slow, as current library construction techniques are time-consuming. This study evaluated Nextera, a new transposon-based method that is designed for quick production of DNA fragment libraries from a small quantity of DNA. The sequence read distribution across nine phage genomes in a mock viral assemblage met predictions for six of the least-abundant phages; however, the rank order of the most abundant phages differed slightly from predictions. De novo genome assemblies from Nextera libraries provided long contigs spanning over half of the phage genome; in four cases where full-length genome sequences were available for comparison, consensus sequences were found to match over 99% of the genome with near-perfect identity. Analysis of areas of low and high sequence coverage within phage genomes indicated that GC content may influence coverage of sequences from Nextera libraries. Comparisons of phage genomes prepared using both Nextera and a standard 454 FLX Titanium library preparation protocol suggested that the coverage biases according to GC content observed within the Nextera libraries were largely attributable to bias in the Nextera protocol rather than to the 454 sequencing technology. Nevertheless, given suitable sequence coverage, the Nextera protocol produced high-quality data for genomic studies. For metagenomics analyses, effects of GC amplification bias would need to be considered; however, the library preparation standardization that Nextera provides should benefit comparative metagenomic analyses.  相似文献   

13.
Gap closure is a challenging phase in microbial random shotgun genome sequencing projects, particularly since genome assemblies are often complicated by the presence of repeat elements, insertion sequences and other similar factors that contribute to sequence misassemblies. While it is well recognized that the conservation of genetic information between microbial genomes, combined with the exponential increase in available microbial sequences, can be exploited to increase the efficiency of gap closure, we lack the computational tools to aid in this process. We describe here a new tool, MGView, which was developed to create a graphical depiction of the alignment of a set of microbial contigs against a completed microbial genome. The results of our assembly of the Staphylococcus aureus RF122 genome show that MGView enables a considerable reduction in time and economic cost associated with closure. Together, the results also show that the application of MGView not only enables a reduction in fold-coverage requirements of the random shotgun sequence phase, but also provides interesting insights into differences in gene content and organization between finished and unfinished microbial genomes.  相似文献   

14.
ABSTRACT: BACKGROUND: The availability of a large number of recently sequenced vertebrate genomes opens new avenues to integrate cytogenetics and genomics in comparative and evolutionary studies. Cytogenetic mapping can offer alternative means to identify conserved synteny shared by distinct genomes and also to define genome regions that are still not fine characterized even after wide-ranging nucleotide sequence efforts. An efficient way to perform comparative cytogenetic mapping is based on BAC clones mapping by fluorescence in situ hybridization. In this report, to address the knowledge gap on the genome evolution in cichlid fishes, BAC clones of an Oreochromis niloticus library covering the linkage groups (LG) 1, 3, 5, and 7 were mapped onto the chromosomes of 9 African cichlid species. The cytogenetic mapping data were also integrated with BAC-end sequences information of O. niloticus and comparatively analyzed against the genome of other fish species and vertebrates. RESULTS: The location of BACs from LG1, 3, 5, and 7 revealed a strong chromosomal conservation among the analyzed cichlid species genomes, which evidenced a synteny of the markers of each LG. Comparative in silico analysis also identified large genomic blocks that were conserved in distantly related fish groups and also in other vertebrates. CONCLUSIONS: Although it has been suggested that fishes contain plastic genomes with high rates of chromosomal rearrangements and probably low rates of synteny conservation, our results evidence that large syntenic chromosome segments have been maintained conserved during evolution, at least for the considered markers. Additionally, our current cytogenetic mapping efforts integrated with genomic approaches conduct to a new perspective to address important questions involving chromosome evolution in fishes.  相似文献   

15.
16.
There are ∼1.4 million organisms on this planet that have been described morphologically but there is no comparable coverage of biodiversity at the molecular level. Little more than 1% of the known species have been subject to any molecular scrutiny and eukaryotic genome projects have focused on a group of closely related model organisms. The past year, however, has seen an ∼80% increase in the number of species represented in sequence databases and the completion of the sequencing of three prokaryotic genomes. Large-scale sequencing projects seem set to begin coverage of a wider range of the eukaryotic diversity, including green plants, microsporidians and diplomonads.  相似文献   

17.
Accurate and comprehensive sequence coverage for large genomes has been restricted to only a few species of specific interest. Lower sequence coverage (survey sequencing) of related species can yield a wealth of information about gene content and putative regulatory elements. But survey sequences lack long-range continuity and provide only a fragmented view of a genome. Here we show the usefulness of combining survey sequencing with dense radiation-hybrid (RH) maps for extracting maximum comparative genome information from model organisms. Based on results from the canine system, we propose that from now on all low-pass sequencing projects should be accompanied by a dense, gene-based RH map-construction effort to extract maximum information from the genome with a marginal extra cost.  相似文献   

18.
The gap between the number of known protein sequences and structures continues to widen, particularly as a result of sequencing projects for entire genomes. Recently there have been many attempts to generate structural assignments to all genes on sets of completed genomes using fold-recognition methods. We developed a method that detects false positives made by these genome-wide structural assignment experiments by identifying isolated occurrences. The method was tested using two sets of assignments, generated by SUPERFAMILY and PSI-BLAST, on 150 completed genomes. A phylogeny of these genomes was built and a parsimony algorithm was used to identify isolated occurrences by detecting occurrences that cause a gain at leaf level. Isolated occurrences tend to have high e-values, and in both sets of assignments, a sudden increase in isolated occurrences is observed for e-values >10−8 for SUPERFAMILY and >10−4 for PSI-BLAST. Conditions to predict false positives are based on these results. Independent tests confirm that the predicted false positives are indeed more likely to be incorrectly assigned. Evaluation of the predicted false positives also showed that the accuracy of profile-based fold-recognition methods might depend on secondary structure content and sequence length. We show that false positives generated by fold-recognition methods can be identified by considering structural occurrence patterns on completed genomes; occurrences that are isolated within the phylogeny tend to be less reliable. The method provides a new independent way to examine the quality of fold assignments and may be used to improve the output of any genome-wide fold assignment method.  相似文献   

19.

Background  

Whole genome shotgun sequencing produces increasingly higher coverage of a genome with random sequence reads. Progressive whole genome assembly and eventual finishing sequencing is a process that typically takes several years for large eukaryotic genomes. In the interim, all sequence reads of public sequencing projects are made available in repositories such as the NCBI Trace Archive. For a particular locus, sequencing coverage may be high enough early on to produce a reliable local genome assembly. We have developed software, Tracembler, that facilitates in silico chromosome walking by recursively assembling reads of a selected species from the NCBI Trace Archive starting with reads that significantly match sequence seeds supplied by the user.  相似文献   

20.
The genomic peculiarities among microbial eukaryotes challenge the conventional wisdom of genome evolution. Currently, many studies and textbooks explore principles of genome evolution from a limited number of eukaryotic lineages, focusing often on only a few representative species of plants, animals and fungi. Increasing emphasis on studies of genomes in microbial eukaryotes has and will continue to uncover features that are either not present in the representative species (e.g. hypervariable karyotypes or highly fragmented mitochondrial genomes) or are exaggerated in microbial groups (e.g. chromosomal processing between germline and somatic nuclei). Data for microbial eukaryotes have emerged from recent genome sequencing projects, enabling comparisons of the genomes from diverse lineages across the eukaryotic phylogenetic tree. Some of these features, including amplified rDNAs, subtelomeric rDNAs and reduced genomes, appear to have evolved multiple times within eukaryotes, whereas other features, such as absolute strand polarity, are found only within single lineages.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号