首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
ABSTRACT: BACKGROUND: Gene prediction algorithms (or gene callers) are an essential tool for analyzing shotgun nucleic acid sequence data. Gene prediction is a ubiquitous step in sequence analysis pipelines; it reduces the volume of data by identifying the most likely reading frame for a fragment, permitting the out-of-frame translations to be ignored. In this study we evaluate five widely used ab initio gene-calling algorithms--FragGeneScan, MetaGeneAnnotator, MetaGeneMark, Orphelia, and Prodigal--for accuracy on short (75-1000 bp) fragments containing sequence error from previously published artificial data and "real" metagenomic datasets. RESULTS: While gene prediction tools have similar accuracies predicting genes on error-free fragments, in the presence of sequencing errors considerable differences between tools become evident. For error-containing short reads, FragGeneScan finds more prokaryotic coding regions than does MetaGeneAnnotator, MetaGeneMark, Orphelia, or Prodigal. This improved detection of genes in error-containing fragments, however, comes at the cost of much lower (50%) specificity and overprediction of genes in noncoding regions. CONCLUSIONS: Ab initio gene callers offer a significant reduction in the computational burden of annotating individual nucleic acid reads and are used in many metagenomic annotation systems. For predicting reading frames on raw reads, we find the hidden Markov model approach in FragGeneScan is more sensitive than other gene prediction tools, while Prodigal, MGA, and MGM are better suited for higher-quality sequences such as assembled contigs.  相似文献   

2.
Environmental shotgun sequencing (or metagenomics) is widely used to survey the communities of microbial organisms that live in many diverse ecosystems, such as the human body. Finding the protein-coding genes within the sequences is an important step for assessing the functional capacity of a metagenome. In this work, we developed a metagenomics gene prediction system Glimmer-MG that achieves significantly greater accuracy than previous systems via novel approaches to a number of important prediction subtasks. First, we introduce the use of phylogenetic classifications of the sequences to model parameterization. We also cluster the sequences, grouping together those that likely originated from the same organism. Analogous to iterative schemes that are useful for whole genomes, we retrain our models within each cluster on the initial gene predictions before making final predictions. Finally, we model both insertion/deletion and substitution sequencing errors using a different approach than previous software, allowing Glimmer-MG to change coding frame or pass through stop codons by predicting an error. In a comparison among multiple gene finding methods, Glimmer-MG makes the most sensitive and precise predictions on simulated and real metagenomes for all read lengths and error rates tested.  相似文献   

3.
With the decreasing cost of next-generation sequencing, deep sequencing of clinical samples provides unique opportunities to understand host-associated microbial communities. Among the primary challenges of clinical metagenomic sequencing is the rapid filtering of human reads to survey for pathogens with high specificity and sensitivity. Metagenomes are inherently variable due to different microbes in the samples and their relative abundance, the size and architecture of genomes, and factors such as target DNA amounts in tissue samples (i.e. human DNA versus pathogen DNA concentration). This variation in metagenomes typically manifests in sequencing datasets as low pathogen abundance, a high number of host reads, and the presence of close relatives and complex microbial communities. In addition to these challenges posed by the composition of metagenomes, high numbers of reads generated from high-throughput deep sequencing pose immense computational challenges. Accurate identification of pathogens is confounded by individual reads mapping to multiple different reference genomes due to gene similarity in different taxa present in the community or close relatives in the reference database. Available global and local sequence aligners also vary in sensitivity, specificity, and speed of detection. The efficiency of detection of pathogens in clinical samples is largely dependent on the desired taxonomic resolution of the organisms. We have developed an efficient strategy that identifies “all against all” relationships between sequencing reads and reference genomes. Our approach allows for scaling to large reference databases and then genome reconstruction by aggregating global and local alignments, thus allowing genetic characterization of pathogens at higher taxonomic resolution. These results were consistent with strain level SNP genotyping and bacterial identification from laboratory culture.  相似文献   

4.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.  相似文献   

5.
Exhaustive gene identification is a fundamental goal in all metagenomics projects. However, most metagenomic sequences are unassembled anonymous fragments, and conventional gene-finding methods cannot be applied. We have developed a prokaryotic gene-finding program, MetaGene, which utilizes di-codon frequencies estimated by the GC content of a given sequence with other various measures. MetaGene can predict a whole range of prokaryotic genes based on the anonymous genomic sequences of a few hundred bases, with a sensitivity of 95% and a specificity of 90% for artificial shotgun sequences (700 bp fragments from 12 species). MetaGene has two sets of codon frequency interpolations, one for bacteria and one for archaea, and automatically selects the proper set for a given sequence using the domain classification method we propose. The domain classification works properly, correctly assigning domain information to more than 90% of the artificial shotgun sequences. Applied to the Sargasso Sea dataset, MetaGene predicted almost all of the annotated genes and a notable number of novel genes. MetaGene can be applied to wide variety of metagenomic projects and expands the utility of metagenomics.  相似文献   

6.
A new system, ZCURVE 1.0, for finding protein- coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein-coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov-model-based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function-known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes approximately 2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.  相似文献   

7.
Accurate identification of DNA polymorphisms using next-generation sequencing technology is challenging because of a high rate of sequencing error and incorrect mapping of reads to reference genomes. Currently available short read aligners and DNA variant callers suffer from these problems. We developed the Coval software to improve the quality of short read alignments. Coval is designed to minimize the incidence of spurious alignment of short reads, by filtering mismatched reads that remained in alignments after local realignment and error correction of mismatched reads. The error correction is executed based on the base quality and allele frequency at the non-reference positions for an individual or pooled sample. We demonstrated the utility of Coval by applying it to simulated genomes and experimentally obtained short-read data of rice, nematode, and mouse. Moreover, we found an unexpectedly large number of incorrectly mapped reads in ‘targeted’ alignments, where the whole genome sequencing reads had been aligned to a local genomic segment, and showed that Coval effectively eliminated such spurious alignments. We conclude that Coval significantly improves the quality of short-read sequence alignments, thereby increasing the calling accuracy of currently available tools for SNP and indel identification. Coval is available at http://sourceforge.net/projects/coval105/.  相似文献   

8.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.  相似文献   

9.
To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.  相似文献   

10.
While recently developed short-read sequencing technologies may dramatically reduce the sequencing cost and eventually achieve the $1000 goal for re-sequencing, their limitations prevent the de novo sequencing of eukaryotic genomes with the standard shotgun sequencing protocol. We present SHRAP (SHort Read Assembly Protocol), a sequencing protocol and assembly methodology that utilizes high-throughput short-read technologies. We describe a variation on hierarchical sequencing with two crucial differences: (1) we select a clone library from the genome randomly rather than as a tiling path and (2) we sample clones from the genome at high coverage and reads from the clones at low coverage. We assume that 200 bp read lengths with a 1% error rate and inexpensive random fragment cloning on whole mammalian genomes is feasible. Our assembly methodology is based on first ordering the clones and subsequently performing read assembly in three stages: (1) local assemblies of regions significantly smaller than a clone size, (2) clone-sized assemblies of the results of stage 1, and (3) chromosome-sized assemblies. By aggressively localizing the assembly problem during the first stage, our method succeeds in assembling short, unpaired reads sampled from repetitive genomes. We tested our assembler using simulated reads from D. melanogaster and human chromosomes 1, 11, and 21, and produced assemblies with large sets of contiguous sequence and a misassembly rate comparable to other draft assemblies. Tested on D. melanogaster and the entire human genome, our clone-ordering method produces accurate maps, thereby localizing fragment assembly and enabling the parallelization of the subsequent steps of our pipeline. Thus, we have demonstrated that truly inexpensive de novo sequencing of mammalian genomes will soon be possible with high-throughput, short-read technologies using our methodology.  相似文献   

11.
Assembling microbial and viral genomes from metagenomes is a powerful and appealing method to understand structure–function relationships in complex environments. To compare the recovery of genomes from microorganisms and their viruses from groundwater, we generated shotgun metagenomes with Illumina sequencing accompanied by long reads derived from the Oxford Nanopore Technologies (ONT) sequencing platform. Assembly and metagenome-assembled genome (MAG) metrics for both microbes and viruses were determined from an Illumina-only assembly, ONT-only assembly, and a hybrid assembly approach. The hybrid approach recovered 2× more mid to high-quality MAGs compared to the Illumina-only approach and 4× more than the ONT-only approach. A similar number of viral genomes were reconstructed using the hybrid and ONT methods, and both recovered nearly fourfold more viral genomes than the Illumina-only approach. While yielding fewer MAGs, the ONT-only approach generated MAGs with a high probability of containing rRNA genes, 3× higher than either of the other methods. Of the shared MAGs recovered from each method, the ONT-only approach generated the longest and least fragmented MAGs, while the hybrid approach yielded the most complete. This work provides quantitative data to inform a cost–benefit analysis of the decision to supplement shotgun metagenomic projects with long reads towards the goal of recovering genomes from environmentally abundant groups.  相似文献   

12.
The challenge presented by high-throughput sequencing necessitates the development of novel tools for accurate alignment of reads to reference sequences. Current approaches focus on using heuristics to map reads quickly to large genomes, rather than generating highly accurate alignments in coding regions. Such approaches are, thus, unsuited for applications such as amplicon-based analysis and the realignment phase of exome sequencing and RNA-seq, where accurate and biologically relevant alignment of coding regions is critical. To facilitate such analyses, we have developed a novel tool, RAMICS, that is tailored to mapping large numbers of sequence reads to short lengths (<10 000 bp) of coding DNA. RAMICS utilizes profile hidden Markov models to discover the open reading frame of each sequence and aligns to the reference sequence in a biologically relevant manner, distinguishing between genuine codon-sized indels and frameshift mutations. This approach facilitates the generation of highly accurate alignments, accounting for the error biases of the sequencing machine used to generate reads, particularly at homopolymer regions. Performance improvements are gained through the use of graphics processing units, which increase the speed of mapping through parallelization. RAMICS substantially outperforms all other mapping approaches tested in terms of alignment quality while maintaining highly competitive speed performance.  相似文献   

13.
Although whole human genome sequencing can be done with readily available technical and financial resources, the need for detailed analyses of genomes of certain populations still exists. Here we present, for the first time, sequencing and analysis of a Turkish human genome. We have performed 35x coverage using paired-end sequencing, where over 95% of sequencing reads are mapped to the reference genome covering more than 99% of the bases. The assembly of unmapped reads rendered 11,654 contigs, 2,168 of which did not reveal any homology to known sequences, resulting in ∼1 Mbp of unmapped sequence. Single nucleotide polymorphism (SNP) discovery resulted in 3,537,794 SNP calls with 29,184 SNPs identified in coding regions, where 106 were nonsense and 259 were categorized as having a high-impact effect. The homo/hetero zygosity (1,415,123∶2,122,671 or 1∶1.5) and transition/transversion ratios (2,383,204∶1,154,590 or 2.06∶1) were within expected limits. Of the identified SNPs, 480,396 were potentially novel with 2,925 in coding regions, including 48 nonsense and 95 high-impact SNPs. Functional analysis of novel high-impact SNPs revealed various interaction networks, notably involving hereditary and neurological disorders or diseases. Assembly results indicated 713,640 indels (1∶1.09 insertion/deletion ratio), ranging from −52 bp to 34 bp in length and causing about 180 codon insertion/deletions and 246 frame shifts. Using paired-end- and read-depth-based methods, we discovered 9,109 structural variants and compared our variant findings with other populations. Our results suggest that whole genome sequencing is a valuable tool for understanding variations in the human genome across different populations. Detailed analyses of genomes of diverse origins greatly benefits research in genetics and medicine and should be conducted on a larger scale.  相似文献   

14.
The human body consists of innumerable multifaceted environments that predispose colonization by a number of distinct microbial communities, which play fundamental roles in human health and disease. In addition to community surveys and shotgun metagenomes that seek to explore the composition and diversity of these microbiomes, there are significant efforts to sequence reference microbial genomes from many body sites of healthy adults. To illustrate the utility of reference genomes when studying more complex metagenomes, we present a reference-based analysis of sequence reads generated from 55 shotgun metagenomes, selected from 5 major body sites, including 16 sub-sites. Interestingly, between 13% and 92% (62.3% average) of these shotgun reads were aligned to a then-complete list of 2780 reference genomes, including 1583 references for the human microbiome. However, no reference genome was universally found in all body sites. For any given metagenome, the body site-specific reference genomes, derived from the same body site as the sample, accounted for an average of 58.8% of the mapped reads. While different body sites did differ in abundant genera, proximal or symmetrical body sites were found to be most similar to one another. The extent of variation observed, both between individuals sampled within the same microenvironment, or at the same site within the same individual over time, calls into question comparative studies across individuals even if sampled at the same body site. This study illustrates the high utility of reference genomes and the need for further site-specific reference microbial genome sequencing, even within the already well-sampled human microbiome.  相似文献   

15.
Lander-Waterman’s coverage bound establishes the total number of reads required to cover the whole genome of size G bases. In fact, their bound is a direct consequence of the well-known solution to the coupon collector’s problem which proves that for such genome, the total number of bases to be sequenced should be O(G ln G). Although the result leads to a tight bound, it is based on a tacit assumption that the set of reads are first collected through a sequencing process and then are processed through a computation process, i.e., there are two different machines: one for sequencing and one for processing. In this paper, we present a significant improvement compared to Lander-Waterman’s result and prove that by combining the sequencing and computing processes, one can re-sequence the whole genome with as low as O(G) sequenced bases in total. Our approach also dramatically reduces the required computational power for the combined process. Simulation results are performed on real genomes with different sequencing error rates. The results support our theory predicting the log G improvement on coverage bound and corresponding reduction in the total number of bases required to be sequenced.  相似文献   

16.
We aim to compare the performance of Bowtie2 , bwa‐mem , blastn and blastx when aligning bacterial metagenomes against the Comprehensive Antibiotic Resistance Database (CARD). Simulated reads were used to evaluate the performance of each aligner under the following four performance criteria: correctly mapped, false positives, multi‐reads and partials. The optimal alignment approach was applied to samples from two wastewater treatment plants to detect antibiotic resistance genes using next generation sequencing. blastn mapped with greater accuracy among the four sequence alignment approaches considered followed by Bowtie2 . blastx generated the greatest number of false positives and multi‐reads when aligned against the CARD. The performance of each alignment tool was also investigated using error‐free reads. Although each aligner mapped a greater number of error‐free reads as compared to Illumina‐error reads, in general, the introduction of sequencing errors had little effect on alignment results when aligning against the CARD. Given each performance criteria, blastn was found to be the most favourable alignment tool and was therefore used to assess resistance genes in sewage samples. Beta‐lactam and aminoglycoside were found to be the most abundant classes of antibiotic resistance genes in each sample.

Significance and Impact of the Study

Antibiotic resistance genes (ARGs) are pollutants known to persist in wastewater treatment plants among other environments, thus methods for detecting these genes have become increasingly relevant. Next generation sequencing has brought about a host of sequence alignment tools that provide a comprehensive look into antimicrobial resistance in environmental samples. However, standardizing practices in ARG metagenomic studies is challenging since results produced from alignment tools can vary significantly. Our study provides sequence alignment results of synthetic, and authentic bacterial metagenomes mapped against an ARG database using multiple alignment tools, and the best practice for detecting ARGs in environmental samples.  相似文献   

17.
18.
19.
Single Molecule, Real-Time (SMRT®) Sequencing (Pacific Biosciences, Menlo Park, CA, USA) provides the longest continuous DNA sequencing reads currently available. However, the relatively high error rate in the raw read data requires novel analysis methods to deconvolute sequences derived from complex samples. Here, we present a workflow of novel computer algorithms able to reconstruct viral variant genomes present in mixtures with an accuracy of >QV50. This approach relies exclusively on Continuous Long Reads (CLR), which are the raw reads generated during SMRT Sequencing. We successfully implement this workflow for simultaneous sequencing of mixtures containing up to forty different >9 kb HIV-1 full genomes. This was achieved using a single SMRT Cell for each mixture and desktop computing power. This novel approach opens the possibility of solving complex sequencing tasks that currently lack a solution.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号