首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.  相似文献   

2.
MALINA is a web service for bioinformatic analysis of whole-genome metagenomic data obtained from human gut microbiota sequencing. As input data, it accepts metagenomic reads of various sequencing technologies, including long reads (such as Sanger and 454 sequencing) and next-generation (including SOLiD and Illumina). It is the first metagenomic web service that is capable of processing SOLiD color-space reads, to authors’ knowledge. The web service allows phylogenetic and functional profiling of metagenomic samples using coverage depth resulting from the alignment of the reads to the catalogue of reference sequences which are built into the pipeline and contain prevalent microbial genomes and genes of human gut microbiota. The obtained metagenomic composition vectors are processed by the statistical analysis and visualization module containing methods for clustering, dimension reduction and group comparison. Additionally, the MALINA database includes vectors of bacterial and functional composition for human gut microbiota samples from a large number of existing studies allowing their comparative analysis together with user samples, namely datasets from Russian Metagenome project, MetaHIT and Human Microbiome Project (downloaded fromhttp://hmpdacc.org). MALINA is made freely available on the web athttp://malina.metagenome.ru. The website is implemented in JavaScript (using Ext JS), Microsoft .NET Framework, MS SQL, Python, with all major browsers supported.  相似文献   

3.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.  相似文献   

4.
Metagenomics: Read Length Matters   总被引:7,自引:0,他引:7       下载免费PDF全文
Obtaining an unbiased view of the phylogenetic composition and functional diversity within a microbial community is one central objective of metagenomic analysis. New technologies, such as 454 pyrosequencing, have dramatically reduced sequencing costs, to a level where metagenomic analysis may become a viable alternative to more-focused assessments of the phylogenetic (e.g., 16S rRNA genes) and functional diversity of microbial communities. To determine whether the short (~100 to 200 bp) sequence reads obtained from pyrosequencing are appropriate for the phylogenetic and functional characterization of microbial communities, the results of BLAST and COG analyses were compared for long (~750 bp) and randomly derived short reads from each of two microbial and one virioplankton metagenome libraries. Overall, BLASTX searches against the GenBank nr database found far fewer homologs within the short-sequence libraries. This was especially pronounced for a Chesapeake Bay virioplankton metagenome library. Increasing the short-read sampling depth or the length of derived short reads (up to 400 bp) did not completely resolve the discrepancy in BLASTX homolog detection. Only in cases where the long-read sequence had a close homolog (low BLAST E-score) did the derived short-read sequence also find a significant homolog. Thus, more-distant homologs of microbial and viral genes are not detected by short-read sequences. Among COG hits, derived short reads sampled at a depth of two short reads per long read missed up to 72% of the COG hits found using long reads. Noting the current limitation in computational approaches for the analysis of short sequences, the use of short-read-length libraries does not appear to be an appropriate tool for the metagenomic characterization of microbial communities.  相似文献   

5.
A major challenge in the field of shotgun metagenomics is the accurate identification of organisms present within a microbial community, based on classification of short sequence reads. Though existing microbial community profiling methods have attempted to rapidly classify the millions of reads output from modern sequencers, the combination of incomplete databases, similarity among otherwise divergent genomes, errors and biases in sequencing technologies, and the large volumes of sequencing data required for metagenome sequencing has led to unacceptably high false discovery rates (FDR). Here, we present the application of a novel, gene-independent and signature-based metagenomic taxonomic profiling method with significantly and consistently smaller FDR than any other available method. Our algorithm circumvents false positives using a series of non-redundant signature databases and examines Genomic Origins Through Taxonomic CHAllenge (GOTTCHA). GOTTCHA was tested and validated on 20 synthetic and mock datasets ranging in community composition and complexity, was applied successfully to data generated from spiked environmental and clinical samples, and robustly demonstrates superior performance compared with other available tools.  相似文献   

6.
Metagenomics is a rapidly emerging field of research for studying microbial communities. To evaluate methods presently used to process metagenomic sequences, we constructed three simulated data sets of varying complexity by combining sequencing reads randomly selected from 113 isolate genomes. These data sets were designed to model real metagenomes in terms of complexity and phylogenetic composition. We assembled sampled reads using three commonly used genome assemblers (Phrap, Arachne and JAZZ), and predicted genes using two popular gene-finding pipelines (fgenesb and CRITICA/GLIMMER). The phylogenetic origins of the assembled contigs were predicted using one sequence similarity-based (blast hit distribution) and two sequence composition-based (PhyloPythia, oligonucleotide frequencies) binning methods. We explored the effects of the simulated community structure and method combinations on the fidelity of each processing step by comparison to the corresponding isolate genomes. The simulated data sets are available online to facilitate standardized benchmarking of tools for metagenomic analysis.  相似文献   

7.
Profiling microbial community function from metagenomic sequencing data remains a computationally challenging problem. Mapping millions of DNA reads from such samples to reference protein databases requires long run-times, and short read lengths can result in spurious hits to unrelated proteins (loss of specificity). We developed ShortBRED (Short, Better Representative Extract Dataset) to address these challenges, facilitating fast, accurate functional profiling of metagenomic samples. ShortBRED consists of two components: (i) a method that reduces reference proteins of interest to short, highly representative amino acid sequences (“markers”) and (ii) a search step that maps reads to these markers to quantify the relative abundance of their associated proteins. After evaluating ShortBRED on synthetic data, we applied it to profile antibiotic resistance protein families in the gut microbiomes of individuals from the United States, China, Malawi, and Venezuela. Our results support antibiotic resistance as a core function in the human gut microbiome, with tetracycline-resistant ribosomal protection proteins and Class A beta-lactamases being the most widely distributed resistance mechanisms worldwide. ShortBRED markers are applicable to other homology-based search tasks, which we demonstrate here by identifying phylogenetic signatures of antibiotic resistance across more than 3,000 microbial isolate genomes. ShortBRED can be applied to profile a wide variety of protein families of interest; the software, source code, and documentation are available for download at http://huttenhower.sph.harvard.edu/shortbred  相似文献   

8.
Metagenomic sequencing projects from environments dominated by a small number of species produce genome-wide population samples. We present a two-site composite likelihood estimator of the scaled recombination rate, ρ = 2Nec, that operates on metagenomic assemblies in which each sequenced fragment derives from a different individual. This new estimator properly accounts for sequencing error, as quantified by per-base quality scores, and missing data, as inferred from the placement of reads in a metagenomic assembly. We apply our estimator to data from a sludge metagenome project to demonstrate how this method will elucidate the rates of exchange of genetic material in natural microbial populations. Surprisingly, for a fixed amount of sequencing, this estimator has lower variance than similar methods that operate on more traditional population genetic samples of comparable size. In addition, we can infer variation in recombination rate across the genome because metagenomic projects sample genetic diversity genome-wide, not just at particular loci. The method itself makes no assumption specific to microbial populations, opening the door for application to any mixed population sample where the number of individuals sampled is much greater than the number of fragments sequenced.  相似文献   

9.
Pignatelli M  Moya A 《PloS one》2011,6(5):e19984
A frequent step in metagenomic data analysis comprises the assembly of the sequenced reads. Many assembly tools have been published in the last years targeting data coming from next-generation sequencing (NGS) technologies but these assemblers have not been designed for or tested in multi-genome scenarios that characterize metagenomic studies. Here we provide a critical assessment of current de novo short reads assembly tools in multi-genome scenarios using complex simulated metagenomic data. With this approach we tested the fidelity of different assemblers in metagenomic studies demonstrating that even under the simplest compositions the number of chimeric contigs involving different species is noticeable. We further showed that the assembly process reduces the accuracy of the functional classification of the metagenomic data and that these errors can be overcome raising the coverage of the studied metagenome. The results presented here highlight the particular difficulties that de novo genome assemblers face in multi-genome scenarios demonstrating that these difficulties, that often compromise the functional classification of the analyzed data, can be overcome with a high sequencing effort.  相似文献   

10.
Niu B  Zhu Z  Fu L  Wu S  Li W 《Bioinformatics (Oxford, England)》2011,27(12):1704-1705
SUMMARY: Fragment recruitment, a process of aligning sequencing reads to reference genomes, is a crucial step in metagenomic data analysis. The available sequence alignment programs are either slow or insufficient for recruiting metagenomic reads. We implemented an efficient algorithm, FR-HIT, for fragment recruitment. We applied FR-HIT and several other tools including BLASTN, MegaBLAST, BLAT, LAST, SSAHA2, SOAP2, BWA and BWA-SW to recruit four metagenomic datasets from different type of sequencers. On average, FR-HIT and BLASTN recruited significantly more reads than other programs, while FR-HIT is about two orders of magnitude faster than BLASTN. FR-HIT is slower than the fastest SOAP2, BWA and BWA-SW, but it recruited 1-5 times more reads. AVAILABILITY: http://weizhongli-lab.org/frhit.  相似文献   

11.
This is the first report on the phylogenetic analysis of the community diversity of a single spontaneous cocoa bean box fermentation sample through a metagenomic approach involving 454 pyrosequencing. Several sequence-based and composition-based taxonomic profiling tools were used and evaluated to avoid software-dependent results and their outcome was validated by comparison with previously obtained culture-dependent and culture-independent data. Overall, this approach revealed a wider bacterial (mainly γ-Proteobacteria) and fungal diversity than previously found. Further, the use of a combination of different classification methods, in a software-independent way, helped to understand the actual composition of the microbial ecosystem under study. In addition, bacteriophage-related sequences were found. The bacterial diversity depended partially on the methods used, as composition-based methods predicted a wider diversity than sequence-based methods, and as classification methods based solely on phylogenetic marker genes predicted a more restricted diversity compared with methods that took all reads into account. The metagenomic sequencing analysis identified Hanseniaspora uvarum, Hanseniaspora opuntiae, Saccharomyces cerevisiae, Lactobacillus fermentum, and Acetobacter pasteurianus as the prevailing species. Also, the presence of occasional members of the cocoa bean fermentation process was revealed (such as Erwinia tasmaniensis, Lactobacillus brevis, Lactobacillus casei, Lactobacillus rhamnosus, Lactococcus lactis, Leuconostoc mesenteroides, and Oenococcus oeni). Furthermore, the sequence reads associated with viral communities were of a restricted diversity, dominated by Myoviridae and Siphoviridae, and reflecting Lactobacillus as the dominant host. To conclude, an accurate overview of all members of a cocoa bean fermentation process sample was revealed, indicating the superiority of metagenomic sequencing over previously used techniques.  相似文献   

12.
The assembly of multiple genomes from mixed sequence reads is a bottleneck in metagenomic analysis. A single-genome assembly program (assembler) is not capable of resolving metagenome sequences, so assemblers designed specifically for metagenomics have been developed. MetaVelvet is an extension of the single-genome assembler Velvet. It has been proved to generate assemblies with higher N50 scores and higher quality than single-genome assemblers such as Velvet and SOAPdenovo when applied to metagenomic sequence reads and is frequently used in this research community. One important open problem for MetaVelvet is its low accuracy and sensitivity in detecting chimeric nodes in the assembly (de Bruijn) graph, which prevents the generation of longer contigs and scaffolds. We have tackled this problem of classifying chimeric nodes using supervised machine learning to significantly improve the performance of MetaVelvet and developed a new tool, called MetaVelvet-SL. A Support Vector Machine is used for learning the classification model based on 94 features extracted from candidate nodes. In extensive experiments, MetaVelvet-SL outperformed the original MetaVelvet and other state-of-the-art metagenomic assemblers, IDBA-UD, Ray Meta and Omega, to reconstruct accurate longer assemblies with higher N50 scores for both simulated data sets and real data sets of human gut microbial sequences.  相似文献   

13.
Here we demonstrate a cost effective and scalable microbial ecology sequencing platform using the Ion Torrent Personal Genome Machine (PGM). We assessed both PCR amplified 16S rRNA and shotgun metagenomic approaches and generated 100,000+ to 1,000,000+ reads using 'post-light' based sequencing technology within different sized semi-conductor chips. Further development of Golay barcoded Ion Tags allowed multiplex analyses of microbial communities with substantially reduced costs compared with platforms such as 454/GS-FLX. Using these protocols we assessed the bacterial and archaeal dynamics within covered anaerobic digesters used to treat piggery wastes. Analysis of these sequence data showed that these novel methanogenic waste treatment systems are dominated by bacterial taxa, in particular Clostridium, Synergistia and Bacteroides that were maintained as a stable community over extended time periods. Archaeal community dynamics were more stochastic with the key methanogenic taxa more difficult to resolve, principally due to the poor congruence seen between community structures generated either by nested PCR or metagenomic approaches for archaeal analyses. Our results show that for microbial community structure and function analyses, the PGM platform provides a low cost, scalable and high throughput solution for both Tag sequencing and metagenomic analyses.  相似文献   

14.

Background

Metagenomics is a relatively new but fast growing field within environmental biology and medical sciences. It enables researchers to understand the diversity of microbes, their functions, cooperation, and evolution in a particular ecosystem. Traditional methods in genomics and microbiology are not efficient in capturing the structure of the microbial community in an environment. Nowadays, high-throughput next-generation sequencing technologies are powerfully driving the metagenomic studies. However, there is an urgent need to develop efficient statistical methods and computational algorithms to rapidly analyze the massive metagenomic short sequencing data and to accurately detect the features/functions present in the microbial community. Although several issues about functions of metagenomes at pathways or subsystems level have been investigated, there is a lack of studies focusing on functional analysis at a low level of a hierarchical functional tree, such as SEED subsystem tree.

Results

A two-step statistical procedure (metaFunction) is proposed to detect all possible functional roles at the low level from a metagenomic sample/community. In the first step a statistical mixture model is proposed at the base of gene codons to estimate the abundances for the candidate functional roles, with sequencing error being considered. As a gene could be involved in multiple biological processes the functional assignment is therefore adjusted by utilizing an error distribution in the second step. The performance of the proposed procedure is evaluated through comprehensive simulation studies. Compared with other existing methods in metagenomic functional analysis the new approach is more accurate in assigning reads to functional roles, and therefore at more general levels. The method is also employed to analyze two real data sets.

Conclusions

metaFunction is a powerful tool in accurate profiling functions in a metagenomic sample.  相似文献   

15.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.  相似文献   

16.
Next-generation sequencing(NGS) technology has revolutionized and significantly impacted metagenomic research.However,the NGS data usually contains sequencing artifacts such as low-quality reads and contaminating reads,which will significantly compromise downstream analysis.Many quality control(QC) tools have been proposed,however,few of them have been verified to be suitable or efficient for metagenomic data,which are composed of multiple genomes and are more complex than other kinds of NGS data.Here we present a metagenomic data QC method named Meta-QC-Chain.Meta-QC-Chain combines multiple QC functions:technical tests describe input data status and identify potential errors,quality trimming filters poor sequencing-quality bases and reads,and contamination screening identifies higher eukaryotic species,which are considered as contamination for metagenomic data.Most computing processes are optimized based on parallel programming.Testing on an 8-GB real dataset showed that Meta-QC-Chain trimmed low sequencing-quality reads and contaminating reads,and the whole quality control procedure was completed within 20 min.Therefore,Meta-QC-Chain provides a comprehensive,useful and high-performance QC tool for metagenomic data.Meta-QC-Chain is publicly available for free at:http://computationalbioenergy.org/meta-qc-chain.html.  相似文献   

17.
Microbial community profiling identifies and quantifies organisms in metagenomic sequencing data using either reference based or unsupervised approaches. However, current reference based profiling methods only report the presence and abundance of single reference genomes that are available in databases. Since only a small fraction of environmental genomes is represented in genomic databases, these approaches entail the risk of false identifications and often suggest a higher precision than justified by the data. Therefore, we developed MicrobeGPS, a novel metagenomic profiling approach that overcomes these limitations. MicrobeGPS is the first method that identifies microbiota in the sample and estimates their genomic distances to known reference genomes. With this strategy, MicrobeGPS identifies organisms down to the strain level and highlights possibly inaccurate identifications when the correct reference genome is missing. We demonstrate on three metagenomic datasets with different origin that our approach successfully avoids misleading interpretation of results and additionally provides more accurate results than current profiling methods. Our results indicate that MicrobeGPS can enable reference based taxonomic profiling of complex and less characterized microbial communities. MicrobeGPS is open source and available from https://sourceforge.net/projects/microbegps/ as source code and binary distribution for Windows and Linux operating systems.  相似文献   

18.

Background

Microbial life dominates the earth, but many species are difficult or even impossible to study under laboratory conditions. Sequencing DNA directly from the environment, a technique commonly referred to as metagenomics, is an important tool for cataloging microbial life. This culture-independent approach involves collecting samples that include microbes in them, extracting DNA from the samples, and sequencing the DNA. A sample may contain many different microorganisms, macroorganisms, and even free-floating environmental DNA. A fundamental challenge in metagenomics has been estimating the abundance of organisms in a sample based on the frequency with which the organism''s DNA was observed in reads generated via DNA sequencing.

Methodology/Principal Findings

We created mixtures of ten microbial species for which genome sequences are known. Each mixture contained an equal number of cells of each species. We then extracted DNA from the mixtures, sequenced the DNA, and measured the frequency with which genomic regions from each organism was observed in the sequenced DNA. We found that the observed frequency of reads mapping to each organism did not reflect the equal numbers of cells that were known to be included in each mixture. The relative organism abundances varied significantly depending on the DNA extraction and sequencing protocol utilized.

Conclusions/Significance

We describe a new data resource for measuring the accuracy of metagenomic binning methods, created by in vitro-simulation of a metagenomic community. Our in vitro simulation can be used to complement previous in silico benchmark studies. In constructing a synthetic community and sequencing its metagenome, we encountered several sources of observation bias that likely affect most metagenomic experiments to date and present challenges for comparative metagenomic studies. DNA preparation methods have a particularly profound effect in our study, implying that samples prepared with different protocols are not suitable for comparative metagenomics.  相似文献   

19.
Compared with traditional algorithms for long metagenomic sequence classification, characterizing microorganisms’ taxonomic and functional abundance based on tens of millions of very short reads are much more challenging. We describe an efficient composition and phylogeny-based algorithm [Metagenome Composition Vector (MetaCV)] to classify very short metagenomic reads (75–100 bp) into specific taxonomic and functional groups. We applied MetaCV to the Meta-HIT data (371-Gb 75-bp reads of 109 human gut metagenomes), and this single-read-based, instead of assembly-based, classification has a high resolution to characterize the composition and structure of human gut microbiota, especially for low abundance species. Most strikingly, it only took MetaCV 10 days to do all the computation work on a server with five 24-core nodes. To our knowledge, MetaCV, benefited from the strategy of composition comparison, is the first algorithm that can classify millions of very short reads within affordable time.  相似文献   

20.
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号