首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Metagenomic Characterization of Chesapeake Bay Virioplankton   总被引:7,自引:1,他引:6       下载免费PDF全文
Viruses are ubiquitous and abundant throughout the biosphere. In marine systems, virus-mediated processes can have significant impacts on microbial diversity and on global biogeocehmical cycling. However, viral genetic diversity remains poorly characterized. To address this shortcoming, a metagenomic library was constructed from Chesapeake Bay virioplankton. The resulting sequences constitute the largest collection of long-read double-stranded DNA (dsDNA) viral metagenome data reported to date. BLAST homology comparisons showed that Chesapeake Bay virioplankton contained a high proportion of unknown (homologous only to environmental sequences) and novel (no significant homolog) sequences. This analysis suggests that dsDNA viruses are likely one of the largest reservoirs of unknown genetic diversity in the biosphere. The taxonomic origin of BLAST homologs to viral library sequences agreed well with reported abundances of cooccurring bacterial subphyla within the estuary and indicated that cyanophages were abundant. However, the low proportion of Siphophage homologs contradicts a previous assertion that this family comprises most bacteriophage diversity. Identification and analyses of cyanobacterial homologs of the psbA gene illustrated the value of metagenomic studies of virioplankton. The phylogeny of inferred PsbA protein sequences suggested that Chesapeake Bay cyanophage strains are endemic in that environment. The ratio of psbA homologous sequences to total cyanophage sequences in the metagenome indicated that the psbA gene may be nearly universal in Chesapeake Bay cyanophage genomes. Furthermore, the low frequency of psbD homologs in the library supports the prediction that Chesapeake Bay cyanophage populations are dominated by Podoviridae.  相似文献   

2.
MOTIVATION: A typical metagenome dataset generated using a 454 pyrosequencing platform consists of short reads sampled from the collective genome of a microbial community. The amount of sequence in such datasets is usually insufficient for assembly, and traditional gene prediction cannot be applied to unassembled short reads. As a result, analysis of such datasets usually involves comparisons in terms of relative abundances of various protein families. The latter requires assignment of individual reads to protein families, which is hindered by the fact that short reads contain only a fragment, usually small, of a protein. RESULTS: We have considered the assignment of pyrosequencing reads to protein families directly using RPS-BLAST against COG and Pfam databases and indirectly via proxygenes that are identified using BLASTx searches against protein sequence databases. Using simulated metagenome datasets as benchmarks, we show that the proxygene method is more accurate than the direct assignment. We introduce a clustering method which significantly reduces the size of a metagenome dataset while maintaining a faithful representation of its functional and taxonomic content.  相似文献   

3.
Metagenomics is providing striking insights into the ecology of microbial communities. The recently developed massively parallel 454 pyrosequencing technique gives the opportunity to rapidly obtain metagenomic sequences at a low cost and without cloning bias. However, the phylogenetic analysis of the short reads produced represents a significant computational challenge. The phylogenetic algorithm CARMA for predicting the source organisms of environmental 454 reads is described. The algorithm searches for conserved Pfam domain and protein families in the unassembled reads of a sample. These gene fragments (environmental gene tags, EGTs), are classified into a higher-order taxonomy based on the reconstruction of a phylogenetic tree of each matching Pfam family. The method exhibits high accuracy for a wide range of taxonomic groups, and EGTs as short as 27 amino acids can be phylogenetically classified up to the rank of genus. The algorithm was applied in a comparative study of three aquatic microbial samples obtained by 454 pyrosequencing. Profound differences in the taxonomic composition of these samples could be clearly revealed.  相似文献   

4.
Next-generation sequencing technologies have allowed researchers to determine the collective genomes of microbial communities co-existing within diverse ecological environments. Varying species abundance, length and complexities within different communities, coupled with discovery of new species makes the problem of taxonomic assignment to short DNA sequence reads extremely challenging. We have developed a new sequence composition-based taxonomic classifier using extreme learning machines referred to as TAC-ELM for metagenomic analysis. TAC-ELM uses the framework of extreme learning machines to quickly and accurately learn the weights for a neural network model. The input features consist of GC content and oligonucleotides. TAC-ELM is evaluated on two metagenomic benchmarks with sequence read lengths reflecting the traditional and current sequencing technologies. Our empirical results indicate the strength of the developed approach, which outperforms state-of-the-art taxonomic classifiers in terms of accuracy and implementation complexity. We also perform experiments that evaluate the pervasive case within metagenome analysis, where a species may not have been previously sequenced or discovered and will not exist in the reference genome databases. TAC-ELM was also combined with BLAST to show improved classification results. Code and Supplementary Results: http://www.cs.gmu.edu/~mlbio/TAC-ELM (BSD License).  相似文献   

5.
Rumen microorganisms play an important role in ruminant digestion and absorption of nutrients and have great potential applications in the field of rumen adjusting, food fermentation and biomass utilization etc. In order to investigate the composition of microorganisms in the rumen of camel (Camelus dromedarius), this study delves in the microbial diversity by culture-independent approach. It includes comparison of rumen samples investigated in the present study to other currently available metagenomes to reveal potential differences in rumen microbial systems. Pyrosequencing based metagenomics was applied to analyze phylogenetic and metabolic profiles by MG-RAST, a web based tool. Pyrosequencing of camel rumen sample yielded 8,979,755 nucleotides assembled to 41,905 sequence reads with an average read length of 214 nucleotides. Taxonomic analysis of metagenomic reads indicated Bacteroidetes (55.5 %), Firmicutes (22.7 %) and Proteobacteria (9.2 %) phyla as predominant camel rumen taxa. At a finer phylogenetic resolution, Bacteroides species dominated the camel rumen metagenome. Functional analysis revealed that clustering-based subsystem and carbohydrate metabolism were the most abundant SEED subsystem representing 17 and 13 % of camel metagenome, respectively. A high taxonomic and functional similarity of camel rumen was found with the cow metagenome which is not surprising given the fact that both are mammalian herbivores with similar digestive tract structures and functions. Combined pyrosequencing approach and subsystems-based annotations available in the SEED database allowed us access to understand the metabolic potential of these microbiomes. Altogether, these data suggest that agricultural and animal husbandry practices can impose significant selective pressures on the rumen microbiota regardless of rumen type. The present study provides a baseline for understanding the complexity of camel rumen microbial ecology while also highlighting striking similarities and differences when compared to other animal gastrointestinal environments.  相似文献   

6.
Microbial communities carry out the majority of the biochemical activity on the planet, and they play integral roles in processes including metabolism and immune homeostasis in the human microbiome. Shotgun sequencing of such communities' metagenomes provides information complementary to organismal abundances from taxonomic markers, but the resulting data typically comprise short reads from hundreds of different organisms and are at best challenging to assemble comparably to single-organism genomes. Here, we describe an alternative approach to infer the functional and metabolic potential of a microbial community metagenome. We determined the gene families and pathways present or absent within a community, as well as their relative abundances, directly from short sequence reads. We validated this methodology using a collection of synthetic metagenomes, recovering the presence and abundance both of large pathways and of small functional modules with high accuracy. We subsequently applied this method, HUMAnN, to the microbial communities of 649 metagenomes drawn from seven primary body sites on 102 individuals as part of the Human Microbiome Project (HMP). This provided a means to compare functional diversity and organismal ecology in the human microbiome, and we determined a core of 24 ubiquitously present modules. Core pathways were often implemented by different enzyme families within different body sites, and 168 functional modules and 196 metabolic pathways varied in metagenomic abundance specifically to one or more niches within the microbiome. These included glycosaminoglycan degradation in the gut, as well as phosphate and amino acid transport linked to host phenotype (vaginal pH) in the posterior fornix. An implementation of our methodology is available at http://huttenhower.sph.harvard.edu/humann. This provides a means to accurately and efficiently characterize microbial metabolic pathways and functional modules directly from high-throughput sequencing reads, enabling the determination of community roles in the HMP cohort and in future metagenomic studies.  相似文献   

7.
One consistent finding among studies using shotgun metagenomics to analyze whole viral communities is that most viral sequences show no significant homology to known sequences. Thus, bioinformatic analyses based on sequence collections such as GenBank nr, which are largely comprised of sequences from known organisms, tend to ignore a majority of sequences within most shotgun viral metagenome libraries. Here we describe a bioinformatic pipeline, the Viral Informatics Resource for Metagenome Exploration (VIROME), that emphasizes the classification of viral metagenome sequences (predicted open-reading frames) based on homology search results against both known and environmental sequences. Functional and taxonomic information is derived from five annotated sequence databases which are linked to the UniRef 100 database. Environmental classifications are obtained from hits against a custom database, MetaGenomes On-Line, which contains 49 million predicted environmental peptides. Each predicted viral metagenomic ORF run through the VIROME pipeline is placed into one of seven ORF classes, thus, every sequence receives a meaningful annotation. Additionally, the pipeline includes quality control measures to remove contaminating and poor quality sequence and assesses the potential amount of cellular DNA contamination in a viral metagenome library by screening for rRNA genes. Access to the VIROME pipeline and analysis results are provided through a web-application interface that is dynamically linked to a relational back-end database. The VIROME web-application interface is designed to allow users flexibility in retrieving sequences (reads, ORFs, predicted peptides) and search results for focused secondary analyses.  相似文献   

8.
9.
In functional metagenomics, BLAST homology search is a common method to classify metagenomic reads into protein/domain sequence families such as Clusters of Orthologous Groups of proteins (COGs) in order to quantify the abundance of each COG in the community. The resulting functional profile of the community is then used in downstream analysis to correlate the change in abundance to environmental perturbation, clinical variation, and so on. However, the short read length coupled with next-generation sequencing technologies poses a barrier in this approach, essentially because similarity significance cannot be discerned by searching with short reads. Consequently, artificial functional families are produced, in which those with a large number of reads assigned decreases the accuracy of functional profile dramatically. There is no method available to address this problem. We intended to fill this gap in this paper. We revealed that BLAST similarity scores of homologues for short reads from COG protein members coding sequences are distributed differently from the scores of those derived elsewhere. We showed that, by choosing an appropriate score cut-off, we are able to filter out most artificial families and simultaneously to preserve sufficient information in order to build the functional profile. We also showed that, by incorporated application of BLAST and RPS-BLAST, some artificial families with large read counts can be further identified after the score cutoff filtration. Evaluated on three experimental metagenomic datasets with different coverages, we found that the proposed method is robust against read coverage and consistently outperforms the other E-value cutoff methods currently used in literatures.  相似文献   

10.
Due to the complexity of the protocols and a limited knowledge of the nature of microbial communities, simulating metagenomic sequences plays an important role in testing the performance of existing tools and data analysis methods with metagenomic data. We developed metagenomic read simulators with platform-specific (Sanger, pyrosequencing, Illumina) base-error models, and simulated metagenomes of differing community complexities. We first evaluated the effect of rigorous quality control on Illumina data. Although quality filtering removed a large proportion of the data, it greatly improved the accuracy and contig lengths of resulting assemblies. We then compared the quality-trimmed Illumina assemblies to those from Sanger and pyrosequencing. For the simple community (10 genomes) all sequencing technologies assembled a similar amount and accurately represented the expected functional composition. For the more complex community (100 genomes) Illumina produced the best assemblies and more correctly resembled the expected functional composition. For the most complex community (400 genomes) there was very little assembly of reads from any sequencing technology. However, due to the longer read length the Sanger reads still represented the overall functional composition reasonably well. We further examined the effect of scaffolding of contigs using paired-end Illumina reads. It dramatically increased contig lengths of the simple community and yielded minor improvements to the more complex communities. Although the increase in contig length was accompanied by increased chimericity, it resulted in more complete genes and a better characterization of the functional repertoire. The metagenomic simulators developed for this research are freely available.  相似文献   

11.
The animal gastrointestinal tract contains a complex community of microbes, whose composition ultimately reflects the co-evolution of microorganisms with their animal host. An analysis of 78,619 pyrosequencing reads generated from pygmy loris fecal DNA extracts was performed to help better understand the microbial diversity and functional capacity of the pygmy loris gut microbiome. The taxonomic analysis of the metagenomic reads indicated that pygmy loris fecal microbiomes were dominated by Bacteroidetes and Proteobacteria phyla. The hierarchical clustering of several gastrointestinal metagenomes demonstrated the similarities of the microbial community structures of pygmy loris and mouse gut systems despite their differences in functional capacity. The comparative analysis of function classification revealed that the metagenome of the pygmy loris was characterized by an overrepresentation of those sequences involved in aromatic compound metabolism compared with humans and other animals. The key enzymes related to the benzoate degradation pathway were identified based on the Kyoto Encyclopedia of Genes and Genomes pathway assignment. These results would contribute to the limited body of primate metagenome studies and provide a framework for comparative metagenomic analysis between human and non-human primates, as well as a comparative understanding of the evolution of humans and their microbiome. However, future studies on the metagenome sequencing of pygmy loris and other prosimians regarding the effects of age, genetics, and environment on the composition and activity of the metagenomes are required.  相似文献   

12.
Virioplankton have a significant role in marine ecosystems, yet we know little of the predominant biological characteristics of aquatic viruses that influence the flow of nutrients and energy through microbial communities. Family A DNA polymerases, critical to DNA replication and repair in prokaryotes, are found in many tailed bacteriophages. The essential role of DNA polymerase in viral replication makes it a useful target for connecting viral diversity with an important biological feature of viruses. Capturing the full diversity of this polymorphic gene by targeted approaches has been difficult; thus, full-length DNA polymerase genes were assembled out of virioplankton shotgun metagenomic sequence libraries (viromes). Within the viromes novel DNA polymerases were common and found in both double-stranded (ds) DNA and single-stranded (ss) DNA libraries. Finding DNA polymerase genes in ssDNA viral libraries was unexpected, as no such genes have been previously reported from ssDNA phage. Surprisingly, the most common virioplankton DNA polymerases were related to a siphovirus infecting an α-proteobacterial symbiont of a marine sponge and not the podoviral T7-like polymerases seen in many other studies. Amino acids predictive of catalytic efficiency and fidelity linked perfectly to the environmental clades, indicating that most DNA polymerase-carrying virioplankton utilize a lower efficiency, higher fidelity enzyme. Comparisons with previously reported, PCR-amplified DNA polymerase sequences indicated that the most common virioplankton metagenomic DNA polymerases formed a new group that included siphoviruses. These data indicate that slower-replicating, lytic or lysogenic phage populations rather than fast-replicating, highly lytic phages may predominate within the virioplankton.  相似文献   

13.
MALINA is a web service for bioinformatic analysis of whole-genome metagenomic data obtained from human gut microbiota sequencing. As input data, it accepts metagenomic reads of various sequencing technologies, including long reads (such as Sanger and 454 sequencing) and next-generation (including SOLiD and Illumina). It is the first metagenomic web service that is capable of processing SOLiD color-space reads, to authors’ knowledge. The web service allows phylogenetic and functional profiling of metagenomic samples using coverage depth resulting from the alignment of the reads to the catalogue of reference sequences which are built into the pipeline and contain prevalent microbial genomes and genes of human gut microbiota. The obtained metagenomic composition vectors are processed by the statistical analysis and visualization module containing methods for clustering, dimension reduction and group comparison. Additionally, the MALINA database includes vectors of bacterial and functional composition for human gut microbiota samples from a large number of existing studies allowing their comparative analysis together with user samples, namely datasets from Russian Metagenome project, MetaHIT and Human Microbiome Project (downloaded fromhttp://hmpdacc.org). MALINA is made freely available on the web athttp://malina.metagenome.ru. The website is implemented in JavaScript (using Ext JS), Microsoft .NET Framework, MS SQL, Python, with all major browsers supported.  相似文献   

14.
Wastewater treatment plants (WWTPs) are a reservoir for bacteria harbouring antibiotic resistance plasmids. To get a comprehensive overview on the plasmid metagenome of WWTP bacteria showing reduced susceptibility to certain antimicrobial drugs an ultrafast sequencing approach applying the 454-technology was carried out. One run on the GS 20 System yielded 346,427 reads with an average read length of 104 bases resulting in a total of 36,071,493 bases sequence data. The obtained plasmid metagenome was analysed and functionally annotated by means of the Sequence Analysis and Management System (SAMS) software package. Known plasmid genes could be identified within the WWTP plasmid metagenome data set by BLAST searches using the NCBI Plasmid Database. Most abundant hits represent genes involved in plasmid replication, stability, mobility and transposition. Mapping of plasmid metagenome reads to completely sequenced plasmids revealed that many sequences could be assigned to the cryptic pAsa plasmids previously identified in Aeromonas salmonicida subsp. salmonicida and to the accessory modules of the conjugative IncU resistance plasmid pFBAOT6 of Aeromonas punctata. Matches of sequence reads to antibiotic resistance genes indicate that plasmids from WWTP bacteria encode resistances to all major classes of antimicrobial drugs. Plasmid metagenome sequence reads could be assembled into 605 contigs with a minimum length of 500 bases. Contigs predominantly encode plasmid survival functions and transposition enzymes.  相似文献   

15.
Du  Nan  Chen  Jiao  Sun  Yanni 《BMC genomics》2019,20(2):49-62
Background

Single-molecule, real-time sequencing (SMRT) developed by Pacific BioSciences produces longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. One key step in genome assembly using long reads is to quickly identify reads forming overlaps. Because PacBio data has higher sequencing error rate and lower coverage than popular short read sequencing technologies (such as Illumina), efficient detection of true overlaps requires specially designed algorithms. In particular, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates in both reads. Addressing this need will enable better assembly for metagenomic data produced by third-generation sequencing technologies.

Results

In this work, we designed and implemented an overlap detection program named GroupK, for third-generation sequencing reads based on grouped k-mer hits. While using k-mer hits for detecting reads’ overlaps has been adopted by several existing programs, our method uses a group of short k-mer hits satisfying statistically derived distance constraints to increase the sensitivity of small overlap detection. Grouped k-mer hit was originally designed for homology search. We are the first to apply group hit for long read overlap detection. The experimental results of applying our pipeline to both simulated and real third-generation sequencing data showed that GroupK enables more sensitive overlap detection, especially for datasets of low sequencing coverage.

Conclusions

GroupK is best used for detecting small overlaps for third-generation sequencing data. It provides a useful supplementary tool to existing ones for more sensitive and accurate overlap detection. The source code is freely available at https://github.com/Strideradu/GroupK.

  相似文献   

16.
Compared with traditional algorithms for long metagenomic sequence classification, characterizing microorganisms’ taxonomic and functional abundance based on tens of millions of very short reads are much more challenging. We describe an efficient composition and phylogeny-based algorithm [Metagenome Composition Vector (MetaCV)] to classify very short metagenomic reads (75–100 bp) into specific taxonomic and functional groups. We applied MetaCV to the Meta-HIT data (371-Gb 75-bp reads of 109 human gut metagenomes), and this single-read-based, instead of assembly-based, classification has a high resolution to characterize the composition and structure of human gut microbiota, especially for low abundance species. Most strikingly, it only took MetaCV 10 days to do all the computation work on a server with five 24-core nodes. To our knowledge, MetaCV, benefited from the strategy of composition comparison, is the first algorithm that can classify millions of very short reads within affordable time.  相似文献   

17.
To assess the functional capacities of microbial communities, including those inhabiting the human body, shotgun metagenomic reads are often aligned to a database of known genes. Such homology-based annotation practices critically rely on the assumption that short reads can map to orthologous genes of similar function. This assumption, however, and the various factors that impact short read annotation, have not been systematically evaluated. To address this challenge, we generated an extremely large database of simulated reads (totaling 15.9 Gb), spanning over 500,000 microbial genes and 170 curated genomes and including, for many genomes, every possible read of a given length. We annotated each read using common metagenomic protocols, fully characterizing the effect of read length, sequencing error, phylogeny, database coverage, and mapping parameters. We additionally rigorously quantified gene-, genome-, and protocol-specific annotation biases. Overall, our findings provide a first comprehensive evaluation of the capabilities and limitations of functional metagenomic annotation, providing crucial goal-specific best-practice guidelines to inform future metagenomic research.  相似文献   

18.
Deep-sea hydrothermal vent chimneys harbor a high diversity of largely unknown microorganisms. Although the phylogenetic diversity of these microorganisms has been described previously, the adaptation and metabolic potential of the microbial communities is only beginning to be revealed. A pyrosequencing approach was used to directly obtain sequences from a fosmid library constructed from a black smoker chimney 4143-1 in the Mothra hydrothermal vent field at the Juan de Fuca Ridge. A total of 308 034 reads with an average sequence length of 227 bp were generated. Comparative genomic analyses of metagenomes from a variety of environments by two-way clustering of samples and functional gene categories demonstrated that the 4143-1 metagenome clustered most closely with that from a carbonate chimney from Lost City. Both are highly enriched in genes for mismatch repair and homologous recombination, suggesting that the microbial communities have evolved extensive DNA repair systems to cope with the extreme conditions that have potential deleterious effects on the genomes. As previously reported for the Lost City microbiome, the metagenome of chimney 4143-1 exhibited a high proportion of transposases, implying that horizontal gene transfer may be a common occurrence in the deep-sea vent chimney biosphere. In addition, genes for chemotaxis and flagellar assembly were highly enriched in the chimney metagenomes, reflecting the adaptation of the organisms to the highly dynamic conditions present within the chimney walls. Reconstruction of the metabolic pathways revealed that the microbial community in the wall of chimney 4143-1 was mainly fueled by sulfur oxidation, putatively coupled to nitrate reduction to perform inorganic carbon fixation through the Calvin–Benson–Bassham cycle. On the basis of the genomic organization of the key genes of the carbon fixation and sulfur oxidation pathways contained in the large genomic fragments, both obligate and facultative autotrophs appear to be present and contribute to biomass production.  相似文献   

19.
Industrial units, manufacturing dyes, chemicals, solvents, and xenobiotic compounds, produce liquid and solid wastes, which upon conventional treatment are released in the nearby environment and thus are the major cause of pollution. Soil collected from contaminated Kharicut Canal bank (N 22°57.878′; E 072°38.478′), Ahmedabad, Gujarat, India was used for metagenomic DNA preparation to study the capabilities of intrinsic microbial community in dealing with xenobiotics. Sequencing of metagenomic DNA on the Genome Sequencer FLX System using titanium chemistry resulted in 409,782 reads accounting for 133,529,997 bases of sequence information. Taxonomic analyses and gene annotations were carried out using the bioinformatics platform Sequence Analysis and Management System for Metagenomic Datasets. Taxonomic profiling was carried out by three different complementary approaches: (a) 16S rDNA, (b) environmental gene tags, and (c) lowest common ancestor. The most abundant phylum and genus were found to be “Proteobacteria” and “Pseudomonas,” respectively. Metagenome reads were mapped on sequenced microbial genomes and the highest numbers of reads were allocated to Pseudomonas stutzeri A1501. Assignment of obtained metagenome reads to Gene Ontology terms, Clusters of Orthologous Groups of protein categories, protein family numbers, and Kyoto Encyclopedia of Genes and Genomes hits revealed genomic potential of indigenous microbial community. In total, 157,024 reads corresponded to 37,028 different KEGG hits, and amongst them, 11,574 reads corresponded to 131 different enzymes potentially involved in xenobiotic biodegradation. These enzymes were mapped on biodegradation pathways of xenobiotics to elucidate their roles in possible catalytic reactions. Consequently, information obtained from the present study will act as a baseline which, subsequently along with other “-omic” studies, will help in designing future bioremediation strategies in effluent treatment plants and environmental clean-up projects.  相似文献   

20.
Assembling individual genomes from complex community metagenomic data remains a challenging issue for environmental studies. We evaluated the quality of genome assemblies from community short read data (Illumina 100 bp pair-ended sequences) using datasets recovered from freshwater and soil microbial communities as well as in silico simulations. Our analyses revealed that the genome of a single genotype (or species) can be accurately assembled from a complex metagenome when it shows at least about 20 × coverage. At lower coverage, however, the derived assemblies contained a substantial fraction of non-target sequences (chimeras), which explains, at least in part, the higher number of hypothetical genes recovered in metagenomic relative to genomic projects. We also provide examples of how to detect intrapopulation structure in metagenomic datasets and estimate the type and frequency of errors in assembled genes and contigs from datasets of varied species complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号