首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recent metagenomics studies of environmental samples suggested that microbial communities are much more diverse than previously reported, and deep sequencing will significantly increase the estimate of total species diversity. Massively parallel pyrosequencing technology enables ultra-deep sequencing of complex microbial populations rapidly and inexpensively. However, computational methods for analyzing large collections of 16S ribosomal sequences are limited. We proposed a new algorithm, referred to as ESPRIT, which addresses several computational issues with prior methods. We developed two versions of ESPRIT, one for personal computers (PCs) and one for computer clusters (CCs). The PC version is used for small- and medium-scale data sets and can process several tens of thousands of sequences within a few minutes, while the CC version is for large-scale problems and is able to analyze several hundreds of thousands of reads within one day. Large-scale experiments are presented that clearly demonstrate the effectiveness of the newly proposed algorithm. The source code and user guide are freely available at http://www.biotech.ufl.edu/people/sun/esprit.html.  相似文献   

2.
3.
Kraken is an ultrafast and highly accurate program for assigning taxonomic labels to metagenomic DNA sequences. Previous programs designed for this task have been relatively slow and computationally expensive, forcing researchers to use faster abundance estimation programs, which only classify small subsets of metagenomic data. Using exact alignment of k-mers, Kraken achieves classification accuracy comparable to the fastest BLAST program. In its fastest mode, Kraken classifies 100 base pair reads at a rate of over 4.1 million reads per minute, 909 times faster than Megablast and 11 times faster than the abundance estimation program MetaPhlAn. Kraken is available at http://ccb.jhu.edu/software/kraken/.  相似文献   

4.
The advent of next-generation sequencing technologies has greatly promoted the field of metagenomics which studies genetic material recovered directly from an environment. Characterization of genomic composition of a metagenomic sample is essential for understanding the structure of the microbial community. Multiple genomes contained in a metagenomic sample can be identified and quantitated through homology searches of sequence reads with known sequences catalogued in reference databases. Traditionally, reads with multiple genomic hits are assigned to non-specific or high ranks of the taxonomy tree, thereby impacting on accurate estimates of relative abundance of multiple genomes present in a sample. Instead of assigning reads one by one to the taxonomy tree as many existing methods do, we propose a statistical framework to model the identified candidate genomes to which sequence reads have hits. After obtaining the estimated proportion of reads generated by each genome, sequence reads are assigned to the candidate genomes and the taxonomy tree based on the estimated probability by taking into account both sequence alignment scores and estimated genome abundance. The proposed method is comprehensively tested on both simulated datasets and two real datasets. It assigns reads to the low taxonomic ranks very accurately. Our statistical approach of taxonomic assignment of metagenomic reads, TAMER, is implemented in R and available at http://faculty.wcas.northwestern.edu/hji403/MetaR.htm.  相似文献   

5.
6.

Motivation

Paired-end sequencing protocols, offered by next generation sequencing (NGS) platforms like Illumia, generate a pair of reads for every DNA fragment in a sample. Although this protocol has been utilized for several metagenomics studies, most taxonomic binning approaches classify each of the reads (forming a pair), independently. The present work explores some simple but effective strategies of utilizing pairing-information of Illumina short reads for improving the accuracy of taxonomic binning of metagenomic datasets. The strategies proposed can be used in conjunction with all genres of existing binning methods.

Results

Validation results suggest that employment of these “Binpairs” strategies can provide significant improvements in the binning outcome. The quality of the taxonomic assignments thus obtained are often comparable to those that can only be achieved with relatively longer reads obtained using other NGS platforms (such as Roche).

Availability

An implementation of the proposed strategies of utilizing pairing information is freely available for academic users at https://metagenomics.atc.tcs.com/binning/binpairs.  相似文献   

7.

Background

Computing the long term behavior of regulatory and signaling networks is critical in understanding how biological functions take place in organisms. Steady states of these networks determine the activity levels of individual entities in the long run. Identifying all the steady states of these networks is difficult due to the state space explosion problem.

Methodology

In this paper, we propose a method for identifying all the steady states of Boolean regulatory and signaling networks accurately and efficiently. We build a mathematical model that allows pruning a large portion of the state space quickly without causing any false dismissals. For the remaining state space, which is typically very small compared to the whole state space, we develop a randomized traversal method that extracts the steady states. We estimate the number of steady states, and the expected behavior of individual genes and gene pairs in steady states in an online fashion. Also, we formulate a stopping criterion that terminates the traversal as soon as user supplied percentage of the results are returned with high confidence.

Conclusions

This method identifies the observed steady states of boolean biological networks computationally. Our algorithm successfully reported the G1 phases of both budding and fission yeast cell cycles. Besides, the experiments suggest that this method is useful in identifying co-expressed genes as well. By analyzing the steady state profile of Hedgehog network, we were able to find the highly co-expressed gene pair GL1-SMO together with other such pairs.

Availability

Source code of this work is available at http://bioinformatics.cise.ufl.edu/palSteady.html twocolumnfalse]  相似文献   

8.
A Genomic Islands (GI) is a chunk of DNA sequence in a genome whose origin can be traced back to other organisms or viruses. The detection of GIs plays an indispensable role in biomedical research, due to the fact that GIs are highly related to special functionalities such as disease-causing GIs - pathogenicity islands. It is also very important to visualize genomic islands, as well as the supporting features corresponding to the genomic islands in the genome. We have developed a program, Genomic Island Visualization (GIV), which displays the locations of genomic islands in a genome, as well as the corresponding supportive feature information for GIs. GIV was implemented in C++, and was compiled and executed on Linux/Unix operating systems.

Availability

GIV is freely available for non-commercial use at http://www5.esu.edu/cpsc/bioinfo/software/GIV  相似文献   

9.
10.
Corynebacteria are used for a wide variety of industrial purposes but some species are associated with human diseases. With increasing number of corynebacterial genomes having been sequenced, comparative analysis of these strains may provide better understanding of their biology, phylogeny, virulence and taxonomy that may lead to the discoveries of beneficial industrial strains or contribute to better management of diseases. To facilitate the ongoing research of corynebacteria, a specialized central repository and analysis platform for the corynebacterial research community is needed to host the fast-growing amount of genomic data and facilitate the analysis of these data. Here we present CoryneBase, a genomic database for Corynebacterium with diverse functionality for the analysis of genomes aimed to provide: (1) annotated genome sequences of Corynebacterium where 165,918 coding sequences and 4,180 RNAs can be found in 27 species; (2) access to comprehensive Corynebacterium data through the use of advanced web technologies for interactive web interfaces; and (3) advanced bioinformatic analysis tools consisting of standard BLAST for homology search, VFDB BLAST for sequence homology search against the Virulence Factor Database (VFDB), Pairwise Genome Comparison (PGC) tool for comparative genomic analysis, and a newly designed Pathogenomics Profiling Tool (PathoProT) for comparative pathogenomic analysis. CoryneBase offers the access of a range of Corynebacterium genomic resources as well as analysis tools for comparative genomics and pathogenomics. It is publicly available at http://corynebacterium.um.edu.my/.  相似文献   

11.

Background

Personal genome assembly is a critical process when studying tumor genomes and other highly divergent sequences. The accuracy of downstream analyses, such as RNA-seq and ChIP-seq, can be greatly enhanced by using personal genomic sequences rather than standard references. Unfortunately, reads sequenced from these types of samples often have a heterogeneous mix of various subpopulations with different variants, making assembly extremely difficult using existing assembly tools. To address these challenges, we developed SHEAR (Sample Heterogeneity Estimation and Assembly by Reference; http://vk.cs.umn.edu/SHEAR), a tool that predicts SVs, accounts for heterogeneous variants by estimating their representative percentages, and generates personal genomic sequences to be used for downstream analysis.

Results

By making use of structural variant detection algorithms, SHEAR offers improved performance in the form of a stronger ability to handle difficult structural variant types and better computational efficiency. We compare against the lead competing approach using a variety of simulated scenarios as well as real tumor cell line data with known heterogeneous variants. SHEAR is shown to successfully estimate heterogeneity percentages in both cases, and demonstrates an improved efficiency and better ability to handle tandem duplications.

Conclusion

SHEAR allows for accurate and efficient SV detection and personal genomic sequence generation. It is also able to account for heterogeneous sequencing samples, such as from tumor tissue, by estimating the subpopulation percentage for each heterogeneous variant.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-84) contains supplementary material, which is available to authorized users.  相似文献   

12.
We describe a new program for the alignment of multiple biological sequences that is both statistically motivated and fast enough for problem sizes that arise in practice. Our Fast Statistical Alignment program is based on pair hidden Markov models which approximate an insertion/deletion process on a tree and uses a sequence annealing algorithm to combine the posterior probabilities estimated from these models into a multiple alignment. FSA uses its explicit statistical model to produce multiple alignments which are accompanied by estimates of the alignment accuracy and uncertainty for every column and character of the alignment—previously available only with alignment programs which use computationally-expensive Markov Chain Monte Carlo approaches—yet can align thousands of long sequences. Moreover, FSA utilizes an unsupervised query-specific learning procedure for parameter estimation which leads to improved accuracy on benchmark reference alignments in comparison to existing programs. The centroid alignment approach taken by FSA, in combination with its learning procedure, drastically reduces the amount of false-positive alignment on biological data in comparison to that given by other methods. The FSA program and a companion visualization tool for exploring uncertainty in alignments can be used via a web interface at http://orangutan.math.berkeley.edu/fsa/, and the source code is available at http://fsa.sourceforge.net/.  相似文献   

13.
We have previously developed a computational method for representing a genome as a barcode image, which makes various genomic features visually apparent. We have demonstrated that this visual capability has made some challenging genome analysis problems relatively easy to solve. We have applied this capability to a number of challenging problems, including (a) identification of horizontally transferred genes, (b) identification of genomic islands with special properties and (c) binning of metagenomic sequences, and achieved highly encouraging results. These application results inspired us to develop this barcode-based genome analysis server for public service, which supports the following capabilities: (a) calculation of the k-mer based barcode image for a provided DNA sequence; (b) detection of sequence fragments in a given genome with distinct barcodes from those of the majority of the genome, (c) clustering of provided DNA sequences into groups having similar barcodes; and (d) homology-based search using Blast against a genome database for any selected genomic regions deemed to have interesting barcodes. The barcode server provides a job management capability, allowing processing of a large number of analysis jobs for barcode-based comparative genome analyses. The barcode server is accessible at http://csbl1.bmb.uga.edu/Barcode.  相似文献   

14.
The taxonomic composition of a microbial community can be deduced by analyzing its rRNA gene content by, e.g., high-throughput DNA sequencing or DNA chips. Such methods typically are based on PCR amplification of rRNA gene sequences using broad-taxonomic-range PCR primers. In these analyses, the use of optimal primers is crucial for achieving an unbiased representation of community composition. Here, we present the computer program DegePrime that, for each position of a multiple sequence alignment, finds a degenerate oligomer of as high coverage as possible and outputs its coverage among taxonomic divisions. We show that our novel heuristic, which we call weighted randomized combination, performs better than previously described algorithms for solving the maximum coverage degenerate primer design problem. We previously used DegePrime to design a broad-taxonomic-range primer pair that targets the bacterial V3-V4 region (341F-805R) (D. P. Herlemann, M. Labrenz, K. Jurgens, S. Bertilsson, J. J. Waniek, and A. F. Andersson, ISME J. 5:1571–1579, 2011, http://dx.doi.org/10.1038/ismej.2011.41), and here we use the program to significantly increase the coverage of a primer pair (515F-806R) widely used for Illumina-based surveys of bacterial and archaeal diversity. By comparison with shotgun metagenomics, we show that the primers give an accurate representation of microbial diversity in natural samples.  相似文献   

15.
Bats account for ~20% of mammalian species, and are the only mammals with true powered flight. For the sake of their specialized phenotypic traits, many researches have been devoted to examine the evolution of bats. Until now, some whole genome sequences of bats have been assembled and annotated, however, a uniform resource for the annotated bat genomes is still unavailable. To make the extensive data associated with the bat genomes accessible to the general biological communities, we established a Bat Genome Database (BGD). BGD is an open-access, web-available portal that integrates available data of bat genomes and genes. It hosts data from six bat species, including two megabats and four microbats. Users can query the gene annotations using efficient searching engine, and it offers browsable tracks of bat genomes. Furthermore, an easy-to-use phylogenetic analysis tool was also provided to facilitate online phylogeny study of genes. To the best of our knowledge, BGD is the first database of bat genomes. It will extend our understanding of the bat evolution and be advantageous to the bat sequences analysis. BGD is freely available at: http://donglab.ecnu.edu.cn/databases/BatGenome/.  相似文献   

16.
17.

Background

In environmental sequencing studies, fungi can be identified based on nucleic acid sequences, using either highly variable sequences as species barcodes or conserved sequences containing a high-quality phylogenetic signal. For the latter, identification relies on phylogenetic analyses and the adoption of the phylogenetic species concept.Such analysis requires that the reference sequences are well identified and deposited in public-access databases. However, many entries in the public sequence databases are problematic in terms of quality and reliability and these data require screening to ensure correct phylogenetic interpretation.

Methods and Principal Findings

To facilitate phylogenetic inferences and phylogenetic assignment, we introduce a fungal sequence database. The database PHYMYCO-DB comprises fungal sequences from GenBank that have been filtered to satisfy stringent sequence quality criteria. For the first release, two widely used molecular taxonomic markers were chosen: the nuclear SSU rRNA and EF1-α gene sequences. Following the automatic extraction and filtration, a manual curation is performed to remove problematic sequences while preserving relevant sequences useful for phylogenetic studies. As a result of curation, ∼20% of the automatically filtered sequences have been removed from the database. To demonstrate how PHYMYCO-DB can be employed, we test a set of environmental Chytridiomycota sequences obtained from deep sea samples.

Conclusion

PHYMYCO-DB offers the tools necessary to: (i) extract high quality fungal sequences for each of the 5 fungal phyla, at all taxonomic levels, (ii) extract already performed alignments, to act as ‘reference alignments’, (iii) launch alignments of personal sequences along with stored data. A total of 9120 SSU rRNA and 672 EF1-α high-quality fungal sequences are now available.The PHYMYCO-DB is accessible through the URL http://phymycodb.genouest.org/.  相似文献   

18.
Aggregatibacter actinomycetemcomitans is a major etiological agent of periodontitis. Here we report the complete genome sequence of serotype c strain D11S-1, which was recovered from the subgingival plaque of a patient diagnosed with generalized aggressive periodontitis.Aggregatibacter actinomycetemcomitans is a major etiologic agent of human periodontal disease, in particular aggressive periodontitis (12). The natural population of A. actinomycetemcomitans is clonal (7). Six A. actinomycetemcomitans serotypes are distinguished based on the structural and serological characteristics of the O antigen of LPS (6, 7). Three of the serotypes (a, b, and c) comprise >80% of all strains, and each serotype represents a distinct clonal lineage (1, 6, 7). Serotype c strain D11S-1 was cultured from a subgingival plaque sample of a patient diagnosed with generalized aggressive periodontitis. The complete genome sequencing of the strain was determined by 454 pyrosequencing (10), which achieved 25× coverage. Assembly was performed using the Newbler assembler (454, Branford, CT) and generated 199 large contigs, with 99.3% of the bases having a quality score of 40 and above. The contigs were aligned with the genome of the sequenced serotype b strain HK1651 (http://www.genome.ou.edu/act.html) using software written in house. The putative contig gaps were then closed by primer walking and sequencing of PCR products over the gaps. The final genome assembly was further confirmed by comparison of an in silico NcoI restriction map to the experimental map generated by optical mapping (8). The genome structure of the D11S-1 strain was compared to that of the sequenced strain HK1651 using the program MAUVE (2, 3). The automated annotation was done using a protocol similar to the annotation engine service at The Institute for Genomic Research/J. Craig Venter Institute with some local modifications. Briefly, protein-coding genes were identified using Glimmer3 (4). Each protein sequence was then annotated by comparing to the GenBank nonredundant protein database. BLAST-Extend-Repraze was applied to the predicted genes to identify genes that might have been truncated due to a frameshift mutation or premature stop codon. tRNA and rRNA genes were identified by using tRNAScan-SE (9) and a similarity search to our in-house RNA database, respectively.The D11S-1 circular genome contains 2,105,764 nucleotides, a GC content of 44.55%, 2,134 predicted coding sequences, and 54 tRNA and 19 rRNA genes (see additional data at http://expression.washington.edu/bumgarnerlab/publications.php). The distribution of predicted genes based on functional categories was similar between D11S-1 and HK1651 (http://expression.washington.edu/bumgarnerlab/publications.php). One hundred six and 86 coding sequences were unique to strain D11S-1 and HK1651, respectively (http://expression.washington.edu/bumgarnerlab/publications.php). Genomic islands were identified based on annotations for strain HK1651 and based on manual inspection of contiguous D11S-1 specific DNA regions with G+C bias (http://expression.washington.edu/bumgarnerlab/publications.php). Among 12 identified genomics islands, 5 (B, C, D, E and G; cytolethal distending toxin gene cluster, tight adherence gene cluster, O-antigen biosynthesis and transport gene cluster, leukotoxin gene cluster, and lipoligosaccharide biosynthesis enzyme gene, respectively) correspond to islands 2 to 5 and 8 of strain HK1651 (http://www.oralgen.lanl.gov/) (5). Island F (∼5 kb) is homologous to a portion of the 12.5-kb island 7 in HK1651. Five genomic islands (H to L) were unique to strain D11S-1. The remaining island (A) is a fusion of genomic islands 1 and 6, in strain HK1651. The genome of D11S-1 is largely in synteny with the genome of the sequenced serotype b strain HK1651 but contained several large-scale genomic rearrangements.Strain D11S-1 harbors a 43-kb bacteriophage and two plasmids of 31 and 23 kb (http://expression.washington.edu/bumgarnerlab/publications.php). Excluding an ∼9-kb region of low homology, the phage showed >90% nucleotide sequence identity with AaΦ23 (11). A 49-bp attB site (11) was identified at coordinates 2,024,825 to 2,024,873. The location of the inserted phage was identified in the optical map of strain D11S-1 and further confirmed by PCR amplification and sequencing of the regions flanking the insertion site. A closed circular form of the phage was also detected in strain D11S-1 by PCR analysis of the phage ends. The 23-kb plasmid is homologous to pVT745 (92% nucleotide identities). The 31-kb plasmid is a novel plasmid. It has significant homologies in short regions (<2 kb) to Haemophilus influenzae biotype aegyptius plasmid pF1947 and other plasmids.  相似文献   

19.
In plants and animals, chromosomal breakage and fusion events based on conserved syntenic genomic blocks lead to conserved patterns of karyotype evolution among species of the same family. However, karyotype information has not been well utilized in genomic comparison studies. We present CrusView, a Java-based bioinformatic application utilizing Standard Widget Toolkit/Swing graphics libraries and a SQLite database for performing visualized analyses of comparative genomics data in Brassicaceae (crucifer) plants. Compared with similar software and databases, one of the unique features of CrusView is its integration of karyotype information when comparing two genomes. This feature allows users to perform karyotype-based genome assembly and karyotype-assisted genome synteny analyses with preset karyotype patterns of the Brassicaceae genomes. Additionally, CrusView is a local program, which gives its users high flexibility when analyzing unpublished genomes and allows users to upload self-defined genomic information so that they can visually study the associations between genome structural variations and genetic elements, including chromosomal rearrangements, genomic macrosynteny, gene families, high-frequency recombination sites, and tandem and segmental duplications between related species. This tool will greatly facilitate karyotype, chromosome, and genome evolution studies using visualized comparative genomics approaches in Brassicaceae species. CrusView is freely available at http://www.cmbb.arizona.edu/CrusView/.The Brassicaceae (crucifer) plant family contains more than 3,700 species, including the model plant organism Arabidopsis (Arabidopsis thaliana); economically important crop species, such as Brassica rapa and Brassica napus; and close relatives of Arabidopsis used in abiotic stress research, such as Eutrema salsugineum and Schrenkiella parvula. Because Brassicaceae plants have high scientific and economic importance, several whole-genome sequencing projects of the species in this family have been recently launched (http://www.brassica.info). Moreover, Brassicaceae is also a good system for population genomics. The 1001 Arabidopsis Genomes Project (http://www.1001genomes.org/) plans to generate complete genome sequences for 1,001 Arabidopsis strains to study the associations between genetic variation and phenotypic diversity. The Value-directed Evolutionary Genomics Initiative project aims to understand the genome evolution of Brassicaceae species by sequencing several close relatives of Arabidopsis, such as Arabidopsis lyrata and Capsella rubella. Recent advances in high-throughput sequencing technology have greatly expedited these whole-genome sequencing projects of versatile nonmodel organisms. Although increasingly longer reads can now be produced from high-throughput sequencing experiments, de novo assembler tools can only generate contig and/or scaffold sequences from high-throughput sequencing reads. These tools cannot generate complete chromosome sequences without genetic and/or physical maps that typically require years to create. This limitation makes chromosome-scale structural variation (i.e. translocation, inversion, deletion and insertion, and segmental and tandem duplication) and genomic macrosynteny analyses difficult to perform.In both plants and animals, genomes of species within the same family have evolved with conserved karyotype patterns due to the rearrangements of large chromosomal segments. Chromosomal karyotypes can be obtained from comparative chromosomal painting (CCP) experiments by performing in situ hybridization experiments on bacterial artificial chromosome sequences between related species. The genome of each Brassicaceae member is composed of 24 conserved genomic blocks that have been considered as the basic units of chromosomal rearrangement during genome evolution (Lysak et al., 2006). The sizes of these conserved blocks range from several to dozens of megabases. Currently, karyotypes profiled by CCP experiments in approximately 20 Brassicaceae species are available; such karyotypes include those from Arabidopsis (n = 5), Homungia alpine (n = 6), Eutrema spp. (n = 7), A. lyrata (n = 8), B. rapa (n = 10), and Polyctenium fremontii (n = 14). By utilizing the karyotype information in Brassicaceae, we have developed a tool, KGBassembler (for Karyotype-based Genome assembler for Brassicaceae), to finalize the assembly of chromosomes from scaffolds/contigs without relying on a genetic/physical map (Ma et al., 2012).Over the past 2 years, complete whole-genome sequences of several Brassicaceae species have been released, including the aforementioned A. lyrata, S. parvula, B. rapa, and E. salsugineum (Dassanayake et al., 2011; Hu et al., 2011; Wang et al., 2011; Wright and Agren, 2011; Wu et al., 2012; Yang et al., 2013). These genomic resources have opened a new era of comparative genomics in Brassicaceae to better understand the genomic evolution (Cheng et al., 2012). Numerous tools and databases are available for performing comparative genomics analysis in plants. CoGe is a comparative genomics analysis platform that is now a part of the iPlant Collaborative Project (Goff et al., 2011). The CoGe database currently includes nearly 2,000 genome sequences of approximately 1,500 organisms, allowing users to perform online visual analyses of genome synteny and duplication events (Tang and Lyons, 2012). PLAZA and Vista are also Web-based databases that provide comparative analysis services on the genomic data deposited in the databases (Frazer et al., 2004; Van Bel et al., 2012). Other stand-alone bioinformatic applications for comparative genomic analysis, such as Easyfig and genoPlotR, are commonly used to generate synteny plots of given genome segments at a scale ranging from a single gene to one chromosome (Guy et al., 2010; Sullivan et al., 2011).In this work, we present a Java-based bioinformatic application, CrusView, for performing visualized analyses of genome synteny and karyotype evolution in Brassicaceae species. CrusView features a user-friendly graphical user interface (GUI) implemented with Standard Widget Toolkit (SWT)/Swing graphics libraries and a SQLite database used to manage local genomic data. Compared with the most commonly used tools in comparative genomics, one of the unique features of CrusView is that available karyotype data of a Brassicaceae species are incorporated to facilitate karyotype-based chromosome assembly and analyses of chromosomal structural evolution. Compared with Web-based tools, the stand-alone CrusView tool was also designed to give users higher flexibility in analyzing currently unpublished genome data and integrating self-defined genomic information based on the users’ interests, such as gene families, gene duplications, chromosomal break points, Gene Ontology terms, and groups of orthologs/paralogs, with the genomic synteny maps. In addition, CrusView can generate images representing genomic synteny between two compared genomes in PNG/SVG/PDF high-resolution formats that are suitable for publication.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号