首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Mao X  Zhang Y  Xu Y 《PloS one》2011,6(7):e22556
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at http://csbl.bmb.uga.edu/~xizeng/research/seas/.  相似文献   

2.
BEST: binding-site estimation suite of tools   总被引:4,自引:0,他引:4  
  相似文献   

3.
Predicting protein-coding genes still remains a significant challenge. Although a variety of computational programs that use commonly machine learning methods have emerged, the accuracy of predictions remains a low level when implementing in large genomic sequences. Moreover, computational gene finding in newly se- quenced genomes is especially a difficult task due to the absence of a training set of abundant validated genes. Here we present a new gene-finding program, SCGPred, to improve the accuracy of prediction by combining multiple sources of evidence. SCGPred can perform both supervised method in previously well-studied genomes and unsupervised one in novel genomes. By testing with datasets composed of large DNA sequences from human and a novel genome of Ustilago maydi, SCGPred gains a significant improvement in comparison to the popular ab initio gene predictors. We also demonstrate that SCGPred can significantly improve prediction in novel genomes by combining several foreign gene finders with similarity alignments, which is superior to other unsupervised methods. Therefore, SCGPred can serve as an alternative gene-finding tool for newly sequenced eukaryotic genomes. The program is freely available at http://bio.scu.edu.cn/SCGPred/.  相似文献   

4.
In this paper, a self-training method is proposed to recognize translation start sites in bacterial genomes without a prior knowledge of rRNA in the genomes concerned. Many features with biological meanings are incorporated, including mononucleotide distribution patterns near the start codon, the start codon itself, the coding potential and the distance from the most-left start codon to the start codon. The proposed method correctly predicts 92% of the translation start sites of 195 experimentally confirmed Escherichia coli CDSs, 96% of 58 reliable Bacillus subtilis CDSs and 82% of 140 reliable Synechocystis CDSs. Moreover, the self-training method presented might also be used to relocate the translation start sites of putative CDSs of genomes, which are predicted by gene-finding programs. After post-processing by the method presented, the improvement of gene start prediction of some gene-finding programs is remarkable, e.g., the accuracy of gene start prediction of Glimmer 2.02 increases from 63 to 91% for 832 E. coli reliable CDSs. An open source computer program to implement the method, GS-Finder, is freely available for academic purposes from http://tubic.tju.edu.cn/GS-Finder/.  相似文献   

5.
MOTIVATION: Tightly packed prokaryotic genes frequently overlap with each other. This feature, rarely seen in eukaryotic DNA, makes detection of translation initiation sites and, therefore, exact predictions of prokaryotic genes notoriously difficult. Improving the accuracy of precise gene prediction in prokaryotic genomic DNA remains an important open problem. RESULTS: A software program implementing a new algorithm utilizing a uniform Hidden Markov Model for prokaryotic gene prediction was developed. The algorithm analyzes a given DNA sequence in each of six possible global reading frames independently. Twelve complete prokaryotic genomes were analyzed using the new tool. The accuracy of gene finding, predicting locations of protein-coding ORFs, as well as the accuracy of precise gene prediction, and detecting the whole gene including translation initiation codon were assessed by comparison with existing annotation. It was shown that in terms of gene finding, the program performs at least as well as the previously developed tools, such as GeneMark and GLIMMER. In terms of precise gene prediction the new program was shown to be more accurate, by several percentage points, than earlier developed tools, such as GeneMark.hmm, ECOPARSE and ORPHEUS. The results of testing the program indicated the possibility of systematic bias in start codon annotation in several early sequenced prokaryotic genomes. AVAILABILITY: The new gene-finding program can be accessed through the Web site: http:@dixie.biology.gatech.edu/GeneMark/fbf.cgi CONTACT: mark@amber.gatech.edu.  相似文献   

6.
Gene recognition by combination of several gene-finding programs   总被引:8,自引:1,他引:7  
MOTIVATION: A number of programs have been developed to predict the eukaryotic gene structures in DNA sequences. However, gene finding is still a challenging problem. RESULTS: We have explored the effectiveness when the results of several gene-finding programs were re- analyzed and combined. We studied several methods with four programs (FEXH, GeneParser3, GEN-SCAN and GRAIL2). By HIGHEST-policy combination method or BOUNDARY method, approximate correlation (AC) improved by 3- 5% in comparison with the best single gene-finding program. From another viewpoint, OR-based combination of the four programs is the most reliable to know whether a candidate exon overlaps with the real exon or not, although it is less sensitive than GENSCAN for exon-intron boundaries. Our methods can easily be extended to combine other programs. AVAILABILITY: We have developed a server program (Shirokane System) and a client program (GeneScope) to use the methods. GeneScope is available through a WWW site (http://gf.genome.ad.jp/). CONTACT: katsu,takagi@ims.u-tokyo.ac.jp   相似文献   

7.
Li G  Ma Q  Mao X  Yin Y  Zhu X  Xu Y 《Nucleic acids research》2011,39(22):e150
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at http://csbl.bmb.uga.edu/~maqin/GOST.  相似文献   

8.
9.
10.
We present a new computational method for solving a classical problem, the identification problem of cis-regulatory motifs in a given set of promoter sequences, based on one key new idea. Instead of scoring candidate motifs individually like in all the existing motif-finding programs, our method scores groups of candidate motifs with similar sequences, called motif closures, using a P-value, which has substantially improved the prediction reliability over the existing methods. Our new P-value scoring scheme is sequence length independent, hence allowing direct comparisons among predicted motifs with different lengths on the same footing. We have implemented this method as a Motif Recognition Computer (MREC) program, and have extensively tested MREC on both simulated and biological data from prokaryotic genomes. Our test results indicate that MREC can accurately pick out the actual motif with the correct length as the best scoring candidate for the vast majority of the cases in our test set. We compared our prediction results with two motif-finding programs Cosmo and MEME, and found that MREC outperforms both programs across all the test cases by a large margin. The MREC program is available at http://csbl.bmb.uga.edu/~bingqiang/MREC1/.  相似文献   

11.
Exon discovery by genomic sequence alignment   总被引:5,自引:0,他引:5  
MOTIVATION: During evolution, functional regions in genomic sequences tend to be more highly conserved than randomly mutating 'junk DNA' so local sequence similarity often indicates biological functionality. This fact can be used to identify functional elements in large eukaryotic DNA sequences by cross-species sequence comparison. In recent years, several gene-prediction methods have been proposed that work by comparing anonymous genomic sequences, for example from human and mouse. The main advantage of these methods is that they are based on simple and generally applicable measures of (local) sequence similarity; unlike standard gene-finding approaches they do not depend on species-specific training data or on the presence of cognate genes in data bases. As all comparative sequence-analysis methods, the new comparative gene-finding approaches critically rely on the quality of the underlying sequence alignments. RESULTS: Herein, we describe a new implementation of the sequence-alignment program DIALIGN that has been developed for alignment of large genomic sequences. We compare our method to the alignment programs PipMaker, WABA and BLAST and we show that local similarities identified by these programs are highly correlated to protein-coding regions. In our test runs, PipMaker was the most sensitive method while DIALIGN was most specific. AVAILABILITY: The program is downloadable from the DIALIGN home page at http://bibiserv.techfak.uni-bielefeld.de/dialign/.  相似文献   

12.
CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes   总被引:1,自引:0,他引:1  
MOTIVATION: The numbers of finished and ongoing genome projects are increasing at a rapid rate, and providing the catalog of genes for these new genomes is a key challenge. Obtaining a set of well-characterized genes is a basic requirement in the initial steps of any genome annotation process. An accurate set of genes is needed in order to learn about species-specific properties, to train gene-finding programs, and to validate automatic predictions. Unfortunately, many new genome projects lack comprehensive experimental data to derive a reliable initial set of genes. RESULTS: In this study, we report a computational method, CEGMA (Core Eukaryotic Genes Mapping Approach), for building a highly reliable set of gene annotations in the absence of experimental data. We define a set of conserved protein families that occur in a wide range of eukaryotes, and present a mapping procedure that accurately identifies their exon-intron structures in a novel genomic sequence. CEGMA includes the use of profile-hidden Markov models to ensure the reliability of the gene structures. Our procedure allows one to build an initial set of reliable gene annotations in potentially any eukaryotic genome, even those in draft stages. AVAILABILITY: Software and data sets are available online at http://korflab.ucdavis.edu/Datasets.  相似文献   

13.
MOTIVATION: In gene discovery projects based on EST sequencing, effective post-sequencing identification methods are important in determining tissue sources of ESTs within pooled cDNA libraries. In the past, such identification efforts have been characterized by higher than necessary failure rates due to the presence of errors within the subsequence containing the oligo tag intended to define the tissue source for each EST. RESULTS: A large-scale EST-based gene discovery program at The University of Iowa has led to the creation of a unique software method named UITagCreator usable in the creation of large sets of synthetic tissue identification tags. The identification tags provide error detection and correction capability and, in conjunction with automated annotation software, result in a substantial improvement in the accurate identification of the tissue source in the presence of sequencing and base-calling errors. These identification rates are favorable, relative to past paradigms. AVAILABILITY: The UITagCreator source code and installation instructions, along with detection software usable in concert with created tag sets, is freely available at http://genome.uiowa.edu/pubsoft/software.html CONTACT: tomc@eng.uiowa.edu  相似文献   

14.
We present a web-based network-construction system, CINPER (CSBL INteractive Pathway BuildER), to assist a user to build a user-specified gene network for a prokaryotic organism in an intuitive manner. CINPER builds a network model based on different types of information provided by the user and stored in the system. CINPER’s prediction process has four steps: (i) collection of template networks based on (partially) known pathways of related organism(s) from the SEED or BioCyc database and the published literature; (ii) construction of an initial network model based on the template networks using the P-Map program; (iii) expansion of the initial model, based on the association information derived from operons, protein-protein interactions, co-expression modules and phylogenetic profiles; and (iv) computational validation of the predicted models based on gene expression data. To facilitate easy applications, CINPER provides an interactive visualization environment for a user to enter, search and edit relevant data and for the system to display (partial) results and prompt for additional data. Evaluation of CINPER on 17 well-studied pathways in the MetaCyc database shows that the program achieves an average recall rate of 76% and an average precision rate of 90% on the initial models; and a higher average recall rate at 87% and an average precision rate at 28% on the final models. The reduced precision rate in the final models versus the initial models reflects the reality that the final models have large numbers of novel genes that have no experimental evidences and hence are not yet collected in the MetaCyc database. To demonstrate the usefulness of this server, we have predicted an iron homeostasis gene network of Synechocystis sp. PCC6803 using the server. The predicted models along with the server can be accessed at http://csbl.bmb.uga.edu/cinper/.  相似文献   

15.
Tn-seq is a high throughput technique for analysis of transposon mutant libraries. Tn-seq Explorer was developed as a convenient and easy-to-use package of tools for exploration of the Tn-seq data. In a typical application, the user will have obtained a collection of sequence reads adjacent to transposon insertions in a reference genome. The reads are first aligned to the reference genome using one of the tools available for this task. Tn-seq Explorer reads the alignment and the gene annotation, and provides the user with a set of tools to investigate the data and identify possibly essential or advantageous genes as those that contain significantly low counts of transposon insertions. Emphasis is placed on providing flexibility in selecting parameters and methodology most appropriate for each particular dataset. Tn-seq Explorer is written in Java as a menu-driven, stand-alone application. It was tested on Windows, Mac OS, and Linux operating systems. The source code is distributed under the terms of GNU General Public License. The program and the source code are available for download at http://www.cmbl.uga.edu/downloads/programs/Tn_seq_Explorer/ and https://github.com/sina-cb/Tn-seqExplorer.  相似文献   

16.
MOTIVATION: Ion-type identification is a fundamental problem in computational proteomics. Methods for accurate identification of ion types provide the basis for many mass spectrometry data interpretation problems, including (a) de novo sequencing, (b) identification of post-translational modifications and mutations and (c) validation of database search results. RESULTS: Here, we present a novel graph-theoretic approach for solving the problem of separating b ions from y ions in a set of tandem mass spectra. We represent each spectral peak as a node and consider two types of edges: type-1 edge connecting two peaks probably of the same ion types and type-2 edge connecting two peaks probably of different ion types. The problem of ion-separation is formulated and solved as a graph partition problem, which is to partition the graph into three subgraphs, representing b, y and others ions, respectively, through maximizing the total weight of type-1 edges while minimizing the total weight of type-2 edges within each partitioned subgraph. We have developed a dynamic programming algorithm for rigorously solving this graph partition problem and implemented it as a computer program PRIME (PaRtition of Ion types in tandem Mass spEctra). The tests on a large amount of simulated mass spectra and 19 sets of high-quality experimental Fourier transform ion cyclotron resonance tandem mass spectra indicate that an accuracy level of approximately 90% for the separation of b and y ions was achieved. AVAILABILITY: The executable code of PRIME is available upon request. CONTACT: xyn@bmb.uga.edu.  相似文献   

17.
Gene-finding program evaluation (GFPE) is a set of Java classes for evaluating gene-finding programs. A command-line interface is also provided. Inputs to the program include the sequence data (in FASTA format), annotations of "actual" sequence features, and annotations of "predicted" sequence features. Annotation files are in the General Feature Format promoted by the Sanger center. GFPE calculates a number of metrics of accuracy of predictions at three levels:the coding level, the exon level, and the protein level.  相似文献   

18.
MCScan is an algorithm able to scan multiple genomes or subgenomes in order to identify putative homologous chromosomal regions, and align these regions using genes as anchors. The MCScanX toolkit implements an adjusted MCScan algorithm for detection of synteny and collinearity that extends the original software by incorporating 14 utility programs for visualization of results and additional downstream analyses. Applications of MCScanX to several sequenced plant genomes and gene families are shown as examples. MCScanX can be used to effectively analyze chromosome structural changes, and reveal the history of gene family expansions that might contribute to the adaptation of lineages and taxa. An integrated view of various modes of gene duplication can supplement the traditional gene tree analysis in specific families. The source code and documentation of MCScanX are freely available at http://chibba.pgml.uga.edu/mcscan2/.  相似文献   

19.
Biclustering extends the traditional clustering techniques by attempting to find (all) subgroups of genes with similar expression patterns under to-be-identified subsets of experimental conditions when applied to gene expression data. Still the real power of this clustering strategy is yet to be fully realized due to the lack of effective and efficient algorithms for reliably solving the general biclustering problem. We report a QUalitative BIClustering algorithm (QUBIC) that can solve the biclustering problem in a more general form, compared to existing algorithms, through employing a combination of qualitative (or semi-quantitative) measures of gene expression data and a combinatorial optimization technique. One key unique feature of the QUBIC algorithm is that it can identify all statistically significant biclusters including biclusters with the so-called ‘scaling patterns’, a problem considered to be rather challenging; another key unique feature is that the algorithm solves such general biclustering problems very efficiently, capable of solving biclustering problems with tens of thousands of genes under up to thousands of conditions in a few minutes of the CPU time on a desktop computer. We have demonstrated a considerably improved biclustering performance by our algorithm compared to the existing algorithms on various benchmark sets and data sets of our own. QUBIC was written in ANSI C and tested using GCC (version 4.1.2) on Linux. Its source code is available at: http://csbl.bmb.uga.edu/∼maqin/bicluster. A server version of QUBIC is also available upon request.  相似文献   

20.
MOTIVATION: Whole genome duplications have played a major role in determining the structure of eukaryotic genomes. Current evidence revealing large blocks of duplicated chromatin yields new insights into the evolutionary history of species, but also presents a major challenge for researchers attempting to utilize comparative genomics techniques. Understanding the timing of duplication events relative to divergence among taxa is critical to accurate and comprehensive cross-species comparisons. RESULTS: We describe a large-scale approach to estimate the timing of duplication events in a phylogenetic context. The methodology has been previously utilized for analysis of Arabidopsis and Saccharomyces duplication events. This new implementation provides a more flexible and reusable framework for these analyses. Scripts written in the Python programming language drive a number of freely available bioinformatics programs, creating a no-cost tool for researchers. The usefulness of the approach is demonstrated through genome-scale analysis of Arabidopsis and Oryza (rice) duplications. AVAILABILITY: Software and documentation are freely available from http://plantgenome.agtec.uga.edu/bioinformatics/dating/  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号