首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 658 毫秒
1.
Gene recognition from questionable ORFs in bacterial and archaeal genomes   总被引:1,自引:0,他引:1  
The ORFs of microbial genomes in annotation files are usually classified into two groups: the first corresponds to known genes; whereas the second includes 'putative', 'probable', 'conserved hypothetical', 'hypothetical', 'unknown' and 'predicted' ORFs etc. Since the annotation is not 100% accurate, it is essential to confirm which ORF of the latter group is coding and which is not. Starting from known genes in the former, this paper describes an improved Z curve method to recognize genes in the latter. Ten-fold cross-validation tests show that the average accuracy of the algorithm is greater than 99% for recognizing the known genes in 57 bacterial and archaeal genomes. The method is then applied to recognize genes of the latter group. The likely non-coding ORFs in each of the 57 bacterial or archaeal genomes studied here are recognized and listed at the website http://tubic.tju.edu.cn/ZCURVE_C_html/noncoding.html. The working mechanism of the algorithm has been discussed in details. A computer program, called ZCURVE_C, was written to calculate a coding score called Z-curve score for ORFs in the above 57 bacterial and archaeal genomes. Coding/non-coding is simply determined by the criterion of Z-curve score > 0/ Z-curve score < 0. A website has been set up to provide the service to calculate the Z-curve score. A user may submit the DNA sequence of an ORF to the server at http://tubic.tju.edu.cn/ZCURVE_C/Default.cgi, and the Z-curve score of the ORF is calculated and returned to the user immediately.  相似文献   

2.
In this paper, a self-training method is proposed to recognize translation start sites in bacterial genomes without a prior knowledge of rRNA in the genomes concerned. Many features with biological meanings are incorporated, including mononucleotide distribution patterns near the start codon, the start codon itself, the coding potential and the distance from the most-left start codon to the start codon. The proposed method correctly predicts 92% of the translation start sites of 195 experimentally confirmed Escherichia coli CDSs, 96% of 58 reliable Bacillus subtilis CDSs and 82% of 140 reliable Synechocystis CDSs. Moreover, the self-training method presented might also be used to relocate the translation start sites of putative CDSs of genomes, which are predicted by gene-finding programs. After post-processing by the method presented, the improvement of gene start prediction of some gene-finding programs is remarkable, e.g., the accuracy of gene start prediction of Glimmer 2.02 increases from 63 to 91% for 832 E. coli reliable CDSs. An open source computer program to implement the method, GS-Finder, is freely available for academic purposes from http://tubic.tju.edu.cn/GS-Finder/.  相似文献   

3.
Gao F  Ou HY  Chen LL  Zheng WX  Zhang CT 《FEBS letters》2003,553(3):451-456
Recently, we have developed a coronavirus-specific gene-finding system, ZCURVE_CoV 1.0. In this paper, the system is further improved by taking the prediction of cleavage sites of viral proteinases in polyproteins into account. The cleavage sites of the 3C-like proteinase and papain-like proteinase are highly conserved. Based on the method of traditional positional weight matrix trained by the peptides around cleavage sites, the present method also sufficiently considers the length conservation of non-structural proteins cleaved by the 3C-like proteinase and papain-like proteinase to reduce the false positive prediction rate. The improved system, ZCURVE_CoV 2.0, has been run for each of the 24 completely sequenced coronavirus genomes in GenBank. Consequently, all the non-structural proteins in the 24 genomes are accurately predicted. Compared with known annotations, the performance of the present method is satisfactory. The software ZCURVE_CoV 2.0 is freely available at http://tubic.tju.edu.cn/sars/.  相似文献   

4.
Gao F  Zhang CT 《FEBS letters》2008,582(16):2441-2444
The human genome is structured at multiple levels: it is organized into a series of replication time zones, and meanwhile it is composed of isochores. Accumulating evidence suggests a match between these two genome features. Based on newly developed software GC-Profile, we obtained a complete coverage of the human genome by 3198 isochores with boundaries at single nucleotide resolution. Interestingly, the experimentally confirmed replication timing sites in the regions of 1p36.1, 6p21.32, 17q11.2 and 22q12.1 nearly all coincide with the determined isochore boundaries. The precise boundaries of the 3198 isochores are available via the website: http://tubic.tju.edu.cn/isomap/.  相似文献   

5.
A new system, ZCURVE 1.0, for finding protein- coding genes in bacterial and archaeal genomes has been proposed. The current algorithm, which is based on the Z curve representation of the DNA sequences, lays stress on the global statistical features of protein-coding genes by taking the frequencies of bases at three codon positions into account. In ZCURVE 1.0, since only 33 parameters are used to characterize the coding sequences, it gives better consideration to both typical and atypical cases, whereas in Markov-model-based methods, e.g. Glimmer 2.02, thousands of parameters are trained, which may result in less adaptability. To compare the performance of the new system with that of Glimmer 2.02, both systems were run, respectively, for 18 genomes not annotated by the Glimmer system. Comparisons were also performed for predicting some function-known genes by both systems. Consequently, the average accuracy of both systems is well matched; however, ZCURVE 1.0 has more accurate gene start prediction, lower additional prediction rate and higher accuracy for the prediction of horizontally transferred genes. It is shown that the joint applications of both systems greatly improve gene-finding results. For a typical genome, e.g. Escherichia coli, the system ZCURVE 1.0 takes approximately 2 min on a Pentium III 866 PC without any human intervention. The system ZCURVE 1.0 is freely available at: http://tubic. tju.edu.cn/Zcurve_B/.  相似文献   

6.
A new system to recognize protein coding genes in the coronavirus genomes, specially suitable for the SARS-CoV genomes, has been proposed in this paper. Compared with some existing systems, the new program package has the merits of simplicity, high accuracy, reliability, and quickness. The system ZCURVE_CoV has been run for each of the 11 newly sequenced SARS-CoV genomes. Consequently, six genomes not annotated previously have been annotated, and some problems of previous annotations in the remaining five genomes have been pointed out and discussed. In addition to the polyprotein chain ORFs 1a and 1b and the four genes coding for the major structural proteins, spike (S), small envelop (E), membrane (M), and nuleocaspid (N), respectively, ZCURVE_CoV also predicts 5-6 putative proteins in length between 39 and 274 amino acids with unknown functions. Some single nucleotide mutations within these putative coding sequences have been detected and their biological implications are discussed. A web service is provided, by which a user can obtain the annotated result immediately by pasting the SARS-CoV genome sequences into the input window on the web site (http://tubic.tju.edu.cn/sars/). The software ZCURVE_CoV can also be downloaded freely from the web address mentioned above and run in computers under the platforms of Windows or Linux.  相似文献   

7.
Ou HY  Guo FB  Zhang CT 《FEBS letters》2003,540(1-3):188-194
The nucleotide distribution of all 33 527 open reading frames (ORFs) (≥300 bp) in the genome of Streptomyces coelicolor A3(2) has been analyzed using the Z curve method. Each ORF is mapped onto a point in a 9-dimensional space. To visualize the distribution of mapping points, the points are projected onto the principal plane based on principal component analysis. Consequently, the distribution pattern of the 33 527 points in the principal plane shows a flower-like shape, in which there are seven distinct regions. In addition to the central region, there are six petal-like regions around the center, one of which corresponds to 7172 coding sequences. The central region and the remaining five petal-like regions correspond to the intergenic sequences and out-of-frame non-coding ORFs, respectively. It is shown that selective pressure produces a remarkable bias of the G+C content among three codon positions, resulting in the interesting phenomenon observed. A similar phenomenon is also observed for other bacterial genomes with high genomic G+C content, such as Pseudomonas aeruginosa PA01 (G+C=66.6%). However, for the genomes of Bacillus subtilis (G+C=43.5%) and Clostridium perfringens (G+C=28.6%), no similar phenomenon was observed. The finding presented here may be useful to improve the gene-finding algorithms for genomes with high G+C content. A set of supplementary materials including the plots displaying the base distribution patterns of ORFs in 12 prokaryotes is provided on the website http://tubic.tju.edu.cn/highGC/.  相似文献   

8.
The 2694 ORFs originally annotated as potential genes in the genome of Aeropyrum pernix can be categorized into three clusters (A, B, C), according to their nucleotide composition at three codon positions. Coding potential was found to be responsible for the phenomenon of three clusters in a 9-dimensional space derived from the nucleotide composition of ORFs: ORFs assigned to cluster A are coding ones, while those assigned to clusters B and C are non-coding ORFs. A "codingness" index called the AZ score is defined based on a clustering method used to recognize protein-coding genes in the A. pernix genome. The criterion for a coding or non-coding ORF is based on the AZ score. ORFs with AZ > 0 or AZ < 0 are coding or non-coding, respectively. Consequently, 620 out of 632 ORFs with putative functions based on the original annotation are contained in cluster A, which have positive AZ scores. In addition, all 29 ORFs encoding putative or conserved proteins newly added in RefSeq annotation also have positive AZ scores. Accordingly, the number of re-recognized protein-coding genes in the A. pernix genome is 1610, which is significantly less than 2694 in the original annotation and also much less than 1841 in the RefSeq annotation curated by NCBI staff. Annotation information of re-recognized genes and their AZ scores are available at: http://tubic.tju.edu.cn/Aper/.  相似文献   

9.
Bats account for ~20% of mammalian species, and are the only mammals with true powered flight. For the sake of their specialized phenotypic traits, many researches have been devoted to examine the evolution of bats. Until now, some whole genome sequences of bats have been assembled and annotated, however, a uniform resource for the annotated bat genomes is still unavailable. To make the extensive data associated with the bat genomes accessible to the general biological communities, we established a Bat Genome Database (BGD). BGD is an open-access, web-available portal that integrates available data of bat genomes and genes. It hosts data from six bat species, including two megabats and four microbats. Users can query the gene annotations using efficient searching engine, and it offers browsable tracks of bat genomes. Furthermore, an easy-to-use phylogenetic analysis tool was also provided to facilitate online phylogeny study of genes. To the best of our knowledge, BGD is the first database of bat genomes. It will extend our understanding of the bat evolution and be advantageous to the bat sequences analysis. BGD is freely available at: http://donglab.ecnu.edu.cn/databases/BatGenome/.  相似文献   

10.
Liu  Wei 《Molecular biology reports》2019,46(2):1551-1553

Gene co-expression network analysis has been widely performed in systems biology. Here, I use a chromosome-based strategy to find potential chromosome regions associated with disease, and show an example of cancer. All results are available at http://bioinformatics.fafu.edu.cn/chrom-WGCNA/.

  相似文献   

11.
Abstract

In this paper, we re-annotated the genome of Pyrobaculum aerophilum str. IM2, particularly for hypothetical ORFs. The annotation process includes three parts. Firstly and most importantly, 23 new genes, which were missed in the original annotation, are found by combining similarity search and the ab initio gene finding approaches. Among these new genes, five have significant similarities with function-known genes and the rest have significant similarities with hypothetical ORFs contained in other genomes. Secondly, the coding potentials of the 1645 hypothetical ORFs are re-predicted by using 33 Z curve variables combined with Fisher linear discrimination method. With the accuracy being 99.68%, 25 originally annotated hypothetical ORFs are recognized as non-coding by our method. Thirdly, 80 hypothetical ORFs are assigned with potential functions by using similarity search with BLAST program. Re-annotation of the genome will benefit related researches on this hyperthermophilic crenarchaeon. Also, the re-annotation procedure could be taken as a reference for other archaeal genomes. Details of the revised annotation are freely available at http://cobi.uestc.edu.cn/resource/paero/  相似文献   

12.
Jia P  Xuan L  Liu L  Wei C 《PloS one》2011,6(11):e25353
Metagenomic sequence classification is a procedure to assign sequences to their source genomes. It is one of the important steps for metagenomic sequence data analysis. Although many methods exist, classification of high-throughput metagenomic sequence data in a limited time is still a challenge. We present here an ultra-fast metagenomic sequence classification system (MetaBinG) using graphic processing units (GPUs). The accuracy of MetaBinG is comparable to the best existing systems and it can classify a million of 454 reads within five minutes, which is more than 2 orders of magnitude faster than existing systems. MetaBinG is publicly available at http://cbb.sjtu.edu.cn/~ccwei/pub/software/MetaBinG/MetaBinG.php.  相似文献   

13.
MOTIVATION: Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. RESULTS: In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. AVAILABILITY: A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. SUPPLEMENTARY INFORMATION: Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.  相似文献   

14.
SUMMARY: The Genome Organization Analysis Tool (GOAT) is a program that performs comparative sequence analysis on ordered gene lists from annotated genomes, provides visual and tabular output, and provides means of accessing and analyzing related gene sequence data, for the purpose of comparing genome organization at the gene-order level. GOAT can be used to compare any two or more genomes or chromosomes on demand, or configured to provide access to precomputed comparisons of a specific group of genome sequences. AVAILABILITY: Demonstration web server and software download, subject to the Virginia Tech Noncommercial License are available at http://gaia.biotech.vt.edu/goat/. SUPPLEMENTARY INFORMATION: Updates, installation and configuration information are available at http://gaia.biotech.vt.edu/goat.  相似文献   

15.
Feng Gao 《Current Genomics》2014,15(2):104-112
Precise DNA replication is critical for the maintenance of genetic integrity in all organisms. In all three domains of life, DNA replication starts at a specialized locus, termed as the replication origin, oriC or ORI, and its identification is vital to understanding the complex replication process. In bacteria and eukaryotes, replication initiates from single and multiple origins, respectively, while archaea can adopt either of the two modes. The Z-curve method has been successfully used to identify replication origins in genomes of various species, including multiple oriCs in some archaea. Based on the Z-curve method and comparative genomics analysis, we have developed a web-based system, Ori-Finder, for finding oriCs in bacterial genomes with high accuracy. Predicted oriC regions in bacterial genomes are organized into an online database, DoriC. Recently, archaeal oriC regions identified by both in vivo and in silico methods have also been included in the database. Here, we summarize the recent advances of in silico prediction of oriCs in bacterial and archaeal genomes using the Z-curve based method.  相似文献   

16.
目前, 大量园艺植物基因组测序已经完成或接近尾声, 它们的基因组序列和注释数据极大地促进了功能基因组学研究。为给科研人员提供批量下载特定的基因组区段序列和注释平台, 笔者开发了一个称为OBRRP的生物信息学工具。OBRRP具有提取葡萄(Vitis vinifera)、桃(Prunus persica)、草莓(Fragaria vesca)、黄瓜(Cucumis sativus)、西瓜(Citrullus lanatus)、番茄(Solanum lycopersicum)、甜橙(Citrus sinensis)、苹果(Malus x domestica)、猕猴桃(Actinidia chinensis)、马铃薯(Solanum tuberosum)、香蕉(Musa acuminata)和拟南芥(Arabidopsis thaliana) 12种植物基因组序列及注释数据的功能; 同时, 也具有扩展到其它Gbrowser浏览器架构的数据库功能。测试结果表明, OBRRP是一个快捷简便的在线、批量和实时提取工具, 其登录地址为http://bioinfo.jit.edu.cn/OBRRP/。  相似文献   

17.
MicroRNA identification based on sequence and structure alignment   总被引:20,自引:0,他引:20  
MOTIVATION: MicroRNAs (miRNA) are approximately 22 nt long non-coding RNAs that are derived from larger hairpin RNA precursors and play important regulatory roles in both animals and plants. The short length of the miRNA sequences and relatively low conservation of pre-miRNA sequences restrict the conventional sequence-alignment-based methods to finding only relatively close homologs. On the other hand, it has been reported that miRNA genes are more conserved in the secondary structure rather than in primary sequences. Therefore, secondary structural features should be more fully exploited in the homologue search for new miRNA genes. RESULTS: In this paper, we present a novel genome-wide computational approach to detect miRNAs in animals based on both sequence and structure alignment. Experiments show this approach has higher sensitivity and comparable specificity than other reported homologue searching methods. We applied this method on Anopheles gambiae and detected 59 new miRNA genes. AVAILABILITY: This program is available at http://bioinfo.au.tsinghua.edu.cn/miralign. SUPPLEMENTARY INFORMATION: Supplementary information is available at http://bioinfo.au.tsinghua.edu.cn/miralign/supplementary.htm.  相似文献   

18.
Integrative genomics predictors, which score highly in predicting bacterial essential genes, would be unfeasible in most species because the data sources are limited. We developed a universal approach and tool designated Geptop, based on orthology and phylogeny, to offer gene essentiality annotations. In a series of tests, our Geptop method yielded higher area under curve (AUC) scores in the receiver operating curves than the integrative approaches. In the ten-fold cross-validations among randomly upset samples, Geptop yielded an AUC of 0.918, and in the cross-organism predictions for 19 organisms Geptop yielded AUC scores between 0.569 and 0.959. A test applied to the very recently determined essential gene dataset from the Porphyromonas gingivalis, which belongs to a phylum different with all of the above 19 bacterial genomes, gave an AUC of 0.77. Therefore, Geptop can be applied to any bacterial species whose genome has been sequenced. Compared with the essential genes uniquely identified by the lethal screening, the essential genes predicted only by Gepop are associated with more protein-protein interactions, especially in the three bacteria with lower AUC scores (<0.7). This may further illustrate the reliability and feasibility of our method in some sense. The web server and standalone version of Geptop are available at http://cefg.uestc.edu.cn/geptop/ free of charge. The tool has been run on 968 bacterial genomes and the results are accessible at the website.  相似文献   

19.
The mitochondrion is a key organelle of eukaryotic cell that provides the energy for cellular activities. Correctly identifying submitochondria locations of proteins can provide plentiful information for understanding their functions. However, using web-experimental methods to recognize submitochondria locations of proteins are time-consuming and costly. Thus, it is highly desired to develop a bioinformatics method to predict the submitochondria locations of mitochondrion proteins. In this work, a novel method based on support vector machine was developed to predict the submitochondria locations of mitochondrion proteins by using over-represented tetrapeptides selected by using binomial distribution. A reliable and rigorous benchmark dataset including 495 mitochondrion proteins with sequence identity ≤25 % was constructed for testing and evaluating the proposed model. Jackknife cross-validated results showed that the 91.1 % of the 495 mitochondrion proteins can be correctly predicted. Subsequently, our model was estimated by three existing benchmark datasets. The overall accuracies are 94.0, 94.7 and 93.4 %, respectively, suggesting that the proposed model is potentially useful in the realm of mitochondrion proteome research. Based on this model, we built a predictor called TetraMito which is freely available at http://lin.uestc.edu.cn/server/TetraMito.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号