首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
MOTIVATION: Targeting peptides direct nascent proteins to their specific subcellular compartment. Knowledge of targeting signals enables informed drug design and reliable annotation of gene products. However, due to the low similarity of such sequences and the dynamical nature of the sorting process, the computational prediction of subcellular localization of proteins is challenging. RESULTS: We contrast the use of feed forward models as employed by the popular TargetP/SignalP predictors with a sequence-biased recurrent network model. The models are evaluated in terms of performance at the residue level and at the sequence level, and demonstrate that recurrent networks improve the overall prediction performance. Compared to the original results reported for TargetP, an ensemble of the tested models increases the accuracy by 6 and 5% on non-plant and plant data, respectively. AVAILABILITY: The Protein Prowler incorporating the recurrent network predictor described in this paper is available online at http://pprowler.imb.uq.edu.au/  相似文献   

2.
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.  相似文献   

3.
结合计算机技术和生物信息学的方法,采用组合的信号肽分析软件SignalPv3.0、TargetPv1.1、Big-PIpredictor、TMHMMv2.0和SecretomeP对已公布的1486个稻瘟菌(magnaporthegrisea)小蛋白基因的N-端氨基酸序列进行信号肽分析,同时系统分析了信号肽的类型及结构。分析结果表明,在1486个稻瘟病菌小蛋白中,119个具有N-端信号肽的典型分泌蛋白。其中116个具有分泌型信号肽,1个具RR-motif型信号肽,2个具信号肽酶II型信号肽。在稻瘟病菌基因组中,分泌型小蛋白的序列是高度趋异的,仅出现少数氨基酸组成完全一致的信号肽,为进一步确认具有相同信号肽的分泌蛋白是否具有同源性,分别用BLAST2SEQUENCES对具有相同信号肽的分泌蛋白进行了序列对比。结果表明,具有相同信号肽的分泌蛋白同源性非常高。同时还采用Sublocv1.0对1486个小蛋白的亚细胞位置进行了预测,结果显示小蛋白的可能功能场所包括细胞质、细胞外、线立体和细胞核,功能场所位于细胞核的小蛋白是最多的。  相似文献   

4.
以500个茶(Camellia sinensis(L.)O.Ktze.)叶片的蛋白质作为数据集,比较TargetP、WoLF PSORT、LocTree和Plant-mPLoc 4种软件预测亚细胞定位的可信度和灵敏度。结果显示,4种软件预测可信度均高于80%,依次排序为TargetP > LocTree > WoLF PSORT > Plant-mPLoc。其中,LocTree对细胞质蛋白和分泌蛋白检测灵敏度最高,但对叶绿体蛋白灵敏度最低;Plant-mPLoc检测核蛋白最灵敏,但对细胞质蛋白最不敏感;TargetP检测叶绿体蛋白最灵敏,但仅能区分3个亚细胞器官;WoLF PSORT对分泌蛋白检测灵敏度最低,但对其他蛋白均较灵敏。基于上述结果,该研究针对4种软件提出了合理的使用建议。  相似文献   

5.
Predicting subcellular localization with AdaBoost Learner   总被引:1,自引:0,他引:1  
Protein subcellular localization, which tells where a protein resides in a cell, is an important characteristic of a protein, and relates closely to the function of proteins. The prediction of their subcellular localization plays an important role in the prediction of protein function, genome annotation and drug design. Therefore, it is an important and challenging role to predict subcellular localization using bio-informatics approach. In this paper, a robust predictor, AdaBoost Learner is introduced to predict protein subcellular localization based on its amino acid composition. Jackknife cross-validation and independent dataset test were used to demonstrate that Adaboost is a robust and efficient model in predicting protein subcellular localization. As a result, the correct prediction rates were 74.98% and 80.12% for the Jackknife test and independent dataset test respectively, which are higher than using other existing predictors. An online server for predicting subcellular localization of proteins based on AdaBoost classifier was available on http://chemdata.shu. edu.cn/sl12.  相似文献   

6.
The ability to predict the subcellular localization of a protein from its sequence is of great importance, as it provides information about the protein's function. We present a computational tool, PredSL, which utilizes neural networks, Markov chains, profile hidden Markov models, and scoring matrices for the prediction of the subcellular localization of proteins in eukaryotic cells from the N-terminal amino acid sequence. It aims to classify proteins into five groups: chloroplast, thylakoid, mitochondrion, secretory pathway, and "other". When tested in a fivefold cross-validation procedure, PredSL demonstrates 86.7% and 87.1% overall accuracy for the plant and non-plant datasets, respectively. Compared with TargetP, which is the most widely used method to date, and LumenP, the results of PredSL are comparable in most cases. When tested on the experimentally verified proteins of the Saccharomyces cerevisiae genome, PredSL performs comparably if not better than any available algorithm for the same task. Furthermore, PredSL is the only method capable for the prediction of these subcellular localizations that is available as a stand-alone application through the URL: http://bioinformatics.biol.uoa.gr/PredSL/.  相似文献   

7.
We report the development of LumenP, a new neural network-based predictor for the identification of proteins targeted to the thylakoid lumen of plant chloroplasts and prediction of their cleavage sites. When used together with the previously developed TargetP predictor, LumenP reaches a significantly better performance than what has been recorded for previous attempts at predicting thylakoid lumen location, mostly due to a lower false positive rate. The combination of TargetP and LumenP predicts around 1.5%-3% of all proteins encoded in the genomes of Arabidopsis thaliana and Oryza sativa to be located in the lumen of the thylakoid.  相似文献   

8.
Many proteins bear multi-locational characteristics, and this phenomenon is closely related to biological function. However, most of the existing methods can only deal with single-location proteins. Therefore, an automatic and reliable ensemble classifier for protein subcellular multi-localization is needed. We propose a new ensemble classifier combining the KNN (K-nearest neighbour) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic, Gram-negative bacterial and viral proteins based on the general form of Chou's pseudo amino acid composition, i.e., GO (gene ontology) annotations, dipeptide composition and AmPseAAC (Amphiphilic pseudo amino acid composition). This ensemble classifier was developed by fusing many basic individual classifiers through a voting system. The overall prediction accuracies obtained by the KNN-SVM ensemble classifier are 95.22, 93.47 and 80.72% for the eukaryotic, Gram-negative bacterial and viral proteins, respectively. Our prediction accuracies are significantly higher than those by previous methods and reveal that our strategy better predicts subcellular locations of multi-location proteins.  相似文献   

9.
Locating proteins in the cell using TargetP, SignalP and related tools   总被引:9,自引:0,他引:9  
Determining the subcellular localization of a protein is an important first step toward understanding its function. Here, we describe the properties of three well-known N-terminal sequence motifs directing proteins to the secretory pathway, mitochondria and chloroplasts, and sketch a brief history of methods to predict subcellular localization based on these sorting signals and other sequence properties. We then outline how to use a number of internet-accessible tools to arrive at a reliable subcellular localization prediction for eukaryotic and prokaryotic proteins. In particular, we provide detailed step-by-step instructions for the coupled use of the amino-acid sequence-based predictors TargetP, SignalP, ChloroP and TMHMM, which are all hosted at the Center for Biological Sequence Analysis, Technical University of Denmark. In addition, we describe and provide web references to other useful subcellular localization predictors. Finally, we discuss predictive performance measures in general and the performance of TargetP and SignalP in particular.  相似文献   

10.
? Apart from their antifungal role, plant defensins have recently been shown to be involved in abiotic stress tolerance or in inhibition of root growth when added in plant culture medium. We studied the subcellular localization of these proteins, which may account for these different roles. ? Stable and transient expression of AhPDF1.1::GFP (green fluorescent protein) fusion proteins were analysed in yeast and plants. Functional tests established that the GFP tag did not alter the action of the defensin. Subcellular localization of AhPDF1.1 was characterized: by imaging AhPDF1.1::GFP together with organelle markers; and by immunolabelling AhPDF1.1 in Arabidopsis halleri and Arabidopsis thaliana leaves using a polyclonal serum. ? All our independent approaches demonstrated that AhPDF1.1 is retained in intracellular compartments on the way to the lytic vacuole, instead of being addressed to the apoplasm. ? These findings challenge the commonly accepted idea of secretion of defensins. The subcellular localization highlighted in this study could partly explain the dual role of plant defensins on plant cells and is of major importance to unravel the mechanisms of action of these proteins at the cellular level.  相似文献   

11.
A complete map of the Arabidopsis (Arabidopsis thaliana) proteome is clearly a major goal for the plant research community in terms of determining the function and regulation of each encoded protein. Developing genome-wide prediction tools such as for localizing gene products at the subcellular level will substantially advance Arabidopsis gene annotation. To this end, we performed a comprehensive study in Arabidopsis and created an integrative support vector machine-based localization predictor called AtSubP (for Arabidopsis subcellular localization predictor) that is based on the combinatorial presence of diverse protein features, such as its amino acid composition, sequence-order effects, terminal information, Position-Specific Scoring Matrix, and similarity search-based Position-Specific Iterated-Basic Local Alignment Search Tool information. When used to predict seven subcellular compartments through a 5-fold cross-validation test, our hybrid-based best classifier achieved an overall sensitivity of 91% with high-confidence precision and Matthews correlation coefficient values of 90.9% and 0.89, respectively. Benchmarking AtSubP on two independent data sets, one from Swiss-Prot and another containing green fluorescent protein- and mass spectrometry-determined proteins, showed a significant improvement in the prediction accuracy of species-specific AtSubP over some widely used “general” tools such as TargetP, LOCtree, PA-SUB, MultiLoc, WoLF PSORT, Plant-PLoc, and our newly created All-Plant method. Cross-comparison of AtSubP on six nontrained eukaryotic organisms (rice [Oryza sativa], soybean [Glycine max], human [Homo sapiens], yeast [Saccharomyces cerevisiae], fruit fly [Drosophila melanogaster], and worm [Caenorhabditis elegans]) revealed inferior predictions. AtSubP significantly outperformed all the prediction tools being currently used for Arabidopsis proteome annotation and, therefore, may serve as a better complement for the plant research community. A supplemental Web site that hosts all the training/testing data sets and whole proteome predictions is available at http://bioinfo3.noble.org/AtSubP/.Subcellular proteomics has gained tremendous attention of late, owing to the role played by organelles in carrying out defined cellular processes. Several experimental efforts have been made to catalog the complete subcellular proteomes of various organisms (Michaud and Snyder, 2002; Huh et al., 2003; Taylor et al., 2003; Andersen and Mann, 2006), with the aim being to improve our understanding of defined cellular processes at the organellar and cellular levels. Although such efforts have generated valuable information, cataloging all subcellular proteomes is far from complete, as experimental methods are expensive and more time consuming. Alternatively, computational prediction systems provide fast, economic (mostly free), automatic, and reasonably accurate assignment of subcellular location to a protein, especially for high-throughput analysis of large-scale genome sequences, ultimately giving the right direction to design cost-effective wet-lab experiments.The existing bioinformatics localization predictors in the literature can be broadly grouped into three categories: (1) amino acid composition based; (2) N-terminal sorting signals based; and (3) homology based (e.g. those based on domain or motif co-occurrence). These methods have previously been reviewed in detail (Mott et al., 2002; Scott et al., 2004). However, in bioinformatics in general, and in subcellular localization prediction in particular, it is often debated whether predictions should be done over broad systematic groups such as all eukaryotes or all plants, or over narrower groups such as dicots, or even at the single-species level. On the one hand, species-specific features of sorting signals and amino acid composition could make the prediction better if trained on the particular species where it is going to be used; on the other hand, the smaller data set available for a single species could make the single-species predictor less accurate. How to strike the balance between these two concerns is an important question, which has received far too little attention until now. In this study, we have investigated this important question by conducting a systematic species-specific case study on predicting subcellular localization in Arabidopsis (Arabidopsis thaliana). Although some recent reviews/advances in the prediction of protein-targeting signals have stressed the need for “species-specific” prediction tools (Schneider and Fechner, 2004; Chou and Shen, 2007a), very few have been developed/reported in the literature. The PSLT method (Scott et al., 2004), a Bayesian framework that uses a combination of InterPro motifs, signaling peptides, and transmembrane domains, was developed for predicting genome-wide subcellular localization of human proteins. Two others, HSLpred (Garg et al., 2005) and Hum-PLoc (Chou and Shen, 2006), were also developed specifically for human proteins; another species-specific system, TBpred, was developed for Mycobacterium tuberculosis (Rashid et al., 2007). However, none of these methods have rigorously tested whether their species-specific methods were actually better than the “general” ones.In plants, some widely used prediction tools are TargetP (Emanuelsson et al., 2000), LOCtree (Nair and Rost, 2005), PA-SUB (Lu et al., 2004), MultiLoc (Höglund et al., 2006), WoLF PSORT (updated version of PSORT II; Horton et al., 2007), and Plant-PLoc (Chou and Shen, 2007b), all having good accuracy (greater than 70%). A recent computational effort was made in developing a plant species-specific prediction system, RSLpred, for genome-wide subcellular localization annotations of rice (Oryza sativa) proteins (Kaundal and Raghava, 2009). However, although Arabidopsis was the first model plant that was completely sequenced back in the year 2000, there is still no efficient prediction method available for accurately annotating its proteome at the subcellular level. To date, we only know the subcellular localization of about 6,000 proteins that are experimentally proven (e.g. using GFP fusions, mass spectrometry [MS], or other approaches) out of the total 27,379 protein-coding genes as predicted by The Arabidopsis Information Resource (TAIR) release 9 (www.arabidopsis.org). To narrow this huge gap between the large number of predicted genes in the Arabidopsis genome and the limited experimental characterization of their corresponding proteins, a fully automatic and reliable prediction system for complete subcellular annotation of the Arabidopsis proteome would be very useful.This article presents AtSubP (for Arabidopsis subcellular localization predictor), an integrative system that addresses the aforementioned issues and problems. In this study, we develop this species-specific predictor and rigorously compare its performance with some of the widely used general tools, including the one being currently used by TAIR (Rhee et al., 2003), and discuss if species-specific predictors are more suitable for individual proteome-wide annotations. AtSubP uses the combinatorial presence of diverse features of a protein sequence, such as its amino acid composition, residue order-based dipeptide composition, N- and C-terminal composition, similarity search-based Position-Specific Iterated (PSI)-BLAST information, and the Position-Specific Scoring Matrix (PSSM), as its evolutionary information in a statistically coherent manner. Under five major classification approaches, we devised 15 different possible techniques to develop 105 different classifiers for each of the seven subcellular compartments under study (chloroplast, cytoplasm, Golgi apparatus, mitochondrion, extracellular, nucleus, and plasma membrane). The performance of these models was systematically evaluated based on a 5-fold cross-validation test and two diverse independent tests: one from Swiss-Prot and the other containing MS/GFP-proven sequences as an experimental test data set from the SUBcellular location database for Arabidopsis (SUBA; http://suba.plantenergy.uwa.edu.au/) and the eukaryotic Subcellular Localization DataBase (eSLDB; http://gpcr.biocomp.unibo.it/esldb/). Our novel approach of combining some diverse protein features into a smart hybrid technique led to the best classifier that achieved an outstanding accuracy level of 91%, with a high-confidence precision and Matthews correlation coefficient (MCC) of 90.9% and 0.89, respectively. The similarity search-based PSI-BLAST module alone performed moderately, achieving an overall accuracy of 78%, suggesting the advantages of machine learning-based classifiers.To expand on the application and data-mining aspects of the method, we cross-matched the AtSubP’s predictions with the available Swiss-Prot and TAIR annotations as well as compared its performance with some of the widely used general tools on both independent test sets. To explore the species-specific effects, a new All-Plant classifier was developed from a mixture of plant proteins using the same location definitions and encoding schemes as in AtSubP, and their performances were compared in an independent testing. As another benchmark, the performance of an Arabidopsis-specific classifier was cross-checked on six other eukaryotic organisms (rice [Oryza sativa], soybean [Glycine max], human [Homo sapiens], yeast [Saccharomyces cerevisiae], fruit fly [Drosophila melanogaster], and worm [Caenorhabditis elegans]). The basic purpose of all these diverse tests was to explore the advantages of developing a species-specific predictor(s), if any. To further test this hypothesis, we also analyzed the variation in amino acid composition across various eukaryotic organisms and compared with Arabidopsis, both at the sequence level and in the signal peptide-containing regions.Finally, AtSubP was used to annotate all 27,379 Arabidopsis proteins contained in TAIR release 9; among them, 21,649 (79.1%) proteins were predicated with their localization information, 7,982 (29.2%) sequences being predicted with high confidence. A user-friendly Web server, available at http://bioinfo3.noble.org/AtSubP/, was also developed to host all the training/testing data sets, whole proteome annotations, and options for annotating the query sequences using five diverse prediction modules based on user selection of protein feature(s).  相似文献   

12.

Background

Subcellular localization of a new protein sequence is very important and fruitful for understanding its function. As the number of new genomes has dramatically increased over recent years, a reliable and efficient system to predict protein subcellular location is urgently needed.

Results

Esub8 was developed to predict protein subcellular localizations for eukaryotic proteins based on amino acid composition. In this research, the proteins are classified into the following eight groups: chloroplast, cytoplasm, extracellular, Golgi apparatus, lysosome, mitochondria, nucleus and peroxisome. We know subcellular localization is a typical classification problem; consequently, a one-against-one (1-v-1) multi-class support vector machine was introduced to construct the classifier. Unlike previous methods, ours considers the order information of protein sequences by a different method. Our method is tested in three subcellular localization predictions for prokaryotic proteins and four subcellular localization predictions for eukaryotic proteins on Reinhardt's dataset. The results are then compared to several other methods. The total prediction accuracies of two tests are both 100% by a self-consistency test, and are 92.9% and 84.14% by the jackknife test, respectively. Esub8 also provides excellent results: the total prediction accuracies are 100% by a self-consistency test and 87% by the jackknife test.

Conclusions

Our method represents a different approach for predicting protein subcellular localization and achieved a satisfactory result; furthermore, we believe Esub8 will be a useful tool for predicting protein subcellular localizations in eukaryotic organisms.
  相似文献   

13.
Knowledge of protein subcellular localization is vitally important for both basic research and drug development. With the avalanche of protein sequences emerging in the post-genomic age, it is highly desired to develop computational tools for timely and effectively identifying their subcellular localization purely based on the sequence information alone. Recently, a predictor called “pLoc-mGpos” was developed for identifying the subcellular localization of Gram-positive bacterial proteins. Its performance is overwhelmingly better than that of the other predictors for the same purpose, particularly in dealing with multi-label systems in which some proteins, called “multiplex proteins”, may simultaneously occur in two or more subcellular locations. Although it is indeed a very powerful predictor, more efforts are definitely needed to further improve it. This is because pLoc-mGpos was trained by an extremely skewed dataset in which some subset (subcellular location) was over 11 times the size of the other subsets. Accordingly, it cannot avoid the bias consequence caused by such an uneven training dataset. To alleviate such bias consequence, we have developed a new and bias-reducing predictor called pLoc_bal-mGpos by quasi-balancing the training dataset. Rigorous target jackknife tests on exactly the same experiment-confirmed dataset have indicated that the proposed new predictor is remarkably superior to pLoc-mGpos, the existing state-of-the-art predictor in identifying the subcellular localization of Gram-positive bacterial proteins. To maximize the convenience for most experimental scientists, a user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/pLoc_bal-mGpos/, by which users can easily get their desired results without the need to go through the detailed mathematics.  相似文献   

14.
粗糙脉孢菌基因组分泌蛋白的初步分析   总被引:4,自引:0,他引:4  
文章报道利用信号肽预测软件SignalP v3.0和PSORT,跨膜螺旋结构预测软件TMHMMv2.0和THUMBUP,GPI-锚定位点预测软件big-PI Predictor和亚细胞器中蛋白定位分布预测软件TargetP v1.01对粗糙脉孢菌全基因组数据库中已公布的10 082个氨基酸序列进行预测分析。结果表明在粗糙脉孢菌中有437个蛋白为分泌蛋白,编码这些蛋白最小的可读框(open reading frame,ORF)为252 bp,最大为6 604 bp,平均1 433 bp,分泌蛋白信号肽长度介于15~59个氨基酸之间。在437个分泌蛋白中,205个具有功能描述,主要包括各种酶类、细胞能量生成、运转以及自身修复、防卫等多种功能。这些蛋白所参与的生化过程可能发生在膜外的周质空间或是菌体外的场所,为该物种营养的摄取,以及对环境做出响应服务。   相似文献   

15.
One of the fundamental goals in proteomics and cell biology is to identify the functions of proteins in various cellular organelles and pathways. Information of subcellular locations of proteins can provide useful insights for revealing their functions and understanding how they interact with each other in cellular network systems. Most of the existing methods in predicting plant protein subcellular localization can only cover three or four location sites, and none of them can be used to deal with multiplex plant proteins that can simultaneously exist at two, or move between, two or more different location sites. Actually, such multiplex proteins might have special biological functions worthy of particular notice. The present study was devoted to improve the existing plant protein subcellular location predictors from the aforementioned two aspects. A new predictor called “Plant-mPLoc” is developed by integrating the gene ontology information, functional domain information, and sequential evolutionary information through three different modes of pseudo amino acid composition. It can be used to identify plant proteins among the following 12 location sites: (1) cell membrane, (2) cell wall, (3) chloroplast, (4) cytoplasm, (5) endoplasmic reticulum, (6) extracellular, (7) Golgi apparatus, (8) mitochondrion, (9) nucleus, (10) peroxisome, (11) plastid, and (12) vacuole. Compared with the existing methods for predicting plant protein subcellular localization, the new predictor is much more powerful and flexible. Particularly, it also has the capacity to deal with multiple-location proteins, which is beyond the reach of any existing predictors specialized for identifying plant protein subcellular localization. As a user-friendly web-server, Plant-mPLoc is freely accessible at http://www.csbio.sjtu.edu.cn/bioinf/plant-multi/. Moreover, for the convenience of the vast majority of experimental scientists, a step-by-step guide is provided on how to use the web-server to get the desired results. It is anticipated that the Plant-mPLoc predictor as presented in this paper will become a very useful tool in plant science as well as all the relevant areas.  相似文献   

16.
现有蛋白质亚细胞定位方法针对水溶性蛋白质而设计,对跨膜蛋白并不适用。而专门的跨膜拓扑预测器,又不是为亚细胞定位而设计的。文章改进了跨膜拓扑预测器TMPHMMLoc的模型结构,设计了一个新的二阶隐马尔可夫模型;采用推广到二阶模型的Baum-Welch算法估计模型参数,并把将各个亚细胞位置建立的模型整合为一个预测器。数据集上测试结果表明,此方法性能显著优于针对可溶性蛋白设计的支持向量机方法和模糊k最邻近方法,也优于TMPHMMLoc中提出的隐马尔可夫模型方法,是一个有效的跨膜蛋白亚细胞定位预测方法。  相似文献   

17.

Background

The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.

Results

This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).

Conclusions

Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
  相似文献   

18.
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.  相似文献   

19.
Lee K  Kim DW  Na D  Lee KH  Lee D 《Nucleic acids research》2006,34(17):4655-4666
Subcellular localization is one of the key functional characteristics of proteins. An automatic and efficient prediction method for the protein subcellular localization is highly required owing to the need for large-scale genome analysis. From a machine learning point of view, a dataset of protein localization has several characteristics: the dataset has too many classes (there are more than 10 localizations in a cell), it is a multi-label dataset (a protein may occur in several different subcellular locations), and it is too imbalanced (the number of proteins in each localization is remarkably different). Even though many previous works have been done for the prediction of protein subcellular localization, none of them tackles effectively these characteristics at the same time. Thus, a new computational method for protein localization is eventually needed for more reliable outcomes. To address the issue, we present a protein localization predictor based on D-SVDD (PLPD) for the prediction of protein localization, which can find the likelihood of a specific localization of a protein more easily and more correctly. Moreover, we introduce three measurements for the more precise evaluation of a protein localization predictor. As the results of various datasets which are made from the experiments of Huh et al. (2003), the proposed PLPD method represents a different approach that might play a complimentary role to the existing methods, such as Nearest Neighbor method and discriminate covariant method. Finally, after finding a good boundary for each localization using the 5184 classified proteins as training data, we predicted 138 proteins whose subcellular localizations could not be clearly observed by the experiments of Huh et al. (2003).  相似文献   

20.
Shen HB  Chou KC 《Biopolymers》2007,85(3):233-240
Viruses can reproduce their progenies only within a host cell, and their actions depend both on its destructive tendencies toward a specific host cell and on environmental conditions. Therefore, knowledge of the subcellular localization of viral proteins in a host cell or virus-infected cell is very useful for in-depth studying of their functions and mechanisms as well as designing antiviral drugs. An analysis on the Swiss-Prot database (version 50.0, released on May 30, 2006) indicates that only 23.5% of viral protein entries are annotated for their subcellular locations in this regard. As for the gene ontology database, the corresponding percentage is 23.8%. Such a gap calls for the development of high throughput tools for timely annotating the localization of viral proteins within host and virus-infected cells. In this article, a predictor called "Virus-PLoc" has been developed that is featured by fusing many basic classifiers with each engineered according to the K-nearest neighbor rule. The overall jackknife success rate obtained by Virus-PLoc in identifying the subcellular compartments of viral proteins was 80% for a benchmark dataset in which none of proteins has more than 25% sequence identity to any other in a same location site. Virus-PLoc will be freely available as a web-server at http://202.120.37.186/bioinf/virus for the public usage. Furthermore, Virus-PLoc has been used to provide large-scale predictions of all viral protein entries in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The results thus obtained have been deposited in a downloadable file prepared with Microsoft Excel and named "Tab_Virus-PLoc.xls." This file is available at the same website and will be updated twice a year to include the new entries of viral proteins and reflect the continuous development of Virus-PLoc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号