首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Li L  Zhang Y  Zou L  Li C  Yu B  Zheng X  Zhou Y 《PloS one》2012,7(1):e31057
With the rapid increase of protein sequences in the post-genomic age, it is challenging to develop accurate and automated methods for reliably and quickly predicting their subcellular localizations. Till now, many efforts have been tried, but most of which used only a single algorithm. In this paper, we proposed an ensemble classifier of KNN (k-nearest neighbor) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic proteins based on a voting system. The overall prediction accuracies by the one-versus-one strategy are 78.17%, 89.94% and 75.55% for three benchmark datasets of eukaryotic proteins. The improved prediction accuracies reveal that GO annotations and hydrophobicity of amino acids help to predict subcellular locations of eukaryotic proteins.  相似文献   

2.
林昊 《生物信息学》2009,7(4):252-254
由于蛋白质亚细胞位置与其一级序列存在很强的相关性,利用多样性增量来描述蛋白质之间氨基酸组分和二肽组分的相似程度,采用修正的马氏判别式(这里称为IDQD方法)对分枝杆菌蛋白质的亚细胞位置进行了预测。利用Jackknife检验对不同序列相似度下的蛋白质数据集进行了预测研究,结果显示,当数据集的序列相似度小于等于70%时,算法的预测精度稳定在75%左右。在对整体852条蛋白质的预测成功率达到87.7%,这一结果优于已有算法的预测精度,说明IDQD是一种有效的分枝杆菌蛋白质亚细胞预测方法。  相似文献   

3.
The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence.  相似文献   

4.
Protein trafficking or protein sorting in eukaryotes is a complicated process and is carried out based on the information contaified in the protein. Many methods reported prediction of the subcellular location of proteins from sequence information. However, most of these prediction methods use a flat structure or parallel architecture to perform prediction. In this work, we introduce ensemble classifiers with features that are extracted directly from full length protein sequences to predict locations in the protein-sorting pathway hierarchically. Sequence driven features, sequence mapped features and sequence autocorrelation features were tested with ensemble learners and their performances were compared. When evaluated by independent data testing, ensemble based-bagging algorithms with sequence feature composition, transition and distribution (CTD) successfully classified two datasets with accuracies greater than 90%. We compared our results with similar published methods, and our method equally performed with the others at two levels in the secreted pathway. This study shows that the feature CTD extracted from protein sequences is effective in capturing biological features among compartments in secreted pathways.  相似文献   

5.
Subcellular location is an important functional annotation of proteins. An automatic, reliable and efficient prediction system for protein subcellular localization is necessary for large-scale genome analysis. This paper describes a protein subcellular localization method which extracts features from protein profiles rather than from amino acid sequences. The protein profile represents a protein family, discards part of the sequence information that is not conserved throughout the family and therefore is more sensitive than the amino acid sequence. The amino acid compositions of whole profile and the N-terminus of the profile are extracted, respectively, to train and test the probabilistic neural network classifiers. On two benchmark datasets, the overall accuracies of the proposed method reach 89.1% and 68.9%, respectively. The prediction results show that the proposed method perform better than those methods based on amino acid sequences. The prediction results of the proposed method are also compared with Subloc on two redundance-reduced datasets.  相似文献   

6.
Prediction of protein subcellular location is a meaningful task which attracted much attention in recent years. A lot of protein subcellular location predictors which can only deal with the single-location proteins were developed. However, some proteins may belong to two or even more subcellular locations. It is important to develop predictors which will be able to deal with multiplex proteins, because these proteins have extremely useful implication in both basic biological research and drug discovery. Considering the circumstance that the number of methods dealing with multiplex proteins is limited, it is meaningful to explore some new methods which can predict subcellular location of proteins with both single and multiple sites. Different methods of feature extraction and different models of predict algorithms using on different benchmark datasets may receive some general results. In this paper, two different feature extraction methods and two different models of neural networks were performed on three benchmark datasets of different kinds of proteins, i.e. datasets constructed specially for Gram-positive bacterial proteins, plant proteins and virus proteins. These benchmark datasets have different number of location sites. The application result shows that RBF neural network has apparently superiorities against BP neural network on these datasets no matter which type of feature extraction is chosen.  相似文献   

7.
The function of protein is closely correlated with it subcellular location. Prediction of subcellular location of apoptosis proteins is an important research area in post-genetic era because the knowledge of apoptosis proteins is useful to understand the mechanism of programmed cell death. Compared with the conventional amino acid composition (AAC), the Pseudo Amino Acid composition (PseAA) as originally introduced by Chou can incorporate much more information of a protein sequence so as to remarkably enhance the power of using a discrete model to predict various attributes of a protein. In this study, a novel approach is presented to predict apoptosis protein solely from sequence based on the concept of Chou's PseAA composition. The concept of approximate entropy (ApEn), which is a parameter denoting complexity of time series, is used to construct PseAA composition as additional features. Fuzzy K-nearest neighbor (FKNN) classifier is selected as prediction engine. Particle swarm optimization (PSO) algorithm is adopted for optimizing the weight factors which are important in PseAA composition. Two datasets are used to validate the performance of the proposed approach, which incorporate six subcellular location and four subcellular locations, respectively. The results obtained by jackknife test are quite encouraging. It indicates that the ApEn of protein sequence could represent effectively the information of apoptosis proteins subcellular locations. It can at least play a complimentary role to many of the existing methods, and might become potentially useful tool for protein function prediction. The software in Matlab is available freely by contacting the corresponding author.  相似文献   

8.
ABSTRACT: BACKGROUND: Understanding protein subcellular localization is a necessary component toward understanding the overall function of a protein. Numerous computational methods have been published over the past decade, with varying degrees of success. Despite the large number of published methods in this area, only a small fraction of them are available for researchers to use in their own studies. Of those that are available, many are limited by predicting only a small number of major organelles in the cell. Additionally, the majority of methods predict only a single location, even though it is known that a large fraction of the proteins in eukaryotic species shuttle between locations to carry out their function. FINDINGS: We present a software package and a web server for predicting subcellular localization of protein sequences based on the ngLOC method. ngLOC is an n-gram-based Bayesian classifier that predicts subcellular localization of proteins both in prokaryotes and eukaryotes. The overall prediction accuracy varies from 89.8% to 91.4% across species. This program can predict 11 distinct locations each in plant and animal species. ngLOC also predicts 4 and 5 distinct locations on gram-positive and gram-negative bacterial datasets, respectively. CONCLUSIONS: ngLOC is a generic method that can be trained by data from a variety of species or classes for predicting protein subcellular localization. The standalone software is freely available for academic use under GNU GPL, and the ngLOC web server is also accessible at http://ngloc.unmc.edu.  相似文献   

9.
When aligning biological sequences, the choice of parameter values for the alignment scoring function is critical. Small changes in gap penalties, for example, can yield radically different alignments. A rigorous way to compute parameter values that are appropriate for aligning biological sequences is through inverse parametric sequence alignment. Given a collection of examples of biologically correct alignments, this is the problem of finding parameter values that make the scores of the example alignments close to those of optimal alignments for their sequences. We extend prior work on inverse parametric alignment to partial examples, which contain regions where the alignment is left unspecified, and to an improved formulation based on minimizing the average error between the score of an example and the score of an optimal alignment. Experiments on benchmark biological alignments show we can find parameters that generalize across protein families and that boost the accuracy of multiple sequence alignment by as much as 25%.  相似文献   

10.
Revealing the subcellular location of newly discovered protein sequences can bring insight to their function and guide research at the cellular level. The rapidly increasing number of sequences entering the genome databanks has called for the development of automated analysis methods. Currently, most existing methods used to predict protein subcellular locations cover only one, or a very limited number of species. Therefore, it is necessary to develop reliable and effective computational approaches to further improve the performance of protein subcellular prediction and, at the same time, cover more species. The current study reports the development of a novel predictor called MSLoc-DT to predict the protein subcellular locations of human, animal, plant, bacteria, virus, fungi, and archaea by introducing a novel feature extraction approach termed Amino Acid Index Distribution (AAID) and then fusing gene ontology information, sequential evolutionary information, and sequence statistical information through four different modes of pseudo amino acid composition (PseAAC) with a decision template rule. Using the jackknife test, MSLoc-DT can achieve 86.5, 98.3, 90.3, 98.5, 95.9, 98.1, and 99.3% overall accuracy for human, animal, plant, bacteria, virus, fungi, and archaea, respectively, on seven stringent benchmark datasets. Compared with other predictors (e.g., Gpos-PLoc, Gneg-PLoc, Virus-PLoc, Plant-PLoc, Plant-mPLoc, ProLoc-Go, Hum-PLoc, GOASVM) on the gram-positive, gram-negative, virus, plant, eukaryotic, and human datasets, the new MSLoc-DT predictor is much more effective and robust. Although the MSLoc-DT predictor is designed to predict the single location of proteins, our method can be extended to multiple locations of proteins by introducing multilabel machine learning approaches, such as the support vector machine and deep learning, as substitutes for the K-nearest neighbor (KNN) method. As a user-friendly web server, MSLoc-DT is freely accessible at http://bioinfo.ibp.ac.cn/MSLOC_DT/index.html.  相似文献   

11.
One of the fundamental tasks in biology is to identify the functions of all proteins to reveal the primary machinery of a cell. Knowledge of the subcellular locations of proteins will provide key hints to reveal their functions and to understand the intricate pathways that regulate biological processes at the cellular level. Protein subcellular location prediction has been extensively studied in the past two decades. A lot of methods have been developed based on protein primary sequences as well as protein-protein interaction network. In this paper, we propose to use the protein-protein interaction network as an infrastructure to integrate existing sequence based predictors. When predicting the subcellular locations of a given protein, not only the protein itself, but also all its interacting partners were considered. Unlike existing methods, our method requires neither the comprehensive knowledge of the protein-protein interaction network nor the experimentally annotated subcellular locations of most proteins in the protein-protein interaction network. Besides, our method can be used as a framework to integrate multiple predictors. Our method achieved 56% on human proteome in absolute-true rate, which is higher than the state-of-the-art methods.  相似文献   

12.

Background

The functions of proteins are closely related to their subcellular locations. In the post-genomics era, the amount of gene and protein data grows exponentially, which necessitates the prediction of subcellular localization by computational means.

Results

This paper proposes mitigating the computation burden of alignment-based approaches to subcellular localization prediction by a cascaded fusion of cleavage site prediction and profile alignment. Specifically, the informative segments of protein sequences are identified by a cleavage site predictor using the information in their N-terminal shorting signals. Then, the sequences are truncated at the cleavage site positions, and the shortened sequences are passed to PSI-BLAST for computing their profiles. Subcellular localization are subsequently predicted by a profile-to-profile alignment support-vector-machine (SVM) classifier. To further reduce the training and recognition time of the classifier, the SVM classifier is replaced by a new kernel method based on the perturbational discriminant analysis (PDA).

Conclusions

Experimental results on a new dataset based on Swiss-Prot Release 57.5 show that the method can make use of the best property of signal- and homology-based approaches and can attain an accuracy comparable to that achieved by using full-length sequences. Analysis of profile-alignment score matrices suggest that both profile creation time and profile alignment time can be reduced without significant reduction in subcellular localization accuracy. It was found that PDA enjoys a short training time as compared to the conventional SVM. We advocate that the method will be important for biologists to conduct large-scale protein annotation or for bioinformaticians to perform preliminary investigations on new algorithms that involve pairwise alignments.
  相似文献   

13.
MOTIVATION: The maximum expected accuracy optimization criterion for multiple sequence alignment uses pairwise posterior probabilities of residues to align sequences. The partition function methodology is one way of estimating these probabilities. Here, we combine these two ideas for the first time to construct maximal expected accuracy sequence alignments. RESULTS: We bridge the two techniques within the program Probalign. Our results indicate that Probalign alignments are generally more accurate than other leading multiple sequence alignment methods (i.e. Probcons, MAFFT and MUSCLE) on the BAliBASE 3.0 protein alignment benchmark. Similarly, Probalign also outperforms these methods on the HOMSTRAD and OXBENCH benchmarks. Probalign ranks statistically highest (P-value < 0.005) on all three benchmarks. Deeper scrutiny of the technique indicates that the improvements are largest on datasets containing N/C-terminal extensions and on datasets containing long and heterogeneous length proteins. These points are demonstrated on both real and simulated data. Finally, our method also produces accurate alignments on long and heterogeneous length datasets containing protein repeats. Here, alignment accuracy scores are at least 10% and 15% higher than the other three methods when standard deviation of length is >300 and 400, respectively. AVAILABILITY: Open source code implementing Probalign as well as for producing the simulated data, and all real and simulated data are freely available from http://www.cs.njit.edu/usman/probalign  相似文献   

14.
Subcellular localization of a protein is important to understand proteins’ functions and interactions. There are many techniques based on computational methods to predict protein subcellular locations, but it has been shown that many prediction tasks have a training data shortage problem. This paper introduces a new method to mine proteins with non-experimental annotations, which are labeled by non-experimental evidences of protein databases to overcome the training data shortage problem. A novel active sample selection strategy is designed, taking advantage of active learning technology, to actively find useful samples from the entire data pool of candidate proteins with non-experimental annotations. This approach can adequately estimate the “value” of each sample, automatically select the most valuable samples and add them into the original training set, to help to retrain the classifiers. Numerical experiments with for four popular multi-label classifiers on three benchmark datasets show that the proposed method can effectively select the valuable samples to supplement the original training set and significantly improve the performances of predicting classifiers.  相似文献   

15.
Facing the explosion of newly generated protein sequences in the post genomic era, we are challenged to develop an automated method for fast and reliably annotating their subcellular locations. Knowledge of subcellular locations of proteins can provide useful hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both expensive and time-consuming to determine the localization of an uncharacterized protein in a living cell purely based on experiments. To tackle the challenge, a novel hybridization classifier was developed by fusing many basic individual classifiers through a voting system. The "engine" of these basic classifiers was operated by the OET-KNN (Optimized Evidence-Theoretic K-Nearest Neighbor) rule. As a demonstration, predictions were performed with the fusion classifier for proteins among the following 16 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cyanelle, (5) cytoplasm, (6) cytoskeleton, (7) endoplasmic reticulum, (8) extracell, (9) Golgi apparatus, (10) lysosome, (11) mitochondria, (12) nucleus, (13) peroxisome, (14) plasma membrane, (15) plastid, and (16) vacuole. To get rid of redundancy and homology bias, none of the proteins investigated here had >/=25% sequence identity to any other in a same subcellular location. The overall success rates thus obtained via the jack-knife cross-validation test and independent dataset test were 81.6% and 83.7%, respectively, which were 46 approximately 63% higher than those performed by the other existing methods on the same benchmark datasets. Also, it is clearly elucidated that the overwhelmingly high success rates obtained by the fusion classifier is by no means a trivial utilization of the GO annotations as prone to be misinterpreted because there is a huge number of proteins with given accession numbers and the corresponding GO numbers, but their subcellular locations are still unknown, and that the percentage of proteins with GO annotations indicating their subcellular components is even less than the percentage of proteins with known subcellular location annotation in the Swiss-Prot database. It is anticipated that the powerful fusion classifier may also become a very useful high throughput tool in characterizing other attributes of proteins according to their sequences, such as enzyme class, membrane protein type, and nuclear receptor subfamily, among many others. A web server, called "Euk-OET-PLoc", has been designed at http://202.120.37.186/bioinf/euk-oet for public to predict subcellular locations of eukaryotic proteins by the fusion OET-KNN classifier.  相似文献   

16.
Given a raw protein sequence, knowing its subcellular location is an important step toward understanding its function and designing further experiments. A novel method is proposed for the prediction of protein subcellular locations from sequences. For four categories of eukaryotic proteins the overall predictive accuracy is 82.0%, 2.6% higher than that by using SVM approach. For three subcellular locations of prokaryotic proteins, an overall accuracy of 89.9% is obtained. In accordance with the architecture of cells, a hierarchical prediction approach is designed. Based on amino acid composition extracellular proteins and intracellular proteins can be identified with accuracy of 97%.  相似文献   

17.
Feng ZP 《In silico biology》2002,2(3):291-303
The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.  相似文献   

18.
MOTIVATION: Subcellular localization is a key functional characteristic of proteins. A fully automatic and reliable prediction system for protein subcellular localization is needed, especially for the analysis of large-scale genome sequences. RESULTS: In this paper, Support Vector Machine has been introduced to predict the subcellular localization of proteins from their amino acid compositions. The total prediction accuracies reach 91.4% for three subcellular locations in prokaryotic organisms and 79.4% for four locations in eukaryotic organisms. Predictions by our approach are robust to errors in the protein N-terminal sequences. This new approach provides superior prediction performance compared with existing algorithms based on amino acid composition and can be a complementary method to other existing methods based on sorting signals. AVAILABILITY: A web server implementing the prediction method is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/. SUPPLEMENTARY INFORMATION: Supplementary material is available at http://www.bioinfo.tsinghua.edu.cn/SubLoc/.  相似文献   

19.
Apoptosis proteins have a central role in the development and the homeostasis of an organism. These proteins are very important for understanding the mechanism of programmed cell death. The function of an apoptosis protein is closely related to its subcellular location. It is crucial to develop powerful tools to predict apoptosis protein locations for rapidly increasing gap between the number of known structural proteins and the number of known sequences in protein databank. In this study, amino acids pair compositions with different spaces are used to construct feature sets for representing sample of protein feature selection approach based on binary particle swarm optimization, which is applied to extract effective feature. Ensemble classifier is used as prediction engine, of which the basic classifier is the fuzzy K-nearest neighbor. Each basic classifier is trained with different feature sets. Two datasets often used in prior works are selected to validate the performance of proposed approach. The results obtained by jackknife test are quite encouraging, indicating that the proposed method might become a potentially useful tool for subcellular location of apoptosis protein, or at least can play a complimentary role to the existing methods in the relevant areas. The supplement information and software written in Matlab are available by contacting the corresponding author.  相似文献   

20.
MOTIVATION: Likelihood ratio approximants (LRA) have been widely used for model comparison in statistics. The present study was undertaken in order to explore their utility as a scoring (ranking) function in the classification of protein sequences. RESULTS: We used a simple LRA-based on the maximal similarity (or minimal distance) scores of the two top ranking sequence classes. The scoring methods (Smith-Waterman, BLAST, local alignment kernel and compression based distances) were compared on datasets designed to test sequence similarities between proteins distantly related in terms of structure or evolution. It was found that LRA-based scoring can significantly outperform simple scoring methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号