首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
One of the fundamental goals in cell biology and proteomics is to identify the functions of proteins in the context of compartments that organize them in the cellular environment. Knowledge of subcellular locations of proteins can provide key hints for revealing their functions and understanding how they interact with each other in cellular networking. Unfortunately, it is both time-consuming and expensive to determine the localization of an uncharacterized protein in a living cell purely based on experiments. With the avalanche of newly found protein sequences emerging in the post genomic era, we are facing a critical challenge, that is, how to develop an automated method to fast and reliably identify their subcellular locations so as to be able to timely use them for basic research and drug discovery. In view of this, an ensemble classifier was developed by the approach of fusing many basic individual classifiers through a voting system. Each of these basic classifiers was trained in a different dimension of the amphiphilic pseudo amino acid composition (Chou [2005] Bioinformatics 21: 10-19). As a demonstration, predictions were performed with the fusion classifier for proteins among the following 14 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cytoplasm, (5) cytoskeleton, (6) endoplasmic reticulum, (7) extracellular, (8) Golgi apparatus, (9) lysosome, (10) mitochondria, (11) nucleus, (12) peroxisome, (13) plasma membrane, and (14) vacuole. The overall success rates thus obtained via the resubstitution test, jackknife test, and independent dataset test were all significantly higher than those by the existing classifiers. It is anticipated that the novel ensemble classifier may also become a very useful vehicle in classifying other attributes of proteins according to their sequences, such as membrane protein type, enzyme family/sub-family, G-protein coupled receptor (GPCR) type, and structural class, among many others. The fusion ensemble classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.  相似文献   

2.
The nucleus is the brain of eukaryotic cells that guides the life processes of the cell by issuing key instructions. For in-depth understanding of the biochemical process of the nucleus, the knowledge of localization of nuclear proteins is very important. With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast annotating the subnuclear locations for numerous newly found nuclear protein sequences so as to be able to timely utilize them for basic research and drug discovery. In view of this, a novel approach is developed for predicting the protein subnuclear location. It is featured by introducing a powerful classifier, the optimized evidence-theoretic K-nearest classifier, and using the pseudo amino acid composition [K.C. Chou, PROTEINS: Structure, Function, and Genetics, 43 (2001) 246], which can incorporate a considerable amount of sequence-order effects, to represent protein samples. As a demonstration, identifications were performed for 370 nuclear proteins among the following 9 subnuclear locations: (1) Cajal body, (2) chromatin, (3) heterochromatin, (4) nuclear diffuse, (5) nuclear pore, (6) nuclear speckle, (7) nucleolus, (8) PcG body, and (9) PML body. The overall success rates thus obtained by both the re-substitution test and jackknife cross-validation test are significantly higher than those by existing classifiers on the same working dataset. It is anticipated that the powerful approach may also become a useful high throughput vehicle to bridge the huge gap occurring in the post-genomic era between the number of gene sequences in databases and the number of gene products that have been functionally characterized. The OET-KNN classifier will be available at www.pami.sjtu.edu.cn/people/hbshen.  相似文献   

3.
The biological functions of a protein are closely related to its attributes in a cell. With the rapid accumulation of newly found protein sequence data in databanks, it is highly desirable to develop an automated method for predicting the subcellular location of proteins. The establishment of such a predictor will expedite the functional determination of newly found proteins and the process of prioritizing genes and proteins identified by genomic efforts as potential molecular targets for drug design. The traditional algorithms for predicting these attributes were based solely on amino acid composition in which no sequence order effect was taken into account. To improve the prediction quality, it is necessary to incorporate such an effect. However, the number of possible patterns in protein sequences is extremely large, posing a formidable difficulty for realizing this goal. To deal with such difficulty, a well-developed tool in digital signal processing named digital Fourier transform (DFT) [1] was introduced. After being translated to a digital signal according to the hydrophobicity of each amino acid, a protein was analyzed by DFT within the frequency domain. A set of frequency spectrum parameters, thus obtained, were regarded as the factors to represent the sequence order effect. A significant improvement in prediction quality was observed by incorporating the frequency spectrum parameters with the conventional amino acid composition. One of the crucial merits of this approach is that many existing tools in mathematics and engineering can be easily applied in the predicting process. It is anticipated that digital signal processing may serve as a useful vehicle for many other protein science areas.  相似文献   

4.
A novel approach CE-Ploc is proposed for predicting protein subcellular locations by exploiting diversity both in feature and decision spaces. The diversity in a sequence of feature spaces is exploited using hydrophobicity and hydrophilicity of amphiphilic pseudo amino acid composition and a specific learning mechanism. Diversity in learning mechanisms is exploited by fusion of classifiers that are based on different learning mechanisms. Significant improvement in prediction performance is observed using jackknife and independent dataset tests.  相似文献   

5.
The function of the protein is closely correlated with its subcellular localization. Probing into the mechanism of protein sorting and predicting protein subcellular location can provide important clues or insights for understanding the function of proteins. In this paper, we introduce a new PseAAC approach to encode the protein sequence based on the physicochemical properties of amino acid residues. Each of the protein samples was defined as a 146D (dimensional) vector including the 20 amino acid composition components and 126 adjacent triune residues contents. To evaluate the effectiveness of this encoding scheme, we did jackknife tests on three datasets using the support vector machine algorithm. The total prediction accuracies are 84.9%, 91.2%, and 92.6%, respectively. The satisfactory results indicate that our method could be a useful tool in the area of bioinformatics and proteomics.  相似文献   

6.
MOTIVATION: Identifying the destination or localization of proteins is key to understanding their function and facilitating their purification. A number of existing computational prediction methods are based on sequence analysis. However, these methods are limited in scope, accuracy and most particularly breadth of coverage. Rather than using sequence information alone, we have explored the use of database text annotations from homologs and machine learning to substantially improve the prediction of subcellular location. RESULTS: We have constructed five machine-learning classifiers for predicting subcellular localization of proteins from animals, plants, fungi, Gram-negative bacteria and Gram-positive bacteria, which are 81% accurate for fungi and 92-94% accurate for the other four categories. These are the most accurate subcellular predictors across the widest set of organisms ever published. Our predictors are part of the Proteome Analyst web-service.  相似文献   

7.
The subcellular location of a protein is highly related to its function. Identifying the location of a given protein is an essential step for investigating its related problems. Traditional experimental methods can produce solid determination. However, their limitations, such as high cost and low efficiency, are evident. Computational methods provide an alternative means to address these problems. Most previous methods constantly extract features from protein sequences or structures for building prediction models. In this study, we use two types of features and combine them to construct the model. The first feature type is extracted from a protein–protein interaction network to abstract the relationship between the encoded protein and other proteins. The second type is obtained from gene ontology and biological pathways to indicate the existing functions of the encoded protein. These features are analyzed using some feature selection methods. The final optimum features are adopted to build the model with recurrent neural network as the classification algorithm. Such model yields good performance with Matthews correlation coefficient of 0.844. A decision tree is used as a rule learning classifier to extract decision rules. Although the performance of decision rules is poor, they are valuable in revealing the molecular mechanism of proteins with different subcellular locations. The final analysis confirms the reliability of the extracted rules. The source code of the propose method is freely available at https://github.com/xypan1232/rnnloc  相似文献   

8.
Shen HB  Yang J  Chou KC 《Amino acids》2007,33(1):57-67
With the avalanche of newly-found protein sequences emerging in the post genomic era, it is highly desirable to develop an automated method for fast and reliably identifying their subcellular locations because knowledge thus obtained can provide key clues for revealing their functions and understanding how they interact with each other in cellular networking. However, predicting subcellular location of eukaryotic proteins is a challenging problem, particularly when unknown query proteins do not have significant homology to proteins of known subcellular locations and when more locations need to be covered. To cope with the challenge, protein samples are formulated by hybridizing the information derived from the gene ontology database and amphiphilic pseudo amino acid composition. Based on such a representation, a novel ensemble hybridization classifier was developed by fusing many basic individual classifiers through a voting system. Each of these basic classifiers was engineered by the KNN (K-Nearest Neighbor) principle. As a demonstration, a new benchmark dataset was constructed that covers the following 18 localizations: (1) cell wall, (2) centriole, (3) chloroplast, (4) cyanelle, (5) cytoplasm, (6) cytoskeleton, (7) endoplasmic reticulum, (8) extracell, (9) Golgi apparatus, (10) hydrogenosome, (11) lysosome, (12) mitochondria, (13) nucleus, (14) peroxisome, (15) plasma membrane, (16) plastid, (17) spindle pole body, and (18) vacuole. To avoid the homology bias, none of the proteins included has > or =25% sequence identity to any other in a same subcellular location. The overall success rates thus obtained via the 5-fold and jackknife cross-validation tests were 81.6 and 80.3%, respectively, which were 40-50% higher than those performed by the other existing methods on the same strict dataset. The powerful predictor, named "Euk-PLoc", is available as a web-server at http://202.120.37.186/bioinf/euk . Furthermore, to support the need of people working in the relevant areas, a downloadable file will be provided at the same website to list the results predicted by Euk-PLoc for all eukaryotic protein entries (excluding fragments) in Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results will be updated twice a year to include the new entries of eukaryotic proteins and reflect the continuous development of Euk-PLoc.  相似文献   

9.

Background  

The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction.  相似文献   

10.
One of the critical challenges in predicting protein subcellular localization is how to deal with the case of multiple location sites. Unfortunately, so far, no efforts have been made in this regard except for the one focused on the proteins in budding yeast only. For most existing predictors, the multiple-site proteins are either excluded from consideration or assumed even not existing. Actually, proteins may simultaneously exist at, or move between, two or more different subcellular locations. For instance, according to the Swiss-Prot database (version 50.7, released 19-Sept-2006), among the 33,925 eukaryotic protein entries that have experimentally observed subcellular location annotations, 2715 have multiple location sites, meaning about 8% bearing the multiplex feature. Proteins with multiple locations or dynamic feature of this kind are particularly interesting because they may have some very special biological functions intriguing to investigators in both basic research and drug discovery. Meanwhile, according to the same Swiss-Prot database, the number of total eukaryotic protein entries (except those annotated with "fragment" or those with less than 50 amino acids) is 90,909, meaning a gap of (90,909-33,925) = 56,984 entries for which no knowledge is available about their subcellular locations. Although one can use the computational approach to predict the desired information for the blank, so far, all the existing methods for predicting eukaryotic protein subcellular localization are limited in the case of single location site only. To overcome such a barrier, a new ensemble classifier, named Euk-mPLoc, was developed that can be used to deal with the case of multiple location sites as well. Euk-mPLoc is freely accessible to the public as a Web server at http://202.120.37.186/bioinf/euk-multi. Meanwhile, to support the people working in the relevant areas, Euk-mPLoc has been used to identify all eukaryotic protein entries in the Swiss-Prot database that do not have subcellular location annotations or are annotated as being uncertain. The large-scale results thus obtained have been deposited at the same Web site via a downloadable file prepared with Microsoft Excel and named "Tab_Euk-mPLoc.xls". Furthermore, to include new entries of eukaryotic proteins and reflect the continuous development of Euk-mPLoc in both the coverage scope and prediction accuracy, we will timely update the downloadable file as well as the predictor, and keep users informed by publishing a short note in the Journal and making an announcement in the Web Page.  相似文献   

11.
The location of a protein in a cell is closely correlated with its biological function. Based on the concept that the protein subcellular location is mainly determined by its amino acid and pseudo amino acid composition (PseAA), a new algorithm of increment of diversity combined with support vector machine is proposed to predict the protein subcellular location. The subcellular locations of plant and non-plant proteins are investigated by our method. The overall prediction accuracies in jackknife test are 88.3% for the eukaryotic plant proteins and 92.4% for the eukaryotic non-plant proteins, respectively. In order to estimate the effect of the sequence identity on predictive result, the proteins with sequence identity 相似文献   

12.
Large-scale plant protein subcellular location prediction   总被引:1,自引:0,他引:1  
Current plant genome sequencing projects have called for development of novel and powerful high throughput tools for timely annotating the subcellular location of uncharacterized plant proteins. In view of this, an ensemble classifier, Plant-PLoc, formed by fusing many basic individual classifiers, has been developed for large-scale subcellular location prediction for plant proteins. Each of the basic classifiers was engineered by the K-Nearest Neighbor (KNN) rule. Plant-PLoc discriminates plant proteins among the following 11 subcellular locations: (1) cell wall, (2) chloroplast, (3) cytoplasm, (4) endoplasmic reticulum, (5) extracell, (6) mitochondrion, (7) nucleus, (8) peroxisome, (9) plasma membrane, (10) plastid, and (11) vacuole. As a demonstration, predictions were performed on a stringent benchmark dataset in which none of the proteins included has > or =25% sequence identity to any other in a same subcellular location to avoid the homology bias. The overall success rate thus obtained was 32-51% higher than the rates obtained by the previous methods on the same benchmark dataset. The essence of Plant-PLoc in enhancing the prediction quality and its significance in biological applications are discussed. Plant-PLoc is accessible to public as a free web-server at: (http://202.120.37.186/bioinf/plant). Furthermore, for public convenience, results predicted by Plant-PLoc have been provided in a downloadable file at the same website for all plant protein entries in the Swiss-Prot database that do not have subcellular location annotations, or are annotated as being uncertain. The large-scale results will be updated twice a year to include new entries of plant proteins and reflect the continuous development of Plant-PLoc.  相似文献   

13.
Knowledge of membrane protein type often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences emerging during the post-genomic era, it is highly desirable to develop an automated method that can serve as a high throughput tool in identifying the types of newly found membrane proteins according to their primary sequences, so as to timely make the relevant annotations on them for the reference usage in both basic research and drug discovery. Based on the concept of pseudo-amino acid composition [K.C. Chou, Proteins: Struct. Funct. Genet. 43 (2001) 246-255; Erratum: Proteins: Struct. Funct. Genet. 44 (2001) 60] that has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, a novel predictor, the so-called "optimized evidence-theoretic K-nearest neighbor" or "OET-KNN" classifier, was proposed. It was demonstrated via the self-consistency test, jackknife test, and independent dataset test that the new predictor, compared with many previous ones, yielded higher success rates in most cases. The new predictor can also be used to improve the prediction quality for, among many other protein attributes, structural class, subcellular localization, enzyme family class, and G-protein coupled receptor type. The OET-KNN classifier will be available as a web-server at http://www.pami.sjtu.edu.cn/kcchou.  相似文献   

14.
Predicting unconventional protein secretion is a much harder problem than predicting signal peptide-based protein secretion, both due to the small number of examples and due to the heterogeneity and the limited knowledge of the pathways involved, especially in eukaryotes. However, the idea that secreted proteins share certain properties regardless of the secretion pathway used made it possible to construct the prediction method SecretomeP in 2004. Here, we take a critical look at SecretomeP and its successors, and we also discuss whether multi-category subcellular location predictors can be used to predict unconventional protein secretion in eukaryotes. A new benchmark shows SecretomeP to perform much worse than initially estimated, casting doubt on the underlying hypothesis. On a more positive note, recent developments in machine learning may have the potential to construct new methods which can not only predict unconventional protein secretion but also point out which parts of a sequence are important for secretion.  相似文献   

15.
Machine learning is a kind of reliable technology for automated subcellular localization of viral proteins within a host cell or virus-infected cell. One challenge is that the viral protein samples are not only with multiple location sites, but also class-imbalanced. The imbalanced dataset often decreases the prediction performance. In order to accomplish this challenge, this paper proposes a novel approach named imbalance-weighted multi-label K-nearest neighbor to predict viral protein subcellular location with multiple sites. The experimental results by jackknife test indicate that the presented algorithm achieves a better performance than the existing methods and has great potentials in protein science.  相似文献   

16.
This paper introduces a new subcellular localization system (TSSub) for eukaryotic proteins. This system extracts features from both profiles and amino acid sequences. Four different features are extracted from profiles by four probabilistic neural network (PNN) classifiers, respectively (the amino acid composition from whole profiles; the amino acid composition from the N-terminus of profiles; the dipeptide composition from whole profiles and the amino acid composition from fragments of profiles). In addition, a support vector machine (SVM) classifier is added to implement the residue-couple feature extracted from amino acid sequences. The results from the five classifiers are fused by an additional SVM classifier. The overall accuracies of this TSSub reach 93.0 and 77.4% on Reinhardt and Hubbard's eukaryotic protein dataset and Huang and Li's eukaryotic protein dataset, respectively. The comparison with existing methods results shows TSSub provides better prediction performance than existing methods. AVAILABILITY: The web server is available from http://166.111.24.5/webtools/TSSub/index.html.  相似文献   

17.
Membrane proteins are vital type of proteins that serve as channels, receptors, and energy transducers in a cell. Prediction of membrane protein types is an important research area in bioinformatics. Knowledge of membrane protein types provides some valuable information for predicting novel example of the membrane protein types. However, classification of membrane protein types can be both time consuming and susceptible to errors due to the inherent similarity of membrane protein types. In this paper, neural networks based membrane protein type prediction system is proposed. Composite protein sequence representation (CPSR) is used to extract the features of a protein sequence, which includes seven feature sets; amino acid composition, sequence length, 2 gram exchange group frequency, hydrophobic group, electronic group, sum of hydrophobicity, and R-group. Principal component analysis is then employed to reduce the dimensionality of the feature vector. The probabilistic neural network (PNN), generalized regression neural network, and support vector machine (SVM) are used as classifiers. A high success rate of 86.01% is obtained using SVM for the jackknife test. In case of independent dataset test, PNN yields the highest accuracy of 95.73%. These classifiers exhibit improved performance using other performance measures such as sensitivity, specificity, Mathew's correlation coefficient, and F-measure. The experimental results show that the prediction performance of the proposed scheme for classifying membrane protein types is the best reported, so far. This performance improvement may largely be credited to the learning capabilities of neural networks and the composite feature extraction strategy, which exploits seven different properties of protein sequences. The proposed Mem-Predictor can be accessed at http://111.68.99.218/Mem-Predictor.  相似文献   

18.
Li L  Zhang Y  Zou L  Li C  Yu B  Zheng X  Zhou Y 《PloS one》2012,7(1):e31057
With the rapid increase of protein sequences in the post-genomic age, it is challenging to develop accurate and automated methods for reliably and quickly predicting their subcellular localizations. Till now, many efforts have been tried, but most of which used only a single algorithm. In this paper, we proposed an ensemble classifier of KNN (k-nearest neighbor) and SVM (support vector machine) algorithms to predict the subcellular localization of eukaryotic proteins based on a voting system. The overall prediction accuracies by the one-versus-one strategy are 78.17%, 89.94% and 75.55% for three benchmark datasets of eukaryotic proteins. The improved prediction accuracies reveal that GO annotations and hydrophobicity of amino acids help to predict subcellular locations of eukaryotic proteins.  相似文献   

19.
Zheng X  Liu T  Wang J 《Amino acids》2009,37(2):427-433
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step, distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other. Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov model of protein sequences.  相似文献   

20.
Liu DQ  Liu H  Shen HB  Yang J  Chou KC 《Amino acids》2007,32(4):493-496
Summary. A newly synthesized secretory protein in cells bears a special sequence, called signal peptide or sequence, which plays the role of “address tag” in guiding the protein to wherever it is needed. Such a unique function of signal sequences has stimulated novel strategies for drug design or reprogramming cells for gene therapy. To realize these new ideas and plans, however, it is important to develop an automated method for fast and accurately identifying the signal sequences or their cleavage sites. In this paper, a new method is developed for predicting the signal sequence of a query secretory protein by fusing the results from a series of global alignments through a voting system. The very high success rates thus obtained suggest that the novel approach is very promising, and that the new method may become a useful vehicle in identifying signal sequence, or at least serve as a complementary tool to the existing algorithms of this field.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号