共查询到20条相似文献,搜索用时 15 毫秒
1.
Predicting protein structural class with AdaBoost Learner 总被引:1,自引:0,他引:1
The structural class is an important feature in characterizing the overall topological folding type of a protein or the domains therein. Prediction of protein structural classification has attracted the attention and efforts from many investigators. In this paper a novel predictor, the AdaBoost Learner, was introduced to deal with this problem. The essence of the AdaBoost Learner is that a combination of many 'weak' learning algorithms, each performing just slightly better than a random guessing algorithm, will generate a 'strong' learning algorithm. Demonstration thru jackknife cross-validation on two working datasets constructed by previous investigators indicated that AdaBoost outperformed other predictors such as SVM (support vector machine), a powerful algorithm widely used in biological literatures. It has not escaped our notice that AdaBoost may hold a high potential for improving the quality in predicting the other protein features as well, such as subcellular location and receptor type, among many others. Or at the very least, it will play a complementary role to many of the existing algorithms in this regard. 相似文献
2.
Lu Z Szafron D Greiner R Lu P Wishart DS Poulin B Anvik J Macdonell C Eisner R 《Bioinformatics (Oxford, England)》2004,20(4):547-556
MOTIVATION: Identifying the destination or localization of proteins is key to understanding their function and facilitating their purification. A number of existing computational prediction methods are based on sequence analysis. However, these methods are limited in scope, accuracy and most particularly breadth of coverage. Rather than using sequence information alone, we have explored the use of database text annotations from homologs and machine learning to substantially improve the prediction of subcellular location. RESULTS: We have constructed five machine-learning classifiers for predicting subcellular localization of proteins from animals, plants, fungi, Gram-negative bacteria and Gram-positive bacteria, which are 81% accurate for fungi and 92-94% accurate for the other four categories. These are the most accurate subcellular predictors across the widest set of organisms ever published. Our predictors are part of the Proteome Analyst web-service. 相似文献
3.
4.
MOTIVATION: The localization of a protein in a cell is closely correlated with its biological function. With the number of sequences entering into databanks rapidly increasing, the importance of developing a powerful high-throughput tool to determine protein subcellular location has become self-evident. In view of this, the Nearest Neighbour Algorithm was developed for predicting the protein subcellular location using the strategy of hybridizing the information derived from the recent development in gene ontology with that from the functional domain composition as well as the pseudo amino acid composition. RESULTS: As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate of the jackknife test for the plant protein dataset was 86%, and that for the non-plant protein dataset 91.2%. These are the highest success rates achieved so far for the two datasets by following a rigorous cross-validation test procedure, suggesting that such a hybrid approach (particularly by incorporating the knowledge of gene ontology) may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. AVAILABILITY: The software would be made available on sending a request to the authors. 相似文献
5.
Predicting Protein Subcellular Localization: Past, Present, and Future 总被引:10,自引:0,他引:10
6.
For a protein, an important characteristic is its location or compartment in a cell. This is because a protein has to be located in its proper position in a cell to perform its biological functions. Therefore, predicting protein subcellular location is an important and challenging task in current molecular and cellular biology. In this paper, based on AdaBoost.ME algorithm and Chou's PseAAC (pseudo amino acid composition), a new computational method was developed to identify protein subcellular location. AdaBoost.ME is an improved version of AdaBoost algorithm that can directly extend the original AdaBoost algorithm to deal with multi-class cases without the need to reduce it to multiple two-class problems. In some previous studies the conventional amino acid composition was applied to represent protein samples. In order to take into account the sequence order effects, in this study we use Chou's PseAAC to represent protein samples. To demonstrate that AdaBoost.ME is a robust and efficient model in predicting protein subcellular locations, the same protein dataset used by Cedano et al. (Journal of Molecular Biology, 1997, 266: 594-600) is adopted in this paper. It can be seen from the computed results that the accuracy achieved by our method is better than those by the methods developed by the previous investigators. 相似文献
7.
Information of protein subcellular location plays an important role in molecular cell biology. Prediction of the subcellular location of proteins will help to understand their functions and interactions. In this paper, a different mode of pseudo amino acid composition was proposed to represent protein samples for predicting their subcellular localization via the following procedures: based on the optimal splice site of each protein sequence, we divided a sequence into sorting signal part and mature protein part, and extracted sequence features from each part separately. Then, the combined features were fed into the SVM classifier to perform the prediction. By the jackknife test on a benchmark dataset in which none of proteins included has more than 90% pairwise sequence identity to any other, the overall accuracies achieved by the method are 94.5% and 90.3% for prokaryotic and eukaryotic proteins, respectively. The results indicate that the prediction quality by our method is quite satisfactory. It is anticipated that the current method may serve as an alternative approach to the existing prediction methods. 相似文献
8.
Predicting subcellular localization of proteins based on their N-terminal amino acid sequence 总被引:96,自引:0,他引:96
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/. 相似文献
9.
Background
Essential proteins play an indispensable role in the cellular survival and development. There have been a series of biological experimental methods for finding essential proteins; however they are time-consuming, expensive and inefficient. In order to overcome the shortcomings of biological experimental methods, many computational methods have been proposed to predict essential proteins. The computational methods can be roughly divided into two categories, the topology-based methods and the sequence-based ones. The former use the topological features of protein-protein interaction (PPI) networks while the latter use the sequence features of proteins to predict essential proteins. Nevertheless, it is still challenging to improve the prediction accuracy of the computational methods.Results
Comparing with nonessential proteins, essential proteins appear more frequently in certain subcellular locations and their evolution more conservative. By integrating the information of subcellular localization, orthologous proteins and PPI networks, we propose a novel essential protein prediction method, named SON, in this study. The experimental results on S.cerevisiae data show that the prediction accuracy of SON clearly exceeds that of nine competing methods: DC, BC, IC, CC, SC, EC, NC, PeC and ION.Conclusions
We demonstrate that, by integrating the information of subcellular localization, orthologous proteins with PPI networks, the accuracy of predicting essential proteins can be improved. Our proposed method SON is effective for predicting essential proteins.10.
Predicting subcellular localization of mycobacterial proteins by using Chou's pseudo amino acid composition 总被引:1,自引:0,他引:1
The successful prediction of protein subcellular localization directly from protein primary sequence is useful to protein function prediction and drug discovery. In this paper, by using the concept of pseudo amino acid composition (PseAAC), the mycobacterial proteins are studied and predicted by support vector machine (SVM) and increment of diversity combined with modified Mahalanobis Discriminant (IDQD). The results of jackknife cross-validation for 450 non-redundant proteins show that the overall predicted successful rates of SVM and IDQD are 82.2% and 79.1%, respectively. Compared with other existing methods, SVM combined with PseAAC display higher accuracies. 相似文献
11.
Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naive Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences. 相似文献
12.
Predicting subcellular localization of proteins by hybridizing functional domain composition and pseudo-amino acid composition 总被引:3,自引:0,他引:3
Recent advances in large-scale genome sequencing have led to the rapid accumulation of amino acid sequences of proteins whose functions are unknown. Since the functions of these proteins are closely correlated with their subcellular localizations, many efforts have been made to develop a variety of methods for predicting protein subcellular location. In this study, based on the strategy by hybridizing the functional domain composition and the pseudo-amino acid composition (Cai and Chou [2003]: Biochem. Biophys. Res. Commun. 305:407-411), the Intimate Sorting Algorithm (ISort predictor) was developed for predicting the protein subcellular location. As a showcase, the same plant and non-plant protein datasets as investigated by the previous investigators were used for demonstration. The overall success rate by the jackknife test for the plant protein dataset was 85.4%, and that for the non-plant protein dataset 91.9%. These are so far the highest success rates achieved for the two datasets by following a rigorous cross validation test procedure, further confirming that such a hybrid approach may become a very useful high-throughput tool in the area of bioinformatics, proteomics, as well as molecular cell biology. 相似文献
13.
The study of protein subcellular localization is important to elucidate protein function. Even in well-studied organisms such as yeast, experimental methods have not been able to provide a full coverage of localization. The development of bioinformatic predictors of localization can bridge this gap. We have created a Bayesian network predictor called PSLT2 that considers diverse protein characteristics, including the combinatorial presence of InterPro motifs and protein interaction data. We compared the localization predictions of PSLT2 to high-throughput experimental localization datasets. Disagreements between these methods generally involve proteins that transit through or reside in the secretory pathway. We used our multi-compartmental predictions to refine the localization annotations of yeast proteins primarily by distinguishing between soluble lumenal proteins and soluble proteins peripherally associated with organelles. To our knowledge, this is the first tool to provide this functionality. We used these sub-compartmental predictions to characterize cellular processes on an organellar scale. The integration of diverse protein characteristics and protein interaction data in an appropriate setting can lead to high-quality detailed localization annotations for whole proteomes. This type of resource is instrumental in developing models of whole organelles that provide insight into the extent of interaction and communication between organelles and help define organellar functionality. 相似文献
14.
Daniel Restrepo-Montoya Carolina Vizcaíno Luis F Niño Marisol Ocampo Manuel E Patarroyo Manuel A Patarroyo 《BMC bioinformatics》2009,10(1):134-9
Background
The computational prediction of mycobacterial proteins' subcellular localization is of key importance for proteome annotation and for the identification of new drug targets and vaccine candidates. Several subcellular localization classifiers have been developed over the past few years, which have comprised both general localization and feature-based classifiers. Here, we have validated the ability of different bioinformatics approaches, through the use of SignalP 2.0, TatP 1.0, LipoP 1.0, Phobius, PA-SUB 2.5, PSORTb v.2.0.4 and Gpos-PLoc, to predict secreted bacterial proteins. These computational tools were compared in terms of sensitivity, specificity and Matthew's correlation coefficient (MCC) using a set of mycobacterial proteins having less than 40% identity, none of which are included in the training data sets of the validated tools and whose subcellular localization have been experimentally confirmed. These proteins belong to the TBpred training data set, a computational tool specifically designed to predict mycobacterial proteins. 相似文献15.
Meta-prediction seeks to harness the combined strengths of multiple predicting programs with the hope of achieving predicting performance surpassing that of all existing predictors in a defined problem domain. We investigated meta-prediction for the four-compartment eukaryotic subcellular localization problem. We compiled an unbiased subcellular localization dataset of 1693 nuclear, cytoplasmic, mitochondrial and extracellular animal proteins from Swiss-Prot 50.2. Using this dataset, we assessed the predicting performance of 12 predictors from eight independent subcellular localization predicting programs: ELSPred, LOCtree, PLOC, Proteome Analyst, PSORT, PSORT II, SubLoc and WoLF PSORT. Gorodkin correlation coefficient (GCC) was one of the performance measures. Proteome Analyst is the best individual subcellular localization predictor tested in this four-compartment prediction problem, with GCC = 0.811. A reduced voting strategy eliminating six of the 12 predictors yields a meta-predictor (RAW-RAG-6) with GCC = 0.856, substantially better than all tested individual subcellular localization predictors (P = 8.2 × 10−6, Fisher's Z-transformation test). The improvement in performance persists when the meta-predictor is tested with data not used in its development. This and similar voting strategies, when properly applied, are expected to produce meta-predictors with outstanding performance in other life sciences problem domains. 相似文献
16.
Mechanisms of subcellular mRNA localization 总被引:17,自引:0,他引:17
17.
Nuclear localization of proteins is a crucial element in the dynamic life of the cell. It is complicated by the massive diversity of targeting signals and the existence of proteins that shuttle between the nucleus and cytoplasm. Nevertheless, a majority of subcellular localization tools that predict nuclear proteins have been developed without involving dual localized proteins in the data sets. Hence, in general, the existing models are focused on predicting statically nuclear proteins, rather than nuclear localization itself. We present an independent analysis of existing nuclear localization predictors, using a nonredundant data set extracted from Swiss-Prot R50.0. We demonstrate that accuracy on truly novel proteins is lower than that of previous estimations, and that existing models generalize poorly to dual localized proteins. We have developed a model trained to identify nuclear proteins including dual localized proteins. The results suggest that using more recent data and including dual localized proteins improves the overall prediction. The final predictor NUCLEO operates with a realistic success rate of 0.70 and a correlation coefficient of 0.38, as established on the independent test set. (NUCLEO is available at: http://pprowler.itee.uq.edu.au.). 相似文献
18.
Variable subcellular localization of glycosphingolipids 总被引:5,自引:1,他引:5
Although most glycosphingolipids (GSLs) are thought to be locatedin the outer leaflet of the plasma membrane, recent evidenceindicates that GSLs are also associated with intracellular organelles.We now report that the subcellular localization of GSLs variesdepending on the GSL structure and cell type. GSL localizationwas determined by indirect immunofluorescence microscopy offixed permeabilized cells. A single GSL exhibited variable subcellularlocalization in different cells. For example, antibody to GalCeris localized primarily to the plasma membrane of HaCaT II-3keratinocytes, but to intracellular organelies in other epithelialcells. GalCer is localized to small vesicles and tubulovesicularstructures in MDCK cells, and to the surface of phase-denselipid droplets in HepG2 hepatoma cells. Furthermore, withina single cell type, individual GSLs were found to exhibit differentpatterns of subcellular localization. In HepG2 cells, LacCerwas associated with small vesicles, which differed from thephase-dense vesicles stained by anti-GalCer, and Gb4Cer wasassociated with the intermediate filaments of the cytoskeleton.Both anti-GalCer and monoclonal antibody A2B5, which binds polysialogangliosides,localized to mitochondria. The distinct subcellular localizationpatterns of GSLs raise interesting questions about their functionsin different organelles. Together with published data on theenrichment of GSLs in specific organelles and in apical plasmamembrane, these findings indicate the existence of specificsorting mechanisms that regulate the intracellular transportand localization of GSLs. cytoskeleton glycosphingolipid intracellular organelles mitochondria subcellular localization 相似文献
19.
Predicting subcellular localization of proteins for Gram-negative bacteria by support vector machines based on n-peptide compositions 总被引:16,自引:0,他引:16
Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89%, which, to the best of our knowledge, is the highest prediction rate ever reported. Our prediction is 14% higher than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data. 相似文献
20.
The more proteins diverged in sequence, the more difficult it becomes for bioinformatics to infer similarities of protein function and structure from sequence. The precise thresholds used in automated genome annotations depend on the particular aspect of protein function transferred by homology. Here, we presented the first large-scale analysis of the relation between sequence similarity and identity in subcellular localization. Three results stood out: (1) The subcellular compartment is generally more conserved than what might have been expected given that short sequence motifs like nuclear localization signals can alter the native compartment; (2) the sequence conservation of localization is similar between different compartments; and (3) it is similar to the conservation of structure and enzymatic activity. In particular, we found the transition between the regions of conserved and nonconserved localization to be very sharp, although the thresholds for conservation were less well defined than for structure and enzymatic activity. We found that a simple measure for sequence similarity accounting for pairwise sequence identity and alignment length, the HSSP distance, distinguished accurately between protein pairs of identical and different localizations. In fact, BLAST expectation values outperformed the HSSP distance only for alignments in the subtwilight zone. We succeeded in slightly improving the accuracy of inferring localization through homology by fine tuning the thresholds. Finally, we applied our results to the entire SWISS-PROT database and five entirely sequenced eukaryotes. 相似文献