首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and other localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.  相似文献   

3.
The successful prediction of protein subcellular localization directly from protein primary sequence is useful to protein function prediction and drug discovery. In this paper, by using the concept of pseudo amino acid composition (PseAAC), the mycobacterial proteins are studied and predicted by support vector machine (SVM) and increment of diversity combined with modified Mahalanobis Discriminant (IDQD). The results of jackknife cross-validation for 450 non-redundant proteins show that the overall predicted successful rates of SVM and IDQD are 82.2% and 79.1%, respectively. Compared with other existing methods, SVM combined with PseAAC display higher accuracies.  相似文献   

4.

Background

Annotations that describe the function of sequences are enormously important to researchers during laboratory investigations and when making computational inferences. However, there has been little investigation into the data quality of sequence function annotations. Here we have developed a new method of estimating the error rate of curated sequence annotations, and applied this to the Gene Ontology (GO) sequence database (GOSeqLite). This method involved artificially adding errors to sequence annotations at known rates, and used regression to model the impact on the precision of annotations based on BLAST matched sequences.

Results

We estimated the error rate of curated GO sequence annotations in the GOSeqLite database (March 2006) at between 28% and 30%. Annotations made without use of sequence similarity based methods (non-ISS) had an estimated error rate of between 13% and 18%. Annotations made with the use of sequence similarity methodology (ISS) had an estimated error rate of 49%.

Conclusion

While the overall error rate is reasonably low, it would be prudent to treat all ISS annotations with caution. Electronic annotators that use ISS annotations as the basis of predictions are likely to have higher false prediction rates, and for this reason designers of these systems should consider avoiding ISS annotations where possible. Electronic annotators that use ISS annotations to make predictions should be viewed sceptically. We recommend that curators thoroughly review ISS annotations before accepting them as valid. Overall, users of curated sequence annotations from the GO database should feel assured that they are using a comparatively high quality source of information.  相似文献   

5.
Most of the prediction methods for secretory proteins require the presence of a correct N-terminal end of the preprotein for correct classification. As large scale genome sequencing projects sometimes assign the 5'-end of genes incorrectly, many proteins are encoded without the correct N-terminus leading to incorrect prediction. In this study, a systematic attempt has been made to predict secretory proteins irrespective of presence or absence of N-terminal signal peptides (also known as classical and non-classical secreted proteins respectively), using machine-learning techniques; artificial neural network (ANN) and support vector machine (SVM). We trained and tested our methods on a dataset of 3321 secretory and 3654 non-secretory mammalian proteins using five-fold cross-validation technique. First, ANN-based modules have been developed for predicting secretory proteins using 33 physico-chemical properties, amino acid composition and dipeptide composition and achieved accuracies of 73.1%, 76.1% and 77.1%, respectively. Similarly, SVM-based modules using 33 physico-chemical properties, amino acid, and dipeptide composition have been able to achieve accuracies of 77.4%, 79.4% and 79.9%, respectively. In addition, BLAST and PSI-BLAST modules designed for predicting secretory proteins based on similarity search achieved 23.4% and 26.9% accuracy, respectively. Finally, we developed a hybrid-approach by integrating amino acid and dipeptide composition based SVM modules and PSI-BLAST module that increased the accuracy to 83.2%, which is significantly better than individual modules. We also achieved high sensitivity of 60.4% with low value of 5% false positive predictions using hybrid module. A web server SRTpred has been developed based on above study for predicting classical and non-classical secreted proteins from whole sequence of mammalian proteins, which is available from http://www.imtech.res.in/raghava/srtpred/.  相似文献   

6.

Background  

Predicting the subcellular localization of proteins is important for determining the function of proteins. Previous works focused on predicting protein localization in Gram-negative bacteria obtained good results. However, these methods had relatively low accuracies for the localization of extracellular proteins. This paper studies ways to improve the accuracy for predicting extracellular localization in Gram-negative bacteria.  相似文献   

7.
Classification of gene function remains one of the most important and demanding tasks in the post-genome era. Most of the current predictive computer methods rely on comparing features that are essentially linear to the protein sequence. However, features of a protein nonlinear to the sequence may also be predictive to its function. Machine learning methods, for instance the Support Vector Machines (SVMs), are particularly suitable for exploiting such features. In this work we introduce SVM and the pseudo-amino acid composition, a collection of nonlinear features extractable from protein sequence, to the field of protein function prediction. We have developed prototype SVMs for binary classification of rRNA-, RNA-, and DNA-binding proteins. Using a protein's amino acid composition and limited range correlation of hydrophobicity and solvent accessible surface area as input, each of the SVMs predicts whether the protein belongs to one of the three classes. In self-consistency and cross-validation tests, which measures the success of learning and prediction, respectively, the rRNA-binding SVM has consistently achieved >95% accuracy. The RNA- and DNA-binding SVMs demonstrate more diverse accuracy, ranging from approximately 76% to approximately 97%. Analysis of the test results suggests the directions of improving the SVMs.  相似文献   

8.
Ma J  Gu H 《BMB reports》2010,43(10):670-676
In this paper, a novel approach, ELM-PCA, is introduced for the first time to predict protein subcellular localization. Firstly, Protein Samples are represented by the pseudo amino acid composition (PseAAC). Secondly, the principal component analysis (PCA) is employed to extract essential features. Finally, the Elman Recurrent Neural Network (RNN) is used as a classifier to identify the protein sequences. The results demonstrate that the proposed approach is effective and practical.  相似文献   

9.
The identification of the thermostability from the amino acid sequence information would be helpful in computational screening for thermostable proteins. We have developed a method to discriminate thermophilic and mesophilic proteins based on support vector machines. Using self-consistency validation, 5-fold cross-validation and independent testing procedure with other datasets, this module achieved overall accuracy of 94.2%, 90.5% and 92.4%, respectively. The performance of this SVM-based module was better than the classifiers built using alternative machine learning and statistical algorithms including artificial neural networks, Bayesian statistics, and decision trees, when evaluated using these three validation methods. The influence of protein size on prediction accuracy was also addressed.  相似文献   

10.
Tantoso E  Li KB 《Amino acids》2008,35(2):345-353
Identifying a protein's subcellular localization is an important step to understand its function. However, the involved experimental work is usually laborious, time consuming and costly. Computational prediction hence becomes valuable to reduce the inefficiency. Here we provide a method to predict protein subcellular localization by using amino acid composition and physicochemical properties. The method concatenates the information extracted from a protein's N-terminal, middle and full sequence. Each part is represented by amino acid composition, weighted amino acid composition, five-level grouping composition and five-level dipeptide composition. We divided our dataset into training and testing set. The training set is used to determine the best performing amino acid index by using five-fold cross validation, whereas the testing set acts as the independent dataset to evaluate the performance of our model. With the novel representation method, we achieve an accuracy of approximately 75% on independent dataset. We conclude that this new representation indeed performs well and is able to extract the protein sequence information. We have developed a web server for predicting protein subcellular localization. The web server is available at http://aaindexloc.bii.a-star.edu.sg .  相似文献   

11.
Natural peptides and small proteins in general have amino acid compositions that diverge much more from the average composition of all proteins than do those of proteins. The effect is large and consistent enough to provide a rough check on the measured molecular mass of a protein and to indicate whether it is likely to have a significantly repetitive structure. For example, the alpha-chain of tropomyosin, a highly repetitive protein, has no amino acid composition that would be characteristic of a much smaller protein. The observation provides support for the suggestion [Taylor, Britton & van Heyningen (1983) Biochem. J. 209, 897-899] that tetanus toxin resembles a trimer of the light chain produced by proteolysis.  相似文献   

12.
In silico prediction of protein subcellular localization based on amino acid sequence can reveal valuable information about the protein's innate roles in the cell. Unfortunately, such prediction is made difficult because of complex protein sorting signals. Some prediction methods are based on searching for similar proteins with known localization, assuming that known homologs exist. However, it may not perform well on proteins with no known homolog. In contrast, machine learning-based approaches attempt to infer a predictive model that describes the protein sorting signals. Alas, in doing so, it does not take advantage of known homologs (if they exist) by doing a simple "table lookup". Here, we capture the best of both worlds by combining both approaches. On a dataset with 12 locations, similarity-based and machine learning independently achieve an accuracy of 83.8% and 72.6%, respectively. Our hybrid approach yields an improved accuracy of 85.9%. We compared our method with three other methods' published results. For two of the methods, we used their published datasets for comparison. For the third we used the 12 location dataset. The Error Correcting Output Code algorithm was used to construct our predictive model. This algorithm gives attention to all the classes regardless of number of instances and led to high accuracy among each of the classes and a high prediction rate overall. We also illustrated how the machine learning classifier we use, built over a meaningful set of features can produce interpretable rules that may provide valuable insights into complex protein sorting mechanisms.  相似文献   

13.
MOTIVATION: Functional annotation of unknown proteins is a major goal in proteomics. A key annotation is the prediction of a protein's subcellular localization. Numerous prediction techniques have been developed, typically focusing on a single underlying biological aspect or predicting a subset of all possible localizations. An important step is taken towards emulating the protein sorting process by capturing and bringing together biologically relevant information, and addressing the clear need to improve prediction accuracy and localization coverage. RESULTS: Here we present a novel SVM-based approach for predicting subcellular localization, which integrates N-terminal targeting sequences, amino acid composition and protein sequence motifs. We show how this approach improves the prediction based on N-terminal targeting sequences, by comparing our method TargetLoc against existing methods. Furthermore, MultiLoc performs considerably better than comparable methods predicting all major eukaryotic subcellular localizations, and shows better or comparable results to methods that are specialized on fewer localizations or for one organism. AVAILABILITY: http://www-bs.informatik.uni-tuebingen.de/Services/MultiLoc/  相似文献   

14.
Identifying the subcellular localization of proteins is particularly helpful in the functional annotation of gene products. In this study, we use Machine Learning and Exploratory Data Analysis (EDA) techniques to examine and characterize amino acid sequences of human proteins localized in nine cellular compartments. A dataset of 3,749 protein sequences representing human proteins was extracted from the SWISS-PROT database. Feature vectors were created to capture specific amino acid sequence characteristics. Relative to a Support Vector Machine, a Multi-layer Perceptron, and a Naive Bayes classifier, the C4.5 Decision Tree algorithm was the most consistent performer across all nine compartments in reliably predicting the subcellular localization of proteins based on their amino acid sequences (average Precision=0.88; average Sensitivity=0.86). Furthermore, EDA graphics characterized essential features of proteins in each compartment. As examples, proteins localized to the plasma membrane had higher proportions of hydrophobic amino acids; cytoplasmic proteins had higher proportions of neutral amino acids; and mitochondrial proteins had higher proportions of neutral amino acids and lower proportions of polar amino acids. These data showed that the C4.5 classifier and EDA tools can be effective for characterizing and predicting the subcellular localization of human proteins based on their amino acid sequences.  相似文献   

15.
The proteins in the hinge ligaments of molluscan bivalves were subjected to chemotaxonomic studies according to their amino acid compositions. The hinge-ligament protein is a new class of structure proteins, and this is the first attempt to introduce chemical taxonomy into the systematics of bivalves. The hinge-ligament proteins from morphologically close species, namely mactra (superfamily Mactracea) or scallop (family Pectinidae) species, showed high intraspecific homology in their compositions. On the other hand, inconsistent results were obtained with two types of ligament proteins in pearl oyster species (genus Pinctada). The results of our chemotaxonomic analyses were sometimes in good agreement with the morphological classifications and sometimes inconsistent, implying a complicated phylogenetic relationship among the species.  相似文献   

16.
In this study we classified regions of random coil into four types: coil between alpha helix and beta strand, coil between beta strand and alpha helix, coil between two alpha helices and coil between two beta strands. This classification may be considered as natural. We used 610 3D structures of proteins collected from the Protein Data Bank from bacteria with low, average and high genomic GC-content. Relatively short regions of coil are not random: certain amino acid residues are more or less frequent in each of the types of coil. Namely, hydrophobic amino acids with branched side chains (Ile, Val and Leu) are rare in coil between two beta strands, unlike some acrophilic amino acids (Asp, Asn and Gly). In contrast, coil between two alpha helices is enriched by Leu. Regions of coil between alpha helix and beta strand are enriched by positively charged amino acids (Arg and Lys), while the usage of residues with side chains possessing hydroxyl group (Ser and Thr) is low in them, in contrast to the regions of coil between beta strand and alpha helix. Regions of coil between beta strand and alpha helix are significantly enriched by Cys residues. The response to the symmetric mutational pressure (AT-pressure or GC-pressure) is also quite different for four types of coil. The most conserved regions of coil are “connecting bridges” between beta strand and alpha helix, since their amino acid content shows less strong dependence on GC-content of genes than amino acid contents of other three types of coil. Possible causes and consequences of the described differences in amino acid content distribution between different types of random coil have been discussed.  相似文献   

17.
The subunit stoichiometry of a large, multisubunit protein can be determined from the molar amino acid compositions (i amino acids) of the protein and its subunits. The number of copies of the subunits (1, 2, ... j) is calculated by solving all possible combinations of simultaneous equations in j unknowns (i!/j!(i - j)!). Calculations carried out using the published amino acid compositions determined by analysis and the compositions calculated from the sequences for two proteins of known stoichiometry provided the following results: Escherichia coli aspartate transcarbamoylase (R6C6, Mr = 307.5 kDa), R = 5.6 to 6.6 and C = 5.8 to 6.3, and spinach ribulose-bisphosphate carboxylase (L8S8, Mr = 535 kDa), L = 7.3 to 9.1 and S = 5.6 to 10.6. Calculations were also carried out with the amino acid compositions of two much larger proteins, the E. coli pyruvate dehydrogenase complex, Mr = 5280 kDa, subunits E1 (99.5 kDa), E2 (66 kDa), and E3 (50.6 kDa), and the extracellular hemoglobin of Lumbricus terrestris, Mr = 3760 kDa, subunits M (17 kDa), D1 (31 kDa), D2 (37 kDa), and T (51 kDa); the results for PDHase were E1 = 20 to 24, E2 = 18 to 31, E3 = 21 to 33 and those for Lumbricus hemoglobin were M = 34 to 46, D1 = 13 to 19, D2 = 13 to 18, and T = 34 to 36. Although the sample standard deviations of the mean values are generally high, the proposed method works surprisingly well for the two smaller proteins and provides physically reasonable results for the two larger proteins.  相似文献   

18.
A maize root fraction which inactivates nitrate reductase has been shown to have protease activity which can be measured by the hydrolysis of azocasein. This inactivating enzyme was also found to inactivate yeast tryptophan synthase. Yeast proteases A and B, which inactivate this latter enzyme, also gave a specific inactivation of the maize nitrate reductase. The maize root inactivating enzyme, like yeast protease B, degraded casein, and was inhibited by phenylmethylsulphonyl fluoride. A partially-purified yeast inhibitor prevented catalysis by the yeast proteases and maize root inactivating enzyme, but purified yeast inhibitors were without effect on the latter protein. The level of nitrate reductase-inactivating activity, and associated azocasein-degrading activity, increased with age of the maize root. Evidence was obtained for a heat stable inhibitor which maintained them in an inactive state, especially in the young root tip cells.  相似文献   

19.
用离散增量结合支持向量机方法预测蛋白质亚细胞定位   总被引:3,自引:0,他引:3  
赵禹  赵巨东  姚龙 《生物信息学》2010,8(3):237-239,244
对未知蛋白的功能注释是蛋白质组学的主要目标。一个关键的注释是蛋白质亚细胞定位的预测。本文应用离散增量结合支持向量机(ID_SVM)的方法,对阳性革兰氏细菌蛋白的5类亚细胞定位点进行预测。在独立检验下,其总体预测成功率为89.66%。结果发现ID_SVM算法对预测的成功率有很大改进。  相似文献   

20.
Spred-1 and Spred-2 (Sprouty-related protein with an EVH1 domain) are recently described members of the EVH1 (Ena/VASP-homology domain 1) family. Both Spred-1 and Spred-2 are membrane-associated substrates of receptor tyrosine kinases and they act as negative regulators of the Ras pathway upon growth factor stimulation. Since the Spred family members seem to exert overlapping molecular functions, the isotype-specific function of each member remains enigmatic. To date, no comprehensive expression profiling of Spred proteins has been shown. Therefore, we compared mRNA and protein expression patterns of Spred-1 and Spred-2 systematically in mouse organs. Furthermore, we focused on the tissue-specific expression of Spred-2 in adult human tissues, the subcellular localization, and the potential role of Spred-2 in the organism. Our studies show that expression patterns of Spred-1 and Spred-2 differ markedly among various tissues and cell types. In mouse, Spred-1 and Spred-2 were found to be expressed predominantly in brain, whereas Spred-2 was found to be more widely expressed in various adult tissues than Spred-1. In humans, Spred-2 was found to be strongly expressed in glandular epithelia and, at the subcellular level, its immunoreactivity was associated with secretory vesicles. Using confocal microscopy we found Spred-2 to be strongly colocalized with Rab11 and, to a lesser extent, with Rab5a GTPase, an observation that was not made for Spred-1. We conclude that the two members of the recently discovered Spred protein family, Spred-1 and Spred-2, show a highly specific expression pattern in various tissues reflecting a specific physiological role for the individual Spred isoforms in these tissues. Furthermore, it becomes most likely that Spred-2 is involved in the regulation of secretory pathways.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号