首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
Given the sequence of a protein, how can we predict whether it is a membrane protein or non-membrane protein? If it is, what membrane protein type it belongs to? Since these questions are closely relevant to the function of an uncharacterized protein, their importance is self-evident. Particularly, with the explosion of protein sequences entering into databanks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give a fast answers to these questions. By hybridizing the functional domain (FunD) and pseudo-amino acid composition (PseAA), a new strategy called FunD-PseAA predictor was introduced. To test the power of the predictor, a highly non-homologous data set was constructed where none of proteins has 25% sequence identity to any other. The overall success rates obtained with the FunD-PseAA predictor on such a data set by the jackknife cross-validation test was 85% for the case in identifying membrane protein and non-membrane protein, and 91% in identifying the membrane protein type among the following 5 categories: (1) type-1 membrane protein, (2) type-2 membrane protein, (3) multipass transmembrane protein, (4) lipid chain-anchored membrane protein, and (5) GPI-anchored membrane protein. These rates are much higher than those obtained by the other methods on the same stringent data set, indicating that the FunD-PseAA predictor may become a useful high throughput tool in bioinformatics and proteomics.  相似文献   

2.
According to their main EC (Enzyme Commission) numbers, enzymes are classified into the following 6 main classes: oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases. A new method has been developed to predict the enzymatic attribute of proteins by introducing the functional domain composition to formulate a given protein sequence. The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 85% in identifying the enzyme family classes (including the identification of nonenzyme protein sequences as well). The success rate is significantly higher than those obtained by the other methods on such a stringent dataset. This indicates that using the functional domain composition to represent protein samples for statistical prediction is indeed very promising, and will become a powerful tool in bioinformatics and proteomics.  相似文献   

3.
Prediction of protease types in a hybridization space   总被引:2,自引:0,他引:2  
Regulating most physiological processes by controlling the activation, synthesis, and turnover of proteins, proteases play pivotal regulatory roles in conception, birth, digestion, growth, maturation, ageing, and death of all organisms. Different types of proteases have different functions and biological processes. Therefore, it is important for both basic research and drug discovery to consider the following two problems. (1) Given the sequence of a protein, can we identify whether it is a protease or non-protease? (2) If it is, what protease type does it belong to? Although the two problems can be solved by various experimental means, it is both time-consuming and costly to do so. The avalanche of protein sequences generated in the post-genetic era has challenged us to develop an automated method for making a fast and reliable identification. By hybridizing the functional domain composition and pseudo-amino acid composition, we have introduced a new method called "FunD-PseAA predictor" that is operated in a hybridization space. To avoid redundancy and bias, demonstrations were performed on a dataset where none of the proteins has >or=25% sequence identity to any other. The overall success rate thus obtained by the jackknife cross-validation test in identifying protease and non-protease was 92.95%, and that in identifying the protease type was 94.75% among the following six types: (1) aspartic, (2) cysteine, (3) glutamic, (4) metallo, (5) serine, and (6) threonine. Demonstration was also made on an independent dataset, and the corresponding overall success rates were 98.36% and 97.11%, respectively, suggesting the FunD-PseAA predictor is very powerful and may become a useful tool in bioinformatics and proteomics.  相似文献   

4.
A new method has been developed to predict the enzymatic attribute of proteins by hybridizing the gene product composition and pseudo amino acid composition. As a demonstration, a working dataset was generated with a cutoff of 60% sequence identity to avoid redundancy and bias in statistical prediction. The dataset thus constructed contains 39989 protein sequences, of which 27469 are non-enzymes and 12520 enzymes that were further classified into 6 enzyme family classes according to their 6 main EC (Enzyme Commission) numbers (2314 are oxidoreductases, 3653 transferases, 3246 hydrolases, 1307 lyases, 676 isomerases, and 1324 ligases). The overall success rate by the jackknife test for the identification between enzyme and non-enzyme was 94%, and that for the identification among the 6 enzyme family classes was 98%. It is anticipated that, with the rapid increase of protein sequences entering into databanks, the current method will become a useful automated tool in identifying the enzymatic attribute of a newly found protein sequence.  相似文献   

5.
Given the sequence of a protein, how can we predict whether it is an enzyme or a non‐enzyme? If it is, what enzyme family class it belongs to? Because these questions are closely relevant to the biological function of a protein and its acting object, their importance is self‐evident. Particularly with the explosion of protein sequences entering into data banks and the relatively much slower progress in using biochemical experiments to determine their functions, it is highly desired to develop an automated method that can be used to give fast answers to these questions. By hybridizing the gene ontology and pseudo‐amino‐acid composition, we have introduced a new method that is called GO‐PseAA predictor and operate it in a hybridization space. To avoid redundancy and bias, demonstrations were performed on a data set in which none of the proteins in an individual class has ≥40% sequence identity to any other. The overall success rate thus obtained by the jackknife cross‐validation test in identifying enzyme and non‐enzyme was 93%, and that in identifying the enzyme family was 94% for the following six main Enzyme Commission (EC) classes: (1) oxidoreductase, (2) transferase, (3) hydrolase, (4) lyase, (5) isomerase, and (6) ligase. The corresponding rates by the independent data set test were 98% and 97%, respectively.  相似文献   

6.
Enzyme function is much less conserved than anticipated, i.e., the requirement for sequence similarity that implies similarity in enzymatic function is much higher than the requirement that implies similarity in protein structure. This is because the function of an enzyme is an extremely complicated problem that may involve very subtle structural details as well as many other physical chemistry factors. Accordingly, if simply based on the sequence similarity approach, it would hardly get a decent success rate in predicting enzyme sub-class even for a dataset consisting of samples with 50% sequence identity. To cope with such a situation, the GO-PseAA predictor was adopted to identify the sub-class for each of the six main enzyme families. It has been observed that, even for the much more stringent datasets in which none of the enzymes has 25% sequence identity to any others, the overall success rates are 73-95%, suggesting that the GO-PseAA predictor can catch the core features of the statistical samples concerned and may become a useful high throughput tool in proteomics and bioinformatics.  相似文献   

7.
Zhou GP  Cai YD 《Proteins》2006,63(3):681-684
Proteases play a vitally important role in regulating most physiological processes. Different types of proteases perform different functions with different biological processes. Therefore, it is highly desired to develop a fast and reliable means to identify the types of proteases according to their sequences, or even just identify whether they are proteases or nonproteases. The avalanche of protein sequences generated in the postgenomic era has made such a challenge become even more critical and urgent. By hybridizing the gene ontology approach and pseudo amino acid composition approach, a powerful predictor called GO-PseAA predictor was introduced to address the problems. To avoid redundancy and bias, demonstrations were performed on a dataset where none of proteins has >/= 25% sequence identity to any other. The overall success rates thus obtained by the jackknife cross-validation test in identifying protease and nonprotease was 91.82%, and that in identifying the protease type was 85.49% among the following five types: (1) aspartic, (2) cysteine, (3) metallo, (4) serine, and (5) threonine. The high jackknife success rates yielded for such a stringent dataset indicate the GO-PseAA predictor is very powerful and might become a useful tool in bioinformatics and proteomics.  相似文献   

8.
MOTIVATION: With protein sequences entering into databanks at an explosive pace, the early determination of the family or subfamily class for a newly found enzyme molecule becomes important because this is directly related to the detailed information about which specific target it acts on, as well as to its catalytic process and biological function. Unfortunately, it is both time-consuming and costly to do so by experiments alone. In a previous study, the covariant-discriminant algorithm was introduced to identify the 16 subfamily classes of oxidoreductases. Although the results were quite encouraging, the entire prediction process was based on the amino acid composition alone without including any sequence-order information. Therefore, it is worthy of further investigation. RESULTS: To incorporate the sequence-order effects into the predictor, the 'amphiphilic pseudo amino acid composition' is introduced to represent the statistical sample of a protein. The novel representation contains 20 + 2lambda discrete numbers: the first 20 numbers are the components of the conventional amino acid composition; the next 2lambda numbers are a set of correlation factors that reflect different hydrophobicity and hydrophilicity distribution patterns along a protein chain. Based on such a concept and formulation scheme, a new predictor is developed. It is shown by the self-consistency test, jackknife test and independent dataset tests that the success rates obtained by the new predictor are all significantly higher than those by the previous predictors. The significant enhancement in success rates also implies that the distribution of hydrophobicity and hydrophilicity of the amino acid residues along a protein chain plays a very important role to its structure and function.  相似文献   

9.
The fold pattern of a protein is one level deeper than its structural classification, and hence is more challenging and complicated for prediction. Many efforts have been made in this regard, but so far all the reported success rates are still under 70%, indicating that it is extremely difficult to enhance the success rate even by 1% or 2%. To address this problem, here a novel approach is proposed that is featured by combining the functional domain information and the sequential evolution information through a fusion ensemble classifier. The predictor thus developed is called PFP-FunDSeqE. Tests were performed for identifying proteins among their 27 fold patterns. Compared with the existing predictors tested by a same stringent benchmark dataset, the new predictor can, for the first time, achieve over 70% success rate. The PFP-FunDSeqE predictor is freely available to the public as a web server at http://www.csbio.sjtu.edu.cn/bioinf/PFP-FunDSeqE/.  相似文献   

10.
Cell membranes are crucial to the life of a cell. Although the basic structure of biological membrane is provided by the lipid bilayer, most of the specific functions are carried out by membrane proteins. Knowledge of membrane protein type often offers important clues toward determining the function of an uncharacterized protein. Therefore, predicting the type of a membrane protein from its primary sequence, or even just identifying whether the uncharacterized protein belongs to a membrane protein or not, is an important and challenging problem in bioinformatics and proteomics. To deal with these problems, the GO-PseAA predictor is introduced that is operated in a hybridization space by combining the gene ontology and pseudo amino acid composition. Meanwhile, to test the prediction quality, a dataset was constructed that contains 6476 non-membrane proteins and 5122 membrane proteins classified into five different types. To avoid redundancy and bias, none of the proteins included has > or = 40% sequence identity to any other. It has been observed that the overall success rate by the jackknife cross-validation test in identifying non-membrane proteins and membrane proteins was 94.76%, and that in identifying the five membrane protein types was 95.84%. The high success rates suggest that the GO-PseAA predictor can catch the core feature of the statistical samples concerned and may become an automated high throughput toll in molecular and cell biology.  相似文献   

11.
Predominantly occurring on cytosine, DNA methylation is a process by which cells can modify their DNAs to change the expression of gene products. It plays very important roles in life development but also in forming nearly all types of cancer. Therefore, knowledge of DNA methylation sites is significant for both basic research and drug development. Given an uncharacterized DNA sequence containing many cytosine residues, which one can be methylated and which one cannot? With the avalanche of DNA sequences generated during the postgenomic age, it is highly desired to develop computational methods for accurately identifying the methylation sites in DNA. Using the trinucleotide composition, pseudo amino acid components, and a dataset-optimizing technique, we have developed a new predictor called “iDNA-Methyl” that has achieved remarkably higher success rates in identifying the DNA methylation sites than the existing predictors. A user-friendly web-server for the new predictor has been established at http://www.jci-bioinfo.cn/iDNA-Methyl, where users can easily get their desired results. We anticipate that the web-server predictor will become a very useful high-throughput tool for basic research and drug development and that the novel approach and technique can also be used to investigate many other DNA-related problems and genome analysis.  相似文献   

12.
The functional domain composition is introduced to predict the structural class of a protein or domain according to the following classification: all-alpha, all-beta, alpha/beta, alpha+beta, micro (multi-domain), sigma (small protein), and rho (peptide). The advantage by doing so is that both the sequence-order-related features and the function-related features are naturally incorporated in the predictor. As a demonstration, the jackknife cross-validation test was performed on a dataset that consists of proteins and domains with only less than 20% sequence identity to each other in order to get rid of any homologous bias. The overall success rate thus obtained was 98%. In contrast to this, the corresponding rates obtained by the simple geometry approaches based on the amino acid composition were only 36-39%. This indicates that using the functional domain composition to represent the sample of a protein for statistical prediction is very promising, and that the functional type of a domain is closely correlated with its structural class.  相似文献   

13.
With the rapid increment of protein sequence data, it is indispensable to develop automated and reliable predictive methods for protein function annotation. One approach for facilitating protein function prediction is to classify proteins into functional families from primary sequence. Being the most important group of all proteins, the accurate prediction for enzyme family classes and subfamily classes is closely related to their biological functions. In this paper, for the prediction of enzyme subfamily classes, the Chou's amphiphilic pseudo-amino acid composition [Chou, K.C., 2005. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19] has been adopted to represent the protein samples for training the 'one-versus-rest' support vector machine. As a demonstration, the jackknife test was performed on the dataset that contains 2640 oxidoreductase sequences classified into 16 subfamily classes [Chou, K.C., Elrod, D.W., 2003. Prediction of enzyme family classes. J. Proteome Res. 2, 183-190]. The overall accuracy thus obtained was 80.87%. The significant enhancement in the accuracy indicates that the current method might play a complementary role to the exiting methods.  相似文献   

14.
Apoptosis proteins are very important for understanding the mechanism of programmed cell death. The apoptosis protein localization can provide valuable information about its molecular function. The prediction of localization of an apoptosis protein is a challenging task. In our previous work we proposed an increment of diversity (ID) method using protein sequence information for this prediction task. In this work, based on the concept of Chou's pseudo-amino acid composition [Chou, K.C., 2001. Prediction of protein cellular attributes using pseudo-amino acid composition. Proteins: Struct. Funct. Genet. (Erratum: Chou, K.C., 2001, vol. 44, 60) 43, 246-255, Chou, K.C., 2005. Using amphiphilic pseudo-amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10-19], a different pseudo-amino acid composition by using the hydropathy distribution information is introduced. A novel ID_SVM algorithm combined ID with support vector machine (SVM) is proposed. This method is applied to three data sets (317 apoptosis proteins, 225 apoptosis proteins and 98 apoptosis proteins). The higher predictive success rates than the previous algorithms are obtained by the jackknife tests.  相似文献   

15.
16.
Zhang SW  Zhang YL  Yang HF  Zhao CH  Pan Q 《Amino acids》2008,34(4):565-572
The rapidly increasing number of sequence entering into the genome databank has called for the need for developing automated methods to analyze them. Information on the subcellular localization of new found protein sequences is important for helping to reveal their functions in time and conducting the study of system biology at the cellular level. Based on the concept of Chou’s pseudo-amino acid composition, a series of useful information and techniques, such as residue conservation scores, von Neumann entropies, multi-scale energy, and weighted auto-correlation function were utilized to generate the pseudo-amino acid components for representing the protein samples. Based on such an infrastructure, a hybridization predictor was developed for identifying uncharacterized proteins among the following 12 subcellular localizations: chloroplast, cytoplasm, cytoskeleton, endoplasmic reticulum, extracell, Golgi apparatus, lysosome, mitochondria, nucleus, peroxisome, plasma membrane, and vacuole. Compared with the results reported by the previous investigators, higher success rates were obtained, suggesting that the current approach is quite promising, and may become a useful high-throughput tool in the relevant areas.  相似文献   

17.
MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk  相似文献   

18.
Ensemble classifier for protein fold pattern recognition   总被引:4,自引:0,他引:4  
MOTIVATION: Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo-amino acid composition, which were extracted from a training dataset. The operation engine for the constituent individual classifiers was OET-KNN (optimized evidence-theoretic k-nearest neighbors) rule. Their outcomes were combined through a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns. RESULTS: The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have <25% sequence identity with the proteins used in training the classifier. Such a rate is 6-21% higher than the corresponding rates obtained by various existing NN (neural networks) and SVM (support vector machines) approaches, implying that the ensemble classifier is very promising and might become a useful vehicle in protein science, as well as proteomics and bioinformatics. AVAILABILITY: The ensemble classifier, called PFP-Pred, is available as a web-server at http://202.120.37.186/bioinf/fold/PFP-Pred.htm for public usage.  相似文献   

19.
G protein-coupled receptors (GPCRs) are among the most frequent targets of therapeutic drugs. With the avalanche of newly generated protein sequences in the post genomic age, to expedite the process of drug discovery, it is highly desirable to develop an automated method to rapidly identify GPCRs and their types. A new predictor was developed by hybridizing two different modes of pseudo-amino acid composition (PseAAC): the functional domain PseAAC and the low-frequency Fourier spectrum PseAAC. The new predictor is called GPCR-2L, where "2L" means that it is a two-layer predictor: the 1st layer prediction engine is to identify a query protein as GPCR or not; if it is, the prediction will be automatically continued to further identify it as belonging to one of the following six types: (1) rhodopsin-like (Class A), (2) secretin-like (Class B), (3) metabotropic glutamate/pheromone (Class C), (4) fungal pheromone (Class D), (5) cAMP receptor (Class E), or (6) frizzled/smoothened family (Class F). The overall success rate of GPCR-2L in identifying proteins as GPCRs or non-GPCRs is over 97.2%, while identifying GPCRs among their six types is over 97.8%. Such high success rates were derived by the rigorous jackknife cross-validation on a stringent benchmark dataset, in which none of the included proteins had ≥40% pairwise sequence identity to any other protein in a same subset. As a user-friendly web-server, GPCR-2L is freely accessible to the public at http://icpr.jci.edu.cn/, by which one can obtain the 2-level results in about 20 s for a query protein sequence of 500 amino acids. The longer the sequence is, the more time it may usually need. The high success rates reported here indicate that it is a quite effective approach to identify GPCRs and their types with the functional domain information and the low-frequency Fourier spectrum analysis. It is anticipated that GPCR-2L may become a useful tool for both basic research and drug development in the areas related to GPCRs.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号