首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Functional annotation of protein sequences with low similarity to well characterized protein sequences is a major challenge of computational biology in the post genomic era. The cyclin protein family is once such important family of proteins which consists of sequences with low sequence similarity making discovery of novel cyclins and establishing orthologous relationships amongst the cyclins, a difficult task. The currently identified cyclin motifs and cyclin associated domains do not represent all of the identified and characterized cyclin sequences. We describe a Support Vector Machine (SVM) based classifier, CyclinPred, which can predict cyclin sequences with high efficiency. The SVM classifier was trained with features of selected cyclin and non cyclin protein sequences. The training features of the protein sequences include amino acid composition, dipeptide composition, secondary structure composition and PSI-BLAST generated Position Specific Scoring Matrix (PSSM) profiles. Results obtained from Leave-One-Out cross validation or jackknife test, self consistency and holdout tests prove that the SVM classifier trained with features of PSSM profile was more accurate than the classifiers based on either of the other features alone or hybrids of these features. A cyclin prediction server--CyclinPred has been setup based on SVM model trained with PSSM profiles. CyclinPred prediction results prove that the method may be used as a cyclin prediction tool, complementing conventional cyclin prediction methods.  相似文献   

2.
Predicting the interactions between all the possible pairs of proteins in a given organism (making a protein-protein interaction map) is a crucial subject in bioinformatics. Most of the previous methods based on supervised machine learning use datasets containing approximately the same number of interacting pairs of proteins (positives) and non-interacting pairs of proteins (negatives) for training a classifier and are estimated to yield a large number of false positives. Thinking that the negatives used in previous studies cannot adequately represent all the negatives that need to be taken into account, we have developed a method based on multiple Support Vector Machines (SVMs) that uses more negatives than positives for predicting interactions between pairs of yeast proteins and pairs of human proteins. We show that the performance of a single SVM improved as we increased the number of negatives used for training and that, if more than one CPU is available, an approach using multiple SVMs is useful not only for improving the performance of classifiers but also for reducing the time required for training them. Our approach can also be applied to assessing the reliability of high-throughput interactions.  相似文献   

3.
4.
5.
6.
7.
K Nakai  M Kanehisa 《Proteins》1991,11(2):95-110
We have developed an expert system that makes use of various kinds of knowledge organized as "if-then" rules for predicting protein localization sites in Gram-negative bacteria, given the amino acid sequence information alone. We considered four localization sites: the cytoplasm, the inner (cytoplasmic) membrane, the periplasm, and the outer membrane. Most rules were derived from experimental observations. For example, the rule to recognize an inner membrane protein is the presence of either a hydrophobic stretch in the predicted mature protein or an uncleavable N-terminal signal sequence. Lipoproteins are first recognized by a consensus pattern and then assumed present at either the inner or outer membrane. These two possibilities are further discriminated by examining an acidic residue in the mature N-terminal portion. Furthermore, we found an empirical rule that periplasmic and outer membrane proteins were successfully discriminated by their different amino acid composition. Overall, our system could predict 83% of the localization sites of proteins in our database.  相似文献   

8.
9.
The nucleus guides life processes of cells. Many of the nuclear proteins participating in the life processes tend to concentrate on subnuclear compartments. The subnuclear localization of nuclear proteins is hence important for deeply understanding the construction and functions of the nucleus. Recently, Gene Ontology (GO) annotation has been used for prediction of subnuclear localization. However, the effective use of GO terms in solving sequence-based prediction problems remains challenging, especially when query protein sequences have no accession number or annotated GO term. This study obtains homologies of query proteins with known accession numbers using BLAST to retrieve GO terms for sequence-based subnuclear localization prediction. A prediction method PGAC, which involves mining informative GO terms associated with amino acid composition features, is proposed to design a support vector machine-based classifier. PGAC yields 55 informative GO terms with training and test accuracies of 85.7% and 76.3%, respectively, using a data set SNL_35 (561 proteins in 9 localizations) with 35% sequence identity. Upon comparison with Nuc-PLoc, which combines amphiphilic pseudo amino acid composition of a protein with its position-specific scoring matrix, PGAC using the data set SNL_80 yields a leave-one-out cross-validation accuracy of 81.1%, which is better than that of Nuc-PLoc, 67.4%. Experimental results show that the set of informative GO terms are effective features for protein subnuclear localization. The prediction server based on PGAC has been implemented at http://iclab.life.nctu.edu.tw/prolocgac.  相似文献   

10.
An eigenvalue-eigenvector approach to predicting protein folding types   总被引:1,自引:0,他引:1  
The accuracy of predicting protein folding types can be significantly enhanced by a recently developed algorithm in which the coupling effect among different amino acid components is taken into account [Chou and Zhang (1994)J. Biol. Chem.269, 22014-22020]. However, in practical calculations using this powerful algorithm, one may sometimes face illconditioned matrices. To overcome such a difficulty, an effective eigenvalue-eigenvector approach is proposed. Furthermore, the new approach has been used to predict a recently constructed set of 76 proteins not included in the training set, and the accuracy of prediction is also much higher than those of other methods.  相似文献   

11.
According to the recent experiments, proteins in budding yeast can be distinctly classified into 22 subcellular locations. Of these proteins, some bear the multi-locational feature, i.e., occur in more than one location. However, so far all the existing methods in predicting protein subcellular location were developed to deal with only the mono-locational case where a query protein is assumed to belong to one, and only one, subcellular location. To stimulate the development of subcellular location prediction, an augmentation procedure is formulated that will enable the existing methods to tackle the multi-locational problem as well. It has been observed thru a jackknife cross-validation test that the success rate obtained by the augmented GO-FnD-PseAA algorithm [BBRC 320 (2004) 1236] is overwhelmingly higher than those by the other augmented methods. It is anticipated that the augmented GO-FunD-PseAA predictor will become a very useful tool in predicting protein subcellular localization for both basic research and practical application.  相似文献   

12.
SLLE for predicting membrane protein types   总被引:2,自引:0,他引:2  
Introduction of the concept of pseudo amino acid composition (PROTEINS: Structure, Function, and Genetics 43 (2001) 246; Erratum: ibid. 44 (2001) 60) has made it possible to incorporate a considerable amount of sequence-order effects by representing a protein sample in terms of a set of discrete numbers, and hence can significantly enhance the prediction quality of membrane protein type. As a continuous effort along such a line, the Supervised Locally Linear Embedding (SLLE) technique for nonlinear dimensionality reduction is introduced (Science 22 (2000) 2323). The advantage of using SLLE is that it can reduce the operational space by extracting the essential features from the high-dimensional pseudo amino acid composition space, and that the cluster-tolerant capacity can be increased accordingly. As a consequence by combining these two approaches, high success rates have been observed during the tests of self-consistency, jackknife and independent data set, respectively, by using the simplest nearest neighbour classifier. The current approach represents a new strategy to deal with the problems of protein attribute prediction, and hence may become a useful vehicle in the area of bioinformatics and proteomics.  相似文献   

13.
The transmembrane (TM) domains of many integral membrane proteins are composed of alpha-helix bundles. Structure determination at high resolution (<4 A) of TM domains is still exceedingly difficult experimentally. Hence, some TM-protein structures have only been solved at intermediate (5-10 A) or low (>10 A) resolutions using, for example, cryo-electron microscopy (cryo-EM). These structures reveal the packing arrangement of the TM domain, but cannot be used to determine the positions of individual amino acids. The observation that typically, the lipid-exposed faces of TM proteins are evolutionarily more variable and less charged than their core provides a simple rule for orienting their constituent helices. Based on this rule, we developed score functions and automated methods for orienting TM helices, for which locations and tilt angles have been determined using, e.g., cryo-EM data. The method was parameterized with the aim of retrieving the native structure of bacteriorhodopsin among near- and far-from-native templates. It was then tested on proteins that differ from bacteriorhodopsin in their sequences, architectures, and functions, such as the acetylcholine receptor and rhodopsin. The predicted structures were within 1.5-3.5 A from the native state in all cases. We conclude that the computational method can be used in conjunction with cryo-EM data to obtain approximate model structures of TM domains of proteins for which a sufficiently heterogeneous set of homologs is available. We also show that in those proteins in which relatively short loops connect neighboring helices, the scoring functions can discriminate between near- and far-from-native conformations even without the constraints imposed on helix locations and tilt angles that are derived from cryo-EM.  相似文献   

14.
Feng ZP 《In silico biology》2002,2(3):291-303
The present paper overviews the issue on predicting the subcellular location of a protein. Five measures of extracting information from the global sequence based on the Bayes discriminant algorithm are reviewed. 1) The auto-correlation functions of amino acid indices along the sequence; 2) The quasi-sequence-order approach; 3) the pseudo-amino acid composition; 4) the unified attribute vector in Hilbert space, 5) Zp parameters extracted from the Zp curve. The actual performance of the predictive accuracy is closely related to the degree of similarity between the training and testing sets or to the average degree of pairwise similarity in dataset in a cross-validated study. Many scholars considered that the current higher predictive accuracy still cannot ensure that some available algorithms are effective in practice prediction for the higher pairwise sequence identity of the datasets, but some of them declared that construction of the dataset used for developing software should base on the reality determined by the Mother Nature that some subcellular locations really contain only a minor number of proteins of which some even have a high percentage of sequence similarity. Owing to the complexity of the problem itself, some very sophisticated and special programs are needed for both constructing dataset and improving the prediction. Anyhow finding the target information in mature protein sequence and properly cooperating it with sorting signals in prediction may further improve the overall predictive accuracy and make the prediction into practice.  相似文献   

15.
Nucleoids, a subnuclear system capable of chain elongation   总被引:1,自引:0,他引:1  
Nucleoids, prepared by salt extraction of non-DNase-digested nuclei, have properties similar, but not identical, to those of nuclear matrices which are prepared by salt extraction of DNase-digested nuclei. Nuclear matrices retained less pulse-labelled DNA, slightly less bound DNA polymerase alpha and DNA primase, but had greater in vitro DNA synthesis and in vitro priming. Nucleoids contained larger (110 S) DNA chains than nuclear matrices (30 S). Each type of residual nuclear structure could synthesize 4.5 S Okazaki fragments. When extracted with increasing concentrations of salt, DNase-digested nucleo lost the ability for further elongation of the 4.5 S DNA intermediate after 0.1-0.2 M NaCl, whereas undigested nuclei retained this ability up to 0.9 M NaCl. Chain elongation to 28 S DNA chains could be restored to nucleoids, but not to nuclear matrices, by the addition of nuclear extracts.  相似文献   

16.
In the last few years there have been many developments in computational biology, particularly with regard to novel, imaginative exploitation of genomic data. Disappointingly, there has been a lack of progress in the methodology for prediction of protein structures. In the last several years, however, promising new methods have finally begun to emerge. These methods are increasing the power and scope of the methodology, but, most importantly, they are generating new areas of investigation that we believe will accelerate progress in the field. In this review we describe recent developments and highlight the implications of their success as well as areas where efforts should be focused.  相似文献   

17.
An efficient system for small protein expression and refolding   总被引:1,自引:0,他引:1  
The low expression yield and poor refolding efficiency of small recombinant proteins expressed in Escherichia coli have continued to hinder the large-scale purification of such proteins for structural and biological investigations. A system based on a small fusion partner, the B1 domain of Streptococcal protein G (GB1), was utilized to overcome this problem. We have tested this system on a small cysteine-rich toxin, mutant myotoxin alpha (MyoP20G). The highly expressed fusion protein was refolded using an unfolding/refolding protocol. Due to the small size of GB1, we were able to monitor the unfolding/refolding status by heteronuclear single quantum coherence (HSQC) NMR spectroscopy. The final product yielded well-resolved NMR spectra, with a topology corresponding to the natural product. We conclude that GB1 not only increases the expression level but also enhances the refolding of small proteins.  相似文献   

18.
ING4 (inhibitor of growth 4) is a candidate tumor suppressor gene that is implicated as a repressor of cell growth, angiogenesis, cell spreading and cell migration and can suppress loss of contact inhibition in vitro. Another group and we identified four wobble-splicing isoforms of ING4 generated by alternative splicing at two tandem splice sites, GC(N)7GT and NAGNAG, which caused canonical (GT-AG) and non-canonical (GC-AG) splice site wobbling selection. Expression of the four ING4 wobble-splicing isoforms did not vary significantly in any of the cell lines examined. Here we show that ING4_v1 is translocated to the nucleolus, indicating that ING4 contains an intrinsic nucleolar localization signal. We further demonstrate that the subcellular localization of ING4 is modulated by two wobble-splicing events at the exon 4-5 boundary, causing displacement from the nucleolus to the nucleus. We also observed that ING4 is degraded through the ubiquitin-proteasome pathway and that it is subjected to N-terminal ubiquitination. We demonstrate that nucleolar accumulation of ING4 prolongs its half-life, but lack of nucleolar targeting potentially increases ING4 degradation. Taken together, our data suggest that the two wobble-splicing events at the exon 4-5 boundary influence subnuclear localization and degradation of ING4.  相似文献   

19.
Cell membranes are vitally important to living cells. Although the infrastructure of biological membrane is provided by the lipid bilayer, membrane proteins perform most of the specific functions. Knowledge of membrane protein types often provides crucial hints toward determining the function of an uncharacterized membrane protein. With the avalanche of new protein sequences generated in the post-genomic era, it is highly demanded to develop a high throughput tool in identifying the type of newly found membrane proteins according to their primary sequences, so as to timely annotate them for reference usage in both basic research and drug discovery. To realize this, the key is to establish a powerful identifier that can catch their characteristic sequence patterns for different membrane protein types. However, it is not easy because they are buried in a pile of long and complicated sequences. In this paper, based on the concept of the pseudo-amino acid composition [K.C. Chou, PROTEINS: Struct., Funct., Genet. 43 (2001) 246-255], the low-frequency Fourier spectrum analysis is introduced. The merits by doing so are that the sequence pattern information can be more effectively incorporated into a set of discrete components, and that all the existing prediction algorithms can be straightforwardly used on such a formulation for protein samples. High success rates were observed by the re-substitution test, jackknife test, and independent dataset test, indicating that the low-frequency Fourier spectrum approach may become a very useful tool for membrane protein type prediction. The novel approach also holds a high potential for predicting many other attributes of proteins.  相似文献   

20.
Zheng X  Liu T  Wang J 《Amino acids》2009,37(2):427-433
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step, distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other. Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov model of protein sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号