首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Microarrays are a new technology that allows biologists to better understand the interactions between diverse pathologic state at the gene level. However, the amount of data generated by these tools becomes problematic, even though data are supposed to be automatically analyzed (e.g., for diagnostic purposes). The issue becomes more complex when the expression data involve multiple states. We present a novel approach to the gene selection problem in multi-class gene expression-based cancer classification, which combines support vector machines and genetic algorithms. This new method is able to select small subsets and still improve the classification accuracy.  相似文献   

2.
Yu R  Shete S 《BMC genetics》2005,6(Z1):S136
A supervised learning method, support vector machine, was used to analyze the microsatellite marker dataset of the Collaborative Study on the Genetics of Alcoholism Problem 1 for the Genetic Analysis Workshop 14. Twelve binary-valued phenotype variables were chosen for analyses using the markers from all autosomal chromosomes. Using various polynomial kernel functions of the support vector machine and randomly divided genome regions, we were able to observe the association of some marker sets with the chosen phenotypes and thus reduce the size of the dataset. The successful classifications established with the chosen support vector machine kernel function had high levels of correctness for each prediction, e.g., 96% in the fourfold cross-validations. However, owing to the limited sample data, we were not able to test the predictions of the classifiers in the new sample data.  相似文献   

3.

Background  

Alpha-helical transmembrane (TM) proteins are involved in a wide range of important biological processes such as cell signaling, transport of membrane-impermeable molecules, cell-cell communication, cell recognition and cell adhesion. Many are also prime drug targets, and it has been estimated that more than half of all drugs currently on the market target membrane proteins. However, due to the experimental difficulties involved in obtaining high quality crystals, this class of protein is severely under-represented in structural databases. In the absence of structural data, sequence-based prediction methods allow TM protein topology to be investigated.  相似文献   

4.

Background

β-turns are secondary structure type that have essential role in molecular recognition, protein folding, and stability. They are found to be the most common type of non-repetitive structures since 25% of amino acids in protein structures are situated on them. Their prediction is considered to be one of the crucial problems in bioinformatics and molecular biology, which can provide valuable insights and inputs for the fold recognition and drug design.

Results

We propose an approach that combines support vector machines (SVMs) and logistic regression (LR) in a hybrid prediction method, which we call (H-SVM-LR) to predict β-turns in proteins. Fractional polynomials are used for LR modeling. We utilize position specific scoring matrices (PSSMs) and predicted secondary structure (PSS) as features. Our simulation studies show that H-SVM-LR achieves Qtotal of 82.87%, 82.84%, and 82.32% on the BT426, BT547, and BT823 datasets respectively. These values are the highest among other β-turns prediction methods that are based on PSSMs and secondary structure information. H-SVM-LR also achieves favorable performance in predicting β-turns as measured by the Matthew's correlation coefficient (MCC) on these datasets. Furthermore, H-SVM-LR shows good performance when considering shape strings as additional features.

Conclusions

In this paper, we present a comprehensive approach for β-turns prediction. Experiments show that our proposed approach achieves better performance compared to other competing prediction methods.
  相似文献   

5.
MOTIVATION: The standard L(2)-norm support vector machine (SVM) is a widely used tool for microarray classification. Previous studies have demonstrated its superior performance in terms of classification accuracy. However, a major limitation of the SVM is that it cannot automatically select relevant genes for the classification. The L(1)-norm SVM is a variant of the standard L(2)-norm SVM, that constrains the L(1)-norm of the fitted coefficients. Due to the singularity of the L(1)-norm, the L(1)-norm SVM has the property of automatically selecting relevant genes. On the other hand, the L(1)-norm SVM has two drawbacks: (1) the number of selected genes is upper bounded by the size of the training data; (2) when there are several highly correlated genes, the L(1)-norm SVM tends to pick only a few of them, and remove the rest. RESULTS: We propose a hybrid huberized support vector machine (HHSVM). The HHSVM combines the huberized hinge loss function and the elastic-net penalty. By doing so, the HHSVM performs automatic gene selection in a way similar to the L(1)-norm SVM. In addition, the HHSVM encourages highly correlated genes to be selected (or removed) together. We also develop an efficient algorithm to compute the entire solution path of the HHSVM. Numerical results indicate that the HHSVM tends to provide better variable selection results than the L(1)-norm SVM, especially when variables are highly correlated. AVAILABILITY: R code are available at http://www.stat.lsa.umich.edu/~jizhu/code/hhsvm/.  相似文献   

6.
Yuan Z  Burrage K  Mattick JS 《Proteins》2002,48(3):566-570
A Support Vector Machine learning system has been trained to predict protein solvent accessibility from the primary structure. Different kernel functions and sliding window sizes have been explored to find how they affect the prediction performance. Using a cut-off threshold of 15% that splits the dataset evenly (an equal number of exposed and buried residues), this method was able to achieve a prediction accuracy of 70.1% for single sequence input and 73.9% for multiple alignment sequence input, respectively. The prediction of three and more states of solvent accessibility was also studied and compared with other methods. The prediction accuracies are better than, or comparable to, those obtained by other methods such as neural networks, Bayesian classification, multiple linear regression, and information theory. In addition, our results further suggest that this system may be combined with other prediction methods to achieve more reliable results, and that the Support Vector Machine method is a very useful tool for biological sequence analysis.  相似文献   

7.
Discrimination of outer membrane proteins using support vector machines   总被引:3,自引:0,他引:3  
MOTIVATION: Discriminating outer membrane proteins from other folding types of globular and membrane proteins is an important task both for dissecting outer membrane proteins (OMPs) from genomic sequences and for the successful prediction of their secondary and tertiary structures. RESULTS: We have developed a method based on support vector machines using amino acid composition and residue pair information. Our approach with amino acid composition has correctly predicted the OMPs with a cross-validated accuracy of 94% in a set of 208 proteins. Further, this method has successfully excluded 633 of 673 globular proteins and 191 of 206 alpha-helical membrane proteins. We obtained an overall accuracy of 92% for correctly picking up the OMPs from a dataset of 1087 proteins belonging to all different types of globular and membrane proteins. Furthermore, residue pair information improved the accuracy from 92 to 94%. This accuracy of discriminating OMPs is higher than that of other methods in the literature, which could be used for dissecting OMPs from genomic sequences. AVAILABILITY: Discrimination results are available at http://tmbeta-svm.cbrc.jp.  相似文献   

8.
9.
Sun XD  Huang RB 《Amino acids》2006,30(4):469-475
Summary. The support vector machine, a machine-learning method, is used to predict the four structural classes, i.e. mainly α, mainly β, α–β and fss, from the topology-level of CATH protein structure database. For the binary classification, any two structural classes which do not share any secondary structure such as α and β elements could be classified with as high as 90% accuracy. The accuracy, however, will decrease to less than 70% if the structural classes to be classified contain structure elements in common. Our study also shows that the dimensions of feature space 202 = 400 (for dipeptide) and 203 = 8 000 (for tripeptide) give nearly the same prediction accuracy. Among these 4 structural classes, multi-class classification gives an overall accuracy of about 52%, indicating that the multi-class classification technique in support of vector machines may still need to be further improved in future investigation.  相似文献   

10.
Secondary structure prediction with support vector machines   总被引:8,自引:0,他引:8  
MOTIVATION: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. METHODS: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. RESULTS: The average three-state prediction accuracy per protein (Q(3)) is estimated by cross-validation to be 77.07 +/- 0.26% with a segment overlap (Sov) score of 73.32 +/- 0.39%. The SVM performs similarly to the 'state-of-the-art' PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods.  相似文献   

11.

Background  

Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles.  相似文献   

12.
13.
14.
15.
Recently, two different models have been developed for predicting gamma-turns in proteins by Kaur and Raghava [2002. An evaluation of beta-turn prediction methods. Bioinformatics 18, 1508-1514; 2003. A neural-network based method for prediction of gamma-turns in proteins from multiple sequence alignment. Protein Sci. 12, 923-929]. However, the major limitation of previous methods is inability in predicting gamma-turns types. Thus, there is a need to predict gamma-turn types using an approach which will be useful in overall tertiary structure prediction. In this work, support vector machines (SVMs), a powerful model is proposed for predicting gamma-turn types in proteins. The high rates of prediction accuracy showed that the formation of gamma-turn types is evidently correlated with the sequence of tripeptides, and hence can be approximately predicted based on the sequence information of the tripeptides alone.  相似文献   

16.
Discriminating thermophilic lipases from their similar thermostable counterparts is a challenging task and it would help to design stable proteins. In this study, the distributions of N (N=2, 3) neighboring amino acids and the non-adjacent di-residue coupling patterns in the sequences of 65 thermostable and 77 thermophilic lipases had been systematically analyzed. It was found that the hydrophobic residues Leu, Pro, Met, Phe, Trp, as well as the polar residue Tyr had higher occurrence in thermophilic lipases than thermostable ones. The occurrence frequencies of KC EE KE RE, VE, YI, EK, VK, EV, YV, EY, KY, VY and YY in thermophilic proteins were significantly higher, while the occurrence frequencies of QC, QH, QN, HQ, MQ, NQ, QQ, TQ, QS and QT were significantly lower. CXP or CPX showed significantly positive to lipase thermostability, while XXQ or QXX showed significantly negative to lipase thermostability. Non-adjacent di-residue coupling patterns of PR14, RY32, YR47, LE53, LE64, PP64, RP70 and PP101 were significantly different in thermophilic lipases and their thermostable counterparts. The composition of dipeptide, tripeptide and non-adjacent di-residue patterns contained more information than amino acid composition. A statistical method based on support vector machines (SVMs) was developed for discriminating thermophilic and thermostable lipases. The accuracy of this method for the training dataset was 97.17?. Furthermore, the highest accuracy of the method for testing datasets was 98.41?. The influence of some specific patterns on lipase thermostability was also discussed.  相似文献   

17.
Classifying G-protein coupled receptors with support vector machines   总被引:7,自引:0,他引:7  
MOTIVATION: The enormous amount of protein sequence data uncovered by genome research has increased the demand for computer software that can automate the recognition of new proteins. We discuss the relative merits of various automated methods for recognizing G-Protein Coupled Receptors (GPCRs), a superfamily of cell membrane proteins. GPCRs are found in a wide range of organisms and are central to a cellular signalling network that regulates many basic physiological processes. They are the focus of a significant amount of current pharmaceutical research because they play a key role in many diseases. However, their tertiary structures remain largely unsolved. The methods described in this paper use only primary sequence information to make their predictions. We compare a simple nearest neighbor approach (BLAST), methods based on multiple alignments generated by a statistical profile Hidden Markov Model (HMM), and methods, including Support Vector Machines (SVMs), that transform protein sequences into fixed-length feature vectors. RESULTS: The last is the most computationally expensive method, but our experiments show that, for those interested in annotation-quality classification, the results are worth the effort. In two-fold cross-validation experiments testing recognition of GPCR subfamilies that bind a specific ligand (such as a histamine molecule), the errors per sequence at the Minimum Error Point (MEP) were 13.7% for multi-class SVMs, 17.1% for our SVMtree method of hierarchical multi-class SVM classification, 25.5% for BLAST, 30% for profile HMMs, and 49% for classification based on nearest neighbor feature vector Kernel Nearest Neighbor (kernNN). The percentage of true positives recognized before the first false positive was 65% for both SVM methods, 13% for BLAST, 5% for profile HMMs and 4% for kernNN.  相似文献   

18.
Hyperspectral reflectance (350–2500 nm) measurements were made over two experimental rice fields containing two cultivars treated with three levels of nitrogen application. Four different transformations of the reflectance data were analyzed for their capability to predict rice biophysical parameters, comprising leaf area index (LAI; m2 green leaf area m−2 soil) and green leaf chlorophyll density (GLCD; mg chlorophyll m−2 soil), using stepwise multiple regression (SMR) models and support vector machines (SVMs). Four transformations of the rice canopy data were made, comprising reflectances (R), first-order derivative reflectances (D1), second-order derivative reflectances (D2), and logarithm transformation of reflectances (LOG). The polynomial kernel (POLY) of the SVM using R was the best model to predict rice LAI, with a root mean square error (RMSE) of 1.0496 LAI units. The analysis of variance kernel of SVM using LOG was the best model to predict rice GLCD, with an RMSE of 523.0741 mg m−2. The SVM approach was not only superior to SMR models for predicting the rice biophysical parameters, but also provided a useful exploratory and predictive tool for analyzing different transformations of reflectance data.  相似文献   

19.
20.

Background  

Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains and proteins. All models were trained and tested on multiple datasets of non-redundant proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号