共查询到20条相似文献,搜索用时 93 毫秒
1.
Microarrays are a new technology that allows biologists to better understand the interactions between diverse pathologic state at the gene level. However, the amount of data generated by these tools becomes problematic, even though data are supposed to be automatically analyzed (e.g., for diagnostic purposes). The issue becomes more complex when the expression data involve multiple states. We present a novel approach to the gene selection problem in multi-class gene expression-based cancer classification, which combines support vector machines and genetic algorithms. This new method is able to select small subsets and still improve the classification accuracy. 相似文献
3.
MOTIVATION: With the development of DNA microarray technology, scientists can now measure the expression levels of thousands of genes simultaneously in one single experiment. One current difficulty in interpreting microarray data comes from their innate nature of 'high-dimensional low sample size'. Therefore, robust and accurate gene selection methods are required to identify differentially expressed group of genes across different samples, e.g. between cancerous and normal cells. Successful gene selection will help to classify different cancer types, lead to a better understanding of genetic signatures in cancers and improve treatment strategies. Although gene selection and cancer classification are two closely related problems, most existing approaches handle them separately by selecting genes prior to classification. We provide a unified procedure for simultaneous gene selection and cancer classification, achieving high accuracy in both aspects. RESULTS: In this paper we develop a novel type of regularization in support vector machines (SVMs) to identify important genes for cancer classification. A special nonconvex penalty, called the smoothly clipped absolute deviation penalty, is imposed on the hinge loss function in the SVM. By systematically thresholding small estimates to zeros, the new procedure eliminates redundant genes automatically and yields a compact and accurate classifier. A successive quadratic algorithm is proposed to convert the non-differentiable and non-convex optimization problem into easily solved linear equation systems. The method is applied to two real datasets and has produced very promising results. AVAILABILITY: MATLAB codes are available upon request from the authors. 相似文献
4.
This paper presents a framework for annotating protein domains with predicted domain-domain interaction networks. Specially, domain annotation is formalized as a multi-class classification problem in this work. The numerical experiments on InterPro domains show promising results, which proves the efficiency of our proposed methods. 相似文献
5.
Background Human genetic variations primarily result from single nucleotide polymorphisms (SNPs) that occur approximately every 1000
bases in the overall human population. The non-synonymous SNPs (nsSNPs) that lead to amino acid changes in the protein product
may account for nearly half of the known genetic variations linked to inherited human diseases. One of the key problems of
medical genetics today is to identify nsSNPs that underlie disease-related phenotypes in humans. As such, the development
of computational tools that can identify such nsSNPs would enhance our understanding of genetic diseases and help predict
the disease. 相似文献
6.
Backgroundβ-turns are secondary structure type that have essential role in molecular recognition, protein folding, and stability. They are found to be the most common type of non-repetitive structures since 25% of amino acids in protein structures are situated on them. Their prediction is considered to be one of the crucial problems in bioinformatics and molecular biology, which can provide valuable insights and inputs for the fold recognition and drug design.ResultsWe propose an approach that combines support vector machines (SVMs) and logistic regression (LR) in a hybrid prediction method, which we call (H-SVM-LR) to predict β-turns in proteins. Fractional polynomials are used for LR modeling. We utilize position specific scoring matrices (PSSMs) and predicted secondary structure (PSS) as features. Our simulation studies show that H-SVM-LR achieves Qtotal of 82.87%, 82.84%, and 82.32% on the BT426, BT547, and BT823 datasets respectively. These values are the highest among other β-turns prediction methods that are based on PSSMs and secondary structure information. H-SVM-LR also achieves favorable performance in predicting β-turns as measured by the Matthew's correlation coefficient (MCC) on these datasets. Furthermore, H-SVM-LR shows good performance when considering shape strings as additional features.ConclusionsIn this paper, we present a comprehensive approach for β-turns prediction. Experiments show that our proposed approach achieves better performance compared to other competing prediction methods. 相似文献
8.
Methionine aminopeptidase and N-terminal acetyltransferase are two enzymes that contribute most to the N-terminal acetylation, which has long been recognized as a frequent and important kind of co-translational modifications [R.A. Bradshaw, W.W. Brickey, K.W. Walker, N-terminal processing: the methionine aminopeptidase and N alpha-acetyl transferase families, Trends Biochem. Sci. 23 (1998) 263-267]. The combined action of these two enzymes leads to two types of N-terminal acetylated proteins that are with/without the initiator methionine after the N-terminal acetylation. To accurately predict these two types of N-terminal acetylation, a new method based on feature selection has been developed. 1047 N-terminal acetylated and non-acetylated decapeptides retrieved from Swiss-Prot database (http://cn.expasy.org) are encoded into feature vectors by amino acid properties collected in Amino Acid Index database (http://www.genome.jp/aaindex). The Maximum Relevance Minimum Redundancy method (mRMR) combining with Incremental Feature Selection (IFS) and Feature Forward Selection (FFS) is then applied to extract informative features. Nearest Neighbor Algorithm (NNA) is used to build prediction models. Tested by Jackknife Cross-Validation, the correct rate of predictors reach 91.34% and 75.49% for each type, which are both better than that of 84.41% and 62.99% acquired by using motif methods [S. Huang, R.C. Elliott, P.S. Liu, R.K. Koduri, J.L. Weickmann, J.H. Lee, L.C. Blair, P. Ghosh-Dastidar, R.A. Bradshaw, K.M. Bryan, et al., Specificity of cotranslational amino-terminal processing of proteins in yeast, Biochemistry 26 (1987) 8242-8246; R. Yamada, R.A. Bradshaw, Rat liver polysome N alpha-acetyltransferase: substrate specificity, Biochemistry 30 (1991) 1017-1021]. Furthermore, the analysis of the informative features indicates that at least six downstream residues might have effect on the rules that guide the N-terminal acetylation, besides the penultimate residue. The software is available upon request. 相似文献
9.
Background Identification of DNA-binding proteins is one of the major challenges in the field of genome annotation, as these proteins
play a crucial role in gene-regulation. In this paper, we developed various SVM modules for predicting DNA-binding domains
and proteins. All models were trained and tested on multiple datasets of non-redundant proteins. 相似文献
10.
Gram-negative bacteria have five major subcellular localization sites: the cytoplasm, the periplasm, the inner membrane, the outer membrane, and the extracellular space. The subcellular location of a protein can provide valuable information about its function. With the rapid increase of sequenced genomic data, the need for an automated and accurate tool to predict subcellular localization becomes increasingly important. We present an approach to predict subcellular localization for Gram-negative bacteria. This method uses the support vector machines trained by multiple feature vectors based on n-peptide compositions. For a standard data set comprising 1443 proteins, the overall prediction accuracy reaches 89%, which, to the best of our knowledge, is the highest prediction rate ever reported. Our prediction is 14% higher than that of the recently developed multimodular PSORT-B. Because of its simplicity, this approach can be easily extended to other organisms and should be a useful tool for the high-throughput and large-scale analysis of proteomic and genomic data. 相似文献
11.
MOTIVATION: The standard L(2)-norm support vector machine (SVM) is a widely used tool for microarray classification. Previous studies have demonstrated its superior performance in terms of classification accuracy. However, a major limitation of the SVM is that it cannot automatically select relevant genes for the classification. The L(1)-norm SVM is a variant of the standard L(2)-norm SVM, that constrains the L(1)-norm of the fitted coefficients. Due to the singularity of the L(1)-norm, the L(1)-norm SVM has the property of automatically selecting relevant genes. On the other hand, the L(1)-norm SVM has two drawbacks: (1) the number of selected genes is upper bounded by the size of the training data; (2) when there are several highly correlated genes, the L(1)-norm SVM tends to pick only a few of them, and remove the rest. RESULTS: We propose a hybrid huberized support vector machine (HHSVM). The HHSVM combines the huberized hinge loss function and the elastic-net penalty. By doing so, the HHSVM performs automatic gene selection in a way similar to the L(1)-norm SVM. In addition, the HHSVM encourages highly correlated genes to be selected (or removed) together. We also develop an efficient algorithm to compute the entire solution path of the HHSVM. Numerical results indicate that the HHSVM tends to provide better variable selection results than the L(1)-norm SVM, especially when variables are highly correlated. AVAILABILITY: R code are available at http://www.stat.lsa.umich.edu/~jizhu/code/hhsvm/. 相似文献
12.
With the development of high-throughput methods for identifying protein-protein interactions, large scale interaction networks are available. Computational methods to analyze the networks to detect functional modules as protein complexes are becoming more important. However, most of the existing methods only make use of the protein-protein interaction networks without considering the structural limitations of proteins to bind together. In this paper, we design a new protein complex prediction method by extending the idea of using domain-domain interaction information. Here we formulate the problem into a maximum matching problem (which can be solved in polynomial time) instead of the binary integer linear programming approach (which can be NP-hard in the worst case). We also add a step to predict domain-domain interactions which first searches the database Pfam using the hidden Markov model and then predicts the domain-domain interactions based on the database DOMINE and InterDom which contain confirmed DDIs. By adding the domain-domain interaction prediction step, we have more edges in the DDI graph and the recall value is increased significantly (at least doubled) comparing with the method of Ozawa et al. (2010) [1] while the average precision value is slightly better. We also combine our method with three other existing methods, such as COACH, MCL and MCODE. Experiments show that the precision of the combined method is improved. This article is part of a Special Issue entitled: Computational Methods for Protein Interaction and Structural Prediction. 相似文献
13.
In the post-genome era, the prediction of protein function is one of the most demanding tasks in the study of bioinformatics. Machine learning methods, such as the support vector machines (SVMs), greatly help to improve the classification of protein function. In this work, we integrated SVMs, protein sequence amino acid composition, and associated physicochemical properties into the study of nucleic-acid-binding proteins prediction. We developed the binary classifications for rRNA-, RNA-, DNA-binding proteins that play an important role in the control of many cell processes. Each SVM predicts whether a protein belongs to rRNA-, RNA-, or DNA-binding protein class. Self-consistency and jackknife tests were performed on the protein data sets in which the sequences identity was < 25%. Test results show that the accuracies of rRNA-, RNA-, DNA-binding SVMs predictions are approximately 84%, approximately 78%, approximately 72%, respectively. The predictions were also performed on the ambiguous and negative data set. The results demonstrate that the predicted scores of proteins in the ambiguous data set by RNA- and DNA-binding SVM models were distributed around zero, while most proteins in the negative data set were predicted as negative scores by all three SVMs. The score distributions agree well with the prior knowledge of those proteins and show the effectiveness of sequence associated physicochemical properties in the protein function prediction. The software is available from the author upon request. 相似文献
14.
Background PDZ domains mediate protein-protein interactions involved in important biological processes through the recognition of short
linear motifs in their target proteins. Two recent independent studies have used protein microarray or phage display technology
to detect PDZ domain interactions with peptide ligands on a large scale. Several computational predictors of PDZ domain interactions
have been developed, however they are trained using only protein microarray data and focus on limited subsets of PDZ domains.
An accurate predictor of genomic PDZ domain interactions would allow the proteomes of organisms to be scanned for potential
binders. Such an application would require an accurate and precise predictor to avoid generating too many false positive hits
given the large amount of possible interactors in a given proteome. Once validated these predictions will help to increase
the coverage of current PDZ domain interaction networks and further our understanding of the roles that PDZ domains play in
a variety of biological processes. 相似文献
15.
Background Predicting protein residue-residue contacts is an important 2D prediction task. It is useful for ab initio structure prediction and understanding protein folding. In spite of steady progress over the past decade, contact prediction
remains still largely unsolved. 相似文献
16.
Prognostic prediction is important in medical domain, because it can be used to select an appropriate treatment for a patient by predicting the patient's clinical outcomes. For high-dimensional data, a normal prognostic method undergoes two steps: feature selection and prognosis analysis. Recently, the L?-L?-norm Support Vector Machine (L?-L? SVM) has been developed as an effective classification technique and shown good classification performance with automatic feature selection. In this paper, we extend L?-L? SVM for regression analysis with automatic feature selection. We further improve the L?-L? SVM for prognostic prediction by utilizing the information of censored data as constraints. We design an efficient solution to the new optimization problem. The proposed method is compared with other seven prognostic prediction methods on three realworld data sets. The experimental results show that the proposed method performs consistently better than the medium performance. It is more efficient than other algorithms with the similar performance. 相似文献
17.
G-protein coupled receptors (GPCRs) represent one of the most important classes of drug targets for pharmaceutical industry and play important roles in cellular signal transduction. Predicting the coupling specificity of GPCRs to G-proteins is vital for further understanding the mechanism of signal transduction and the function of the receptors within a cell, which can provide new clues for pharmaceutical research and development. In this study, the features of amino acid compositions and physiochemical properties of the full-length GPCR sequences have been analyzed and extracted. Based on these features, classifiers have been developed to predict the coupling specificity of GPCRs to G-protelns using support vector machines. The testing results show that this method could obtain better prediction accuracy. 相似文献
18.
BackgroundLately, biomarker discovery has become one of the most significant research issues in the biomedical field. Owing to the presence of high-throughput technologies, genomic data, such as microarray data and RNA-seq, have become widely available. Many kinds of feature selection techniques have been applied to retrieve significant biomarkers from these kinds of data. However, they tend to be noisy with high-dimensional features and consist of a small number of samples; thus, conventional feature selection approaches might be problematic in terms of reproducibility. ResultsIn this article, we propose a stable feature selection method for high-dimensional datasets. We apply an ensemble L
1
-norm support vector machine to efficiently reduce irrelevant features, considering the stability of features. We define the stability score for each feature by aggregating the ensemble results, and utilize backward feature elimination on a purified feature set based on this score; therefore, it is possible to acquire an optimal set of features for performance without the need to set a specific threshold. The proposed methodology is evaluated by classifying the binary stage of renal clear cell carcinoma with RNA-seq data. ConclusionA comparison with established algorithms, i.e., a fast correlation-based filter, random forest, and an ensemble version of an L
2
-norm support vector machine-based recursive feature elimination, enabled us to prove the superior performance of our method in terms of classification as well as stability in general. It is also shown that the proposed approach performs moderately on high-dimensional datasets consisting of a very large number of features and a smaller number of samples. The proposed approach is expected to be applicable to many other researches aimed at biomarker discovery. 相似文献
20.
MOTIVATION: A new method that uses support vector machines (SVMs) to predict protein secondary structure is described and evaluated. The study is designed to develop a reliable prediction method using an alternative technique and to investigate the applicability of SVMs to this type of bioinformatics problem. METHODS: Binary SVMs are trained to discriminate between two structural classes. The binary classifiers are combined in several ways to predict multi-class secondary structure. RESULTS: The average three-state prediction accuracy per protein (Q(3)) is estimated by cross-validation to be 77.07 +/- 0.26% with a segment overlap (Sov) score of 73.32 +/- 0.39%. The SVM performs similarly to the 'state-of-the-art' PSIPRED prediction method on a non-homologous test set of 121 proteins despite being trained on substantially fewer examples. A simple consensus of the SVM, PSIPRED and PROFsec achieves significantly higher prediction accuracy than the individual methods. 相似文献
|