共查询到20条相似文献,搜索用时 0 毫秒
1.
蛋白质折叠问题被列为"21世纪的生物物理学"的重要课题,他是分子生物学中心法则尚未解决的一个重大生物学问题,因此预测蛋白质折叠模式是一个复杂、困难、和有挑战性的工作。为了解决该问题,我们引入了分类器集成,本文所采用的是三种分类器(LMT、RandomForest、SMO)进行集成以及188维组合理化特征来对蛋白质类别进行预测。实验证明,该方法可以有效表征蛋白质折叠模式的特性,对蛋白质序列数据实现精确分类;交叉验证和独立测试均证明本文预测准确率超过70%,比前人工作提高近10个百分点。 相似文献
2.
The focus of this work is the use of ensembles of classifiers for predicting HIV protease cleavage sites in proteins. Due
to the complex relationships in the biological data, several recent works show that often ensembles of learning algorithms
outperform stand-alone methods. We show that the fusion of approaches based on different encoding models can be useful for
improving the performance of this classification problem. In particular, in this work four different feature encodings for
peptides are described and tested. An extensive evaluation on a large dataset according to a blind testing protocol is reported
which demonstrates how different feature extraction methods and classifiers can be combined for obtaining a robust and reliable
system. The comparison with other stand-alone approaches allows quantifying the performance improvement obtained by the ensembles
proposed in this work. 相似文献
3.
Improved method for predicting linear B-cell epitopes 总被引:2,自引:0,他引:2
Background
B-cell epitopes are the sites of molecules that are recognized by antibodies of the immune system. Knowledge of B-cell epitopes may be used in the design of vaccines and diagnostics tests. It is therefore of interest to develop improved methods for predicting B-cell epitopes. In this paper, we describe an improved method for predicting linear B-cell epitopes.Results
In order to do this, three data sets of linear B-cell epitope annotated proteins were constructed. A data set was collected from the literature, another data set was extracted from the AntiJen database and a data sets of epitopes in the proteins of HIV was collected from the Los Alamos HIV database. An unbiased validation of the methods was made by testing on data sets on which they were neither trained nor optimized on. We have measured the performance in a non-parametric way by constructing ROC-curves.Conclusion
The best single method for predicting linear B-cell epitopes is the hidden Markov model. Combining the hidden Markov model with one of the best propensity scale methods, we obtained the BepiPred method. When tested on the validation data set this method performs significantly better than any of the other methods tested. The server and data sets are publicly available at http://www.cbs.dtu.dk/services/BepiPred. 相似文献4.
A method (SPREK) was developed to evaluate the register of a sequence on a structure based on the matching of structural patterns against a library derived from the protein structure databank. The scores obtained were normalized against random background distributions derived from sequence shuffling and permutation methods. 'Random' structures were also used to evaluate the effectiveness of the method. These were generated by a simple random-walk and a more sophisticated structure prediction method that produced protein-like folds. For comparison with other methods, the performance of the method was assessed using collections of models including decoys and models from the CASP-5 exercise. The performance of SPREK on the decoy models was equivalent to (and sometimes better than) those obtained with more complex approaches. An exception was the two smallest proteins, for which SPREK did not perform well due to a lack of patterns. Using the best parameter combination from trials on decoy models, the CASP models of intermediate difficulty were evaluated by SPREK and the quality of the top scoring model was evaluated by its CASP ranking. Of the 14 targets in this class, half lie in the top 10% (out of around 140 models for each target). The two worst rankings resulted from the selection by our method of a well-packed model that was based on the wrong fold. Of the other poor rankings, one was the smallest protein and the others were the four largest (all over 250 residues). 相似文献
5.
B Busetta 《Biochimica et biophysica acta》1986,870(2):327-338
6.
Amino Acids - Protein hot spot residues are functional sites in protein–protein interactions. Biological experimental methods are traditionally used to identify hot spot residues, which is... 相似文献
7.
This paper proposes an ensemble of classifiers for biomedical name recognition in which three classifiers, one Support Vector Machine and two discriminative Hidden Markov Models, are combined effectively using a simple majority voting strategy. In addition, we incorporate three post-processing modules, including an abbreviation resolution module, a protein/gene name refinement module and a simple dictionary matching module, into the system to further improve the performance. Evaluation shows that our system achieves the best performance from among 10 systems with a balanced F-measure of 82.58 on the closed evaluation of the BioCreative protein/gene name recognition task (Task 1A). 相似文献
8.
Background
Prediction of long-range inter-residue contacts is an important topic in bioinformatics research. It is helpful for determining protein structures, understanding protein foldings, and therefore advancing the annotation of protein functions.Results
In this paper, we propose a novel ensemble of genetic algorithm classifiers (GaCs) to address the long-range contact prediction problem. Our method is based on the key idea called sequence profile centers (SPCs). Each SPC is the average sequence profiles of residue pairs belonging to the same contact class or non-contact class. GaCs train on multiple but different pairs of long-range contact data (positive data) and long-range non-contact data (negative data). The negative data sets, having roughly the same sizes as the positive ones, are constructed by random sampling over the original imbalanced negative data. As a result, about 21.5% long-range contacts are correctly predicted. We also found that the ensemble of GaCs indeed makes an accuracy improvement by around 5.6% over the single GaC.Conclusions
Classifiers with the use of sequence profile centers may advance the long-range contact prediction. In line with this approach, key structural features in proteins would be determined with high efficiency and accuracy.9.
Background
Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. 相似文献10.
It is well known in the literature that an ensemble of classifiers obtains good performance with respect to that obtained by a stand-alone method. Hence, it is very important to develop ensemble methods well suited for bioinformatics data. In this work, we propose to combine the feature extraction method based on grouped weight with a set of amino-acid alphabets obtained by a Genetic Algorithm. The proposed method is applied for predicting DNA-binding proteins. As classifiers, the linear support vector machine and the radial basis function support vector machine are tested. As performance indicators, the accuracy and Matthews's correlation coefficient are reported. Matthews's correlation coefficient obtained by our ensemble method is approximately 0.97 when the jackknife cross-validation is used. This result outperforms the performance obtained in the literature using the same dataset where the features are extracted directly from the amino-acid sequence. 相似文献
11.
MOTIVATION: Numerous methods for predicting beta-turns in proteins have been developed based on various computational schemes. Here, we introduce a new method of beta-turn prediction that uses the support vector machine (SVM) algorithm together with predicted secondary structure information. Various parameters from the SVM have been adjusted to achieve optimal prediction performance. RESULTS: The SVM method achieved excellent performance as measured by the Matthews correlation coefficient (MCC = 0.45) using a 7-fold cross validation on a database of 426 non-homologous protein chains. To our best knowledge, this MCC value is the highest achieved so far for predicting beta-turn. The overall prediction accuracy Qtotal was 77.3%, which is the best among the existing prediction methods. Among its unique attractive features, the present SVM method avoids overtraining and compresses information and provides a predicted reliability index. 相似文献
12.
Background
The subcellular location of a protein is closely related to its function. It would be worthwhile to develop a method to predict the subcellular location for a given protein when only the amino acid sequence of the protein is known. Although many efforts have been made to predict subcellular location from sequence information only, there is the need for further research to improve the accuracy of prediction. 相似文献13.
BACKGROUND: The ability to predict the native conformation of a globular protein from its amino-acid sequence is an important unsolved problem of molecular biology. We have previously reported a method in which reduced representations of proteins are folded on a lattice by Monte Carlo simulation, using statistically-derived potentials. When applied to sequences designed to fold into four-helix bundles, this method generated predicted conformations closely resembling the real ones. RESULTS: We now report a hierarchical approach to protein-structure prediction, in which two cycles of the above-mentioned lattice method (the second on a finer lattice) are followed by a full-atom molecular dynamics simulation. The end product of the simulations is thus a full-atom representation of the predicted structure. The application of this procedure to the 60 residue, B domain of staphylococcal protein A predicts a three-helix bundle with a backbone root mean square (rms) deviation of 2.25-3 A from the experimentally determined structure. Further application to a designed, 120 residue monomeric protein, mROP, based on the dimeric ROP protein of Escherichia coli, predicts a left turning, four-helix bundle native state. Although the ultimate assessment of the quality of this prediction awaits the experimental determination of the mROP structure, a comparison of this structure with the set of equivalent residues in the ROP dime- crystal structure indicates that they have a rms deviation of approximately 3.6-4.2 A. CONCLUSION: Thus, for a set of helical proteins that have simple native topologies, the native folds of the proteins can be predicted with reasonable accuracy from their sequences alone. Our approach suggest a direction for future work addressing the protein-folding problem. 相似文献
14.
A complexity-based approach is proposed to predict subcellular location of proteins. Instead of extracting features from protein
sequences as done previously, our approach is based on a complexity decomposition of symbol sequences. In the first step,
distance between each pair of protein sequences is evaluated by the conditional complexity of one sequence given the other.
Subcellular location of a protein is then determined using the k-nearest neighbor algorithm. Using three widely used data sets created by Reinhardt and Hubbard, Park and Kanehisa, and Gardy
et al., our approach shows an improvement in prediction accuracy over those based on the amino acid composition and Markov
model of protein sequences. 相似文献
15.
基于氨基酸的16种分类模型,给出蛋白质序列的派生序列,进而结合加权拟熵和LZ复杂度构造出34维特征向量来表示蛋白质序列。借助于贝叶斯分类器对同源性不超过25%的640数据集进行蛋白质结构类预测,准确度达到71.28%。 相似文献
16.
Background:
The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse.Results:
In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein.Conclusion:
Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.17.
Bartolome Bejarano Mariangela Bianco Dolores Gonzalez-Moron Jorge Sepulcre Joaquin Goñi Juan Arcocha Oscar Soto Del Ubaldo Carro Giancarlo Comi Letizia Leocani Pablo Villoslada 《BMC neurology》2011,11(1):1-9
Background
Conventional magnetic resonance imaging (MRI) has improved the diagnosis and monitoring of multiple sclerosis (MS). In clinical trials, MRI has been found to detect treatment effects with greater sensitivity than clinical measures; however, clinical and MRI outcomes tend to correlate poorly.Methods
In this observational study, patients (n = 550; 18-50 years; relapsing-remitting MS [Expanded Disability Status Scale score ≤4.0]) receiving interferon (IFN) β-1a therapy (44 or 22 µg subcutaneously [sc] three times weekly [tiw]) underwent standardized MRI, neuropsychological and quality-of-life (QoL) assessments over 3 years. In this post hoc analysis, MRI outcomes and correlations between MRI parameters and clinical and functional outcomes were analysed.Results
MRI data over 3 years were available for 164 patients. T2 lesion and T1 gadolinium-enhancing (Gd+) lesion volumes, but not black hole (BH) volumes, decreased significantly from baseline to Year 3 (P < 0.0001). Percentage decreases (baseline to Year 3) were greater with the 44 μg dose than with the 22 μg dose for T2 lesion volume (-10.2% vs -4.5%, P = 0.025) and T1 BH volumes (-7.8% vs +10.3%, P = 0.002). A decrease in T2 lesion volume over 3 years predicted stable QoL over the same time period. Treatment with IFN β-1a, 44 μg sc tiw, predicted an absence of cognitive impairment at Year 3.Conclusion
Subcutaneous IFN β-1a significantly decreased MRI measures of disease, with a significant benefit shown for the 44 µg over the 22 µg dose; higher-dose treatment also predicted better cognitive outcomes over 3 years. 相似文献18.
Summary We describe a simple method for determining the overall fold of a polypeptide chain from NOE-derived distance restraints. The method uses a reduced representation consisting of two particles per residue, and a force field containing pseudo-bond and pseudo-angle terms, an electrostatic term, but no van der Waals or hard shell repulsive terms. The method is fast and robust, requiring relatively few distance restraints to approximate the correct fold, and the correct mirror image is readily determined. The method is easily implemented using commercially available molecular modeling software. 相似文献
19.
Prediction of protein-protein interaction is a difficult and important problem in biology. In this paper, we propose a new method based on an ensemble of K-local hyperplane distance nearest neighbor (HKNN) classifiers, where each HKNN is trained using a different physicochemical property of the amino acids. Moreover, we propose a new encoding technique that combines the amino acid indices together with the 2-Grams amino acid composition. A fusion of HKNN classifiers combined with the 'Sum rule' enables us to obtain an improvement over other state-of-the-art methods. The approach is demonstrated by building a learning system based on experimentally validated protein-protein interactions in human gastric bacterium Helicobacter pylori and in Human dataset. 相似文献
20.
Koziol JA Feng AC Jia Z Wang Y Goodison S McClelland M Mercola D 《Bioinformatics (Oxford, England)》2009,25(1):54-60
MOTIVATION: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. RESULTS: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. 相似文献