首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Genome-wide association studies (GWAS) have been fruitful in identifying disease susceptibility loci for common and complex diseases. A remaining question is whether we can quantify individual disease risk based on genotype data, in order to facilitate personalized prevention and treatment for complex diseases. Previous studies have typically failed to achieve satisfactory performance, primarily due to the use of only a limited number of confirmed susceptibility loci. Here we propose that sophisticated machine-learning approaches with a large ensemble of markers may improve the performance of disease risk assessment. We applied a Support Vector Machine (SVM) algorithm on a GWAS dataset generated on the Affymetrix genotyping platform for type 1 diabetes (T1D) and optimized a risk assessment model with hundreds of markers. We subsequently tested this model on an independent Illumina-genotyped dataset with imputed genotypes (1,008 cases and 1,000 controls), as well as a separate Affymetrix-genotyped dataset (1,529 cases and 1,458 controls), resulting in area under ROC curve (AUC) of ∼0.84 in both datasets. In contrast, poor performance was achieved when limited to dozens of known susceptibility loci in the SVM model or logistic regression model. Our study suggests that improved disease risk assessment can be achieved by using algorithms that take into account interactions between a large ensemble of markers. We are optimistic that genotype-based disease risk assessment may be feasible for diseases where a notable proportion of the risk has already been captured by SNP arrays.  相似文献   

2.
3.
We developed a method called residue contact frequency (RCF), which uses the complex structures generated by the protein–protein docking algorithm ZDOCK to predict interface residues. Unlike interface prediction algorithms that are based on monomers alone, RCF is binding partner specific. We evaluated the performance of RCF using the area under the precision‐recall (PR) curve (AUC) on a large protein docking Benchmark. RCF (AUC = 0.44) performed as well as meta‐PPISP (AUC = 0.43), which is one of the best monomer‐based interface prediction methods. In addition, we test a support vector machine (SVM) to combine RCF with meta‐PPISP and another monomer‐based interface prediction algorithm Evolutionary Trace to further improve the performance. We found that the SVM that combined RCF and meta‐PPISP achieved the best performance (AUC = 0.47). We used RCF to predict the binding interfaces of proteins that can bind to multiple partners and RCF was able to correctly predict interface residues that are unique for the respective binding partners. Furthermore, we found that residues that contributed greatly to binding affinity (hotspot residues) had significantly higher RCF than other residues. Proteins 2014; 82:57–66. © 2013 Wiley Periodicals, Inc.  相似文献   

4.
Deriving predictive models in medicine typically relies on a population approach where a single model is developed from a dataset of individuals. In this paper we describe and evaluate a personalized approach in which we construct a new type of decision tree model called decision-path model that takes advantage of the particular features of a given person of interest. We introduce three personalized methods that derive personalized decision-path models. We compared the performance of these methods to that of Classification And Regression Tree (CART) that is a population decision tree to predict seven different outcomes in five medical datasets. Two of the three personalized methods performed statistically significantly better on area under the ROC curve (AUC) and Brier skill score compared to CART. The personalized approach of learning decision path models is a new approach for predictive modeling that can perform better than a population approach.  相似文献   

5.
BackgroundT-cell epitopes play the important role in T-cell immune response, and they are critical components in the epitope-based vaccine design. Immunogenicity is the ability to trigger an immune response. The accurate prediction of immunogenic T-cell epitopes is significant for designing useful vaccines and understanding the immune system.MethodsIn this paper, we attempt to differentiate immunogenic epitopes from non-immunogenic epitopes based on their primary structures. First of all, we explore a variety of sequence-derived features, and analyze their relationship with epitope immunogenicity. To effectively utilize various features, a genetic algorithm (GA)-based ensemble method is proposed to determine the optimal feature subset and develop the high-accuracy ensemble model. In the GA optimization, a chromosome is to represent a feature subset in the search space. For each feature subset, the selected features are utilized to construct the base predictors, and an ensemble model is developed by taking the average of outputs from base predictors. The objective of GA is to search for the optimal feature subset, which leads to the ensemble model with the best cross validation AUC (area under ROC curve) on the training set.ResultsTwo datasets named ‘IMMA2’ and ‘PAAQD’ are adopted as the benchmark datasets. Compared with the state-of-the-art methods POPI, POPISK, PAAQD and our previous method, the GA-based ensemble method produces much better performances, achieving the AUC score of 0.846 on IMMA2 dataset and the AUC score of 0.829 on PAAQD dataset. The statistical analysis demonstrates the performance improvements of GA-based ensemble method are statistically significant.ConclusionsThe proposed method is a promising tool for predicting the immunogenic epitopes. The source codes and datasets are available in S1 File.  相似文献   

6.

Background

Highly parallel analysis of gene expression has recently been used to identify gene sets or ‘signatures’ to improve patient diagnosis and risk stratification. Once a signature is generated, traditional statistical testing is used to evaluate its prognostic performance. However, due to the dimensionality of microarrays, this can lead to false interpretation of these signatures.

Principal Findings

A method was developed to test batches of a user-specified number of randomly chosen signatures in patient microarray datasets. The percentage of random generated signatures yielding prognostic value was assessed using ROC analysis by calculating the area under the curve (AUC) in six public available cancer patient microarray datasets. We found that a signature consisting of randomly selected genes has an average 10% chance of reaching significance when assessed in a single dataset, but can range from 1% to ∼40% depending on the dataset in question. Increasing the number of validation datasets markedly reduces this number.

Conclusions

We have shown that the use of an arbitrary cut-off value for evaluation of signature significance is not suitable for this type of research, but should be defined for each dataset separately. Our method can be used to establish and evaluate signature performance of any derived gene signature in a dataset by comparing its performance to thousands of randomly generated signatures. It will be of most interest for cases where few data are available and testing in multiple datasets is limited.  相似文献   

7.
Aim To investigate the impact of geographical bias on the performance of ecological niche models for invasive plant species. Location South Africa and Australia. Methods We selected 10 Australian plants invasive in South Africa and nine South African plants invasive in Australia. Geographical bias was simulated in occurrence records obtained from the native range of a species to represent two scenarios. For the first scenario (A, worst‐case) a proportion of records were excluded from a specific region of a species’ range and for the second scenario (B, less extreme) only some records were excluded from that specific region of the range. Introduced range predictions were produced with the Maxent modelling algorithm where models were calibrated with datasets from these biased occurrence records and 19 bioclimatic variables. Models were evaluated with independent test data obtained from the introduced range of the species. Geographical bias was quantified as the proportional difference between the occurrence records from a control and a biased dataset, and environmental bias was expressed as either the difference in marginality or tolerance between these datasets. Model performance [assessed using the conventional and modified AUC (area under the curve of receiver‐operating characteristic plots) and the maximum true skill statistic] was compared between models calibrated with occurrence records from a biased dataset and a control dataset. Results We found considerable variation in the relationship between geographical and environmental bias. Environmental bias, expressed as the difference in marginality, differed significantly across treatments. Model performance did not differ significantly among treatments. Regions predicted as suitable for most of the species were very similar when compared between a biased and control dataset, with only a few exceptions. Main conclusions The geographical bias simulated in this study was sufficient to result in significant environmental bias across treatments, but despite this we did not find a significant effect on model performance. Differences in the environmental spaces occupied by the species in their native and invaded ranges may explain why we did not find a significant effect on model performance.  相似文献   

8.
L Han  YJ Zhang  J Song  MS Liu  Z Zhang 《PloS one》2012,7(7):e41370
Enzymes play a fundamental role in almost all biological processes and identification of catalytic residues is a crucial step for deciphering the biological functions and understanding the underlying catalytic mechanisms. In this work, we developed a novel structural feature called MEDscore to identify catalytic residues, which integrated the microenvironment (ME) and geometrical properties of amino acid residues. Firstly, we converted a residue's ME into a series of spatially neighboring residue pairs, whose likelihood of being located in a catalytic ME was deduced from a benchmark enzyme dataset. We then calculated an ME-based score, termed as MEscore, by summing up the likelihood of all residue pairs. Secondly, we defined a parameter called Dscore to measure the relative distance of a residue to the center of the protein, provided that catalytic residues are typically located in the center of the protein structure. Finally, we defined the MEDscore feature based on an effective nonlinear integration of MEscore and Dscore. When evaluated on a well-prepared benchmark dataset using five-fold cross-validation tests, MEDscore achieved a robust performance in identifying catalytic residues with an AUC1.0 of 0.889. At a ≤ 10% false positive rate control, MEDscore correctly identified approximately 70% of the catalytic residues. Remarkably, MEDscore achieved a competitive performance compared with the residue conservation score (e.g. CONscore), the most informative singular feature predominantly employed to identify catalytic residues. To the best of our knowledge, MEDscore is the first singular structural feature exhibiting such an advantage. More importantly, we found that MEDscore is complementary with CONscore and a significantly improved performance can be achieved by combining CONscore with MEDscore in a linear manner. As an implementation of this work, MEDscore has been made freely accessible at http://protein.cau.edu.cn/mepi/.  相似文献   

9.
As one of the most common post-translational modifications, ubiquitination regulates the quantity and function of a variety of proteins. Experimental and clinical investigations have also suggested the crucial roles of ubiquitination in several human diseases. The complicated sequence context of human ubiquitination sites revealed by proteomic studies highlights the need of developing effective computational strategies to predict human ubiquitination sites. Here we report the establishment of a novel human-specific ubiquitination site predictor through the integration of multiple complementary classifiers. Firstly, a Support Vector Machine (SVM) classier was constructed based on the composition of k-spaced amino acid pairs (CKSAAP) encoding, which has been utilized in our previous yeast ubiquitination site predictor. To further exploit the pattern and properties of the ubiquitination sites and their flanking residues, three additional SVM classifiers were constructed using the binary amino acid encoding, the AAindex physicochemical property encoding and the protein aggregation propensity encoding, respectively. Through an integration that relied on logistic regression, the resulting predictor termed hCKSAAP_UbSite achieved an area under ROC curve (AUC) of 0.770 in 5-fold cross-validation test on a class-balanced training dataset. When tested on a class-balanced independent testing dataset that contains 3419 ubiquitination sites, hCKSAAP_UbSite has also achieved a robust performance with an AUC of 0.757. Specifically, it has consistently performed better than the predictor using the CKSAAP encoding alone and two other publicly available predictors which are not human-specific. Given its promising performance in our large-scale datasets, hCKSAAP_UbSite has been made publicly available at our server (http://protein.cau.edu.cn/cksaap_ubsite/).  相似文献   

10.
Some global models to predict the risk of diabetes may not be applicable to local populations. We aimed to develop and validate a score to predict type 2 diabetes mellitus (T2DM) in a rural adult Chinese population. Data for a cohort of 12,849 participants were randomly divided into derivation (n = 11,564) and validation (n = 1285) datasets. A questionnaire interview and physical and blood biochemical examinations were performed at baseline (July to August 2007 and July to August 2008) and follow-up (July to August 2013 and July to October 2014). A Cox regression model was used to weigh each variable in the derivation dataset. For each significant variable, a score was calculated by multiplying β by 100 and rounding to the nearest integer. Age, body mass index, triglycerides and fasting plasma glucose (scores 3, 12, 24 and 76, respectively) were predictors of incident T2DM. The model accuracy was assessed by the area under the receiver operating characteristic curve (AUC), with optimal cut-off value 936. With the derivation dataset, sensitivity, specificity and AUC of the model were 66.7%, 74.0% and 0.768 (95% CI 0.760–0.776), respectively. With the validation dataset, the performance of the model was superior to the Chinese (simple), FINDRISC, Oman and IDRS models of T2DM risk but equivalent to the Framingham model, which is widely applicable in a variety of populations. Our model for predicting 6-year risk of T2DM could be used in a rural adult Chinese population.  相似文献   

11.
In this study, we purpose to investigate a novel five-gene signature for predicting the prognosis of patients with laryngeal cancer. The laryngeal cancer datasets were obtained from The Cancer Genome Atlas (TCGA). Both univariate and multivariate Cox regression analysis was applied to screening for prognostic differential expressed genes (DEGs), and a novel gene signature was obtained. The performance of this Cox regression model was tested by receiver operating characteristic (ROC) curves and area under the curve (AUC). Further survival analysis for each of the five genes was carried out through the Kaplan-Meier curve and Log-rank test. Totally, 622 DEGs were screened from the TCGA datasets in this study. We construct a five-gene signature through Cox survival analysis. Patients were divided into low- and high-risk groups depending on the median risk score, and a significant difference of the 5-year overall survival was found between these two groups (P < .05). ROC curves verified that this five-gene signature had good performance to predict the prognosis of laryngeal cancer (AUC = 0.862, P < .05). In conclusion, the five-gene signature consist of EMP1, HOXB9, DPY19L2P1, MMP1, and KLHDC7B might be applied as an independent prognosis predictor of laryngeal cancer.  相似文献   

12.
The resistance against oxaliplatin (L-OHP) based regimens remains a major obstacle for its efficient usage in treating metastatic colorectal cancer (mCRC). In this study, we performed weighted gene coexpression network analysis (WGCNA) to systematically screen the relevant hub genes for L-OHP resistance using the raw microarray data of 30 consecutive mCRC samples from our earlier study (GSE69657). The results were further confirmed through datasets from Gene Expression Omnibus (GEO). From L-OHP resistance module, nine genes in both the coexpression and protein–protein interaction networks were chosen as hub genes. Among these genes, Meis Homeobox 2 (MEIS2) had the highest correlation with L-OHP resistance (r = −0.443) and was deregulated in L-OHP resistant tissues compared with L-OHP sensitive tissues in both our own dataset and GSE104645 testing dataset. The receiver operating characteristic curve validated that MEIS2 had a good ability in predicting L-OHP response in both our own dataset (area under the curve [AUC] = 0.802) and GSE104645 dataset (AUC = 0.746). Then, the down expression of MEIS2 was observed in CRC tissue compared with normal tissue in 12 GEO-sourced datasets and The Cancer Genome Atlas (TCGA) and was correlated with poor event-free survival. Furthermore, analyzing methylation data from TCGA showed that MEIS2 had increased promoter hypermethylation. In addition, MEIS2 expression was significantly decreased in CRC stem cells compared with nonstem cells in two GEO datasets (GSE14773 and GSE24747). Further methylation analysis from GSE104271 demonstrated that CRC stem cells had higher MEIS2 promoter methylation levels in cg00366722 and cg00610348 sites. Gene set enrichment analysis showed that MEIS2 might be involved in the Wnt/β-catenin pathway. In the overall view, MEIS2 had increased promoter hypermethylation and was downregulated in poor L-OHP response mCRC tissues. MEIS2 might be involved in the Wnt/β-catenin pathway to maintain CRC stemness, which leads to L-OHP resistance.  相似文献   

13.
14.
Developing an efficient method for determination of the DNA-binding proteins, due to their vital roles in gene regulation, is becoming highly desired since it would be invaluable to advance our understanding of protein functions. In this study, we proposed a new method for the prediction of the DNA-binding proteins, by performing the feature rank using random forest and the wrapper-based feature selection using forward best-first search strategy. The features comprise information from primary sequence, predicted secondary structure, predicted relative solvent accessibility, and position specific scoring matrix. The proposed method, called DBPPred, used Gaussian naïve Bayes as the underlying classifier since it outperformed five other classifiers, including decision tree, logistic regression, k-nearest neighbor, support vector machine with polynomial kernel, and support vector machine with radial basis function. As a result, the proposed DBPPred yields the highest average accuracy of 0.791 and average MCC of 0.583 according to the five-fold cross validation with ten runs on the training benchmark dataset PDB594. Subsequently, blind tests on the independent dataset PDB186 by the proposed model trained on the entire PDB594 dataset and by other five existing methods (including iDNA-Prot, DNA-Prot, DNAbinder, DNABIND and DBD-Threader) were performed, resulting in that the proposed DBPPred yielded the highest accuracy of 0.769, MCC of 0.538, and AUC of 0.790. The independent tests performed by the proposed DBPPred on completely a large non-DNA binding protein dataset and two RNA binding protein datasets also showed improved or comparable quality when compared with the relevant prediction methods. Moreover, we observed that majority of the selected features by the proposed method are statistically significantly different between the mean feature values of the DNA-binding and the non DNA-binding proteins. All of the experimental results indicate that the proposed DBPPred can be an alternative perspective predictor for large-scale determination of DNA-binding proteins.  相似文献   

15.
Hugo Schweke  Qifang Xu  Gerardo Tauriello  Lorenzo Pantolini  Torsten Schwede  Frédéric Cazals  Alix Lhéritier  Juan Fernandez-Recio  Luis Angel Rodríguez-Lumbreras  Ora Schueler-Furman  Julia K. Varga  Brian Jiménez-García  Manon F. Réau  Alexandre M. J. J. Bonvin  Castrense Savojardo  Pier-Luigi Martelli  Rita Casadio  Jérôme Tubiana  Haim J. Wolfson  Romina Oliva  Didier Barradas-Bautista  Tiziana Ricciardelli  Luigi Cavallo  Česlovas Venclovas  Kliment Olechnovič  Raphael Guerois  Jessica Andreani  Juliette Martin  Xiao Wang  Genki Terashi  Daipayan Sarkar  Charles Christoffer  Tunde Aderinwale  Jacob Verburgt  Daisuke Kihara  Anthony Marchand  Bruno E. Correia  Rui Duan  Liming Qiu  Xianjin Xu  Shuang Zhang  Xiaoqin Zou  Sucharita Dey  Roland L. Dunbrack  Emmanuel D. Levy  Shoshana J. Wodak 《Proteomics》2023,23(17):2200323
Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community-wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non-physiological complexes. The non-physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein-protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non-physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross-validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non-physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.  相似文献   

16.
17.
Barenboim M  Masso M  Vaisman II  Jamison DC 《Proteins》2008,71(4):1930-1939
There is substantial interest in methods designed to predict the effect of nonsynonymous single nucleotide polymorphisms (nsSNPs) on protein function, given their potential relationship to heritable diseases. Current state-of-the-art supervised machine learning algorithms, such as random forest (RF), train models that classify single amino acid mutations in proteins as either neutral or deleterious to function. However, it is frequently the case that the functional effect of a polymorphism on a protein resides between these two extremes. The utilization of classifiers that incorporate fuzzy logic provides a natural extension in order to account for the spectrum of possible functional consequences. We generated a dataset of single amino acid substitutions in human proteins having known three-dimensional structures. Each variant was uniquely represented as a feature vector that included computational geometry and knowledge-based statistical potential predictors obtained though application of Delaunay tessellation of protein structures. Additional attributes consisted of physicochemical properties of the native and replacement amino acids as well as topological location of the mutated residue position in the solved structure. Classification performance of the RF algorithm was evaluated on a training set consisting of the disease-associated and neutral nsSNPs taken from our dataset, and attributes were ranked according to their relative importance. Similarly, we evaluated the performance of adaptive neuro-fuzzy inference system (ANFIS). The utility of statistical geometry predictors was compared with that of traditional structural and evolutionary attributes employed by other researchers, revealing an equally effective yet complementary methodology. Among all attributes in our feature set, the statistical geometry predictors were found to be the most highly ranked. On the basis of the AUC (area under the ROC curve) measure of performance, the ANFIS and RF models were equally effective when only statistical geometry features were utilized. Tenfold cross-validation studies evaluating AUC, balanced error rate (BER), and Matthew's correlation coefficient (MCC) showed that our RF model was at least comparable with the well-established methods of SIFT and PolyPhen. The trained RF and ANFIS models were each subsequently used to predict the disease potential of human nsSNPs in our dataset that are currently unclassified (http://rna.gmu.edu/FuzzySnps/).  相似文献   

18.
Cysteine S-sulfenylation is an important post-translational modification (PTM) in proteins, and provides redox regulation of protein functions. Bioinformatics and structural analyses indicated that S-sulfenylation could impact many biological and functional categories and had distinct structural features. However, major limitations for identifying cysteine S-sulfenylation were expensive and low-throughout. In view of this situation, the establishment of a useful computational method and the development of an efficient predictor are highly desired. In this study, a predictor iSulf-Cys which incorporated 14 kinds of physicochemical properties of amino acids was proposed. With the 10-fold cross-validation, the value of area under the curve (AUC) was 0.7155 ± 0.0085, MCC 0.3122 ± 0.0144 on the training dataset for 20 times. iSulf-Cys also showed satisfying performance in the independent testing dataset with AUC 0.7343 and MCC 0.3315. Features which were constructed from physicochemical properties and position were carefully analyzed. Meanwhile, a user-friendly web-server for iSulf-Cys is accessible at http://app.aporc.org/iSulf-Cys/.  相似文献   

19.
Drug-target interactions provide insight into the drug-side effects and drug repositioning. However, wet-lab biochemical experiments are time-consuming and labor-intensive, and are insufficient to meet the pressing demand for drug research and development. With the rapid advancement of deep learning, computational methods are increasingly applied to screen drug-target interactions. Many methods consider this problem as a binary classification task (binding or not), but ignore the quantitative binding affinity. In this paper, we propose a new end-to-end deep learning method called DeepMHADTA, which uses the multi-head self-attention mechanism in a deep residual network to predict drug-target binding affinity. On two benchmark datasets, our method outperformed several current state-of-the-art methods in terms of multiple performance measures, including mean square error (MSE), consistency index (CI), rm2, and PR curve area (AUPR). The results demonstrated that our method achieved better performance in predicting the drug–target binding affinity.  相似文献   

20.
The process of deducing the catalytic mechanism of an enzyme from its structure is highly complex and requires extensive experimental work to validate a proposed mechanism. As one step towards improving the reliability of this process, we have gathered statistics describing the typical geometry of catalytic residues with regard to the substrate and one another. In order to analyse residue-substrate interactions, we have assembled a dataset of structures of enzymes of known mechanism bound to substrate, product, or a substrate analogue. Despite the challenges presented in obtaining such experimental data, we were able to include 42 enzyme structures. We have also assembled a separate dataset of catalytic residues which act upon other catalytic residues, using a set of 60 enzyme structures. For both datasets, we have extracted the distances between residues with a given catalytic function and their target moieties. The geometry of residues whose function involves the transfer or sharing of hydrogens (either with substrate or another residue) was analysed more closely. The results showed that the geometry for such productive interactions (prior to the transition state) closely resembles that seen in non-catalytic hydrogen bonds, with distances and angles in the normal expected range. Such statistics provide limits on "expected geometries" for catalytic residues, which will help to identify these residues and elucidate enzyme mechanisms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号