共查询到20条相似文献,搜索用时 0 毫秒
1.
Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer. 相似文献
2.
This paper applies and studies the behavior of three learning algorithms, i.e. the Support Vector machine (SVM), the Radial Basis Function Network (the RBF network), and k-Nearest Neighbor (k-NN) for predicting HIV-1 drug resistance from genotype data. In addition, a new algorithm for classifier combination is proposed. The results of comparing the predictive performance of three learning algorithms show that, SVM yields the highest average accuracy, the RBF network gives the highest sensitivity, and k-NN yields the best in specificity. Finally, the comparison of the predictive performance of the composite classifier with three learning algorithms demonstrates that the proposed composite classifier provides the highest average accuracy. 相似文献
3.
Laxman Yetukuri Jarkko Tikka Jaakko Hollmén Matej Orešič 《Metabolomics : Official journal of the Metabolomic Society》2010,6(1):18-26
Mass spectrometry (MS)-based metabolomics studies often require handling of both identified and unidentified metabolite data. In order to avoid bias in data interpretation, it would be of advantage for the data analysis to include all available data. A practical challenge in exploratory metabolomics analysis is therefore how to interpret the changes related to unidentified peaks. In this paper, we address the challenge by predicting the class membership of unknown peaks by applying and comparing multiple supervised classifiers to selected lipidomics datasets. The employed classifiers include k-nearest neighbours (k-NN), support vector machines (SVM), partial least squares and discriminant analysis (PLS-DA) and Naive Bayes methods which are known to be effective and efficient in predicting the labels for unseen data. Here, the class label predictions are sought for unidentified lipid profiles coming from high throughput global screening in Ultra Performance Liquid Chromatography Mass Spectrometry (UPLCTM/MS) experimental setup. Our investigation reveals that k-NN and SVM classifiers outperform both PLS-DA and Naive Bayes classifiers. Naive Bayes classifier perform poorly among all models and this observation seems logical as lipids are highly co-regulated and do not respect Naive Bayes assumptions of features being conditionally independent given the class. Common label predictions from k-NN and SVM can serve as a good starting point to explore full data and thereby facilitating exploratory studies where label information is critical for the data interpretation. 相似文献
4.
Cornelia Caragea Jivko Sinapov Adrian Silvescu Drena Dobbs Vasant Honavar 《BMC bioinformatics》2007,8(1):438
Background
Glycosylation is one of the most complex post-translational modifications (PTMs) of proteins in eukaryotic cells. Glycosylation plays an important role in biological processes ranging from protein folding and subcellular localization, to ligand recognition and cell-cell interactions. Experimental identification of glycosylation sites is expensive and laborious. Hence, there is significant interest in the development of computational methods for reliable prediction of glycosylation sites from amino acid sequences. 相似文献5.
Hagen Klett Yesilda Balavarca Reka Toth Biljana Gigic Nina Habermann Dominique Scherer 《Epigenetics》2018,13(4):386-397
DNA methylation is recognized as one of several epigenetic regulators of gene expression and as potential driver of carcinogenesis through gene-silencing of tumor suppressors and activation of oncogenes. However, abnormal methylation, even of promoter regions, does not necessarily alter gene expression levels, especially if the gene is already silenced, leaving the exact mechanisms of methylation unanswered. Using a large cohort of matching DNA methylation and gene expression samples of colorectal cancer (CRC; n = 77) and normal adjacent mucosa tissues (n = 108), we investigated the regulatory role of methylation on gene expression. We show that on a subset of genes enriched in common cancer pathways, methylation is significantly associated with gene regulation through gene-specific mechanisms. We built two classification models to infer gene regulation in CRC from methylation differences of tumor and normal tissues, taking into account both gene-silencing and gene-activation effects through hyper- and hypo-methylation of CpGs. The classification models result in high prediction performances in both training and independent CRC testing cohorts (0.92<AUC<0.97) as well as in individual patient data (average AUC = 0.82), suggesting a robust interplay between methylation and gene regulation. Validation analysis in other cancerous tissues resulted in lower prediction performances (0.69<AUC<0.90); however, it identified genes that share robust dependencies across cancerous tissues. In conclusion, we present a robust classification approach that predicts the gene-specific regulation through DNA methylation in CRC tissues with possible transition to different cancer entities. Furthermore, we present HMGA1 as consistently associated with methylation across cancers, suggesting a potential candidate for DNA methylation targeting cancer therapy. 相似文献
6.
7.
Lysine acetylation sites prediction using an ensemble of support vector machine classifiers 总被引:1,自引:0,他引:1
Lysine acetylation is an essentially reversible and high regulated post-translational modification which regulates diverse protein properties. Experimental identification of acetylation sites is laborious and expensive. Hence, there is significant interest in the development of computational methods for reliable prediction of acetylation sites from amino acid sequences. In this paper we use an ensemble of support vector machine classifiers to perform this work. The experimentally determined acetylation lysine sites are extracted from Swiss-Prot database and scientific literatures. Experiment results show that an ensemble of support vector machine classifiers outperforms single support vector machine classifier and other computational methods such as PAIL and LysAcet on the problem of predicting acetylation lysine sites. The resulting method has been implemented in EnsemblePail, a web server for lysine acetylation sites prediction available at http://www.aporc.org/EnsemblePail/. 相似文献
8.
Koziol JA Feng AC Jia Z Wang Y Goodison S McClelland M Mercola D 《Bioinformatics (Oxford, England)》2009,25(1):54-60
MOTIVATION: Classification and regression trees have long been used for cancer diagnosis and prognosis. Nevertheless, instability and variable selection bias, as well as overfitting, are well-known problems of tree-based methods. In this article, we investigate whether ensemble tree classifiers can ameliorate these difficulties, using data from two recent studies of radical prostatectomy in prostate cancer. RESULTS: Using time to progression following prostatectomy as the relevant clinical endpoint, we found that ensemble tree classifiers robustly and reproducibly identified three subgroups of patients in the two clinical datasets: non-progressors, early progressors and late progressors. Moreover, the consensus classifications were independent predictors of time to progression compared to known clinical prognostic factors. 相似文献
9.
Control strategies for gene regulatory networks have begun to be explored, both experimentally and theoretically, with implications
for control of disease as well as for synthetic biology. Recent work has focussed on controls designed to achieve desired
stationary states. Another useful objective, however, is the initiation of sustained oscillations in systems where oscillations
are normally damped, or even not present. Alternatively, it may be desired to suppress (by damping) oscillations that naturally
occur in an uncontrolled network. Here we address these questions in the context of piecewise-affine models of gene regulatory
networks with affine controls that match the qualitative nature of the model. In the case of two genes with a single relevant
protein concentration threshold per gene, we find that control of production terms (constant control) is effective in generating
or suppressing sustained oscillations, while control of decay terms (linear control) is not effective. We derive an easily
calculated condition to determine an effective constant control. As an example, we apply our analysis to a model of the carbon
response network in Escherichia coli, reduced to the two genes that are essential in understanding its behavior. 相似文献
10.
RNA secondary structure prediction from sequence alignments using a network of k-nearest neighbor classifiers 总被引:3,自引:0,他引:3
下载免费PDF全文

We present a machine learning method (a hierarchical network of k-nearest neighbor classifiers) that uses an RNA sequence alignment in order to predict a consensus RNA secondary structure. The input to the network is the mutual information, the fraction of complementary nucleotides, and a novel consensus RNAfold secondary structure prediction of a pair of alignment columns and its nearest neighbors. Given this input, the network computes a prediction as to whether a particular pair of alignment columns corresponds to a base pair. By using a comprehensive test set of 49 RFAM alignments, the program KNetFold achieves an average Matthews correlation coefficient of 0.81. This is a significant improvement compared with the secondary structure prediction methods PFOLD and RNAalifold. By using the example of archaeal RNase P, we show that the program can also predict pseudoknot interactions. 相似文献
11.
Andries E Hagstrom T Atlas SR Willman C 《Journal of bioinformatics and computational biology》2007,5(1):79-104
Linear discrimination, from the point of view of numerical linear algebra, can be treated as solving an ill-posed system of linear equations. In order to generate a solution that is robust in the presence of noise, these problems require regularization. Here, we examine the ill-posedness involved in the linear discrimination of cancer gene expression data with respect to outcome and tumor subclasses. We show that a filter factor representation, based upon Singular Value Decomposition, yields insight into the numerical ill-posedness of the hyperplane-based separation when applied to gene expression data. We also show that this representation yields useful diagnostic tools for guiding the selection of classifier parameters, thus leading to improved performance. 相似文献
12.
Insulin genes in mouse and rat compose a two-gene system in which Ins1 was retroposed from the partially processed mRNA of Ins2. When Ins1 originated and how it was retained in genomes still remain interesting problems. In this study, we used genomic approaches to detect insulin gene copy number variation in rodent species and investigated evolutionary forces acting on both Ins1 and Ins2. We characterized the phylogenetic distribution of the new insulin gene (Ins1) by Southern analyses and confirmed by sequencing insulin genes in the rodent genomes. The results demonstrate that Ins1 originated right before the mouse-rat split ( approximately 20 MYA), and both Ins1 and Ins2 are under strong functional constraints in these murine species. Interestingly, by examining a range of nucleotide polymorphisms, we detected positive selection acting on both Ins2 and Ins1 gene regions in the Mus musculus domesticus populations. Furthermore, three amino acid sites were also identified as having evolved under positive selection in two insulin peptides: two are in the signal peptide and one is in the C-peptide. Our data suggest an adaptive divergence in the mouse insulin two-gene system, which may result from the response to environmental change caused by the rise of agricultural civilization, as proposed by the thrifty-genotype hypothesis. 相似文献
13.
Koenig T Menze BH Kirchner M Monigatti F Parker KC Patterson T Steen JJ Hamprecht FA Steen H 《Journal of proteome research》2008,7(9):3708-3717
Protein identification by tandem mass spectrometry is based on the reliable processing of the acquired data. Unfortunately, the generation of a large number of poor quality spectra is commonly observed in LC-MS/MS, and the processing of these mostly noninformative spectra with its associated costs should be avoided. We present a continuous quality score that can be computed very quickly and that can be considered an approximation of the MASCOT score in case of a correct identification. This score can be used to reject low quality spectra prior to database identification, or to draw attention to those spectra that exhibit a (supposedly) high information content, but could not be identified. The proposed quality score can be calibrated automatically on site without the need for a manually generated training set. When this score is turned into a classifier and when features are used that are independent of the instrument, the proposed approach performs equally to previously published classifiers and feature sets and also gives insights into the behavior of the MASCOT score. 相似文献
14.
In microarray-based case–control studies of a disease, people often attempt to identify a few diagnostic or prognostic markers amongst the most significant differentially expressed (DE) genes. However, the reproducibility of DE genes identified in different studies for a disease is typically very low. To tackle the problem, we could evaluate the reproducibility of DE genes across studies and define robust markers for disease diagnosis using disease-associated protein–protein interaction (PPI) subnetwork. Using datasets for four cancer types, we found that the most significant DE genes in cancer exhibit consistent up- or down-regulation in different datasets. For each cancer type, the 5 (or 10) most significant DE genes separately extracted from different datasets tend to be significantly coexpressed and closely connected in the PPI subnetwork, thereby indicating that they are highly reproducible at the PPI level. Consequently, we were able to build robust subnetwork-based classifiers for cancer diagnosis. 相似文献
15.
Adam J. Golman Kerry A. Danelson 《Computer methods in biomechanics and biomedical engineering》2016,19(7):717-732
This study developed a parametric methodology to robustly predict occupant injuries sustained in real-world crashes using a finite element (FE) human body model (HBM). One hundred and twenty near-side impact motor vehicle crashes were simulated over a range of parameters using a Toyota RAV4 (bullet vehicle), Ford Taurus (struck vehicle) FE models and a validated human body model (HBM) Total HUman Model for Safety (THUMS). Three bullet vehicle crash parameters (speed, location and angle) and two occupant parameters (seat position and age) were varied using a Latin hypercube design of Experiments. Four injury metrics (head injury criterion, half deflection, thoracic trauma index and pelvic force) were used to calculate injury risk. Rib fracture prediction and lung strain metrics were also analysed. As hypothesized, bullet speed had the greatest effect on each injury measure. Injury risk was reduced when bullet location was further from the B-pillar or when the bullet angle was more oblique. Age had strong correlation to rib fractures frequency and lung strain severity. The injuries from a real-world crash were predicted using two different methods by (1) subsampling the injury predictors from the 12 best crush profile matching simulations and (2) using regression models. Both injury prediction methods successfully predicted the case occupant's low risk for pelvic injury, high risk for thoracic injury, rib fractures and high lung strains with tight confidence intervals. This parametric methodology was successfully used to explore crash parameter interactions and to robustly predict real-world injuries. 相似文献
16.
17.
Background
Predicting the response to a drug for cancer disease patients based on genomic information is an important problem in modern clinical oncology. This problem occurs in part because many available drug sensitivity prediction algorithms do not consider better quality cancer cell lines and the adoption of new feature representations; both lead to the accurate prediction of drug responses. By predicting accurate drug responses to cancer, oncologists gain a more complete understanding of the effective treatments for each patient, which is a core goal in precision medicine.Results
In this paper, we model cancer drug sensitivity as a link prediction, which is shown to be an effective technique. We evaluate our proposed link prediction algorithms and compare them with an existing drug sensitivity prediction approach based on clinical trial data. The experimental results based on the clinical trial data show the stability of our link prediction algorithms, which yield the highest area under the ROC curve (AUC) and are statistically significant.Conclusions
We propose a link prediction approach to obtain new feature representation. Compared with an existing approach, the results show that incorporating the new feature representation to the link prediction algorithms has significantly improved the performance.18.
In recent years increasing evidence appeared that breast cancer may not constitute a single disease at the molecular level, but comprises a heterogeneous set of subtypes. This suggests that instead of building a single monolithic predictor, better predictors might be constructed that solely target samples of a designated subtype, which are believed to represent more homogeneous sets of samples. An unavoidable drawback of developing subtype-specific predictors, however, is that a stratification by subtype drastically reduces the number of samples available for their construction. As numerous studies have indicated sample size to be an important factor in predictor construction, it is therefore questionable whether the potential benefit of subtyping can outweigh the drawback of a severe loss in sample size. Factors like unequal class distributions and differences in the number of samples per subtype, further complicate comparisons. We present a novel experimental protocol that facilitates a comprehensive comparison between subtype-specific predictors and predictors that do not take subtype information into account. Emphasis lies on careful control of sample size as well as class and subtype distributions. The methodology is applied to a large breast cancer compendium involving over 1500 arrays, using a state-of-the-art subtyping scheme. We show that the resulting subtype-specific predictors outperform those that do not take subtype information into account, especially when taking sample size considerations into account. 相似文献
19.
Jianhua Ruan Md. Jamiul Jahid Fei Gu Chengwei Lei Yi-Wen Huang Ya-Ting Hsu David G. Mutch Chun-Liang Chen Nameer B. Kirma Tim H.-M. Huang 《Genomics》2019,111(1):17-23
To develop accurate prognostic models is one of the biggest challenges in “omics”-based cancer research. Here, we propose a novel computational method for identifying dysregulated gene subnetworks as biomarkers to predict cancer recurrence. Applying our method to the DNA methylome of endometrial cancer patients, we identified a subnetwork consisting of differentially methylated (DM) genes, and non-differentially methylated genes, termed Epigenetic Connectors (EC), that are topologically important for connecting the DM genes in a protein-protein interaction network. The ECs are statistically significantly enriched in well-known tumorgenesis and metastasis pathways, and include known epigenetic regulators. Importantly, combining the DMs and ECs as features using a novel random walk procedure, we constructed a support vector machine classifier that significantly improved the prediction accuracy of cancer recurrence and outperformed several alternative methods, demonstrating the effectiveness of our network-based approach. 相似文献
20.
Ray O. Bahado-Singh Amit Lugade Jayson Field Zaid Al-Wahab BeomSoo Han Rupasri Mandal Trent C. Bjorndahl Onur Turkoglu Stewart F. Graham David Wishart Kunle Odunsi 《Metabolomics : Official journal of the Metabolomic Society》2018,14(1):6