首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Recently, several classifiers that combine primary tumor data, like gene expression data, and secondary data sources, such as protein-protein interaction networks, have been proposed for predicting outcome in breast cancer. In these approaches, new composite features are typically constructed by aggregating the expression levels of several genes. The secondary data sources are employed to guide this aggregation. Although many studies claim that these approaches improve classification performance over single genes classifiers, the gain in performance is difficult to assess. This stems mainly from the fact that different breast cancer data sets and validation procedures are employed to assess the performance. Here we address these issues by employing a large cohort of six breast cancer data sets as benchmark set and by performing an unbiased evaluation of the classification accuracies of the different approaches. Contrary to previous claims, we find that composite feature classifiers do not outperform simple single genes classifiers. We investigate the effect of (1) the number of selected features; (2) the specific gene set from which features are selected; (3) the size of the training set and (4) the heterogeneity of the data set on the performance of composite feature and single genes classifiers. Strikingly, we find that randomization of secondary data sources, which destroys all biological information in these sources, does not result in a deterioration in performance of composite feature classifiers. Finally, we show that when a proper correction for gene set size is performed, the stability of single genes sets is similar to the stability of composite feature sets. Based on these results there is currently no reason to prefer prognostic classifiers based on composite features over single genes classifiers for predicting outcome in breast cancer.  相似文献   

2.

Background

Clinical data, such as patient history, laboratory analysis, ultrasound parameters-which are the basis of day-to-day clinical decision support-are often used to guide the clinical management of cancer in the presence of microarray data. Several data fusion techniques are available to integrate genomics or proteomics data, but only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. To improve clinical management, these data should be fully exploited. This requires efficient algorithms to integrate these data sets and design a final classifier.LS-SVM classifiers and generalized eigenvalue/singular value decompositions are successfully used in many bioinformatics applications for prediction tasks. While bringing up the benefits of these two techniques, we propose a machine learning approach, a weighted LS-SVM classifier to integrate two data sources: microarray and clinical parameters.

Results

We compared and evaluated the proposed methods on five breast cancer case studies. Compared to LS-SVM classifier on individual data sets, generalized eigenvalue decomposition (GEVD) and kernel GEVD, the proposed weighted LS-SVM classifier offers good prediction performance, in terms of test area under ROC Curve (AUC), on all breast cancer case studies.

Conclusions

Thus a clinical classifier weighted with microarray data set results in significantly improved diagnosis, prognosis and prediction responses to therapy. The proposed model has been shown as a promising mathematical framework in both data fusion and non-linear classification problems.  相似文献   

3.
This paper addresses the question of maximizing classifier accuracy for classifying task-related mental activity from Magnetoencelophalography (MEG) data. We propose the use of different sources of information and introduce an automatic channel selection procedure. To determine an informative set of channels, our approach combines a variety of machine learning algorithms: feature subset selection methods, classifiers based on regularized logistic regression, information fusion, and multiobjective optimization based on probabilistic modeling of the search space. The experimental results show that our proposal is able to improve classification accuracy compared to approaches whose classifiers use only one type of MEG information or for which the set of channels is fixed a priori.  相似文献   

4.
Breast cancer outcome can be predicted using models derived from gene expression data or clinical data. Only a few studies have created a single prediction model using both gene expression and clinical data. These studies often remain inconclusive regarding an obtained improvement in prediction performance. We rigorously compare three different integration strategies (early, intermediate, and late integration) as well as classifiers employing no integration (only one data type) using five classifiers of varying complexity. We perform our analysis on a set of 295 breast cancer samples, for which gene expression data and an extensive set of clinical parameters are available as well as four breast cancer datasets containing 521 samples that we used as independent validation.mOn the 295 samples, a nearest mean classifier employing a logical OR operation (late integration) on clinical and expression classifiers significantly outperforms all other classifiers. Moreover, regardless of the integration strategy, the nearest mean classifier achieves the best performance. All five classifiers achieve their best performance when integrating clinical and expression data. Repeating the experiments using the 521 samples from the four independent validation datasets also indicated a significant performance improvement when integrating clinical and gene expression data. Whether integration also improves performances on other datasets (e.g. other tumor types) has not been investigated, but seems worthwhile pursuing. Our work suggests that future models for predicting breast cancer outcome should exploit both data types by employing a late OR or intermediate integration strategy based on nearest mean classifiers.  相似文献   

5.
A range of single classifiers have been proposed to classify crop types using time series vegetation indices, and hybrid classifiers are used to improve discriminatory power. Traditional fusion rules use the product of multi-single classifiers, but that strategy cannot integrate the classification output of machine learning classifiers. In this research, the performance of two hybrid strategies, multiple voting (M-voting) and probabilistic fusion (P-fusion), for crop classification using NDVI time series were tested with different training sample sizes at both pixel and object levels, and two representative counties in north Xinjiang were selected as study area. The single classifiers employed in this research included Random Forest (RF), Support Vector Machine (SVM), and See 5 (C 5.0). The results indicated that classification performance improved (increased the mean overall accuracy by 5%~10%, and reduced standard deviation of overall accuracy by around 1%) substantially with the training sample number, and when the training sample size was small (50 or 100 training samples), hybrid classifiers substantially outperformed single classifiers with higher mean overall accuracy (1%~2%). However, when abundant training samples (4,000) were employed, single classifiers could achieve good classification accuracy, and all classifiers obtained similar performances. Additionally, although object-based classification did not improve accuracy, it resulted in greater visual appeal, especially in study areas with a heterogeneous cropping pattern.  相似文献   

6.
In the past, microarray studies have been criticized due to noise and the limited overlap between gene signatures. Prior biological knowledge should therefore be incorporated as side information in models based on gene expression data to improve the accuracy of diagnosis and prognosis in cancer. As prior knowledge, we investigated interaction and pathway information from the human interactome on different aspects of biological systems. By exploiting the properties of kernel methods, relations between genes with similar functions but active in alternative pathways could be incorporated in a support vector machine classifier based on spectral graph theory. Using 10 microarray data sets, we first reduced the number of data sources relevant for multiple cancer types and outcomes. Three sources on metabolic pathway information (KEGG), protein-protein interactions (OPHID) and miRNA-gene targeting (microRNA.org) outperformed the other sources with regard to the considered class of models. Both fixed and adaptive approaches were subsequently considered to combine the three corresponding classifiers. Averaging the predictions of these classifiers performed best and was significantly better than the model based on microarray data only. These results were confirmed on 6 validation microarray sets, with a significantly improved performance in 4 of them. Integrating interactome data thus improves classification of cancer outcome for the investigated microarray technologies and cancer types. Moreover, this strategy can be incorporated in any kernel method or non-linear version of a non-kernel method.  相似文献   

7.
The development of techniques for oncogenomic analyses such as array comparative genomic hybridization, messenger RNA expression arrays and mutational screens have come to the fore in modern cancer research. Studies utilizing these techniques are able to highlight panels of genes that are altered in cancer. However, these candidate cancer genes must then be scrutinized to reveal whether they contribute to oncogenesis or are coincidental and non-causative. We present a computational method for the prioritization of candidate (i) proto-oncogenes and (ii) tumour suppressor genes from oncogenomic experiments. We constructed computational classifiers using different combinations of sequence and functional data including sequence conservation, protein domains and interactions, and regulatory data. We found that these classifiers are able to distinguish between known cancer genes and other human genes. Furthermore, the classifiers also discriminate candidate cancer genes from a recent mutational screen from other human genes. We provide a web-based facility through which cancer biologists may access our results and we propose computational cancer gene classification as a useful method of prioritizing candidate cancer genes identified in oncogenomic studies.  相似文献   

8.
The safety of human–machine systems can be indirectly evaluated based on operator’s cognitive load levels at each temporal instant. However, relevant features of cognitive states are hidden behind in multiple sources of cortical neural responses. In this study, we developed a novel neural network ensemble, SE-SDAE, based on stacked denoising autoencoders (SDAEs) which identify different levels of cognitive load by electroencephalography (EEG) signals. To improve the generalization capability of the ensemble framework, a stacking-based approach is adopted to fuse the abstracted EEG features from activations of deep-structured hidden layers. In particular, we also combine multiple K-nearest neighbor and naive Bayesian classifiers with SDAEs to generate a heterogeneous classification committee to enhance ensemble’s diversity. Finally, we validate the proposed SE-SDAE by comparing its performance with mainstream pattern classifiers for cognitive load evaluation to show its effectiveness.  相似文献   

9.

Background

One of the major goals in gene and protein expression profiling of cancer is to identify biomarkers and build classification models for prediction of disease prognosis or treatment response. Many traditional statistical methods, based on microarray gene expression data alone and individual genes' discriminatory power, often fail to identify biologically meaningful biomarkers thus resulting in poor prediction performance across data sets. Nonetheless, the variables in multivariable classifiers should synergistically interact to produce more effective classifiers than individual biomarkers.

Results

We developed an integrated approach, namely network-constrained support vector machine (netSVM), for cancer biomarker identification with an improved prediction performance. The netSVM approach is specifically designed for network biomarker identification by integrating gene expression data and protein-protein interaction data. We first evaluated the effectiveness of netSVM using simulation studies, demonstrating its improved performance over state-of-the-art network-based methods and gene-based methods for network biomarker identification. We then applied the netSVM approach to two breast cancer data sets to identify prognostic signatures for prediction of breast cancer metastasis. The experimental results show that: (1) network biomarkers identified by netSVM are highly enriched in biological pathways associated with cancer progression; (2) prediction performance is much improved when tested across different data sets. Specifically, many genes related to apoptosis, cell cycle, and cell proliferation, which are hallmark signatures of breast cancer metastasis, were identified by the netSVM approach. More importantly, several novel hub genes, biologically important with many interactions in PPI network but often showing little change in expression as compared with their downstream genes, were also identified as network biomarkers; the genes were enriched in signaling pathways such as TGF-beta signaling pathway, MAPK signaling pathway, and JAK-STAT signaling pathway. These signaling pathways may provide new insight to the underlying mechanism of breast cancer metastasis.

Conclusions

We have developed a network-based approach for cancer biomarker identification, netSVM, resulting in an improved prediction performance with network biomarkers. We have applied the netSVM approach to breast cancer gene expression data to predict metastasis in patients. Network biomarkers identified by netSVM reveal potential signaling pathways associated with breast cancer metastasis, and help improve the prediction performance across independent data sets.  相似文献   

10.
For small samples, classifier design algorithms typically suffer from overfitting. Given a set of features, a classifier must be designed and its error estimated. For small samples, an error estimator may be unbiased but, owing to a large variance, often give very optimistic estimates. This paper proposes mitigating the small-sample problem by designing classifiers from a probability distribution resulting from spreading the mass of the sample points to make classification more difficult, while maintaining sample geometry. The algorithm is parameterized by the variance of the spreading distribution. By increasing the spread, the algorithm finds gene sets whose classification accuracy remains strong relative to greater spreading of the sample. The error gives a measure of the strength of the feature set as a function of the spread. The algorithm yields feature sets that can distinguish the two classes, not only for the sample data, but for distributions spread beyond the sample data. For linear classifiers, the topic of the present paper, the classifiers are derived analytically from the model, thereby providing an enormous savings in computation time. The algorithm is applied to cancer classification via cDNA microarrays. In particular, the genes BRCA1 and BRCA2 are associated with a hereditary disposition to breast cancer, and the algorithm is used to find gene sets whose expressions can be used to classify BRCA1 and BRCA2 tumors.  相似文献   

11.
《IRBM》2022,43(1):62-74
BackgroundThe prediction of breast cancer subtypes plays a key role in the diagnosis and prognosis of breast cancer. In recent years, deep learning (DL) has shown good performance in the intelligent prediction of breast cancer subtypes. However, most of the traditional DL models use single modality data, which can just extract a few features, so it cannot establish a stable relationship between patient characteristics and breast cancer subtypes.DatasetWe used the TCGA-BRCA dataset as a sample set for molecular subtype prediction of breast cancer. It is a public dataset that can be obtained through the following link: https://portal.gdc.cancer.gov/projects/TCGA-BRCAMethodsIn this paper, a Hybrid DL model based on the multimodal data is proposed. We combine the patient's gene modality data with image modality data to construct a multimodal fusion framework. According to the different forms and states, we set up feature extraction networks respectively, and then we fuse the output of the two feature networks based on the idea of weighted linear aggregation. Finally, the fused features are used to predict breast cancer subtypes. In particular, we use the principal component analysis to reduce the dimensionality of high-dimensional data of gene modality and filter the data of image modality. Besides, we also improve the traditional feature extraction network to make it show better performance.ResultsThe results show that compared with the traditional DL model, the Hybrid DL model proposed in this paper is more accurate and efficient in predicting breast cancer subtypes. Our model achieved a prediction accuracy of 88.07% in 10 times of 10-fold cross-validation. We did a separate AUC test for each subtype, and the average AUC value obtained was 0.9427. In terms of subtype prediction accuracy, our model is about 7.45% higher than the previous average.  相似文献   

12.
13.
We investigate the multiclass classification of cancer microarray samples. In contrast to classification of two cancer types from gene expression data, multiclass classification of more than two cancer types are relatively hard and less studied problem. We used class-wise optimized genes with corresponding one-versus-all support vector machine (OVA-SVM) classifier to maximize the utilization of selected genes. Final prediction was made by using probability scores from all classifiers. We used three different methods of estimating probability from decision value. Among the three probability methods, Platt's approach was more consistent, whereas, isotonic approach performed better for datasets with unequal proportion of samples in different classes. Probability based decision does not only gives true and fair comparison between different one-versus-all (OVA) classifiers but also gives the possibility of using them for any post analysis. Several ensemble experiments, an example of post analysis, of the three probability methods were implemented to study their effect in improving the classification accuracy. We observe that ensemble did help in improving the predictive accuracy of cancer data sets especially involving unbalanced samples. Four-fold external stratified cross-validation experiment was performed on the six multiclass cancer datasets to obtain unbiased estimates of prediction accuracies. Analysis of class-wise frequently selected genes on two cancer datasets demonstrated that the approach was able to select important and relevant genes consistent to literature. This study demonstrates successful implementation of the framework of class-wise feature selection and multiclass classification for prediction of cancer subtypes on six datasets.  相似文献   

14.
OBJECTIVE: To analyze the presence of malignancy associated changes (MACs) in normal buccal mucosa cells of lung and breast cancer patients and their relationship to tumor subtype, stage and size. STUDY DESIGN: Buccal mucosa smears of 107 lung cancer and 100 breast cancer patients and corresponding healthy subjects were collected, stained by the DNA-specific Feulgen-thionin method and scanned using an automated high-resolution cytometer. Nuclear texture features of a minimum of 500 nuclei per slide were calculated, and statistical classifiers using Gaussian models of class-probability distribution were designed, trained and tested in 3 parts: (1) ability to separate cancer patient samples from controls, (2) cross-validation of classifiers for different cancer types, and (3) correlation of MAC expression with tumor subtype, stage and size. RESULTS: Lung and breast cancer induce MACs in normal buccal mucosa cells. The classifiers based on the selected nuclear features correctly recognized >80% of lung and breast cancer cases. The results indicate that MAC detection is not dependent on the tumor subtype, stage or size. CONCLUSION: The presence of MACs in buccal mucosa cells offers the potential for developing a new noninvasive cancer screening test.  相似文献   

15.
《Genomics》2022,114(2):110264
Cancer is one of the major causes of human death per year. In recent years, cancer identification and classification using machine learning have gained momentum due to the availability of high throughput sequencing data. Using RNA-seq, cancer research is blooming day by day and new insights of cancer and related treatments are coming into light. In this paper, we propose PanClassif, a method that requires a very few and effective genes to detect cancer from RNA-seq data and is able to provide performance gain in several wide range machine learning classifiers. We have taken 22 types of cancer samples from The Cancer Genome Atlas (TCGA) having 8287 cancer samples and 680 normal samples. Firstly, PanClassif uses k-Nearest Neighbour (k-NN) smoothing to smooth the samples to handle noise in the data. Then effective genes are selected by Anova based test. For balancing the train data, PanClassif applies an oversampling method, SMOTE. We have performed comprehensive experiments on the datasets using several classification algorithms. Experimental results shows that PanClassif outperform existing state-of-the-art methods available and shows consistent performance for two single cell RNA-seq datasets taken from Gene Expression Omnibus (GEO). PanClassif improves performances of a wide variety of classifiers for both binary cancer prediction and multi-class cancer classification. PanClassif is available as a python package (https://pypi.org/project/panclassif/). All the source code and materials of PanClassif are available at https://github.com/Zwei-inc/panclassif.  相似文献   

16.
Accurate molecular classification of cancer using simple rules   总被引:1,自引:0,他引:1  

Background

One intractable problem with using microarray data analysis for cancer classification is how to reduce the extremely high-dimensionality gene feature data to remove the effects of noise. Feature selection is often used to address this problem by selecting informative genes from among thousands or tens of thousands of genes. However, most of the existing methods of microarray-based cancer classification utilize too many genes to achieve accurate classification, which often hampers the interpretability of the models. For a better understanding of the classification results, it is desirable to develop simpler rule-based models with as few marker genes as possible.

Methods

We screened a small number of informative single genes and gene pairs on the basis of their depended degrees proposed in rough sets. Applying the decision rules induced by the selected genes or gene pairs, we constructed cancer classifiers. We tested the efficacy of the classifiers by leave-one-out cross-validation (LOOCV) of training sets and classification of independent test sets.

Results

We applied our methods to five cancerous gene expression datasets: leukemia (acute lymphoblastic leukemia [ALL] vs. acute myeloid leukemia [AML]), lung cancer, prostate cancer, breast cancer, and leukemia (ALL vs. mixed-lineage leukemia [MLL] vs. AML). Accurate classification outcomes were obtained by utilizing just one or two genes. Some genes that correlated closely with the pathogenesis of relevant cancers were identified. In terms of both classification performance and algorithm simplicity, our approach outperformed or at least matched existing methods.

Conclusion

In cancerous gene expression datasets, a small number of genes, even one or two if selected correctly, is capable of achieving an ideal cancer classification effect. This finding also means that very simple rules may perform well for cancerous class prediction.  相似文献   

17.
DNA microarray gene expression and microarray-based comparative genomic hybridization (aCGH) have been widely used for biomedical discovery. Because of the large number of genes and the complex nature of biological networks, various analysis methods have been proposed. One such method is "gene shaving," a procedure which identifies subsets of the genes with coherent expression patterns and large variation across samples. Since combining genomic information from multiple sources can improve classification and prediction of diseases, in this paper we proposed a new method, "ICA gene shaving" (ICA, independent component analysis), for jointly analyzing gene expression and copy number data. First we used ICA to analyze joint measurements, gene expression and copy number, of a biological system and project the data onto statistically independent biological processes. Next, we used these results to identify patterns of variation in the data and then applied an iterative shaving method. We investigated the properties of our proposed method by analyzing both simulated and real data. We demonstrated that the robustness of our method to noise using simulated data. Using breast cancer data, we showed that our method is superior to the Generalized Singular Value Decomposition (GSVD) gene shaving method for identifying genes associated with breast cancer.  相似文献   

18.
Object categorization using single-trial electroencephalography (EEG) data measured while participants view images has been studied intensively. In previous studies, multiple event-related potential (ERP) components (e.g., P1, N1, P2, and P3) were used to improve the performance of object categorization of visual stimuli. In this study, we introduce a novel method that uses multiple-kernel support vector machine to fuse multiple ERP component features. We investigate whether fusing the potential complementary information of different ERP components (e.g., P1, N1, P2a, and P2b) can improve the performance of four-category visual object classification in single-trial EEGs. We also compare the classification accuracy of different ERP component fusion methods. Our experimental results indicate that the classification accuracy increases through multiple ERP fusion. Additional comparative analyses indicate that the multiple-kernel fusion method can achieve a mean classification accuracy higher than 72 %, which is substantially better than that achieved with any single ERP component feature (55.07 % for the best single ERP component, N1). We compare the classification results with those of other fusion methods and determine that the accuracy of the multiple-kernel fusion method is 5.47, 4.06, and 16.90 % higher than those of feature concatenation, feature extraction, and decision fusion, respectively. Our study shows that our multiple-kernel fusion method outperforms other fusion methods and thus provides a means to improve the classification performance of single-trial ERPs in brain–computer interface research.  相似文献   

19.
20.
With the advent of high-throughput technologies for measuring genome-wide expression profiles, a large number of methods have been proposed for discovering diagnostic markers that can accurately discriminate between different classes of a disease. However, factors such as the small sample size of typical clinical data, the inherent noise in high-throughput measurements, and the heterogeneity across different samples, often make it difficult to find reliable gene markers. To overcome this problem, several studies have proposed the use of pathway-based markers, instead of individual gene markers, for building the classifier. Given a set of known pathways, these methods estimate the activity level of each pathway by summarizing the expression values of its member genes, and use the pathway activities for classification. It has been shown that pathway-based classifiers typically yield more reliable results compared to traditional gene-based classifiers. In this paper, we propose a new classification method based on probabilistic inference of pathway activities. For a given sample, we compute the log-likelihood ratio between different disease phenotypes based on the expression level of each gene. The activity of a given pathway is then inferred by combining the log-likelihood ratios of the constituent genes. We apply the proposed method to the classification of breast cancer metastasis, and show that it achieves higher accuracy and identifies more reproducible pathway markers compared to several existing pathway activity inference methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号