首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 609 毫秒
1.

Background

A tremendous amount of efforts have been devoted to identifying genes for diagnosis and prognosis of diseases using microarray gene expression data. It has been demonstrated that gene expression data have cluster structure, where the clusters consist of co-regulated genes which tend to have coordinated functions. However, most available statistical methods for gene selection do not take into consideration the cluster structure.

Results

We propose a supervised group Lasso approach that takes into account the cluster structure in gene expression data for gene selection and predictive model building. For gene expression data without biological cluster information, we first divide genes into clusters using the K-means approach and determine the optimal number of clusters using the Gap method. The supervised group Lasso consists of two steps. In the first step, we identify important genes within each cluster using the Lasso method. In the second step, we select important clusters using the group Lasso. Tuning parameters are determined using V-fold cross validation at both steps to allow for further flexibility. Prediction performance is evaluated using leave-one-out cross validation. We apply the proposed method to disease classification and survival analysis with microarray data.

Conclusion

We analyze four microarray data sets using the proposed approach: two cancer data sets with binary cancer occurrence as outcomes and two lymphoma data sets with survival outcomes. The results show that the proposed approach is capable of identifying a small number of influential gene clusters and important genes within those clusters, and has better prediction performance than existing methods.  相似文献   

2.
This paper introduces a novel approach to gene selection based on a substantial modification of analytic hierarchy process (AHP). The modified AHP systematically integrates outcomes of individual filter methods to select the most informative genes for microarray classification. Five individual ranking methods including t-test, entropy, receiver operating characteristic (ROC) curve, Wilcoxon and signal to noise ratio are employed to rank genes. These ranked genes are then considered as inputs for the modified AHP. Additionally, a method that uses fuzzy standard additive model (FSAM) for cancer classification based on genes selected by AHP is also proposed in this paper. Traditional FSAM learning is a hybrid process comprising unsupervised structure learning and supervised parameter tuning. Genetic algorithm (GA) is incorporated in-between unsupervised and supervised training to optimize the number of fuzzy rules. The integration of GA enables FSAM to deal with the high-dimensional-low-sample nature of microarray data and thus enhance the efficiency of the classification. Experiments are carried out on numerous microarray datasets. Results demonstrate the performance dominance of the AHP-based gene selection against the single ranking methods. Furthermore, the combination of AHP-FSAM shows a great accuracy in microarray data classification compared to various competing classifiers. The proposed approach therefore is useful for medical practitioners and clinicians as a decision support system that can be implemented in the real medical practice.  相似文献   

3.
MOTIVATION: Many methods have been developed for selecting small informative feature subsets in large noisy data. However, unsupervised methods are scarce. Examples are using the variance of data collected for each feature, or the projection of the feature on the first principal component. We propose a novel unsupervised criterion, based on SVD-entropy, selecting a feature according to its contribution to the entropy (CE) calculated on a leave-one-out basis. This can be implemented in four ways: simple ranking according to CE values (SR); forward selection by accumulating features according to which set produces highest entropy (FS1); forward selection by accumulating features through the choice of the best CE out of the remaining ones (FS2); backward elimination (BE) of features with the lowest CE. RESULTS: We apply our methods to different benchmarks. In each case we evaluate the success of clustering the data in the selected feature spaces, by measuring Jaccard scores with respect to known classifications. We demonstrate that feature filtering according to CE outperforms the variance method and gene-shaving. There are cases where the analysis, based on a small set of selected features, outperforms the best score reported when all information was used. Our method calls for an optimal size of the relevant feature set. This turns out to be just a few percents of the number of genes in the two Leukemia datasets that we have analyzed. Moreover, the most favored selected genes turn out to have significant GO enrichment in relevant cellular processes.  相似文献   

4.
Paul TK  Iba H 《Bio Systems》2005,82(3):208-225
Recently, DNA microarray-based gene expression profiles have been used to correlate the clinical behavior of cancers with the differential gene expression levels in cancerous and normal tissues. To this end, after selection of some predictive genes based on signal-to-noise (S2N) ratio, unsupervised learning like clustering and supervised learning like k-nearest neighbor (k NN) classifier are widely used. Instead of S2N ratio, adaptive searches like Probabilistic Model Building Genetic Algorithm (PMBGA) can be applied for selection of a smaller size gene subset that would classify patient samples more accurately. In this paper, we propose a new PMBGA-based method for identification of informative genes from microarray data. By applying our proposed method to classification of three microarray data sets of binary and multi-type tumors, we demonstrate that the gene subsets selected with our technique yield better classification accuracy.  相似文献   

5.
Current feature selection methods for supervised classification of tissue samples from microarray data generally fail to exploit complementary discriminatory power that can be found in sets of features. Using a feature selection method with the computational architecture of the cross-entropy method, including an additional preliminary step ensuring a lower bound on the number of times any feature is considered, we show when testing on a human lymph node data set that there are a significant number of genes that perform well when their complementary power is assessed, but “pass under the radar” of popular feature selection methods that only assess genes individually on a given classification tool. We also show that this phenomenon becomes more apparent as diagnostic specificity of the tissue samples analysed increases.  相似文献   

6.
Feature selection is widely established as one of the fundamental computational techniques in mining microarray data. Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection proposed to use these consensus solutions as oracles. However,these methods are dependent on both the particular cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when the true cluster number is not available, which is common in practice. In view of the above problems, a new unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works. Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when compared to several state-of-the-art techniques.  相似文献   

7.
Selecting relevant features is a common task in most OMICs data analysis, where the aim is to identify a small set of key features to be used as biomarkers. To this end, two alternative but equally valid methods are mainly available, namely the univariate (filter) or the multivariate (wrapper) approach. The stability of the selected lists of features is an often neglected but very important requirement. If the same features are selected in multiple independent iterations, they more likely are reliable biomarkers. In this study, we developed and evaluated the performance of a novel method for feature selection and prioritization, aiming at generating robust and stable sets of features with high predictive power. The proposed method uses the fuzzy logic for a first unbiased feature selection and a Random Forest built from conditional inference trees to prioritize the candidate discriminant features. Analyzing several multi-class gene expression microarray data sets, we demonstrate that our technique provides equal or better classification performance and a greater stability as compared to other Random Forest-based feature selection methods.  相似文献   

8.
Basic microarray analysis: grouping and feature reduction   总被引:10,自引:0,他引:10  
DNA microarray technologies are useful for addressing a broad range of biological problems - including the measurement of mRNA expression levels in target cells. These studies typically produce large data sets that contain measurements on thousands of genes under hundreds of conditions. There is a critical need to summarize this data and to pick out the important details. The most common activities, therefore, are to group together microarray data and to reduce the number of features. Both of these activities can be done using only the raw microarray data (unsupervised methods) or using external information that provides labels for the microarray data (supervised methods). We briefly review supervised and unsupervised methods for grouping and reducing data in the context of a publicly available suite of tools called CLEAVER, and illustrate their application on a representative data set collected to study lymphoma.  相似文献   

9.

Motivation

DNA microarray analysis is characterized by obtaining a large number of gene variables from a small number of observations. Cluster analysis is widely used to analyze DNA microarray data to make classification and diagnosis of disease. Because there are so many irrelevant and insignificant genes in a dataset, a feature selection approach must be employed in data analysis. The performance of cluster analysis of this high-throughput data depends on whether the feature selection approach chooses the most relevant genes associated with disease classes.

Results

Here we proposed a new method using multiple Orthogonal Partial Least Squares-Discriminant Analysis (mOPLS-DA) models and S-plots to select the most relevant genes to conduct three-class disease classification and prediction. We tested our method using Golub’s leukemia microarray data. For three classes with subtypes, we proposed hierarchical orthogonal partial least squares-discriminant analysis (OPLS-DA) models and S-plots to select features for two main classes and their subtypes. For three classes in parallel, we employed three OPLS-DA models and S-plots to choose marker genes for each class. The power of feature selection to classify and predict three-class disease was evaluated using cluster analysis. Further, the general performance of our method was tested using four public datasets and compared with those of four other feature selection methods. The results revealed that our method effectively selected the most relevant features for disease classification and prediction, and its performance was better than that of the other methods.  相似文献   

10.
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi  相似文献   

11.
《IRBM》2014,35(5):244-254
ObjectiveThe overall goal of the study is to detect coronary artery lesions regardless their nature, calcified or hypo-dense. To avoid explicit modelling of heterogeneous lesions, we adopted an approach based on machine learning and using unsupervised or semi-supervised classifiers. The success of the classifiers based on machine learning strongly depends on the appropriate choice of features differentiating between lesions and regular appearance. The specific goal of this article is to propose a novel strategy devised to select the best feature set for the classifiers used, out of a given set of candidate features.Materials and methodsThe features are calculated in image planes orthogonal to the artery centerline, and the classifier assigns to each of these cross-sections a label “healthy” or “diseased”. The contribution of this article is a feature-selection strategy based on the empirical risk function that is used as a criterion in the initial feature ranking and in the selection process itself. We have assessed this strategy in association with two classifiers based on the density-level detection approach that seeks outliers from the distribution corresponding to the regular appearance. The method was evaluated using a total of 13,687 cross-sections extracted from 53 coronary arteries in 15 patients.ResultsUsing the feature subset selected by the risk-based strategy, balanced error rates achieved by the unsupervised and semi-supervised classifiers respectively were equal to 13.5% and 15.4%. These results were substantially better than the rates achieved using feature subsets selected by supervised strategies. The unsupervised and semi-supervised methods also outperformed supervised classifiers using feature subsets selected by the corresponding supervised strategies.DiscussionSupervised methods require large data sets annotated by experts, both to select the features and to train the classifiers, and collecting these annotations is time-consuming. With these methods, lesions whose appearance differs from the training data may remain undetected. Lesion-detection problem is highly imbalanced, since healthy cross-sections usually are much more numerous than the diseased ones. Training the classifiers based on the density-level detection approach needs a small number of annotations or no annotations at all. The same annotations are sufficient to compute the empirical risk and to perform the selection. Therefore, our strategy associated with an unsupervised or semi-supervised classifier requires a considerably smaller number of annotations as compared to conventional supervised selection strategies. The approach proposed is also better suited for highly imbalanced problems and can detect lesions differing from the training set.ConclusionThe risk-based selection strategy, associated with classifiers using the density-level detection approach, outperformed other strategies and classifiers when used to detect coronary artery lesions. It is well suited for highly imbalanced problems, where the lesions are represented as low-density regions of the feature space, and it can be used in other anomaly detection problems interpretable as a binary classification problem where the empirical risk can be calculated.  相似文献   

12.
MOTIVATION: Currently the most popular approach to analyze genome-wide expression data is clustering. One of the major drawbacks of most of the existing clustering methods is that the number of clusters has to be specified a priori. Furthermore, by using pure unsupervised algorithms prior biological knowledge is totally ignored Moreover, most current tools lack an effective framework for tight integration of unsupervised and supervised learning for the analysis of high-dimensional expression data and only very few multi-class supervised approaches are designed with the provision for effectively utilizing multiple functional class labeling. RESULTS: The paper adapts a novel Self-Organizing map called supervised Network Self-Organized Map (sNet-SOM) to the peculiarities of multi-labeled gene expression data. The sNet-SOM determines adaptively the number of clusters with a dynamic extension process. This process is driven by an inhomogeneous measure that tries to balance unsupervised, supervised and model complexity criteria. Nodes within a rectangular grid are grown at the boundary nodes, weights rippled from the internal nodes towards the outer nodes of the grid, and whole columns inserted within the map The appropriate level of expansion is determined automatically. Multiple sNet-SOM models are constructed dynamically each for a different unsupervised/supervised balance and model selection criteria are used to select the one optimum one. The results indicate that sNet-SOM yields competitive performance to other recently proposed approaches for supervised classification at a significantly reduced computational cost and it provides extensive exploratory analysis potentiality within the analysis framework. Furthermore, it explores simple design decisions that are easier to comprehend and computationally efficient.  相似文献   

13.
The paper presents MRNET, an original method for inferring genetic networks from microarray data. The method is based on maximum relevance/minimum redundancy (MRMR), an effective information-theoretic technique for feature selection in supervised learning. The MRMR principle consists in selecting among the least redundant variables the ones that have the highest mutual information with the target. MRNET extends this feature selection principle to networks in order to infer gene-dependence relationships from microarray data. The paper assesses MRNET by benchmarking it against RELNET, CLR, and ARACNE, three state-of-the-art information-theoretic methods for large (up to several thousands of genes) network inference. Experimental results on thirty synthetically generated microarray datasets show that MRNET is competitive with these methods.  相似文献   

14.
Prognostic prediction is important in medical domain, because it can be used to select an appropriate treatment for a patient by predicting the patient's clinical outcomes. For high-dimensional data, a normal prognostic method undergoes two steps: feature selection and prognosis analysis. Recently, the L?-L?-norm Support Vector Machine (L?-L? SVM) has been developed as an effective classification technique and shown good classification performance with automatic feature selection. In this paper, we extend L?-L? SVM for regression analysis with automatic feature selection. We further improve the L?-L? SVM for prognostic prediction by utilizing the information of censored data as constraints. We design an efficient solution to the new optimization problem. The proposed method is compared with other seven prognostic prediction methods on three realworld data sets. The experimental results show that the proposed method performs consistently better than the medium performance. It is more efficient than other algorithms with the similar performance.  相似文献   

15.
MOTIVATION: Most supervised classification methods are limited by the requirement for more cases than variables. In microarray data the number of variables (genes) far exceeds the number of cases (arrays), and thus filtering and pre-selection of genes is required. We describe the application of Between Group Analysis (BGA) to the analysis of microarray data. A feature of BGA is that it can be used when the number of variables (genes) exceeds the number of cases (arrays). BGA is based on carrying out an ordination of groups of samples, using a standard method such as Correspondence Analysis (COA), rather than an ordination of the individual microarray samples. As such, it can be viewed as a method of carrying out COA with grouped data. RESULTS: We illustrate the power of the method using two cancer data sets. In both cases, we can quickly and accurately classify test samples from any number of specified a priori groups and identify the genes which characterize these groups. We obtained very high rates of correct classification, as determined by jack-knife or validation experiments with training and test sets. The results are comparable to those from other methods in terms of accuracy but the power and flexibility of BGA make it an especially attractive method for the analysis of microarray cancer data.  相似文献   

16.
Microarray data has a high dimension of variables but available datasets usually have only a small number of samples, thereby making the study of such datasets interesting and challenging. In the task of analyzing microarray data for the purpose of, e.g., predicting gene-disease association, feature selection is very important because it provides a way to handle the high dimensionality by exploiting information redundancy induced by associations among genetic markers. Judicious feature selection in microarray data analysis can result in significant reduction of cost while maintaining or improving the classification or prediction accuracy of learning machines that are employed to sort out the datasets. In this paper, we propose a gene selection method called Recursive Feature Addition (RFA), which combines supervised learning and statistical similarity measures. We compare our method with the following gene selection methods:
  • Support Vector Machine Recursive Feature Elimination (SVMRFE)
  • Leave-One-Out Calculation Sequential Forward Selection (LOOCSFS)
  • Gradient based Leave-one-out Gene Selection (GLGS)
To evaluate the performance of these gene selection methods, we employ several popular learning classifiers on the MicroArray Quality Control phase II on predictive modeling (MAQC-II) breast cancer dataset and the MAQC-II multiple myeloma dataset. Experimental results show that gene selection is strictly paired with learning classifier. Overall, our approach outperforms other compared methods. The biological functional analysis based on the MAQC-II breast cancer dataset convinced us to apply our method for phenotype prediction. Additionally, learning classifiers also play important roles in the classification of microarray data and our experimental results indicate that the Nearest Mean Scale Classifier (NMSC) is a good choice due to its prediction reliability and its stability across the three performance measurements: Testing accuracy, MCC values, and AUC errors.  相似文献   

17.
Predicting survival from microarray data--a comparative study   总被引:1,自引:0,他引:1  
MOTIVATION: Survival prediction from gene expression data and other high-dimensional genomic data has been subject to much research during the last years. These kinds of data are associated with the methodological problem of having many more gene expression values than individuals. In addition, the responses are censored survival times. Most of the proposed methods handle this by using Cox's proportional hazards model and obtain parameter estimates by some dimension reduction or parameter shrinkage estimation technique. Using three well-known microarray gene expression data sets, we compare the prediction performance of seven such methods: univariate selection, forward stepwise selection, principal components regression (PCR), supervised principal components regression, partial least squares regression (PLS), ridge regression and the lasso. RESULTS: Statistical learning from subsets should be repeated several times in order to get a fair comparison between methods. Methods using coefficient shrinkage or linear combinations of the gene expression values have much better performance than the simple variable selection methods. For our data sets, ridge regression has the overall best performance. AVAILABILITY: Matlab and R code for the prediction methods are available at http://www.med.uio.no/imb/stat/bmms/software/microsurv/.  相似文献   

18.
We present a new computational technique (a software implementation, data sets, and supplementary information are available at http://www.enm.bris.ac.uk/lpd/) which enables the probabilistic analysis of cDNA microarray data and we demonstrate its effectiveness in identifying features of biomedical importance. A hierarchical Bayesian model, called Latent Process Decomposition (LPD), is introduced in which each sample in the data set is represented as a combinatorial mixture over a finite set of latent processes, which are expected to correspond to biological processes. Parameters in the model are estimated using efficient variational methods. This type of probabilistic model is most appropriate for the interpretation of measurement data generated by cDNA microarray technology. For determining informative substructure in such data sets, the proposed model has several important advantages over the standard use of dendrograms. First, the ability to objectively assess the optimal number of sample clusters. Second, the ability to represent samples and gene expression levels using a common set of latent variables (dendrograms cluster samples and gene expression values separately which amounts to two distinct reduced space representations). Third, in constrast to standard cluster models, observations are not assigned to a single cluster and, thus, for example, gene expression levels are modeled via combinations of the latent processes identified by the algorithm. We show this new method compares favorably with alternative cluster analysis methods. To illustrate its potential, we apply the proposed technique to several microarray data sets for cancer. For these data sets it successfully decomposes the data into known subtypes and indicates possible further taxonomic subdivision in addition to highlighting, in a wholly unsupervised manner, the importance of certain genes which are known to be medically significant. To illustrate its wider applicability, we also illustrate its performance on a microarray data set for yeast.  相似文献   

19.
MOTIVATION: Methods for analyzing cancer microarray data often face two distinct challenges: the models they infer need to perform well when classifying new tissue samples while at the same time providing an insight into the patterns and gene interactions hidden in the data. State-of-the-art supervised data mining methods often cover well only one of these aspects, motivating the development of methods where predictive models with a solid classification performance would be easily communicated to the domain expert. RESULTS: Data visualization may provide for an excellent approach to knowledge discovery and analysis of class-labeled data. We have previously developed an approach called VizRank that can score and rank point-based visualizations according to degree of separation of data instances of different class. We here extend VizRank with techniques to uncover outliers, score features (genes) and perform classification, as well as to demonstrate that the proposed approach is well suited for cancer microarray analysis. Using VizRank and radviz visualization on a set of previously published cancer microarray data sets, we were able to find simple, interpretable data projections that include only a small subset of genes yet do clearly differentiate among different cancer types. We also report that our approach to classification through visualization achieves performance that is comparable to state-of-the-art supervised data mining techniques. AVAILABILITY: VizRank and radviz are implemented as part of the Orange data mining suite (http://www.ailab.si/orange). SUPPLEMENTARY INFORMATION: Supplementary data are available from http://www.ailab.si/supp/bi-cancer.  相似文献   

20.
Particle tracking in living systems requires low light exposure and short exposure times to avoid phototoxicity and photobleaching and to fully capture particle motion with high-speed imaging. Low-excitation light comes at the expense of tracking accuracy. Image restoration methods based on deep learning dramatically improve the signal-to-noise ratio in low-exposure data sets, qualitatively improving the images. However, it is not clear whether images generated by these methods yield accurate quantitative measurements such as diffusion parameters in (single) particle tracking experiments. Here, we evaluate the performance of two popular deep learning denoising software packages for particle tracking, using synthetic data sets and movies of diffusing chromatin as biological examples. With synthetic data, both supervised and unsupervised deep learning restored particle motions with high accuracy in two-dimensional data sets, whereas artifacts were introduced by the denoisers in three-dimensional data sets. Experimentally, we found that, while both supervised and unsupervised approaches improved tracking results compared with the original noisy images, supervised learning generally outperformed the unsupervised approach. We find that nicer-looking image sequences are not synonymous with more precise tracking results and highlight that deep learning algorithms can produce deceiving artifacts with extremely noisy images. Finally, we address the challenge of selecting parameters to train convolutional neural networks by implementing a frugal Bayesian optimizer that rapidly explores multidimensional parameter spaces, identifying networks yielding optimal particle tracking accuracy. Our study provides quantitative outcome measures of image restoration using deep learning. We anticipate broad application of this approach to critically evaluate artificial intelligence solutions for quantitative microscopy.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号