首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 62 毫秒
1.
2.
Bagging to improve the accuracy of a clustering procedure   总被引:5,自引:0,他引:5  
MOTIVATION: The microarray technology is increasingly being applied in biological and medical research to address a wide range of problems such as the classification of tumors. An important statistical question associated with tumor classification is the identification of new tumor classes using gene expression profiles. Essential aspects of this clustering problem include identifying accurate partitions of the tumor samples into clusters and assessing the confidence of cluster assignments for individual samples. RESULTS: Two new resampling methods, inspired from bagging in prediction, are proposed to improve and assess the accuracy of a given clustering procedure. In these ensemble methods, a partitioning clustering procedure is applied to bootstrap learning sets and the resulting multiple partitions are combined by voting or the creation of a new dissimilarity matrix. As in prediction, the motivation behind bagging is to reduce variability in the partitioning results via averaging. The performances of the new and existing methods were compared using simulated data and gene expression data from two recently published cancer microarray studies. The bagged clustering procedures were in general at least as accurate and often substantially more accurate than a single application of the partitioning clustering procedure. A valuable by-product of bagged clustering are the cluster votes which can be used to assess the confidence of cluster assignments for individual observations. SUPPLEMENTARY INFORMATION: For supplementary information on datasets, analyses, and software, consult http://www.stat.berkeley.edu/~sandrine and http://www.bioconductor.org.  相似文献   

3.
MOTIVATION: A number of algorithms and analytical models have been employed to reduce the multidimensional complexity of DNA array data and attempt to extract some meaningful interpretation of the results. These include clustering, principal components analysis, self-organizing maps, and support vector machine analysis. Each method assumes an implicit model for the data, many of which separate genes into distinct clusters defined by similar expression profiles in the samples tested. A point of concern is that many genes may be involved in a number of distinct behaviours, and should therefore be modelled to fit into as many separate clusters as detected in the multidimensional gene expression space. The analysis of gene expression data using a decomposition model that is independent of the observer involved would be highly beneficial to improve standard and reproducible classification of clinical and research samples. RESULTS: We present a variational independent component analysis (ICA) method for reducing high dimensional DNA array data to a smaller set of latent variables, each associated with a gene signature. We present the results of applying the method to data from an ovarian cancer study, revealing a number of tissue type-specific and tissue type-independent gene signatures present in varying amounts among the samples surveyed. The observer independent results of such molecular analysis of biological samples could help identify patients who would benefit from different treatment strategies. We further explore the application of the model to similar high-throughput studies.  相似文献   

4.
MOTIVATION: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. RESULTS: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. AVAILABILITY: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.  相似文献   

5.
6.
Mixture modelling of gene expression data from microarray experiments   总被引:5,自引:0,他引:5  
MOTIVATION: Hierarchical clustering is one of the major analytical tools for gene expression data from microarray experiments. A major problem in the interpretation of the output from these procedures is assessing the reliability of the clustering results. We address this issue by developing a mixture model-based approach for the analysis of microarray data. Within this framework, we present novel algorithms for clustering genes and samples. One of the byproducts of our method is a probabilistic measure for the number of true clusters in the data. RESULTS: The proposed methods are illustrated by application to microarray datasets from two cancer studies; one in which malignant melanoma is profiled (Bittner et al., Nature, 406, 536-540, 2000), and the other in which prostate cancer is profiled (Dhanasekaran et al., 2001, submitted).  相似文献   

7.
8.
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.  相似文献   

9.
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.  相似文献   

10.
MOTIVATION: One particular application of microarray data, is to uncover the molecular variation among cancers. One feature of microarray studies is the fact that the number n of samples collected is relatively small compared to the number p of genes per sample which are usually in the thousands. In statistical terms this very large number of predictors compared to a small number of samples or observations makes the classification problem difficult. An efficient way to solve this problem is by using dimension reduction statistical techniques in conjunction with nonparametric discriminant procedures. RESULTS: We view the classification problem as a regression problem with few observations and many predictor variables. We use an adaptive dimension reduction method for generalized semi-parametric regression models that allows us to solve the 'curse of dimensionality problem' arising in the context of expression data. The predictive performance of the resulting classification rule is illustrated on two well know data sets in the microarray literature: the leukemia data that is known to contain classes that are easy 'separable' and the colon data set.  相似文献   

11.
MOTIVATION: Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. RESULTS: We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.  相似文献   

12.
Standard and Consensus Clustering Analysis Tool for Microarray Data (SC2ATmd) is a MATLAB-implemented application specifically designed for the exploration of microarray gene expression data via clustering. Implementation of two versions of the clustering validation method figure of merit allows for performance comparisons between different clustering algorithms, and tailors the cluster analysis process to the varying characteristics of each dataset. Along with standard clustering algorithms this application also offers a consensus clustering method that can generate reproducible clusters across replicate experiments or different clustering algorithms. This application was designed specifically for the analysis of gene expression data, but may be used with any numerical data as long as it is in the right format. AVAILABILITY: SC2ATmd may be freely downloaded from http://www.compbiosci.wfu.edu/tools.htm.  相似文献   

13.
MOTIVATION: DNA microarray technologies make it possible to simultaneously monitor thousands of genes' expression levels. A topic of great interest is to study the different expression profiles between microarray samples from cancer patients and normal subjects, by classifying them at gene expression levels. Currently, various clustering methods have been proposed in the literature to classify cancer and normal samples based on microarray data, and they are predominantly data-driven approaches. In this paper, we propose an alternative approach, a model-driven approach, which can reveal the relationship between the global gene expression profile and the subject's health status, and thus is promising in predicting the early development of cancer. RESULTS: In this work, we propose an ensemble dependence model, aimed at exploring the group dependence relationship of gene clusters. Under the framework of hypothesis-testing, we employ genes' dependence relationship as a feature to model and classify cancer and normal samples. The proposed classification scheme is applied to several real cancer datasets, including cDNA, Affymetrix microarray and proteomic data. It is noted that the proposed method yields very promising performance. We further investigate the eigenvalue pattern of the proposed method, and we discover different patterns between cancer and normal samples. Moreover, the transition between cancer and normal patterns suggests that the eigenvalue pattern of the proposed models may have potential to predict the early stage of cancer development. In addition, we examine the effects of possible model mismatch on the proposed scheme.  相似文献   

14.
MOTIVATION: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. RESULTS: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation  相似文献   

15.

Background  

Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.  相似文献   

16.
Microarray technology facilitates the monitoring of the expression levels of thousands of genes over different experimental conditions simultaneously. Clustering is a popular data mining tool which can be applied to microarray gene expression data to identify co-expressed genes. Most of the traditional clustering methods optimize a single clustering goodness criterion and thus may not be capable of performing well on all kinds of datasets. Motivated by this, in this article, a multiobjective clustering technique that optimizes cluster compactness and separation simultaneously, has been improved through a novel support vector machine classification based cluster ensemble method. The superiority of MOCSVMEN (MultiObjective Clustering with Support Vector Machine based ENsemble) has been established by comparing its performance with that of several well known existing microarray data clustering algorithms. Two real-life benchmark gene expression datasets have been used for testing the comparative performances of different algorithms. A recently developed metric, called Biological Homogeneity Index (BHI), which computes the clustering goodness with respect to functional annotation, has been used for the comparison purpose.  相似文献   

17.
When applying hierarchical clustering algorithms to cluster patient samples from microarray data, the clustering patterns generated by most algorithms tend to be dominated by groups of highly differentially expressed genes that have closely related expression patterns. Sometimes, these genes may not be relevant to the biological process under study or their functions may already be known. The problem is that these genes can potentially drown out the effects of other genes that are relevant or have novel functions. We propose a procedure called complementary hierarchical clustering that is designed to uncover the structures arising from these novel genes that are not as highly expressed. Simulation studies show that the procedure is effective when applied to a variety of examples. We also define a concept called relative gene importance that can be used to identify the influential genes in a given clustering. Finally, we analyze a microarray data set from 295 breast cancer patients, using clustering with the correlation-based distance measure. The complementary clustering reveals a grouping of the patients which is uncorrelated with a number of known prognostic signatures and significantly differing distant metastasis-free probabilities.  相似文献   

18.
Microarray data analysis has been shown to provide an effective tool for studying cancer and genetic diseases. Although classical machine learning techniques have successfully been applied to find informative genes and to predict class labels for new samples, common restrictions of microarray analysis such as small sample sizes, a large attribute space and high noise levels still limit its scientific and clinical applications. Increasing the interpretability of prediction models while retaining a high accuracy would help to exploit the information content in microarray data more effectively. For this purpose, we evaluate our rule-based evolutionary machine learning systems, BioHEL and GAssist, on three public microarray cancer datasets, obtaining simple rule-based models for sample classification. A comparison with other benchmark microarray sample classifiers based on three diverse feature selection algorithms suggests that these evolutionary learning techniques can compete with state-of-the-art methods like support vector machines. The obtained models reach accuracies above 90% in two-level external cross-validation, with the added value of facilitating interpretation by using only combinations of simple if-then-else rules. As a further benefit, a literature mining analysis reveals that prioritizations of informative genes extracted from BioHEL's classification rule sets can outperform gene rankings obtained from a conventional ensemble feature selection in terms of the pointwise mutual information between relevant disease terms and the standardized names of top-ranked genes.  相似文献   

19.
BACKGROUND: Complex diseases are commonly caused by multiple genes and their interactions with each other. Genome-wide association (GWA) studies provide us the opportunity to capture those disease associated genes and gene-gene interactions through panels of SNP markers. However, a proper filtering procedure is critical to reduce the search space prior to the computationally intensive gene-gene interaction identification step. In this study, we show that two commonly used SNP-SNP interaction filtering algorithms, ReliefF and tuned ReliefF (TuRF), are sensitive to the order of the samples in the dataset, giving rise to unstable and suboptimal results. However, we observe that the 'unstable' results from multiple runs of these algorithms can provide valuable information about the dataset. We therefore hypothesize that aggregating results from multiple runs of the algorithm may improve the filtering performance. RESULTS: We propose a simple and effective ensemble approach in which the results from multiple runs of an unstable filter are aggregated based on the general theory of ensemble learning. The ensemble versions of the ReliefF and TuRF algorithms, referred to as ReliefF-E and TuRF-E, are robust to sample order dependency and enable a more informative investigation of data characteristics. Using simulated and real datasets, we demonstrate that both the ensemble of ReliefF and the ensemble of TuRF can generate a much more stable SNP ranking than the original algorithms. Furthermore, the ensemble of TuRF achieved the highest success rate in comparison to many state-of-the-art algorithms as well as traditional χ2-test and odds ratio methods in terms of retaining gene-gene interactions.  相似文献   

20.
MOTIVATION: Bioinformatics clustering tools are useful at all levels of proteomic data analysis. Proteomics studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens. The high dimensionality of data generated from these studies requires the development of improved bioinformatics tools for efficient and accurate data analyses. For proteome profiling of a particular system or organism, a number of specialized software tools are needed. Indeed, significant advances in the informatics and software tools necessary to support the analysis and management of these massive amounts of data are needed. Clustering algorithms based on probabilistic and Bayesian models provide an alternative to heuristic algorithms. The number of clusters (diseased and non-diseased groups) is reduced to the choice of the number of components of a mixture of underlying probability. The Bayesian approach is a tool for including information from the data to the analysis. It offers an estimation of the uncertainties of the data and the parameters involved. RESULTS: We present novel algorithms that can organize, cluster and derive meaningful patterns of expression from large-scaled proteomics experiments. We processed raw data using a graphical-based algorithm by transforming it from a real space data-expression to a complex space data-expression using discrete Fourier transformation; then we used a thresholding approach to denoise and reduce the length of each spectrum. Bayesian clustering was applied to the reconstructed data. In comparison with several other algorithms used in this study including K-means, (Kohonen self-organizing map (SOM), and linear discriminant analysis, the Bayesian-Fourier model-based approach displayed superior performances consistently, in selecting the correct model and the number of clusters, thus providing a novel approach for accurate diagnosis of the disease. Using this approach, we were able to successfully denoise proteomic spectra and reach up to a 99% total reduction of the number of peaks compared to the original data. In addition, the Bayesian-based approach generated a better classification rate in comparison with other classification algorithms. This new finding will allow us to apply the Fourier transformation for the selection of the protein profile for each sample, and to develop a novel bioinformatic strategy based on Bayesian clustering for biomarker discovery and optimal diagnosis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号