首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Feature selection is widely established as one of the fundamental computational techniques in mining microarray data. Due to the lack of categorized information in practice, unsupervised feature selection is more practically important but correspondingly more difficult. Motivated by the cluster ensemble techniques, which combine multiple clustering solutions into a consensus solution of higher accuracy and stability, recent efforts in unsupervised feature selection proposed to use these consensus solutions as oracles. However,these methods are dependent on both the particular cluster ensemble algorithm used and the knowledge of the true cluster number. These methods will be unsuitable when the true cluster number is not available, which is common in practice. In view of the above problems, a new unsupervised feature ranking method is proposed to evaluate the importance of the features based on consensus affinity. Different from previous works, our method compares the corresponding affinity of each feature between a pair of instances based on the consensus matrix of clustering solutions. As a result, our method alleviates the need to know the true number of clusters and the dependence on particular cluster ensemble approaches as in previous works. Experiments on real gene expression data sets demonstrate significant improvement of the feature ranking results when compared to several state-of-the-art techniques.  相似文献   

2.
Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal” and "luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal” subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.  相似文献   

3.
Y Peng  Y Zhang  G Kou  Y Shi 《PloS one》2012,7(7):e41713
Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm-k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study.  相似文献   

4.
Gangnon RE  Clayton MK 《Biometrics》2000,56(3):922-935
Many current statistical methods for disease clustering studies are based on a hypothesis testing paradigm. These methods typically do not produce useful estimates of disease rates or cluster risks. In this paper, we develop a Bayesian procedure for drawing inferences about specific models for spatial clustering. The proposed methodology incorporates ideas from image analysis, from Bayesian model averaging, and from model selection. With our approach, we obtain estimates for disease rates and allow for greater flexibility in both the type of clusters and the number of clusters that may be considered. We illustrate the proposed procedure through simulation studies and an analysis of the well-known New York leukemia data.  相似文献   

5.
Summary This article develops a latent model and likelihood‐based inference to detect temporal clustering of events. The model mimics typical processes generating the observed data. We apply model selection techniques to determine the number of clusters, and develop likelihood inference and a Monte Carlo expectation–maximization algorithm to estimate model parameters, detect clusters, and identify cluster locations. Our method differs from the classical scan statistic in that we can simultaneously detect multiple clusters of varying sizes. We illustrate the methodology with two real data applications and evaluate its efficiency through simulation studies. For the typical data‐generating process, our methodology is more efficient than a competing procedure that relies on least squares.  相似文献   

6.
7.
We wished to quantify the state-of-the-art of our understanding of clusters in microarray data. To do this we systematically compared the clusters produced on sets of microarray data using a representative set of clustering algorithms (hierarchical, k-means, and a modified version of QT_CLUST) with the annotation schemes MIPS, GeneOntology and GenProtEC. We assumed that if a cluster reflected known biology its members would share related ontological annotations. This assumption is the basis of "guilt-by-association" and is commonly used to assign the putative function of proteins. To statistically measure the relationship between cluster and annotation we developed a new predictive discriminatory measure. We found that the clusters found in microarray data do not in general agree with functional annotation classes. Although many statistically significant relationships can be found, the majority of clusters are not related to known biology (as described in annotation ontologies). This implies that use of guilt-by-association is not supported by annotation ontologies. Depending on the estimate of the amount of noise in the data, our results suggest that bioinformatics has only codified a small proportion of the biological knowledge required to understand microarray data.  相似文献   

8.
MOTIVATION: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. RESULTS: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. AVAILABILITY: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.  相似文献   

9.
Chang  Luyao  Li  Fan  Niu  Xinzheng  Zhu  Jiahui 《Cluster computing》2022,25(4):3005-3017

To better collect data in context to balance energy consumption, wireless sensor networks (WSN) need to be divided into clusters. The division of clusters makes the network become a hierarchical organizational structure, which plays the role of balancing the network load and prolonging the life cycle of the system. In clustering routing algorithm, the pros and cons of clustering algorithm directly affect the result of cluster division. In this paper, an algorithm for selecting cluster heads based on node distribution density and allocating remaining nodes is proposed for the defects of cluster head random election and uneven clustering in the traditional LEACH protocol clustering algorithm in WSN. Experiments show that the algorithm can realize the rapid selection of cluster heads and division of clusters, which is effective for node clustering and is conducive to equalizing energy consumption.

  相似文献   

10.
Determining the number of clusters using the weighted gap statistic   总被引:3,自引:0,他引:3  
Yan M  Ye K 《Biometrics》2007,63(4):1031-1037
Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method.  相似文献   

11.
Model-based clustering and data transformations for gene expression data.   总被引:20,自引:0,他引:20  
MOTIVATION: Clustering is a useful exploratory technique for the analysis of gene expression data. Many different heuristic clustering algorithms have been proposed in this context. Clustering algorithms based on probability models offer a principled alternative to heuristic algorithms. In particular, model-based clustering assumes that the data is generated by a finite mixture of underlying probability distributions such as multivariate normal distributions. The issues of selecting a 'good' clustering method and determining the 'correct' number of clusters are reduced to model selection problems in the probability framework. Gaussian mixture models have been shown to be a powerful tool for clustering in many applications. RESULTS: We benchmarked the performance of model-based clustering on several synthetic and real gene expression data sets for which external evaluation criteria were available. The model-based approach has superior performance on our synthetic data sets, consistently selecting the correct model and the number of clusters. On real expression data, the model-based approach produced clusters of quality comparable to a leading heuristic clustering algorithm, but with the key advantage of suggesting the number of clusters and an appropriate model. We also explored the validity of the Gaussian mixture assumption on different transformations of real data. We also assessed the degree to which these real gene expression data sets fit multivariate Gaussian distributions both before and after subjecting them to commonly used data transformations. Suitably chosen transformations seem to result in reasonable fits. AVAILABILITY: MCLUST is available at http://www.stat.washington.edu/fraley/mclust. The software for the diagonal model is under development. CONTACT: kayee@cs.washington.edu. SUPPLEMENTARY INFORMATION: http://www.cs.washington.edu/homes/kayee/model.  相似文献   

12.
Model-based clustering is a popular tool for summarizing high-dimensional data. With the number of high-throughput large-scale gene expression studies still on the rise, the need for effective data- summarizing tools has never been greater. By grouping genes according to a common experimental expression profile, we may gain new insight into the biological pathways that steer biological processes of interest. Clustering of gene profiles can also assist in assigning functions to genes that have not yet been functionally annotated. In this paper, we propose 2 model selection procedures for model-based clustering. Model selection in model-based clustering has to date focused on the identification of data dimensions that are relevant for clustering. However, in more complex data structures, with multiple experimental factors, such an approach does not provide easily interpreted clustering outcomes. We propose a mixture model with multiple levels, , that provides sparse representations both "within" and "between" cluster profiles. We explore various flexible "within-cluster" parameterizations and discuss how efficient parameterizations can greatly enhance the objective interpretability of the generated clusters. Moreover, we allow for a sparse "between-cluster" representation with a different number of clusters at different levels of an experimental factor of interest. This enhances interpretability of clusters generated in multiple-factor contexts. Interpretable cluster profiles can assist in detecting biologically relevant groups of genes that may be missed with less efficient parameterizations. We use our multilevel mixture model to mine a proliferating cell line expression data set for annotational context and regulatory motifs. We also investigate the performance of the multilevel clustering approach on several simulated data sets.  相似文献   

13.
MOTIVATION: A measurement of cluster quality is needed to choose potential clusters of genes that contain biologically relevant patterns of gene expression. This is strongly desirable when a large number of gene expression profiles have to be analyzed and proper clusters of genes need to be identified for further analysis, such as the search for meaningful patterns, identification of gene functions or gene response analysis. RESULTS: We propose a new cluster quality method, called stability, by which unsupervised learning of gene expression data can be performed efficiently. The method takes into account a cluster's stability on partition. We evaluate this method and demonstrate its performance using four independent, real gene expression and three simulated datasets. We demonstrate that our method outperforms other techniques listed in the literature. The method has applications in evaluating clustering validity as well as identifying stable clusters. AVAILABILITY: Please contact the first author.  相似文献   

14.

Background

The goal of the study was to demonstrate a hierarchical structure of resting state activity in the healthy brain using a data-driven clustering algorithm.

Methodology/Principal Findings

The fuzzy-c-means clustering algorithm was applied to resting state fMRI data in cortical and subcortical gray matter from two groups acquired separately, one of 17 healthy individuals and the second of 21 healthy individuals. Different numbers of clusters and different starting conditions were used. A cluster dispersion measure determined the optimal numbers of clusters. An inner product metric provided a measure of similarity between different clusters. The two cluster result found the task-negative and task-positive systems. The cluster dispersion measure was minimized with seven and eleven clusters. Each of the clusters in the seven and eleven cluster result was associated with either the task-negative or task-positive system. Applying the algorithm to find seven clusters recovered previously described resting state networks, including the default mode network, frontoparietal control network, ventral and dorsal attention networks, somatomotor, visual, and language networks. The language and ventral attention networks had significant subcortical involvement. This parcellation was consistently found in a large majority of algorithm runs under different conditions and was robust to different methods of initialization.

Conclusions/Significance

The clustering of resting state activity using different optimal numbers of clusters identified resting state networks comparable to previously obtained results. This work reinforces the observation that resting state networks are hierarchically organized.  相似文献   

15.
16.

Purpose

To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment.

Material and Methods

The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters.

Results

The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters.

Conclusion

The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.  相似文献   

17.
Previous studies have been conducted in gene expression profiling to identify groups of genes that characterize the colorectal carcinoma disease. Despite the success of previous attempts to identify groups of genes in the progression of the colorectal carcinoma disease, their methods either require subjective interpretation of the number of clusters, or lack stability during different runs of the algorithms. All of which limits the usefulness of these methods. In this study, we propose an enhanced algorithm that provides stability and robustness in identifying differentially expressed genes in an expression profile analysis. Our proposed algorithm uses multiple clustering algorithms under the consensus clustering framework. The results of the experiment show that the robustness of our method provides a consistent structure of clusters, similar to the structure found in the previous study. Furthermore, our algorithm outperforms any single clustering algorithms in terms of the cluster quality score.  相似文献   

18.
Clustering analysis has a growing role in the study of co-expressed genes for gene discovery. Conventional binary and fuzzy clustering do not embrace the biological reality that some genes may be irrelevant for a problem and not be assigned to a cluster, while other genes may participate in several biological functions and should simultaneously belong to multiple clusters. Also, these algorithms cannot generate tight clusters that focus on their cores or wide clusters that overlap and contain all possibly relevant genes. In this paper, a new clustering paradigm is proposed. In this paradigm, all three eventualities of a gene being exclusively assigned to a single cluster, being assigned to multiple clusters, and being not assigned to any cluster are possible. These possibilities are realised through the primary novelty of the introduction of tunable binarization techniques. Results from multiple clustering experiments are aggregated to generate one fuzzy consensus partition matrix (CoPaM), which is then binarized to obtain the final binary partitions. This is referred to as Binarization of Consensus Partition Matrices (Bi-CoPaM). The method has been tested with a set of synthetic datasets and a set of five real yeast cell-cycle datasets. The results demonstrate its validity in generating relevant tight, wide, and complementary clusters that can meet requirements of different gene discovery studies.  相似文献   

19.
Spatial scan statistics are widely used in various fields. The performance of these statistics is influenced by parameters, such as maximum spatial cluster size, and can be improved by parameter selection using performance measures. Current performance measures are based on the presence of clusters and are thus inapplicable to data sets without known clusters. In this work, we propose a novel overall performance measure called maximum clustering set–proportion (MCS-P), which is based on the likelihood of the union of detected clusters and the applied dataset. MCS-P was compared with existing performance measures in a simulation study to select the maximum spatial cluster size. Results of other performance measures, such as sensitivity and misclassification, suggest that the spatial scan statistic achieves accurate results in most scenarios with the maximum spatial cluster sizes selected using MCS-P. Given that previously known clusters are not required in the proposed strategy, selection of the optimal maximum cluster size with MCS-P can improve the performance of the scan statistic in applications without identified clusters.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号