首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is an important research area that has practical applications in many fields. Fuzzy clustering has shown advantages over crisp and probabilistic clustering, especially when there are significant overlaps between clusters. Most analytic fuzzy clustering approaches are derived from Bezdek's fuzzy c-means algorithm. One major factor that influences the determination of appropriate clusters in these approaches is an exponent parameter, called the fuzzifier. To our knowledge, no theoretical reason leading to an optimal setting of this parameter is available. This paper presents the development of an heuristic scheme for determining the fuzzifier. This scheme creates close interactions between the fuzzifier and the data set to be clustered. Experimental results in clustering IRIS data and in code book design required for image compression reveal a good performance of our proposal.  相似文献   

2.
Successful clustering algorithms are highly dependent on parameter settings. The clustering performance degrades significantly unless parameters are properly set, and yet, it is difficult to set these parameters a priori. To address this issue, in this paper, we propose a unique splitting-while-merging clustering framework, named “splitting merging awareness tactics” (SMART), which does not require any a priori knowledge of either the number of clusters or even the possible range of this number. Unlike existing self-splitting algorithms, which over-cluster the dataset to a large number of clusters and then merge some similar clusters, our framework has the ability to split and merge clusters automatically during the process and produces the the most reliable clustering results, by intrinsically integrating many clustering techniques and tasks. The SMART framework is implemented with two distinct clustering paradigms in two algorithms: competitive learning and finite mixture model. Nevertheless, within the proposed SMART framework, many other algorithms can be derived for different clustering paradigms. The minimum message length algorithm is integrated into the framework as the clustering selection criterion. The usefulness of the SMART framework and its algorithms is tested in demonstration datasets and simulated gene expression datasets. Moreover, two real microarray gene expression datasets are studied using this approach. Based on the performance of many metrics, all numerical results show that SMART is superior to compared existing self-splitting algorithms and traditional algorithms. Three main properties of the proposed SMART framework are summarized as: (1) needing no parameters dependent on the respective dataset or a priori knowledge about the datasets, (2) extendible to many different applications, (3) offering superior performance compared with counterpart algorithms.  相似文献   

3.

Background

While there are a large number of bioinformatics datasets for clustering, many of them are incomplete, i.e., missing attribute values in some data samples needed by clustering algorithms. A variety of clustering algorithms have been proposed in the past years, but they usually are limited to cluster on the complete dataset. Besides, conventional clustering algorithms cannot obtain a trade-off between accuracy and efficiency of the clustering process since many essential parameters are determined by the human user’s experience.

Results

The paper proposes a Multiple Kernel Density Clustering algorithm for Incomplete datasets called MKDCI. The MKDCI algorithm consists of recovering missing attribute values of input data samples, learning an optimally combined kernel for clustering the input dataset, reducing dimensionality with the optimal kernel based on multiple basis kernels, detecting cluster centroids with the Isolation Forests method, assigning clusters with arbitrary shape and visualizing the results.

Conclusions

Extensive experiments on several well-known clustering datasets in bioinformatics field demonstrate the effectiveness of the proposed MKDCI algorithm. Compared with existing density clustering algorithms and parameter-free clustering algorithms, the proposed MKDCI algorithm tends to automatically produce clusters of better quality on the incomplete dataset in bioinformatics.
  相似文献   

4.
Clustering algorithms divide a set of observations into groups so that members of the same group share common features. In most of the algorithms, tunable parameters are set arbitrarily or by trial and error, resulting in less than optimal clustering. This paper presents a global optimization strategy for the systematic and optimal selection of parameter values associated with a clustering method. In the process, a performance criterion for the optimization model is proposed and benchmarked against popular performance criteria from the literature (namely, the Silhouette coefficient, Dunn's index, and Davies-Bouldin index). The tuning strategy is illustrated using the support vector clustering (SVC) algorithm and simulated annealing. In order to reduce the computational burden, the paper also proposes an alternative to the adjacency matrix method (used for the assignment of cluster labels), namely the contour plotting approach. Datasets tested include the iris and the thyroid datasets from the UCI repository, as well as lymphoma and breast cancer data. The optimal tuning parameters are determined efficiently, while the contour plotting approach leads to significant reductions in computational effort (CPU time) especially for large datasets. The performance criteria comparisons indicate mixed results. Specifically, the Silhouette coefficient and the Davies-Bouldin index perform better, while the Dunn's index is worse on average than the proposed performance index.  相似文献   

5.
聚类数目是影响聚类效果的关键参数,通常需要人工确定,对于较难获得这一先验知识的复杂生物数据集,聚类分析会因此受到限制。针对这一问题,文章提出一种自动确定最佳聚类数目的方法,该方法利用体现"类内紧凑类间离散"思想的优化聚类算法来执行主要计算,结合目标函数二阶差分的判定准则,通过聚类算法的自学习来确定最佳聚类数。实验结果显示,该方法能在复杂数据集上自动得到合理的聚类数目。  相似文献   

6.
7.
Single-molecule localization microscopy (SMLM) is a powerful tool for studying intracellular structure and macromolecular organization at the nanoscale. The increasingly massive pointillistic data sets generated by SMLM require the development of new and highly efficient quantification tools. Here we present FOCAL3D, an accurate, flexible and exceedingly fast (scaling linearly with the number of localizations) density-based algorithm for quantifying spatial clustering in large 3D SMLM data sets. Unlike DBSCAN, which is perhaps the most commonly employed density-based clustering algorithm, an optimum set of parameters for FOCAL3D may be objectively determined. We initially validate the performance of FOCAL3D on simulated datasets at varying noise levels and for a range of cluster sizes. These simulated datasets are used to illustrate the parametric insensitivity of the algorithm, in contrast to DBSCAN, and clustering metrics such as the F1 and Silhouette score indicate that FOCAL3D is highly accurate, even in the presence of significant background noise and mixed populations of variable sized clusters, once optimized. We then apply FOCAL3D to 3D astigmatic dSTORM images of the nuclear pore complex (NPC) in human osteosaracoma cells, illustrating both the validity of the parameter optimization and the ability of the algorithm to accurately cluster complex, heterogeneous 3D clusters in a biological dataset. FOCAL3D is provided as an open source software package written in Python.  相似文献   

8.
Inferring the structure of populations has many applications for genetic research. In addition to providing information for evolutionary studies, it can be used to account for the bias induced by population stratification in association studies. To this end, many algorithms have been proposed to cluster individuals into genetically homogeneous sub-populations. The parametric algorithms, such as Structure, are very popular but their underlying complexity and their high computational cost led to the development of faster parametric alternatives such as Admixture. Alternatives to these methods are the non-parametric approaches. Among this category, AWclust has proven efficient but fails to properly identify population structure for complex datasets. We present in this article a new clustering algorithm called Spectral Hierarchical clustering for the Inference of Population Structure (SHIPS), based on a divisive hierarchical clustering strategy, allowing a progressive investigation of population structure. This method takes genetic data as input to cluster individuals into homogeneous sub-populations and with the use of the gap statistic estimates the optimal number of such sub-populations. SHIPS was applied to a set of simulated discrete and admixed datasets and to real SNP datasets, that are data from the HapMap and Pan-Asian SNP consortium. The programs Structure, Admixture, AWclust and PCAclust were also investigated in a comparison study. SHIPS and the parametric approach Structure were the most accurate when applied to simulated datasets both in terms of individual assignments and estimation of the correct number of clusters. The analysis of the results on the real datasets highlighted that the clusterings of SHIPS were the more consistent with the population labels or those produced by the Admixture program. The performances of SHIPS when applied to SNP data, along with its relatively low computational cost and its ease of use make this method a promising solution to infer fine-scale genetic patterns.  相似文献   

9.
Quantification of vegetation cover from pollen analysis has been a goal of palynologists since the advent of the method in 1916 by the great Lennart von Post. Pollen-based research projects are becoming increasingly ambitious in scale, and the emergence of spatially extensive open-access datasets, advanced methods and computer power has facilitated sub-continental analysis of Holocene pollen data. This paper presents results of one such study, focussing on the Mediterranean basin. Pollen data from 105 fossil sequences have been extracted from the European Pollen database, harmonised by both taxonomy and chronologies, and subjected to a hierarchical agglomerative clustering method to synthesise the dataset into 16 main groupings. A particular focus of analysis was to describe the common transitions from one group to another to understand pathways of Holocene vegetation change in the Mediterranean. Two pollen-based indices of human impact (OJC: Oleaceae, Juglans, Castanea; API: anthropogenic pollen indicators) have been used to infer the degree of human modification of vegetation within each pollen grouping. Pollen-inferred cluster groups that are interpreted as representing more natural vegetation states show a restricted number of pathways of change. A set of cluster groups were identified that closely resemble anthropogenically-disturbed vegetation, and might be considered anthromes (anthopogenic biomes). These clusters show a very wide set of potential pathways, implying that all potential vegetation communities identified through this analysis have been altered in response to land exploitation and transformation by human societies in combination with other factors, such as climatic change. Future work to explain these ecosystem pathways will require developing complementary datasets from the social sciences and humanities (archaeology and historical sources), along with synthesis of the climatic records from the region.  相似文献   

10.
The analysis of gene expression temporal profiles is a topic of increasing interest in functional genomics. Model-based clustering methods are particularly interesting because they are able to capture the dynamic nature of these data and to identify the optimal number of clusters. We have defined a new Bayesian method that allows us to cope with some important issues that remain unsolved in the currently available approaches: the presence of time dislocations in gene expression, the non-stationarity of the processes generating the data, and the presence of data collected on an irregular temporal grid. Our method, which is based on random walk models, requires only mild a priori assumptions about the nature of the processes generating the data and explicitly models inter-gene variability within each cluster. It has first been validated on simulated datasets and then employed for the analysis of a dataset relative to serum-stimulated fibroblasts. In all cases, the results have been promising, showing that the method can be helpful in functional genomics research.  相似文献   

11.
A hybrid GA (genetic algorithm)-based clustering (HGACLUS) schema, combining merits of the Simulated Annealing, was described for finding an optimal or near-optimal set of medoids. This schema maximized the clustering success by achieving internal cluster cohesion and external cluster isolation. The performance  相似文献   

12.
MOTIVATION: The increasing use of microarray technologies is generating large amounts of data that must be processed in order to extract useful and rational fundamental patterns of gene expression. Hierarchical clustering technology is one method used to analyze gene expression data, but traditional hierarchical clustering algorithms suffer from several drawbacks (e.g. fixed topology structure; mis-clustered data which cannot be reevaluated). In this paper, we introduce a new hierarchical clustering algorithm that overcomes some of these drawbacks. RESULT: We propose a new tree-structure self-organizing neural network, called dynamically growing self-organizing tree (DGSOT) algorithm for hierarchical clustering. The DGSOT constructs a hierarchy from top to bottom by division. At each hierarchical level, the DGSOT optimizes the number of clusters, from which the proper hierarchical structure of the underlying dataset can be found. In addition, we propose a new cluster validation criterion based on the geometric property of the Voronoi partition of the dataset in order to find the proper number of clusters at each hierarchical level. This criterion uses the Minimum Spanning Tree (MST) concept of graph theory and is computationally inexpensive for large datasets. A K-level up distribution (KLD) mechanism, which increases the scope of data distribution in the hierarchy construction, was used to improve the clustering accuracy. The KLD mechanism allows the data misclustered in the early stages to be reevaluated at a later stage and increases the accuracy of the final clustering result. The clustering result of the DGSOT is easily displayed as a dendrogram for visualization. Based on a yeast cell cycle microarray expression dataset, we found that our algorithm extracts gene expression patterns at different levels. Furthermore, the biological functionality enrichment in the clusters is considerably high and the hierarchical structure of the clusters is more reasonable. AVAILABILITY: DGSOT is available upon request from the authors.  相似文献   

13.
Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.  相似文献   

14.
MOTIVATION: Clustering microarray gene expression data is a powerful tool for elucidating co-regulatory relationships among genes. Many different clustering techniques have been successfully applied and the results are promising. However, substantial fluctuation contained in microarray data, lack of knowledge on the number of clusters and complex regulatory mechanisms underlying biological systems make the clustering problems tremendously challenging. RESULTS: We devised an improved model-based Bayesian approach to cluster microarray gene expression data. Cluster assignment is carried out by an iterative weighted Chinese restaurant seating scheme such that the optimal number of clusters can be determined simultaneously with cluster assignment. The predictive updating technique was applied to improve the efficiency of the Gibbs sampler. An additional step is added during reassignment to allow genes that display complex correlation relationships such as time-shifted and/or inverted to be clustered together. Analysis done on a real dataset showed that as much as 30% of significant genes clustered in the same group display complex relationships with the consensus pattern of the cluster. Other notable features including automatic handling of missing data, quantitative measures of cluster strength and assignment confidence. Synthetic and real microarray gene expression datasets were analyzed to demonstrate its performance. AVAILABILITY: A computer program named Chinese restaurant cluster (CRC) has been developed based on this algorithm. The program can be downloaded at http://www.sph.umich.edu/csg/qin/CRC/.  相似文献   

15.
刘文忠 《遗传》2004,26(4):532-536
综述了R法估计方差组分的原理、方法和应用,目的是使该方法能够得到合理应用。R法是通过计算全数据集对亚数据集随机效应的回归因子(R)来估计方差组分的。利用一种基于一个变换矩阵的多变量迭代算法,结合先决条件的共扼梯度法求解混合模型方程组使R法的计算效率大为改善。R法的主要优点是计算成本低,同时可以得到方差组分估值的抽样误差和近似置信区间。其缺点是对于同样的数据,R法较其他方法的抽样误差大,而且在小样本中估计值往往有偏。做为一种可选方法,R法可以应用到大数据集的方差组分估计中,同时应进一步研究其理论特性,拓宽其应用范围。Abstract: Theory, method and application of Method R on estimation of (co)variance components were reviewed in order to make the method be reasonably used. Estimation requires R values,which are regressions of predicted random effects that are calculated using complete dataset on predicted random effects that are calculated using random subsets of the same data. By using multivariate iteration algorithm based on a transformation matrix,and combining with the preconditioned conjugate gradient to solve the mixed model equations, the computation efficiency of Method R is much improved. Method R is computationally inexpensive,and the sampling errors and approximate credible intervals of estimates can be obtained. Disadvantages of Method R include a larger sampling variance than other methods for the same data,and biased estimates in small datasets. As an alternative method, Method R can be used in larger datasets. It is necessary to study its theoretical properties and broaden its application range further.  相似文献   

16.
Dynamic model-based clustering for time-course gene expression data   总被引:1,自引:0,他引:1  
Microarray technology has produced a huge body of time-course gene expression data. Such gene expression data has proved useful in genomic disease diagnosis and genomic drug design. The challenge is how to uncover useful information in such data. Cluster analysis has played an important role in analyzing gene expression data. Many distance/correlation- and static model-based clustering techniques have been applied to time-course expression data. However, these techniques are unable to account for the dynamics of such data. It is the dynamics that characterize the data and that should be considered in cluster analysis so as to obtain high quality clustering. This paper proposes a dynamic model-based clustering method for time-course gene expression data. The proposed method regards a time-course gene expression dataset as a set of time series, generated by a number of stochastic processes. Each stochastic process defines a cluster and is described by an autoregressive model. A relocation-iteration algorithm is proposed to identity the model parameters and posterior probabilities are employed to assign each gene to an appropriate cluster. A bootstrapping method and an average adjusted Rand index (AARI) are employed to measure the quality of clustering. Computational experiments are performed on a synthetic and three real time-course gene expression datasets to investigate the proposed method. The results show that our method allows the better quality clustering than other clustering methods (e.g. k-means) for time-course gene expression data, and thus it is a useful and powerful tool for analyzing time-course gene expression data.  相似文献   

17.

Background  

There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.  相似文献   

18.
Classification is a data mining task the goal of which is to learn a model, from a training dataset, that can predict the class of a new data instance, while clustering aims to discover natural instance-groupings within a given dataset. Learning cluster-based classification systems involves partitioning a training set into data subsets (clusters) and building a local classification model for each data cluster. The class of a new instance is predicted by first assigning the instance to its nearest cluster and then using that cluster’s local classification model to predict the instance’s class. In this paper, we present an ant colony optimization (ACO) approach to building cluster-based classification systems. Our ACO approach optimizes the number of clusters, the positioning of the clusters, and the choice of classification algorithm to use as the local classifier for each cluster. We also present an ensemble approach that allows the system to decide on the class of a given instance by considering the predictions of all local classifiers, employing a weighted voting mechanism based on the fuzzy degree of membership in each cluster. Our experimental evaluation employs five widely used classification algorithms: naïve Bayes, nearest neighbour, Ripper, C4.5, and support vector machines, and results are reported on a suite of 54 popular UCI benchmark datasets.  相似文献   

19.
Clustering is a prevalent analytical means to analyze single cell RNA sequencing (scRNA-seq) data but the rapidly expanding data volume can make this process computationally challenging. New methods for both accurate and efficient clustering are of pressing need. Here we proposed Spearman subsampling-clustering-classification (SSCC),a new clustering framework based on random projection and feature construction,for large-scale scRNA-seq data. SSCC greatly improves clustering accuracy,robustness,and computational efficacy for various state-of-the-art algorithms benchmarked on multiple real datasets. On a dataset with 68,578 human blood cells,SSCC achieved 20%improvement for clustering accuracy and 50-fold acceleration,but only consumed 66%memory usage,compared to the widelyused software package SC3. Compared to k-means,the accuracy improvement of SSCC can reach 3-fold. An R implementation of SSCC is available at https://github.com/Japrin/sscClust.  相似文献   

20.
MOTIVATION: Principal Component Analysis (PCA) is one of the most popular dimensionality reduction techniques for the analysis of high-dimensional datasets. However, in its standard form, it does not take into account any error measures associated with the data points beyond a standard spherical noise. This indiscriminate nature provides one of its main weaknesses when applied to biological data with inherently large variability, such as expression levels measured with microarrays. Methods now exist for extracting credibility intervals from the probe-level analysis of cDNA and oligonucleotide microarray experiments. These credibility intervals are gene and experiment specific, and can be propagated through an appropriate probabilistic downstream analysis. RESULTS: We propose a new model-based approach to PCA that takes into account the variances associated with each gene in each experiment. We develop an efficient EM-algorithm to estimate the parameters of our new model. The model provides significantly better results than standard PCA, while remaining computationally reasonable. We show how the model can be used to 'denoise' a microarray dataset leading to improved expression profiles and tighter clustering across profiles. The probabilistic nature of the model means that the correct number of principal components is automatically obtained.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号