首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 734 毫秒
1.
Microarray technology plays an important role in drawing useful biological conclusions by analyzing thousands of gene expressions simultaneously. Especially, image analysis is a key step in microarray analysis and its accuracy strongly depends on segmentation. The pioneering works of clustering based segmentation have shown that k-means clustering algorithm and moving k-means clustering algorithm are two commonly used methods in microarray image processing. However, they usually face unsatisfactory results because the real microarray image contains noise, artifacts and spots that vary in size, shape and contrast. To improve the segmentation accuracy, in this article we present a combination clustering based segmentation approach that may be more reliable and able to segment spots automatically. First, this new method starts with a very simple but effective contrast enhancement operation to improve the image quality. Then, an automatic gridding based on the maximum between-class variance is applied to separate the spots into independent areas. Next, among each spot region, the moving k-means clustering is first conducted to separate the spot from background and then the k-means clustering algorithms are combined for those spots failing to obtain the entire boundary. Finally, a refinement step is used to replace the false segmentation and the inseparable ones of missing spots. In addition, quantitative comparisons between the improved method and the other four segmentation algorithms--edge detection, thresholding, k-means clustering and moving k-means clustering--are carried out on cDNA microarray images from six different data sets. Experiments on six different data sets, 1) Stanford Microarray Database (SMD), 2) Gene Expression Omnibus (GEO), 3) Baylor College of Medicine (BCM), 4) Swiss Institute of Bioinformatics (SIB), 5) Joe DeRisi’s individual tiff files (DeRisi), and 6) University of California, San Francisco (UCSF), indicate that the improved approach is more robust and sensitive to weak spots. More importantly, it can obtain higher segmentation accuracy in the presence of noise, artifacts and weakly expressed spots compared with the other four methods.  相似文献   

2.
Traditional k-means and most k-means variants are still computationally expensive for large datasets, such as microarray data, which have large datasets with large dimension size d. In k-means clustering, we are given a set of n data points in d-dimensional space Rd and an integer k. The problem is to determine a set of k points in Rd, called centers, so as to minimize the mean squared distance from each data point to its nearest center. In this work, we develop a novel k-means algorithm, which is simple but more efficient than the traditional k-means and the recent enhanced k-means. Our new algorithm is based on the recently established relationship between principal component analysis and the k-means clustering. We provided the correctness proof for this algorithm. Results obtained from testing the algorithm on three biological data and six non-biological data (three of these data are real, while the other three are simulated) also indicate that our algorithm is empirically faster than other known k-means algorithms. We assessed the quality of our algorithm clusters against the clusters of a known structure using the Hubert-Arabie Adjusted Rand index (ARIHA). We found that when k is close to d, the quality is good (ARIHA>0.8) and when k is not close to d, the quality of our new k-means algorithm is excellent (ARIHA>0.9). In this paper, emphases are on the reduction of the time requirement of the k-means algorithm and its application to microarray data due to the desire to create a tool for clustering and malaria research. However, the new clustering algorithm can be used for other clustering needs as long as an appropriate measure of distance between the centroids and the members is used. This has been demonstrated in this work on six non-biological data.  相似文献   

3.
Metabolomics and other omics tools are generally characterized by large data sets with many variables obtained under different environmental conditions. Clustering methods and more specifically two-mode clustering methods are excellent tools for analyzing this type of data. Two-mode clustering methods allow for analysis of the behavior of subsets of metabolites under different experimental conditions. In addition, the results are easily visualized. In this paper we introduce a two-mode clustering method based on a genetic algorithm that uses a criterion that searches for homogeneous clusters. Furthermore we introduce a cluster stability criterion to validate the clusters and we provide an extended knee plot to select the optimal number of clusters in both experimental and metabolite modes. The genetic algorithm-based two-mode clustering gave biological relevant results when it was applied to two real life metabolomics data sets. It was, for instance, able to identify a catabolic pathway for growth on several of the carbon sources. Electronic supplementary material The online version of this article (doi:) contains supplementary material, which is available to authorized users. J. A. Hageman and R. A. van den Berg contributed equally to this paper.  相似文献   

4.
BACKGROUND: Artificial neural networks (ANNs) have been shown to be valuable in the analysis of analytical flow cytometric (AFC) data in aquatic ecology. Automated extraction of clusters is an important first stage in deriving ANN training data from field samples, but AFC data pose a number of challenges for many types of clustering algorithm. The fuzzy k-means algorithm recently has been extended to address nonspherical clusters with the use of scatter matrices. Four variants were proposed, each optimizing a different measure of clustering "goodness." METHODS: With AFC data obtained from marine phytoplankton species in culture, the four fuzzy k-means algorithm variants were compared with each other and with another multivariate clustering algorithm based on critical distances currently used in flow cytometry. RESULTS: One of the algorithm variants (adaptive distances, also known as the Gustafson--Kessel algorithm) was found to be robust and reliable, whereas the others showed various problems. CONCLUSIONS: The adaptive distances algorithm was superior in use to the clustering algorithms against which it was tested, but the problem of automatic determination of the number of clusters remains to be addressed.  相似文献   

5.
K-ary clustering with optimal leaf ordering for gene expression data   总被引:2,自引:0,他引:2  
MOTIVATION: A major challenge in gene expression analysis is effective data organization and visualization. One of the most popular tools for this task is hierarchical clustering. Hierarchical clustering allows a user to view relationships in scales ranging from single genes to large sets of genes, while at the same time providing a global view of the expression data. However, hierarchical clustering is very sensitive to noise, it usually lacks of a method to actually identify distinct clusters, and produces a large number of possible leaf orderings of the hierarchical clustering tree. In this paper we propose a new hierarchical clustering algorithm which reduces susceptibility to noise, permits up to k siblings to be directly related, and provides a single optimal order for the resulting tree. RESULTS: We present an algorithm that efficiently constructs a k-ary tree, where each node can have up to k children, and then optimally orders the leaves of that tree. By combining k clusters at each step our algorithm becomes more robust against noise and missing values. By optimally ordering the leaves of the resulting tree we maintain the pairwise relationships that appear in the original method, without sacrificing the robustness. Our k-ary construction algorithm runs in O(n(3)) regardless of k and our ordering algorithm runs in O(4(k)n(3)). We present several examples that show that our k-ary clustering algorithm achieves results that are superior to the binary tree results in both global presentation and cluster identification. AVAILABILITY: We have implemented the above algorithms in C++ on the Linux operating system.  相似文献   

6.
In this paper, three different clustering algorithms were applied to assemble infrared (IR) spectral maps from IR microspectra of tissues. Using spectra from a colorectal adenocarcinoma section, we show how IR images can be assembled by agglomerative hierarchical (AH) clustering (Ward's technique), fuzzy C-means (FCM) clustering, and k-means (KM) clustering. We discuss practical problems of IR imaging on tissues such as the influence of spectral quality and data pretreatment on image quality. Furthermore, the applicability of cluster algorithms to the spatially resolved microspectroscopic data and the degree of correlation between distinct cluster images and histopathology are compared. The use of any of the clustering algorithms dramatically increased the information content of the IR images, as compared to univariate methods of IR imaging (functional group mapping). Among the cluster imaging methods, AH clustering (Ward's algorithm) proved to be the best method in terms of tissue structure differentiation.  相似文献   

7.
MOTIVATION: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into 'meaningful' groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified. RESULTS: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.  相似文献   

8.
Kim HY  Kim MJ  Han JI  Kim BK  Lee YS  Lee YS  Kim JH 《Bio Systems》2009,95(1):17-25
A time-series microarray experiment is useful to study the changes in the expression of a large number of genes over time. Many methods for clustering genes using gene expression profiles have been suggested, but it is not easy to interpret the biological significance of the results or utilize these methods for understanding the dynamics of gene regulatory systems. In this study, we introduce an algorithm for readjusting the boundaries of clusters by adopting the advantages of both k-means and singular value decomposition (SVD). In addition, we suggest a methodology for searching the principal genes that can be the most crucial genes in regulation of clusters. We found 34 principal genes from 171 clusters having strong concentratedness in their expression patterns and distinct ranges of oscillatory phases, by using a time-series microarray dataset of mouse embryonic stem (ES) cells after induction of dopaminergic neural differentiation. The biological significance of the principal genes examined in the literature supports the feasibility of our algorithms in that the hierarchy of clusters may lead the manifestation of the phenotypes, e.g., the development of the nervous system.  相似文献   

9.
Recent progress in bioinformatics research has led to the accumulation of huge quantities of biological data at various data sources. The DNA microarray technology makes it possible to simultaneously analyze large number of genes across different samples. Clustering of microarray data can reveal the hidden gene expression patterns from large quantities of expression data that in turn offers tremendous possibilities in functional genomics, comparative genomics, disease diagnosis and drug development. The k- ¬means clustering algorithm is widely used for many practical applications. But the original k-¬means algorithm has several drawbacks. It is computationally expensive and generates locally optimal solutions based on the random choice of the initial centroids. Several methods have been proposed in the literature for improving the performance of the k-¬means algorithm. A meta-heuristic optimization algorithm named harmony search helps find out near-global optimal solutions by searching the entire solution space. Low clustering accuracy of the existing algorithms limits their use in many crucial applications of life sciences. In this paper we propose a novel Harmony Search-K means Hybrid (HSKH) algorithm for clustering the gene expression data. Experimental results show that the proposed algorithm produces clusters with better accuracy in comparison with the existing algorithms.  相似文献   

10.
MOTIVATION: Clustering is one of the most widely used methods in unsupervised gene expression data analysis. The use of different clustering algorithms or different parameters often produces rather different results on the same data. Biological interpretation of multiple clustering results requires understanding how different clusters relate to each other. It is particularly non-trivial to compare the results of a hierarchical and a flat, e.g. k-means, clustering. RESULTS: We present a new method for comparing and visualizing relationships between different clustering results, either flat versus flat, or flat versus hierarchical. When comparing a flat clustering to a hierarchical clustering, the algorithm cuts different branches in the hierarchical tree at different levels to optimize the correspondence between the clusters. The optimization function is based on graph layout aesthetics or on mutual information. The clusters are displayed using a bipartite graph where the edges are weighted proportionally to the number of common elements in the respective clusters and the weighted number of crossings is minimized. The performance of the algorithm is tested using simulated and real gene expression data. The algorithm is implemented in the online gene expression data analysis tool Expression Profiler. AVAILABILITY: http://www.ebi.ac.uk/expressionprofiler  相似文献   

11.
MOTIVATION: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY: R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.  相似文献   

12.
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.  相似文献   

13.
Spot Detection and Image Segmentation in DNA Microarray Data   总被引:3,自引:0,他引:3  
Following the invention of microarrays in 1994, the development and applications of this technology have grown exponentially. The numerous applications of microarray technology include clinical diagnosis and treatment, drug design and discovery, tumour detection, and environmental health research. One of the key issues in the experimental approaches utilising microarrays is to extract quantitative information from the spots, which represent genes in a given experiment. For this process, the initial stages are important and they influence future steps in the analysis. Identifying the spots and separating the background from the foreground is a fundamental problem in DNA microarray data analysis. In this review, we present an overview of state-of-the-art methods for microarray image segmentation. We discuss the foundations of the circle-shaped approach, adaptive shape segmentation, histogram-based methods and the recently introduced clustering-based techniques. We analytically show that clustering-based techniques are equivalent to the one-dimensional, standard k-means clustering algorithm that utilises the Euclidean distance.  相似文献   

14.
15.
MOTIVATION: The increasing use of microarray technologies is generating large amounts of data that must be processed in order to extract useful and rational fundamental patterns of gene expression. Hierarchical clustering technology is one method used to analyze gene expression data, but traditional hierarchical clustering algorithms suffer from several drawbacks (e.g. fixed topology structure; mis-clustered data which cannot be reevaluated). In this paper, we introduce a new hierarchical clustering algorithm that overcomes some of these drawbacks. RESULT: We propose a new tree-structure self-organizing neural network, called dynamically growing self-organizing tree (DGSOT) algorithm for hierarchical clustering. The DGSOT constructs a hierarchy from top to bottom by division. At each hierarchical level, the DGSOT optimizes the number of clusters, from which the proper hierarchical structure of the underlying dataset can be found. In addition, we propose a new cluster validation criterion based on the geometric property of the Voronoi partition of the dataset in order to find the proper number of clusters at each hierarchical level. This criterion uses the Minimum Spanning Tree (MST) concept of graph theory and is computationally inexpensive for large datasets. A K-level up distribution (KLD) mechanism, which increases the scope of data distribution in the hierarchy construction, was used to improve the clustering accuracy. The KLD mechanism allows the data misclustered in the early stages to be reevaluated at a later stage and increases the accuracy of the final clustering result. The clustering result of the DGSOT is easily displayed as a dendrogram for visualization. Based on a yeast cell cycle microarray expression dataset, we found that our algorithm extracts gene expression patterns at different levels. Furthermore, the biological functionality enrichment in the clusters is considerably high and the hierarchical structure of the clusters is more reasonable. AVAILABILITY: DGSOT is available upon request from the authors.  相似文献   

16.
Microarray-CGH (comparative genomic hybridization) experiments are used to detect and map chromosomal imbalances. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose representative sequences share the same relative copy number on average. Segmentation methods constitute a natural framework for the analysis, but they do not provide a biological status for the detected segments. We propose a new model for this segmentation/clustering problem, combining a segmentation model with a mixture model. We present a new hybrid algorithm called dynamic programming-expectation maximization (DP-EM) to estimate the parameters of the model by maximum likelihood. This algorithm combines DP and the EM algorithm. We also propose a model selection heuristic to select the number of clusters and the number of segments. An example of our procedure is presented, based on publicly available data sets. We compare our method to segmentation methods and to hidden Markov models, and we show that the new segmentation/clustering model is a promising alternative that can be applied in the more general context of signal processing.  相似文献   

17.
18.
MOTIVATION: Single nucleotide polymorphisms have been investigated as biological markers and the representative high-throughput genotyping method is a combination of the Invader assay and a statistical clustering method. A typical statistical clustering method is the k-means method, but it often fails because of the lack of flexibility. An alternative fast and reliable method is therefore desirable. RESULTS: This paper proposes a model-based clustering method using a normal mixture model and a well-conceived penalized likelihood. The proposed method can judge unclear genotypings to be re-examined and also work well even when the number of clusters is unknown. Some results are illustrated and then satisfactory genotypings are shown. Even when the conventional maximum likelihood method and the typical k-means clustering method failed, the proposed method succeeded.  相似文献   

19.
MOTIVATION: Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. RESULTS: We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.  相似文献   

20.
Y Peng  Y Zhang  G Kou  Y Shi 《PloS one》2012,7(7):e41713
Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm-k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号