首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Evaluation and comparison of gene clustering methods in microarray analysis   总被引:4,自引:0,他引:4  
MOTIVATION: Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including hierarchical clustering, K-means, PAM, SOM, mixture model-based clustering and tight clustering have been widely used in the literature. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods. RESULTS: In this paper, six gene clustering methods are evaluated by simulated data from a hierarchical log-normal model with various degrees of perturbation as well as four real datasets. A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes (i.e. a set of noise genes not being clustered). Performance of the methods in the real data is assessed by a predictive accuracy analysis through verified gene annotations. Our results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clustering and SOM perform among the worst. Our analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis.  相似文献   

2.
MOTIVATION: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters. RESULTS: We present the Gustafson-Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://dragon.kaist.ac.kr/gk.  相似文献   

3.
We report on the application of the Self-Organizing Map (SOM) classification method to the task of categorizing texts according to their register and the style of their author. The SOM has been selected as its performance in various data-mining applications has been found to be highly successful. Here, the method is evaluated against the task of clustering textual data which are corpora of texts written in the Greek language; the parameters used depict linguistically important structural properties of the texts. The experiments reported indicate that the SOM results are equivalent to those generated by statistical methods.  相似文献   

4.

Background

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.

Results

In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.

Conclusions

The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.  相似文献   

5.
An improved algorithm for clustering gene expression data   总被引:1,自引:0,他引:1  
MOTIVATION: Recent advancements in microarray technology allows simultaneous monitoring of the expression levels of a large number of genes over different time points. Clustering is an important tool for analyzing such microarray data, typical properties of which are its inherent uncertainty, noise and imprecision. In this article, a two-stage clustering algorithm, which employs a recently proposed variable string length genetic scheme and a multiobjective genetic clustering algorithm, is proposed. It is based on the novel concept of points having significant membership to multiple classes. An iterated version of the well-known Fuzzy C-Means is also utilized for clustering. RESULTS: The significant superiority of the proposed two-stage clustering algorithm as compared to the average linkage method, Self Organizing Map (SOM) and a recently developed weighted Chinese restaurant-based clustering method (CRC), widely used methods for clustering gene expression data, is established on a variety of artificial and publicly available real life data sets. The biological relevance of the clustering solutions are also analyzed.  相似文献   

6.

Background  

In recent years, clustering algorithms have been effectively applied in molecular biology for gene expression data analysis. With the help of clustering algorithms such as K-means, hierarchical clustering, SOM, etc, genes are partitioned into groups based on the similarity between their expression profiles. In this way, functionally related genes are identified. As the amount of laboratory data in molecular biology grows exponentially each year due to advanced technologies such as Microarray, new efficient and effective methods for clustering must be developed to process this growing amount of biological data.  相似文献   

7.
A reliable and precise identification of the type of tumors is crucial to the effective treatment of cancer. With the rapid development of microarray technologies, tumor clustering based on gene expression data is becoming a powerful approach to cancer class discovery. In this paper, we apply the penalized matrix decomposition (PMD) to gene expression data to extract metasamples for clustering. The extracted metasamples capture the inherent structures of samples belong to the same class. At the same time, the PMD factors of a sample over the metasamples can be used as its class indicator in return. Compared with the conventional methods such as hierarchical clustering (HC), self-organizing maps (SOM), affinity propagation (AP) and nonnegative matrix factorization (NMF), the proposed method can identify the samples with complex classes. Moreover, the factor of PMD can be used as an index to determine the cluster number. The proposed method provides a reasonable explanation of the inconsistent classifications made by the conventional methods. In addition, it is able to discover the modules in gene expression data of conterminous developmental stages. Experiments on two representative problems show that the proposed PMD-based method is very promising to discover biological phenotypes.  相似文献   

8.
Self-Organized Maps (SOMs) are a popular approach for analyzing genome-wide expression data. However, most SOM based approaches ignore prior knowledge about functional gene categories. Also, Self Organized Map (SOM) based approaches usually develop topographic maps with disjoint and uniform activation regions that correspond to a hard clustering of the patterns at their nodes. We present a novel Self-Organizing map, the Kernel Supervised Dynamic Grid Self-Organized Map (KSDG-SOM). This model adapts its parameters in a kernel space. Gaussian kernels are used and their mean and variance components are adapted in order to optimize the fitness to the input density. The KSDG-SOM also grows dynamically up to a size defined with statistical criteria. It is capable of incorporating a priori information for the known functional characteristics of genes. This information forms a supervised bias at the cluster formation and the model owns the potentiality of revising incorrect functional labels. The new method overcomes the main drawbacks of most of the existing clustering methods that lack a mechanism for dynamical extension on the basis of a balance between unsupervised and supervised drives.  相似文献   

9.

Background  

A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples.  相似文献   

10.
The identification and visualization of clusters formed by motor unit action potentials (MUAPs) is an essential step in investigations seeking to explain the control of the neuromuscular system. This work introduces the generative topographic mapping (GTM), a novel machine learning tool, for clustering of MUAPs, and also it extends the GTM technique to provide a way of visualizing MUAPs. The performance of GTM was compared to that of three other clustering methods: the self-organizing map (SOM), a Gaussian mixture model (GMM), and the neural-gas network (NGN). The results, based on the study of experimental MUAPs, showed that the rate of success of both GTM and SOM outperformed that of GMM and NGN, and also that GTM may in practice be used as a principled alternative to the SOM in the study of MUAPs. A visualization tool, which we called GTM grid, was devised for visualization of MUAPs lying in a high-dimensional space. The visualization provided by the GTM grid was compared to that obtained from principal component analysis (PCA).  相似文献   

11.
Self-Organising Map (SOM) clustering methods applied to the monthly and seasonal averaged flowering intensity records of eight Eucalypt species are shown to successfully quantify, visualise and model synchronisation of multivariate time series. The SOM algorithm converts complex, nonlinear relationships between high-dimensional data into simple networks and a map based on the most likely patterns in the multiplicity of time series that it trains. Monthly- and seasonal-based SOMs identified three synchronous species groups (clusters): E. camaldulensis, E. melliodora, E. polyanthemos; E. goniocalyx, E. microcarpa, E. macrorhyncha; and E. leucoxylon, E. tricarpa. The main factor in synchronisation (clustering) appears to be the season in which flowering commences. SOMs also identified the asynchronous relationship among the eight species. Hence, the likelihood of the production, or not, of hybrids between sympatric species is also identified. The SOM pattern-based correlation values mirror earlier synchrony statistics gleaned from Moran correlations obtained from the raw flowering records. Synchronisation of flowering is shown to be a complex mechanism that incorporates all the flowering characteristics: flowering duration, timing of peak flowering, of start and finishing of flowering, as well as possibly specific climate drivers for flowering. SOMs can accommodate for all this complexity and we advocate their use by phenologists and ecologists as a powerful, accessible and interpretable tool for visualisation and clustering of multivariate time series and for synchrony studies.  相似文献   

12.
Analysis of gene expression data using self-organizing maps.   总被引:29,自引:0,他引:29  
DNA microarray technologies together with rapidly increasing genomic sequence information is leading to an explosion in available gene expression data. Currently there is a great need for efficient methods to analyze and visualize these massive data sets. A self-organizing map (SOM) is an unsupervised neural network learning algorithm which has been successfully used for the analysis and organization of large data files. We have here applied the SOM algorithm to analyze published data of yeast gene expression and show that SOM is an excellent tool for the analysis and visualization of gene expression profiles.  相似文献   

13.

Background

Molecular dynamics (MD) simulations are powerful tools to investigate the conformational dynamics of proteins that is often a critical element of their function. Identification of functionally relevant conformations is generally done clustering the large ensemble of structures that are generated. Recently, Self-Organising Maps (SOMs) were reported performing more accurately and providing more consistent results than traditional clustering algorithms in various data mining problems. We present a novel strategy to analyse and compare conformational ensembles of protein domains using a two-level approach that combines SOMs and hierarchical clustering.

Results

The conformational dynamics of the α-spectrin SH3 protein domain and six single mutants were analysed by MD simulations. The Cα's Cartesian coordinates of conformations sampled in the essential space were used as input data vectors for SOM training, then complete linkage clustering was performed on the SOM prototype vectors. A specific protocol to optimize a SOM for structural ensembles was proposed: the optimal SOM was selected by means of a Taguchi experimental design plan applied to different data sets, and the optimal sampling rate of the MD trajectory was selected. The proposed two-level approach was applied to single trajectories of the SH3 domain independently as well as to groups of them at the same time. The results demonstrated the potential of this approach in the analysis of large ensembles of molecular structures: the possibility of producing a topological mapping of the conformational space in a simple 2D visualisation, as well as of effectively highlighting differences in the conformational dynamics directly related to biological functions.

Conclusions

The use of a two-level approach combining SOMs and hierarchical clustering for conformational analysis of structural ensembles of proteins was proposed. It can easily be extended to other study cases and to conformational ensembles from other sources.  相似文献   

14.
1. Two types of artificial neural networks procedures were used to define and predict diatom assemblage structures in Luxembourg streams using environmental data. 2. Self‐organising maps (SOM) were used to classify samples according to their diatom composition, and multilayer perceptron with a backpropagation learning algorithm (BPN) was used to predict these assemblages using environmental characteristics of each sample as input and spatial coordinates (X and Y) of the cell centres of the SOM map identified as diatom assemblages as output. Classical methods (correspondence analysis and clustering analysis) were then used to identify the relations between diatom assemblages and the SOM cell number. A canonical correspondence analysis was also used to define the relationship between these assemblages and the environmental conditions. 3. The diatom‐SOM training set resulted in 12 representative assemblages (12 clusters) having different species compositions. Comparison of observed and estimated sample positions on the SOM map were used to evaluate the performance of the BPN (correlation coefficients were 0.93 for X and 0.94 for Y). Mean square errors of 12 cells varied from 0.47 to 1.77 and the proportion of well predicted samples ranged from 37.5 to 92.9%. This study showed the high predictability of diatom assemblages using physical and chemical parameters for a small number of river types within a restricted geographical area.  相似文献   

15.
We propose an unsupervised recognition system for single-trial classification of motor imagery (MI) electroencephalogram (EEG) data in this study. Competitive Hopfield neural network (CHNN) clustering is used for the discrimination of left and right MI EEG data posterior to selecting active segment and extracting fractal features in multi-scale. First, we use continuous wavelet transform (CWT) and Student's two-sample t-statistics to select the active segment in the time-frequency domain. The multiresolution fractal features are then extracted from wavelet data by means of modified fractal dimension. At last, CHNN clustering is adopted to recognize extracted features. Due to the characteristic of non-supervision, it is proper for CHNN to classify non-stationary EEG signals. The results indicate that CHNN achieves 81.9% in average classification accuracy in comparison with self-organizing map (SOM) and several popular supervised classifiers on six subjects from two data sets.  相似文献   

16.
The research aim is to use three clustering technologies for establishing molecular data model of large size sets by comparison between low energy samples (LES) and local molecular samples (LMS). Hierarchical cluster of multi-level tree distance relation, competitive learning network of similar inputs falling into the same cluster and topological SOM are used to analyze 6,242 LES and 5,000 LMS. Our experiments show that in SOM, there are 24 to 25 Davies-Boulding clustering index and color map cluster units in the LES more than 10 to 12 in the LMS, which is consistent with the results of hierarchical cluster and competitive learning network in the rough. The hierarchical cluster reflects the biggest inter-cluster distance about 30 for the LES is far larger than that of LMS about 10. The intra-cluster distance of LES about 15 is also far bigger than that of LMS about 3. In SOM, there are more cluster borders of high values (black) reflecting large distance and more clusters in the D-matrix and U-matrix of LES than that of LMS, due to the biggest standard deviation range from -8 to 10 of samples feature of the LES is bigger than that of LMS from -2.5 to 2.5.  相似文献   

17.
Work-related casualties always cause serious damages to regional social and economic development. China's rapid development is raising a series of concerns about work-related casualties. The self-organizing maps (SOM) approach is applied in this study to detect the impacts of socioeconomic factors on the severity of work-related casualties in 31 regions of mainland China. The results show that: (1) the regional severity of work-related casualties and socioeconomic development seem to follow an inverted U-shaped pattern (i.e., the number of work-related fatalities increases to a peak at a certain stage and then decline along with socioeconomic development); (2) the industrial and employment structure have negative correlation with the regional severity of work-related casualties, specifically, the higher percentage of tertiary industry in gross regional product (GRP) and percentage of employed persons in tertiary industry may lead to fewer numbers of work-related fatalities in one region; (3) some socioeconomic factors like education level, medical condition, and insurance coverage have negative impacts on the regional severity of work-related casualties. Furthermore, the study also shows that the SOM approach is capable of improving clustering quality and visualization effects when facing multidimensional datasets compared with traditional cluster approaches such as K-Means and hierarchical-based clustering methods.  相似文献   

18.
The self-organizing map (SOM), as a kind of unsupervised neural network, has been used for both static data management and dynamic data analysis. To further exploit its search abilities, in this paper we propose an SOM-based algorithm (SOMS) for optimization problems involving both static and dynamic functions. Furthermore, a new SOM weight updating rule is proposed to enhance the learning efficiency; this may dynamically adjust the neighborhood function for the SOM in learning system parameters. As a demonstration, the proposed SOMS is applied to function optimization and also dynamic trajectory prediction, and its performance compared with that of the genetic algorithm (GA) due to the similar ways both methods conduct searches.  相似文献   

19.
For classification of action potential shapes in multineuron recordings, we present a spike sorting system employing independent component analysis (ICA) and an unsupervised artificial neural network (Kohonen's self-organizing map, SOM). We focus on how ICA in the first stage of the spike sorting system can be used to address specific problems arising in recordings using multielectrode arrays in the CNS. Using real data recorded from the pontine nuclei in rats and simulated data, we evaluate the performance of several ICA algorithms to remove cross-talk between electrodes using data from continuous recording (or simulation). When using cut-out data, the standard format of extracellular spike recordings, new problems emerge and robust algorithms are needed. We demonstrate that several ICA algorithms show a good performance on cut-out data from multielectrode array recordings (simulated and real data). In tetrode recordings the same neuron is purposely recorded by several electrodes simultaneously and we show, how independent component analysis can be used in this case to identify redundant information and hence to compress relevant information, improving subsequent clustering of a SOM.  相似文献   

20.
齐建超  刘慧平  伊尧国 《生态学报》2017,37(19):6346-6354
时间序列土地利用时空演变规律分析是当前的研究热点之一,通过应用自组织映射神经网络方法进行多时间序列土地利用变化时空一体化表达与演变规律分析,探索区域土地利用变化模式。基于北京市2005、2007、2009、2011、2013年5期土地利用遥感分类数据,构建自组织映射神经网络并利用其聚类和降维可视化功能对5个年份的土地利用数据同时进行训练输出,发现建设用地、耕地、林地、牧草地、园地的聚集模式,并通过对输出神经元进行二次聚类以及土地利用变化轨迹分析,获得北京市郊区5个监测时相土地利用变化的时空演变特征。结果揭示出北京市郊区2005-2013年土地利用变化具有明显的耕地型向建设用地型发展的平原区演变特征,以及向林地型发展的山区演变特征,且各区的发展具有时间上的顺序性;总体上形成6类土地利用演变轨迹。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号