首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes.  相似文献   

2.
Evaluation and comparison of gene clustering methods in microarray analysis   总被引:4,自引:0,他引:4  
MOTIVATION: Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including hierarchical clustering, K-means, PAM, SOM, mixture model-based clustering and tight clustering have been widely used in the literature. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods. RESULTS: In this paper, six gene clustering methods are evaluated by simulated data from a hierarchical log-normal model with various degrees of perturbation as well as four real datasets. A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes (i.e. a set of noise genes not being clustered). Performance of the methods in the real data is assessed by a predictive accuracy analysis through verified gene annotations. Our results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clustering and SOM perform among the worst. Our analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis.  相似文献   

3.
Multi-class clustering and prediction in the analysis of microarray data   总被引:1,自引:0,他引:1  
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.  相似文献   

4.

Background  

Image analysis is the first crucial step to obtain reliable results from microarray experiments. First, areas in the image belonging to single spots have to be identified. Then, those target areas have to be partitioned into foreground and background. Finally, two scalar values for the intensities have to be extracted. These goals have been tackled either by spot shape methods or intensity histogram methods, but it would be desirable to have hybrid algorithms which combine the advantages of both approaches.  相似文献   

5.
MOTIVATION: Because co-expressed genes are likely to share the same biological function, cluster analysis of gene expression profiles has been applied for gene function discovery. Most existing clustering methods ignore known gene functions in the process of clustering. RESULTS: To take advantage of accumulating gene functional annotations, we propose incorporating known gene functions into a new distance metric, which shrinks a gene expression-based distance towards 0 if and only if the two genes share a common gene function. A two-step procedure is used. First, the shrinkage distance metric is used in any distance-based clustering method, e.g. K-medoids or hierarchical clustering, to cluster the genes with known functions. Second, while keeping the clustering results from the first step for the genes with known functions, the expression-based distance metric is used to cluster the remaining genes of unknown function, assigning each of them to either one of the clusters obtained in the first step or some new clusters. A simulation study and an application to gene function prediction for the yeast demonstrate the advantage of our proposal over the standard method.  相似文献   

6.
To supervise, stabilize and optimize antibiotic fermentations in the industrial scale expert systems are presently worked out. For the knowledge acquisition various classifiers are tested using a set of 27 nourseothricin fermentation runs. Two methods are applied: optimal clustering by help of minimum variance criterion and hierarchical clustering by help of dendrograms. The fermentations are classified with respect to the specific material costs as well as the product formation kinetics.List of Symbols a kg/m3 initial value of linearized product kinetics - b kg/(m3 · h) slope of linearized product kinetics - B binary variable (value 0 or 1) - C DM/kg specific costs - d distance - m number of samples - p kg/m3 product concentration - pO2 % dissolved oxygen concentration - t h fermentation time - T h initial time of linearized product kinetics - n number of fermentation runs  相似文献   

7.
Many bioinformatics problems can be tackled from a fresh angle offered by the network perspective. Directly inspired by metabolic network structural studies, we propose an improved gene clustering approach for inferring gene signaling pathways from gene microarray data. Based on the construction of co-expression networks that consists of both significantly linear and non-linear gene associations together with controlled biological and statistical significance, our approach tends to group functionally related genes into tight clusters despite their expression dissimilarities. We illustrate our approach and compare it to the traditional clustering approaches on a yeast galactose metabolism dataset and a retinal gene expression dataset. Our approach greatly outperforms the traditional approach in rediscovering the relatively well known galactose metabolism pathway in yeast and in clustering genes of the photoreceptor differentiation pathway. AVAILABILITY: The clustering method has been implemented in an R package "GeneNT" that is freely available from: http://www.cran.org.  相似文献   

8.
Fuzzy C-means method for clustering microarray data   总被引:9,自引:0,他引:9  
MOTIVATION: Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. RESULTS: A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. AVAILABILITY: Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/  相似文献   

9.
Summary The resolution of spectral frequencies in NMR data obtained from discrete Fourier transformation (DFT) along D constant-time dimensions can be improved significantly through extrapolation of the D-dimensional free induction decay (FID) by multidimensional Bayesian analysis. Starting from Bayesian probability theory for parameter estimation and model detection of one-dimensional time-domain data [Bretthorst, (1990) J. Magn. Reson., 88, 533–551; 552–570; 571–595], a theory for the D-dimensional case has been developed and implemented in an algorithm called BAMBAM (BAyesian Model Building Algorithm in Multidimensions). BAMBAM finds the most probable sinusoidal model to account for the systematic portion of any D-dimensional stationary FID. According to the parameters estimated by the algorithm, the FID is extrapolated in D dimensions prior to apodization and Fourier transformation. Multidimensional Bayesian analysis allows for the detection of signals not resolved by the DFT alone or even by sequential one-dimensional extrapolation from mirror-image linear prediction prior to the DFT. The procedure has been tested with a theoretical two-dimensional dataset and with four-dimensional HN(CO)CAHA (Kay et al. (1992) J. Magn. Reson., 98, 443–450) data from a small protein (8 kDa) where BAMBAM was applied to the 13C and H constant-time dimensions.To whom correspondence should be addressed.  相似文献   

10.

Background  

Data clustering analysis has been extensively applied to extract information from gene expression profiles obtained with DNA microarrays. To this aim, existing clustering approaches, mainly developed in computer science, have been adapted to microarray data analysis. However, previous studies revealed that microarray datasets have very diverse structures, some of which may not be correctly captured by current clustering methods. We therefore approached the problem from a new starting point, and developed a clustering algorithm designed to capture dataset-specific structures at the beginning of the process.  相似文献   

11.
MOTIVATION: Due to the existence of the loss of synchrony in cell-cycle data sets, standard clustering methods (e.g. k-means), which group open reading frames (ORFs) based on similar expression levels, are deficient unless the temporal pattern of the expression levels of the ORFs is taken into account. METHODS: We propose to improve the performance of the k-means method by assigning a decreasing weight on its variable level and evaluating the 'weighted k-means' on a yeast cell-cycle data set. Protein complexes from a public website are used as biological benchmarks. To compare the k-means clusters with the structures of the protein complexes, we measure the agreement between these two ways of clustering via the adjusted Rand index. RESULTS: Our results show the time-decreasing weight function--exp[-(1/2)(t(2)/C(2))]--which we assign to the variable level of k-means, generally increases the agreement between protein complexes and k-means clusters when C is near the length of two cell cycles.  相似文献   

12.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

13.
Gaussian mixture clustering and imputation of microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: In microarray experiments, missing entries arise from blemishes on the chips. In large-scale studies, virtually every chip contains some missing entries and more than 90% of the genes are affected. Many analysis methods require a full set of data. Either those genes with missing entries are excluded, or the missing entries are filled with estimates prior to the analyses. This study compares methods of missing value estimation. RESULTS: Two evaluation metrics of imputation accuracy are employed. First, the root mean squared error measures the difference between the true values and the imputed values. Second, the number of mis-clustered genes measures the difference between clustering with true values and that with imputed values; it examines the bias introduced by imputation to clustering. The Gaussian mixture clustering with model averaging imputation is superior to all other imputation methods, according to both evaluation metrics, on both time-series (correlated) and non-time series (uncorrelated) data sets.  相似文献   

14.
MOTIVATION: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.  相似文献   

15.
A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering and ordering the genes using gene expression data into homogeneous groups was shown to be useful in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on gene ordering in hierarchical clustering framework for gene expression analysis, there is no work addressing and evaluating the importance of gene ordering in partitive clustering framework, to the best knowledge of the authors. Outside the framework of hierarchical clustering, different gene ordering algorithms are applied on the whole data set, and the domain of partitive clustering is still unexplored with gene ordering approaches. A new hybrid method is proposed for ordering genes in each of the clusters obtained from partitive clustering solution, using microarray gene expressions.Two existing algorithms for optimally ordering cities in travelling salesman problem (TSP), namely, FRAG_GALK and Concorde, are hybridized individually with self organizing MAP to show the importance of gene ordering in partitive clustering framework. We validated our hybrid approach using yeast and fibroblast data and showed that our approach improves the result quality of partitive clustering solution, by identifying subclusters within big clusters, grouping functionally correlated genes within clusters, minimization of summation of gene expression distances, and the maximization of biological gene ordering using MIPS categorization. Moreover, the new hybrid approach, finds comparable or sometimes superior biological gene order in less computation time than those obtained by optimal leaf ordering in hierarchical clustering solution.  相似文献   

16.

Background  

There are some limitations associated with conventional clustering methods for short time-course gene expression data. The current algorithms require prior domain knowledge and do not incorporate information from replicates. Moreover, the results are not always easy to interpret biologically.  相似文献   

17.
Bayesian mixture model based clustering of replicated microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements. In such situations, Bayesian mixtures allow for an efficient use of information by precisely modeling between-replicates variability. RESULTS: We developed different variants of Bayesian mixture based clustering procedures for clustering gene expression data with experimental replicates. In this approach, the statistical distribution of microarray data is described by a Bayesian mixture model. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. We define infinite and finite Bayesian mixture models with different between-replicates variance structures and investigate their utility by analyzing synthetic and the real-world datasets. Results of our analyses demonstrate that (1) improvements in precision achieved by performing only two experimental replicates can be dramatic when the between-replicates variability is high, (2) precise modeling of intra-gene variability is important for accurate identification of co-expressed genes and (3) the infinite mixture model with the 'elliptical' between-replicates variance structure performed overall better than any other method tested. We also introduce a heuristic modification to the Gibbs sampler based on the 'reverse annealing' principle. This modification effectively overcomes the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions. Finally, we demonstrate that the Bayesian infinite mixture model with 'elliptical' variance structure is capable of identifying the underlying structure of the data without knowing the 'correct' number of clusters. AVAILABILITY: The MS Windows based program named Gaussian Infinite Mixture Modeling (GIMM) implementing the Gibbs sampler and corresponding C++ code are available at http://homepages.uc.edu/~medvedm/GIMM.htm SUPPLEMENTAL INFORMATION: http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html  相似文献   

18.
Multidimensional genome-wide data (e.g., gene expression microarray data) provide rich information and widespread applications in integrative biology. However, little attention has been paid to the inherent relationships within these natural data. By simply viewing multidimensional microarray data scattered over hyperspace, the spatial properties (topological structure) of the data clouds may reveal the underlying relationships. Based on this idea, we herein make analytical improvements by introducing a topology-preserving selection and clustering (TPSC) approach to complex large-scale microarray data. Specifically, the integration of self-organizing map (SOM) and singular value decomposition allows genome-wide selection on sound foundations of statistical inference. Moreover, this approach is complemented with an SOM-based two-phase gene clustering procedure, allowing the topology-preserving identification of gene clusters. These gene clusters with highly similar expression patterns can facilitate many aspects of biological interpretations in terms of functional and regulatory relevance. As demonstrated by processing large and complex datasets of the human cell cycle, stress responses, and host cell responses to pathogen infection, our proposed method can yield better characteristic features from the whole datasets compared to conventional routines. We hence conclude that the topology-preserving selection and clustering without a priori assumption on data structure allow the in-depth mining of biological information in a more accurate and unbiased manner. A Web server ( http://www.cs.bris.ac.uk/~hfang/TPSC ) hosting a MATLAB package that implements the methodology is freely available to both academic and nonacademic users. These advances will expand the scope of omics applications.  相似文献   

19.
MOTIVATION: Microarray images challenge existing analytical methods in many ways given that gene spots are often comprised of characteristic imperfections. Irregular contours, donut shapes, artifacts, and low or heterogeneous expression impair corresponding values for red and green intensities as well as their ratio R/G. New approaches are needed to ensure accurate data extraction from these images. RESULTS: Herein we introduce a novel method for intensity assessment of gene spots. The technique is based on clustering pixels of a target area into foreground and background. For this purpose we implemented two clustering algorithms derived from k-means and Partitioning Around Medoids (PAM), respectively. Results from the analysis of real gene spots indicate that our approach performs superior to other existing analytical methods. This is particularly true for spots generally considered as problematic due to imperfections or almost absent expression. Both PX(PAM) and PX(KMEANS) prove to be highly robust against various types of artifacts through adaptive partitioning, which more correctly assesses expression intensity values. AVAILABILITY: The implementation of this method is a combination of two complementary tools Extractiff (Java) and Pixclust (free statistical language R), which are available upon request from the authors.  相似文献   

20.
Focused microarray analysis   总被引:9,自引:0,他引:9  
  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号