首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: Clustering technique is used to find groups of genes that show similar expression patterns under multiple experimental conditions. Nonetheless, the results obtained by cluster analysis are influenced by the existence of missing values that commonly arise in microarray experiments. Because a clustering method requires a complete data matrix as an input, previous studies have estimated the missing values using an imputation method in the preprocessing step of clustering. However, a common limitation of these conventional approaches is that once the estimates of missing values are fixed in the preprocessing step, they are not changed during subsequent processes of clustering; badly estimated missing values obtained in data preprocessing are likely to deteriorate the quality and reliability of clustering results. Thus, a new clustering method is required for improving missing values during iterative clustering process. RESULTS: We present a method for Clustering Incomplete data using Alternating Optimization (CIAO) in which a prior imputation method is not required. To reduce the influence of imputation in preprocessing, we take an alternative optimization approach to find better estimates during iterative clustering process. This method improves the estimates of missing values by exploiting the cluster information such as cluster centroids and all available non-missing values in each iteration. To test the performance of the CIAO, we applied the CIAO and conventional imputation-based clustering methods, e.g. k-means based on KNNimpute, for clustering two yeast incomplete data sets, and compared the clustering result of each method using the Saccharomyces Genome Database annotations. The clustering results of the CIAO method are more significantly relevant to the biological gene annotations than those of other methods, indicating its effectiveness and potential for clustering incomplete gene expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request.  相似文献   

2.

Background  

Missing values frequently pose problems in gene expression microarray experiments as they can hinder downstream analysis of the datasets. While several missing value imputation approaches are available to the microarray users and new ones are constantly being developed, there is no general consensus on how to choose between the different methods since their performance seems to vary drastically depending on the dataset being used.  相似文献   

3.
Bayesian mixture model based clustering of replicated microarray data   总被引:3,自引:0,他引:3  
MOTIVATION: Identifying patterns of co-expression in microarray data by cluster analysis has been a productive approach to uncovering molecular mechanisms underlying biological processes under investigation. Using experimental replicates can generally improve the precision of the cluster analysis by reducing the experimental variability of measurements. In such situations, Bayesian mixtures allow for an efficient use of information by precisely modeling between-replicates variability. RESULTS: We developed different variants of Bayesian mixture based clustering procedures for clustering gene expression data with experimental replicates. In this approach, the statistical distribution of microarray data is described by a Bayesian mixture model. Clusters of co-expressed genes are created from the posterior distribution of clusterings, which is estimated by a Gibbs sampler. We define infinite and finite Bayesian mixture models with different between-replicates variance structures and investigate their utility by analyzing synthetic and the real-world datasets. Results of our analyses demonstrate that (1) improvements in precision achieved by performing only two experimental replicates can be dramatic when the between-replicates variability is high, (2) precise modeling of intra-gene variability is important for accurate identification of co-expressed genes and (3) the infinite mixture model with the 'elliptical' between-replicates variance structure performed overall better than any other method tested. We also introduce a heuristic modification to the Gibbs sampler based on the 'reverse annealing' principle. This modification effectively overcomes the tendency of the Gibbs sampler to converge to different modes of the posterior distribution when started from different initial positions. Finally, we demonstrate that the Bayesian infinite mixture model with 'elliptical' variance structure is capable of identifying the underlying structure of the data without knowing the 'correct' number of clusters. AVAILABILITY: The MS Windows based program named Gaussian Infinite Mixture Modeling (GIMM) implementing the Gibbs sampler and corresponding C++ code are available at http://homepages.uc.edu/~medvedm/GIMM.htm SUPPLEMENTAL INFORMATION: http://expression.microslu.washington.edu/expression/kayee/medvedovic2003/medvedovic_bioinf2003.html  相似文献   

4.

Background  

Quality assessment of microarray data is an important and often challenging aspect of gene expression analysis. This task frequently involves the examination of a variety of summary statistics and diagnostic plots. The interpretation of these diagnostics is often subjective, and generally requires careful expert scrutiny.  相似文献   

5.
A mixture model-based approach to the clustering of microarray expression data   总被引:13,自引:0,他引:13  
MOTIVATION: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. RESULTS: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. AVAILABILITY: EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/  相似文献   

6.
MOTIVATION: Grouping genes having similar expression patterns is called gene clustering, which has been proved to be a useful tool for extracting underlying biological information of gene expression data. Many clustering procedures have shown success in microarray gene clustering; most of them belong to the family of heuristic clustering algorithms. Model-based algorithms are alternative clustering algorithms, which are based on the assumption that the whole set of microarray data is a finite mixture of a certain type of distributions with different parameters. Application of the model-based algorithms to unsupervised clustering has been reported. Here, for the first time, we demonstrated the use of the model-based algorithm in supervised clustering of microarray data. RESULTS: We applied the proposed methods to real gene expression data and simulated data. We showed that the supervised model-based algorithm is superior over the unsupervised method and the support vector machines (SVM) method. AVAILABILITY: The program written in the SAS language implementing methods I-III in this report is available upon request. The software of SVMs is available in the website http://svm.sdsc.edu/cgi-bin/nph-SVMsubmit.cgi  相似文献   

7.
MOTIVATION: Significance analysis of differential expression in DNA microarray data is an important task. Much of the current research is focused on developing improved tests and software tools. The task is difficult not only owing to the high dimensionality of the data (number of genes), but also because of the often non-negligible presence of missing values. There is thus a great need to reliably impute these missing values prior to the statistical analyses. Many imputation methods have been developed for DNA microarray data, but their impact on statistical analyses has not been well studied. In this work we examine how missing values and their imputation affect significance analysis of differential expression. RESULTS: We develop a new imputation method (LinCmb) that is superior to the widely used methods in terms of normalized root mean squared error. Its estimates are the convex combinations of the estimates of existing methods. We find that LinCmb adapts to the structure of the data: If the data are heterogeneous or if there are few missing values, LinCmb puts more weight on local imputation methods; if the data are homogeneous or if there are many missing values, LinCmb puts more weight on global imputation methods. Thus, LinCmb is a useful tool to understand the merits of different imputation methods. We also demonstrate that missing values affect significance analysis. Two datasets, different amounts of missing values, different imputation methods, the standard t-test and the regularized t-test and ANOVA are employed in the simulations. We conclude that good imputation alleviates the impact of missing values and should be an integral part of microarray data analysis. The most competitive methods are LinCmb, GMC and BPCA. Popular imputation schemes such as SVD, row mean, and KNN all exhibit high variance and poor performance. The regularized t-test is less affected by missing values than the standard t-test. AVAILABILITY: Matlab code is available on request from the authors.  相似文献   

8.
9.

Background  

Cluster analysis has become a standard computational method for gene function discovery as well as for more general explanatory data analysis. A number of different approaches have been proposed for that purpose, out of which different mixture models provide a principled probabilistic framework. Cluster analysis is increasingly often supplemented with multiple data sources nowadays, and these heterogeneous information sources should be made as efficient use of as possible.  相似文献   

10.

Background  

The imputation of missing values is necessary for the efficient use of DNA microarray data, because many clustering algorithms and some statistical analysis require a complete data set. A few imputation methods for DNA microarray data have been introduced, but the efficiency of the methods was low and the validity of imputed values in these methods had not been fully checked.  相似文献   

11.
Many bioinformatics problems can be tackled from a fresh angle offered by the network perspective. Directly inspired by metabolic network structural studies, we propose an improved gene clustering approach for inferring gene signaling pathways from gene microarray data. Based on the construction of co-expression networks that consists of both significantly linear and non-linear gene associations together with controlled biological and statistical significance, our approach tends to group functionally related genes into tight clusters despite their expression dissimilarities. We illustrate our approach and compare it to the traditional clustering approaches on a yeast galactose metabolism dataset and a retinal gene expression dataset. Our approach greatly outperforms the traditional approach in rediscovering the relatively well known galactose metabolism pathway in yeast and in clustering genes of the photoreceptor differentiation pathway. AVAILABILITY: The clustering method has been implemented in an R package "GeneNT" that is freely available from: http://www.cran.org.  相似文献   

12.
Fuzzy C-means method for clustering microarray data   总被引:9,自引:0,他引:9  
MOTIVATION: Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. RESULTS: A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. AVAILABILITY: Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/  相似文献   

13.
Clustering is a major tool for microarray gene expression data analysis. The existing clustering methods fall mainly into two categories: parametric and nonparametric. The parametric methods generally assume a mixture of parametric subdistributions. When the mixture distribution approximately fits the true data generating mechanism, the parametric methods perform well, but not so when there is nonnegligible deviation between them. On the other hand, the nonparametric methods, which usually do not make distributional assumptions, are robust but pay the price for efficiency loss. In an attempt to utilize the known mixture form to increase efficiency, and to free assumptions about the unknown subdistributions to enhance robustness, we propose a semiparametric method for clustering. The proposed approach possesses the form of parametric mixture, with no assumptions to the subdistributions. The subdistributions are estimated nonparametrically, with constraints just being imposed on the modes. An expectation-maximization (EM) algorithm along with a classification step is invoked to cluster the data, and a modified Bayesian information criterion (BIC) is employed to guide the determination of the optimal number of clusters. Simulation studies are conducted to assess the performance and the robustness of the proposed method. The results show that the proposed method yields reasonable partition of the data. As an illustration, the proposed method is applied to a real microarray data set to cluster genes.  相似文献   

14.
15.

Background  

Microarray technology has become popular for gene expression profiling, and many analysis tools have been developed for data interpretation. Most of these tools require complete data, but measurement values are often missing A way to overcome the problem of incomplete data is to impute the missing data before analysis. Many imputation methods have been suggested, some na?ve and other more sophisticated taking into account correlation in data. However, these methods are binary in the sense that each spot is considered either missing or present. Hence, they are depending on a cutoff separating poor spots from good spots. We suggest a different approach in which a continuous spot quality weight is built into the imputation methods, allowing for smooth imputations of all spots to larger or lesser degree.  相似文献   

16.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

17.
MOTIVATION: Unsupervised analysis of microarray gene expression data attempts to find biologically significant patterns within a given collection of expression measurements. For example, hierarchical clustering can be applied to expression profiles of genes across multiple experiments, identifying groups of genes that share similar expression profiles. Previous work using the support vector machine supervised learning algorithm with microarray data suggests that higher-order features, such as pairwise and tertiary correlations across multiple experiments, may provide significant benefit in learning to recognize classes of co-expressed genes. RESULTS: We describe a generalization of the hierarchical clustering algorithm that efficiently incorporates these higher-order features by using a kernel function to map the data into a high-dimensional feature space. We then evaluate the utility of the kernel hierarchical clustering algorithm using both internal and external validation. The experiments demonstrate that the kernel representation itself is insufficient to provide improved clustering performance. We conclude that mapping gene expression data into a high-dimensional feature space is only a good idea when combined with a learning algorithm, such as the support vector machine that does not suffer from the curse of dimensionality. AVAILABILITY: Supplementary data at www.cs.columbia.edu/compbio/hiclust. Software source code available by request.  相似文献   

18.

Background  

Microarray technology has made it possible to simultaneously measure the expression levels of large numbers of genes in a short time. Gene expression data is information rich; however, extensive data mining is required to identify the patterns that characterize the underlying mechanisms of action. Clustering is an important tool for finding groups of genes with similar expression patterns in microarray data analysis. However, hard clustering methods, which assign each gene exactly to one cluster, are poorly suited to the analysis of microarray datasets because in such datasets the clusters of genes frequently overlap.  相似文献   

19.
Kauermann G  Eilers P 《Biometrics》2004,60(2):376-387
An important goal of microarray studies is the detection of genes that show significant changes in expression when two classes of biological samples are being compared. We present an ANOVA-style mixed model with parameters for array normalization, overall level of gene expression, and change of expression between the classes. For the latter we assume a mixing distribution with a probability mass concentrated at zero, representing genes with no changes, and a normal distribution representing the level of change for the other genes. We estimate the parameters by optimizing the marginal likelihood. To make this practical, Laplace approximations and a backfitting algorithm are used. The performance of the model is studied by simulation and by application to publicly available data sets.  相似文献   

20.
Multi-class clustering and prediction in the analysis of microarray data   总被引:1,自引:0,他引:1  
DNA microarray technology provides tools for studying the expression profiles of a large number of distinct genes simultaneously. This technology has been applied to sample clustering and sample prediction. Because of a large number of genes measured, many of the genes in the original data set are irrelevant to the analysis. Selection of discriminatory genes is critical to the accuracy of clustering and prediction. This paper considers statistical significance testing approach to selecting discriminatory gene sets for multi-class clustering and prediction of experimental samples. A toxicogenomic data set with nine treatments (a control and eight metals, As, Cd, Ni, Cr, Sb, Pb, Cu, and AsV with a total of 55 samples) is used to illustrate a general framework of the approach. Among four selected gene sets, a gene set omega(I) formed by the intersection of the F-test and the set of the union of one-versus-all t-tests performs the best in terms of clustering as well as prediction. Hierarchical and two modified partition (k-means) methods all show that the set omega(I) is able to group the 55 samples into seven clusters reasonably well, in which the As and AsV samples are considered as one cluster (the same group) as are the Cd and Cu samples. With respect to prediction, the overall accuracy for the gene set omega(I) using the nearest neighbors algorithm to predict 55 samples into one of the nine treatments is 85%.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号