首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
MOTIVATION: The biologic significance of results obtained through cluster analyses of gene expression data generated in microarray experiments have been demonstrated in many studies. In this article we focus on the development of a clustering procedure based on the concept of Bayesian model-averaging and a precise statistical model of expression data. RESULTS: We developed a clustering procedure based on the Bayesian infinite mixture model and applied it to clustering gene expression profiles. Clusters of genes with similar expression patterns are identified from the posterior distribution of clusterings defined implicitly by the stochastic data-generation model. The posterior distribution of clusterings is estimated by a Gibbs sampler. We summarized the posterior distribution of clusterings by calculating posterior pairwise probabilities of co-expression and used the complete linkage principle to create clusters. This approach has several advantages over usual clustering procedures. The analysis allows for incorporation of a reasonable probabilistic model for generating data. The method does not require specifying the number of clusters and resulting optimal clustering is obtained by averaging over models with all possible numbers of clusters. Expression profiles that are not similar to any other profile are automatically detected, the method incorporates experimental replicates, and it can be extended to accommodate missing data. This approach represents a qualitative shift in the model-based cluster analysis of expression data because it allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles. We also demonstrated the importance of incorporating the information on experimental variability into the clustering model. AVAILABILITY: The MS Windows(TM) based program implementing the Gibbs sampler and supplemental material is available at http://homepages.uc.edu/~medvedm/BioinformaticsSupplement.htm CONTACT: medvedm@email.uc.edu  相似文献   

2.
MOTIVATION: Microarray experiments have revolutionized the study of gene expression with their ability to generate large amounts of data. This article describes an alternative to existing approaches to clustering of gene expression profiles; the key idea is to cluster in stages using a hierarchy of distance measures. This method is motivated by the way in which the human mind sorts and so groups many items. The distance measures arise from the orthogonal breakup of Euclidean distance, giving us a set of independent measures of different attributes of the gene expression profile. Interpretation of these distances is closely related to the statistical design of the microarray experiment. This clustering method not only accommodates missing data but also leads to an associated imputation method. RESULTS: The performance of the clustering and imputation methods was tested on a simulated dataset, a yeast cell cycle dataset and a central nervous system development dataset. Based on the Rand and adjusted Rand indices, the clustering method is more consistent with the biological classification of the data than commonly used clustering methods. The imputation method, at varying levels of missingness, outperforms most imputation methods, based on root mean squared error (RMSE). AVAILABILITY: Code in R is available on request from the authors.  相似文献   

3.
MOTIVATION: Identifying groups of co-regulated genes by monitoring their expression over various experimental conditions is complicated by the fact that such co-regulation is condition-specific. Ignoring the context-specific nature of co-regulation significantly reduces the ability of clustering procedures to detect co-expressed genes due to additional 'noise' introduced by non-informative measurements. RESULTS: We have developed a novel Bayesian hierarchical model and corresponding computational algorithms for clustering gene expression profiles across diverse experimental conditions and studies that accounts for context-specificity of gene expression patterns. The model is based on the Bayesian infinite mixtures framework and does not require a priori specification of the number of clusters. We demonstrate that explicit modeling of context-specificity results in increased accuracy of the cluster analysis by examining the specificity and sensitivity of clusters in microarray data. We also demonstrate that probabilities of co-expression derived from the posterior distribution of clusterings are valid estimates of statistical significance of created clusters. AVAILABILITY: The open-source package gimm is available at http://eh3.uc.edu/gimm.  相似文献   

4.
We consider model-based clustering of data that lie on a unit sphere. Such data arise in the analysis of microarray experiments when the gene expressions are standardized so that they have mean 0 and variance 1 across the arrays. We propose to model the clusters on the sphere with inverse stereographic projections of multivariate normal distributions. The corresponding model-based clustering algorithm is described. This algorithm is applied first to simulated data sets to assess the performance of several criteria for determining the number of clusters and to compare its performance with existing methods and second to a real reference data set of standardized gene expression profiles.  相似文献   

5.
MOTIVATION: The increasing use of microarray technologies is generating large amounts of data that must be processed in order to extract useful and rational fundamental patterns of gene expression. Hierarchical clustering technology is one method used to analyze gene expression data, but traditional hierarchical clustering algorithms suffer from several drawbacks (e.g. fixed topology structure; mis-clustered data which cannot be reevaluated). In this paper, we introduce a new hierarchical clustering algorithm that overcomes some of these drawbacks. RESULT: We propose a new tree-structure self-organizing neural network, called dynamically growing self-organizing tree (DGSOT) algorithm for hierarchical clustering. The DGSOT constructs a hierarchy from top to bottom by division. At each hierarchical level, the DGSOT optimizes the number of clusters, from which the proper hierarchical structure of the underlying dataset can be found. In addition, we propose a new cluster validation criterion based on the geometric property of the Voronoi partition of the dataset in order to find the proper number of clusters at each hierarchical level. This criterion uses the Minimum Spanning Tree (MST) concept of graph theory and is computationally inexpensive for large datasets. A K-level up distribution (KLD) mechanism, which increases the scope of data distribution in the hierarchy construction, was used to improve the clustering accuracy. The KLD mechanism allows the data misclustered in the early stages to be reevaluated at a later stage and increases the accuracy of the final clustering result. The clustering result of the DGSOT is easily displayed as a dendrogram for visualization. Based on a yeast cell cycle microarray expression dataset, we found that our algorithm extracts gene expression patterns at different levels. Furthermore, the biological functionality enrichment in the clusters is considerably high and the hierarchical structure of the clusters is more reasonable. AVAILABILITY: DGSOT is available upon request from the authors.  相似文献   

6.
Tumor-specific gene expression patterns with gene expression profiles   总被引:1,自引:0,他引:1  
Gene expression profiles of 14 common tumors and their counterpart normal tissues were analyzed with machine learning methods to address the problem of selection of tumor-specific genes and analysis of their differential expressions in tumor tissues. First, a variation of the Relief algorithm, "RFE_Relief algorithm" was proposed to learn the relations between genes and tissue types. Then, a support vector machine was employed to find the gene subset with the best classification performance for distinguishing cancerous tissues and their counterparts. After tissue-specific genes were removed, cross validation experiments were employed to demonstrate the common deregulated expressions of the selected gene in tumor tissues. The results indicate the existence of a specific expression fingerprint of these genes that is shared in different tumor tissues, and the hallmarks of the expression patterns of these genes in cancerous tissues are summarized at the end of this paper.  相似文献   

7.
Gene expression profiles of 14 common tumors and their counterpart normal tissues were analyzed with machine learning methods to address the problem of selection of tumor-specific genes and analysis of their differential expressions in tumor tissues. First, a variation of the Relief algorithm, “RFE_Relief algorithm” was proposed to learn the relations between genes and tissue types. Then, a support vector machine was employed to find the gene subset with the best classification performance for distinguishing cancerous tissues and their counterparts. After tissue-specific genes were removed, cross validation experiments were employed to demonstrate the common deregulated expressions of the selected gene in tumor tissues. The results indicate the existence of a specific expression fingerprint of these genes that is shared in different tumor tissues, and the hallmarks of the expression patterns of these genes in cancerous tissues are summarized at the end of this paper.  相似文献   

8.
Gene-Ontology-based clustering of gene expression data   总被引:2,自引:0,他引:2  
The expected correlation between genetic co-regulation and affiliation to a common biological process is not necessarily the case when numerical cluster algorithms are applied to gene expression data. GO-Cluster uses the tree structure of the Gene Ontology database as a framework for numerical clustering, and thus allowing a simple visualization of gene expression data at various levels of the ontology tree. AVAILABILITY: The 32-bit Windows application is freely available at http://www.mpibpc.mpg.de/go-cluster/  相似文献   

9.
Current clustering methods are routinely applied to gene expressiontime course data to find genes with similar activation patternsand ultimately to understand the dynamics of biological processes.As the dynamic unfolding of a biological process often involvesthe activation of genes at different rates, successful clusteringin this context requires dealing with varying time and shapepatterns simultaneously. This motivates the combination of anovel pairwise warping with a suitable clustering method todiscover expression shape clusters. We develop a novel clusteringmethod that combines an initial pairwise curve alignment toadjust for time variation within likely clusters. The cluster-specifictime synchronization method shows excellent performance overstandard clustering methods in terms of cluster quality measuresin simulations and for yeast and human fibroblast data sets.In the yeast example, the discovered clusters have high concordancewith the known biological processes.  相似文献   

10.
It has been well established that gene expression data contain large amounts of random variation that affects both the analysis and the results of microarray experiments. Typically, microarray data are either tested for differential expression between conditions or grouped on the basis of profiles that are assessed temporally or across genetic or environmental conditions. While testing differential expression relies on levels of certainty to evaluate the relative worth of various analyses, cluster analysis is exploratory in nature and has not had the benefit of any judgment of statistical inference. By using a novel dissimilarity function to ascertain gene expression clusters and conditional randomization of the data space to illuminate distinctions between statistically significant clusters of gene expression patterns, we aim to provide a level of confidence to inferred clusters of gene expression data. We apply both permutation and convex hull approaches for randomization of the data space and show that both methods can provide an effective assessment of gene expression profiles whose coregulation is statistically different from that expected by random chance alone.  相似文献   

11.
12.
13.
Validating clustering for gene expression data   总被引:24,自引:0,他引:24  
MOTIVATION: Many clustering algorithms have been proposed for the analysis of gene expression data, but little guidance is available to help choose among them. We provide a systematic framework for assessing the results of clustering algorithms. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters-meaningful clusters should exhibit less variation in the remaining condition than clusters formed by chance. RESULTS: We successfully applied our methodology to compare six clustering algorithms on four gene expression data sets. We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality.  相似文献   

14.
The Purkinje cell degeneration (PCD) mutant mouse is characterized by a degeneration of cerebellar Purkinje cells and progressive ataxia. To identify the molecular mechanisms that lead to the death of Purkinje neurons in PCD mice, we used Affymetrix microarray technology to compare cerebellar gene expression profiles in pcd3J mutant mice 14 days of age (prior to Purkinje cell loss) to unaffected littermates. Microarray analysis, Ingenuity Pathway Analysis (IPA) and expression analysis systematic explorer (EASE) software were used to identify biological and molecular pathways implicated in the progression of Purkinje cell degeneration. IPA analysis indicated that mutant pcd3J mice showed dysregulation of specific processes that may lead to Purkinje cell death, including several molecules known to control neuronal apoptosis such as Bad, CDK5 and PTEN. These findings demonstrate the usefulness of these powerful microarray analysis tools and have important implications for understanding the mechanisms of selective neuronal death and for developing therapeutic strategies to treat neurodegenerative disorders.  相似文献   

15.
The accumulation of DNA microarray data has now made it possible to use gene expression profiles to analyse expression data. A gene expression profile contains the expression data for a given gene over various samples, and can be contrasted with an expression signature, which contains the expression data for a single sample. Gene expression profiles are most revealing when samples are grouped appropriately, either by standard clinical or pathological categories or by categories discovered through cluster analysis techniques. Expression profiles can exist at various levels of abstraction, yielding information across various tissues or across diseases within a particular tissue. Hypothesis tests may be applied to expression profiles on a large scale to identify candidate genes of interest.  相似文献   

16.
MOTIVATION: Cluster analysis of genome-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and samples. In the present paper, we focus on several important issues related to clustering algorithms that have not yet been fully studied. RESULTS: We describe a simple and robust algorithm for the clustering of temporal gene expression profiles that is based on the simulated annealing procedure. In general, this algorithm guarantees to eventually find the globally optimal distribution of genes over clusters. We introduce an iterative scheme that serves to evaluate quantitatively the optimal number of clusters for each specific data set. The scheme is based on standard approaches used in regular statistical tests. The basic idea is to organize the search of the optimal number of clusters simultaneously with the optimization of the distribution of genes over clusters. The efficiency of the proposed algorithm has been evaluated by means of a reverse engineering experiment, that is, a situation in which the correct distribution of genes over clusters is known a priori. The employment of this statistically rigorous test has shown that our algorithm places greater than 90% genes into correct clusters. Finally, the algorithm has been tested on real gene expression data (expression changes during yeast cell cycle) for which the fundamental patterns of gene expression and the assignment of genes to clusters are well understood from numerous previous studies.  相似文献   

17.
18.
Recent advances in high throughput technologies have generated an abundance of biological information, such as gene expression, protein-protein interaction, and metabolic data. These various types of data capture different aspects of the cellular response to environmental factors. Integrating data from different measurements enhances the ability of modeling frameworks to predict cellular function more accurately and can lead to a more coherent reconstruction of the underlying regulatory network structure. Different techniques, newly developed and borrowed, have been applied for the purpose of extracting this information from experimental data. In this study, we developed a framework to integrate metabolic and gene expression profiles for a hepatocellular system. Specifically, we applied genetic algorithm and partial least square analysis to identify important genes relevant to a specific cellular function. We identified genes 1) whose expression levels quantitatively predict a metabolic function and 2) that play a part in regulating a hepatocellular function and reconstructed their role in the metabolic network. The framework 1) preprocesses the gene expression data using statistical techniques, 2) selects genes using a genetic algorithm and couples them to a partial least squares analysis to predict cellular function, and 3) reconstructs, with the assistance of a literature search, the pathways that regulate cellular function, namely intracellular triglyceride and urea synthesis. This provides a framework for identifying cellular pathways that are active as a function of the environment and in turn helps to uncover the interplay between gene and metabolic networks.  相似文献   

19.
Clustering is an important tool in microarray data analysis. This unsupervised learning technique is commonly used to reveal structures hidden in large gene expression data sets. The vast majority of clustering algorithms applied so far produce hard partitions of the data, i.e. each gene is assigned exactly to one cluster. Hard clustering is favourable if clusters are well separated. However, this is generally not the case for microarray time-course data, where gene clusters frequently overlap. Additionally, hard clustering algorithms are often highly sensitive to noise. To overcome the limitations of hard clustering, we applied soft clustering which offers several advantages for researchers. First, it generates accessible internal cluster structures, i.e. it indicates how well corresponding clusters represent genes. This can be used for the more targeted search for regulatory elements. Second, the overall relation between clusters, and thus a global clustering structure, can be defined. Additionally, soft clustering is more noise robust and a priori pre-filtering of genes can be avoided. This prevents the exclusion of biologically relevant genes from the data analysis. Soft clustering was implemented here using the fuzzy c-means algorithm. Procedures to find optimal clustering parameters were developed. A software package for soft clustering has been developed based on the open-source statistical language R. The package called Mfuzz is freely available.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号