首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Assessing reliability of gene clusters from gene expression data   总被引:5,自引:0,他引:5  
The rapid development of microarray technologies has raised many challenging problems in experiment design and data analysis. Although many numerical algorithms have been successfully applied to analyze gene expression data, the effects of variations and uncertainties in measured gene expression levels across samples and experiments have been largely ignored in the literature. In this article, in the context of hierarchical clustering algorithms, we introduce a statistical resampling method to assess the reliability of gene clusters identified from any hierarchical clustering method. Using the clustering trees constructed from the resampled data, we can evaluate the confidence value for each node in the observed clustering tree. A majority-rule consensus tree can be obtained, showing clusters that only occur in a majority of the resampled trees. We illustrate our proposed methods with applications to two published data sets. Although the methods are discussed in the context of hierarchical clustering methods, they can be applied with other cluster-identification methods for gene expression data to assess the reliability of any gene cluster of interest. Electronic Publication  相似文献   

2.
Although many numerical clustering algorithms have been applied to gene expression dataanalysis,the essential step is still biological interpretation by manual inspection.The correlation betweengenetic co-regulation and affiliation to a common biological process is what biologists expect.Here,weintroduce some clustering algorithms that are based on graph structure constituted by biological knowledge.After applying a widely used dataset,we compared the result clusters of two of these algorithms in terms ofthe homogeneity of clusters and coherence of annotation and matching ratio.The results show that theclusters of knowledge-guided analysis are the kernel parts of the clusters of Gene Ontology (GO)-Clustersoftware,which contains the genes that are most expression correlative and most consistent with biologicalfunctions.Moreover,knowledge-guided analysis seems much more applicable than GO-Cluster in a largerdataset.  相似文献   

3.
Current clustering methods are routinely applied to gene expressiontime course data to find genes with similar activation patternsand ultimately to understand the dynamics of biological processes.As the dynamic unfolding of a biological process often involvesthe activation of genes at different rates, successful clusteringin this context requires dealing with varying time and shapepatterns simultaneously. This motivates the combination of anovel pairwise warping with a suitable clustering method todiscover expression shape clusters. We develop a novel clusteringmethod that combines an initial pairwise curve alignment toadjust for time variation within likely clusters. The cluster-specifictime synchronization method shows excellent performance overstandard clustering methods in terms of cluster quality measuresin simulations and for yeast and human fibroblast data sets.In the yeast example, the discovered clusters have high concordancewith the known biological processes.  相似文献   

4.
With the advent of the microarray technology, the field of life science has been greatly revolutionized, since this technique allows the simultaneous monitoring of the expression levels of thousands of genes in a particular organism. However, the statistical analysis of expression data has its own challenges, primarily because of the huge amount of data that is to be dealt with, and also because of the presence of noise, which is almost an inherent characteristic of microarray data. Clustering is one tool used to mine meaningful patterns from microarray data. In this paper, we present a novel method of clustering yeast microarray data, which is robust and yet simple to implement. It identifies the best clusters from a given dataset on the basis of the population of the clusters as well as the variance of the feature values of the members from the cluster-center. It has been found to yield satisfactory results even in the presence of noisy data.  相似文献   

5.
A hybrid GA (genetic algorithm)-based clustering (HGACLUS) schema, combining merits of the Simulated Annealing, was described for finding an optimal or near-optimal set of medoids. This schema maximized the clustering success by achieving internal cluster cohesion and external cluster isolation. The performance  相似文献   

6.
Statistical inference for simultaneous clustering of gene expression data   总被引:1,自引:0,他引:1  
Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function theta=Phi(P) of the true data generating distribution P, and an estimate is obtained by applying this function to the empirical distribution P(n). We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as parameters which are compositions of individual mappings for clustering patients and genes. This framework allows one to assess classical properties of clustering methods, such as consistency, and to formally study statistical inference regarding the clustering parameter. We present results of simulations designed to assess the asymptotic validity of different bootstrap methods for estimating the distribution of Phi(P(n)). The method is illustrated on a publicly available data set.  相似文献   

7.
Dynamic models of gene expression and classification   总被引:3,自引:0,他引:3  
Powerful new methods, like expression profiles using cDNA arrays, have been used to monitor changes in gene expression levels as a result of a variety of metabolic, xenobiotic or pathogenic challenges. This potentially vast quantity of data enables, in principle, the dissection of the complex genetic networks that control the patterns and rhythms of gene expression in the cell. Here we present a general approach to developing dynamic models for analyzing time series of whole genome expression. In this approach, a self-consistent calculation is performed that involves both linear and non-linear response terms for interrelating gene expression levels. This calculation uses singular value decomposition (SVD) not as a statistical tool but as a means of inverting noisy and near-singular matrices. The linear transition matrix that is determined from this calculation can be used to calculate the underlying network reflected in the data. This suggests a direct method of classifying genes according to their place in the resulting network. In addition to providing a means to model such a large multivariate system this approach can be used to reduce the dimensionality of the problem in a rational and consistent way, and suppress the strong noise amplification effects often encountered with expression profile data. Non-linear and higher-order Markov behavior of the network are also determined in this self-consistent method. In data sets from yeast, we calculate the Markov matrix and the gene classes based on the linear-Markov network. These results compare favorably with previously used methods like cluster analysis. Our dynamic method appears to give a broad and general framework for data analysis and modeling of gene expression arrays. Electronic Publication  相似文献   

8.
Gene expression studies generate large quantities of data with the defining characteristic that the number of genes (whose expression profiles are to be determined) exceed the number of available replicates by several orders of magnitude. Standard spot-by-spot analysis still seeks to extract useful information for each gene on the basis of the number of available replicates, and thus plays to the weakness of microarrays. On the other hand, because of the data volume, treating the entire data set as an ensemble, and developing theoretical distributions for these ensembles provides a framework that plays instead to the strength of microarrays. We present theoretical results that under reasonable assumptions, the distribution of microarray intensities follows the Gamma model, with the biological interpretations of the model parameters emerging naturally. We subsequently establish that for each microarray data set, the fractional intensities can be represented as a mixture of Beta densities, and develop a procedure for using these results to draw statistical inference regarding differential gene expression. We illustrate the results with experimental data from gene expression studies on Deinococcus radiodurans following DNA damage using cDNA microarrays.  相似文献   

9.
10.
Computational analysis methods for gene expression data gathered in microarray experiments can be used to identify the functions of previously unstudied genes. While obtaining the expression data is not a difficult task, interpreting and extracting the information from the datasets is challenging. In this study, a knowledge-based approach which identifies and saves important functional genes before filtering based on variability and fold change differences was utilized to study light regulation. Two clustering methods were used to cluster the filtered datasets, and clusters containing a key light regulatory gene were located. The common genes to both of these clusters were identified, and the genes in the common cluster were ranked based on their coexpression to the key gene. This process was repeated for 11 key genes in 3 treatment combinations. The initial filtering method reduced the dataset size from 22,814 probes to an average of 1134 genes, and the resulting common cluster lists contained an average of only 14 genes. These common cluster lists scored higher gene enrichment scores than two individual clustering methods. In addition, the filtering method increased the proportion of light responsive genes in the dataset from 1.8% to 15.2%, and the cluster lists increased this proportion to 18.4%. The relatively short length of these common cluster lists compared to gene groups generated through typical clustering methods or coexpression networks narrows the search for novel functional genes while increasing the likelihood that they are biologically relevant.  相似文献   

11.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

12.
High-throughout genomic data provide an opportunity for identifying pathways and genes that are related to various clinical phenotypes. Besides these genomic data, another valuable source of data is the biological knowledge about genes and pathways that might be related to the phenotypes of many complex diseases. Databases of such knowledge are often called the metadata. In microarray data analysis, such metadata are currently explored in post hoc ways by gene set enrichment analysis but have hardly been utilized in the modeling step. We propose to develop and evaluate a pathway-based gradient descent boosting procedure for nonparametric pathways-based regression (NPR) analysis to efficiently integrate genomic data and metadata. Such NPR models consider multiple pathways simultaneously and allow complex interactions among genes within the pathways and can be applied to identify pathways and genes that are related to variations of the phenotypes. These methods also provide an alternative to mediating the problem of a large number of potential interactions by limiting analysis to biologically plausible interactions between genes in related pathways. Our simulation studies indicate that the proposed boosting procedure can indeed identify relevant pathways. Application to a gene expression data set on breast cancer distant metastasis identified that Wnt, apoptosis, and cell cycle-regulated pathways are more likely related to the risk of distant metastasis among lymph-node-negative breast cancer patients. Results from analysis of other two breast cancer gene expression data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth, and maintenance are important to breast cancer relapse and survival. We also observed that by incorporating the pathway information, we achieved better prediction for cancer recurrence.  相似文献   

13.
Yuan M  Kendziorski C 《Biometrics》2006,62(4):1089-1098
Although both clustering and identification of differentially expressed genes are equally essential in most microarray studies, the two tasks are often conducted without regard to each other. This is clearly not the most efficient way of extracting information. The main aim of this article is to develop a coherent statistical method that can simultaneously cluster and detect differentially expressed genes. Through information sharing between the two tasks, the proposed approach gives more sensible clustering among genes and is more sensitive in identifying differentially expressed genes. The improvement over existing methods is illustrated in both our simulation results and a case study.  相似文献   

14.
It is shown here how gene knock-out experiments can be simulated in Random Boolean Networks (RBN), which are well-known simplified models of genetic networks. The results of the simulations are presented and compared with those of actual experiments in S. cerevisiae. RBN with two incoming links per node have been considered, and the Boolean functions have been chosen at random among the set of so-called canalizing functions. Genes are knocked-out (i.e. silenced) one at a time, and the variations in the expression levels of the other genes, with respect to the unperturbed case, are considered. Two important variables are defined: (i) avalanches, which measure the size of the perturbation generated by knocking out a single gene, and (ii) susceptibilities, which measure how often the expression of a given gene is modified in these experiments. A remarkable observation is that the distributions of avalanches and susceptibilities are very robust, i.e. they are very similar in different random networks; this should be contrasted with the distribution of other variables that show a high variance in RBN. Moreover, the distribution of avalanches and susceptibilities of the RBN models are close to those observed in actual experiments performed with S. cerevisiae, where the changes in gene expression levels have been recorded with DNA microarrays. These findings suggest that these distributions might be "generic" properties, common to a wide range of genetic models and real genetic networks. The importance of such generic properties is discussed.  相似文献   

15.

Background

MicroRNAs (miRNAs) are a class of endogenous small regulatory RNAs. Identifications of the dys-regulated or perturbed miRNAs and their key target genes are important for understanding the regulatory networks associated with the studied cellular processes. Several computational methods have been developed to infer the perturbed miRNA regulatory networks by integrating genome-wide gene expression data and sequence-based miRNA-target predictions. However, most of them only use the expression information of the miRNA direct targets, rarely considering the secondary effects of miRNA perturbation on the global gene regulatory networks.

Results

We proposed a network propagation based method to infer the perturbed miRNAs and their key target genes by integrating gene expressions and global gene regulatory network information. The method used random walk with restart in gene regulatory networks to model the network effects of the miRNA perturbation. Then, it evaluated the significance of the correlation between the network effects of the miRNA perturbation and the gene differential expression levels with a forward searching strategy. Results show that our method outperformed several compared methods in rediscovering the experimentally perturbed miRNAs in cancer cell lines. Then, we applied it on a gene expression dataset of colorectal cancer clinical patient samples and inferred the perturbed miRNA regulatory networks of colorectal cancer, including several known oncogenic or tumor-suppressive miRNAs, such as miR-17, miR-26 and miR-145.

Conclusions

Our network propagation based method takes advantage of the network effect of the miRNA perturbation on its target genes. It is a useful approach to infer the perturbed miRNAs and their key target genes associated with the studied biological processes using gene expression data.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-255) contains supplementary material, which is available to authorized users.  相似文献   

16.
We propose a model-based approach to unify clustering and network modeling using time-course gene expression data. Specifically, our approach uses a mixture model to cluster genes. Genes within the same cluster share a similar expression profile. The network is built over cluster-specific expression profiles using state-space models. We discuss the application of our model to simulated data as well as to time-course gene expression data arising from animal models on prostate cancer progression. The latter application shows that with a combined statistical/bioinformatics analyses, we are able to extract gene-to-gene relationships supported by the literature as well as new plausible relationships.  相似文献   

17.
Ruan L  Yuan M 《Biometrics》2011,67(4):1617-1626
With the prevalence of gene expression studies and the relatively low reproducibility caused by insufficient sample sizes, it is natural to consider joint analysis that could combine data from different experiments effectively to achieve improved accuracy. We present in this article a model-based approach for better identification of differentially expressed genes by incorporating data from different studies. The model can accommodate in a seamless fashion a wide range of studies including those performed at different platforms by fitting each data with different set of parameters, and/or under different but overlapping biological conditions. Model-based inferences can be done in an empirical Bayes' fashion. Because of the information sharing among studies, the joint analysis dramatically improves inferences based on individual analysis. Simulation studies and real data examples are presented to demonstrate the effectiveness of the proposed approach under a variety of complications that often arise in practice.  相似文献   

18.
Gene set analysis methods are popular tools for identifying differentially expressed gene sets in microarray data. Most existing methods use a permutation test to assess significance for each gene set. The permutation test's assumption of exchangeable samples is often not satisfied for time‐series data and complex experimental designs, and in addition it requires a certain number of samples to compute p‐values accurately. The method presented here uses a rotation test rather than a permutation test to assess significance. The rotation test can compute accurate p‐values also for very small sample sizes. The method can handle complex designs and is particularly suited for longitudinal microarray data where the samples may have complex correlation structures. Dependencies between genes, modeled with the use of gene networks, are incorporated in the estimation of correlations between samples. In addition, the method can test for both gene sets that are differentially expressed and gene sets that show strong time trends. We show on simulated longitudinal data that the ability to identify important gene sets may be improved by taking the correlation structure between samples into account. Applied to real data, the method identifies both gene sets with constant expression and gene sets with strong time trends.  相似文献   

19.
20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号