首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In this paper, we propose a hybrid clustering method that combines the strengths of bottom-up hierarchical clustering with that of top-down clustering. The first method is good at identifying small clusters but not large ones; the strengths are reversed for the second method. The hybrid method is built on the new idea of a mutual cluster: a group of points closer to each other than to any other points. Theoretical connections between mutual clusters and bottom-up clustering methods are established, aiding in their interpretation and providing an algorithm for identification of mutual clusters. We illustrate the technique on simulated and real microarray datasets.  相似文献   

2.
1. Assuming that recruitment variation is one of the main sources of fish population and assemblage changes, it is necessary to understand how natural variations in the environment influence 0+ fish abundance. Temperature regimes play an important role in enhancing both spawning activity and survival during early larval fish development. Flow regime variation, which is a powerful source of stream disturbance, is another factor to be taken into account. 2. Responses to these variables need to be assessed using long‐term datasets, since standard statistical approaches fail to provide a causal structure or to quantify the different effects. We therefore used a 26‐year dataset to evaluate the respective effects of seven derived independent variables describing the effects of temperature and flow regimes on the 0+ juvenile abundance of eight fish species in the River Rhone. 3. A clustering procedure using the Kendall tau rank correlation coefficient was implemented and identified three groups of fish according to their synchronic variations in juvenile abundance; i.e. varying with decreasing juvenile abundance, slightly increasing juvenile abundance and increasing juvenile abundance. These clusters provided the basis for building hierarchical log‐Poisson generalized linear models. The Bayesian paradigm gives a reliable framework for model selection, and the best model was determined using the Bayes Factor. 4. The posterior distribution of the regression parameters was coherent with what was expected based on knowledge of the biology of the different species. It indicates that temperature regime drives 0+ juvenile abundance but that flow regime also plays an important regulating role. The models thus detected evidence of the consequences of specific flow events such as larval drift and an increase in available habitat during higher flow discharges. 5. Our study illustrates the advantages of using a hierarchical modelling approach to quantify ecological effects by improving discrimination between the different sources of uncertainty, leading to better precision when estimating regression parameters.  相似文献   

3.
MOTIVATION: Clustering is one of the most widely used methods in unsupervised gene expression data analysis. The use of different clustering algorithms or different parameters often produces rather different results on the same data. Biological interpretation of multiple clustering results requires understanding how different clusters relate to each other. It is particularly non-trivial to compare the results of a hierarchical and a flat, e.g. k-means, clustering. RESULTS: We present a new method for comparing and visualizing relationships between different clustering results, either flat versus flat, or flat versus hierarchical. When comparing a flat clustering to a hierarchical clustering, the algorithm cuts different branches in the hierarchical tree at different levels to optimize the correspondence between the clusters. The optimization function is based on graph layout aesthetics or on mutual information. The clusters are displayed using a bipartite graph where the edges are weighted proportionally to the number of common elements in the respective clusters and the weighted number of crossings is minimized. The performance of the algorithm is tested using simulated and real gene expression data. The algorithm is implemented in the online gene expression data analysis tool Expression Profiler. AVAILABILITY: http://www.ebi.ac.uk/expressionprofiler  相似文献   

4.
Finite mixture models can provide the insights about behavioral patterns as a source of heterogeneity of the various dynamics of time course gene expression data by reducing the high dimensionality and making clear the major components of the underlying structure of the data in terms of the unobservable latent variables. The latent structure of the dynamic transition process of gene expression changes over time can be represented by Markov processes. This paper addresses key problems in the analysis of large gene expression data sets that describe systemic temporal response cascades and dynamic changes to therapeutic doses in multiple tissues, such as liver, skeletal muscle, and kidney from the same animals. Bayesian Finite Markov Mixture Model with a Dirichlet Prior is developed for the identifications of differentially expressed time related genes and dynamic clusters. Deviance information criterion is applied to determine the number of components for model comparisons and selections. The proposed Bayesian models are applied to multiple tissue polygenetic temporal gene expression data and compared to a Bayesian model‐based clustering method, named CAGED. Results show that our proposed Bayesian Finite Markov Mixture model can well capture the dynamic changes and patterns for irregular complex temporal data (© 2009 WILEY‐VCH Verlag GmbH & Co. KGaA, Weinheim)  相似文献   

5.
ABSTRACT: BACKGROUND: An important question in the analysis of biochemical data is that of identifying subsets of molecular variables that may jointly influence a biological response. Statistical variable selection methods have been widely used for this purpose. In many settings, it may be important to incorporate ancillary biological information concerning the variables of interest. Pathway and network maps are one example of a source of such information. However, although ancillary information is increasingly available, it is not always clear how it should be used nor how it should be weighted in relation to primary data. RESULTS: We put forward an approach in which biological knowledge is incorporated using informative prior distributions over variable subsets, with prior information selected and weighted in an automated, objective manner using an empirical Bayes formulation. We employ continuous, linear models with interaction terms and exploit biochemically-motivated sparsity constraints to permit exact inference. We show an example of priors for pathway- and network-based information and illustrate our proposed method on both synthetic response data and by an application to cancer drug response data. Comparisons are also made to alternative Bayesian and frequentist penalised-likelihood methods for incorporating network-based information. CONCLUSIONS: The empirical Bayes method proposed here can aid prior elicitation for Bayesian variable selection studies and help to guard against mis-specification of priors. Empirical Bayes, together with the proposed pathway-based priors, results in an approach with a competitive variable selection performance. In addition, the overall procedure is fast, deterministic, and has very few user-set parameters, yet is capable of capturing interplay between molecular players. The approach presented is general and readily applicable in any setting with multiple sources of biological prior knowledge.  相似文献   

6.
Cluster analysis of gene-wide expression data from DNA microarray hybridization studies has proved to be a useful tool for identifying biologically relevant groupings of genes and constructing gene regulatory networks. The motivation for considering mutual information is its capacity to measure a general dependence among gene random variables. We propose a novel clustering strategy based on minimizing mutual information among gene clusters. Simulated annealing is employed to solve the optimization problem. Bootstrap techniques are employed to get more accurate estimates of mutual information when the data sample size is small. Moreover, we propose to combine the mutual information criterion and traditional distance criteria such as the Euclidean distance and the fuzzy membership metric in designing the clustering algorithm. The performances of the new clustering methods are compared with those of some existing methods, using both synthesized data and experimental data. It is seen that the clustering algorithm based on a combined metric of mutual information and fuzzy membership achieves the best performance. The supplemental material is available at www.gspsnap.tamu.edu/gspweb/zxb/glioma_zxb.  相似文献   

7.
This paper is concerned with the estimation of the number of species in a population through a fully hierarchical Bayesian model using the Metropolis algorithm. The proposed Bayesian estimator is based on Poisson random variables with means that are distributed according to some prior distributions with unknown hyperparameters. An empirical Bayes approach is considered and compared with the fully Bayesian approach based on biological data.  相似文献   

8.
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.  相似文献   

9.
In this work, we introduce an entirely data-driven and automated approach to reveal disease-associated biomarker and risk factor networks from heterogeneous and high-dimensional healthcare data. Our workflow is based on Bayesian networks, which are a popular tool for analyzing the interplay of biomarkers. Usually, data require extensive manual preprocessing and dimension reduction to allow for effective learning of Bayesian networks. For heterogeneous data, this preprocessing is hard to automatize and typically requires domain-specific prior knowledge. We here combine Bayesian network learning with hierarchical variable clustering in order to detect groups of similar features and learn interactions between them entirely automated. We present an optimization algorithm for the adaptive refinement of such group Bayesian networks to account for a specific target variable, like a disease. The combination of Bayesian networks, clustering, and refinement yields low-dimensional but disease-specific interaction networks. These networks provide easily interpretable, yet accurate models of biomarker interdependencies. We test our method extensively on simulated data, as well as on data from the Study of Health in Pomerania (SHIP-TREND), and demonstrate its effectiveness using non-alcoholic fatty liver disease and hypertension as examples. We show that the group network models outperform available biomarker scores, while at the same time, they provide an easily interpretable interaction network.  相似文献   

10.
We present CLIFF, an algorithm for clustering biological samples using gene expression microarray data. This clustering problem is difficult for several reasons, in particular the sparsity of the data, the high dimensionality of the feature (gene) space, and the fact that many features are irrelevant or redundant. Our algorithm iterates between two computational processes, feature filtering and clustering. Given a reference partition that approximates the correct clustering of the samples, our feature filtering procedure ranks the features according to their intrinsic discriminability, relevance to the reference partition, and irredundancy to other relevant features, and uses this ranking to select the features to be used in the following round of clustering. Our clustering algorithm, which is based on the concept of a normalized cut, clusters the samples into a new reference partition on the basis of the selected features. On a well-studied problem involving 72 leukemia samples and 7130 genes, we demonstrate that CLIFF outperforms standard clustering approaches that do not consider the feature selection issue, and produces a result that is very close to the original expert labeling of the sample set.  相似文献   

11.
Jung S  Lee KH  Lee D 《Bio Systems》2007,90(1):197-210
The Bayesian network is a popular tool for describing relationships between data entities by representing probabilistic (in)dependencies with a directed acyclic graph (DAG) structure. Relationships have been inferred between biological entities using the Bayesian network model with high-throughput data from biological systems in diverse fields. However, the scalability of those approaches is seriously restricted because of the huge search space for finding an optimal DAG structure in the process of Bayesian network learning. For this reason, most previous approaches limit the number of target entities or use additional knowledge to restrict the search space. In this paper, we use the hierarchical clustering and order restriction (H-CORE) method for the learning of large Bayesian networks by clustering entities and restricting edge directions between those clusters, with the aim of overcoming the scalability problem and thus making it possible to perform genome-scale Bayesian network analysis without additional biological knowledge. We use simulations to show that H-CORE is much faster than the widely used sparse candidate method, whilst being of comparable quality. We have also applied H-CORE to retrieving gene-to-gene relationships in a biological system (The 'Rosetta compendium'). By evaluating learned information through literature mining, we demonstrate that H-CORE enables the genome-scale Bayesian analysis of biological systems without any prior knowledge.  相似文献   

12.
Bayesian hierarchical models have been applied in clinical trials to allow for information sharing across subgroups. Traditional Bayesian hierarchical models do not have subgroup classifications; thus, information is shared across all subgroups. When the difference between subgroups is large, it suggests that the subgroups belong to different clusters. In that case, placing all subgroups in one pool and borrowing information across all subgroups can result in substantial bias for the subgroups with strong borrowing, or a lack of efficiency gain with weak borrowing. To resolve this difficulty, we propose a hierarchical Bayesian classification and information sharing (BaCIS) model for the design of multigroup phase II clinical trials with binary outcomes. We introduce subgroup classification into the hierarchical model. Subgroups are classified into two clusters on the basis of their outcomes mimicking the hypothesis testing framework. Subsequently, information sharing takes place within subgroups in the same cluster, rather than across all subgroups. This method can be applied to the design and analysis of multigroup clinical trials with binary outcomes. Compared to the traditional hierarchical models, better operating characteristics are obtained with the BaCIS model under various scenarios.  相似文献   

13.
MOTIVATION: Bioinformatics clustering tools are useful at all levels of proteomic data analysis. Proteomics studies can provide a wealth of information and rapidly generate large quantities of data from the analysis of biological specimens. The high dimensionality of data generated from these studies requires the development of improved bioinformatics tools for efficient and accurate data analyses. For proteome profiling of a particular system or organism, a number of specialized software tools are needed. Indeed, significant advances in the informatics and software tools necessary to support the analysis and management of these massive amounts of data are needed. Clustering algorithms based on probabilistic and Bayesian models provide an alternative to heuristic algorithms. The number of clusters (diseased and non-diseased groups) is reduced to the choice of the number of components of a mixture of underlying probability. The Bayesian approach is a tool for including information from the data to the analysis. It offers an estimation of the uncertainties of the data and the parameters involved. RESULTS: We present novel algorithms that can organize, cluster and derive meaningful patterns of expression from large-scaled proteomics experiments. We processed raw data using a graphical-based algorithm by transforming it from a real space data-expression to a complex space data-expression using discrete Fourier transformation; then we used a thresholding approach to denoise and reduce the length of each spectrum. Bayesian clustering was applied to the reconstructed data. In comparison with several other algorithms used in this study including K-means, (Kohonen self-organizing map (SOM), and linear discriminant analysis, the Bayesian-Fourier model-based approach displayed superior performances consistently, in selecting the correct model and the number of clusters, thus providing a novel approach for accurate diagnosis of the disease. Using this approach, we were able to successfully denoise proteomic spectra and reach up to a 99% total reduction of the number of peaks compared to the original data. In addition, the Bayesian-based approach generated a better classification rate in comparison with other classification algorithms. This new finding will allow us to apply the Fourier transformation for the selection of the protein profile for each sample, and to develop a novel bioinformatic strategy based on Bayesian clustering for biomarker discovery and optimal diagnosis.  相似文献   

14.
Carlson SM  Najmi A  Cohen HJ 《Proteomics》2007,7(7):1037-1046
Correlated variables have been shown to confound statistical analyses in microarray experiments. The same effect applies to an even greater degree in proteomics, especially with the use of MS for parallel measurements. Biological effects such as PTM, fragmentation, and multimer formation can produce strongly correlated variables. The problem is compounded in some types of MS by technical effects such as incomplete chromatographic separation, binding to multiple surfaces, or multiple ionizations. Existing methods for dimension reduction, notably principal components analysis and related techniques, are not always satisfactory because they produce data that often lack clear biological interpretation. We propose a preprocessing algorithm that clusters highly correlated features, using the Bayes information criterion to select an optimal number of clusters. Statistical analysis of clusters, instead of individual features, benefits from lower noise, and reduces the difficulties associated with strongly correlated data. This preprocessing increases the statistical power of analyses using false discovery rate on simulated data. Strong correlations are often present in real data, and we find that clustering improves biomarker discovery in clinical SELDI-TOF-MS datasets of plasma from patients with Kawasaki disease, and bone-marrow cell extracts from patients with acute myeloid or acute lymphoblastic leukemia.  相似文献   

15.
Clustering analysis is an important tool in studying gene expression data. The Bayesian hierarchical clustering (BHC) algorithm can automatically infer the number of clusters and uses Bayesian model selection to improve clustering quality. In this paper, we present an extension of the BHC algorithm. Our Gaussian BHC (GBHC) algorithm represents data as a mixture of Gaussian distributions. It uses normal-gamma distribution as a conjugate prior on the mean and precision of each of the Gaussian components. We tested GBHC over 11 cancer and 3 synthetic datasets. The results on cancer datasets show that in sample clustering, GBHC on average produces a clustering partition that is more concordant with the ground truth than those obtained from other commonly used algorithms. Furthermore, GBHC frequently infers the number of clusters that is often close to the ground truth. In gene clustering, GBHC also produces a clustering partition that is more biologically plausible than several other state-of-the-art methods. This suggests GBHC as an alternative tool for studying gene expression data.The implementation of GBHC is available at https://sites.google.com/site/gaussianbhc/  相似文献   

16.
MOTIVATION: Identifying groups of co-regulated genes by monitoring their expression over various experimental conditions is complicated by the fact that such co-regulation is condition-specific. Ignoring the context-specific nature of co-regulation significantly reduces the ability of clustering procedures to detect co-expressed genes due to additional 'noise' introduced by non-informative measurements. RESULTS: We have developed a novel Bayesian hierarchical model and corresponding computational algorithms for clustering gene expression profiles across diverse experimental conditions and studies that accounts for context-specificity of gene expression patterns. The model is based on the Bayesian infinite mixtures framework and does not require a priori specification of the number of clusters. We demonstrate that explicit modeling of context-specificity results in increased accuracy of the cluster analysis by examining the specificity and sensitivity of clusters in microarray data. We also demonstrate that probabilities of co-expression derived from the posterior distribution of clusterings are valid estimates of statistical significance of created clusters. AVAILABILITY: The open-source package gimm is available at http://eh3.uc.edu/gimm.  相似文献   

17.
MOTIVATION: With the advancements of next-generation sequencing technology, it is now possible to study samples directly obtained from the environment. Particularly, 16S rRNA gene sequences have been frequently used to profile the diversity of organisms in a sample. However, such studies are still taxed to determine both the number of operational taxonomic units (OTUs) and their relative abundance in a sample. RESULTS: To address these challenges, we propose an unsupervised Bayesian clustering method termed Clustering 16S rRNA for OTU Prediction (CROP). CROP can find clusters based on the natural organization of data without setting a hard cut-off threshold (3%/5%) as required by hierarchical clustering methods. By applying our method to several datasets, we demonstrate that CROP is robust against sequencing errors and that it produces more accurate results than conventional hierarchical clustering methods. Availability and Implementation: Source code freely available at the following URL: http://code.google.com/p/crop-tingchenlab/, implemented in C++ and supported on Linux and MS Windows.  相似文献   

18.
We quantify the amount of information filtered by different hierarchical clustering methods on correlations between stock returns comparing the clustering structure with the underlying industrial activity classification. We apply, for the first time to financial data, a novel hierarchical clustering approach, the Directed Bubble Hierarchical Tree and we compare it with other methods including the Linkage and k-medoids. By taking the industrial sector classification of stocks as a benchmark partition, we evaluate how the different methods retrieve this classification. The results show that the Directed Bubble Hierarchical Tree can outperform other methods, being able to retrieve more information with fewer clusters. Moreover, we show that the economic information is hidden at different levels of the hierarchical structures depending on the clustering method. The dynamical analysis on a rolling window also reveals that the different methods show different degrees of sensitivity to events affecting financial markets, like crises. These results can be of interest for all the applications of clustering methods to portfolio optimization and risk hedging.  相似文献   

19.
Greenland S 《Biometrics》2001,57(3):663-670
In Bayesian and empirical Bayes analyses of epidemiologic data, the most easily implemented prior specifications use a multivariate normal distribution for the log relative risks or a conjugate distribution for the discrete response vector. This article describes problems in translating background information about relative risks into conjugate priors and a solution. Traditionally, conjugate priors have been specified through flattening constants, an approach that leads to conflicts with the true prior covariance structure for the log relative risks. One can, however, derive a conjugate prior consistent with that structure by using a data-augmentation approximation to the true log relative-risk prior, although a rescaling step is needed to ensure the accuracy of the approximation. These points are illustrated with a logistic regression analysis of neonatal-death risk.  相似文献   

20.
Ando  Tomohiro 《Biometrika》2007,94(2):443-458
The problem of evaluating the goodness of the predictive distributionsof hierarchical Bayesian and empirical Bayes models is investigated.A Bayesian predictive information criterion is proposed as anestimator of the posterior mean of the expected loglikelihoodof the predictive distribution when the specified family ofprobability distributions does not contain the true distribution.The proposed criterion is developed by correcting the asymptoticbias of the posterior mean of the loglikelihood as an estimatorof its expected loglikelihood. In the evaluation of hierarchicalBayesian models with random effects, regardless of our parametricfocus, the proposed criterion considers the bias correctionof the posterior mean of the marginal loglikelihood becauseit requires a consistent parameter estimator. The use of thebootstrap in model evaluation is also discussed.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号