首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
The use of mutual information as a similarity measure in agglomerative hierarchical clustering (AHC) raises an important issue: some correction needs to be applied for the dimensionality of variables. In this work, we formulate the decision of merging dependent multivariate normal variables in an AHC procedure as a Bayesian model comparison. We found that the Bayesian formulation naturally shrinks the empirical covariance matrix towards a matrix set a priori (e.g., the identity), provides an automated stopping rule, and corrects for dimensionality using a term that scales up the measure as a function of the dimensionality of the variables. Also, the resulting log Bayes factor is asymptotically proportional to the plug-in estimate of mutual information, with an additive correction for dimensionality in agreement with the Bayesian information criterion. We investigated the behavior of these Bayesian alternatives (in exact and asymptotic forms) to mutual information on simulated and real data. An encouraging result was first derived on simulations: the hierarchical clustering based on the log Bayes factor outperformed off-the-shelf clustering techniques as well as raw and normalized mutual information in terms of classification accuracy. On a toy example, we found that the Bayesian approaches led to results that were similar to those of mutual information clustering techniques, with the advantage of an automated thresholding. On real functional magnetic resonance imaging (fMRI) datasets measuring brain activity, it identified clusters consistent with the established outcome of standard procedures. On this application, normalized mutual information had a highly atypical behavior, in the sense that it systematically favored very large clusters. These initial experiments suggest that the proposed Bayesian alternatives to mutual information are a useful new tool for hierarchical clustering.  相似文献   

3.
MOTIVATION: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. RESULTS: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All other clustering methods considered have S+ implementation in the library MASS. S+ codes for calculating the validation measures are available from the authors upon request. The sporulation data set is publicly available at http://cmgm.stanford.edu/pbrown/sporulation  相似文献   

4.
Ortholog identification is a crucial first step in comparative genomics. Here, we present a rapid method of ortholog grouping which is effective enough to allow the comparison of many genomes simultaneously. The method takes as input all-against-all similarity data and classifies genes based on the traditional hierarchical clustering algorithm UPGMA. In the course of clustering, the method detects domain fusion or fission events, and splits clusters into domains if required. The subsequent procedure splits the resulting trees such that intra-species paralogous genes are divided into different groups so as to create plausible orthologous groups. As a result, the procedure can split genes into the domains minimally required for ortholog grouping. The procedure, named DomClust, was tested using the COG database as a reference. When comparing several clustering algorithms combined with the conventional bidirectional best-hit (BBH) criterion, we found that our method generally showed better agreement with the COG classification. By comparing the clustering results generated from datasets of different releases, we also found that our method showed relatively good stability in comparison to the BBH-based methods.  相似文献   

5.
《Ecological Informatics》2008,3(4-5):286-294
Simulated ecological datasets have been widely used to assess the ability of ordination techniques to portray patterns in ecological assemblage data. Such datasets typically contain a single assemblage sampled over an environmental gradient or set of gradients. Little has been done on the generation of artificial datasets that contain a number of different species assemblages, to aid in the evaluation of multivariate techniques that test for differences between assemblages of species. This paper describes and compares two simulation methods that generate ecologically realistic artificial multi-assemblage datasets. Both methods provide multivariate data (e.g. species abundances) for replicate sites within discretely different assemblages. The first technique is a coenocline model based on species' responses to variation modeled by a five-parameter β-function, where variation in species abundances both within and between assemblages is governed by differences in the positions of sites and assemblages along environmental gradients. The second technique, the resampling method, involves bootstrap resampling of real assemblage datasets, with the addition of selected types of controlled differences between assemblages. Here we use it to generate turnover in species composition. We calibrate both simulation methods based on a field assemblage of bird species. The two different simulation methods portray different levels and types of between-assemblage variation. The resampling method allows greater control over some aspects of assemblage difference (e.g. independently varying differences in species richness and compositional turnover) than the coenocline method. Both can generate usable replicated simulated datasets for assessing the ability of multivariate tests to detect ecological variation among assemblages.  相似文献   

6.
Cao  Yong  Bark  Anthony W.  Williams  W. Peter 《Hydrobiologia》1997,347(1-3):24-40
Four commonly used clustering methods (UPGMA, Ward Linkage,Complete Linkage and TWINSPAN) were compared in their abilitytorecognise the structure of three river macroinvertebratesdatasetswhich were pre-determined based on habitat and biologicalcharacteristics or chemical water quality of sampling sites.DCA,NMDS and ANOSIM were applied to the same datasets to providefurther information about data structure, and nonparametrictestswere also undertaken on major chemical variables to justifythepredeterminations. The modified Rand Index was used to measuretheagreement between a particular solution and the pre-determinedclassification. The results showed that Ward Linkage performedbestwhen its use was broadened and used with the CY DissimilarityMeasure, followed by TWINSPAN and Complete Linkage with UPGMAbeingleast successful. There was evidence to suggest that theeffectiveness of some clustering methods (e.g. UPGMA) may varyatdifferent clustering levels, and simulation techniques whichhavebeen used to assess clustering methods could leave somepropertiesof clustering methods unexamined.  相似文献   

7.
The insular limestone karsts of northern Vietnam harbor a very rich biodiversity. Many taxa are strongly associated with these environments, and individual species communities can differ considerably among karst areas. The exact processes that have shaped the biotic composition of these habitats, however, remain largely unknown. In this study, the role of two major processes for the assembly of snail communities on limestone karsts was investigated, interspecific competition and filtering of taxa due to geographical factors. Communities of operculate land snails of the genus Cyclophorus were studied using the dry and fluid‐preserved specimen collections of the Natural History Museum, London. Phylogenetic distances (based on a Bayesian analysis using DNA sequence data) and shell characters (based on 200 semilandmarks) were used as proxies for ecological similarity and were analyzed to reveal patterns of overdispersion (indicating competition) or clustering (indicating filtering) in observed communities compared to random communities. Among the seven studied karst areas, a total of 15 Cyclophorus lineages were found. Unique communities were present in each area. The analyses revealed phylogenetic overdispersion in six and morphological overdispersion in four of seven karst areas. The pattern of frequent phylogenetic overdispersion indicated that competition among lineages is the major process shaping the Cyclophorus communities studied. The Coastal Area, which was phylogenetically overdispersed, showed a clear morphological clustering, which could have been caused by similar ecological adaptations among taxa in this environment. Only the community in the Cuc Phuong Area showed a pattern of phylogenetic clustering, which was partly caused by an absence of a certain, phylogenetically very distinct group in this region. Filtering due to geographical factors could have been involved here. This study shows how museum collections can be used to examine community assembly and contributes to the understanding of the processes that have shaped karst communities in Vietnam.  相似文献   

8.
Latitudinal patterns of diversity are one of the most striking large-scale biological phenomena and several hypotheses have been proposed to explain them. Using data from literature-surveys we investigated how phylogenetic patterns in microorganisms, plants, and, metazoans communities differ between the tropical and temperate regions and then explored possible ecological and evolutionary process that could shape such patterns. Using the Net Relatedness Index, we analyzed data from 1486 biological communities, collected in 32 articles that considered the phylogenetic structure of biological communities. We found a pattern of phylogenetic clustering in both regions for microorganisms, while for plants we found phylogenetic clustering in temperate regions and phylogenetic overdispersion in the tropics. We did not detect a clear pattern of clustering or overdispersion in tropical or temperate regions in metazoans. From these patterns we explore different ecological and evolutionary processes that have shaped these communities over space and time.  相似文献   

9.
10.
MOTIVATION: Accurate subcategorization of tumour types through gene-expression profiling requires analytical techniques that estimate the number of categories or clusters rigorously and reliably. Parametric mixture modelling provides a natural setting to address this problem. RESULTS: We compare a criterion for model selection that is derived from a variational Bayesian framework with a popular alternative based on the Bayesian information criterion. Using simulated data, we show that the variational Bayesian method is more accurate in finding the true number of clusters in situations that are relevant to current and future microarray studies. We also compare the two criteria using freely available tumour microarray datasets and show that the variational Bayesian method is more sensitive to capturing biologically relevant structure.  相似文献   

11.
In high‐dimensional omics studies where multiple molecular profiles are obtained for each set of patients, there is often interest in identifying complex multivariate associations, for example, copy number regulated expression levels in a certain pathway or in a genomic region. To detect such associations, we present a novel approach to test for association between two sets of variables. Our approach generalizes the global test, which tests for association between a group of covariates and a single univariate response, to allow high‐dimensional multivariate response. We apply the method to several simulated datasets as well as two publicly available datasets, where we compare the performance of multivariate global test (G2) with univariate global test. The method is implemented in R and will be available as a part of the globaltest package in R.  相似文献   

12.
Numerical classification of species of Vibrio and related genera   总被引:10,自引:0,他引:10  
Data from 1091 strains of the family Vibrionaceae collected in five different studies have been merged into a single data matrix and analysed in a taxonomic study. A set of 142 characters was selected to compare these data. Seventy-nine characters were common to all studies, but data for the other 63 characters were incomplete. Cultures of 90 strains, examined in more than one of the original studies, were used to estimate test error and inter-study variability. The data from these replicate strains also allowed the problem of merging data from different studies to be assessed. Taxonomic resemblance was estimated on the basis of 111 characters using the SSM coefficient and UPGMA clustering. A taxonomic analysis based on 999 strains, which included most of the major species of the family Vibrionaceae, gave 59 clusters and 44 unclustered strains. A table of properties of these phenons was produced. The results showed that data obtained from studies carried out at different times and in different locations, but using standard techniques, could be combined and used to provide useful taxonomic information.  相似文献   

13.
L. BELBIN 《Austral ecology》1992,17(3):255-262
Abstract Comparing a new set of samples to what may be considered a reference set is a common problem in ecology. The investigator may be interested in the degree of correspondence or any anomalies. For example, does a set of existing reserves adequately cover the range of communities sampled in a region? A technique for such comparisons is proposed. Being dependent solely on estimates of ecological resemblance, it is simple, efficient and robust. Significant difference is defined by means of a resemblance coefficient. A threshold value denoting significant difference can be defined either by species overlap or other attributes of the data. For presence/absence data the Czekanowski coefficient provides a suitable measure of ecological resemblance. Traditional discriminant analysis does not provide a viable alternative due to its limitations in accommodating ecological data.  相似文献   

14.
15.
Single-cell RNA sequencing enables us to characterize the cellular heterogeneity in single cell resolution with the help of cell type identification algorithms. However, the noise inherent in single-cell RNA-sequencing data severely disturbs the accuracy of cell clustering, marker identification and visualization. We propose that clustering based on feature density profiles can distinguish informative features from noise. We named such strategy as ‘entropy subspace’ separation and designed a cell clustering algorithm called ENtropy subspace separation-based Clustering for nOise REduction (ENCORE) by integrating the ‘entropy subspace’ separation strategy with a consensus clustering method. We demonstrate that ENCORE performs superiorly on cell clustering and generates high-resolution visualization across 12 standard datasets. More importantly, ENCORE enables identification of group markers with biological significance from a hard-to-separate dataset. With the advantages of effective feature selection, improved clustering, accurate marker identification and high-resolution visualization, we present ENCORE to the community as an important tool for scRNA-seq data analysis to study cellular heterogeneity and discover group markers.  相似文献   

16.
Community ecologists commonly perform multivariate techniques (e.g., ordination, cluster analysis) to assess patterns and gradients of taxonomic variation. A critical requirement for a meaningful statistical analysis is accurate information on the taxa found within an ecological sample. However, oversampling (too many individuals counted per sample) also comes at a cost, particularly for ecological systems in which identification and quantification is substantially more resource consuming than the field expedition itself. In such systems, an increasingly larger sample size will eventually result in diminishing returns in improving any pattern or gradient revealed by the data, but will also lead to continually increasing costs. Here, we examine 396 datasets: 44 previously published and 352 created datasets. Using meta-analytic and simulation-based approaches, the research within the present paper seeks (1) to determine minimal sample sizes required to produce robust multivariate statistical results when conducting abundance-based, community ecology research. Furthermore, we seek (2) to determine the dataset parameters (i.e., evenness, number of taxa, number of samples) that require larger sample sizes, regardless of resource availability. We found that in the 44 previously published and the 220 created datasets with randomly chosen abundances, a conservative estimate of a sample size of 58 produced the same multivariate results as all larger sample sizes. However, this minimal number varies as a function of evenness, where increased evenness resulted in increased minimal sample sizes. Sample sizes as small as 58 individuals are sufficient for a broad range of multivariate abundance-based research. In cases when resource availability is the limiting factor for conducting a project (e.g., small university, time to conduct the research project), statistically viable results can still be obtained with less of an investment.  相似文献   

17.
18.
Terminal restriction fragment length polymorphism (T-RFLP) is a culture-independent method of obtaining a genetic fingerprint of the composition of a microbial community. Comparisons of the utility of different methods of (i) including peaks, (ii) computing the difference (or distance) between profiles, and (iii) performing statistical analysis were made by using replicated profiles of eubacterial communities. These samples included soil collected from three regions of the United States, soil fractions derived from three agronomic field treatments, soil samples taken from within one meter of each other in an alfalfa field, and replicate laboratory bioreactors. Cluster analysis by Ward's method and by the unweighted-pair group method using arithmetic averages (UPGMA) were compared. Ward's method was more effective at differentiating major groups within sets of profiles; UPGMA had a slightly reduced error rate in clustering of replicate profiles and was more sensitive to outliers. Most replicate profiles were clustered together when relative peak height or Hellinger-transformed peak height was used, in contrast to raw peak height. Redundancy analysis was more effective than cluster analysis at detecting differences between similar samples. Redundancy analysis using Hellinger distance was more sensitive than that using Euclidean distance between relative peak height profiles. Analysis of Jaccard distance between profiles, which considers only the presence or absence of a terminal restriction fragment, was the most sensitive in redundancy analysis, and was equally sensitive in cluster analysis, if all profiles had cumulative peak heights greater than 10,000 fluorescence units. It is concluded that T-RFLP is a sensitive method of differentiating between microbial communities when the optimal statistical method is used for the situation at hand. It is recommended that hypothesis testing be performed by redundancy analysis of Hellinger-transformed data and that exploratory data analysis be performed by cluster analysis using Ward's method to find natural groups or by UPGMA to identify potential outliers. Analyses can also be based on Jaccard distance if all profiles have cumulative peak heights greater than 10,000 fluorescence units.  相似文献   

19.
Aim To analyse the structure of pteridophyte assemblages, based on phylogenetic relatedness and trait properties, along an elevational gradient. Ecological theory predicts that co‐occurring species may be: randomly selected from a regional pool; ecologically sorted so that they are functionally different hence resulting in reduced competition (overdispersion); or functionally similar as an adaptation to specific ecological conditions (clustering). Location Braulio Carrillo National Park and Cerro de la Muerte, Costa Rica, Central America. Methods We used an empirical dataset of the quantitative pattern of species occurrences and individual numbers of ferns within 156 plots along a tropical elevational gradient to test whether directed ecological sorting might cause deviations in patterns of trait and phylogenetic diversity. Mean pairwise distances of species based on phylogenetic and trait properties were compared with two different sets of null assemblages, one maintaining species frequency distributions (constrained) and one not (unconstrained). Results Applying different null models resulted in varying degrees of overdispersion and clustering, but overall patterns of deviation from random expectations remained the same. Contrary to theoretical predictions, phylogenetic and trait diversity were relatively independent from one another. Phylogenetic diversity showed no patterns along the elevational gradient, whereas trait diversity showed significant trends for epiphytes. Main conclusions Under stressful environmental conditions (drought at low elevations and frost at high elevations), epiphytic fern assemblages tended to be clustered with respect to trait characteristics, which suggests environmental filtering. Conversely, under less extreme environmental conditions (middle of the transect), the sorting was biased towards high differentiation (overdispersion), presumably because of interspecific competition and trait shifts among closely related species (character displacement).  相似文献   

20.
Comparisons are made of the accuracy of the restricted maximum-likelihood, Wagner parsimony, and UPGMA (unweighted pair-group method using arithmetic averages) clustering methods to estimate phylogenetic trees. Data matrices were generated by constructing simulated stochastic evolution in a multidimensional gene-frequency space using a simple genetic-drift model (Brownian-motion, random-walk) with constant rates of divergence in all lineages. Ten differentphylogenetic tree topologies of 20 operational taxonomic units (OTU's), representing a range of tree shapes, were used. Felsenstein's restricted maximum-likelihood method, Wagner parsimony, and UPGMA clustering were used to construct trees from the resulting data matrices. The computations for the restricted maximum-likelihood method were performed on a Cray-1 supercomputer since the required calculations (especially when optimized for the vector hardware) are performed substantially faster than on more conventional computing systems. The overall level of accuracy of tree reconstruction depends on the topology of the true phylogenetic tree. The UPGMA clustering method, especially when genetic-distance coefficients are used, gives the most accurate estimates of the true phylogeny (for our model with constant evolutionary rates). For large numbers of loci, all methods give similar results, but trends in the results imply that the restricted maximum-likelihood method would produce the most accurate trees if sample sizes were large enough.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号