首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
分析了中国分布单叶蔓荆的种内变异.每个居群随机采取5个以上的完整植株,以16个居群单叶蔓荆作为分类运算单位(operational taxonomic unti,OTU),选取11个有变异的形态学性状和3个品质性状.形态性状以各植株的平均值作为OTUs的原始数据,形成16×14的原始数据矩阵,对矩阵进行聚类分析.根据Q...  相似文献   

2.
Many conservation genetics studies in fishes define populations based on capture location. In salmonid fishes, this traditional a priori designation is made by spawning stream, with subsequent post hoc approaches used to define units of conservation. In this study of bull trout from southwestern Alberta, we provide evidence that a model-based Bayesian genetic clustering method may provide a more parsimonious alternative to designating population structure and units of conservation in comparison to traditional methods. The clustering method captured a hierarchical model of population structure, in which seven local populations were nested within three genetic archipelagos. This was in contrast to using simple F ST based approaches between thirteen a priori designated populations, which found significant differences for nearly every pairwise comparison. In addition, assignment tests results from Bayesian clustering revealed that movement may be common between sampling locations. These clustering methods are easy to use, intuitive and provide substantial information on populations of fish; this study provides an example of their utility for local fisheries management and conservation.  相似文献   

3.
Statistical inference for simultaneous clustering of gene expression data   总被引:1,自引:0,他引:1  
Current methods for analysis of gene expression data are mostly based on clustering and classification of either genes or samples. We offer support for the idea that more complex patterns can be identified in the data if genes and samples are considered simultaneously. We formalize the approach and propose a statistical framework for two-way clustering. A simultaneous clustering parameter is defined as a function theta=Phi(P) of the true data generating distribution P, and an estimate is obtained by applying this function to the empirical distribution P(n). We illustrate that a wide range of clustering procedures, including generalized hierarchical methods, can be defined as parameters which are compositions of individual mappings for clustering patients and genes. This framework allows one to assess classical properties of clustering methods, such as consistency, and to formally study statistical inference regarding the clustering parameter. We present results of simulations designed to assess the asymptotic validity of different bootstrap methods for estimating the distribution of Phi(P(n)). The method is illustrated on a publicly available data set.  相似文献   

4.
MOTIVATION: With the increasing number of gene expression databases, the need for more powerful analysis and visualization tools is growing. Many techniques have successfully been applied to unravel latent similarities among genes and/or experiments. Most of the current systems for microarray data analysis use statistical methods, hierarchical clustering, self-organizing maps, support vector machines, or k-means clustering to organize genes or experiments into 'meaningful' groups. Without prior explicit bias almost all of these clustering methods applied to gene expression data not only produce different results, but may also produce clusters with little or no biological relevance. Of these methods, agglomerative hierarchical clustering has been the most widely applied, although many limitations have been identified. RESULTS: Starting with a systematic comparison of the underlying theories behind clustering approaches, we have devised a technique that combines tree-structured vector quantization and partitive k-means clustering (BTSVQ). This hybrid technique has revealed clinically relevant clusters in three large publicly available data sets. In contrast to existing systems, our approach is less sensitive to data preprocessing and data normalization. In addition, the clustering results produced by the technique have strong similarities to those of self-organizing maps (SOMs). We discuss the advantages and the mathematical reasoning behind our approach.  相似文献   

5.
We quantify the amount of information filtered by different hierarchical clustering methods on correlations between stock returns comparing the clustering structure with the underlying industrial activity classification. We apply, for the first time to financial data, a novel hierarchical clustering approach, the Directed Bubble Hierarchical Tree and we compare it with other methods including the Linkage and k-medoids. By taking the industrial sector classification of stocks as a benchmark partition, we evaluate how the different methods retrieve this classification. The results show that the Directed Bubble Hierarchical Tree can outperform other methods, being able to retrieve more information with fewer clusters. Moreover, we show that the economic information is hidden at different levels of the hierarchical structures depending on the clustering method. The dynamical analysis on a rolling window also reveals that the different methods show different degrees of sensitivity to events affecting financial markets, like crises. These results can be of interest for all the applications of clustering methods to portfolio optimization and risk hedging.  相似文献   

6.
It is a common occurrence in plant breeding programs to observe missing values in three-way three-mode multi-environment trial (MET) data. We proposed modifications of models for estimating missing observations for these data arrays, and developed a novel approach in terms of hierarchical clustering. Multiple imputation (MI) was used in four ways, multiple agglomerative hierarchical clustering, normal distribution model, normal regression model, and predictive mean match. The later three models used both Bayesian analysis and non-Bayesian analysis, while the first approach used a clustering procedure with randomly selected attributes and assigned real values from the nearest neighbour to the one with missing observations. Different proportions of data entries in six complete datasets were randomly selected to be missing and the MI methods were compared based on the efficiency and accuracy of estimating those values. The results indicated that the models using Bayesian analysis had slightly higher accuracy of estimation performance than those using non-Bayesian analysis but they were more time-consuming. However, the novel approach of multiple agglomerative hierarchical clustering demonstrated the overall best performances.  相似文献   

7.
Evaluation and comparison of gene clustering methods in microarray analysis   总被引:4,自引:0,他引:4  
MOTIVATION: Microarray technology has been widely applied in biological and clinical studies for simultaneous monitoring of gene expression in thousands of genes. Gene clustering analysis is found useful for discovering groups of correlated genes potentially co-regulated or associated to the disease or conditions under investigation. Many clustering methods including hierarchical clustering, K-means, PAM, SOM, mixture model-based clustering and tight clustering have been widely used in the literature. Yet no comprehensive comparative study has been performed to evaluate the effectiveness of these methods. RESULTS: In this paper, six gene clustering methods are evaluated by simulated data from a hierarchical log-normal model with various degrees of perturbation as well as four real datasets. A weighted Rand index is proposed for measuring similarity of two clustering results with possible scattered genes (i.e. a set of noise genes not being clustered). Performance of the methods in the real data is assessed by a predictive accuracy analysis through verified gene annotations. Our results show that tight clustering and model-based clustering consistently outperform other clustering methods both in simulated and real data while hierarchical clustering and SOM perform among the worst. Our analysis provides deep insight to the complicated gene clustering problem of expression profile and serves as a practical guideline for routine microarray cluster analysis.  相似文献   

8.
MOTIVATION: It is well understood that the successful clustering of expression profiles give beneficial ideas to understand the functions of uncharacterized genes. In order to realize such a successful clustering, we investigate a clustering method based on adaptive resonance theory (ART) in this report. RESULTS: We apply Fuzzy ART as a clustering method for analyzing the time series expression data during sporulation of Saccharomyces cerevisiae. The clustering result by Fuzzy ART was compared with those by other clustering methods such as hierarchical clustering, k-means algorithm and self-organizing maps (SOMs). In terms of the mathematical validations, Fuzzy ART achieved the most reasonable clustering. We also verified the robustness of Fuzzy ART using noised data. Furthermore, we defined the correctness ratio of clustering, which is based on genes whose temporal expressions are characterized biologically. Using this definition, it was proved that the clustering ability of Fuzzy ART was superior to other clustering methods such as hierarchical clustering, k-means algorithm and SOMs. Finally, we validate the clustering results by Fuzzy ART in terms of biological functions and evidence. AVAILABILITY: The software is available at http//www.nubio.nagoya-u.ac.jp/proc/index.html  相似文献   

9.
BACKGROUND: This study examined whether hierarchical clustering could be used to detect cell states induced by treatment combinations that were generated through automation and high-throughput (HT) technology. Data-mining techniques were used to analyze the large experimental data sets to determine whether nonlinear, non-obvious responses could be extracted from the data. METHODS: Unary, binary, and ternary combinations of pharmacological factors (examples of stimuli) were used to induce differentiation of HL-60 cells using a HT automated approach. Cell profiles were analyzed by incorporating hierarchical clustering methods on data collected by flow cytometry. Data-mining techniques were used to explore the combinatorial space for nonlinear, unexpected events. Additional small-scale, follow-up experiments were performed on cellular profiles of interest. RESULTS: Multiple, distinct cellular profiles were detected using hierarchical clustering of expressed cell-surface antigens. Data-mining of this large, complex data set retrieved cases of both factor dominance and cooperativity, as well as atypical cellular profiles. Follow-up experiments found that treatment combinations producing "atypical cell types" made those cells more susceptible to apoptosis. CONCLUSIONS Hierarchical clustering and other data-mining techniques were applied to analyze large data sets from HT flow cytometry. From each sample, the data set was filtered and used to define discrete, usable states that were then related back to their original formulations. Analysis of resultant cell populations induced by a multitude of treatments identified unexpected phenotypes and nonlinear response profiles.  相似文献   

10.
Kernel density smoothing techniques have been used in classification or supervised learning of gene expression profile (GEP) data, but their applications to clustering or unsupervised learning of those data have not been explored and assessed. Here we report a kernel density clustering method for analysing GEP data and compare its performance with the three most widely-used clustering methods: hierarchical clustering, K-means clustering, and multivariate mixture model-based clustering. Using several methods to measure agreement, between-cluster isolation, and withincluster coherence, such as the Adjusted Rand Index, the Pseudo F test, the r(2) test, and the profile plot, we have assessed the effectiveness of kernel density clustering for recovering clusters, and its robustness against noise on clustering both simulated and real GEP data. Our results show that the kernel density clustering method has excellent performance in recovering clusters from simulated data and in grouping large real expression profile data sets into compact and well-isolated clusters, and that it is the most robust clustering method for analysing noisy expression profile data compared to the other three methods assessed.  相似文献   

11.
The selection of an appropriate sampling strategy and a clustering method is important in the construction of core collections based on predicted genotypic values in order to retain the greatest degree of genetic diversity of the initial collection. In this study, methods of developing rice core collections were evaluated based on the predicted genotypic values for 992 rice varieties with 13 quantitative traits. The genotypic values of the traits were predicted by the adjusted unbiased prediction (AUP) method. Based on the predicted genotypic values, Mahalanobis distances were calculated and employed to measure the genetic similarities among the rice varieties. Six hierarchical clustering methods, including the single linkage, median linkage, centroid, unweighted pair-group average, weighted pair-group average and flexible-beta methods, were combined with random, preferred and deviation sampling to develop 18 core collections of rice germplasm. The results show that the deviation sampling strategy in combination with the unweighted pair-group average method of hierarchical clustering retains the greatest degree of genetic diversities of the initial collection. The core collections sampled using predicted genotypic values had more genetic diversity than those based on phenotypic values.Communicated by D.J. Mackill  相似文献   

12.
Data from different agencies share data of the same individuals. Linking these datasets to identify all the records belonging to the same individuals is a crucial and challenging problem, especially given the large volumes of data. A large number of available algorithms for record linkage are prone to either time inefficiency or low-accuracy in finding matches and non-matches among the records. In this paper we propose efficient as well as reliable sequential and parallel algorithms for the record linkage problem employing hierarchical clustering methods. We employ complete linkage hierarchical clustering algorithms to address this problem. In addition to hierarchical clustering, we also use two other techniques: elimination of duplicate records and blocking. Our algorithms use sorting as a sub-routine to identify identical copies of records. We have tested our algorithms on datasets with millions of synthetic records. Experimental results show that our algorithms achieve nearly 100% accuracy. Parallel implementations achieve almost linear speedups. Time complexities of these algorithms do not exceed those of previous best-known algorithms. Our proposed algorithms outperform previous best-known algorithms in terms of accuracy consuming reasonable run times.  相似文献   

13.
Clustering cells and depicting the lineage relationship among cell subpopulations are fundamental tasks in single-cell omics studies. However, existing analytical methods face challenges in stratifying cells, tracking cellular trajectories, and identifying critical points of cell transitions. To overcome these, we proposed a novel Markov hierarchical clustering algorithm (MarkovHC), a topological clustering method that leverages the metastability of exponentially perturbed Markov chains for systematically reconstructing the cellular landscape. Briefly, MarkovHC starts with local connectivity and density derived from the input and outputs a hierarchical structure for the data. We firstly benchmarked MarkovHC on five simulated datasets and ten public single-cell datasets with known labels. Then, we used MarkovHC to investigate the multi-level architectures and transition processes during human embryo preimplantation development and gastric cancer procession. MarkovHC found heterogeneous cell states and sub-cell types in lineage-specific progenitor cells and revealed the most possible transition paths and critical points in the cellular processes. These results demonstrated MarkovHC’s effectiveness in facilitating the stratification of cells, identification of cell populations, and characterization of cellular trajectories and critical points.  相似文献   

14.
MOTIVATION: Clustering is one of the most widely used methods in unsupervised gene expression data analysis. The use of different clustering algorithms or different parameters often produces rather different results on the same data. Biological interpretation of multiple clustering results requires understanding how different clusters relate to each other. It is particularly non-trivial to compare the results of a hierarchical and a flat, e.g. k-means, clustering. RESULTS: We present a new method for comparing and visualizing relationships between different clustering results, either flat versus flat, or flat versus hierarchical. When comparing a flat clustering to a hierarchical clustering, the algorithm cuts different branches in the hierarchical tree at different levels to optimize the correspondence between the clusters. The optimization function is based on graph layout aesthetics or on mutual information. The clusters are displayed using a bipartite graph where the edges are weighted proportionally to the number of common elements in the respective clusters and the weighted number of crossings is minimized. The performance of the algorithm is tested using simulated and real gene expression data. The algorithm is implemented in the online gene expression data analysis tool Expression Profiler. AVAILABILITY: http://www.ebi.ac.uk/expressionprofiler  相似文献   

15.
Assessing reliability of gene clusters from gene expression data   总被引:5,自引:0,他引:5  
The rapid development of microarray technologies has raised many challenging problems in experiment design and data analysis. Although many numerical algorithms have been successfully applied to analyze gene expression data, the effects of variations and uncertainties in measured gene expression levels across samples and experiments have been largely ignored in the literature. In this article, in the context of hierarchical clustering algorithms, we introduce a statistical resampling method to assess the reliability of gene clusters identified from any hierarchical clustering method. Using the clustering trees constructed from the resampled data, we can evaluate the confidence value for each node in the observed clustering tree. A majority-rule consensus tree can be obtained, showing clusters that only occur in a majority of the resampled trees. We illustrate our proposed methods with applications to two published data sets. Although the methods are discussed in the context of hierarchical clustering methods, they can be applied with other cluster-identification methods for gene expression data to assess the reliability of any gene cluster of interest. Electronic Publication  相似文献   

16.
Genetic data are increasingly used to describe the structure of wildlife populations and to infer landscape influences on functional connectivity. To accomplish this, genetic structure can be described with a multitude of methods that vary in their assumptions, advantages and limitations. While some methods discriminate distinct subpopulations separated by sharp genetic boundaries (i.e. barrier detection or clustering methods), other methods estimate gradient genetic structures using individual‐based genetic distances. We present an analytical framework based on individual ancestry values that combines these different approaches and can be used to a) test for local barriers to gene flow and b) evaluate effects of landscape gradients through individual‐based genetic distances that account for hierarchical genetic structure. We illustrate the approach with a data set of 371 cougars Puma concolor from a 217 000 km2 study area in Idaho and western Montana (USA) that were genotyped at 12 microsatellite loci. Results suggest that cougars in the region show a complex, hierarchical genetic structure that is influenced by a local barrier to gene flow (an urban population cluster connected by high traffic volumes), different landscape features (the Snake River Plain, forested habitat), and geographic distance. Our novel approach helped to elucidate the relative influence of these factors on different hierarchical levels of population structure, which was not possible when using either clustering methods or standard genetic distances. Results obtained with our analytical framework highlight the need for multi‐scale management of cougars in the region and show that landscape heterogeneity can create complex genetic structures, even in generalist species with high dispersal capabilities.  相似文献   

17.
Krishnamoorthy K  Lu Y 《Biometrics》2003,59(2):237-247
This article presents procedures for hypothesis testing and interval estimation of the common mean of several normal populations. The methods are based on the concepts of generalized p-value and generalized confidence limit. The merits of the proposed methods are evaluated numerically and compared with those of the existing methods. Numerical studies show that the new procedures are accurate and perform better than the existing methods when the sample sizes are moderate and the number of populations is four or less. If the number of populations is five or more, then the generalized variable method performs much better than the existing methods regardless of the sample sizes. The generalized variable method and other existing methods are illustrated using two examples.  相似文献   

18.
19.
PermutMatrix is a work space designed to graphically explore gene expression data. It relies on the graphical approach introduced by Eisen and also offers several methods for the optimal reorganization of rows and columns of a numerical dataset. For example, several methods are proposed for optimal reorganization of the leaves of a hierarchical clustering tree, along with several seriation or unidimensional scaling methods that do not require any preliminary hierarchical clustering. This program, developed for MS Windows, with MS-Visual C++, has a clear and efficient graphical interface. Large datasets can be thoroughly and quickly analyzed.  相似文献   

20.
Partitioning closely related genes into clusters has become an important element of practically all statistical analyses of microarray data. A number of computer algorithms have been developed for this task. Although these algorithms have demonstrated their usefulness for gene clustering, some basic problems remain. This paper describes our work on extracting functional keywords from MEDLINE for a set of genes that are isolated for further study from microarray experiments based on their differential expression patterns. The sharing of functional keywords among genes is used as a basis for clustering in a new approach called BEA-PARTITION in this paper. Functional keywords associated with genes were extracted from MEDLINE abstracts. We modified the Bond Energy Algorithm (BEA), which is widely accepted in psychology and database design but is virtually unknown in bioinformatics, to cluster genes by functional keyword associations. The results showed that BEA-PARTITION and hierarchical clustering algorithm outperformed k-means clustering and self-organizing map by correctly assigning 25 of 26 genes in a test set of four known gene groups. To evaluate the effectiveness of BEA-PARTITION for clustering genes identified by microarray profiles, 44 yeast genes that are differentially expressed during the cell cycle and have been widely studied in the literature were used as a second test set. Using established measures of cluster quality, the results produced by BEA-PARTITION had higher purity, lower entropy, and higher mutual information than those produced by k-means and self-organizing map. Whereas BEA-PARTITION and the hierarchical clustering produced similar quality of clusters, BEA-PARTITION provides clear cluster boundaries compared to the hierarchical clustering. BEA-PARTITION is simple to implement and provides a powerful approach to clustering genes or to any clustering problem where starting matrices are available from experimental observations.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号