首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Are clusters found in one dataset present in another dataset?   总被引:4,自引:0,他引:4  
In many microarray studies, a cluster defined on one dataset is sought in an independent dataset. If the cluster is found in the new dataset, the cluster is said to be "reproducible" and may be biologically significant. Classifying a new datum to a previously defined cluster can be seen as predicting which of the previously defined clusters is most similar to the new datum. If the new data classified to a cluster are similar, molecularly or clinically, to the data already present in the cluster, then the cluster is reproducible and the corresponding prediction accuracy is high. Here, we take advantage of the connection between reproducibility and prediction accuracy to develop a validation procedure for clusters found in datasets independent of the one in which they were characterized. We define a cluster quality measure called the "in-group proportion" (IGP) and introduce a general procedure for individually validating clusters. Using simulations and real breast cancer datasets, the IGP is compared to four other popular cluster quality measures (homogeneity score, separation score, silhouette width, and weighted average discrepant pairs score). Moreover, simulations and the real breast cancer datasets are used to compare the four versions of the validation procedure which all use the IGP, but differ in the way in which the null distributions are generated. We find that the IGP is the best measure of prediction accuracy, and one version of the validation procedure is the more widely applicable than the other three. An implementation of this algorithm is in a package called "clusterRepro" available through The Comprehensive R Archive Network (http://cran.r-project.org).  相似文献   

2.
Many external and internal validity measures have been proposed in order to estimate the number of clusters in gene expression data but as a rule they do not consider the analysis of the stability of the groupings produced by a clustering algorithm. Based on the approach assessing the predictive power or stability of a partitioning, we propose the new measure of cluster validation and the selection procedure to determine the suitable number of clusters. The validity measure is based on the estimation of the "clearness" of the consensus matrix, which is the result of a resampling clustering scheme or consensus clustering. According to the proposed selection procedure the stable clustering result is determined with the reference to the validity measure for the null hypothesis encoding for the absence of clusters. The final number of clusters is selected by analyzing the distance between the validity plots for initial and permutated data sets. We applied the selection procedure to estimate the clustering results on several datasets. As a result the proposed procedure produced an accurate and robust estimate of the number of clusters, which are in agreement with the biological knowledge and gold standards of cluster quality.  相似文献   

3.

Purpose

To describe a methodology, based on cluster analysis, to partition multi-parametric functional imaging data into groups (or clusters) of similar functional characteristics, with the aim of characterizing functional heterogeneity within head and neck tumour volumes. To evaluate the performance of the proposed approach on a set of longitudinal MRI data, analysing the evolution of the obtained sub-sets with treatment.

Material and Methods

The cluster analysis workflow was applied to a combination of dynamic contrast-enhanced and diffusion-weighted imaging MRI data from a cohort of squamous cell carcinoma of the head and neck patients. Cumulative distributions of voxels, containing pre and post-treatment data and including both primary tumours and lymph nodes, were partitioned into k clusters (k = 2, 3 or 4). Principal component analysis and cluster validation were employed to investigate data composition and to independently determine the optimal number of clusters. The evolution of the resulting sub-regions with induction chemotherapy treatment was assessed relative to the number of clusters.

Results

The clustering algorithm was able to separate clusters which significantly reduced in voxel number following induction chemotherapy from clusters with a non-significant reduction. Partitioning with the optimal number of clusters (k = 4), determined with cluster validation, produced the best separation between reducing and non-reducing clusters.

Conclusion

The proposed methodology was able to identify tumour sub-regions with distinct functional properties, independently separating clusters which were affected differently by treatment. This work demonstrates that unsupervised cluster analysis, with no prior knowledge of the data, can be employed to provide a multi-parametric characterization of functional heterogeneity within tumour volumes.  相似文献   

4.
It has been well established that gene expression data contain large amounts of random variation that affects both the analysis and the results of microarray experiments. Typically, microarray data are either tested for differential expression between conditions or grouped on the basis of profiles that are assessed temporally or across genetic or environmental conditions. While testing differential expression relies on levels of certainty to evaluate the relative worth of various analyses, cluster analysis is exploratory in nature and has not had the benefit of any judgment of statistical inference. By using a novel dissimilarity function to ascertain gene expression clusters and conditional randomization of the data space to illuminate distinctions between statistically significant clusters of gene expression patterns, we aim to provide a level of confidence to inferred clusters of gene expression data. We apply both permutation and convex hull approaches for randomization of the data space and show that both methods can provide an effective assessment of gene expression profiles whose coregulation is statistically different from that expected by random chance alone.  相似文献   

5.
The morphology of 175 specimens of Ornithogalum belonging to 12 species were analyzed on the basis of 31 characters. The methods used were cluster analysis (agglomerative clustering using Ward's average), principal components analysis (using correlation coefficients) and oligothetic characterization of clusters and species. The splitting level at which seven clusters were separated is the best possible one according to the criteria as used. The separation of nine clusters yields more information on the distribution of the species. Both the clusters and the species are separated from each other in the attribute space performed by the first, second and third principal component. The species of the umbellatum-angustifolium complex, i.e. O. umbellatum, O. angustifoliwn, O. refractum, O. monticolum, O. baeticum and O. algeriense , are clearly separated. A. parallel evolution of the pollen-sterile strains of the bulbilliferous species O. umbellatum and O. angustifolium could be traced. A dichotomous key for all the species involved is proposed.  相似文献   

6.
Standard clustering algorithms when applied to DNA microarray data often tend to produce erroneous clusters. A major contributor to this divergence is the feature characteristic of microarray data sets that the number of predictors (genes) in such data far exceeds the number of samples by many orders of magnitude, with only a small percentage of predictors being truly informative with regards to the clustering while the rest merely add noise. An additional complication is that the predictors exhibit an unknown complex correlational configuration embedded in a small subspace of the entire predictor space. Under these conditions, standard clustering algorithms fail to find the true clusters even when applied in tandem with some sort of gene filtering or dimension reduction to reduce the number of predictors. We propose, as an alternative, a novel method for unsupervised classification of DNA microarray data. The method, which is based on the idea of aggregating results obtained from an ensemble of randomly resampled data (where both samples and genes are resampled), introduces a way of tilting the procedure so that the ensemble includes minimal representation from less important areas of the gene predictor space. The method produces a measure of dissimilarity between each pair of samples that can be used in conjunction with (a) a method like Ward's procedure to generate a cluster analysis and (b) multidimensional scaling to generate useful visualizations of the data. We call the dissimilarity measures ABC dissimilarities since they are obtained by aggregating bundles of clusters. An extensive comparison of several clustering methods using actual DNA microarray data convincingly demonstrates that classification using ABC dissimilarities offers significantly superior performance.  相似文献   

7.
Pigmentation of hair in humans has been investigated by medical scientists, anthropologists and, more recently, by forensic scientists. In every investigation, hair color must first be defined by the researchers. Subjective color assessment inhibits the reproducibility of experiments and the direct comparison of results. The aim of this study was to objectively measure human hair color and examine the variation found in a population with European ancestry, using the CIE L*a*b* color space. Observer-perceived hair colors were compared with self-reported hair colors and the color as measured by reflective spectrophotometry of 132 subjects of European ancestry. The presented data show that self-reported hair colors and observer-reported colors are similar; however, these categories are not necessarily the best way to categorize hair color for quantitative research. Using a two-step cluster analysis, hair color can be divided into categories or clusters based on spectrophotometric measurements in the CIE L*a*b* color space and these clusters can be well discriminated from each other. This separation is primarily based on the b* (yellow) color component and the clusters show agreement to observer-reported colors. This study illustrates the possibilities for and necessity of objectively defining the hair color phenotype for various downstream applications.  相似文献   

8.

Background  

Interpretation of comprehensive DNA microarray data sets is a challenging task for biologists and process engineers where scientific assistance of statistics and bioinformatics is essential. Interdisciplinary cooperation and concerted development of software-tools for simplified and accelerated data analysis and interpretation is the key to overcome the bottleneck in data-analysis workflows. This approach is exemplified by gcExplorer an interactive visualization toolbox based on cluster analysis. Clustering is an important tool in gene expression data analysis to find groups of co-expressed genes which can finally suggest functional pathways and interactions between genes. The visualization of gene clusters gives practitioners an understanding of the cluster structure of their data and makes it easier to interpret the cluster results.  相似文献   

9.
In cluster randomized trials, intact social units such as schools, worksites or medical practices - rather than individuals themselves - are randomly allocated to intervention and control conditions, while the outcomes of interest are then observed on individuals within each cluster. Such trials are becoming increasingly common in the fields of health promotion and health services research. Attrition is a common occurrence in randomized trials, and a standard approach for dealing with the resulting missing values is imputation. We consider imputation strategies for missing continuous outcomes, focusing on trials with a completely randomized design in which fixed cohorts from each cluster are enrolled prior to random assignment. We compare five different imputation strategies with respect to Type I and Type II error rates of the adjusted two-sample t -test for the intervention effect. Cluster mean imputation is compared with multiple imputation, using either within-cluster data or data pooled across clusters in each intervention group. In the case of pooling across clusters, we distinguish between standard multiple imputation procedures which do not account for intracluster correlation and a specialized procedure which does account for intracluster correlation but is not yet available in standard statistical software packages. A simulation study is used to evaluate the influence of cluster size, number of clusters, degree of intracluster correlation, and variability among cluster follow-up rates. We show that cluster mean imputation yields valid inferences and given its simplicity, may be an attractive option in some large community intervention trials which are subject to individual-level attrition only; however, it may yield less powerful inferences than alternative procedures which pool across clusters especially when the cluster sizes are small and cluster follow-up rates are highly variable. When pooling across clusters, the imputation procedure should generally take intracluster correlation into account to obtain valid inferences; however, as long as the intracluster correlation coefficient is small, we show that standard multiple imputation procedures may yield acceptable type I error rates; moreover, these procedures may yield more powerful inferences than a specialized procedure, especially when the number of available clusters is small. Within-cluster multiple imputation is shown to be the least powerful among the procedures considered.  相似文献   

10.
MOTIVATION: Hierarchical clustering is a common approach to study protein and gene expression data. This unsupervised technique is used to find clusters of genes or proteins which are expressed in a coordinated manner across a set of conditions. Because of both the biological and technical variability, experimental repetitions are generally performed. In this work, we propose an approach to evaluate the stability of clusters derived from hierarchical clustering by taking repeated measurements into account. RESULTS: The method is based on the bootstrap technique that is used to obtain pseudo-hierarchies of genes from resampled datasets. Based on a fast dynamic programming algorithm, we compare the original hierarchy to the pseudo-hierarchies and assess the stability of the original gene clusters. Then a shuffling procedure can be used to assess the significance of the cluster stabilities. Our approach is illustrated on simulated data and on two microarray datasets. Compared to the standard hierarchical clustering methodology, it allows to point out the dubious and stable clusters, and thus avoids misleading interpretations. AVAILABILITY: The programs were developed in C and R languages.  相似文献   

11.
New probability matrices for identification of Streptomyces   总被引:3,自引:0,他引:3  
The character state data obtained for clusters defined in a previous phenetic classification were used to construct two probabilistic matrices for Streptomyces species. These superseded an original published identification matrix by exclusion of other genera and the inclusion of more Streptomyces species. Separate matrices were constructed for major and minor clusters. The minimum number of diagnostic characters for each matrix was selected by computer programs for determination of character separation indices (CHARSEP) and a selection of group diagnostic properties (DIACHAR). The resulting matrices consisted of 26 phena x 50 characters (major clusters) and 28 phena x 39 characters (minor clusters). Cluster overlap (OVERMAT program) was small in both matrices. Identification scores were used to evaluate both matrices. The theoretically best scores for the most typical example of each cluster (MOSTTYP program) were all satisfactory. Input of test data for randomly selected cluster representatives resulted in correct identification with high scores. The major cluster matrix was shown to be practically sound by its application to 35 unknown soil isolates, 77% of which were clearly identified. The minor cluster matrix provides tentative probabilistic identifications as the small number of strains in each cluster reduces its ability to withstand test variation. A diagnostic table for single-membered clusters, constructed using the CHARSEP and DIACHAR programs, was also produced.  相似文献   

12.
Gangnon RE  Clayton MK 《Biometrics》2000,56(3):922-935
Many current statistical methods for disease clustering studies are based on a hypothesis testing paradigm. These methods typically do not produce useful estimates of disease rates or cluster risks. In this paper, we develop a Bayesian procedure for drawing inferences about specific models for spatial clustering. The proposed methodology incorporates ideas from image analysis, from Bayesian model averaging, and from model selection. With our approach, we obtain estimates for disease rates and allow for greater flexibility in both the type of clusters and the number of clusters that may be considered. We illustrate the proposed procedure through simulation studies and an analysis of the well-known New York leukemia data.  相似文献   

13.
14.
Multiple test procedures are usually compared on various aspects of error control and power. Power is measured as some function of the number of false hypotheses correctly identified as false. However, given equal numbers of rejected false hypotheses, the pattern of rejections, i.e. the particular set of false hypotheses identified, may be crucial in interpreting the results for potential application.In an important area of application, comparisons among a set of treatments based on random samples from populations, two different approaches, cluster analysis and model selection, deal implicitly with such patterns, while traditional multiple testing procedures generally focus on the outcomes of subset and pairwise equality hypothesis tests, without considering the overall pattern of results in comparing methods. An important feature involving the pattern of rejections is their relevance for dividing the treatments into distinct subsets based on some parameter of interest, for example their means. This paper introduces some new measures relating to the potential of methods for achieving such divisions. Following Hartley (1955), sets of treatments with equal parameter values will be called clusters. Because it is necessary to distinguish between clusters in the populations and clustering in sample outcomes, the population clusters will be referred to as P -clusters; any related concepts defined in terms of the sample outcome will be referred to with the prefix outcome. Outcomes of multiple comparison procedures will be studied in terms of their probabilities of leading to separation of treatments into outcome clusters, with various measures relating to the number of such outcome clusters and the proportion of true vs. false outcome clusters. The definitions of true and false outcome clusters and related concepts, and the approach taken here, is in the tradition of hypothesis testing with attention to overall error control and power, but with added consideration of cluster separation potential.The pattern approach will be illustrated by comparing two methods with apparent FDR control but with different ways of ordering outcomes for potential significance: The original Benjamini-Hochberg (1995) procedure (BH), and the Newman-Keuls (Newman, 1939; Keuls, 1952) procedure (NK).  相似文献   

15.
We have mined the evolutionary record for the large family of intracellular lipid-binding proteins (iLBPs) by calculating the statistical coupling of residue variations in a multiple sequence alignment using methods developed by Ranganathan and coworkers (Lockless and Ranganathan, Science 1999:286;295-299). The 213 sequences analyzed have a wide range of ligand-binding functions as well as highly divergent phylogenetic origins, assuring broad sampling of sequence space. Emerging from this analysis were two major clusters of coupled residues, which when mapped onto the structure of a representative iLBP under study in our laboratory, cellular retinoic-acid binding protein I, are largely contiguous and provide useful points of comparison to available data for the folding of this protein. One cluster comprises a predominantly hydrophobic core away from the ligand-binding site and likely represents key structural information for the iLBP fold. The other cluster includes the portal region where ligand enters its binding site, regions of the ligand-binding cavity, and the region where the 10-stranded beta-barrel characteristic of this family closes (between strands 1' and 10). Linkages between these two clusters suggest that evolutionary pressures on this family constrain structural and functional sequence information in an interdependent fashion. The necessity of the structure to wrap around a hydrophobic ligand confounds the typical sequestration of hydrophobic side chains. Additionally, ligand entry and exit require these structures to have a capacity for specific conformational change during binding and release. We conclude that an essential and structurally apparent separation of local and global sequence information is conserved throughout the iLBP family.  相似文献   

16.
An approach for deciding final clusters of a dendrogram is provided. Whether developed by agglomerative or divisive cluster analysis, decisions start at the 2-cluster level of the dendrogram. Cluster density is viewed as compactness and, therefore, is related to interindividual distances. The two clusters and the intercluster space are seen as treatments in analysis of variance with intercluster space as the control treatment. Distances in an interindividual matrix are considered measures of response to the three treatments. The F-ratio indicates if the treatment means differ; the least significant difference indicates which ones are different. The example provided explains our approach in searching for optimal clustering levels.  相似文献   

17.
Rosner B  Glynn RJ  Lee ML 《Biometrics》2006,62(1):185-192
The Wilcoxon signed rank test is a frequently used nonparametric test for paired data (e.g., consisting of pre- and posttreatment measurements) based on independent units of analysis. This test cannot be used for paired comparisons arising from clustered data (e.g., if paired comparisons are available for each of two eyes of an individual). To incorporate clustering, a generalization of the randomization test formulation for the signed rank test is proposed, where the unit of randomization is at the cluster level (e.g., person), while the individual paired units of analysis are at the subunit within cluster level (e.g., eye within person). An adjusted variance estimate of the signed rank test statistic is then derived, which can be used for either balanced (same number of subunits per cluster) or unbalanced (different number of subunits per cluster) data, with an exchangeable correlation structure, with or without tied values. The resulting test statistic is shown to be asymptotically normal as the number of clusters becomes large, if the cluster size is bounded. Simulation studies are performed based on simulating correlated ranked data from a signed log-normal distribution. These studies indicate appropriate type I error for data sets with > or =20 clusters and a superior power profile compared with either the ordinary signed rank test based on the average cluster difference score or the multivariate signed rank test of Puri and Sen. Finally, the methods are illustrated with two data sets, (i) an ophthalmologic data set involving a comparison of electroretinogram (ERG) data in retinitis pigmentosa (RP) patients before and after undergoing an experimental surgical procedure, and (ii) a nutritional data set based on a randomized prospective study of nutritional supplements in RP patients where vitamin E intake outside of study capsules is compared before and after randomization to monitor compliance with nutritional protocols.  相似文献   

18.
19.
Discriminant analysis to evaluate clustering of gene expression data   总被引:1,自引:0,他引:1  
In this work we present a procedure that combines classical statistical methods to assess the confidence of gene clusters identified by hierarchical clustering of expression data. This approach was applied to a publicly released Drosophila metamorphosis data set [White et al., Science 286 (1999) 2179-2184]. We have been able to produce reliable classifications of gene groups and genes within the groups by applying unsupervised (cluster analysis), dimension reduction (principal component analysis) and supervised methods (linear discriminant analysis) in a sequential form. This procedure provides a means to select relevant information from microarray data, reducing the number of genes and clusters that require further biological analysis.  相似文献   

20.
Massively parallel DNA sequencing is capable of sequencing tens of millions of DNA fragments at the same time. However, sequence bias in the initial cycles, which are used to determine the coordinates of individual clusters, causes a loss of fidelity in cluster identification on Illumina Genome Analysers. This can result in a significant reduction in the numbers of clusters that can be analysed. Such low sample diversity is an intrinsic problem of sequencing libraries that are generated by restriction enzyme digestion, such as e4C-seq or reduced-representation libraries. Similarly, this problem can also arise through the combined sequencing of barcoded, multiplexed libraries. We describe a procedure to defer the mapping of cluster coordinates until low-diversity sequences have been passed. This simple procedure can recover substantial amounts of next generation sequencing data that would otherwise be lost.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号