首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Book reviews     
Editing of community data matrices is complementary to analyzing data by multivariate techniques of classification and ordination in the overall task of data analysis. A computer program, DATAEDIT, is described that can perform numerous editing functions, including data transformation, deletion of certain species or samples, deletion of rare species, deletion of outliers, separation of disjunet sample groups, reordering of the species or samples of a data matrix, and the formation of composite samples or of sample subsets. DATAEDIT can use the information in a nonhierarchical or hierarchical classification, and includes its own internal routine for reciprocal averaging ordination.We appreciate valuable suggestions from the late Robert H. Whittaker, and from Philip Dixon, David Hieks, Laura Huenneke, Linda Olsvig-whittaker, and Mark Wilson. Mark O. Hill kindly supplied a fast subroutine for reciprocal averaging.  相似文献   

2.
MOTIVATION: Clustering algorithms are widely used in the analysis of microarray data. In clinical studies, they are often applied to find groups of co-regulated genes. Clustering, however, can also stratify patients by similarity of their gene expression profiles, thereby defining novel disease entities based on molecular characteristics. Several distance-based cluster algorithms have been suggested, but little attention has been given to the distance measure between patients. Even with the Euclidean metric, including and excluding genes from the analysis leads to different distances between the same objects, and consequently different clustering results. RESULTS: We describe a new clustering algorithm, in which gene selection is used to derive biologically meaningful clusterings of samples by combining expression profiles and functional annotation data. According to gene annotations, candidate gene sets with specific functional characterizations are generated. Each set defines a different distance measure between patients, leading to different clusterings. These clusterings are filtered using a resampling-based significance measure. Significant clusterings are reported together with the underlying gene sets and their functional definition. CONCLUSIONS: Our method reports clusterings defined by biologically focused sets of genes. In annotation-driven clusterings, we have recovered clinically relevant patient subgroups through biologically plausible sets of genes as well as new subgroupings. We conjecture that our method has the potential to reveal so far unknown, clinically relevant classes of patients in an unsupervised manner. AVAILABILITY: We provide the R package adSplit as part of Bioconductor release 1.9 and on http://compdiag.molgen.mpg.de/software.  相似文献   

3.
Fast sequence clustering using a suffix array algorithm   总被引:1,自引:0,他引:1  
MOTIVATION: Efficient clustering is important for handling the large amount of available EST sequences. Most contemporary methods are based on some kind of all-against-all comparison, resulting in a quadratic time complexity. A different approach is needed to keep up with the rapid growth of EST data. RESULTS: A new, fast EST clustering algorithm is presented. Sub-quadratic time complexity is achieved by using an algorithm based on suffix arrays. A prototype implementation has been developed and run on a benchmark data set. The produced clusterings are validated by comparing them to clusterings produced by other methods, and the results are quite promising. AVAILABILITY: The source code for the prototype implementation is available under a GPL license from http://www.ii.uib.no/~ketil/bio/.  相似文献   

4.
The phenetic analysis of non-nodulatingAcacia species by Harrier et al. (1997) was repeated to illustrate how different computer programs may generate alternative UPGMA trees for the very same data, even in the absence of data input order effects (ties). For example, all Harrier et al.'s UPGMA dendrograms produced by software from the Scottish Agricultural Statistics Service differed from those obtained by the packages NTSYS and MVSP87. Particularly, the positions ofA. albida, A. rovumae, andA. pentagona, as well as the relationships betweenDiacanthae andTriacanthae were affected by this phenomenon. Hence, whenever clustering techniques are used, care should be taken to consider possible software-dependent caveats and artefacts. Nevertheless, all programs provided clusterings that largely coincided with the subgeneric and sectional groupings proposed by Vassal (1972) although the positions of some species varied depending on whether morphological or molecular data were considered (e.g.A. albida andA. rovumae).  相似文献   

5.
MOTIVATION: The biologic significance of results obtained through cluster analyses of gene expression data generated in microarray experiments have been demonstrated in many studies. In this article we focus on the development of a clustering procedure based on the concept of Bayesian model-averaging and a precise statistical model of expression data. RESULTS: We developed a clustering procedure based on the Bayesian infinite mixture model and applied it to clustering gene expression profiles. Clusters of genes with similar expression patterns are identified from the posterior distribution of clusterings defined implicitly by the stochastic data-generation model. The posterior distribution of clusterings is estimated by a Gibbs sampler. We summarized the posterior distribution of clusterings by calculating posterior pairwise probabilities of co-expression and used the complete linkage principle to create clusters. This approach has several advantages over usual clustering procedures. The analysis allows for incorporation of a reasonable probabilistic model for generating data. The method does not require specifying the number of clusters and resulting optimal clustering is obtained by averaging over models with all possible numbers of clusters. Expression profiles that are not similar to any other profile are automatically detected, the method incorporates experimental replicates, and it can be extended to accommodate missing data. This approach represents a qualitative shift in the model-based cluster analysis of expression data because it allows for incorporation of uncertainties involved in the model selection in the final assessment of confidence in similarities of expression profiles. We also demonstrated the importance of incorporating the information on experimental variability into the clustering model. AVAILABILITY: The MS Windows(TM) based program implementing the Gibbs sampler and supplemental material is available at http://homepages.uc.edu/~medvedm/BioinformaticsSupplement.htm CONTACT: medvedm@email.uc.edu  相似文献   

6.
7.
When a data set is repeatedly clustered using unsupervised techniques, the resulting clusterings, even if highly similar, may list their clusters in different orders. This so-called ‘label-switching’ phenomenon obscures meaningful differences between clusterings, complicating their comparison and summary. The problem often arises in the context of population structure analysis based on multilocus genotype data. In this field, a variety of popular tools apply model-based clustering, assigning individuals to a prespecified number of ancestral populations. Since such methods often involve stochastic components, it is a common practice to perform multiple replicate analyses based on the same input data and parameter settings. Available postprocessing tools allow to mitigate label switching, but leave room for improvements, in particular, regarding large input data sets. In this work, I present Crimp , a lightweight command-line tool, which offers a relatively fast and scalable heuristic to align clusters across replicate clusterings consisting of the same number of clusters. For small problem sizes, an exact algorithm can be used as an alternative. Additional features include row-specific weights, input and output files similar to those of CLUMPP (Jakobsson & Rosenberg, 2007) and the evaluation of a given solution in terms of CLUMPP as well as its own objective functions. Benchmark analyses show that Crimp , especially when applied to larger data sets, tends to outperform alternative tools considering runtime requirements and various quality measures. While primarily targeting population structure analysis, Crimp can be used as a generic tool to correct multiple clusterings for label switching. This facilitates their comparison and allows to generate an averaged clustering. Crimp 's computational efficiency makes it even applicable to relatively large data sets while offering competitive solution quality.  相似文献   

8.
A theoretical framework based on Hill numbers has recently been advocated to measure and partition diversity sensu stricto. Hill numbers can be interpreted intuitively as effective number of species (ENS). They conform to the so‐called replication principle allowing a mathematically coherent multiplicative partitioning of diversity. They form a family of ENS defined by the parameter q which controls the weight attributed to rare species. Despite its advantages, this framework was developed without considering its robustness when treating community samples. In this study, we first show that Hurlbert diversity indices (expected number of species among k individuals) can be transformed into ENS that conform asymptotically to the replication principle while controlling the weight given to rare species through parameter k. We investigate the statistical properties of Hill and Hurlbert ENS using simulated communities with contrasted diversity. The properties of multiplicative beta diversity estimators based on ENS are also characterized by simulating communities with different levels of differentiation. We show that Hurlbert ENS provides a better statistical performance than Hill numbers when dealing with small sample sizes. By contrast, Hill numbers and their estimators suffer from substantial bias except when rare species have a low weight (q= 2). An estimator of ENS estimating both Hill numbers for q= 2 and Hurlbert ENS for k= 2 is shown to give the best performance and is recommended for processing real datasets when rare species receive low weight. In order to better take account of rare species, current estimators of Hill numbers are not recommended when sample size is too low while Hurlbert’s ENS performs reliably. In conclusion, while Hill numbers possess some interesting mathematical properties that are not shared by Hurlbert’s ENS, the latter outperforms Hill numbers in terms of statistical properties and is well suited to processing community samples, as illustrated on a real dataset.  相似文献   

9.
A mixture model-based approach to the clustering of microarray expression data   总被引:13,自引:0,他引:13  
MOTIVATION: This paper introduces the software EMMIX-GENE that has been developed for the specific purpose of a model-based approach to the clustering of microarray expression data, in particular, of tissue samples on a very large number of genes. The latter is a nonstandard problem in parametric cluster analysis because the dimension of the feature space (the number of genes) is typically much greater than the number of tissues. A feasible approach is provided by first selecting a subset of the genes relevant for the clustering of the tissue samples by fitting mixtures of t distributions to rank the genes in order of increasing size of the likelihood ratio statistic for the test of one versus two components in the mixture model. The imposition of a threshold on the likelihood ratio statistic used in conjunction with a threshold on the size of a cluster allows the selection of a relevant set of genes. However, even this reduced set of genes will usually be too large for a normal mixture model to be fitted directly to the tissues, and so the use of mixtures of factor analyzers is exploited to reduce effectively the dimension of the feature space of genes. RESULTS: The usefulness of the EMMIX-GENE approach for the clustering of tissue samples is demonstrated on two well-known data sets on colon and leukaemia tissues. For both data sets, relevant subsets of the genes are able to be selected that reveal interesting clusterings of the tissues that are either consistent with the external classification of the tissues or with background and biological knowledge of these sets. AVAILABILITY: EMMIX-GENE is available at http://www.maths.uq.edu.au/~gjm/emmix-gene/  相似文献   

10.
Detrended correspondence analysis: An improved ordination technique   总被引:61,自引:0,他引:61  
Summary Detrended correspondence analysis (DCA) is an improvement upon the reciprocal averaging (RA) ordination technique. RA has two main faults: the second axis is often an arch or horseshoe distortion of the first axis, and distances in the ordination space do not have a consistent meaning in terms of compositional change (in particular, distances at the ends of the first RA axis are compressed relative to the middle). DCA corrects these two faults. Tests with simulated and field data show DCA superior to RA and to nonmetric multidimensional sealing in giving clear, interpretable results. DCA has several advantages. (a) Its performance is the best of the ordination techniques tested, and both species and sample ordinations are produced simultaneously. (b) The axes are scaled in standard deviation units with a definite meaning, (c) As implemented in a FORTRAN program called DECORANA, computing time rises only linearly with the amount of data analyzed, and only positive entries in the data matrix are stored in memory, so very large data sets present no difficulty. However, DCA has limitations, making it best to remove extreme outliers and discontinuities prior to analysis. DCA consistently gives the most interpretable ordination results, but as always the interpretation of results remains a matter of ecological insight and is improved by field experience and by integration of supplementary environmental data for the vegetation sample sites.This research was supported by the Institute of Terrestrial Ecology, Bangor, Wales, and by a grant from the National Science Foundation to R.H. Whittaker. We thank R.H. Whittaker for encouragement and comments, S.B. Singer for assistance with the Cornell computer, and H.J.B. Birks, S.R. Sabo, T.C.E. Wells, and R.H. Whittaker for data sets used for ordination tests.  相似文献   

11.
MOTIVATION: Consensus clustering, also known as cluster ensemble, is one of the important techniques for microarray data analysis, and is particularly useful for class discovery from microarray data. Compared with traditional clustering algorithms, consensus clustering approaches have the ability to integrate multiple partitions from different cluster solutions to improve the robustness, stability, scalability and parallelization of the clustering algorithms. By consensus clustering, one can discover the underlying classes of the samples in gene expression data. RESULTS: In addition to exploring a graph-based consensus clustering (GCC) algorithm to estimate the underlying classes of the samples in microarray data, we also design a new validation index to determine the number of classes in microarray data. To our knowledge, this is the first time in which GCC is applied to class discovery for microarray data. Given a pre specified maximum number of classes (denoted as K(max) in this article), our algorithm can discover the true number of classes for the samples in microarray data according to a new cluster validation index called the Modified Rand Index. Experiments on gene expression data indicate that our new algorithm can (i) outperform most of the existing algorithms, (ii) identify the number of classes correctly in real cancer datasets, and (iii) discover the classes of samples with biological meaning. AVAILABILITY: Matlab source code for the GCC algorithm is available upon request from Zhiwen Yu.  相似文献   

12.
The large variety of clustering algorithms and their variants can be daunting to researchers wishing to explore patterns within their microarray datasets. Furthermore, each clustering method has distinct biases in finding patterns within the data, and clusterings may not be reproducible across different algorithms. A consensus approach utilizing multiple algorithms can show where the various methods agree and expose robust patterns within the data. In this paper, we present a software package - Consense, written for R/Bioconductor - that utilizes such an approach to explore microarray datasets. Consense produces clustering results for each of the clustering methods and produces a report of metrics comparing the individual clusterings. A feature of Consense is identification of genes that cluster consistently with an index gene across methods. Utilizing simulated microarray data, sensitivity of the metrics to the biases of the different clustering algorithms is explored. The framework is easily extensible, allowing this tool to be used by other functional genomic data types, as well as other high-throughput OMICS data types generated from metabolomic and proteomic experiments. It also provides a flexible environment to benchmark new clustering algorithms. Consense is currently available as an installable R/Bioconductor package (http://www.ohsucancer.com/isrdev/consense/).  相似文献   

13.
MOTIVATION: Over the last decade, a large variety of clustering algorithms have been developed to detect coregulatory relationships among genes from microarray gene expression data. Model-based clustering approaches have emerged as statistically well-grounded methods, but the properties of these algorithms when applied to large-scale data sets are not always well understood. An in-depth analysis can reveal important insights about the performance of the algorithm, the expected quality of the output clusters, and the possibilities for extracting more relevant information out of a particular data set. RESULTS: We have extended an existing algorithm for model-based clustering of genes to simultaneously cluster genes and conditions, and used three large compendia of gene expression data for Saccharomyces cerevisiae to analyze its properties. The algorithm uses a Bayesian approach and a Gibbs sampling procedure to iteratively update the cluster assignment of each gene and condition. For large-scale data sets, the posterior distribution is strongly peaked on a limited number of equiprobable clusterings. A GO annotation analysis shows that these local maxima are all biologically equally significant, and that simultaneously clustering genes and conditions performs better than only clustering genes and assuming independent conditions. A collection of distinct equivalent clusterings can be summarized as a weighted graph on the set of genes, from which we extract fuzzy, overlapping clusters using a graph spectral method. The cores of these fuzzy clusters contain tight sets of strongly coexpressed genes, while the overlaps exhibit relations between genes showing only partial coexpression. AVAILABILITY: GaneSh, a Java package for coclustering, is available under the terms of the GNU General Public License from our website at http://bioinformatics.psb.ugent.be/software  相似文献   

14.
Terminal restriction fragment length polymorphism (T-RFLP) is a culture-independent method of obtaining a genetic fingerprint of the composition of a microbial community. Comparisons of the utility of different methods of (i) including peaks, (ii) computing the difference (or distance) between profiles, and (iii) performing statistical analysis were made by using replicated profiles of eubacterial communities. These samples included soil collected from three regions of the United States, soil fractions derived from three agronomic field treatments, soil samples taken from within one meter of each other in an alfalfa field, and replicate laboratory bioreactors. Cluster analysis by Ward's method and by the unweighted-pair group method using arithmetic averages (UPGMA) were compared. Ward's method was more effective at differentiating major groups within sets of profiles; UPGMA had a slightly reduced error rate in clustering of replicate profiles and was more sensitive to outliers. Most replicate profiles were clustered together when relative peak height or Hellinger-transformed peak height was used, in contrast to raw peak height. Redundancy analysis was more effective than cluster analysis at detecting differences between similar samples. Redundancy analysis using Hellinger distance was more sensitive than that using Euclidean distance between relative peak height profiles. Analysis of Jaccard distance between profiles, which considers only the presence or absence of a terminal restriction fragment, was the most sensitive in redundancy analysis, and was equally sensitive in cluster analysis, if all profiles had cumulative peak heights greater than 10,000 fluorescence units. It is concluded that T-RFLP is a sensitive method of differentiating between microbial communities when the optimal statistical method is used for the situation at hand. It is recommended that hypothesis testing be performed by redundancy analysis of Hellinger-transformed data and that exploratory data analysis be performed by cluster analysis using Ward's method to find natural groups or by UPGMA to identify potential outliers. Analyses can also be based on Jaccard distance if all profiles have cumulative peak heights greater than 10,000 fluorescence units.  相似文献   

15.
Terminal restriction fragment length polymorphism (T-RFLP) is a culture-independent method of obtaining a genetic fingerprint of the composition of a microbial community. Comparisons of the utility of different methods of (i) including peaks, (ii) computing the difference (or distance) between profiles, and (iii) performing statistical analysis were made by using replicated profiles of eubacterial communities. These samples included soil collected from three regions of the United States, soil fractions derived from three agronomic field treatments, soil samples taken from within one meter of each other in an alfalfa field, and replicate laboratory bioreactors. Cluster analysis by Ward's method and by the unweighted-pair group method using arithmetic averages (UPGMA) were compared. Ward's method was more effective at differentiating major groups within sets of profiles; UPGMA had a slightly reduced error rate in clustering of replicate profiles and was more sensitive to outliers. Most replicate profiles were clustered together when relative peak height or Hellinger-transformed peak height was used, in contrast to raw peak height. Redundancy analysis was more effective than cluster analysis at detecting differences between similar samples. Redundancy analysis using Hellinger distance was more sensitive than that using Euclidean distance between relative peak height profiles. Analysis of Jaccard distance between profiles, which considers only the presence or absence of a terminal restriction fragment, was the most sensitive in redundancy analysis, and was equally sensitive in cluster analysis, if all profiles had cumulative peak heights greater than 10,000 fluorescence units. It is concluded that T-RFLP is a sensitive method of differentiating between microbial communities when the optimal statistical method is used for the situation at hand. It is recommended that hypothesis testing be performed by redundancy analysis of Hellinger-transformed data and that exploratory data analysis be performed by cluster analysis using Ward's method to find natural groups or by UPGMA to identify potential outliers. Analyses can also be based on Jaccard distance if all profiles have cumulative peak heights greater than 10,000 fluorescence units.  相似文献   

16.
Ant-based and swarm-based clustering   总被引:3,自引:0,他引:3  
Clustering with swarm-based algorithms is emerging as an alternative to more conventional clustering methods, such as hierarchical clustering and k-means. Ant-based clustering stands out as the most widely used group of swarm-based clustering algorithms. Broadly speaking, there are two main types of ant-based clustering: the first group of methods directly mimics the clustering behavior observed in real ant colonies. The second group is less directly inspired by nature: the clustering task is reformulated as an optimization task and general purpose ant-based optimization heuristics are utilized to find good or near-optimal clusterings. This papers reviews both approaches and places these methods in the wider context of general swarm-based clustering approaches.  相似文献   

17.
Synopsis Data matrices of fish stomach contents frequently contain many zeros, and nonzero values often do not follow usually encountered statistical distributions. Therefore, many common methods of statistical analysis are inappropriate for such data. A method of repeated k-means cluster analysis is proposed for exploratory analysis of data sets on fish stomach contents. Objective rules are proposed for setting the clustering parameters, so the arbitrariness and subjectivity common in interpreting hierarchical clustering methods is avoided. Because the clusters are nonhierarchical, the analysis method also requires much less computer time and memory. Application of the method is illustrated with a data set of 1771 stomachs of cod (Gadus morhua), feeding on 38 different prey types. The results of the clusterings reveal that nine types of prey may account for the systematic information about the diet of cod in this sample from the northern Grand Bank in Spring of 1979. The results are also used to test specific hypotheses about size selectivity of the predator, spatial variation of feeding, environmental influences on diet, and relative preferences among prey taxa.  相似文献   

18.
Consensus clustering involves combining multiple clusterings of the same set of objects to achieve a single clustering that will, hopefully, provide a better picture of the groupings that are present in a dataset. This Letter reports the use of consensus clustering methods on sets of chemical compounds represented by 2D fingerprints. Experiments with DUD, IDAlert, MDDR and MUV data suggests that consensus methods are unlikely to result in significant improvements in clustering effectiveness as compared to the use of a single clustering method.  相似文献   

19.
Four of the most common limitations of the many available clustering methods are: i) the lack of a proper strategy to deal with outliers; ii) the need for a good a priori estimate of the number of clusters to obtain reasonable results; iii) the lack of a method able to detect when partitioning of a specific data set is not appropriate; and iv) the dependence of the result on the initialization. Here we propose Cross-clustering (CC), a partial clustering algorithm that overcomes these four limitations by combining the principles of two well established hierarchical clustering algorithms: Ward’s minimum variance and Complete-linkage. We validated CC by comparing it with a number of existing clustering methods, including Ward’s and Complete-linkage. We show on both simulated and real datasets, that CC performs better than the other methods in terms of: the identification of the correct number of clusters, the identification of outliers, and the determination of real cluster memberships. We used CC to cluster samples in order to identify disease subtypes, and on gene profiles, in order to determine groups of genes with the same behavior. Results obtained on a non-biological dataset show that the method is general enough to be successfully used in such diverse applications. The algorithm has been implemented in the statistical language R and is freely available from the CRAN contributed packages repository.  相似文献   

20.
With the advancement of microarray technology, it is now possible to study the expression profiles of thousands of genes across different experimental conditions or tissue samples simultaneously. Microarray cancer datasets, organized as samples versus genes fashion, are being used for classification of tissue samples into benign and malignant or their subtypes. They are also useful for identifying potential gene markers for each cancer subtype, which helps in successful diagnosis of particular cancer types. In this article, we have presented an unsupervised cancer classification technique based on multiobjective genetic clustering of the tissue samples. In this regard, a real-coded encoding of the cluster centers is used and cluster compactness and separation are simultaneously optimized. The resultant set of near-Pareto-optimal solutions contains a number of non-dominated solutions. A novel approach to combine the clustering information possessed by the non-dominated solutions through Support Vector Machine (SVM) classifier has been proposed. Final clustering is obtained by consensus among the clusterings yielded by different kernel functions. The performance of the proposed multiobjective clustering method has been compared with that of several other microarray clustering algorithms for three publicly available benchmark cancer datasets. Moreover, statistical significance tests have been conducted to establish the statistical superiority of the proposed clustering method. Furthermore, relevant gene markers have been identified using the clustering result produced by the proposed clustering method and demonstrated visually. Biological relationships among the gene markers are also studied based on gene ontology. The results obtained are found to be promising and can possibly have important impact in the area of unsupervised cancer classification as well as gene marker identification for multiple cancer subtypes.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号