首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 484 毫秒
1.
MOTIVATION: Clustering has been used as a popular technique for finding groups of genes that show similar expression patterns under multiple experimental conditions. Many clustering methods have been proposed for clustering gene-expression data, including the hierarchical clustering, k-means clustering and self-organizing map (SOM). However, the conventional methods are limited to identify different shapes of clusters because they use a fixed distance norm when calculating the distance between genes. The fixed distance norm imposes a fixed geometrical shape on the clusters regardless of the actual data distribution. Thus, different distance norms are required for handling the different shapes of clusters. RESULTS: We present the Gustafson-Kessel (GK) clustering method for microarray gene-expression data. To detect clusters of different shapes in a dataset, we use an adaptive distance norm that is calculated by a fuzzy covariance matrix (F) of each cluster in which the eigenstructure of F is used as an indicator of the shape of the cluster. Moreover, the GK method is less prone to falling into local minima than the k-means and SOM because it makes decisions through the use of membership degrees of a gene to clusters. The algorithmic procedure is accomplished by the alternating optimization technique, which iteratively improves a sequence of sets of clusters until no further improvement is possible. To test the performance of the GK method, we applied the GK method and well-known conventional methods to three recently published yeast datasets, and compared the performance of each method using the Saccharomyces Genome Database annotations. The clustering results of the GK method are more significantly relevant to the biological annotations than those of the other methods, demonstrating its effectiveness and potential for clustering gene-expression data. AVAILABILITY: The software was developed using Java language, and can be executed on the platforms that JVM (Java Virtual Machine) is running. It is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at http://dragon.kaist.ac.kr/gk.  相似文献   

2.
The movement of frugivores between remnant forests and successional areas is vital for tropical forest tree species to colonize successional habitats. The response of these species to the spatial structure of pasture tree cover is largely unknown. We studied avian frugivores that were found in primary forest edges and large pastures in eastern Amazonia, Brazil. We determined how the small‐scale spatial structure of pasture trees at forest edges affects five response variables: bird presence, visitation rate, duration of visit, species richness, and an index accounting for species’ level of frugivory and abundance in forests. We used hierarchical linear models to estimate the effect of four predictor variables on response variables: (1) clustering of pasture trees; (2) percent canopy cover of pasture trees; (3) distance of pasture tree to forest edge; and (4) tree crown area. The study species, many of which are widely distributed in the Neotropics, were generally insensitive to percent cover and clustering of trees. Frugivore visitation to individual trees remained constant as cover increased. Visitation was positively correlated with focal tree distance to forest edge and crown area. The positive relationship between distance and visitation rates may be due to the increased abundance of some resource further from forests. If pastures were abandoned the distance from forest edges would not likely limit frugivore visitation and seed deposition under large pasture trees in our study (i.e., up to 200 m distant).  相似文献   

3.
Bio3D is a family of R packages for the analysis of biomolecular sequence, structure, and dynamics. Major functionality includes biomolecular database searching and retrieval, sequence and structure conservation analysis, ensemble normal mode analysis, protein structure and correlation network analysis, principal component, and related multivariate analysis methods. Here, we review recent package developments, including a new underlying segregation into separate packages for distinct analysis, and introduce a new method for structure analysis named ensemble difference distance matrix analysis (eDDM). The eDDM approach calculates and compares atomic distance matrices across large sets of homologous atomic structures to help identify the residue wise determinants underlying specific functional processes. An eDDM workflow is detailed along with an example application to a large protein family. As a new member of the Bio3D family, the Bio3D‐eddm package supports both experimental and theoretical simulation‐generated structures, is integrated with other methods for dissecting sequence‐structure–function relationships, and can be used in a highly automated and reproducible manner. Bio3D is distributed as an integrated set of platform independent open source R packages available from: http://thegrantlab.org/bio3d/ .  相似文献   

4.
MOTIVATION: Large scale gene expression data are often analysed by clustering genes based on gene expression data alone, though a priori knowledge in the form of biological networks is available. The use of this additional information promises to improve exploratory analysis considerably. RESULTS: We propose constructing a distance function which combines information from expression data and biological networks. Based on this function, we compute a joint clustering of genes and vertices of the network. This general approach is elaborated for metabolic networks. We define a graph distance function on such networks and combine it with a correlation-based distance function for gene expression measurements. A hierarchical clustering and an associated statistical measure is computed to arrive at a reasonable number of clusters. Our method is validated using expression data of the yeast diauxic shift. The resulting clusters are easily interpretable in terms of the biochemical network and the gene expression data and suggest that our method is able to automatically identify processes that are relevant under the measured conditions.  相似文献   

5.
Previously, we observed that without using prior information about individual sampling locations, a clustering algorithm applied to multilocus genotypes from worldwide human populations produced genetic clusters largely coincident with major geographic regions. It has been argued, however, that the degree of clustering is diminished by use of samples with greater uniformity in geographic distribution, and that the clusters we identified were a consequence of uneven sampling along genetic clines. Expanding our earlier dataset from 377 to 993 markers, we systematically examine the influence of several study design variables—sample size, number of loci, number of clusters, assumptions about correlations in allele frequencies across populations, and the geographic dispersion of the sample—on the “clusteredness” of individuals. With all other variables held constant, geographic dispersion is seen to have comparatively little effect on the degree of clustering. Examination of the relationship between genetic and geographic distance supports a view in which the clusters arise not as an artifact of the sampling scheme, but from small discontinuous jumps in genetic distance for most population pairs on opposite sides of geographic barriers, in comparison with genetic distance for pairs on the same side. Thus, analysis of the 993-locus dataset corroborates our earlier results: if enough markers are used with a sufficiently large worldwide sample, individuals can be partitioned into genetic clusters that match major geographic subdivisions of the globe, with some individuals from intermediate geographic locations having mixed membership in the clusters that correspond to neighboring regions.  相似文献   

6.
Mayday is a workbench for visualization, analysis and storage of microarray data. It features a graphical user interface and supports the development and integration of existing and new analysis methods. Besides the infrastructural core functionality, Mayday offers a variety of plug-ins, such as various interactive viewers, a connection to the R statistical environment, a connection to SQL-based databases and different data mining methods, including WEKA-library based methods for classification and various clustering methods. In addition, so-called meta information objects are provided for annotation of the microarray data allowing integration of data from different sources, which is a feature that, for instance, is employed in the enhanced heatmap visualization. Supplementary information: The software and more detailed information including screenshots and a user guide as well as test data can be found on the Mayday home page http://www.zbit.uni-tuebingen.de/pas/mayday. The core is published under the GPL (GNU Public License) and the associated plug-ins under the LGPL (Lesser GNU Public License).  相似文献   

7.
Distance-based clustering of CGH data   总被引:1,自引:0,他引:1  
MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. RESULTS: We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach. AVAILABILITY: All software developed in this article and all the datasets are available from the authors upon request. CONTACT: juliu@cise.ufl.edu.  相似文献   

8.

Background

Many clustering procedures only allow the user to input a pairwise dissimilarity or distance measure between objects. We propose a clustering method that can input a multi-point dissimilarity measure d(i1, i2, ..., iP) where the number of points P can be larger than 2. The work is motivated by gene network analysis where clusters correspond to modules of highly interconnected nodes. Here, we define modules as clusters of network nodes with high multi-node topological overlap. The topological overlap measure is a robust measure of interconnectedness which is based on shared network neighbors. In previous work, we have shown that the multi-node topological overlap measure yields biologically meaningful results when used as input of network neighborhood analysis.

Findings

We adapt network neighborhood analysis for the use of module detection. We propose the Module Affinity Search Technique (MAST), which is a generalized version of the Cluster Affinity Search Technique (CAST). MAST can accommodate a multi-node dissimilarity measure. Clusters grow around user-defined or automatically chosen seeds (e.g. hub nodes). We propose both local and global cluster growth stopping rules. We use several simulations and a gene co-expression network application to argue that the MAST approach leads to biologically meaningful results. We compare MAST with hierarchical clustering and partitioning around medoid clustering.

Conclusion

Our flexible module detection method is implemented in the MTOM software which can be downloaded from the following webpage: http://www.genetics.ucla.edu/labs/horvath/MTOM/  相似文献   

9.
MOTIVATION: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY: R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.  相似文献   

10.
11.
12.
Yang K  Zhang L 《Planta》2008,228(3):439-447
With the exponential growth of genomics data, the demand for reliable clustering methods is increasing every day. Despite the wide usage of many clustering algorithms, the accuracy of these algorithms has been evaluated mostly on simulated data sets and seldom on real biological data for which a "correct answer" is available. In order to address this issue, we use the manually curated high-quality Arabidopsis thaliana gene family database as a "gold standard" to conduct a comprehensive comparison of the accuracies of four widely used clustering methods including K-means, TribeMCL, single-linkage clustering and complete-linkage clustering. We compare the results from running different clustering methods on two matrices: the E-value matrix and the k-tuple distance matrix. The E-value matrix is computed based on BLAST E-values. The k-tuple distance matrix is computed based on the difference in tuple frequencies. The TribeMCL with the E-value matrix performed best, with the Inflation parameter (=1.15) tuned considerably lower than what has been suggested previously (=2). The single-linkage clustering method with the E-value matrix was second best. Single-linkage clustering, K-means clustering, complete-linkage clustering, and TribeMCL with a k-tuple distance matrix performed reasonably well. Complete-linkage clustering with the k-tuple distance matrix performed the worst.  相似文献   

13.
利用AFLP技术对26个竹子种类进行了多样性分析, 以探索引物组合数量对准确研究竹子类群系统关系的影响。实验共随机选取10对AFLP引物, 并对所得10组AFLP标记数据随机组合后进行Nei氏遗传距离/UPGMA聚类分析。每对AFLP引物 扩增数据为一组, 随着用于聚类统计的AFLP标记数据随机组合数量的增加, 26个竹子种类的聚类关系趋向一致。这提示我们,在系统学研究中, 足够数量的引物组合是获得供试材料间准确聚类关系的基础, 应采用对各AFLP引物组合数据随机累加后进行聚类分析的方法, 以聚类关系为标准来确定用于分析供试品种的最少引物组合数量。  相似文献   

14.
An automated procedure for the analysis of homologous protein structures has been developed. The method facilitates the characterization of internal conformational differences and inter-conformer relationships and provides a framework for the analysis of protein structural evolution. The method is implemented in bio3d, an R package for the exploratory analysis of structure and sequence data. AVAILABILITY: The bio3d package is distributed with full source code as a platform-independent R package under a GPL2 license from: http://mccammon.ucsd.edu/~bgrant/bio3d/  相似文献   

15.
16.
种群分布格局的多尺度分析   总被引:40,自引:1,他引:40       下载免费PDF全文
种群分布格局的分析对于了解种群空间分布规律以及种内与种间关系具有重要的意义。最近邻体分析方法 (Nearestneighboranalysis, NNA) 作为种群空间分布格局的重要分析方法, 仅局限于种群格局的单尺度分析。改进NNA方法以应用于种群格局的多尺度分析, 将有助于解决种群格局的尺度依赖性。该文在前人研究的基础上提出扩展最近邻体分析方法 (Extendednearestneighboranalysis, ENNA), 也即在传统ClarkEvans指数公式的基础上增加一个距离尺度参数d (m), 并定义其所对应的ClarkEvans 指数CE (d) 的计算公式及其相应的显著性检验计算公式 (u (d) ) 分别为 :CE (d) =rdA/rdE= (1Nd∑Ndi=1 rdi) / (0.5Ad/Nd+0.0 5 14Pd/Nd+0.0 4 1Pd/Nd3 /2 ) 和u (d) = (rdA-rdE) /σd, 在距离尺度d (m) 范围内, 参数rdA指样地内各个体与其最近邻体间距离的平均值 (m) 、rdE指相同环境中个体呈随机状态时最近邻体距离的平均值 (m) 、Nd 为样地内个体总数、rdi为第i个个体与其最近邻体间的距离 (m) 、Ad 为样地面积 (m2 ) 、Pd 为样地周长 (m) 和σd 代表标准差。ENNA尺度变换采用与分形理论中计算沙盒维数相类似的过程, 而格局类型判断的标准与传统最近邻体分析方法相同。传统最近邻体分析结果是EN NA中距离尺度d取最大值dmax时的一个特例。以广东省黑石顶自然保护区针阔叶混交林中的马尾松 (Pinusmas soniana) 、黄牛奶树 (Symplocoslaurina) 、水栗 (Castanopsisnigrescens) 、鼠刺 (Iteachinensis) 和桃金娘 (Rhodomyrtustomentosa) 等 5个代表性种群为例, 在地理信息系统软件ArcViewGIS技术平台上进行的实例研究显示, 5个种群均表现出不同程度的尺度相关性。由此表明, 该文提出的新方法ENNA能够检测出种群空间分布格局的尺度依赖性, 获得关于种群空间分布格局的多尺度信息, 是进行种群空间格局多尺度分析的有效方法。  相似文献   

17.
Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes--reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance.  相似文献   

18.
 本文应用模糊聚类分析对荆条灌丛分类进行了研究。聚类过程可分为三步:1.计算相似矩阵R:这一步与其它聚类方法相同,相似系数可有各种选择。 2.寻找模糊等价关系,取R的乘幂 R2,R4,R8,…,若在某一步,有 R*便是一个模糊等价关系。3.聚类:选取适当的置信水平λ进行聚类。本文相似系数采用式中rjk代表二样方j和k的相似系数,M为一适当的常数,以使0相似文献   

19.
A statistical approach using sequentially principal component analysis (PCA) clustering and discriminant analysis was developed to disclose morphometric sperm subpopulations. In addition, we used a similar approach to disclose subpopulations of spermatozoa with different degrees of DNA fragmentation. It is widely accepted that sperm morphology is a strong indicator of semen quality and since the sperm head mainly comprises the sperm DNA, it has been proposed that subtle changes in sperm head morphology may be related to abnormal DNA content. Semen from four mongrel dogs (five replicates per dog) were used to investigate DNA quality by means of the sperm chromatin structure assay (SCSA), and for computerized sperm morphometry (ASMA). Each sperm head was measured for nine primary parameters: head area (A), head perimeter (P), head length (L), head width (W), acrosome area (%), midpiece width (w), midpiece area (a), distance (d) between the major axes of the head and midpiece, angle (theta) of divergence of the midpiece from the head axis; and four parameters of head shape: FUN1 (L/W), FUN2 (4pi A/P2), FUN3 ((L - W)/(L + W)) and FUN 4 (pi LW/4A). The data matrix consisted of 2361 observations, (morphometric analysis on individual spermatozoa) and 63,815 observations for the DNA integrity. The PCA analysis revealed five variables with Eigen values over 1, representing more than 79% of the cumulative variance. The morphometric data revealed five sperm subpopulations, while the DNA data gave six subpopulations of spermatozoa with different DNA integrity. Significant differences were found in the percentage of spermatozoa falling in each cluster among dogs (p < 0.05). Linear regression models including sperm head shape factors 2, 3 and 4 predicted the amount of denatured DNA within each individual spermatozoon (p < 0.001). We conclude that the ASMA analysis can be considered a powerful tool to improve the spermiogram.  相似文献   

20.
福建省花臭蛙复合体组成及 天目臭蛙分布新记录记述   总被引:1,自引:0,他引:1  
由于没有明显的地理阻隔,闽浙丘陵地带花臭蛙复合体(Odorrana schmackeri species complex)的物种组成、分布界限、分布格局存在争议。2016年9月至10月对闽浙交界地带福建省宁德市屏南县和南平市浦城县进行野外考察、样本收集,通过扩增样本线粒体12S rRNA和16S rRNA基因并与黄岗臭蛙(O. huanggangensis)、天目臭蛙(O. tianmuii)和花臭蛙(O. schmackeri)的序列进行比对,构建系统发生关系、计算遗传距离,结合形态学鉴定和形态量度分析,对福建省花臭蛙复合体组成进行了研究。结果显示,福建省花臭蛙复合体包括黄岗臭蛙和天目臭蛙,其中黄岗臭蛙分布于武夷山山区、闽江和九龙江流域,天目臭蛙仅分布于衢江(钱塘江主要支流)的次源地——浦城县管厝乡东北部,为福建省臭蛙类分布新记录种。本研究增添了福建省两栖动物多样性并细化了黄岗臭蛙和天目臭蛙在该省分布范围的认识,但是浦城县西北端长江流域的花臭蛙复合体的物种组成仍需进一步确认。水系和流域是否影响花臭蛙复合体物种分化和分布格局以及这些近缘物种的物种形成机制有待进一步研究。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号