首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 562 毫秒
1.
MOTIVATION: Clustering algorithms are widely used in the analysis of microarray data. In clinical studies, they are often applied to find groups of co-regulated genes. Clustering, however, can also stratify patients by similarity of their gene expression profiles, thereby defining novel disease entities based on molecular characteristics. Several distance-based cluster algorithms have been suggested, but little attention has been given to the distance measure between patients. Even with the Euclidean metric, including and excluding genes from the analysis leads to different distances between the same objects, and consequently different clustering results. RESULTS: We describe a new clustering algorithm, in which gene selection is used to derive biologically meaningful clusterings of samples by combining expression profiles and functional annotation data. According to gene annotations, candidate gene sets with specific functional characterizations are generated. Each set defines a different distance measure between patients, leading to different clusterings. These clusterings are filtered using a resampling-based significance measure. Significant clusterings are reported together with the underlying gene sets and their functional definition. CONCLUSIONS: Our method reports clusterings defined by biologically focused sets of genes. In annotation-driven clusterings, we have recovered clinically relevant patient subgroups through biologically plausible sets of genes as well as new subgroupings. We conjecture that our method has the potential to reveal so far unknown, clinically relevant classes of patients in an unsupervised manner. AVAILABILITY: We provide the R package adSplit as part of Bioconductor release 1.9 and on http://compdiag.molgen.mpg.de/software.  相似文献   

2.
The use of analytical techniques to delineate biogeographical regions is becoming increasingly popular. One recent example, Heikinheimo et al . ( Journal of Biogeography , 2007, 34 , 1053–1064 ), applied the k -means clustering algorithm to define the biogeography of the European land mammal fauna. However, they used the Euclidean distance measure to cluster grid cells described by species-occurrence data, which is inappropriate. The Euclidian distance yields misleading results when applied to species-occurrence data because of the double-zero problem and the species-abundance paradox. We repeat their analysis using the Hellinger distance, a measure appropriate for species-occurrence data and which has been shown to outperform other such measures. Our results differ substantially from those presented by Heikinheimo et al. We argue that the rigorous application of appropriate statistical techniques is of crucial concern within conservation biogeography.  相似文献   

3.
Classification of the individuals' genotype data is important in various kinds of biomedical research. There are many sophisticated clustering algorithms, but most of them require some appropriate similarity measure between objects to be clustered. Hence, accurate inter-diplotype similarity measures are always required for classification of diplotypes. In this article, we propose a new accurate inter-diplotype similarity measure that we call the population model-based distance (PMD), so that we can cluster individuals with diplotype SNPs data (i.e., unphased-diplotypes) with higher accuracies. For unphased-diplotypes, the allele sharing distance (ASD) has been the standard to measure the genetic distance between the diplotypes of individuals. To achieve higher clustering accuracies, our new measure PMD makes good use of a given appropriate population model which has never been utilized in the ASD. As the population model, we propose to use an hidden Markov model (HMM)-based model. We call the PMD based on the model the HHD (HIT HMM-based Distance). We demonstrate the impact of the HHD on the diplotype classification through comprehensive large-scale experiments over the genome-wide 8930 data sets derived from the HapMap SNPs database. The experiments revealed that the HHD enables significantly more accurate clustering than the ASD.  相似文献   

4.
5.
在DNA序列相似性的研究中,通常采用的动态规划算法对空位罚分函数缺乏理论依据而带有主观性,从而取得不同的结果,本文提出了一种基于DTW(Dynamic Time Warping,动态时间弯曲)距离的DNA序列相似性度量方法可以解决这一问题.通过DNA序列的图形表示把DNA序列转化为时间序列,然后计算DTW距离来度量序列相似度以表征DNA序列属性,得到能够比较DNA序列相似性度量方法,并用这个方法比较分析了七种东亚钳蝎神经毒素(Buthusmartensi Karsch neurotoxin)基因序列的相似性,验证了该度量方法的有效性和准确性.  相似文献   

6.
Environment, dispersal and patterns of species similarity   总被引:2,自引:0,他引:2  
Aim The aim of this paper is to evaluate the combined effects of geographical distance and environmental distance on patterns of species similarity (similarity in species composition between sites), and to identify factors affecting the rate of decay in species similarity with each type of distance. Location Israel. Methods Data on species composition of land snails and land birds were recorded in 27 sites of 1 × 1 km scattered across a rainfall gradient in Israel. Matrices of similarity in species composition between all pairs of sites were computed and analysed with respect to corresponding matrices of geographical distance and rainfall distance (defined as the difference in mean annual rainfall between sites, and used as a measure of environmental distance). Mantel tests were applied to determine the correlation between species similarity and each type of distance. Factors affecting the decay in species similarity were investigated by comparing different subsets of the data using randomization tests. Results Both rainfall distance and geographical distance had negative effects on species similarity. The effect of rainfall distance was statistically significant even after controlling for differences in geographical distance, and vice versa. The per‐unit effect of rainfall distance on species similarity decreased with increasing geographical distance, indicating that the two types of distances interacted in determining the similarity in species composition. Snails showed a higher rate of decay in species similarity with geographical distance than birds, and large snails showed a higher rate of decay than small snails, which are better passive dispersers. The per‐unit effects of both rainfall distance and geographical distance on species similarity were higher in the desert region than in the Mediterranean region. Analyses focusing on a grain size of 10 × 10 m showed a lower similarity in species composition and a lower rate of decay in species similarity with rainfall distance than analyses carried out at a grain size of 1 × 1 km. Main conclusions Patterns of similarity in species composition are influenced by the combined effects of environmental variation, the position of the area along environmental gradients, the dispersal properties of the component species, and the scale (both spatial extent and grain size) at which the patterns are examined.  相似文献   

7.
王丹  王孝安  郭华  王世雄  郑维娜  刘史力 《生态学报》2013,33(14):4409-4415
植物群落构建机制是生态学研究的热点之一.长久以来这个难题并没有得到很好的解释,且争议较多.生态位理论或中性理论,或是二者的共同作用,这样的结论在不同的研究中都有印证.以黄土高原子午岭地区的草地群落为例,对3种不同的草地群落(5a的弃耕地、阴坡和阳坡的草地)进行了野外群落学调查,采用Mantel test和主轴邻距法(PCNM)分析方法,研究了空间地理距离和环境资源差异对于草本植物群落分布的影响,结果表明:地理距离和环境差异共同解释了群落组成相似性的79.3%,剔除环境因子的影响,地理距离解释了群落组成相似性的33.8%;而剔除地理距离的影响,环境因子解释了群落组成相似性的14.2%.无论是生态位理论还是中性理论,其在黄土高原草本群落构建过程中都有作用,但中性理论扮演了更为重要的角色.  相似文献   

8.
The similarity in species composition between two communities generally decays as a function of increasing distance between them. Parasite communities in vertebrate definitive hosts follow this pattern but the respective relationship in intermediate invertebrate hosts of parasites with complex life cycles is unknown. In intermediate hosts, parasite communities are affected not only by the varying vagility of their definitive hosts (dispersing infective propagules) but also by the necessary coincidence of all their hosts in environmentally suitable localities. As intermediate hosts often hardly move they do not contribute to parasite dispersal. Hence, their parasite assemblages may decrease faster in similarity with increasing distance than those in highly mobile vertebrate definitive hosts. We use published field survey data to investigate distance decay of similarity in trematode communities from three prominent coastal molluscs of the Eastern North-Atlantic: the gastropods Littorina littorea and Hydrobia ulvae, and the bivalve Cerastoderma edule. We found that the similarity of trematode communities in all three hosts decayed with distance, independently of local sampling effort, and whether or not the parasites used the mollusc as first or second intermediate host in their life cycle. In H. ulvae, the halving distance (i.e. the distance that halves the similarity from its initial similarity at 1 km distance) for the trematode species using birds as definitive hosts was approximately two to three times larger than for species using fish. The initial similarities (estimated at 1 km distance) among trematode communities were relatively higher, whereas mean halving distances were lower, compared to published values for parasite communities in vertebrate hosts. We conclude that the vagility of definitive hosts accounts for a high similarity at the local scale, while the strong decay of similarity across regions is a consequence of the low probability that all necessary hosts and suitable environmental conditions coincide on a large scale.  相似文献   

9.
Genetic distance analysis based on 16 biochemical markers, following Nei's distance measure has been performed on nine endogamous groups of Maharashtra: Nava Budha, Maratha, Deshastha Rigvedi Brahmin, Chitpavan Brahmin, Chandrasenya Kayastha Prabhu, Parsis, Bhil, Pawara and Katkari. The distances between these groups are small as compared to the within group heterogeneity. The average heterozygosity per gene per locus is high for all the populations (in the range of 20–22%). The observed clusterings among these nine groups, in general, are compatible with the known ethnic history of Maharashtra.  相似文献   

10.
Similarity or distance measures are core components used by distance-based clustering algorithms to cluster similar data points into the same clusters, while dissimilar or distant data points are placed into different clusters. The performance of similarity measures is mostly addressed in two or three-dimensional spaces, beyond which, to the best of our knowledge, there is no empirical study that has revealed the behavior of similarity measures when dealing with high-dimensional datasets. To fill this gap, a technical framework is proposed in this study to analyze, compare and benchmark the influence of different similarity measures on the results of distance-based clustering algorithms. For reproducibility purposes, fifteen publicly available datasets were used for this study, and consequently, future distance measures can be evaluated and compared with the results of the measures discussed in this work. These datasets were classified as low and high-dimensional categories to study the performance of each measure against each category. This research should help the research community to identify suitable distance measures for datasets and also to facilitate a comparison and evaluation of the newly proposed similarity or distance measures with traditional ones.  相似文献   

11.
Abstract

Structures and functions of proteins play various essential roles in biological processes. The functions of newly discovered proteins can be predicted by comparing their structures with that of known-functional proteins. Many approaches have been proposed for measuring the protein structure similarity, such as the template-modeling (TM)-score method, GRaphlet (GR)-Align method as well as the commonly used root-mean-square deviation (RMSD) measures. However, the alignment comparisons between the similarity of protein structure cost much time on large dataset, and the accuracy still have room to improve. In this study, we introduce a new three-dimensional (3D) Yau–Hausdorff distance between any two 3D objects. The (3D) Yau–Hausdorff distance can be used in particular to measure the similarity/dissimilarity of two proteins of any size and does not need aligning and superimposing two structures. We apply structural similarity to study function similarity and perform phylogenetic analysis on several datasets. The results show that (3D) Yau–Hausdorff distance could serve as a more precise and effective method to discover biological relationships between proteins than other methods on structure comparison.

Communicated by Ramaswamy H. Sarma  相似文献   

12.

Background  

The definition of a distance measure plays a key role in the evaluation of different clustering solutions of gene expression profiles. In this empirical study we compare different clustering solutions when using the Mutual Information (MI) measure versus the use of the well known Euclidean distance and Pearson correlation coefficient.  相似文献   

13.
In this paper we present a study of classification of the 20 amino acids via a fuzzy clustering technique. In order to calculate distances among the various elements we employ two different distance functions: the Minkowski distance function and the NTV metric. In the clustering procedure we take into account several physical properties of the amino acids. We examine the effect of the number and nature of properties taken into account to the clustering procedure as a function of the degree of similarity and the distance function used. It turns out that one should use the properties that determine in the more important way the behavior of the amino acids and that the use of the appropriate metric can help in defining the separation into groups.  相似文献   

14.
Mutual information (MI)-based registration, which uses MI as the similarity measure, is a representative method in medical image registration. It has an excellent robustness and accuracy, but with the disadvantages of a large amount of calculation and a long processing time. In this paper, by computing the medical image moments, the centroid is acquired. By applying fuzzy c-means clustering, the coordinates of the medical image are divided into two clusters to fit a straight line, and the rotation angles of the reference and floating images are computed, respectively. Thereby, the initial values for registering the images are determined. When searching the optimal geometric transformation parameters, we put forward the two new concepts of fuzzy distance and fuzzy signal-to-noise ratio (FSNR), and we select FSNR as the similarity measure between the reference and floating images. In the experiments, the Simplex method is chosen as multi-parameter optimisation. The experimental results show that this proposed method has a simple implementation, a low computational cost, a fast registration and good registration accuracy. Moreover, it can effectively avoid trapping into the local optima. It is adapted to both mono-modality and multi-modality image registrations.  相似文献   

15.
Partial gyrB sequences (>1 kb) were obtained from 34 type strains of the genus Amycolatopsis. Phylogenetic trees were constructed to determine the effectiveness of using this gene to predict taxonomic relationships within the genus. The use of gyrB sequence analysis as an alternative to DNA–DNA hybridization was also assessed for distinguishing closely related species. The gyrB based phylogeny mostly confirmed the conventional 16S rRNA gene-based phylogeny and thus provides additional support for certain of these 16S rRNA gene-based phylogenetic groupings. Although pairwise gyrB sequence similarity cannot be used to predict the DNA relatedness between type strains, the gyrB genetic distance can be used as a means to assess quickly whether an isolate is likely to represent a new species in the genus Amycolatopsis. In particular a genetic distance of >0.02 between two Amycolatopsis strains (based on a 315 bp variable region of the gyrB gene) is proposed to provide a good indication that they belong to different species (and that polyphasic taxonomic characterization of the unknown strain is worth undertaking). Electronic supplementary material  The online version of this article (doi:) contains supplementary material, which is available to authorized users. The GenBank accession numbers for the gyrB gene sequences obtained in this study are shown in Table 1.  相似文献   

16.
Comparing two or more phylogenetic trees is a fundamental task in computational biology. The simplest outcome of such a comparison is a pairwise measure of similarity, dissimilarity, or distance. A large number of such measures have been proposed, but so far all suffer from problems varying from computational cost to lack of robustness; many can be shown to behave unexpectedly under certain plausible inputs. For instance, the widely used Robinson-Foulds distance is poorly distributed and thus affords little discrimination, while also lacking robustness in the face of very small changes--reattaching a single leaf elsewhere in a tree of any size can instantly maximize the distance. In this paper, we introduce a new pairwise distance measure, based on matching, for phylogenetic trees. We prove that our measure induces a metric on the space of trees, show how to compute it in low polynomial time, verify through statistical testing that it is robust, and finally note that it does not exhibit unexpected behavior under the same inputs that cause problems with other measures. We also illustrate its usefulness in clustering trees, demonstrating significant improvements in the quality of hierarchical clustering as compared to the same collections of trees clustered using the Robinson-Foulds distance.  相似文献   

17.

Background  

The underlying goal of microarray experiments is to identify gene expression patterns across different experimental conditions. Genes that are contained in a particular pathway or that respond similarly to experimental conditions could be co-expressed and show similar patterns of expression on a microarray. Using any of a variety of clustering methods or gene network analyses we can partition genes of interest into groups, clusters, or modules based on measures of similarity. Typically, Pearson correlation is used to measure distance (or similarity) before implementing a clustering algorithm. Pearson correlation is quite susceptible to outliers, however, an unfortunate characteristic when dealing with microarray data (well known to be typically quite noisy.)  相似文献   

18.

Background  

The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".  相似文献   

19.
Natal sex‐biased dispersal has long been thought to reduce the risk of inbreeding by spatially separating opposite‐sexed kin. Yet, comprehensive and quantitative evaluations of this hypothesis are lacking. In this study, we quantified the effectiveness of sex‐biased dispersal as an inbreeding avoidance strategy by combining spatially explicit simulations and empirical data. We quantified the extent of kin clustering by measuring the degree of spatial autocorrelation among opposite‐sexed individuals (FM structure). This allowed us to systematically explore how the extent of sex‐biased dispersal, generational overlap, and mate searching distance, influenced both kin clustering, and the resulting inbreeding in the absence of complementary inbreeding avoidance strategies. Simulations revealed that when sex‐biased dispersal was limited, positive FM genetic structure developed quickly and increased as the mate searching distance decreased or as generational overlap increased. Interestingly, complete long‐range sex‐biased dispersal did not prevent the development of FM genetic structure when generations overlapped. We found a very strong correlation between FM genetic structure and both FIS under random mating, and pedigree‐based measures of inbreeding. Thus, we show that the detection of FM genetic structure can be a strong indicator of inbreeding risk. Empirical data for two species with different life history strategies yielded patterns congruent with our simulations. Our study illustrates a new application of spatial genetic autocorrelation analysis that offers a framework for quantifying the risk of inbreeding that is easily extendable to other species. Furthermore, our findings provide other researchers with a context for interpreting observed patterns of opposite‐sexed spatial genetic structure.  相似文献   

20.
Landscape similarity search involves finding landscapes from among a large collection that are similar to a query landscape. An example of such collection is a large land cover map subdivided into a grid of smaller local landscapes, a query is a local landscape of interest, and the task is to find other local landscapes within a map which are perceptually similar to the query. Landscape search and the related task of pattern-based regionalization, requires a measure of similarity – a function which quantifies the level of likeness between two landscapes. The standard approach is to use the Euclidean distance between vectors of landscape metrics derived from the two landscapes, but no in-depth analysis of this approach has been conducted. In this paper we investigate the performance of different implementations of the standard similarity measure. Five different implementations are tested against each other and against a control similarity measure based on histograms of class co-occurrence features and the Jensen–Shannon divergence. Testing consists of a series of numerical experiments combined with visual assessments on a set of 400 3 km-scale landscapes. Based on the cases where visual assessment provides definitive answer, we have determined that the standard similarity measure is sensitive to the way landscape metrics are normalized and, additionally, to whether weights aimed at controlling the relative contribution of landscape composition vs. configuration are used. The standard measure achieves the best performance when metrics are normalized using their extreme values extracted from all possible landscapes, not just the landscapes in the given collection, and when weights are assigned so the combined influence of composition metrics on the similarity value equals the combined influence of configuration metrics. We have also determined that the control similarity measure outperforms all implementations of the standard measure.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号