首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Dividing the image into superpixels contributes to further processing of the image. Simple linear iterative clustering (SLIC) algorithm achieves good segmentation result by clustering color and distance characteristics of pixels. However, finite superpixels easily cause under-segmentation. Therefore, the work corrects segmentation result of SLIC by k-means clustering method calculating similarity based on weighted Euclidean distance. After that, the under-segmentation superpixel blocks are conducted with k-means clustering based on binary classification. Result shows that the corrected SLIC segmentation has better visual effect and index.  相似文献   

2.
Distance-based clustering of CGH data   总被引:1,自引:0,他引:1  
MOTIVATION: We consider the problem of clustering a population of Comparative Genomic Hybridization (CGH) data samples. The goal is to develop a systematic way of placing patients with similar CGH imbalance profiles into the same cluster. Our expectation is that patients with the same cancer types will generally belong to the same cluster as their underlying CGH profiles will be similar. RESULTS: We focus on distance-based clustering strategies. We do this in two steps. (1) Distances of all pairs of CGH samples are computed. (2) CGH samples are clustered based on this distance. We develop three pairwise distance/similarity measures, namely raw, cosine and sim. Raw measure disregards correlation between contiguous genomic intervals. It compares the aberrations in each genomic interval separately. The remaining measures assume that consecutive genomic intervals may be correlated. Cosine maps pairs of CGH samples into vectors in a high-dimensional space and measures the angle between them. Sim measures the number of independent common aberrations. We test our distance/similarity measures on three well known clustering algorithms, bottom-up, top-down and k-means with and without centroid shrinking. Our results show that sim consistently performs better than the remaining measures. This indicates that the correlation of neighboring genomic intervals should be considered in the structural analysis of CGH datasets. The combination of sim with top-down clustering emerged as the best approach. AVAILABILITY: All software developed in this article and all the datasets are available from the authors upon request. CONTACT: juliu@cise.ufl.edu.  相似文献   

3.
Lots of similarity-based algorithms have been designed to deal with the problem of link prediction in the past decade. In order to improve prediction accuracy, a novel cosine similarity index CD based on distance between nodes and cosine value between vectors is proposed in this paper. Firstly, node coordinate matrix can be obtained by node distances which are different from distance matrix and row vectors of the matrix are regarded as coordinates of nodes. Then, cosine value between node coordinates is used as their similarity index. A local community density index LD is also proposed. Then, a series of CD-based indices include CD-LD-k, CD*LD-k, CD-k and CDI are presented and applied in ten real networks. Experimental results demonstrate the effectiveness of CD-based indices. The effects of network clustering coefficient and assortative coefficient on prediction accuracy of indices are analyzed. CD-LD-k and CD*LD-k can improve prediction accuracy without considering the assortative coefficient of network is negative or positive. According to analysis of relative precision of each method on each network, CD-LD-k and CD*LD-k indices have excellent average performance and robustness. CD and CD-k indices perform better on positive assortative networks than on negative assortative networks. For negative assortative networks, we improve and refine CD index, referred as CDI index, combining the advantages of CD index and evolutionary mechanism of the network model BA. Experimental results reveal that CDI index can increase prediction accuracy of CD on negative assortative networks.  相似文献   

4.
Aita T  Husimi Y  Nishigaki K 《Bio Systems》2011,106(2-3):67-75
To measure the similarity or dissimilarity between two given biological sequences, several papers proposed metrics based on the "word-composition vector". The essence of these metrics is as follows. First, we count the appearance frequencies of all the K-tuple words throughout each of two given sequences. Then, the two given sequences are transformed into their respective word-composition vectors. Next, the distance metrics, for example the angle between the two vectors, are calculated. A significant issue is to determine the optimal word size K. With a mathematical model of mutational events (including substitutions, insertions, deletions and duplications) that occur in sequences, we analyzed how the angle between the composition vectors depends on the mutational events. We also considered the optimal word size (=resolution) from our original approach. Our results were verified by computational experiments using artificially generated sequences, amino acid sequences of hemoglobin and nucleotide sequences of 16S ribosomal RNA.  相似文献   

5.

Background

We investigate the accuracy of different similarity approaches for clustering over two million biomedical documents. Clustering large sets of text documents is important for a variety of information needs and applications such as collection management and navigation, summary and analysis. The few comparisons of clustering results from different similarity approaches have focused on small literature sets and have given conflicting results. Our study was designed to seek a robust answer to the question of which similarity approach would generate the most coherent clusters of a biomedical literature set of over two million documents.

Methodology

We used a corpus of 2.15 million recent (2004-2008) records from MEDLINE, and generated nine different document-document similarity matrices from information extracted from their bibliographic records, including titles, abstracts and subject headings. The nine approaches were comprised of five different analytical techniques with two data sources. The five analytical techniques are cosine similarity using term frequency-inverse document frequency vectors (tf-idf cosine), latent semantic analysis (LSA), topic modeling, and two Poisson-based language models – BM25 and PMRA (PubMed Related Articles). The two data sources were a) MeSH subject headings, and b) words from titles and abstracts. Each similarity matrix was filtered to keep the top-n highest similarities per document and then clustered using a combination of graph layout and average-link clustering. Cluster results from the nine similarity approaches were compared using (1) within-cluster textual coherence based on the Jensen-Shannon divergence, and (2) two concentration measures based on grant-to-article linkages indexed in MEDLINE.

Conclusions

PubMed''s own related article approach (PMRA) generated the most coherent and most concentrated cluster solution of the nine text-based similarity approaches tested, followed closely by the BM25 approach using titles and abstracts. Approaches using only MeSH subject headings were not competitive with those based on titles and abstracts.  相似文献   

6.
 本文应用模糊聚类分析对荆条灌丛分类进行了研究。聚类过程可分为三步:1.计算相似矩阵R:这一步与其它聚类方法相同,相似系数可有各种选择。 2.寻找模糊等价关系,取R的乘幂 R2,R4,R8,…,若在某一步,有 R*便是一个模糊等价关系。3.聚类:选取适当的置信水平λ进行聚类。本文相似系数采用式中rjk代表二样方j和k的相似系数,M为一适当的常数,以使0相似文献   

7.
Most molecular analyses, including phylogenetic inference, are based on sequence alignments. We present an algorithm that estimates relatedness between biomolecules without the requirement of sequence alignment by using a protein frequency matrix that is reduced by singular value decomposition (SVD), in a latent semantic index information retrieval system. Two databases were used: one with 832 proteins from 13 mitochondrial gene families and another composed of 1000 sequences from nine types of proteins retrieved from GenBank. Firstly, 208 sequences from the first database and 200 from the second were randomly selected and compared using edit distance between each pair of sequences and respective cosines and Euclidean distances from SVD. Correlation between cosine and edit distance was -0.32 (P < 0.01) and between Euclidean distance and edit distance was +0.70 (P < 0.01). In order to check the ability of SVD in classifying sequences according to their categories, we used a sample of 202 sequences from the 13 gene families as queries (test set), and the other proteins (630) were used to generate the frequency matrix (training set). The classification algorithm applies a voting scheme based on the five most similar sequences with each query. With a 3-peptide frequency matrix, all 202 queries were correctly classified (accuracy = 100%). This algorithm is very attractive, because sequence alignments are neither generated nor required. In order to achieve results similar to those obtained with edit distance analysis, we recommend that Euclidean distance be used as a similarity measure for protein sequences in latent semantic indexing methods.  相似文献   

8.
瘿螨总科科间亲缘关系的初步研究   总被引:2,自引:1,他引:1  
以聚丙烯酰胺凝胶电泳,测定了瘿螨总科中分属3个科的5种瘿螨的酯酶同工酶,根据电泳谱带和所测5种瘿螨的形态特征,计算出欧氏距离和相似系数,绘制谱系图,表明它们的亲缘关系顺序为:桧三毛瘿螨,金钱松博氏瘿螨,雪柳顶冠瘿螨,女贞刺瘿螨和金银木大嘴瘿螨。其中以雪柳顶冠瘿螨和女贞刺瘿螨的亲缘关系最近,桧三毛瘿螨和金钱松博氏瘿螨亦属较近缘的两个种,金银木大嘴瘿螨则独立于外。这种亲缘关系,构成了纳氏瘿螨科→瘿螨科→大嘴瘿螨科的科间进化关系。  相似文献   

9.
Summary Simulated coenoclines were used to test performance of several techniques for ordinating samples by species composition: Wisconsin polar or Bray-Curtis ordination with Euclidean distance (ED) and the complements of percentage similarity (PD) and coefficient of community (CD) as distance measures, Principal components analysis, and polar and non-polar or indirect use of Discriminant function analysis. In general the Bray-Curtis technique gave the best ordinations, and PD was the best distance measure. Euclidean distance gave greater distortion than PD in all tests; CD may be better than PD only for some sample sets of high alpha and beta diversity and high levels of noise or sample error. Principal components ordinations are increasingly distored as beta diversity increases, and are highly vulnerable to effects of both noise and sample clustering. Discriminant function analysis was found generally unsuitable for ordination of samples by species composition, but likely to be useful for sample classification.  相似文献   

10.
小麦抗旱生态分类中适合性聚类方法的研究   总被引:5,自引:2,他引:3  
探索了适合于小麦品种抗旱生态分类的聚类方法。选用21个农艺性状和15个冬小麦品种(系),在聚类分析的各环节上,通过采用不同的策略,大规模进行了各种分类结果的比较。结果表明,在与专家经验分类接近程度上,数据转换方法中,原始数据法依次大于普通相关阵基础上的方差极大正交旋转法、Promax斜交旋转法、主成份法;相似性度量上,欧氏距离大于马氏距离;聚类方式上,对应分析法和模糊聚类法大于最短距离法、最长距离  相似文献   

11.
The Mahalanobis generalized distance can advantageously be used to achieve the hierarchical clustering of groups of individuals. With a set of nearly 12,000 biometrical data, comprising populations of 14 different species of clover, we tried four methods to cluster those populations, in order to compare their results and to see whether the numerical classification obtained agrees with the botanical taxonomy. One of those methods is a conventional hierarchical clustering technique, based upon the Euclidean distances between the means of the populations, while the three other methods make use, with an increasing degree of complexity, of the generalized distances. These methods gave obviously better results.  相似文献   

12.
陈影  姚方杰  张友民  方明 《菌物学报》2014,33(5):984-996
在木耳栽培种质资源农艺性状调查的基础上,应用数量分类学中的Q型聚类分析法对20个木耳菌株进行分类研究,并对14个农艺性状进行R型聚类分析和主成分分析。结果表明:Q型聚类将20个木耳菌株在欧氏距离6.29处依据子实体朵型性状分为簇生型菌株和菊花型菌株两大类群,菊花型类群在欧式距离4.79处依据生育期性状的原基发生类型划分为分散型和集中型两个亚群;R型聚类表明菌丝体性状(1个)、生育期性状(2个)、子实体性状(8个)等11个农艺性状间相关性较强;主成分分析中,发现子实体背面皱褶、耳片数、原基发生时间、子实体朵型、干耳背面颜色等5个性状是14个农艺性状的第1主成分,贡献率高达62.26%,把第1主成分命名为朵型-生育期构成因子,作为种质评价的指标。  相似文献   

13.
The metric of functional evenness FEve is an example of how approaches to conceptualizing and measuring functional variability may go astray. This index has several critical conceptual and practical drawbacks:
  1. Different values of the FEve index for the same community can be obtained if the species have unequal species abundances; this result is highly likely if most of the traits are categorical.
  2. Very minor differences in even one pairwise distance can result in very different values of FEve.
  3. FEve uses only a fraction of the information contained in the matrix of species distances. Counterintuitively, this can cause very similar FEve scores for communities with substantially different patterns of species dispersal in trait space.
  4. FEve is a valid metric only if all species have exactly the same abundances. However, the meaning of FEve in such an instance is unclear as the purpose of the metric is to measure the variability of abundances in trait space.
We recommend not using the FEve metric in studies of functional variability. Given the wide usage of FEve index over the last decade, the validity of the conclusions based on those estimates is in question. Instead, we suggest three alternative metrics that combine variability in species distances in trait space with abundance in various ways. More broadly, we recommend that researchers think about which community properties (e.g., trait distances of a focus species to the nearest neighbor or all other species, variability of pairwise interactions between species) they want to measure and pick from among the appropriate metrics.  相似文献   

14.
MOTIVATION: Alignment-free metrics were recently reviewed by the authors, but have not until now been object of a comparative study. This paper compares the classification accuracy of word composition metrics therein reviewed. It also presents a new distance definition between protein sequences, the W-metric, which bridges between alignment metrics, such as scores produced by the Smith-Waterman algorithm, and methods based solely in L-tuple composition, such as Euclidean distance and Information content. RESULTS: The comparative study reported here used the SCOP/ASTRAL protein structure hierarchical database and accessed the discriminant value of alternative sequence dissimilarity measures by calculating areas under the Receiver Operating Characteristic curves. Although alignment methods resulted in very good classification accuracy at family and superfamily levels, alignment-free distances, in particular Standard Euclidean Distance, are as good as alignment algorithms when sequence similarity is smaller, such as for recognition of fold or class relationships. This observation justifies its advantageous use to pre-filter homologous proteins since word statistics techniques are computed much faster than the alignment methods. AVAILABILITY: All MATLAB code used to generate the data is available upon request to the authors. Additional material available at http://bioinformatics.musc.edu/wmetric  相似文献   

15.
MOTIVATION: Microarray experiments have revolutionized the study of gene expression with their ability to generate large amounts of data. This article describes an alternative to existing approaches to clustering of gene expression profiles; the key idea is to cluster in stages using a hierarchy of distance measures. This method is motivated by the way in which the human mind sorts and so groups many items. The distance measures arise from the orthogonal breakup of Euclidean distance, giving us a set of independent measures of different attributes of the gene expression profile. Interpretation of these distances is closely related to the statistical design of the microarray experiment. This clustering method not only accommodates missing data but also leads to an associated imputation method. RESULTS: The performance of the clustering and imputation methods was tested on a simulated dataset, a yeast cell cycle dataset and a central nervous system development dataset. Based on the Rand and adjusted Rand indices, the clustering method is more consistent with the biological classification of the data than commonly used clustering methods. The imputation method, at varying levels of missingness, outperforms most imputation methods, based on root mean squared error (RMSE). AVAILABILITY: Code in R is available on request from the authors.  相似文献   

16.
Landscape similarity search involves finding landscapes from among a large collection that are similar to a query landscape. An example of such collection is a large land cover map subdivided into a grid of smaller local landscapes, a query is a local landscape of interest, and the task is to find other local landscapes within a map which are perceptually similar to the query. Landscape search and the related task of pattern-based regionalization, requires a measure of similarity – a function which quantifies the level of likeness between two landscapes. The standard approach is to use the Euclidean distance between vectors of landscape metrics derived from the two landscapes, but no in-depth analysis of this approach has been conducted. In this paper we investigate the performance of different implementations of the standard similarity measure. Five different implementations are tested against each other and against a control similarity measure based on histograms of class co-occurrence features and the Jensen–Shannon divergence. Testing consists of a series of numerical experiments combined with visual assessments on a set of 400 3 km-scale landscapes. Based on the cases where visual assessment provides definitive answer, we have determined that the standard similarity measure is sensitive to the way landscape metrics are normalized and, additionally, to whether weights aimed at controlling the relative contribution of landscape composition vs. configuration are used. The standard measure achieves the best performance when metrics are normalized using their extreme values extracted from all possible landscapes, not just the landscapes in the given collection, and when weights are assigned so the combined influence of composition metrics on the similarity value equals the combined influence of configuration metrics. We have also determined that the control similarity measure outperforms all implementations of the standard measure.  相似文献   

17.
广西常绿阔叶林的聚类分析   总被引:9,自引:1,他引:8       下载免费PDF全文
本文用聚类分析法对取自广西不同地区的30个常绿阔叶林的样地资料进行了分类。样地间相似性的计算采用了Bray—Curtis距离和Euclidian距离公式。聚合策略是用最近邻体法、最远邻体法,中线法、形心法、组平均法、可变组平均法、可变法以及平方和增量法等八种聚合方法。结果证明,用聚类法所划分的类型与按优势种划分的类型既有相似之点,又有不同之处。但是聚类分析法可以对样地进行更为仔细的分离,即不但可把取自不同地区而优势种相同的归为一类,有时还可把它们划分开来;同时也会把优势种不同而地区相同的样地归并到一起。这反映了与地区的纬度、海拔等生境条件相联系的区系组成的变化。根据聚类分析结果把25个样地划分出7种类型,并在其中5类中找出了它的标志种,此外还对8种聚类法和等级划分阈值进行了讨论。  相似文献   

18.
区域农田景观格局对麦蚜种群数量的影响   总被引:2,自引:0,他引:2  
张永生  欧阳芳  门兴元  戈峰  袁哲明 《生态学报》2018,38(23):8652-8659
明确农田景观格局对麦田蚜虫种群的影响,是开展区域性害虫生态调控的重要理论依据之一。以区域性小麦种植区为研究对象,基于遥感影像与土地覆盖分类数据以及田间调查的蚜虫种群数据,计算景观格局指数,使用负二项分布的广义线性模型从农田景观、非作物生境景观和区域景观3个方面分析了区域农田景观格局对麦田蚜虫种群的影响。结果表明,蚜虫种群的数量与草地的平均斑块面积和最大斑块指数显著正相关,与县域的平均几何最邻近距离和面积加权平均斑块面积显著负相关,与耕地的面积加权平均斑块面积显著负相关,与耕地的斑块密度显著正相关。草地斑块面积的增大、区域景观与耕地的破碎化、区域景观的聚集会促进蚜虫种群数量的增加。使用草地的斑块面积和最大斑块指数、区域景观的平均几何最邻近距离可以预测蚜虫种群的发生量。非作物生境草地的斑块面积、耕地的破碎化、区域景观的空间分布及破碎化是影响麦田蚜虫种群发生的重要景观因素。  相似文献   

19.
Benthic invertebrate data from thirty-nine lakes in south-central Ontario were analyzed to determine the effect of choosing particular data standardizations, resemblance measures, and ordination methods on the resultant multivariate summaries. Logarithmic-transformed, 0–1 scaled, and ranked data were used as standardized variables with resemblance measures of Bray-Curtis, Euclidean distance, cosine distance, correlation, covariance and chi-squared distance. Combinations of these measures and standardizations were used in principal components analysis, principal coordinates analysis, non-metric multidimensional scaling, correspondence analysis, and detrended correspondence analysis. Correspondence analysis and principal components analysis using a correlation coefficient provided the most consistent results irrespective of the choice in data standardization. Other approaches using detrended correspondence analysis, principal components analysis, principal coordinates analysis, and non-metric multidimensional scaling provided less consistent results. These latter three methods produced similar results when the abundance data were replaced with ranks or standardized to a 0–1 range. The log-transformed data produced the least consistent results, whereas ranked data were most consistent. Resemblance measures such as the Bray-Curtis and correlation coefficient provided more consistent solutions than measures such as Euclidean distance or the covariance matrix when different data standardizations were used. The cosine distance based on standardized data provided results comparable to the CA and DCA solutions. Overall, CA proved most robust as it demonstrated high consistency irrespective of the data standardizations. The strong influence of data standardization on the other ordination methods emphasizes the importance of this frequently neglected stage of data analysis.  相似文献   

20.
Rapid developments in the biomedical sciences have increased the demand for automatic clustering of biomedical publications. In contrast to current approaches to text clustering, which focus exclusively on the contents of abstracts, a novel method is proposed for clustering and analysis of complete biomedical article texts. To reduce dimensionality, Cosine Coefficient is used on a sub-space of only two vectors, instead of computing the Euclidean distance within the space of all vectors. Then a strategy and algorithm is introduced for Semi-supervised Affinity Propagation (SSAP) to improve analysis efficiency, using biomedical journal names as an evaluation background. Experimental results show that by avoiding high-dimensional sparse matrix computations, SSAP outperforms conventional k-means methods and improves upon the standard Affinity Propagation algorithm. In constructing a directed relationship network and distribution matrix for the clustering results, it can be noted that overlaps in scope and interests among BioMed publications can be easily identified, providing a valuable analytical tool for editors, authors and readers.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号