首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
MOTIVATION: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins. RESULTS: We developed an approach to classification of biological objects that combines evolutionary measures of their similarity with a model-based clustering procedure. We apply the methodology to amino acid sequences. On the first step, given a multiple sequence alignment, we estimate evolutionary distances between proteins measured in expected numbers of amino acid substitutions per site. These distances are additive and are suitable for evolutionary tree reconstruction. On the second step, we find the best fit approximation of the evolutionary distances by Euclidian distances and thus represent each protein by a point in a multidimensional space. The Euclidian space may be projected in two or three dimensions and the projections can be used to visualize relationships between proteins. On the third step, we find a non-parametric estimate of the probability density of the points and cluster the points that belong to the same local maximum of this density in a group. The number of groups is controlled by a sigma-parameter that determines the shape of the density estimate and the number of maxima in it. The grouping procedure outperforms commonly used methods such as UPGMA and single linkage clustering.  相似文献   

2.
Fuzzy C-means method for clustering microarray data   总被引:9,自引:0,他引:9  
MOTIVATION: Clustering analysis of data from DNA microarray hybridization studies is essential for identifying biologically relevant groups of genes. Partitional clustering methods such as K-means or self-organizing maps assign each gene to a single cluster. However, these methods do not provide information about the influence of a given gene for the overall shape of clusters. Here we apply a fuzzy partitioning method, Fuzzy C-means (FCM), to attribute cluster membership values to genes. RESULTS: A major problem in applying the FCM method for clustering microarray data is the choice of the fuzziness parameter m. We show that the commonly used value m = 2 is not appropriate for some data sets, and that optimal values for m vary widely from one data set to another. We propose an empirical method, based on the distribution of distances between genes in a given data set, to determine an adequate value for m. By setting threshold levels for the membership values, genes which are tigthly associated to a given cluster can be selected. Using a yeast cell cycle data set as an example, we show that this selection increases the overall biological significance of the genes within the cluster. AVAILABILITY: Supplementary text and Matlab functions are available at http://www-igbmc.u-strasbg.fr/fcm/  相似文献   

3.
Taxonomy-independent analysis plays an essential role in microbial community analysis. Hierarchical clustering is one of the most widely employed approaches to finding operational taxonomic units, the basis for many downstream analyses. Most existing algorithms have quadratic space and computational complexities, and thus can be used only for small or medium-scale problems. We propose a new online learning-based algorithm that simultaneously addresses the space and computational issues of prior work. The basic idea is to partition a sequence space into a set of subspaces using a partition tree constructed using a pseudometric, then recursively refine a clustering structure in these subspaces. The technique relies on new methods for fast closest-pair searching and efficient dynamic insertion and deletion of tree nodes. To avoid exhaustive computation of pairwise distances between clusters, we represent each cluster of sequences as a probabilistic sequence, and define a set of operations to align these probabilistic sequences and compute genetic distances between them. We present analyses of space and computational complexity, and demonstrate the effectiveness of our new algorithm using a human gut microbiota data set with over one million sequences. The new algorithm exhibits a quasilinear time and space complexity comparable to greedy heuristic clustering algorithms, while achieving a similar accuracy to the standard hierarchical clustering algorithm.  相似文献   

4.
We present a new, practical algorithm to resolve the experimental data in restriction site analysis, which is a common technique for mapping DNA. Specifically, we assert that multiple digestions with a single restriction enzyme can provide sufficient information to identify the positions of the restriction sites with high probability. The motivation for the new approach comes from combinatorial results on the number of mutually homeometric sets in one dimension, where two sets ofn points are homeometric if the multiset ofn(n−1)/2 distances they determine are the same. Since experimental data contain errors, we propose algorithms for reconstructing sets from noisy interpoint distances, including the possibility of missing fragments. We analyse the performance of these algorithms under a reasonable probability distribution, establishing a relative error limit ofr=Θ(1/n 2) beyond which our technique becomes infeasible. Through simulations, we establish that our technique is robust enough to reconstruct data with relative errors of up to 7.0% in the measured fragment lengths for typical problems, which appears sufficient for certain biological applications.  相似文献   

5.
MOTIVATION: Biologists often employ clustering techniques in the explorative phase of microarray data analysis to discover relevant biological groupings. Given the availability of numerous clustering algorithms in the machine-learning literature, an user might want to select one that performs the best for his/her data set or application. While various validation measures have been proposed over the years to judge the quality of clusters produced by a given clustering algorithm including their biological relevance, unfortunately, a given clustering algorithm can perform poorly under one validation measure while outperforming many other algorithms under another validation measure. A manual synthesis of results from multiple validation measures is nearly impossible in practice, especially, when a large number of clustering algorithms are to be compared using several measures. An automated and objective way of reconciling the rankings is needed. RESULTS: Using a Monte Carlo cross-entropy algorithm, we successfully combine the ranks of a set of clustering algorithms under consideration via a weighted aggregation that optimizes a distance criterion. The proposed weighted rank aggregation allows for a far more objective and automated assessment of clustering results than a simple visual inspection. We illustrate our procedure using one simulated as well as three real gene expression data sets from various platforms where we rank a total of eleven clustering algorithms using a combined examination of 10 different validation measures. The aggregate rankings were found for a given number of clusters k and also for an entire range of k. AVAILABILITY: R code for all validation measures and rank aggregation is available from the authors upon request. SUPPLEMENTARY INFORMATION: Supplementary information are available at http://www.somnathdatta.org/Supp/RankCluster/supp.htm.  相似文献   

6.
We examined the efficiencies of ordination methods in the treatment of gene frequency data at intraspecific level, using metric and nonmetric distance measures (Nei's and Rogers' genetic distances, chi 2 distance). We assessed initial processes responsible for the geographical distribution of the Mediterranean land snail Helix aspersa. Seventeen enzyme loci from 30 North African snail populations were considered in the present analysis. Five combinations of distance/multivariate analysis were compared: correspondence analysis (CA), nonmetric multidimensional scaling (NMDS) on Nei's, Rogers', and chi 2 distances, and principal coordinates analysis on Rogers' distances. Configuration of the objects resulting from ordination was projected onto three-dimensional graphics with the minimum spanning tree or the relative neighborhood graph superimposed. Pre- and postordination or clustering distance matrices were compared by means of correlation methods. As expected, all combinations led to a clear west versus east pattern of variation. However, the intraregional relationships and degree of connectivity between pairs of operational taxonomic units were not necessarily constant from one method to another. Ordination methods when applied with Nei's and Rogers' distances provided the best fit, with original distances (r = 0.98) compared with UPGMA clustering (r approximately 0.75). The Nei/NMDS combination seems to be a good compromise (distortion index dt = 10%) between Rogers/NMDS, which produces a more confusing pattern of differentiation (dt = 24%), and chi 2/CA, which tends to distort large distances (dt = 31%). NMDS obviously provides a powerful method to summarize relationships between populations, when neither hierarchical structure nor phylogenetic inference are required. These findings led the discussion on the good performance of NMDS, the appropriate distances to be used, and the potential application of this method to other types of allelic data (such as microsatellite loci) or data on nucleotide sequences of genes.  相似文献   

7.
The ultrastructural localization of various antigens in a cell using antibodies conjugated to gold particles is a powerful instrument in biological research. However, statistical or stereological tools for testing the observed patterns for significant clustering or colocalization are missing. The paper presents a method for the quantitative analysis of single or multiple immunogold labeling patterns using interpoint distances and tests the method using experimental data. The clustering or colocalization of gold particles was detected using various characteristics of the distribution of distances between them. Pair correlation and cross-correlation functions were used for exploratory analysis; second order reduced K (or cross-K) functions were used for testing the statistical significance of observed events. Confidence intervals of function values were estimated by Monte Carlo simulations of the Poisson process for independent particles, and results were visualized in histograms. Furthermore, a suitability of K functions modified by censoring or weighting was tested. The reliability of the method was assessed by evaluating the labeling patterns of nascent DNA and several nuclear proteins with known functions in replication foci of HeLa cells. The results demonstrate that the method is a powerful tool in biological investigations for testing the statistical significance of observed clustering or colocalization patterns in immunogold labeling experiments.  相似文献   

8.
陶华  唐旭清 《生物信息学》2012,10(4):269-273,279
基于模糊邻近关系的粒度空间,对蛋白质序列进行聚类结构分析。利用MEGA软件计算选取的木聚糖酶序列间的比对距离,引入内积将其转化为模糊邻近关系(或矩阵),再应用算法求解其粒度空间,进行序列的聚类结构分析和最佳聚类确定研究。这些研究为蛋白质序列提供了定量分析的工具。  相似文献   

9.
The clustering propensity of microRNA genes is a common biological phenomenon in various animal and plant species. To gain novel insight into genomic organization and potential functional heterogeneities of miRNA clusters in vertebrates from a genome scale, we used large scale data and presented a comprehensive analysis to examine various features of genomic organization of miRNA clusters across seven vertebrates by a combination of comparative genomics and bioinformatics approaches. The results of pair-wise distance analysis of same-strand consecutive miRNAs suggested that the fractions of the miRNA gene pairs are higher at relatively short pair-wise distances than those of protein-coding genes and other non-coding RNA genes. Especially relatively small number of miRNAs is more clustered at very short pair-wise distances than expected at random. We further observed significant difference between real miRNA clusters and randomly organized clusters for different aspects, including higher overlap of target genes, fewer seed types and significant enrichment in diseases. However, the extent of these features of clustered miRNAs has a different tendency and largely depends on inter-miRNA distances because of diverse clustering propensity of miRNAs in vertebrates, suggesting that this cooperated function or cooperative effects between miRNAs in clusters perhaps be affected by inter-miRNA distances.  相似文献   

10.
Spectral clustering of protein sequences   总被引:1,自引:0,他引:1  
An important problem in genomics is automatically clustering homologous proteins when only sequence information is available. Most methods for clustering proteins are local, and are based on simply thresholding a measure related to sequence distance. We first show how locality limits the performance of such methods by analysing the distribution of distances between protein sequences. We then present a global method based on spectral clustering and provide theoretical justification of why it will have a remarkable improvement over local methods. We extensively tested our method and compared its performance with other local methods on several subsets of the SCOP (Structural Classification of Proteins) database, a gold standard for protein structure classification. We consistently observed that, the number of clusters that we obtain for a given set of proteins is close to the number of superfamilies in that set; there are fewer singletons; and the method correctly groups most remote homologs. In our experiments, the quality of the clusters as quantified by a measure that combines sensitivity and specificity was consistently better [on average, improvements were 84% over hierarchical clustering, 34% over Connected Component Analysis (CCA) (similar to GeneRAGE) and 72% over another global method, TribeMCL].  相似文献   

11.
A central step in the analysis of gene expression data is the identification of groups of genes that exhibit similar expression patterns. Clustering and ordering the genes using gene expression data into homogeneous groups was shown to be useful in functional annotation, tissue classification, regulatory motif identification, and other applications. Although there is a rich literature on gene ordering in hierarchical clustering framework for gene expression analysis, there is no work addressing and evaluating the importance of gene ordering in partitive clustering framework, to the best knowledge of the authors. Outside the framework of hierarchical clustering, different gene ordering algorithms are applied on the whole data set, and the domain of partitive clustering is still unexplored with gene ordering approaches. A new hybrid method is proposed for ordering genes in each of the clusters obtained from partitive clustering solution, using microarray gene expressions.Two existing algorithms for optimally ordering cities in travelling salesman problem (TSP), namely, FRAG_GALK and Concorde, are hybridized individually with self organizing MAP to show the importance of gene ordering in partitive clustering framework. We validated our hybrid approach using yeast and fibroblast data and showed that our approach improves the result quality of partitive clustering solution, by identifying subclusters within big clusters, grouping functionally correlated genes within clusters, minimization of summation of gene expression distances, and the maximization of biological gene ordering using MIPS categorization. Moreover, the new hybrid approach, finds comparable or sometimes superior biological gene order in less computation time than those obtained by optimal leaf ordering in hierarchical clustering solution.  相似文献   

12.
MOTIVATION: Current Self-Organizing Maps (SOMs) approaches to gene expression pattern clustering require the user to predefine the number of clusters likely to be expected. Hierarchical clustering methods used in this area do not provide unique partitioning of data. We describe an unsupervised dynamic hierarchical self-organizing approach, which suggests an appropriate number of clusters, to perform class discovery and marker gene identification in microarray data. In the process of class discovery, the proposed algorithm identifies corresponding sets of predictor genes that best distinguish one class from other classes. The approach integrates merits of hierarchical clustering with robustness against noise known from self-organizing approaches. RESULTS: The proposed algorithm applied to DNA microarray data sets of two types of cancers has demonstrated its ability to produce the most suitable number of clusters. Further, the corresponding marker genes identified through the unsupervised algorithm also have a strong biological relationship to the specific cancer class. The algorithm tested on leukemia microarray data, which contains three leukemia types, was able to determine three major and one minor cluster. Prediction models built for the four clusters indicate that the prediction strength for the smaller cluster is generally low, therefore labelled as uncertain cluster. Further analysis shows that the uncertain cluster can be subdivided further, and the subdivisions are related to two of the original clusters. Another test performed using colon cancer microarray data has automatically derived two clusters, which is consistent with the number of classes in data (cancerous and normal). AVAILABILITY: JAVA software of dynamic SOM tree algorithm is available upon request for academic use. SUPPLEMENTARY INFORMATION: A comparison of rectangular and hexagonal topologies for GSOM is available from http://www.mame.mu.oz.au/mechatronics/journalinfo/Hsu2003supp.pdf  相似文献   

13.
The updating and rethinking of vegetation classifications is important for ecosystem monitoring in a rapidly changing world, where the distribution of vegetation is changing. The general assumption that discrete and persistent plant communities exist that can be monitored efficiently, is rarely tested before undertaking a classification. Marion Island (MI) is comprised of species-poor vegetation undergoing rapid environmental change. It presents a unique opportunity to test the ability to discretely classify species-poor vegetation with recently developed objective classification techniques and relate it to previous classifications. We classified vascular species data of 476 plots sampled across MI, using Ward hierarchical clustering, divisive analysis clustering, non-hierarchical kmeans and partitioning around medoids. Internal cluster validation was performed using silhouette widths, Dunn index, connectivity of clusters and gap statistic. Indicator species analyses were also conducted on the best performing clustering methods. We evaluated the outputs against previously classified units. Ward clustering performed the best, with the highest average silhouette width and Dunn index, as well as the lowest connectivity. The number of clusters differed amongst the clustering methods, but most validation measures, including for Ward clustering, indicated that two and three clusters are the best fit for the data. However, all classification methods produced weakly separated, highly connected clusters with low compactness and low fidelity and specificity to clusters. There was no particularly robust and effective classification outcome that could group plots into previously suggested vegetation units based on species composition alone. The relatively recent age (c. 450,000 years B.P.), glaciation history (last glacial maximum 34,500 years B.P.) and isolation of the sub-Antarctic islands may have hindered the development of strong vascular plant species assemblages with discrete boundaries. Discrete classification at the community-level using species composition may not be suitable in such species-poor environments. Species-level, rather than community-level, monitoring may thus be more appropriate in species-poor environments, aligning with continuum theory rather than community theory.  相似文献   

14.

Background

Clustering is a widely used technique for analysis of gene expression data. Most clustering methods group genes based on the distances, while few methods group genes according to the similarities of the distributions of the gene expression levels. Furthermore, as the biological annotation resources accumulated, an increasing number of genes have been annotated into functional categories. As a result, evaluating the performance of clustering methods in terms of the functional consistency of the resulting clusters is of great interest.

Results

In this paper, we proposed the WDCM (Weibull Distribution-based Clustering Method), a robust approach for clustering gene expression data, in which the gene expressions of individual genes are considered as the random variables following unique Weibull distributions. Our WDCM is based on the concept that the genes with similar expression profiles have similar distribution parameters, and thus the genes are clustered via the Weibull distribution parameters. We used the WDCM to cluster three cancer gene expression data sets from the lung cancer, B-cell follicular lymphoma and bladder carcinoma and obtained well-clustered results. We compared the performance of WDCM with k-means and Self Organizing Map (SOM) using functional annotation information given by the Gene Ontology (GO). The results showed that the functional annotation ratios of WDCM are higher than those of the other methods. We also utilized the external measure Adjusted Rand Index to validate the performance of the WDCM. The comparative results demonstrate that the WDCM provides the better clustering performance compared to k-means and SOM algorithms. The merit of the proposed WDCM is that it can be applied to cluster incomplete gene expression data without imputing the missing values. Moreover, the robustness of WDCM is also evaluated on the incomplete data sets.

Conclusions

The results demonstrate that our WDCM produces clusters with more consistent functional annotations than the other methods. The WDCM is also verified to be robust and is capable of clustering gene expression data containing a small quantity of missing values.  相似文献   

15.
Y Peng  Y Zhang  G Kou  Y Shi 《PloS one》2012,7(7):e41713
Determining the number of clusters in a data set is an essential yet difficult step in cluster analysis. Since this task involves more than one criterion, it can be modeled as a multiple criteria decision making (MCDM) problem. This paper proposes a multiple criteria decision making (MCDM)-based approach to estimate the number of clusters for a given data set. In this approach, MCDM methods consider different numbers of clusters as alternatives and the outputs of any clustering algorithm on validity measures as criteria. The proposed method is examined by an experimental study using three MCDM methods, the well-known clustering algorithm-k-means, ten relative measures, and fifteen public-domain UCI machine learning data sets. The results show that MCDM methods work fairly well in estimating the number of clusters in the data and outperform the ten relative measures considered in the study.  相似文献   

16.
The spatial distribution of ion channels over the surface of a neuron is an important determinant of its excitable properties. We introduce two measures of channel clustering for use in patch-clamp experiments: a normalized chi-squared statistic (eta) and the number of zero-channel patches in a data set (Z). These statistics were calculated for data sets describing the distribution of A-type potassium channels on neurons of the nudibranch Doriopsilla and measurements of Ca-dependent outward current channels on bullfrog hair cells, as well as simulated channel distributions. When channels are clustered, eta is approximately equal to the amount of current in a cluster. The analysis shows that somatic A-channels in the nudibranch are distributed in clusters of approximately 50 channels each. The clusters are < 2 microns wide and are separated, on average, by 3.2 microns. Outward current channels on hair cells occur in clusters of approximately 27 channels each, in agreement with the original analysis. Channel clustering may reflect properties of the insertion or regulation of channels in the membrane.  相似文献   

17.
18.
An approach for deciding final clusters of a dendrogram is provided. Whether developed by agglomerative or divisive cluster analysis, decisions start at the 2-cluster level of the dendrogram. Cluster density is viewed as compactness and, therefore, is related to interindividual distances. The two clusters and the intercluster space are seen as treatments in analysis of variance with intercluster space as the control treatment. Distances in an interindividual matrix are considered measures of response to the three treatments. The F-ratio indicates if the treatment means differ; the least significant difference indicates which ones are different. The example provided explains our approach in searching for optimal clustering levels.  相似文献   

19.
Key synaptic proteins from the soluble SNARE (N-ethylmaleimide-sensitive factor attachment protein receptor) family, among many others, are organized at the plasma membrane of cells as clusters containing dozens to hundreds of protein copies. However, the exact membranal distribution of proteins into clusters or as single molecules, the organization of molecules inside the clusters, and the clustering mechanisms are unclear due to limitations of the imaging and analytical tools. Focusing on syntaxin 1 and SNAP-25, we implemented direct stochastic optical reconstruction microscopy together with quantitative clustering algorithms to demonstrate a novel approach to explore the distribution of clustered and nonclustered molecules at the membrane of PC12 cells with single-molecule precision. Direct stochastic optical reconstruction microscopy images reveal, for the first time, solitary syntaxin/SNAP-25 molecules and small clusters as well as larger clusters. The nonclustered syntaxin or SNAP-25 molecules are mostly concentrated in areas adjacent to their own clusters. In the clusters, the density of the molecules gradually decreases from the dense cluster core to the periphery. We further detected large clusters that contain several density gradients. This suggests that some of the clusters are formed by unification of several clusters that preserve their original organization or reorganize into a single unit. Although syntaxin and SNAP-25 share some common distributional features, their clusters differ markedly from each other. SNAP-25 clusters are significantly larger, more elliptical, and less dense. Finally, this study establishes methodological tools for the analysis of single-molecule-based super-resolution imaging data and paves the way for revealing new levels of membranal protein organization.  相似文献   

20.
杨德武  李霞  肖雪  杨月莹  王靖 《遗传》2008,30(9):1157-1162
离子通道亚型与其基因共表达的关联对研究离子通道功能有重要意义。文章采用主成分分析和模糊C-均值聚类算法对数据进行分析, 将方法应用到人类和小鼠两套表达谱数据, 结果发现离子通道亚型中钾离子通道、钙离子通道、氯离子通道和受体激活型离子通道的表达谱聚类结果与生物学分类有较好的一致性, 体现了离子通道亚型在mRNA水平上的共表达, 并证实了通过离子通道表达谱能很好的对离子通道的功能亚型进行分类。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号