首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Clustering is a popular technique for explorative analysis of data, as it can reveal subgroupings and similarities between data in an unsupervised manner. While clustering is routinely applied to gene expression data, there is a lack of appropriate general methodology for clustering of sequence-level genomic and epigenomic data, e.g. ChIP-based data. We here introduce a general methodology for clustering data sets of coordinates relative to a genome assembly, i.e. genomic tracks. By defining appropriate feature extraction approaches and similarity measures, we allow biologically meaningful clustering to be performed for genomic tracks using standard clustering algorithms. An implementation of the methodology is provided through a tool, ClusTrack, which allows fine-tuned clustering analyses to be specified through a web-based interface. We apply our methods to the clustering of occupancy of the H3K4me1 histone modification in samples from a range of different cell types. The majority of samples form meaningful subclusters, confirming that the definitions of features and similarity capture biological, rather than technical, variation between the genomic tracks. Input data and results are available, and can be reproduced, through a Galaxy Pages document at http://hyperbrowser.uio.no/hb/u/hb-superuser/p/clustrack. The clustering functionality is available as a Galaxy tool, under the menu option "Specialized analyzis of tracks", and the submenu option "Cluster tracks based on genome level similarity", at the Genomic HyperBrowser server: http://hyperbrowser.uio.no/hb/.  相似文献   

2.
3.
We examined the effects of spatial frequency similarity and dissimilarity on human contour integration under various conditions of uncertainty. Participants performed a temporal 2AFC contour detection task. Spatial frequency jitter up to 3.0 octaves was applied either to background elements, or to contour and background elements, or to none of both. Results converge on four major findings. (1) Contours defined by spatial frequency similarity alone are only scarcely visible, suggesting the absence of specialized cortical routines for shape detection based on spatial frequency similarity. (2) When orientation collinearity and spatial frequency similarity are combined along a contour, performance amplifies far beyond probability summation when compared to the fully heterogenous condition but only to a margin compatible with probability summation when compared to the fully homogenous case. (3) Psychometric functions are steeper but not shifted for homogenous contours in heterogenous backgrounds indicating an advantageous signal-to-noise ratio. The additional similarity cue therefore not so much improves contour detection performance but primarily reduces observer uncertainty about whether a potential candidate is a contour or just a false positive. (4) Contour integration is a broadband mechanism which is only moderately impaired by spatial frequency dissimilarity.  相似文献   

4.
InclUSteranalysis,howtoevaluatethesiededtybetWeentWoactionsisOfmuchhippo~e.Withpropersindhatymeasure,wemayhapetocomparetwoden~ofsamesetofobjectstostudytheeffeCtofUSingdifferentclusteringalgorithms(Gordon,1981).WeareabletocomparetheclusterresultswiththenaturalconfigUlationspraiedinthedata(Rand,1971).~aam~OfallutionsOfsimilardatacanidentifytheialluentialfactorsorobjects(Jolliffe,1988,1992).TheexistingmeadscanbeeasilydividedintotWot~.OneisthecorrelationbetweentwomergingordersOfhierarchical…  相似文献   

5.
Analysis of genetic interaction networks often involves identifying genes with similar profiles, which is typically indicative of a common function. While several profile similarity measures have been applied in this context, they have never been systematically benchmarked. We compared a diverse set of correlation measures, including measures commonly used by the genetic interaction community as well as several other candidate measures, by assessing their utility in extracting functional information from genetic interaction data. We find that the dot product, one of the simplest vector operations, outperforms most other measures over a large range of gene pairs. More generally, linear similarity measures such as the dot product, Pearson correlation or cosine similarity perform better than set overlap measures such as Jaccard coefficient. Similarity measures that involve L2-normalization of the profiles tend to perform better for the top-most similar pairs but perform less favorably when a larger set of gene pairs is considered or when the genetic interaction data is thresholded. Such measures are also less robust to the presence of noise and batch effects in the genetic interaction data. Overall, the dot product measure performs consistently among the best measures under a variety of different conditions and genetic interaction datasets.  相似文献   

6.
《IRBM》2020,41(5):267-275
Background and objectiveClustering is a widely used popular method for data analysis within many clustering algorithms for years. Today it is used in many predictions, collaborative filtering and automatic segmentation systems on different domains. Also, to be broadly used in practice, such clustering algorithms need to give both better performance and robustness when compared to the ones currently used. In recent years, evolutionary algorithms are used in many domains since they are robust and easy to implement. And many clustering problems can be easily solved with such algorithms if the problem is modeled as an optimization problem. In this paper, we present an optimization approach for clustering by using four well-known evolutionary algorithms which are Biogeography-Based Optimization (BBO), Grey Wolf Optimization (GWO), Genetic Algorithm (GA) and Particle Swarm Optimization (PSO).Methodthe objective function has been specified to minimize the total distance from cluster centers to the data points. Euclidean distance is used for distance calculation. We have applied this objective function to the given algorithms both to find the most efficient clustering algorithm and to compare the clustering performances of algorithms against different data sizes. In order to benchmark the clustering performances of algorithms in the experiments, we have used a number of datasets with different data sizes such as some small scale, medium and big data. The clustering performances have been compared to K-means as it is a widely used clustering algorithm for years in literature. Rand Index, Adjusted Rand Index, Mirkin's Index and Hubert's Index have been considered as parameters for evaluating the clustering performances.ResultAs a result of the clustering experiments of algorithms over different datasets with varying data sizes according to the specified performance criteria, GA and GWO algorithms show better clustering performances among the others.ConclusionsThe results of the study showed that although the algorithms have shown satisfactory clustering results on small and medium scale datasets, the clustering performances on Big data need to be improved.  相似文献   

7.
Link Clustering (LC) is a relatively new method for detecting overlapping communities in networks. The basic principle of LC is to derive a transform matrix whose elements are composed of the link similarity of neighbor links based on the Jaccard distance calculation; then it applies hierarchical clustering to the transform matrix and uses a measure of partition density on the resulting dendrogram to determine the cut level for best community detection. However, the original link clustering method does not consider the link similarity of non-neighbor links, and the partition density tends to divide the communities into many small communities. In this paper, an Extended Link Clustering method (ELC) for overlapping community detection is proposed. The improved method employs a new link similarity, Extended Link Similarity (ELS), to produce a denser transform matrix, and uses the maximum value of EQ (an extended measure of quality of modularity) as a means to optimally cut the dendrogram for better partitioning of the original network space. Since ELS uses more link information, the resulting transform matrix provides a superior basis for clustering and analysis. Further, using the EQ value to find the best level for the hierarchical clustering dendrogram division, we obtain communities that are more sensible and reasonable than the ones obtained by the partition density evaluation. Experimentation on five real-world networks and artificially-generated networks shows that the ELC method achieves higher EQ and In-group Proportion (IGP) values. Additionally, communities are more realistic than those generated by either of the original LC method or the classical CPM method.  相似文献   

8.
基因芯片数据的监督聚类分析   总被引:1,自引:0,他引:1  
随着后基因组时代的到来,基因芯片技术越来越多地被应用到功能基因组的研究当中。如何快速有效地分析基因芯片实验所获得的大量生物学数据,成为当前一项具有重要意义的研究工作。监督聚类(supervised clustering analysis)是聚类分析的一种,它根据样本的先验信息或假设来决定样本的分类,并据此建立判别模型,继而利用该判别模型对未知对象进行分类。该方法已经成功应用到生物医学研究中的许多领域,成为分析基因芯片数据的重要手段。  相似文献   

9.
Detection of yet unknown subgroups showing differential gene or protein expression is a frequent goal in the analysis of modern molecular data. Applications range from cancer biology over developmental biology to toxicology. Often a control and an experimental group are compared, and subgroups can be characterized by differential expression for only a subgroup-specific set of genes or proteins. Finding such genes and corresponding patient subgroups can help in understanding pathological pathways, diagnosis and defining drug targets. The size of the subgroup and the type of differential expression determine the optimal strategy for subgroup identification. To date, commonly used software packages hardly provide statistical tests and methods for the detection of such subgroups. Different univariate methods for subgroup detection are characterized and compared, both on simulated and on real data. We present an advanced design for simulation studies: Data is simulated under different distributional assumptions for the expression of the subgroup, and performance results are compared against theoretical upper bounds. For each distribution, different degrees of deviation from the majority of observations are considered for the subgroup. We evaluate classical approaches as well as various new suggestions in the context of omics data, including outlier sum, PADGE, and kurtosis. We also propose the new FisherSum score. ROC curve analysis and AUC values are used to quantify the ability of the methods to distinguish between genes or proteins with and without certain subgroup patterns. In general, FisherSum for small subgroups and -test for large subgroups achieve best results. We apply each method to a case-control study on Parkinson''s disease and underline the biological benefit of the new method.  相似文献   

10.
目的:比较连续性血液净化(continuous blood purification,CBP)和血液透析(haemodialysis,HD)治疗尿毒症脑病的疗效.方法:70例尿毒症脑病(uremic encephalopathy,UE)患者随机分成连续性血液净化(CBP)组(n=35)和血液透析(HD)组(n=35),分别给予相应治疗.观察缓解率、平均动脉压、肾功能、电解质及血气指标等.结果:1周缓解率分别为100%(CBP组)和65.7%(HD组)(P<0.05);两组的BUN,Scr,K+,Na+,Cl-和β2-MG均显著降低(P<0.05),但治疗后CBP组的BUN,Scr和β2-MG显著低于HD组(P<0.05);CBP组的HC03-和pH较治疗前显著降低(P<0.05),而HD组中HC03-和pH治疗前后无显著改变(P>0.05).CBP组治疗前后平均动脉压变化无统计学差异(P>0.05),HD组低血压发生率为13.7%,高血压发生率为15.4%.结论:CBP治疗能很好清除BUN、Scr和β2-MG,纠正电解质及酸碱紊乱,维持患者血液动力学,具有较好的临床疗效.  相似文献   

11.
Multivariate statistical techniques such as principal components analysis (PCA) and multidimensional scaling (MDS) have been widely used to summarize the structure of human genetic variation, often in easily visualized two-dimensional maps. Many recent studies have reported similarity between geographic maps of population locations and MDS or PCA maps of genetic variation inferred from single-nucleotide polymorphisms (SNPs). However, this similarity has been evident primarily in a qualitative sense; and, because different multivariate techniques and marker sets have been used in different studies, it has not been possible to formally compare genetic variation datasets in terms of their levels of similarity with geography. In this study, using genome-wide SNP data from 128 populations worldwide, we perform a systematic analysis to quantitatively evaluate the similarity of genes and geography in different geographic regions. For each of a series of regions, we apply a Procrustes analysis approach to find an optimal transformation that maximizes the similarity between PCA maps of genetic variation and geographic maps of population locations. We consider examples in Europe, Sub-Saharan Africa, Asia, East Asia, and Central/South Asia, as well as in a worldwide sample, finding that significant similarity between genes and geography exists in general at different geographic levels. The similarity is highest in our examples for Asia and, once highly distinctive populations have been removed, Sub-Saharan Africa. Our results provide a quantitative assessment of the geographic structure of human genetic variation worldwide, supporting the view that geography plays a strong role in giving rise to human population structure.  相似文献   

12.
利用基因芯片可以得到不同基因在不同生命过程中的表达,因此在医学诊断与病变分析中受到重视,并开始大量应用.经测定发现,不同基因在病变过程的不同阶段中的表达是不相同的,由此可以得到在病变过程的不同基因的表达特征.在本文中,我们给出了乳腺癌在转移过程中的基因表达特征的聚类分析法分析,并改进了k-means聚类算法,使之具有自动搜索聚类数的功能,并且有助于改善k-means算法的聚类结果陷入局部最小值的状况.通过对平均聚类误差指标的比较,kr—means要优于k-means算法.本文所得到的结果可供乳腺癌诊断与病变分析参考,同时可以应用于小型基因检测芯片的制备,也可以用于构建基因网络调控图.  相似文献   

13.
The accelerating growth of the public microbial genomic data imposes substantial burden on the research community that uses such resources.Building databases for non-redundant reference sequences from massive microbial genomic data based on clustering analysis is essential.However,existing clustering algorithms perform poorly on long genomic sequences.In this article,we present Gclust,a parallel program for clustering complete or draft genomic sequences,where clustering is accelerated with a novel parallelization strategy and a fast sequence comparison algorithm using sparse suffix arrays(SSAs).Moreover,genome identity measures between two sequences are calculated based on their maximal exact matches(MEMs).In this paper,we demonstrate the high speed and clustering quality of Gclust by examining four genome sequence datasets.Gclust is freely available for non-commercial use at https://github.com/niu-lab/gclust.We also introduce a web server for clustering user-uploaded genomes at http://niulab.scgrid.cn/gclust.  相似文献   

14.
15.
The effect of laminin on the distribution of dystroglycan (DG) and other surface proteins was examined by fluorescent staining in cultures of muscle cells derived from Xenopus embryos. Western blotting confirmed that previously characterized antibodies are reactive in Xenopus. In control cultures, αDG, βDG, and laminin binding sites were distributed as microclusters (<1 μm2 in area) over the entire dorsal surface of the muscle cells. Treatment with laminin induced the formation of macroclusters (1–20 μm2), accompanied by a corresponding decline in the density of the microclusters. With 6 nM laminin, clustering was apparent within 150 min and near maximal within 1 d. Laminin was effective at 30 pM, the lowest concentration tested. The laminin fragment E3, which competes with laminin for binding to αDG, inhibited laminin-induced clustering but did not itself cluster DG, thereby indicating that other portions of the laminin molecule in addition to its αDG binding domain are required for its clustering activity. Laminin-induced clusters also contained dystrophin, but unlike agrin-induced clusters, they did not contain acetylcholine receptors, utrophin, or phosphotyrosine, and their formation was not inhibited by a tyrosine kinase inhibitor. The results reinforce the notion that unclustered DG is mobile on the surface of embryonic muscle cells and suggest that this mobile DG can be trapped by at least two different sets of molecular interactions. Laminin self binding may be the basis for the laminin-induced clustering.  相似文献   

16.
A Comparison of MCC and CEN Error Measures in Multi-Class Prediction   总被引:1,自引:0,他引:1  
We show that the Confusion Entropy, a measure of performance in multiclass problems has a strong (monotone) relation with the multiclass generalization of a classical metric, the Matthews Correlation Coefficient. Analytical results are provided for the limit cases of general no-information (n-face dice rolling) of the binary classification. Computational evidence supports the claim in the general case.  相似文献   

17.
The first recorded North American epidemic of West Nile virus was detected in New York state in 1999, and since then the virus has spread and become established in much of North America. Mathematical models for this vector-transmitted disease with cross-infection between mosquitoes and birds have recently been formulated with the aim of predicting disease dynamics and evaluating possible control methods. We consider discrete and continuous time versions of the West Nile virus models proposed by Wonham et al. [Proc. R. Soc. Lond. B 271:501–507, 2004] and by Thomas and Urena [Math. Comput. Modell. 34:771–781, 2001], and evaluate the basic reproduction number as the spectral radius of the next-generation matrix in each case. The assumptions on mosquito-feeding efficiency are crucial for the basic reproduction number calculation. Differing assumptions lead to the conclusion from one model [Wonham, M.J. et al., [Proc. R. Soc. Lond. B] 271:501–507, 2004] that a reduction in bird density would exacerbate the epidemic, while the other model [Thomas, D.M., Urena, B., Math. Comput. Modell. 34:771–781, 2001] predicts the opposite: a reduction in bird density would help control the epidemic.  相似文献   

18.
Pulmonary rehabilitation (PR) is an important component in the management of respiratory diseases. The effectiveness of PR is dependent upon adherence to exercise training recommendations. The study of exercise adherence is thus a key step towards the optimization of PR programs. To date, mostly indirect measures, such as rates of participation, completion, and attendance, have been used to determine adherence to PR. The purpose of the present protocol is to describe how continuous data tracking technology can be used to measure adherence to a prescribed aerobic training intensity on a second-by-second basis.In our investigations, adherence has been defined as the percent time spent within a specified target heart rate range. As such, using a combination of hardware and software, heart rate is measured, tracked, and recorded during cycling second-by-second for each participant, for each exercise session. Using statistical software, the data is subsequently extracted and analyzed. The same protocol can be applied to determine adherence to other measures of exercise intensity, such as time spent at a specified wattage, level, or speed on the cycle ergometer. Furthermore, the hardware and software is also available to measure adherence to other modes of training, such as the treadmill, elliptical, stepper, and arm ergometer. The present protocol, therefore, has a vast applicability to directly measure adherence to aerobic exercise.  相似文献   

19.
Summary Identifying homogeneous groups of individuals is an important problem in population genetics. Recently, several methods have been proposed that exploit spatial information to improve clustering algorithms. In this article, we develop a Bayesian clustering algorithm based on the Dirichlet process prior that uses both genetic and spatial information to classify individuals into homogeneous clusters for further study. We study the performance of our method using a simulation study and use our model to cluster wolverines in Western Montana using microsatellite data.  相似文献   

20.
While clustering genes remains one of the most popular exploratory tools for expression data, it often results in a highly variable and biologically uninformative clusters. This paper explores a data fusion approach to clustering microarray data. Our method, which combined expression data and Gene Ontology (GO)-derived information, is applied on a real data set to perform genome-wide clustering. A set of novel tools is proposed to validate the clustering results and pick a fair value of infusion coefficient. These tools measure stability, biological relevance, and distance from the expression-only clustering solution. Our results indicate that a data-fusion clustering leads to more stable, biologically relevant clusters that are still representative of the experimental data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号