首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 500 毫秒
1.

Background

In genomics, hierarchical clustering (HC) is a popular method for grouping similar samples based on a distance measure. HC algorithms do not actually create clusters, but compute a hierarchical representation of the data set. Usually, a fixed height on the HC tree is used, and each contiguous branch of samples below that height is considered a separate cluster. Due to the fixed-height cutting, those clusters may not unravel significant functional coherence hidden deeper in the tree. Besides that, most existing approaches do not make use of available clinical information to guide cluster extraction from the HC. Thus, the identified subgroups may be difficult to interpret in relation to that information.

Results

We develop a novel framework for decomposing the HC tree into clusters by semi-supervised piecewise snipping. The framework, called guided piecewise snipping, utilizes both molecular data and clinical information to decompose the HC tree into clusters. It cuts the given HC tree at variable heights to find a partition (a set of non-overlapping clusters) which does not only represent a structure deemed to underlie the data from which HC tree is derived, but is also maximally consistent with the supplied clinical data. Moreover, the approach does not require the user to specify the number of clusters prior to the analysis. Extensive results on simulated and multiple medical data sets show that our approach consistently produces more meaningful clusters than the standard fixed-height cut and/or non-guided approaches.

Conclusions

The guided piecewise snipping approach features several novelties and advantages over existing approaches. The proposed algorithm is generic, and can be combined with other algorithms that operate on detected clusters. This approach represents an advancement in several regards: (1) a piecewise tree snipping framework that efficiently extracts clusters by snipping the HC tree possibly at variable heights while preserving the HC tree structure; (2) a flexible implementation allowing a variety of data types for both building and snipping the HC tree, including patient follow-up data like survival as auxiliary information.The data sets and R code are provided as supplementary files. The proposed method is available from Bioconductor as the R-package HCsnip.

Electronic supplementary material

The online version of this article (doi:10.1186/s12859-014-0448-1) contains supplementary material, which is available to authorized users.  相似文献   

2.

Background

Serial Analysis of Gene Expression (SAGE) is a DNA sequencing-based method for large-scale gene expression profiling that provides an alternative to microarray analysis. Most analyses of SAGE data aimed at identifying co-expressed genes have been accomplished using various versions of clustering approaches that often result in a number of false positives.

Principal Findings

Here we explore the use of seriation, a statistical approach for ordering sets of objects based on their similarity, for large-scale expression pattern discovery in SAGE data. For this specific task we implement a seriation heuristic we term ‘progressive construction of contigs’ that constructs local chains of related elements by sequentially rearranging margins of the correlation matrix. We apply the heuristic to the analysis of simulated and experimental SAGE data and compare our results to those obtained with a clustering algorithm developed specifically for SAGE data. We show using simulations that the performance of seriation compares favorably to that of the clustering algorithm on noisy SAGE data.

Conclusions

We explore the use of a seriation approach for visualization-based pattern discovery in SAGE data. Using both simulations and experimental data, we demonstrate that seriation is able to identify groups of co-expressed genes more accurately than a clustering algorithm developed specifically for SAGE data. Our results suggest that seriation is a useful method for the analysis of gene expression data whose applicability should be further pursued.  相似文献   

3.
4.
5.

Backgrounds

Recent explosion of biological data brings a great challenge for the traditional clustering algorithms. With increasing scale of data sets, much larger memory and longer runtime are required for the cluster identification problems. The affinity propagation algorithm outperforms many other classical clustering algorithms and is widely applied into the biological researches. However, the time and space complexity become a great bottleneck when handling the large-scale data sets. Moreover, the similarity matrix, whose constructing procedure takes long runtime, is required before running the affinity propagation algorithm, since the algorithm clusters data sets based on the similarities between data pairs.

Methods

Two types of parallel architectures are proposed in this paper to accelerate the similarity matrix constructing procedure and the affinity propagation algorithm. The memory-shared architecture is used to construct the similarity matrix, and the distributed system is taken for the affinity propagation algorithm, because of its large memory size and great computing capacity. An appropriate way of data partition and reduction is designed in our method, in order to minimize the global communication cost among processes.

Result

A speedup of 100 is gained with 128 cores. The runtime is reduced from serval hours to a few seconds, which indicates that parallel algorithm is capable of handling large-scale data sets effectively. The parallel affinity propagation also achieves a good performance when clustering large-scale gene data (microarray) and detecting families in large protein superfamilies.  相似文献   

6.
7.
Valor LM  Grant SG 《PloS one》2007,2(12):e1303

Background

Gene expression profiling using microarrays is a powerful technology widely used to study regulatory networks. Profiling of mRNA levels in mutant organisms has the potential to identify genes regulated by the mutated protein.

Methodology/Principle Findings

Using tissues from multiple lines of knockout mice we have examined genome-wide changes in gene expression. We report that a significant proportion of changed genes were found near the targeted gene.

Conclusions/Significance

The apparent clustering of these genes was explained by the presence of flanking DNA from the parental ES cell. We provide recommendations for the analysis and reporting of microarray data from knockout mice  相似文献   

8.

Motivation

It has been proposed that clustering clinical markers, such as blood test results, can be used to stratify patients. However, the robustness of clusters formed with this approach to data pre-processing and clustering algorithm choices has not been evaluated, nor has clustering reproducibility. Here, we made use of the NHANES survey to compare clusters generated with various combinations of pre-processing and clustering algorithms, and tested their reproducibility in two separate samples.

Method

Values of 44 biomarkers and 19 health/life style traits were extracted from the National Health and Nutrition Examination Survey (NHANES). The 1999–2002 survey was used for training, while data from the 2003–2006 survey was tested as a validation set. Twelve combinations of pre-processing and clustering algorithms were applied to the training set. The quality of the resulting clusters was evaluated both by considering their properties and by comparative enrichment analysis. Cluster assignments were projected to the validation set (using an artificial neural network) and enrichment in health/life style traits in the resulting clusters was compared to the clusters generated from the original training set.

Results

The clusters obtained with different pre-processing and clustering combinations differed both in terms of cluster quality measures and in terms of reproducibility of enrichment with health/life style properties. Z-score normalization, for example, dramatically improved cluster quality and enrichments, as compared to unprocessed data, regardless of the clustering algorithm used. Clustering diabetes patients revealed a group of patients enriched with retinopathies. This could indicate that routine laboratory tests can be used to detect patients suffering from complications of diabetes, although other explanations for this observation should also be considered.

Conclusions

Clustering according to classical clinical biomarkers is a robust process, which may help in patient stratification. However, optimization of the pre-processing and clustering process may be still required.  相似文献   

9.

Background

The goal of the study was to demonstrate a hierarchical structure of resting state activity in the healthy brain using a data-driven clustering algorithm.

Methodology/Principal Findings

The fuzzy-c-means clustering algorithm was applied to resting state fMRI data in cortical and subcortical gray matter from two groups acquired separately, one of 17 healthy individuals and the second of 21 healthy individuals. Different numbers of clusters and different starting conditions were used. A cluster dispersion measure determined the optimal numbers of clusters. An inner product metric provided a measure of similarity between different clusters. The two cluster result found the task-negative and task-positive systems. The cluster dispersion measure was minimized with seven and eleven clusters. Each of the clusters in the seven and eleven cluster result was associated with either the task-negative or task-positive system. Applying the algorithm to find seven clusters recovered previously described resting state networks, including the default mode network, frontoparietal control network, ventral and dorsal attention networks, somatomotor, visual, and language networks. The language and ventral attention networks had significant subcortical involvement. This parcellation was consistently found in a large majority of algorithm runs under different conditions and was robust to different methods of initialization.

Conclusions/Significance

The clustering of resting state activity using different optimal numbers of clusters identified resting state networks comparable to previously obtained results. This work reinforces the observation that resting state networks are hierarchically organized.  相似文献   

10.

Objectives

To investigate the order in which 85 year olds develop difficulty in performing a wide range of daily activities covering basic personal care, household care and mobility.

Design

Cross-sectional analysis of baseline data from a cohort study.

Setting

Newcastle upon Tyne and North Tyneside, UK.

Participants

Individuals born in 1921, registered with participating general practices.

Measurements

Detailed health assessment including 17 activities of daily living related to basic personal care, household care and mobility. Questions were of the form ‘Can you …’ rather than ‘Do you…’ Principal Component Analysis (PCA) was used to confirm a single underlying dimension for the items and Mokken Scaling was used to determine a subsequent hierarchy. Validity of the hierarchical scale was assessed by its associations with known predictors of disability.

Results

839 people within the Newcastle 85+ study for whom complete information was available on self-reported Activities of Daily Living (ADL). PCA confirmed a single underlying dimension; Mokken scaling confirmed a hierarchic scale where ‘Cutting toenails’ was the first item with which participants had difficulty and ‘feeding’ the last. The ordering of loss differed between men and women. Difficulty with ‘shopping’ and ‘heavy housework’ were reported earlier by women whilst men reported ‘walking 400 yards’ earlier. Items formed clusters corresponding to strength, balance, lower and upper body involvement and domains specifically required for balance and upper/lower limb functional integrity.

Conclusion

This comprehensive investigation of ordering of ability in activities in 85 year olds will inform researchers and practitioners assessing older people for onset of disability and subsequent care needs.  相似文献   

11.

Introduction

The aim of this study was to identify subsets of patients with fibromyalgia with similar symptom profiles using the Outcome Measures in Rheumatology (OMERACT) core symptom domains.

Methods

Female patients with a diagnosis of fibromyalgia and currently meeting fibromyalgia research survey criteria completed the Brief Pain Inventory, the 30-item Profile of Mood States, the Medical Outcomes Sleep Scale, the Multidimensional Fatigue Inventory, the Multiple Ability Self-Report Questionnaire, the Fibromyalgia Impact Questionnaire–Revised (FIQ-R) and the Short Form-36 between 1 June 2011 and 31 October 2011. Hierarchical agglomerative clustering was used to identify subgroups of patients with similar symptom profiles. To validate the results from this sample, hierarchical agglomerative clustering was repeated in an external sample of female patients with fibromyalgia with similar inclusion criteria.

Results

A total of 581 females with a mean age of 55.1 (range, 20.1 to 90.2) years were included. A four-cluster solution best fit the data, and each clustering variable differed significantly (P <0.0001) among the four clusters. The four clusters divided the sample into severity levels: Cluster 1 reflects the lowest average levels across all symptoms, and cluster 4 reflects the highest average levels. Clusters 2 and 3 capture moderate symptoms levels. Clusters 2 and 3 differed mainly in profiles of anxiety and depression, with Cluster 2 having lower levels of depression and anxiety than Cluster 3, despite higher levels of pain. The results of the cluster analysis of the external sample (n = 478) looked very similar to those found in the original cluster analysis, except for a slight difference in sleep problems. This was despite having patients in the validation sample who were significantly younger (P <0.0001) and had more severe symptoms (higher FIQ-R total scores (P = 0.0004)).

Conclusions

In our study, we incorporated core OMERACT symptom domains, which allowed for clustering based on a comprehensive symptom profile. Although our exploratory cluster solution needs confirmation in a longitudinal study, this approach could provide a rationale to support the study of individualized clinical evaluation and intervention.  相似文献   

12.

Background

Previous studies using hierarchical clustering approach to analyze resting-state fMRI data were limited to a few slices or regions-of-interest (ROIs) after substantial data reduction.

Purpose

To develop a framework that can perform voxel-wise hierarchical clustering of whole-brain resting-state fMRI data from a group of subjects.

Materials and Methods

Resting-state fMRI measurements were conducted for 86 adult subjects using a single-shot echo-planar imaging (EPI) technique. After pre-processing and co-registration to a standard template, pair-wise cross-correlation coefficients (CC) were calculated for all voxels inside the brain and translated into absolute Pearson''s distances after imposing a threshold CC≥0.3. The group averages of the Pearson''s distances were then used to perform hierarchical clustering with the developed framework, which entails gray matter masking and an iterative scheme to analyze the dendrogram.

Results

With the hierarchical clustering framework, we identified most of the functional connectivity networks reported previously in the literature, such as the motor, sensory, visual, memory, and the default-mode functional networks (DMN). Furthermore, the DMN and visual system were split into their corresponding hierarchical sub-networks.

Conclusion

It is feasible to use the proposed hierarchical clustering scheme for voxel-wise analysis of whole-brain resting-state fMRI data. The hierarchical clustering result not only confirmed generally the finding in functional connectivity networks identified previously using other data processing techniques, such as ICA, but also revealed directly the hierarchical structure within the functional connectivity networks.  相似文献   

13.

Background

One of the most common goals of hierarchical clustering is finding those branches of a tree that form quantifiably distinct data subtypes. Achieving this goal in a statistically meaningful way requires (a) a measure of distinctness of a branch and (b) a test to determine the significance of the observed measure, applicable to all branches and across multiple scales of dissimilarity.

Results

We formulate a method termed Tree Branches Evaluated Statistically for Tightness (TBEST) for identifying significantly distinct tree branches in hierarchical clusters. For each branch of the tree a measure of distinctness, or tightness, is defined as a rational function of heights, both of the branch and of its parent. A statistical procedure is then developed to determine the significance of the observed values of tightness. We test TBEST as a tool for tree-based data partitioning by applying it to five benchmark datasets, one of them synthetic and the other four each from a different area of biology. For each dataset there is a well-defined partition of the data into classes. In all test cases TBEST performs on par with or better than the existing techniques.

Conclusions

Based on our benchmark analysis, TBEST is a tool of choice for detection of significantly distinct branches in hierarchical trees grown from biological data. An R language implementation of the method is available from the Comprehensive R Archive Network: http://www.cran.r-project.org/web/packages/TBEST/index.html.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2164-15-1000) contains supplementary material, which is available to authorized users.  相似文献   

14.
15.

Background and Aims

This study aimed to identify and characterize the ontogenetic, environmental and individual components of forest tree growth. In the proposed approach, the tree growth data typically correspond to the retrospective measurement of annual shoot characteristics (e.g. length) along the trunk.

Methods

Dedicated statistical models (semi-Markov switching linear mixed models) were applied to data sets of Corsican pine and sessile oak. In the semi-Markov switching linear mixed models estimated from these data sets, the underlying semi-Markov chain represents both the succession of growth phases and their lengths, while the linear mixed models represent both the influence of climatic factors and the inter-individual heterogeneity within each growth phase.

Key Results

On the basis of these integrative statistical models, it is shown that growth phases are not only defined by average growth level but also by growth fluctuation amplitudes in response to climatic factors and inter-individual heterogeneity and that the individual tree status within the population may change between phases. Species plasticity affected the response to climatic factors while tree origin, sampling strategy and silvicultural interventions impacted inter-individual heterogeneity.

Conclusions

The transposition of the proposed integrative statistical modelling approach to cambial growth in relation to climatic factors and the study of the relationship between apical growth and cambial growth constitute the next steps in this research.  相似文献   

16.
17.
18.

Background

To investigate the occupational risk of tuberculosis (TB) infection in a low-incidence setting, data from a prospective study of patients with culture-confirmed TB conducted in Hamburg, Germany, from 1997 to 2002 were evaluated.

Methods

M. tuberculosis isolates were genotyped by IS6110 RFLP analysis. Results of contact tracing and additional patient interviews were used for further epidemiological analyses.

Results

Out of 848 cases included in the cluster analysis, 286 (33.7%) were classified into 76 clusters comprising 2 to 39 patients. In total, two patients in the non-cluster and eight patients in the cluster group were health-care workers. Logistic regression analysis confirmed work in the health-care sector as the strongest predictor for clustering (OR 17.9). However, only two of the eight transmission links among the eight clusters involving health-care workers had been detected previously. Overall, conventional contact tracing performed before genotyping had identified only 26 (25.2%) of the 103 contact persons with the disease among the clustered cases whose transmission links were epidemiologically verified.

Conclusion

Recent transmission was found to be strongly associated with health-care work in a setting with low incidence of TB. Conventional contact tracing alone was shown to be insufficient to discover recent transmission chains. The data presented also indicate the need for establishing improved TB control strategies in health-care settings.  相似文献   

19.
Li W  Wooley JC  Godzik A 《PloS one》2008,3(10):e3375

Background

The scale and diversity of metagenomic sequencing projects challenge both our technical and conceptual approaches in gene and genome annotations. The recent Sorcerer II Global Ocean Sampling (GOS) expedition yielded millions of predicted protein sequences, which significantly altered the landscape of known protein space by more than doubling its size and adding thousands of new families (Yooseph et al., 2007 PLoS Biol 5, e16). Such datasets, not only by their sheer size, but also by many other features, defy conventional analysis and annotation methods.

Methodology/Principal Findings

In this study, we describe an approach for rapid analysis of the sequence diversity and the internal structure of such very large datasets by advanced clustering strategies using the newly modified CD-HIT algorithm. We performed a hierarchical clustering analysis on the 17.4 million Open Reading Frames (ORFs) identified from the GOS study and found over 33 thousand large predicted protein clusters comprising nearly 6 million sequences. Twenty percent of these clusters did not match known protein families by sequence similarity search and might represent novel protein families. Distributions of the large clusters were illustrated on organism composition, functional class, and sample locations.

Conclusion/Significance

Our clustering took about two orders of magnitude less computational effort than the similar protein family analysis of original GOS study. This approach will help to analyze other large metagenomic datasets in the future. A Web server with our clustering results and annotations of predicted protein clusters is available online at http://tools.camera.calit2.net/gos under the CAMERA project.  相似文献   

20.
Hypersensitivity in Borderline Personality Disorder during Mindreading   总被引:1,自引:0,他引:1  

Background

One of the core symptoms of borderline personality disorder (BPD) is the instability in interpersonal relationships. This might be related to existent differences in mindreading between BPD patients and healthy individuals.

Methods

We examined the behavioural and neurophysiological (fMRI) responses of BPD patients and healthy controls (HC) during performance of the ‘Reading the Mind in the Eyes’ test (RMET).

Results

Mental state discrimination was significantly better and faster for affective eye gazes in BPD patients than in HC. At the neurophysiological level, this was manifested in a stronger activation of the amygdala and greater activity of the medial frontal gyrus, the left temporal pole and the middle temporal gyrus during affective eye gazes. In contrast, HC subjects showed a greater activation in the insula and the superior temporal gyri.

Conclusion

These findings indicate that BPD patients are highly vigilant to social stimuli, maybe because they resonate intuitively with mental states of others.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号