首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 578 毫秒
1.
J M Neuhaus  N P Jewell 《Biometrics》1990,46(4):977-990
Recently a great deal of attention has been given to binary regression models for clustered or correlated observations. The data of interest are of the form of a binary dependent or response variable, together with independent variables X1,...., Xk, where sets of observations are grouped together into clusters. A number of models and methods of analysis have been suggested to study such data. Many of these are extensions in some way of the familiar logistic regression model for binary data that are not grouped (i.e., each cluster is of size 1). In general, the analyses of these clustered data models proceed by assuming that the observed clusters are a simple random sample of clusters selected from a population of clusters. In this paper, we consider the application of these procedures to the case where the clusters are selected randomly in a manner that depends on the pattern of responses in the cluster. For example, we show that ignoring the retrospective nature of the sample design, by fitting standard logistic regression models for clustered binary data, may result in misleading estimates of the effects of covariates and the precision of estimated regression coefficients.  相似文献   

2.
Cell biologists have developed methods to label membrane proteins with gold nanoparticles and then extract spatial point patterns of the gold particles from transmission electron microscopy images using image processing software. Previously, the resulting patterns were analyzed using the Hopkins statistic, which distinguishes nonclustered from modestly and highly clustered distributions, but is not designed to quantify the number or sizes of the clusters. Clusters were defined by the partitional clustering approach which required the choice of a distance. Two points from a pattern were put in the same cluster if they were closer than this distance. In this study, we present a new methodology based on hierarchical clustering to quantify clustering. An intrinsic distance is computed, which is the distance that produces the maximum number of clusters in the biological data, eliminating the need to choose a distance. To quantify the extent of clustering, we compare the clustering distance between the experimental data being analyzed with that from simulated random data. Results are then expressed as a dimensionless number, the clustering ratio that facilitates the comparison of clustering between experiments. Replacing the chosen cluster distance by the intrinsic clustering distance emphasizes densely packed clusters that are likely more important to downstream signaling events.  相似文献   

3.
In statistical mechanics, the equilibrium properties of a physical system of particles can be calculated as the statistical average over accessible microstates of the system. In general, these calculations are computationally intractable since they involve summations over an exponentially large number of microstates. Clustering algorithms are one of the methods used to numerically approximate these sums. The most basic clustering algorithms first sub-divide the system into a set of smaller subsets (clusters). Then, interactions between particles within each cluster are treated exactly, while all interactions between different clusters are ignored. These smaller clusters have far fewer microstates, making the summation over these microstates, tractable. These algorithms have been previously used for biomolecular computations, but remain relatively unexplored in this context. Presented here, is a theoretical analysis of the error and computational complexity for the two most basic clustering algorithms that were previously applied in the context of biomolecular electrostatics. We derive a tight, computationally inexpensive, error bound for the equilibrium state of a particle computed via these clustering algorithms. For some practical applications, it is the root mean square error, which can be significantly lower than the error bound, that may be more important. We how that there is a strong empirical relationship between error bound and root mean square error, suggesting that the error bound could be used as a computationally inexpensive metric for predicting the accuracy of clustering algorithms for practical applications. An example of error analysis for such an application-computation of average charge of ionizable amino-acids in proteins-is given, demonstrating that the clustering algorithm can be accurate enough for practical purposes.  相似文献   

4.
目的:本文对酒精引起的人脑状态变化进行讨论。通过对客观记录的受试者摄入酒精事件的脑电图数据进行系统聚类分析,从而分析摄入酒精事件与21导联电极分类的关系,进而为有关人脑的其它研究提供实验和理论根据。方法:选取4名习惯用右手、健康的人进行实验,采用标准21个脑电极的10-20导联系统,获取受试者在安静闭眼和摄入一定量啤酒的2个事件的脑电图数据。然后进行数据分析。数据分析的方法是系统聚类分析方法。程序实现采用独立设计的脑电图分析工具箱和聚类分析程序。结果:对脑电图数据聚类分析后发现,未喝酒时脑电活动大致按前额部和中央、后头部、两侧得到3个聚类簇;摄入200毫升啤酒后,受试者P1和P2的大部分额部电极、中央部电极以及后头部电极聚类为一个簇,个别颞部、后头部电极聚类为一个簇,或单个电极独立为一簇,形成孤立点;摄入400毫升啤酒后,受试者P3和P4的大部分额极电极、额部电极、中央部电极以及后头部电极聚类为一个簇,个别额部、中央部、颞部单个电极独立为一簇,形成孤立点。结论:脑电活动对摄入酒精有显著反应。由于人在安静闭眼状态下,后头部记录到的α波较为显著,所以未喝酒时前后头部脑电信号相关性较弱,受试者前后头部的电极基本不在一个聚类簇中;摄入酒精后,受试者大部分额部、中央部和后头部电极聚类为一簇,即前后头部脑电信号的相关性增强,这说明在酒精的作用下,前头部α波增加,α波呈现扩大和增强的趋势。  相似文献   

5.
Diffusion Tensor Imaging (DTI) and fiber tractography are important tools to map the cerebral white matter microstructure in vivo and to model the underlying axonal pathways in the brain with three-dimensional fiber tracts. As the fast and consistent extraction of anatomically correct fiber bundles for multiple datasets is still challenging, we present a novel atlas-guided clustering framework for exploratory data analysis of large tractography datasets. The framework uses an hierarchical cluster analysis approach that exploits the inherent redundancy in large datasets to time-efficiently group fiber tracts. Structural information of a white matter atlas can be incorporated into the clustering to achieve an anatomically correct and reproducible grouping of fiber tracts. This approach facilitates not only the identification of the bundles corresponding to the classes of the atlas; it also enables the extraction of bundles that are not present in the atlas. The new technique was applied to cluster datasets of 46 healthy subjects. Prospects of automatic and anatomically correct as well as reproducible clustering are explored. Reconstructed clusters were well separated and showed good correspondence to anatomical bundles. Using the atlas-guided cluster approach, we observed consistent results across subjects with high reproducibility. In order to investigate the outlier elimination performance of the clustering algorithm, scenarios with varying amounts of noise were simulated and clustered with three different outlier elimination strategies. By exploiting the multithreading capabilities of modern multiprocessor systems in combination with novel algorithms, our toolkit clusters large datasets in a couple of minutes. Experiments were conducted to investigate the achievable speedup and to demonstrate the high performance of the clustering framework in a multiprocessing environment.  相似文献   

6.
The aim of this paper is to present a new clustering algorithm for short time-series gene expression data that is able to characterise temporal relations in the clustering environment (ie data-space), which is not achieved by other conventional clustering algorithms such as k -means or hierarchical clustering. The algorithm called fuzzy c -varieties clustering with transitional state discrimination preclustering (FCV-TSD) is a two-step approach which identifies groups of points ordered in a line configuration in particular locations and orientations of the data-space that correspond to similar expressions in the time domain. We present the validation of the algorithm with both artificial and real experimental datasets, where k -means and random clustering are used for comparison. The performance was evaluated with a measure for internal cluster correlation and the geometrical properties of the clusters, showing that the FCV-TSD algorithm had better performance than the k -means algorithm on both datasets.  相似文献   

7.
Plant pathologists need to manage plant diseases at low incidence levels. This needs to be performed efficiently in terms of precision, cost and time because most plant infections spread rapidly to other plants. Adaptive cluster sampling with a data‐driven stopping rule (ACS*) was proposed to control the final sample size and improve efficiency of the ordinary adaptive cluster sampling (ACS) when prior knowledge of population structure is not known. This study seeks to apply the ACS* design to plant diseases at various levels of clustering and incidences levels. Results from simulation study show that the ACS* is as efficient as the ordinary ACS design at low levels of disease incidence with highly clustered diseased plants and is an efficient design compared with simple random sampling (SRS) and ordinary ACS for some highly to less clustered diseased plants with moderate to higher levels of disease incidence.  相似文献   

8.
In the last decade, numerous efforts have been devoted to design efficient algorithms for clustering the wireless mobile ad-hoc networks (MANET) considering the network mobility characteristics. However, in existing algorithms, it is assumed that the mobility parameters of the networks are fixed, while they are stochastic and vary with time indeed. Therefore, the proposed clustering algorithms do not scale well in realistic MANETs, where the mobility parameters of the hosts freely and randomly change at any time. Finding the optimal solution to the cluster formation problem is incredibly difficult, if we assume that the movement direction and mobility speed of the hosts are random variables. This becomes harder when the probability distribution function of these random variables is assumed to be unknown. In this paper, we propose a learning automata-based weighted cluster formation algorithm called MCFA in which the mobility parameters of the hosts are assumed to be random variables with unknown distributions. In the proposed clustering algorithm, the expected relative mobility of each host with respect to all its neighbors is estimated by sampling its mobility parameters in various epochs. MCFA is a fully distributed algorithm in which each mobile independently chooses the neighboring host with the minimum expected relative mobility as its cluster-head. This is done based solely on the local information each host receives from its neighbors and the hosts need not to be synchronized. The experimental results show the superiority of MCFA over the best existing mobility-based clustering algorithms in terms of the number of clusters, cluster lifetime, reaffiliation rate, and control message overhead.  相似文献   

9.
Key synaptic proteins from the soluble SNARE (N-ethylmaleimide-sensitive factor attachment protein receptor) family, among many others, are organized at the plasma membrane of cells as clusters containing dozens to hundreds of protein copies. However, the exact membranal distribution of proteins into clusters or as single molecules, the organization of molecules inside the clusters, and the clustering mechanisms are unclear due to limitations of the imaging and analytical tools. Focusing on syntaxin 1 and SNAP-25, we implemented direct stochastic optical reconstruction microscopy together with quantitative clustering algorithms to demonstrate a novel approach to explore the distribution of clustered and nonclustered molecules at the membrane of PC12 cells with single-molecule precision. Direct stochastic optical reconstruction microscopy images reveal, for the first time, solitary syntaxin/SNAP-25 molecules and small clusters as well as larger clusters. The nonclustered syntaxin or SNAP-25 molecules are mostly concentrated in areas adjacent to their own clusters. In the clusters, the density of the molecules gradually decreases from the dense cluster core to the periphery. We further detected large clusters that contain several density gradients. This suggests that some of the clusters are formed by unification of several clusters that preserve their original organization or reorganize into a single unit. Although syntaxin and SNAP-25 share some common distributional features, their clusters differ markedly from each other. SNAP-25 clusters are significantly larger, more elliptical, and less dense. Finally, this study establishes methodological tools for the analysis of single-molecule-based super-resolution imaging data and paves the way for revealing new levels of membranal protein organization.  相似文献   

10.
MOTIVATION: Feature (gene) selection can dramatically improve the accuracy of gene expression profile based sample class prediction. Many statistical methods for feature (gene) selection such as stepwise optimization and Monte Carlo simulation have been developed for tissue sample classification. In contrast to class prediction, few statistical and computational methods for feature selection have been applied to clustering algorithms for pattern discovery. RESULTS: An integrated scheme and corresponding program SamCluster for automatic discovery of sample classes based on gene expression profile is presented in this report. The scheme incorporates the feature selection algorithms based on the calculation of CV (coefficient of variation) and t-test into hierarchical clustering and proceeds as follows. At first, the genes with their CV greater than the pre-specified threshold are selected for cluster analysis, which results in two putative sample classes. Then, significantly differentially expressed genes in the two putative sample classes with p-values < or = 0.01, 0.05, or 0.1 from t-test are selected for further cluster analysis. The above processes were iterated until the two stable sample classes were found. Finally, the consensus sample classes are constructed from the putative classes that are derived from the different CV thresholds, and the best putative sample classes that have the minimum distance between the consensus classes and the putative classes are identified. To evaluate the performance of the feature selection for cluster analysis, the proposed scheme was applied to four expression datasets COLON, LEUKEMIA72, LEUKEMIA38, and OVARIAN. The results show that there are only 5, 1, 0, and 0 samples that have been misclassified, respectively. We conclude that the proposed scheme, SamCluster, is an efficient method for discovery of sample classes using gene expression profile. AVAILABILITY: The related program SamCluster is available upon request or from the web page http://www.sph.uth.tmc.edu:8052/hgc/Downloads.asp.  相似文献   

11.
MOTIVATION: In haploinsufficiency profiling data, pleiotropic genes are often misclassified by clustering algorithms that impose the constraint that a gene or experiment belong to only one cluster. We have developed a general probabilistic model that clusters genes and experiments without requiring that a given gene or drug only appear in one cluster. The model also incorporates the functional annotation of known genes to guide the clustering procedure. RESULTS: We applied our model to the clustering of 79 chemogenomic experiments in yeast. Known pleiotropic genes PDR5 and MAL11 are more accurately represented by the model than by a clustering procedure that requires genes to belong to a single cluster. Drugs such as miconazole and fenpropimorph that have different targets but similar off-target genes are clustered more accurately by the model-based framework. We show that this model is useful for summarizing the relationship among treatments and genes affected by those treatments in a compendium of microarray profiles. AVAILABILITY: Supplementary information and computer code at http://genomics.lbl.gov/llda.  相似文献   

12.

Motivation

It has been proposed that clustering clinical markers, such as blood test results, can be used to stratify patients. However, the robustness of clusters formed with this approach to data pre-processing and clustering algorithm choices has not been evaluated, nor has clustering reproducibility. Here, we made use of the NHANES survey to compare clusters generated with various combinations of pre-processing and clustering algorithms, and tested their reproducibility in two separate samples.

Method

Values of 44 biomarkers and 19 health/life style traits were extracted from the National Health and Nutrition Examination Survey (NHANES). The 1999–2002 survey was used for training, while data from the 2003–2006 survey was tested as a validation set. Twelve combinations of pre-processing and clustering algorithms were applied to the training set. The quality of the resulting clusters was evaluated both by considering their properties and by comparative enrichment analysis. Cluster assignments were projected to the validation set (using an artificial neural network) and enrichment in health/life style traits in the resulting clusters was compared to the clusters generated from the original training set.

Results

The clusters obtained with different pre-processing and clustering combinations differed both in terms of cluster quality measures and in terms of reproducibility of enrichment with health/life style properties. Z-score normalization, for example, dramatically improved cluster quality and enrichments, as compared to unprocessed data, regardless of the clustering algorithm used. Clustering diabetes patients revealed a group of patients enriched with retinopathies. This could indicate that routine laboratory tests can be used to detect patients suffering from complications of diabetes, although other explanations for this observation should also be considered.

Conclusions

Clustering according to classical clinical biomarkers is a robust process, which may help in patient stratification. However, optimization of the pre-processing and clustering process may be still required.  相似文献   

13.
HIV molecular epidemiology estimates the transmission patterns from clustering genetically similar viruses. The process involves connecting genetically similar genotyped viral sequences in the network implying epidemiological transmissions. This technique relies on genotype data which is collected only from HIV diagnosed and in-care populations and leaves many persons with HIV (PWH) who have no access to consistent care out of the tracking process. We use machine learning algorithms to learn the non-linear correlation patterns between patient metadata and transmissions between HIV-positive cases. This enables us to expand the transmission network reconstruction beyond the molecular network. We employed multiple commonly used supervised classification algorithms to analyze the San Diego Primary Infection Resource Consortium (PIRC) cohort dataset, consisting of genotypes and nearly 80 additional non-genetic features. First, we trained classification models to determine genetically unrelated individuals from related ones. Our results show that random forest and decision tree achieved over 80% in accuracy, precision, recall, and F1-score by only using a subset of meta-features including age, birth sex, sexual orientation, race, transmission category, estimated date of infection, and first viral load date besides genetic data. Additionally, both algorithms achieved approximately 80% sensitivity and specificity. The Area Under Curve (AUC) is reported 97% and 94% for random forest and decision tree classifiers respectively. Next, we extended the models to identify clusters of similar viral sequences. Support vector machine demonstrated one order of magnitude improvement in accuracy of assigning the sequences to the correct cluster compared to dummy uniform random classifier. These results confirm that metadata carries important information about the dynamics of HIV transmission as embedded in transmission clusters. Hence, novel computational approaches are needed to apply the non-trivial knowledge collected from inter-individual genetic information to metadata from PWH in order to expand the estimated transmissions. We note that feature extraction alone will not be effective in identifying patterns of transmission and will result in random clustering of the data, but its utilization in conjunction with genetic data and the right algorithm can contribute to the expansion of the reconstructed network beyond individuals with genetic data.  相似文献   

14.
We report a study of the efficiency of 4 classifiers (the K-nearest-neighbor and single-nearest-prototype algorithms, each as parametrized by both Fuzzy C-Means and Fuzzy Covariance clustering) in the detection of ventricular arrhythmias in ECG traces characterized by 4 features derived from 7 spectral parameters. Principal components analysis was used in conjunction with a cardiologist's deterministic classification of 90 ECG traces to fix the number of trace classes to 5 (ventricular fibrillation/flutter, sinus rhythm, ventricular rhythms with aberrant complexes and 2 classes of artefact). Forty of the 90 traces were then defined as a test set; 5 different learning sets (numbering 25, 30, 35, 40 and 45 traces) were randomly selected from the remaining 50 traces; each learning set was used to parametrize both the classification algorithms using both fuzzy clustering algorithms and the parametrized classification algorithms were then applied to the test set. Optimal K for K-nearest-neighbor algorithms and optimal cluster volumes for Fuzzy Covariance algorithms were sought by trial and error to minimize classification differences with respect to the cardiologist's classification. Fuzzy Covariance clustering afforded significantly better perception of cluster structure than the Fuzzy C-Means algorithm, and the classifiers performed correspondingly with an overall empirical error ratio of just 0.10 for the K-nearest-neighbor algorithm parametrized by Fuzzy Covariance.  相似文献   

15.
Strict assignment of genes to one class, dimensionality reduction, a priori specification of the number of classes, the need for a training set, nonunique solution, and complex learning mechanisms are some of the inadequacies of current clustering algorithms. Existing algorithms cluster genes on the basis of high positive correlations between their expression patterns. However, genes with strong negative correlations can also have similar functions and are most likely to have a role in the same pathways. To address some of these issues, we propose the adaptive centroid algorithm (ACA), which employs an analysis of variance (ANOVA)-based performance criterion. The ACA also uses Euclidian distances, the center-of-mass principle for heterogeneously distributed mass elements, and the given data set to give unique solutions. The proposed approach involves three stages. In the first stage a two-way ANOVA of the gene expression matrix is performed. The two factors in the ANOVA are gene expression and experimental condition. The residual mean squared error (MSE) from the ANOVA is used as a performance criterion in the ACA. Finally, correlated clusters are found based on the Pearson correlation coefficients. To validate the proposed approach, a two-way ANOVA is again performed on the discovered clusters. The results from this last step indicate that MSEs of the clusters are significantly lower compared to that of the fibroblast-serum gene expression matrix. The ACA is employed in this study for single- as well as multi-cluster gene assignments.  相似文献   

16.
Bootstrap confidence intervals for adaptive cluster sampling   总被引:2,自引:0,他引:2  
Consider a collection of spatially clustered objects where the clusters are geographically rare. Of interest is estimation of the total number of objects on the site from a sample of plots of equal size. Under these spatial conditions, adaptive cluster sampling of plots is generally useful in improving efficiency in estimation over simple random sampling without replacement (SRSWOR). In adaptive cluster sampling, when a sampled plot meets some predefined condition, neighboring plots are added to the sample. When populations are rare and clustered, the usual unbiased estimators based on small samples are often highly skewed and discrete in distribution. Thus, confidence intervals based on asymptotic normal theory may not be appropriate. We investigated several nonparametric bootstrap methods for constructing confidence intervals under adaptive cluster sampling. To perform bootstrapping, we transformed the initial sample in order to include the information from the adaptive portion of the sample yet maintain a fixed sample size. In general, coverages of bootstrap percentile methods were closer to nominal coverage than the normal approximation.  相似文献   

17.
Tseng GC  Wong WH 《Biometrics》2005,61(1):10-16
In this article, we propose a method for clustering that produces tight and stable clusters without forcing all points into clusters. The methodology is general but was initially motivated from cluster analysis of microarray experiments. Most current algorithms aim to assign all genes into clusters. For many biological studies, however, we are mainly interested in identifying the most informative, tight, and stable clusters of sizes, say, 20-60 genes for further investigation. We want to avoid the contamination of tightly regulated expression patterns of biologically relevant genes due to other genes whose expressions are only loosely compatible with these patterns. "Tight clustering" has been developed specifically to address this problem. It applies K-means clustering as an intermediate clustering engine. Early truncation of a hierarchical clustering tree is used to overcome the local minimum problem in K-means clustering. The tightest and most stable clusters are identified in a sequential manner through an analysis of the tendency of genes to be grouped together under repeated resampling. We validated this method in a simulated example and applied it to analyze a set of expression profiles in the study of embryonic stem cells.  相似文献   

18.
Positive feedback plays a key role in the ability of signaling molecules to form highly localized clusters in the membrane or cytosol of cells. Such clustering can occur in the absence of localizing mechanisms such as pre-existing spatial cues, diffusional barriers, or molecular cross-linking. What prevents positive feedback from amplifying inevitable biological noise when an un-clustered “off” state is desired? And, what limits the spread of clusters when an “on” state is desired? Here, we show that a minimal positive feedback circuit provides the general principle for both suppressing and amplifying noise: below a critical density of signaling molecules, clustering switches off; above this threshold, highly localized clusters are recurrently generated. Clustering occurs only in the stochastic regime, suggesting that finite sizes of molecular populations cannot be ignored in signal transduction networks. The emergence of a dominant cluster for finite numbers of molecules is partly a phenomenon of random sampling, analogous to the fixation or loss of neutral mutations in finite populations. We refer to our model as the “neutral drift polarity model.” Regulating the density of signaling molecules provides a simple mechanism for a positive feedback circuit to robustly switch between clustered and un-clustered states. The intrinsic ability of positive feedback both to create and suppress clustering is a general mechanism that could operate within diverse biological networks to create dynamic spatial organization.  相似文献   

19.
Validating clustering for gene expression data   总被引:24,自引:0,他引:24  
MOTIVATION: Many clustering algorithms have been proposed for the analysis of gene expression data, but little guidance is available to help choose among them. We provide a systematic framework for assessing the results of clustering algorithms. Clustering algorithms attempt to partition the genes into groups exhibiting similar patterns of variation in expression level. Our methodology is to apply a clustering algorithm to the data from all but one experimental condition. The remaining condition is used to assess the predictive power of the resulting clusters-meaningful clusters should exhibit less variation in the remaining condition than clusters formed by chance. RESULTS: We successfully applied our methodology to compare six clustering algorithms on four gene expression data sets. We found our quantitative measures of cluster quality to be positively correlated with external standards of cluster quality.  相似文献   

20.

Background  

There are many important clustering questions in computational biology for which no satisfactory method exists. Automated clustering algorithms, when applied to large, multidimensional datasets, such as flow cytometry data, prove unsatisfactory in terms of speed, problems with local minima or cluster shape bias. Model-based approaches are restricted by the assumptions of the fitting functions. Furthermore, model based clustering requires serial clustering for all cluster numbers within a user defined interval. The final cluster number is then selected by various criteria. These supervised serial clustering methods are time consuming and frequently different criteria result in different optimal cluster numbers. Various unsupervised heuristic approaches that have been developed such as affinity propagation are too expensive to be applied to datasets on the order of 106 points that are often generated by high throughput experiments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号