首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 35 毫秒
1.
Consider the scenario of common gene clusters of closely related species where the cluster sizes could be as large as 400 from an alphabet of 25,000 genes. This paper addresses the problem of computing the statistical significance of such large clusters, whose individual elements occur with very low frequency (of the order of the number of species in this case) and the alphabet set of the elements is relatively large. We present a model where we study the structure of the clusters in terms of smaller nested (or otherwise) sub-clusters contained within the cluster. We give a probability estimation based on the expected cluster structure for such clusters (rather than some form of the product of individual probabilities of the elements). We also give an exact probability computation based on a dynamic programming algorithm, which runs in polynomial time.  相似文献   

2.
Evolutionarily conserved non-coding genomic sequences represent a potentially rich source for the discovery of gene regulatory regions. Since these elements are subject to stabilizing selection they evolve much more slowly than adjacent non-functional DNA. These so-called phylogenetic footprints can be detected by comparison of the sequences surrounding orthologous genes in different species. Therefore the loss of phylogenetic footprints as well as the acquisition of conserved non-coding sequences in some lineages, but not in others, can provide evidence for the evolutionary modification of cis-regulatory elements. We introduce here a statistical model of footprint evolution that allows us to estimate the loss of sequence conservation that can be attributed to gene loss and other structural reasons. This approach to studying the pattern of cis-regulatory element evolution, however, requires the comparison of relatively long sequences from many species. We have therefore developed an efficient software tool for the identification of corresponding footprints in long sequences from multiple species. We apply this novel method to the published sequences of HoxA clusters of shark, human, and the duplicated zebrafish and Takifugu clusters as well as the published HoxB cluster sequences. We find that there is a massive loss of sequence conservation in the intergenic region of the HoxA clusters, consistent with the finding in [Chiu et al., PNAS 99 (2002) 5492]. The loss of conservation after cluster duplication is more extensive than expected from structural reasons. This suggests that binding site turnover and/or adaptive modification may also contribute to the loss of sequence conservation.  相似文献   

3.
4.
In observational studies, subjects are often nested within clusters. In medical studies, patients are often treated by doctors and therefore patients are regarded as nested or clustered within doctors. A concern that arises with clustered data is that cluster-level characteristics (e.g., characteristics of the doctor) are associated with both treatment selection and patient outcomes, resulting in cluster-level confounding. Measuring and modeling cluster attributes can be difficult and statistical methods exist to control for all unmeasured cluster characteristics. An assumption of these methods however is that characteristics of the cluster and the effects of those characteristics on the outcome (as well as probability of treatment assignment when using covariate balancing methods) are constant over time. In this paper, we consider methods that relax this assumption and allow for estimation of treatment effects in the presence of unmeasured time-dependent cluster confounding. The methods are based on matching with the propensity score and incorporate unmeasured time-specific cluster effects by performing matching within clusters or using fixed- or random-cluster effects in the propensity score model. The methods are illustrated using data to compare the effectiveness of two total hip devices with respect to survival of the device and a simulation study is performed that compares the proposed methods. One method that was found to perform well is matching within surgeon clusters partitioned by time. Considerations in implementing the proposed methods are discussed.  相似文献   

5.
Accurately localizing molecules within the cell is one of main tasks of modern biology, and colocalization analysis is one of its principal and most often used tools. Despite this popularity, interpretation is often uncertain because colocalization between two or more images is rarely analyzed to determine whether the observed values could have occurred by chance. To address this, we have developed a robust methodology, based on Monte Carlo randomization, to measure the statistical significance of a colocalization. The method works with voxel-based, intensity-based, object-based, and nearest-neighbor metrics. We extend all of these to measure colocalization in images with three colors. We also introduce three new metrics; blob colocalization, where the blob consists of a local maximum surrounded by a three-dimensional group of voxels; cluster diameter, to measure the clustering of fluorophores in three or more images; and the intercluster distance to measure the distance between these clusters. The robustness of these metrics was tested by varying the image thresholds over a broad range, which produced no change in the statistical significance of the colocalizations. A comparison of blob colocalization with voxel and Manders colocalization metrics shows that the different measures produce consistent results with similar values for significance and nonsignificance. Using our methodology, we are able to determine not only whether the labeled molecules colocalize with a probability greater than chance, but also whether they are sequestrated into different compartments. The program, written in C++, is freely available as source, as well as in a Linux version.  相似文献   

6.
重建生物进化树一直以来都是进化生物学家的梦想。大量物种全基因组的测序使得我们可以从全基因组水平上构建进化树,来研究各个物种之间的进化关系。本文采用2种统计方法和3种距离计算方法,在全基因组水平上建立基于蛋白质结构的进化树。选取93个物种的全基因组作为分析对象,涵盖了3个超界:真核生物,细菌和古细菌。而结果也正确地将这些物种分为三个大类,每个大分支内部的物种聚类情况也基本和这些物种的形态学分类相吻合。并将这些方法的聚类结果与物种分类的结果相比较,得出丰度的统计方法和基于两向量夹角的距离计算方法这种组合在构建进化树上比其他组合更好。  相似文献   

7.
P Smolen  J Rinzel    A Sherman 《Biophysical journal》1993,64(6):1668-1680
Previous mathematical modeling of beta cell electrical activity has involved single cells or, recently, clusters of identical cells. Here we model clusters of heterogeneous cells that differ in size, channel density, and other parameters. We use gap-junctional electrical coupling, with conductances determined by an experimental histogram. We find that, for reasonable parameter distributions, only a small proportion of isolated beta cells will burst when uncoupled, at any given value of a glucose-sensing parameter. However, a coupled, heterogeneous cluster of such cells, if sufficiently large (approximately 125 cells), will burst synchronously. Small clusters of such cells will burst only with low probability. In large clusters, the dynamics of intracellular calcium compare well with experiments. Also, these clusters possess a dose-response curve of increasing average electrical activity with respect to a glucose-sensing parameter that is sharp when the cluster is coupled, but shallow when the cluster is decoupled into individual cells. This is in agreement with comparative experiments on cells in suspension and islets.  相似文献   

8.
We develop a statistical tool SNVer for calling common and rare variants in analysis of pooled or individual next-generation sequencing (NGS) data. We formulate variant calling as a hypothesis testing problem and employ a binomial-binomial model to test the significance of observed allele frequency against sequencing error. SNVer reports one single overall P-value for evaluating the significance of a candidate locus being a variant based on which multiplicity control can be obtained. This is particularly desirable because tens of thousands loci are simultaneously examined in typical NGS experiments. Each user can choose the false-positive error rate threshold he or she considers appropriate, instead of just the dichotomous decisions of whether to 'accept or reject the candidates' provided by most existing methods. We use both simulated data and real data to demonstrate the superior performance of our program in comparison with existing methods. SNVer runs very fast and can complete testing 300 K loci within an hour. This excellent scalability makes it feasible for analysis of whole-exome sequencing data, or even whole-genome sequencing data using high performance computing cluster. SNVer is freely available at http://snver.sourceforge.net/.  相似文献   

9.
Sickle cell hemoglobin (HbS) is a mutant, whose polymerization while in deoxy state in the venous circulation underlies the debilitating sickle cell anemia. It has been suggested that the nucleation of the HbS polymers occurs within clusters of dense liquid, existing in HbS solutions. We use dynamic light scattering with solutions of deoxy-HbS, and, for comparison, of oxy-HbS and oxy-normal adult hemoglobin, HbA. We show that solutions of all three Hb variants contain clusters of dense liquid, several hundred nanometers in size, which are metastable with respect to the Hb solutions. The clusters form within a few seconds after solution preparation and their sizes and numbers remain relatively steady for up to 3 h. The lower bound of the cluster lifetime is 15 ms. The clusters exist in broad temperature and Hb concentration ranges, and occupy 10(-5)-10(-2) of the solution volume. The results on the cluster properties can serve as test data for a potential future microscopic theory of cluster stability and kinetics. More importantly, if the clusters are a part of the nucleation mechanism of HbS polymers, the rate of HbS polymerization can be controlled by varying the cluster properties.  相似文献   

10.
Gangnon RE 《Biometrics》2012,68(1):174-182
The spatial scan statistic is an important and widely used tool for cluster detection. It is based on the simultaneous evaluation of the statistical significance of the maximum likelihood ratio test statistic over a large collection of potential clusters. In most cluster detection problems, there is variation in the extent of local multiplicity across the study region. For example, using a fixed maximum geographic radius for clusters, urban areas typically have many overlapping potential clusters, whereas rural areas have relatively few. The spatial scan statistic does not account for local multiplicity variation. We describe a previously proposed local multiplicity adjustment based on a nested Bonferroni correction and propose a novel adjustment based on a Gumbel distribution approximation to the distribution of a local scan statistic. We compare the performance of all three statistics in terms of power and a novel unbiased cluster detection criterion. These methods are then applied to the well-known New York leukemia dataset and a Wisconsin breast cancer incidence dataset.  相似文献   

11.
In cluster randomized trials, intact social units such as schools, worksites or medical practices - rather than individuals themselves - are randomly allocated to intervention and control conditions, while the outcomes of interest are then observed on individuals within each cluster. Such trials are becoming increasingly common in the fields of health promotion and health services research. Attrition is a common occurrence in randomized trials, and a standard approach for dealing with the resulting missing values is imputation. We consider imputation strategies for missing continuous outcomes, focusing on trials with a completely randomized design in which fixed cohorts from each cluster are enrolled prior to random assignment. We compare five different imputation strategies with respect to Type I and Type II error rates of the adjusted two-sample t -test for the intervention effect. Cluster mean imputation is compared with multiple imputation, using either within-cluster data or data pooled across clusters in each intervention group. In the case of pooling across clusters, we distinguish between standard multiple imputation procedures which do not account for intracluster correlation and a specialized procedure which does account for intracluster correlation but is not yet available in standard statistical software packages. A simulation study is used to evaluate the influence of cluster size, number of clusters, degree of intracluster correlation, and variability among cluster follow-up rates. We show that cluster mean imputation yields valid inferences and given its simplicity, may be an attractive option in some large community intervention trials which are subject to individual-level attrition only; however, it may yield less powerful inferences than alternative procedures which pool across clusters especially when the cluster sizes are small and cluster follow-up rates are highly variable. When pooling across clusters, the imputation procedure should generally take intracluster correlation into account to obtain valid inferences; however, as long as the intracluster correlation coefficient is small, we show that standard multiple imputation procedures may yield acceptable type I error rates; moreover, these procedures may yield more powerful inferences than a specialized procedure, especially when the number of available clusters is small. Within-cluster multiple imputation is shown to be the least powerful among the procedures considered.  相似文献   

12.
Dormant bacterial spores are extraordinarily resistant to environmental insults and are vectors of various illnesses. However, spores cannot cause disease unless they germinate and become vegetative cells. The molecular details of initiation of germination are not understood, but proteins essential in early stages of germination, such as nutrient germinant receptors (GRs) and GerD, are located in the spore inner membrane. In this study, we examine how these germination proteins are organized in dormant Bacillus subtilis spores by expressing fluorescent protein fusions that were at least partially functional and observing spores by fluorescence microscopy. We show that GRs and GerD colocalize primarily to a single cluster in dormant spores, reminiscent of the organization of chemoreceptor signalling complexes in Escherichia coli. GRs require all their subunits as well as GerD for clustering, and also require diacylglycerol addition to GerD and GRs' C protein subunits. However, different GRs cluster independently of each other, and GerD forms clusters in the absence of all the GRs. We predict that the clusters represent a functional germination unit or 'germinosome' in the spore inner membrane that is necessary for rapid and cooperative response to nutrients, as conditions known to block nutrient germination also disrupt the protein clusters.  相似文献   

13.
Arabidopsis thaliana is believed to have experienced at least two and possibly three whole-genome duplication events in its evolutionary history. In order to investigate the evolutionary relationships between these duplication events and diversification of disease resistance (R) genes, segmental-duplication events containing R genes belonging to the nucleotide binding-leucine rich repeat (NB-LRR) class were identified. Of 153 segmental-duplication events containing NB-LRR genes, only 22 contained NB-LRR genes in both members of the duplication pair, indicating a high frequency of NB-LRR gene loss after whole-genome duplication. The relative age of the duplication events was estimated based on the average synonymous substitution rate of the duplicated gene pairs in the segments. These data were combined with phylogenetic analyses. NB-LRR genes present in segment pairs derived from the most recent whole-genome duplication event, estimated to have occurred only 20 to 40 million years ago, occupy very distant branches of the NB-LRR phylogenetic tree. These data suggest that when NB-LRR clusters are duplicated as part of a whole-genome duplication, homoeologous NB-LRR genes are preferentially lost, either by eliminating one copy of the cluster or by eliminating individual genes such that only paralogous NB-LRR genes are maintained.  相似文献   

14.
Genomic deletions have long been known to play a causative role in microdeletion syndromes. Recent whole-genome genetic studies have shown that deletions can increase the risk for several psychiatric disorders, suggesting that genomic deletions play an important role in the genetic basis of complex traits. However, the association between genomic deletions and common, complex diseases has not yet been systematically investigated in gene mapping studies. Likelihood-based statistical methods for identifying disease-associated deletions have recently been developed for familial studies of parent-offspring trios. The purpose of this study is to develop statistical approaches for detecting genomic deletions associated with complex disease in case–control studies. Our methods are designed to be used with dense single nucleotide polymorphism (SNP) genotypes to detect deletions in large-scale or whole-genome genetic studies. As more and more SNP genotype data for genome-wide association studies become available, development of sophisticated statistical approaches will be needed that use these data. Our proposed statistical methods are designed to be used in SNP-by-SNP analyses and in cluster analyses based on combined evidence from multiple SNPs. We found that these methods are useful for detecting disease-associated deletions and are robust in the presence of linkage disequilibrium using simulated SNP data sets. Furthermore, we applied the proposed statistical methods to SNP genotype data of chromosome 6p for 868 rheumatoid arthritis patients and 1,197 controls from the North American Rheumatoid Arthritis Consortium. We detected disease-associated deletions within the region of human leukocyte antigen in which genomic deletions were previously discovered in rheumatoid arthritis patients.  相似文献   

15.
We consider the dynamics of a piecewise affine system of degrade-and-fire oscillators with global repressive interaction, inspired by experiments on synchronization in colonies of bacteria-embedded genetic circuits. Due to global coupling, if any two oscillators happen to be in the same state at some time, they remain in sync at all subsequent times; thus clusters of synchronized oscillators cannot shrink as a result of the dynamics. Assuming that the system is initiated from random initial configurations of fully dispersed populations (no clusters), we estimate asymptotic cluster sizes as a function of the coupling strength. A sharp transition is proved to exist that separates a weak coupling regime of unclustered populations from a strong coupling phase where clusters of extensive size are formed. Each phenomena occurs with full probability in the thermodynamics limit. Moreover, the maximum number of asymptotic clusters is known to diverge linearly in this limit. In contrast, we show that with positive probability, the number of asymptotic clusters remains bounded, provided that the coupling strength is sufficiently large.  相似文献   

16.
BACKGROUND: Human diversity, namely single nucleotide polymorphisms (SNPs), is becoming a focus of biomedical research. Despite the binary nature of SNP determination, the majority of genotyping assay data need a critical evaluation for genotype calling. We applied statistical models to improve the automated analysis of 2-dimensional SNP data. METHODS: We derived several quantities in the framework of Gaussian mixture models that provide figures of merit to objectively measure the data quality. The accuracy of individual observations is scored as the probability of belonging to a certain genotype cluster, while the assay quality is measured by the overlap between the genotype clusters. RESULTS: The approach was extensively tested with a dataset of 438 nonredundant SNP assays comprising >150,000 datapoints. The performance of our automatic scoring method was compared with manual assignments. The agreement for the overall assay quality is remarkably good, and individual observations were scored differently by man and machine in 2.6% of cases, when applying stringent probability threshold values. CONCLUSION: Our definition of bounds for the accuracy for complete assays in terms of misclassification probabilities goes beyond other proposed analysis methods. We expect the scoring method to minimise human intervention and provide a more objective error estimate in genotype calling.  相似文献   

17.
MOTIVATION: An important goal of microarray studies is to discover genes that are associated with clinical outcomes, such as disease status and patient survival. While a typical experiment surveys gene expressions on a global scale, there may be only a small number of genes that have significant influence on a clinical outcome. Moreover, expression data have cluster structures and the genes within a cluster have correlated expressions and coordinated functions, but the effects of individual genes in the same cluster may be different. Accordingly, we seek to build statistical models with the following properties. First, the model is sparse in the sense that only a subset of the parameter vector is non-zero. Second, the cluster structures of gene expressions are properly accounted for. RESULTS: For gene expression data without pathway information, we divide genes into clusters using commonly used methods, such as K-means or hierarchical approaches. The optimal number of clusters is determined using the Gap statistic. We propose a clustering threshold gradient descent regularization (CTGDR) method, for simultaneous cluster selection and within cluster gene selection. We apply this method to binary classification and censored survival analysis. Compared to the standard TGDR and other regularization methods, the CTGDR takes into account the cluster structure and carries out feature selection at both the cluster level and within-cluster gene level. We demonstrate the CTGDR on two studies of cancer classification and two studies correlating survival of lymphoma patients with microarray expressions. AVAILABILITY: R code is available upon request. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

18.
We present a new method which allows a swarm of robots to sort arbitrarily arranged objects into homogeneous clusters. In the ideal case, a distributed robotic sorting method should establish a single homogeneous cluster for each object type. This can be achieved with existing methods, but the rate of convergence is considered too slow for real-world application. Previous research on distributed robotic sorting is typified by randomised movement with a pick-up/deposit behaviour that is a probabilistic function of local object density. We investigate whether the ability of each robot to localise and return to remembered places can improve distributed sorting performance. In our method, each robot maintains a cache point for each object type. Upon collecting an object, it returns to add this object to the cluster surrounding the cache point. Similar to previous biologically inspired work on distributed sorting, no explicit communication between robots is implemented. However, the robots can still come to a consensus on the best cache for each object type by observing clusters and comparing their sizes with remembered cache sizes. We refer to this method as cache consensus. Our results indicate that incorporating this localisation capability enables a significant improvement in the rate of convergence. We present experimental results using a realistic simulation of our targeted robotic platform. A subset of these experiments is also validated on physical robots.  相似文献   

19.
Fast Fourier transform (FFT) correlation methods of protein-protein docking, combined with the clustering of low energy conformations, can find a number of local minima on the energy surface. For most complexes, the locations of the near-native structures can be constrained to the 30 largest clusters, each surrounding a local minimum. However, no reliable further discrimination can be obtained by energy measures because the differences in the energy levels between the minima are comparable with the errors in the energy evaluation. In fact, no current scoring function accounts for the entropic contributions that relate to the width rather than the depth of the minima. Since structures at narrow minima loose more entropy, some of the nonnative states can be detected by determining whether or not a local minimum is surrounded by a broad region of attraction on the energy surface. The analysis is based on starting Monte Carlo Minimization (MCM) runs from random points around each minimum, and observing whether a certain fraction of trajectories converge to a small region within the cluster. The cluster is considered stable if such a strong attractor exists, has at least 10 convergent trajectories, is relatively close to the original cluster center, and contains a low energy structure. We studied the stability of clusters for enzyme-inhibitor and antibody-antigen complexes in the Protein Docking Benchmark. The analysis yields three main results. First, all clusters that are close to the native structure are stable. Second, restricting considerations to stable clusters eliminates around half of the false positives, that is, solutions that are low in energy but far from the native structure of the complex. Third, dividing the conformational space into clusters and determining the stability of each cluster, the combined approach is less dependent on a priori information than exploring the potential conformational space by Monte Carlo minimizations.  相似文献   

20.
MOTIVATION: We present statistical methods for determining the number of per gene replicate spots required in microarray experiments. The purpose of these methods is to obtain an estimate of the sampling variability present in microarray data, and to determine the number of replicate spots required to achieve a high probability of detecting a significant fold change in gene expression, while maintaining a low error rate. Our approach is based on data from control microarrays, and involves the use of standard statistical estimation techniques. RESULTS: After analyzing two experimental data sets containing control array data, we were able to determine the statistical power available for the detection of significant differential expression given differing levels of replication. The inclusion of replicate spots on microarrays not only allows more accurate estimation of the variability present in an experiment, but more importantly increases the probability of detecting genes undergoing significant fold changes in expression, while substantially decreasing the probability of observing fold changes due to chance rather than true differential expression.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号