首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Consider the scenario of common gene clusters of closely related species where the cluster sizes could be as large as 400 from an alphabet of 25,000 genes. This paper addresses the problem of computing the statistical significance of such large clusters, whose individual elements occur with very low frequency (of the order of the number of species in this case) and the alphabet set of the elements is relatively large. We present a model where we study the structure of the clusters in terms of smaller nested (or otherwise) sub-clusters contained within the cluster. We give a probability estimation based on the expected cluster structure for such clusters (rather than some form of the product of individual probabilities of the elements). We also give an exact probability computation based on a dynamic programming algorithm, which runs in polynomial time.  相似文献   

2.
Coordinate based meta-analysis (CBMA) is widely used to find regions of consistent activation across fMRI studies that have been selected for their functional relevance to a given hypothesis. Only reported coordinates (foci), and a model of their spatial uncertainty, are used in the analysis. Results are clusters of foci where multiple studies have reported in the same spatial region, indicating functional relevance. There are several published methods that perform the analysis in a voxel-wise manner, resulting in around 105 statistical tests, and considerable emphasis placed on controlling the risk of type 1 statistical error. Here we address this issue by dramatically reducing the number of tests, and by introducing a new false discovery rate control: the false cluster discovery rate (FCDR). FCDR is particularly interpretable and relevant to the results of CBMA, controlling the type 1 error by limiting the proportion of clusters that are expected under the null hypothesis. We also introduce a data diagnostic scheme to help ensure quality of the analysis, and demonstrate its use in the example studies. We show that we control the false clusters better than the widely used ALE method by performing numerical experiments, and that our clustering scheme results in more complete reporting of structures relevant to the functional task.  相似文献   

3.
The stepped wedge design (SWD) is a form of cluster randomized trial, usually comparing two treatments, which is divided into time periods and sequences, with clusters allocated to sequences. Typically all sequences start with the standard treatment and end with the new treatment, with the change happening at different times in the different sequences. The clusters will usually differ in size but this is overlooked in much of the existing literature. This paper considers the case when clusters have different sizes and determines how efficient designs can be found. The approach uses an approximation to the variance of the treatment effect, which is expressed in terms of the proportions of clusters and of individuals allocated to each sequence of the design. The roles of these sets of proportions in determining an efficient design are discussed and illustrated using two SWDs, one in the treatment of sexually transmitted diseases and one in renal replacement therapy. Cluster-balanced designs, which allocate equal numbers of clusters to each sequence, are shown to have excellent statistical and practical properties; suggestions are made about the practical application of the results for these designs. The paper concentrates on the cross-sectional case, where subjects are measured once, but it is briefly indicated how the methods can be extended to the closed-cohort design.  相似文献   

4.
5.
《Biomarkers》2013,18(4):240-252
Abstract

The Net Reclassification Improvement (NRI) and the Integrated Discrimination Improvement (IDI) are used to evaluate the diagnostic accuracy improvement for biomarkers in a wide range of applications. Most applications for these reclassification metrics are confined to nested model comparison. We emphasize the important extensions of these metrics to the non-nested comparison. Non-nested models are important in practice, in particular, in high-dimensional data analysis and in sophisticated semiparametric modeling. We demonstrate that the assessment of accuracy improvement may follow the familiar NRI and IDI evaluation. While the statistical properties of the estimators for NRI and IDI have been well studied in the nested setting, one cannot always rely on these asymptotic results to implement the inference procedure for practical data, especially for testing the null hypothesis of no improvement, and these properties have not been established for the non-nested setting. We propose a generic bootstrap re-sampling procedure for the construction of confidence intervals and hypothesis tests. Extensive simulations and real biomedical data examples illustrate the applicability of the proposed inference methods for both nested and non-nested models.  相似文献   

6.
In observational studies, subjects are often nested within clusters. In medical studies, patients are often treated by doctors and therefore patients are regarded as nested or clustered within doctors. A concern that arises with clustered data is that cluster-level characteristics (e.g., characteristics of the doctor) are associated with both treatment selection and patient outcomes, resulting in cluster-level confounding. Measuring and modeling cluster attributes can be difficult and statistical methods exist to control for all unmeasured cluster characteristics. An assumption of these methods however is that characteristics of the cluster and the effects of those characteristics on the outcome (as well as probability of treatment assignment when using covariate balancing methods) are constant over time. In this paper, we consider methods that relax this assumption and allow for estimation of treatment effects in the presence of unmeasured time-dependent cluster confounding. The methods are based on matching with the propensity score and incorporate unmeasured time-specific cluster effects by performing matching within clusters or using fixed- or random-cluster effects in the propensity score model. The methods are illustrated using data to compare the effectiveness of two total hip devices with respect to survival of the device and a simulation study is performed that compares the proposed methods. One method that was found to perform well is matching within surgeon clusters partitioned by time. Considerations in implementing the proposed methods are discussed.  相似文献   

7.
J M Neuhaus  N P Jewell 《Biometrics》1990,46(4):977-990
Recently a great deal of attention has been given to binary regression models for clustered or correlated observations. The data of interest are of the form of a binary dependent or response variable, together with independent variables X1,...., Xk, where sets of observations are grouped together into clusters. A number of models and methods of analysis have been suggested to study such data. Many of these are extensions in some way of the familiar logistic regression model for binary data that are not grouped (i.e., each cluster is of size 1). In general, the analyses of these clustered data models proceed by assuming that the observed clusters are a simple random sample of clusters selected from a population of clusters. In this paper, we consider the application of these procedures to the case where the clusters are selected randomly in a manner that depends on the pattern of responses in the cluster. For example, we show that ignoring the retrospective nature of the sample design, by fitting standard logistic regression models for clustered binary data, may result in misleading estimates of the effects of covariates and the precision of estimated regression coefficients.  相似文献   

8.
9.
Determining the number of clusters using the weighted gap statistic   总被引:3,自引:0,他引:3  
Yan M  Ye K 《Biometrics》2007,63(4):1031-1037
Estimating the number of clusters in a data set is a crucial step in cluster analysis. In this article, motivated by the gap method (Tibshirani, Walther, and Hastie, 2001, Journal of the Royal Statistical Society B63, 411-423), we propose the weighted gap and the difference of difference-weighted (DD-weighted) gap methods for estimating the number of clusters in data using the weighted within-clusters sum of errors: a measure of the within-clusters homogeneity. In addition, we propose a "multilayer" clustering approach, which is shown to be more accurate than the original gap method, particularly in detecting the nested cluster structure of the data. The methods are applicable when the input data contain continuous measurements and can be used with any clustering method. Simulation studies and real data are investigated and compared among these proposed methods as well as with the original gap method.  相似文献   

10.
This article develops hypothesis testing procedures for the stratified mark‐specific proportional hazards model with missing covariates where the baseline functions may vary with strata. The mark‐specific proportional hazards model has been studied to evaluate mark‐specific relative risks where the mark is the genetic distance of an infecting HIV sequence to an HIV sequence represented inside the vaccine. This research is motivated by analyzing the RV144 phase 3 HIV vaccine efficacy trial, to understand associations of immune response biomarkers on the mark‐specific hazard of HIV infection, where the biomarkers are sampled via a two‐phase sampling nested case‐control design. We test whether the mark‐specific relative risks are unity and how they change with the mark. The developed procedures enable assessment of whether risk of HIV infection with HIV variants close or far from the vaccine sequence are modified by immune responses induced by the HIV vaccine; this question is interesting because vaccine protection occurs through immune responses directed at specific HIV sequences. The test statistics are constructed based on augmented inverse probability weighted complete‐case estimators. The asymptotic properties and finite‐sample performances of the testing procedures are investigated, demonstrating double‐robustness and effectiveness of the predictive auxiliaries to recover efficiency. The finite‐sample performance of the proposed tests are examined through a comprehensive simulation study. The methods are applied to the RV144 trial.  相似文献   

11.
To explore the molecular mechanisms behind the diversification of colicin gene clusters, we examined DNA sequence polymorphism for the colicin gene clusters of 14 colicin E2 (ColE2) plasmids obtained from natural isolates of Escherichia coli. Two types of ColE2 plasmids are revealed, with type II gene clusters generated by recombination between type I ColE2 and ColE7 gene clusters. The levels and patterns of DNA polymorphism are different between the two types. Type I polymorphism is distributed evenly along the gene cluster, while type II accumulates polymorphism at an elevated rate in the 5' end of the colicin gene. These differences may be explained by recombinational origins of type II gene clusters. The pattern of divergence between the ColE2 gene cluster and its close relative ColE9 is not correlated with the pattern of polymorphism within ColE2, suggesting that this gene cluster is not evolving in a neutral fashion. A statistical test confirms significant departures from the predictions of neutrality. These data lend further support to the hypothesis that colicin gene clusters may evolve under the influence of nonneutral forces.   相似文献   

12.
Li Y  Lin X 《Biometrics》2003,59(1):25-35
In the analysis of clustered categorical data, it is of common interest to test for the correlation within clusters, and the heterogeneity across different clusters. We address this problem by proposing a class of score tests for the null hypothesis that the variance components are zero in random effects models, for clustered nominal and ordinal categorical responses. We extend the results to accommodate clustered censored discrete time-to-event data. We next consider such tests in the situation where covariates are measured with errors. We propose using the SIMEX method to construct the score tests for the null hypothesis that the variance components are zero. Key advantages of the proposed score tests are that they can be easily implemented by fitting standard polytomous regression models and discrete failure time models, and that they are robust in the sense that no assumptions need to be made regarding the distributions of the random effects and the unobserved covariates. The asymptotic properties of the proposed tests are studied. We illustrate these tests by analyzing two data sets and evaluate their performance with simulations.  相似文献   

13.

Background  

Genes responsible for biosynthesis of fungal secondary metabolites are usually tightly clustered in the genome and co-regulated with metabolite production. Epipolythiodioxopiperazines (ETPs) are a class of secondary metabolite toxins produced by disparate ascomycete fungi and implicated in several animal and plant diseases. Gene clusters responsible for their production have previously been defined in only two fungi. Fungal genome sequence data have been surveyed for the presence of putative ETP clusters and cluster data have been generated from several fungal taxa where genome sequences are not available. Phylogenetic analysis of cluster genes has been used to investigate the assembly and heredity of these gene clusters.  相似文献   

14.
A totally data-based approach to the evaluation of short-term tests is proposed. The performances of 22 tests over a range of 42 chemicals (data from literature) were studied by cluster analysis. The comparison between them was performed only on the basis of their responses to the chemicals. Two different clustering methods produced a coincident classification, pointing to a clear resolution of all tests into 3 groups with common characteristics. With respect to carcinogen discrimination, cluster 1 showed the highest sensitivity and the lowest specificity. Cluster 3 had opposite characteristics. The tests of cluster 2 showed intermediate features. As far as the membership to clusters is concerned, the literature data about the responses to chemicals indicated a strong test system specificity. This apparently overcame both phylogeny and end-point community. A major characteristic of the present approach is the ability to elicit underlying patterns, the knowledge of which can contribute both to hypothesis formulation and be useful for practical purposes.  相似文献   

15.
Finding subtypes of heterogeneous diseases is the biggest challenge in the area of biology. Often, clustering is used to provide a hypothesis for the subtypes of a heterogeneous disease. However, there are usually discrepancies between the clusterings produced by different algorithms. This work introduces a simple method which provides the most consistent clusters across three different clustering algorithms for a melanoma and a breast cancer data set. The method is validated by showing that the Silhouette, Dunne's and Davies-Bouldin's cluster validation indices are better for the proposed algorithm than those obtained by k-means and another consensus clustering algorithm. The hypotheses of the consensus clusters on both the data sets are corroborated by clear genetic markers and 100 percent classification accuracy. In Bittner et al.'s melanoma data set, a previously hypothesized primary cluster is recognized as the largest consensus cluster and a new partition of this cluster into two subclusters is proposed. In van't Veer et al.'s breast cancer data set, previously proposed "basal” and "luminal A” subtypes are clearly recognized as the two predominant clusters. Furthermore, a new hypothesis is provided about the existence of two subgroups within the "basal” subtype in this data set. The clusters of van't Veer's data set is also validated by high classification accuracy obtained in the data set of van de Vijver et al.  相似文献   

16.
17.
We have mined the evolutionary record for the large family of intracellular lipid-binding proteins (iLBPs) by calculating the statistical coupling of residue variations in a multiple sequence alignment using methods developed by Ranganathan and coworkers (Lockless and Ranganathan, Science 1999:286;295-299). The 213 sequences analyzed have a wide range of ligand-binding functions as well as highly divergent phylogenetic origins, assuring broad sampling of sequence space. Emerging from this analysis were two major clusters of coupled residues, which when mapped onto the structure of a representative iLBP under study in our laboratory, cellular retinoic-acid binding protein I, are largely contiguous and provide useful points of comparison to available data for the folding of this protein. One cluster comprises a predominantly hydrophobic core away from the ligand-binding site and likely represents key structural information for the iLBP fold. The other cluster includes the portal region where ligand enters its binding site, regions of the ligand-binding cavity, and the region where the 10-stranded beta-barrel characteristic of this family closes (between strands 1' and 10). Linkages between these two clusters suggest that evolutionary pressures on this family constrain structural and functional sequence information in an interdependent fashion. The necessity of the structure to wrap around a hydrophobic ligand confounds the typical sequestration of hydrophobic side chains. Additionally, ligand entry and exit require these structures to have a capacity for specific conformational change during binding and release. We conclude that an essential and structurally apparent separation of local and global sequence information is conserved throughout the iLBP family.  相似文献   

18.
The O antigens of Salmonella serogroups A, B, and D differ structurally in their side chain sugar residues. The genes encoding O-antigen biosynthesis are clustered in the rfb operon. The gene rfbJ in strain LT2 (serovar typhimurium, group B) and the genes rfbS and rfbE in strain Ty2 (serovar typhi, group D) account for the known differences in the rfb gene clusters used for determination of group specificity. In this paper, we report the nucleotide sequence of 2.9 kb of DNA from the rfb gene cluster of strain Ty2 and the finding of two open reading frames which have limited similarity with the corresponding open reading frames of strain LT2. These two genes complete the sequence of the rfb region of group D strain Ty2 if we use strain LT2 sequence where restriction site data show it to be extremely similar to the strain Ty2 sequence. The restriction map of the rfb gene cluster in group A strain IMVS1316 (serovar paratyphi) is identical to that of the cluster in strain Ty2 except for a frameshift mutation in rfbE and a triplicated region. The rfb gene clusters of these three strains are compared, and the evolutionary origin of these genes is discussed.  相似文献   

19.
McNemar's test is used to assess the difference between two different procedures (treatments) using independent matched-pair data. For matched-pair data collected in clusters, the tests proposed by Durkalski et al. and Obuchowski are popular and commonly used in practice since these tests do not require distributional assumptions or assumptions on the structure of the within-cluster correlation of the data. Motivated by these tests, this note proposes a modified Obuchowski test and illustrates comparisons of the proposed test with the extant methods. An extensive Monte Carlo simulation study suggests that the proposed test performs well with respect to the nominal size, and has higher power; Obuchowski's test is most conservative, and the performance of the Durkalski's test varies between the modified Obuchowski test and the original Obuchowski's test. These results form the basis for our recommendation that (i) for equal cluster size, the modified Obuchowski test is always preferred; (ii) for varying cluster size Durkalski's test can be used for a small number of clusters (e.g. K < 50), whereas for a large number of clusters (e.g. K ≥ 50) the modified Obuchowski test is preferred. Finally, to illustrate practical application of the competing tests, two real collections of clustered matched-pair data are analyzed.  相似文献   

20.
The endosymbiont theory proposes that chloroplasts have originated from ancestral cyanobacteria through a process of engulfment and subsequent symbiotic adaptation. The molecular data for testing this theory have mainly been the nucleotide sequence of rRNAs and of photosystem component genes. In order to provide additional data in this area, we have isolated genomic clones of Synechocystis DNA containing the ribosomal protein gene cluster rplJL. The nucleotide sequence of this cluster and flanking regions was determined and the derived amino acid sequences were compared to the available homologous sequences from other eubacteria and chloroplasts. In Escherichia coli these two genes are part of a larger cluster, i.e., rplKAJL-rpoBC. In Synechocystis, the genes for the RNA polymerase subunit (rpoBC) are shown to be widely separated from the r-protein genes. The Synechocystis gene arrangement is similar to that in the chloroplast system, where the rpoBC1C2 and rplKAJL clusters are separated and located in two cell compartments, the chloroplast and the nucleus, respectively.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号