首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Roy J  Lin X 《Biometrics》2000,56(4):1047-1054
Multiple outcomes are often used to properly characterize an effect of interest. This paper proposes a latent variable model for the situation where repeated measures over time are obtained on each outcome. These outcomes are assumed to measure an underlying quantity of main interest from different perspectives. We relate the observed outcomes using regression models to a latent variable, which is then modeled as a function of covariates by a separate regression model. Random effects are used to model the correlation due to repeated measures of the observed outcomes and the latent variable. An EM algorithm is developed to obtain maximum likelihood estimates of model parameters. Unit-specific predictions of the latent variables are also calculated. This method is illustrated using data from a national panel study on changes in methadone treatment practices.  相似文献   

2.
Encouragement design studies are particularly useful for estimating the effect of an intervention that cannot itself be randomly administered to some and not to others. They require a randomly selected group receive extra encouragement to undertake the treatment of interest, where the encouragement typically takes the form of additional information or incentives. We consider a "clustered encouragement design" (CED), where the randomization is at the level of the clusters (e.g. physicians), but the compliance with assignment is at the level of the units (e.g. patients) within clusters. Noncompliance and missing data are particular problems in encouragement design studies, where encouragement to take the treatment, rather than the treatment itself, is randomized. The motivating study looks at whether computer-based care suggestions can improve patient outcomes in veterans with chronic heart failure. Since physician adherence has been inadequate, the original study focused on methods to improve physician adherence, although an equally important question is whether physician adherence improves patient outcomes. Here, we reanalyze the data to determine the effect of physician adherence on patient outcomes. We propose causal inference methodology for the effect of a treatment versus a control in a randomized CED study with all-or-none compliance at the unit level. These methods extend the current approaches to account for nonignorable missing data and use an alternative approach to inference using multiple imputation methods, which have been successfully applied to a wide variety of missing data problems and have recently been applied to the potential outcomes framework of causal inference (Taylor and Zhou, 2009b).  相似文献   

3.
MOTIVATION: Discriminant analysis is an effective tool for the classification of experimental units into groups. Here, we consider the typical problem of classifying subjects according to phenotypes via gene expression data and propose a method that incorporates variable selection into the inferential procedure, for the identification of the important biomarkers. To achieve this goal, we build upon a conjugate normal discriminant model, both linear and quadratic, and include a stochastic search variable selection procedure via an MCMC algorithm. Furthermore, we incorporate into the model prior information on the relationships among the genes as described by a gene-gene network. We use a Markov random field (MRF) prior to map the network connections among genes. Our prior model assumes that neighboring genes in the network are more likely to have a joint effect on the relevant biological processes. RESULTS: We use simulated data to assess performances of our method. In particular, we compare the MRF prior to a situation where independent Bernoulli priors are chosen for the individual predictors. We also illustrate the method on benchmark datasets for gene expression. Our simulation studies show that employing the MRF prior improves on selection accuracy. In real data applications, in addition to identifying markers and improving prediction accuracy, we show how the integration of existing biological knowledge into the prior model results in an increased ability to identify genes with strong discriminatory power and also aids the interpretation of the results.  相似文献   

4.
Shin Y  Raudenbush SW 《Biometrics》2007,63(4):1262-1268
The development of model-based methods for incomplete data has been a seminal contribution to statistical practice. Under the assumption of ignorable missingness, one estimates the joint distribution of the complete data for thetainTheta from the incomplete or observed data y(obs). Many interesting models involve one-to-one transformations of theta. For example, with y(i) approximately N(mu, Sigma) for i= 1, ... , n and theta= (mu, Sigma), an ordinary least squares (OLS) regression model is a one-to-one transformation of theta. Inferences based on such a transformation are equivalent to inferences based on OLS using data multiply imputed from f(y(mis) | y(obs), theta) for missing y(mis). Thus, identification of theta from y(obs) is equivalent to identification of the regression model. In this article, we consider a model for two-level data with continuous outcomes where the observations within each cluster are dependent. The parameters of the hierarchical linear model (HLM) of interest, however, lie in a subspace of Theta in general. This identification of the joint distribution overidentifies the HLM. We show how to characterize the joint distribution so that its parameters are a one-to-one transformation of the parameters of the HLM. This leads to efficient estimation of the HLM from incomplete data using either the transformation method or the method of multiple imputation. The approach allows outcomes and covariates to be missing at either of the two levels, and the HLM of interest can involve the regression of any subset of variables on a disjoint subset of variables conceived as covariates.  相似文献   

5.
In the tradition of European phytosociology, delimitations of vegetation units such as associations are mostly based on data from small areas where more detailed vegetation sampling has been carried out. Such locally delimited vegetation units are often accepted in large-scale synthetic classifications, e.g. national vegetation monographs, and tentatively assigned to a small geographical range, forming groups of similar (vicarious) vegetation units in different small areas. These vicarious units, however, often overlap in species composition and are difficult to recognize from each other. We demonstrate this issue using an example of the classification of dry grasslands (Festuco-Brometea) in the Czech Republic. The standard vegetation classification of the Czech Republic supposes that the majority of accepted associations (66 out of 68) have a restricted distribution in one of the two major regions, Bohemia or Moravia. We compared the classification into traditional associations with the numerical classification of 1440 phytosociological relevés from the Czech Republic, in order to test whether the traditionally recognized associations with small geographical ranges are reflected in numerical classification. In various comparisons, the groups of relevés identified by numerical analysis occupied larger areas than the traditional associations. This suggests that with consistent use of total species composition as the vegetation classification criterion, the resulting classification will usually include more vegetation units with larger geographical ranges, while many of the traditional local associations will disappear.  相似文献   

6.
Batch marking is common and useful for many capture–recapture studies where individual marks cannot be applied due to various constraints such as timing, cost, or marking difficulty. When batch marks are used, observed data are not individual capture histories but a set of counts including the numbers of individuals first marked, marked individuals that are recaptured, and individuals captured but released without being marked (applicable to some studies) on each capture occasion. Fitting traditional capture–recapture models to such data requires one to identify all possible sets of capture–recapture histories that may lead to the observed data, which is computationally infeasible even for a small number of capture occasions. In this paper, we propose a latent multinomial model to deal with such data, where the observed vector of counts is a non-invertible linear transformation of a latent vector that follows a multinomial distribution depending on model parameters. The latent multinomial model can be fitted efficiently through a saddlepoint approximation based maximum likelihood approach. The model framework is very flexible and can be applied to data collected with different study designs. Simulation studies indicate that reliable estimation results are obtained for all parameters of the proposed model. We apply the model to analysis of golden mantella data collected using batch marks in Central Madagascar.  相似文献   

7.
Selection on phenotypes may cause genetic change. To understand the relationship between phenotype and gene expression from an evolutionary viewpoint, it is important to study the concordance between gene expression and profiles of phenotypes. In this study, we use a novel method of clustering to identify genes whose expression profiles are related to a quantitative phenotype. Cluster analysis of gene expression data aims at classifying genes into several different groups based on the similarity of their expression profiles across multiple conditions. The hope is that genes that are classified into the same clusters may share underlying regulatory elements or may be a part of the same metabolic pathways. Current methods for examining the association between phenotype and gene expression are limited to linear association measured by the correlation between individual gene expression values and phenotype. Genes may be associated with the phenotype in a nonlinear fashion. In addition, groups of genes that share a particular pattern in their relationship to phenotype may be of evolutionary interest. In this study, we develop a method to group genes based on orthogonal polynomials under a multivariate Gaussian mixture model. The effect of each expressed gene on the phenotype is partitioned into a cluster mean and a random deviation from the mean. Genes can also be clustered based on a time series. Parameters are estimated using the expectation-maximization algorithm and implemented in SAS. The method is verified with simulated data and demonstrated with experimental data from 2 studies, one clusters with respect to severity of disease in Alzheimer's patients and another clusters data for a rat fracture healing study over time. We find significant evidence of nonlinear associations in both studies and successfully describe these patterns with our method. We give detailed instructions and provide a working program that allows others to directly implement this method in their own analyses.  相似文献   

8.
Ma S  Huang J 《Biometrics》2007,63(3):751-757
In biomedical studies, it is of great interest to develop methodologies for combining multiple markers for the purpose of disease classification. The receiving operating characteristic (ROC) technique has been widely used, where classification performance can be measured with the area under the ROC curve (AUC). In this article, we study a ROC-based method for effectively combining multiple markers for disease classification. We propose a sigmoid AUC (SAUC) estimator that maximizes the sigmoid approximation of the empirical AUC. The SAUC estimator is computationally affordable, n(1/2)-consistent and achieves the same asymptotic efficiency as the AUC estimator. Inference based on the weighted bootstrap is investigated. We also propose Monte Carlo methods to assess the overall prediction performance and the relative importance of individual markers. Finite sample performance is evaluated using simulation studies and two public data sets.  相似文献   

9.
One goal of cluster analysis is to sort characteristics into groups (clusters) so that those in the same group are more highly correlated to each other than they are to those in other groups. An example is the search for groups of genes whose expression of RNA is correlated in a population of patients. These genes would be of greater interest if their common level of RNA expression were additionally predictive of the clinical outcome. This issue arose in the context of a study of trauma patients on whom RNA samples were available. The question of interest was whether there were groups of genes that were behaving similarly, and whether each gene in the cluster would have a similar effect on who would recover. For this, we develop an algorithm to simultaneously assign characteristics (genes) into groups of highly correlated genes that have the same effect on the outcome (recovery). We propose a random effects model where the genes within each group (cluster) equal the sum of a random effect, specific to the observation and cluster, and an independent error term. The outcome variable is a linear combination of the random effects of each cluster. To fit the model, we implement a Markov chain Monte Carlo algorithm based on the likelihood of the observed data. We evaluate the effect of including outcome in the model through simulation studies and describe a strategy for prediction. These methods are applied to trauma data from the Inflammation and Host Response to Injury research program, revealing a clustering of the genes that are informed by the recovery outcome.  相似文献   

10.
Recent interest in cancer research focuses on predicting patients' survival by investigating gene expression profiles based on microarray analysis. We propose a doubly penalized Buckley-James method for the semiparametric accelerated failure time model to relate high-dimensional genomic data to censored survival outcomes, which uses the elastic-net penalty that is a mixture of L1- and L2-norm penalties. Similar to the elastic-net method for a linear regression model with uncensored data, the proposed method performs automatic gene selection and parameter estimation, where highly correlated genes are able to be selected (or removed) together. The two-dimensional tuning parameter is determined by generalized crossvalidation. The proposed method is evaluated by simulations and applied to the Michigan squamous cell lung carcinoma study.  相似文献   

11.
In the analysis of binary response data from many types of large studies, the data are likely to have arisen from multiple centers, resulting in a within-center correlation for the response. Such correlation, or clustering, occurs when outcomes within centers tend to be more similar to each other than to outcomes in other centers. In studies where there is also variability among centers with respect to the exposure of interest, analysis of the exposure-outcome association may be confounded, even after accounting for within-center correlations. We apply several analytic methods to compare the risk of major complications associated with two strategies, staged and combined procedures, for performing percutaneous transluminal coronary angioplasty (PTCA), a mechanical means of relieving blockage of blood vessels due to atherosclerosis. Combined procedures are used in some centers as a cost-cutting strategy. We performed a number of population-averaged and cluster-specific (conditional) analyses, which (a) make no adjustments for center effects of any kind; (b) make adjustments for the effect of center on only the response; or (c) make adjustments for both the effect of center on the response and the relationship between center and exposure. The method used for this third approach decomposes the procedure type variable into within-center and among-center components, resulting in two odds ratio estimates. The naive analysis, ignoring clusters, gave a highly significant effect of procedure type (OR = 1.6). Population average models gave marginally to very nonsignificant estimates of the OR for treatment type ranging from 1.6 to 1.2 with adjustment only for the effect of centers on response. These results depended on the assumed correlation structure. Conditional (cluster-specific) models and other methods that decomposed the treatment type variable into among- and within-center components all found no within-center effect of procedure type (OR = 1.02, consistently) and a considerable among-center effect. This among-center variability in outcomes was related to the proportion of patients who receive combined procedures and was found even when conditioned on procedure type (within-center) and other patient- and center-level covariates. This example illustrates the importance of addressing the potential for center effects to confound an outcome-exposure association when average exposure varies across clusters. While conditional approaches provide estimates of the within-cluster effect, they do not provide information about among-center effects. We recommend using the decomposition approach, as it provides both types of estimates.  相似文献   

12.

Background  

The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance.  相似文献   

13.
We study the problem of selecting control clones in DNA array hybridization experiments. The problem arises in the OFRG method for analyzing microbial communities. The OFRG method performs classification of rRNA gene clones using binary fingerprints created from a series of hybridization experiments, where each experiment consists of hybridizing a collection of arrayed clones with a single oligonucleotide probe. This experiment produces analog signals, one for each clone, which then need to be classified, that is, converted into binary values 1 and 0 that represent hybridization and non-hybridization events. In addition to the sample rRNA gene clones, the array contains a number of control clones needed to calibrate the classification procedure of the hybridization signals. These control clones must be selected with care to optimize the classification process. We formulate this as a combinatorial optimization problem called Balanced Covering. We prove that the problem is NP-hard, and we show some results on hardness of approximation. We propose approximation algorithms based on randomized rounding, and we show that, with high probability, our algorithms approximate well the optimum solution. The experimental results confirm that the algorithms find high quality control clones. The algorithms have been implemented and are publicly available as part of the software package called CloneTools.  相似文献   

14.
Dropouts are common in longitudinal study. If the dropout probability depends on the missing observations at or after dropout, this type of dropout is called informative (or nonignorable) dropout (ID). Failure to accommodate such dropout mechanism into the model will bias the parameter estimates. We propose a conditional autoregressive model for longitudinal binary data with an ID model such that the probabilities of positive outcomes as well as the drop‐out indicator in each occasion are logit linear in some covariates and outcomes. This model adopting a marginal model for outcomes and a conditional model for dropouts is called a selection model. To allow for the heterogeneity and clustering effects, the outcome model is extended to incorporate mixture and random effects. Lastly, the model is further extended to a novel model that models the outcome and dropout jointly such that their dependency is formulated through an odds ratio function. Parameters are estimated by a Bayesian approach implemented using the user‐friendly Bayesian software WinBUGS. A methadone clinic dataset is analyzed to illustrate the proposed models. Result shows that the treatment time effect is still significant but weaker after allowing for an ID process in the data. Finally the effect of drop‐out on parameter estimates is evaluated through simulation studies.  相似文献   

15.
Current advances in next-generation sequencing techniques have allowed researchers to conduct comprehensive research on the microbiome and human diseases, with recent studies identifying associations between the human microbiome and health outcomes for a number of chronic conditions. However, microbiome data structure, characterized by sparsity and skewness, presents challenges to building effective classifiers. To address this, we present an innovative approach for distance-based classification using mixture distributions (DCMD). The method aims to improve classification performance using microbiome community data, where the predictors are composed of sparse and heterogeneous count data. This approach models the inherent uncertainty in sparse counts by estimating a mixture distribution for the sample data and representing each observation as a distribution, conditional on observed counts and the estimated mixture, which are then used as inputs for distance-based classification. The method is implemented into a k-means classification and k-nearest neighbours framework. We develop two distance metrics that produce optimal results. The performance of the model is assessed using simulated and human microbiome study data, with results compared against a number of existing machine learning and distance-based classification approaches. The proposed method is competitive when compared to the other machine learning approaches, and shows a clear improvement over commonly used distance-based classifiers, underscoring the importance of modelling sparsity for achieving optimal results. The range of applicability and robustness make the proposed method a viable alternative for classification using sparse microbiome count data. The source code is available at https://github.com/kshestop/DCMD for academic use.  相似文献   

16.
On Monitoring Outcomes of Medical Providers   总被引:1,自引:0,他引:1  
An issue of substantial importance is the monitoring and improvement of health care facilities such as hospitals, nursing homes, dialysis units or surgical wards. In addressing this, there is a need for appropriate methods for monitoring health outcomes. On the one hand, statistical tools are needed to aid centers in instituting and evaluating quality improvement programs and, on the other hand, to aid overseers and payers in identifying and addressing sub-standard performance. In the latter case, the aim is to identify situations where there is evidence that the facility’s outcomes are outside of normal expectations; such facilities would be flagged and perhaps audited for potential difficulties or censured in some way. Methods in use are based on models where the center effects are taken as fixed or random. We take a systematic approach to assessing the merits of these methods when the patient outcome of interest arises from a linear model. We argue that methods based on fixed effects are more appropriate for the task of identifying extreme outcomes by providing better accuracy when the true facility effect is far from that of the average facility and avoiding confounding issues that arise in the random effects models when the patient risks are correlated with facility effects. Finally, we consider approaches to flagging that are based on the Z-statistics arising from the fixed effects model, but which account in a robust way for the intrinsic variation between facilities as contemplated in the standard random effects model. We provide an illustration in monitoring survival outcomes of dialysis facilities in the US.  相似文献   

17.
Lin X  Ryan L  Sammel M  Zhang D  Padungtod C  Xu X 《Biometrics》2000,56(2):593-601
We propose a scaled linear mixed model to assess the effects of exposure and other covariates on multiple continuous outcomes. The most general form of the model allows a different exposure effect for each outcome. An important special case is a model that represents the exposure effects using a common global measure that can be characterized in terms of effect sizes. Correlations among different outcomes within the same subject are accommodated using random effects. We develop two approaches to model fitting, including the maximum likelihood method and the working parameter method. A key feature of both methods is that they can be easily implemented by repeatedly calling software for fitting standard linear mixed models, e.g., SAS PROC MIXED. Compared to the maximum likelihood method, the working parameter method is easier to implement and yields fully efficient estimators of the parameters of interest. We illustrate the proposed methods by analyzing data from a study of the effects of occupational pesticide exposure on semen quality in a cohort of Chinese men.  相似文献   

18.
We present a novel application of methods for analysis of high-dimensional longitudinal data to a comparison of facial shape over time between babies with cleft lip and palate and similarly aged controls. A pairwise methodology is used that was introduced in Fieuws and Verbeke (2006) in order to apply a linear mixed-effects model to data of high dimensions, such as describe facial shape. The approach involves fitting bivariate linear mixed-effects models to all the pairwise combinations of responses, where the latter result from the individual coordinate positions, and aggregating the results across repeated parameter estimates (such as the random-effects variance for a particular coordinate). We describe one example using landmarks and another using facial curves from the cleft lip study, the latter using B-splines to provide an efficient parameterization. The results are presented in 2 dimensions, both in the profile and in the frontal views, with bivariate confidence intervals for the mean position of each landmark or curve, allowing objective assessment of significant differences in particular areas of the face between the 2 groups. Model comparison is performed using Wald and pseudolikelihood ratio tests.  相似文献   

19.
Classifying monoclonal antibodies, based on the similarity of their binding to the proteins (antigens) on the surface of blood cells, is essential for progress in immunology, hematology and clinical medicine. The collaborative efforts of researchers from many countries have led to the classification of thousands of antibodies into 247 clusters of differentiation (CD). Classification is based on flow cytometry and biochemical data. In preliminary classifications of antibodies based on flow cytometry data, the object requiring classification (an antibody) is described by a set of random samples from unknown densities of fluorescence intensity. An individual sample is collected in the experiment, where a population of cells of a certain type is stained by the identical fluorescently marked replicates of the antibody of interest. Samples are collected for multiple cell types. The classification problems of interest include identifying new CDs (class discovery or unsupervised learning) and assigning new antibodies to the known CD clusters (class prediction or supervised learning). These problems have attracted limited attention from statisticians. We recommend a novel approach to the classification process in which a computer algorithm suggests to the analyst the subset of the "most appropriate" classifications of an antibody in class prediction problems or the "most similar" pairs/ groups of antibodies in class discovery problems. The suggested algorithm speeds up the analysis of a flow cytometry data by a factor 10-20. This allows the analyst to focus on the interpretation of the automatically suggested preliminary classification solutions and on planning the subsequent biochemical experiments.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号