首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
Pan Z  Lin DY 《Biometrics》2005,61(4):1000-1009
We develop graphical and numerical methods for checking the adequacy of generalized linear mixed models (GLMMs). These methods are based on the cumulative sums of residuals over covariates or predicted values of the response variable. Under the assumed model, the asymptotic distributions of these stochastic processes can be approximated by certain zero-mean Gaussian processes, whose realizations can be generated through Monte Carlo simulation. Each observed process can then be compared, both visually and analytically, to a number of realizations simulated from the null distribution. These comparisons enable one to assess objectively whether the observed residual patterns reflect model misspecification or random variation. The proposed methods are particularly useful for checking the functional form of a covariate or the link function. Extensive simulation studies show that the proposed goodness-of-fit tests have proper sizes and are sensitive to model misspecification. Applications to two medical studies lead to improved models.  相似文献   

2.
Lin DY  Wei LJ  Ying Z 《Biometrics》2002,58(1):1-12
Residuals have long been used for graphical and numerical examinations of the adequacy of regression models. Conventional residual analysis based on the plots of raw residuals or their smoothed curves is highly subjective, whereas most numerical goodness-of-fit tests provide little information about the nature of model misspecification. In this paper, we develop objective and informative model-checking techniques by taking the cumulative sums of residuals over certain coordinates (e.g., covariates or fitted values) or by considering some related aggregates of residuals, such as moving sums and moving averages. For a variety of statistical models and data structures, including generalized linear models with independent or dependent observations, the distributions of these stochastic processes tinder the assumed model can be approximated by the distributions of certain zero-mean Gaussian processes whose realizations can be easily generated by computer simulation. Each observed process can then be compared, both graphically and numerically, with a number of realizations from the Gaussian process. Such comparisons enable one to assess objectively whether a trend seen in a residual plot reflects model misspecification or natural variation. The proposed techniques are particularly useful in checking the functional form of a covariate and the link function. Illustrations with several medical studies are provided.  相似文献   

3.
In the development of structural equation models (SEMs), observed variables are usually assumed to be normally distributed. However, this assumption is likely to be violated in many practical researches. As the non‐normality of observed variables in an SEM can be obtained from either non‐normal latent variables or non‐normal residuals or both, semiparametric modeling with unknown distribution of latent variables or unknown distribution of residuals is needed. In this article, we find that an SEM becomes nonidentifiable when both the latent variable distribution and the residual distribution are unknown. Hence, it is impossible to estimate reliably both the latent variable distribution and the residual distribution without parametric assumptions on one or the other. We also find that the residuals in the measurement equation are more sensitive to the normality assumption than the latent variables, and the negative impact on the estimation of parameters and distributions due to the non‐normality of residuals is more serious. Therefore, when there is no prior knowledge about parametric distributions for either the latent variables or the residuals, we recommend making parametric assumption on latent variables, and modeling residuals nonparametrically. We propose a semiparametric Bayesian approach using the truncated Dirichlet process with a stick breaking prior to tackle the non‐normality of residuals in the measurement equation. Simulation studies and a real data analysis demonstrate our findings, and reveal the empirical performance of the proposed methodology. A free WinBUGS code to perform the analysis is available in Supporting Information.  相似文献   

4.
Dong B  Matthews DE 《Biometrics》2012,68(2):408-418
In medical studies, it is often of scientific interest to evaluate the treatment effect via the ratio of cumulative hazards, especially when those hazards may be nonproportional. To deal with nonproportionality in the Cox regression model, investigators usually assume that the treatment effect has some functional form. However, to do so may create a model misspecification problem because it is generally difficult to justify the specific parametric form chosen for the treatment effect. In this article, we employ empirical likelihood (EL) to develop a nonparametric estimator of the cumulative hazard ratio with covariate adjustment under two nonproportional hazard models, one that is stratified, as well as a less restrictive framework involving group-specific treatment adjustment. The asymptotic properties of the EL ratio statistic are derived in each situation and the finite-sample properties of EL-based estimators are assessed via simulation studies. Simultaneous confidence bands for all values of the adjusted cumulative hazard ratio in a fixed interval of interest are also developed. The proposed methods are illustrated using two different datasets concerning the survival experience of patients with non-Hodgkin's lymphoma or ovarian cancer.  相似文献   

5.
6.
There have been numerous claims in the ecological literature that spatial autocorrelation in the residuals of ordinary least squares (OLS) regression models results in shifts in the partial coefficients, which bias the interpretation of factors influencing geographical patterns. We evaluate the validity of these claims using gridded species richness data for the birds of North America, South America, Europe, Africa, the ex‐USSR, and Australia. We used richness in 110×110 km cells and environmental predictor variables to generate OLS and simultaneous autoregressive (SAR) multiple regression models for each region. Spatial correlograms of the residuals from each OLS model were then used to identify the minimum distance between cells necessary to avoid short‐distance residual spatial autocorrelation in each data set. This distance was used to subsample cells to generate spatially independent data. The partial OLS coefficients estimated with the full dataset were then compared to the distributions of coefficients created with the subsamples. We found that OLS coefficients generated from data containing residual spatial autocorrelation were statistically indistinguishable from coefficients generated from the same data sets in which short‐distance spatial autocorrelation was not present in all 22 coefficients tested. Consistent with the statistical literature on this subject, we conclude that coefficients estimated from OLS regression are not seriously affected by the presence of spatial autocorrelation in gridded geographical data. Further, shifts in coefficients that occurred when using SAR tended to be correlated with levels of uncertainty in the OLS coefficients. Thus, shifts in the relative importance of the predictors between OLS and SAR models are expected when small‐scale patterns for these predictors create weaker and more unstable broad‐scale coefficients. Our results indicate both that OLS regression is unbiased and that differences between spatial and nonspatial regression models should be interpreted with an explicit awareness of spatial scale.  相似文献   

7.
Model checking for ROC regression analysis   总被引:1,自引:0,他引:1  
Cai T  Zheng Y 《Biometrics》2007,63(1):152-163
Summary .   The receiver operating characteristic (ROC) curve is a prominent tool for characterizing the accuracy of a continuous diagnostic test. To account for factors that might influence the test accuracy, various ROC regression methods have been proposed. However, as in any regression analysis, when the assumed models do not fit the data well, these methods may render invalid and misleading results. To date, practical model-checking techniques suitable for validating existing ROC regression models are not yet available. In this article, we develop cumulative residual-based procedures to graphically and numerically assess the goodness of fit for some commonly used ROC regression models, and show how specific components of these models can be examined within this framework. We derive asymptotic null distributions for the residual processes and discuss resampling procedures to approximate these distributions in practice. We illustrate our methods with a dataset from the cystic fibrosis registry.  相似文献   

8.
There has been a proliferation of studies aimed at predicting the distributions of species from environmental variables despite evidence that spatial interpolation or spatially‐constrained mechanistic models have comparable explanatory power. Moreover, the processes behind environmental and spatial correlations – and their interactions – remain elusive. Here, we examined geographic patterns in the amount of variation explained by environmental correlation and exogenous or endogenous spatial autocorrelation for 4423 terrestrial vertebrate species in Africa using variation partitioning analysis. We also tested the effects of range size and taxonomic class on the relative importance of environmental and spatial correlations, and contrasted empirical patterns to two environmentally‐neutral models to identify potential underlying environmental and spatial mechanisms. Results showed that geographic range size was associated with environmental and spatial variation components in ways that where qualitatively indistinguishable from environmentally‐neutral species with constrained dispersal, suggesting that proportions of variation are due to range cohesiveness rather than other ecological processes. As a consequence, large‐scale patterns of biodiversity should be studied cautiously due to the difficulty of obtaining evidence of causal mechanistic links between species distributions and spatio‐environmental gradients. However, we also uncovered ecologically‐meaningful patterns in the residuals of the relationship between range size and the respective variation components, which differed among vertebrate classes. Moreover, these patterns coincided with contemporary biogeographical regions. This study, therefore, demonstrates that it is possible to extract meaningful environmental and spatial associations that potentially link ecological and biogeographical processes.  相似文献   

9.
Multistate models can be successfully used for describing complex event history data, for example, describing stages in the disease progression of a patient. The so‐called “illness‐death” model plays a central role in the theory and practice of these models. Many time‐to‐event datasets from medical studies with multiple end points can be reduced to this generic structure. In these models one important goal is the modeling of transition rates but biomedical researchers are also interested in reporting interpretable results in a simple and summarized manner. These include estimates of predictive probabilities, such as the transition probabilities, occupation probabilities, cumulative incidence functions, and the sojourn time distributions. We will give a review of some of the available methods for estimating such quantities in the progressive illness‐death model conditionally (or not) on covariate measures. For some of these quantities estimators based on subsampling are employed. Subsampling, also referred to as landmarking, leads to small sample sizes and usually to heavily censored data leading to estimators with higher variability. To overcome this issue estimators based on a preliminary estimation (presmoothing) of the probability of censoring may be used. Among these, the presmoothed estimators for the cumulative incidences are new. We also introduce feasible estimation methods for the cumulative incidence function conditionally on covariate measures. The proposed methods are illustrated using real data. A comparative simulation study of several estimation approaches is performed and existing software in the form of R packages is discussed.  相似文献   

10.
Ecological theory suggests that spatial distribution of biodiversity is strongly driven by community assembly processes. Thus the study of diversity patterns combined with null model testing has become increasingly common to infer assembly processes from observed distributions of diversity indices. However, results in both empirical and simulation studies are inconsistent. The aim of our study is to determine with simulated data which facets of biodiversity, if any, may unravel the processes driving its spatial patterns, and to provide practical considerations about the combination of diversity indices that would produce significant and congruent signals when using null models. The study is based on simulated species’ assemblages that emerge under various landscape structures in a spatially explicit individual‐based model with contrasting, predefined assembly processes. We focus on four assembly processes (species‐sorting, mass effect, neutral dynamics and competition colonization trade‐off) and investigate the emerging species’ distributions with varied diversity indices (alpha, beta and gamma) measured at different spatial scales and for different diversity facets (taxonomic, functional and phylogenetic). We find that 1) the four assembly processes result in distinct spatial distributions of species under any landscape structure, 2) a broad range of diversity indices allows distinguishing between communities driven by different assembly processes, 3) null models provide congruent results only for a small fraction of diversity indices and 4) only a combination of these diversity indices allows identifying the correct assembly processes. Our study supports the inference of assembly processes from patterns of diversity only when different types of indices are combined. It highlights the need to combine phylogenetic, functional and taxonomic diversity indices at multiple spatial scales to effectively infer underlying assembly processes from diversity patterns by illustrating how combination of different indices might help disentangling the complex question of coexistence.  相似文献   

11.
Yang Y  Degruttola V 《Biometrics》2008,64(2):329-336
Summary .   Identifying genetic mutations that cause clinical resistance to antiretroviral drugs requires adjustment for potential confounders, such as the number of active drugs in a HIV-infected patient's regimen other than the one of interest. Motivated by this problem, we investigated resampling-based methods to test equal mean response across multiple groups defined by HIV genotype, after adjustment for covariates. We consider construction of test statistics and their null distributions under two types of model: parametric and semiparametric. The covariate function is explicitly specified in the parametric but not in the semiparametric approach. The parametric approach is more precise when models are correctly specified, but suffer from bias when they are not; the semiparametric approach is more robust to model misspecification, but may be less efficient. To help preserve type I error while also improving power in both approaches, we propose resampling approaches based on matching of observations with similar covariate values. Matching reduces the impact of model misspecification as well as imprecision in estimation. These methods are evaluated via simulation studies and applied to a data set that combines results from a variety of clinical studies of salvage regimens. Our focus is on relating HIV genotype to viral susceptibility to abacavir after adjustment for the number of active antiretroviral drugs (excluding abacavir) in the patient's regimen.  相似文献   

12.
We study the effect of delaying treatment in the presence of (unobserved) heterogeneity. In a homogeneous population and assuming a proportional treatment effect, a treatment delay period will result in notably lower cumulative recovery percentages. We show in theoretical scenarios using frailty models that if the population is heterogeneous, the effect of a delay period is much smaller. This can be explained by the selection process that is induced by the frailty. Patient groups that start treatment later have already undergone more selection. The marginal hazard ratio for the treatment will act differently in such a more homogeneous patient group. We further discuss modeling approaches for estimating the effect of treatment delay in the presence of heterogeneity, and compare their performance in a simulation study. The conventional Cox model that fails to account for heterogeneity overestimates the effect of treatment delay. Including interaction terms between treatment and starting time of treatment or between treatment and follow up time gave no improvement. Estimating a frailty term can improve the estimation, but is sensitive to misspecification of the frailty distribution. Therefore, multiple frailty distributions should be used and the results should be compared using the Akaike Information Criterion. Non-parametric estimation of the cumulative recovery percentages can be considered if the dataset contains sufficient long term follow up for each of the delay strategies. The methods are demonstrated on a motivating application evaluating the effect of delaying the start of treatment with assisted reproductive techniques on time-to-pregnancy in couples with unexplained subfertility.  相似文献   

13.
Joint modeling of longitudinal data and survival data has been used widely for analyzing AIDS clinical trials, where a biological marker such as CD4 count measurement can be an important predictor of survival. In most of these studies, a normal distribution is used for modeling longitudinal responses, which leads to vulnerable inference in the presence of outliers in longitudinal measurements. Powerful distributions for robust analysis are normal/independent distributions, which include univariate and multivariate versions of the Student's t, the slash and the contaminated normal distributions in addition to the normal. In this paper, a linear‐mixed effects model with normal/independent distribution for both random effects and residuals and Cox's model for survival time are used. For estimation, a Bayesian approach using Markov Chain Monte Carlo is adopted. Some simulation studies are performed for illustration of the proposed method. Also, the method is illustrated on a real AIDS data set and the best model is selected using some criteria.  相似文献   

14.
Xia  Yingcun 《Biometrika》2009,96(1):133-148
Lack-of-fit checking for parametric and semiparametric modelsis essential in reducing misspecification. The efficiency ofmost existing model-checking methods drops rapidly as the dimensionof the covariates increases. We propose to check a model byprojecting the fitted residuals along a direction that adaptsto the systematic departure of the residuals from the desiredpattern. Consistency of the method is proved for parametricand semiparametric regression models. A bootstrap implementationis also discussed. Simulation comparisons with several existingmethods are made, suggesting that the proposed methods are moreefficient than the existing methods when the dimension increases.Air pollution data from Chicago are used to illustrate the procedure.  相似文献   

15.
Aim This paper reviews possible candidate models that may be used in theoretical modelling and empirical studies of species–area relationships (SARs). The SAR is an important and well‐proven tool in ecology. The power and the exponential functions are by far the models that are best known and most frequently applied to species–area data, but they might not be the most appropriate. Recent work indicates that the shape of species–area curves in arithmetic space is often not convex but sigmoid and also has an upper asymptote. Methods Characteristics of six convex and eight sigmoid models are discussed and interpretations of different parameters summarized. The convex models include the power, exponential, Monod, negative exponential, asymptotic regression and rational functions, and the sigmoid models include the logistic, Gompertz, extreme value, Morgan–Mercer–Flodin, Hill, Michaelis–Menten, Lomolino and Chapman–Richards functions plus the cumulative Weibull and beta‐P distributions. Conclusions There are two main types of species–area curves: sample curves that are inherently convex and isolate curves, which are sigmoid. Both types may have an upper asymptote. A few have attempted to fit convex asymptotic and/or sigmoid models to species–area data instead of the power or exponential models. Some of these or other models reviewed in this paper should be useful, especially if species–area models are to be based more on biological processes and patterns in nature than mere curve fitting. The negative exponential function is an example of a convex model and the cumulative Weibull distribution an example of a sigmoid model that should prove useful. A location parameter may be added to these two and some of the other models to simulate absolute minimum area requirements.  相似文献   

16.
On reduced amino acid alphabets for phylogenetic inference   总被引:1,自引:0,他引:1  
We investigate the use of Markov models of evolution for reduced amino acid alphabets or bins of amino acids. The use of reduced amino acid alphabets can ameliorate effects of model misspecification and saturation. We present algorithms for 2 different ways of automating the construction of bins: minimizing criteria based on properties of rate matrices and minimizing criteria based on properties of alignments. By simulation, we show that in the absence of model misspecification, the loss of information due to binning is found to be insubstantial, and the use of Markov models at the binned level is found to be almost as effective as the more appropriate missing data approach. By applying these approaches to real data sets where compositional heterogeneity and/or saturation appear to be causing biased tree estimation, we find that binning can improve topological estimation in practice.  相似文献   

17.
Longitudinal data can always be represented by a time series with a deterministic trend and randomly correlated residuals, the latter of which do not usually form a stationary process. The class of linear spectral models is a basis for the exploratory analysis of these data. The theory and techniques of factor analysis provide a means by which one component of the residual series can be separated from an error series, and then partitioned into a sum of randomly scaled metameters that characterize the sample paths of the residuals. These metameters, together with linear modelling techniques, are then used to partition the nonrandom trend into a determined component, which is associated with the sample paths of the residuals, and an independent inherent component. Linear spectral models are assumption-free and represent both random and nonrandom trends with fewer terms than any other mixed-effects linear model. Data on body-weight growth of juvenile mice are used in this paper to illustrate the application of linear spectral models, through a relatively sophisticated exploratory analysis.  相似文献   

18.

Background

The distribution of residual effects in linear mixed models in animal breeding applications is typically assumed normal, which makes inferences vulnerable to outlier observations. In order to mute the impact of outliers, one option is to fit models with residuals having a heavy-tailed distribution. Here, a Student''s-t model was considered for the distribution of the residuals with the degrees of freedom treated as unknown. Bayesian inference was used to investigate a bivariate Student''s-t (BSt) model using Markov chain Monte Carlo methods in a simulation study and analysing field data for gestation length and birth weight permitted to study the practical implications of fitting heavy-tailed distributions for residuals in linear mixed models.

Methods

In the simulation study, bivariate residuals were generated using Student''s-t distribution with 4 or 12 degrees of freedom, or a normal distribution. Sire models with bivariate Student''s-t or normal residuals were fitted to each simulated dataset using a hierarchical Bayesian approach. For the field data, consisting of gestation length and birth weight records on 7,883 Italian Piemontese cattle, a sire-maternal grandsire model including fixed effects of sex-age of dam and uncorrelated random herd-year-season effects were fitted using a hierarchical Bayesian approach. Residuals were defined to follow bivariate normal or Student''s-t distributions with unknown degrees of freedom.

Results

Posterior mean estimates of degrees of freedom parameters seemed to be accurate and unbiased in the simulation study. Estimates of sire and herd variances were similar, if not identical, across fitted models. In the field data, there was strong support based on predictive log-likelihood values for the Student''s-t error model. Most of the posterior density for degrees of freedom was below 4. Posterior means of direct and maternal heritabilities for birth weight were smaller in the Student''s-t model than those in the normal model. Re-rankings of sires were observed between heavy-tailed and normal models.

Conclusions

Reliable estimates of degrees of freedom were obtained in all simulated heavy-tailed and normal datasets. The predictive log-likelihood was able to distinguish the correct model among the models fitted to heavy-tailed datasets. There was no disadvantage of fitting a heavy-tailed model when the true model was normal. Predictive log-likelihood values indicated that heavy-tailed models with low degrees of freedom values fitted gestation length and birth weight data better than a model with normally distributed residuals.Heavy-tailed and normal models resulted in different estimates of direct and maternal heritabilities, and different sire rankings. Heavy-tailed models may be more appropriate for reliable estimation of genetic parameters from field data.  相似文献   

19.
Huang X  Tebbs JM 《Biometrics》2009,65(3):710-718
Summary .  We consider structural measurement error models for a binary response. We show that likelihood-based estimators obtained from fitting structural measurement error models with pooled binary responses can be far more robust to covariate measurement error in the presence of latent-variable model misspecification than the corresponding estimators from individual responses. Furthermore, despite the loss in information, pooling can provide improved parameter estimators in terms of mean-squared error. Based on these and other findings, we create a new diagnostic method to detect latent-variable model misspecification in structural measurement error models with individual binary response. We use simulation and data from the Framingham Heart Study to illustrate our methods.  相似文献   

20.
Although two-dimensional gel electrophoresis (2-DE) has long been a favorite experimental method to screen proteomes, its reproducibility is seldom analyzed with the assistance of quantitative error models. The lack of models of residual distributions that can be used to assign likelihood to differential expression reflects the difficulty in tackling the combined effect of variability in spot intensity and uncertain recognition of the same spot in different gels. In this report we have analyzed a series of four triplicate two-dimensional gels of chicken embryo heart samples at two distinct development stages to produce such a model of residual distribution. In order to achieve this reference error model, a nonparametric procedure for consistent spot intensity normalization had to be established, and is also reported here. In addition to variability in normalized intensity due to various sources, the residual variation between replicates was observed to be compounded by failure to identify the spot itself (gel alignment). The mixed effect is reflected by variably skewed bimodal density distributions of residuals. The extraction of a global error model that accommodated such distribution was achieved empirically by machine learning, specifically by bootstrapped artificial neural networks. The model described is being used to assign confidence values to observed variations in arbitrary 2-DE gels in order to quantify the degree of over-expression and under-expression of protein spots.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号