首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Motivated by a clinical prediction problem, a simulation study was performed to compare different approaches for building risk prediction models. Robust prediction models for hospital survival in patients with acute heart failure were to be derived from three highly correlated blood parameters measured up to four times, with predictive ability having explicit priority over interpretability. Methods that relied only on the original predictors were compared with methods using an expanded predictor space including transformations and interactions. Predictors were simulated as transformations and combinations of multivariate normal variables which were fitted to the partly skewed and bimodally distributed original data in such a way that the simulated data mimicked the original covariate structure. Different penalized versions of logistic regression as well as random forests and generalized additive models were investigated using classical logistic regression as a benchmark. Their performance was assessed based on measures of predictive accuracy, model discrimination, and model calibration. Three different scenarios using different subsets of the original data with different numbers of observations and events per variable were investigated. In the investigated setting, where a risk prediction model should be based on a small set of highly correlated and interconnected predictors, Elastic Net and also Ridge logistic regression showed good performance compared to their competitors, while other methods did not lead to substantial improvements or even performed worse than standard logistic regression. Our work demonstrates how simulation studies that mimic relevant features of a specific data set can support the choice of a good modeling strategy.  相似文献   

2.
Aim To investigate the impact of positional uncertainty in species occurrences on the predictions of seven commonly used species distribution models (SDMs), and explore its interaction with spatial autocorrelation in predictors. Methods A series of artificial datasets covering 155 scenarios including different combinations of five positional uncertainty scenarios and 31 spatial autocorrelation scenarios were simulated. The level of positional uncertainty was defined by the standard deviation of a normally distributed zero‐mean random variable. Each dataset included two environmental gradients (predictor variables) and one set of species occurrence sample points (response variable). Seven commonly used models were selected to develop SDMs: generalized linear models, generalized additive models, boosted regression trees, multivariate adaptive regression spline, random forests, genetic algorithm for rule‐set production and maximum entropy. A probabilistic approach was employed to model and simulate five levels of error in the species locations. To analyse the propagation of positional uncertainty, Monte Carlo simulation was applied to each scenario for each SDM. The models were evaluated for performance using simulated independent test data with Cohen’s Kappa and the area under the receiver operating characteristic curve. Results Positional uncertainty in species location led to a reduction in prediction accuracy for all SDMs, although the magnitude of the reduction varied between SDMs. In all cases the magnitude of this impact varied according to the degree of spatial autocorrelation in predictors and the levels of positional uncertainty. It was shown that when the range of spatial autocorrelation in the predictors was less than or equal to three times the standard deviation of the positional error, the models were less affected by error and, consequently, had smaller decreases in prediction accuracy. When the range of spatial autocorrelation in predictors was larger than three times the standard deviation of positional error, the prediction accuracy was low for all scenarios. Main conclusions The potential impact of positional uncertainty in species occurrences on the predictions of SDMs can be understood by comparing it with the spatial autocorrelation range in predictor variables.  相似文献   

3.
Application of random effects to the study of resource selection by animals   总被引:5,自引:0,他引:5  
1. Resource selection estimated by logistic regression is used increasingly in studies to identify critical resources for animal populations and to predict species occurrence. 2. Most frequently, individual animals are monitored and pooled to estimate population-level effects without regard to group or individual-level variation. Pooling assumes that both observations and their errors are independent, and resource selection is constant given individual variation in resource availability. 3. Although researchers have identified ways to minimize autocorrelation, variation between individuals caused by differences in selection or available resources, including functional responses in resource selection, have not been well addressed. 4. Here we review random-effects models and their application to resource selection modelling to overcome these common limitations. We present a simple case study of an analysis of resource selection by grizzly bears in the foothills of the Canadian Rocky Mountains with and without random effects. 5. Both categorical and continuous variables in the grizzly bear model differed in interpretation, both in statistical significance and coefficient sign, depending on how a random effect was included. We used a simulation approach to clarify the application of random effects under three common situations for telemetry studies: (a) discrepancies in sample sizes among individuals; (b) differences among individuals in selection where availability is constant; and (c) differences in availability with and without a functional response in resource selection. 6. We found that random intercepts accounted for unbalanced sample designs, and models with random intercepts and coefficients improved model fit given the variation in selection among individuals and functional responses in selection. Our empirical example and simulations demonstrate how including random effects in resource selection models can aid interpretation and address difficult assumptions limiting their generality. This approach will allow researchers to appropriately estimate marginal (population) and conditional (individual) responses, and account for complex grouping, unbalanced sample designs and autocorrelation.  相似文献   

4.
In biomedical research, the logistic regression model is the most commonly used method for predicting the probability of a binary outcome. While many clinical researchers have expressed an enthusiasm for regression trees, this method may have limited accuracy for predicting health outcomes. We aimed to evaluate the improvement that is achieved by using ensemble‐based methods, including bootstrap aggregation (bagging) of regression trees, random forests, and boosted regression trees. We analyzed 30‐day mortality in two large cohorts of patients hospitalized with either acute myocardial infarction (N = 16,230) or congestive heart failure (N = 15,848) in two distinct eras (1999–2001 and 2004–2005). We found that both the in‐sample and out‐of‐sample prediction of ensemble methods offered substantial improvement in predicting cardiovascular mortality compared to conventional regression trees. However, conventional logistic regression models that incorporated restricted cubic smoothing splines had even better performance. We conclude that ensemble methods from the data mining and machine learning literature increase the predictive performance of regression trees, but may not lead to clear advantages over conventional logistic regression models for predicting short‐term mortality in population‐based samples of subjects with cardiovascular disease.  相似文献   

5.
In data collection for predictive modeling, underrepresentation of certain groups, based on gender, race/ethnicity, or age, may yield less accurate predictions for these groups. Recently, this issue of fairness in predictions has attracted significant attention, as data-driven models are increasingly utilized to perform crucial decision-making tasks. Existing methods to achieve fairness in the machine learning literature typically build a single prediction model in a manner that encourages fair prediction performance for all groups. These approaches have two major limitations: (i) fairness is often achieved by compromising accuracy for some groups; (ii) the underlying relationship between dependent and independent variables may not be the same across groups. We propose a joint fairness model (JFM) approach for logistic regression models for binary outcomes that estimates group-specific classifiers using a joint modeling objective function that incorporates fairness criteria for prediction. We introduce an accelerated smoothing proximal gradient algorithm to solve the convex objective function, and present the key asymptotic properties of the JFM estimates. Through simulations, we demonstrate the efficacy of the JFM in achieving good prediction performance and across-group parity, in comparison with the single fairness model, group-separate model, and group-ignorant model, especially when the minority group's sample size is small. Finally, we demonstrate the utility of the JFM method in a real-world example to obtain fair risk predictions for underrepresented older patients diagnosed with coronavirus disease 2019 (COVID-19).  相似文献   

6.
Two random regression models, where the effect of a putative QTL was regressed on an environmental gradient, are described. The first model estimates the correlation between intercept and slope of the random regression, while the other model restricts this correlation to 1 or -1, which is expected under a bi-allelic QTL model. The random regression models were compared to a model assuming no gene by environment interactions. The comparison was done with regards to the models ability to detect QTL, to position them accurately and to detect possible QTL by environment interactions. A simulation study based on a granddaughter design was conducted, and QTL were assumed, either by assigning an effect independent of the environment or as a linear function of a simulated environmental gradient. It was concluded that the random regression models were suitable for detection of QTL effects, in the presence and absence of interactions with environmental gradients. Fixing the correlation between intercept and slope of the random regression had a positive effect on power when the QTL effects re-ranked between environments.  相似文献   

7.
The term “effect” in additive genetic effect suggests a causal meaning. However, inferences of such quantities for selection purposes are typically viewed and conducted as a prediction task. Predictive ability as tested by cross-validation is currently the most acceptable criterion for comparing models and evaluating new methodologies. Nevertheless, it does not directly indicate if predictors reflect causal effects. Such evaluations would require causal inference methods that are not typical in genomic prediction for selection. This suggests that the usual approach to infer genetic effects contradicts the label of the quantity inferred. Here we investigate if genomic predictors for selection should be treated as standard predictors or if they must reflect a causal effect to be useful, requiring causal inference methods. Conducting the analysis as a prediction or as a causal inference task affects, for example, how covariates of the regression model are chosen, which may heavily affect the magnitude of genomic predictors and therefore selection decisions. We demonstrate that selection requires learning causal genetic effects. However, genomic predictors from some models might capture noncausal signal, providing good predictive ability but poorly representing true genetic effects. Simulated examples are used to show that aiming for predictive ability may lead to poor modeling decisions, while causal inference approaches may guide the construction of regression models that better infer the target genetic effect even when they underperform in cross-validation tests. In conclusion, genomic selection models should be constructed to aim primarily for identifiability of causal genetic effects, not for predictive ability.  相似文献   

8.
Several penalization approaches have been developed to identify homogeneous subgroups based on a regression model with subject-specific intercepts in subgroup analysis. These methods often apply concave penalty functions to pairwise comparisons of the intercepts, such that the subjects with similar intercept values are assigned to the same group, which is very similar to the procedure of the penalization approaches for variable selection. Since the Bayesian methods are commonly used in variable selection, it is worth considering the corresponding approaches to subgroup analysis in the Bayesian framework. In this paper, a Bayesian hierarchical model with appropriate prior structures is developed for the pairwise differences of intercepts based on a regression model with subject-specific intercepts, which can automatically detect and identify homogeneous subgroups. A Gibbs sampling algorithm is also provided to select the hyperparameter and estimate the intercepts and coefficients of the covariates simultaneously, which is computationally efficient for pairwise comparisons compared to the time-consuming procedures for parameter estimation of the penalization methods (e.g., alternating direction method of multiplier) in the case of large sample sizes. The effectiveness and usefulness of the proposed Bayesian method are evaluated through simulation studies and analysis of a Cleveland Heart Disease Dataset.  相似文献   

9.
Extrapolating landscape regression models for use in assessing vector-borne disease risk and other applications requires thoughtful evaluation of fundamental model choice issues. To examine implications of such choices, an analysis was conducted to explore the extent to which disparate landscape models agree in their epidemiological and entomological risk predictions when extrapolated to new regions. Agreement between six literature-drawn landscape models was examined by comparing predicted county-level distributions of either Lyme disease or Ixodes scapularis vector using Spearman ranked correlation. AUC analyses and multinomial logistic regression were used to assess the ability of these extrapolated landscape models to predict observed national data. Three models based on measures of vegetation, habitat patch characteristics, and herbaceous landcover emerged as effective predictors of observed disease and vector distribution. An ensemble model containing these three models improved precision and predictive ability over individual models. A priori assessment of qualitative model characteristics effectively identified models that subsequently emerged as better predictors in quantitative analysis. Both a methodology for quantitative model comparison and a checklist for qualitative assessment of candidate models for extrapolation are provided; both tools aim to improve collaboration between those producing models and those interested in applying them to new areas and research questions.  相似文献   

10.
A working guide to boosted regression trees   总被引:33,自引:0,他引:33  
1. Ecologists use statistical models for both explanation and prediction, and need techniques that are flexible enough to express typical features of their data, such as nonlinearities and interactions. 2. This study provides a working guide to boosted regression trees (BRT), an ensemble method for fitting statistical models that differs fundamentally from conventional techniques that aim to fit a single parsimonious model. Boosted regression trees combine the strengths of two algorithms: regression trees (models that relate a response to their predictors by recursive binary splits) and boosting (an adaptive method for combining many simple models to give improved predictive performance). The final BRT model can be understood as an additive regression model in which individual terms are simple trees, fitted in a forward, stagewise fashion. 3. Boosted regression trees incorporate important advantages of tree-based methods, handling different types of predictor variables and accommodating missing data. They have no need for prior data transformation or elimination of outliers, can fit complex nonlinear relationships, and automatically handle interaction effects between predictors. Fitting multiple trees in BRT overcomes the biggest drawback of single tree models: their relatively poor predictive performance. Although BRT models are complex, they can be summarized in ways that give powerful ecological insight, and their predictive performance is superior to most traditional modelling methods. 4. The unique features of BRT raise a number of practical issues in model fitting. We demonstrate the practicalities and advantages of using BRT through a distributional analysis of the short-finned eel (Anguilla australis Richardson), a native freshwater fish of New Zealand. We use a data set of over 13 000 sites to illustrate effects of several settings, and then fit and interpret a model using a subset of the data. We provide code and a tutorial to enable the wider use of BRT by ecologists.  相似文献   

11.
Sightability models are binary logistic-regression models used to estimate and adjust for visibility bias in wildlife-population surveys. Like many models in wildlife and ecology, sightability models are typically developed from small observational datasets with many candidate predictors. Aggressive model-selection methods are often employed to choose a best model for prediction and effect estimation, despite evidence that such methods can lead to overfitting (i.e., selected models may describe random error or noise rather than true predictor–response curves) and poor predictive ability. We used moose (Alces alces) sightability data from northeastern Minnesota (2005–2007) as a case study to illustrate an alternative approach, which we refer to as degrees-of-freedom (df) spending: sample-size guidelines are used to determine an acceptable level of model complexity and then a pre-specified model is fit to the data and used for inference. For comparison, we also constructed sightability models using Akaike's Information Criterion (AIC) step-down procedures and model averaging (based on a small set of models developed using df-spending guidelines). We used bootstrap procedures to mimic the process of model fitting and prediction, and to compute an index of overfitting, expected predictive accuracy, and model-selection uncertainty. The index of overfitting increased 13% when the number of candidate predictors was increased from three to eight and a best model was selected using step-down procedures. Likewise, model-selection uncertainty increased when the number of candidate predictors increased. Model averaging (based on R = 30 models with 1–3 predictors) effectively shrunk regression coefficients toward zero and produced similar estimates of precision to our 3-df pre-specified model. As such, model averaging may help to guard against overfitting when too many predictors are considered (relative to available sample size). The set of candidate models will influence the extent to which coefficients are shrunk toward zero, which has implications for how one might apply model averaging to problems traditionally approached using variable-selection methods. We often recommend the df-spending approach in our consulting work because it is easy to implement and it naturally forces investigators to think carefully about their models and predictors. Nonetheless, similar concepts should apply whether one is fitting 1 model or using multi-model inference. For example, model-building decisions should consider the effective sample size, and potential predictors should be screened (without looking at their relationship to the response) for missing data, narrow distributions, collinearity, potentially overly influential observations, and measurement errors (e.g., via logical error checks). © 2011 The Wildlife Society.  相似文献   

12.
When using species distribution models to predict distributions of invasive species, we are faced with the trade-off between model realism, generality, and precision. Models are most applicable to specific conditions on which they are developed, but typically not readily transferred to other situations. To better assist management of biological invasions, it is critical to know how to validate and improve model generality while maintaining good model precision and realism. We examined this issue with Bythotrephes longimanus, to determine the importance of different models and datasets in providing insights into understanding and predicting invasions. We developed models (linear discriminant analysis, multiple logistic regression, random forests, and artificial neural networks) on datasets with different sample sizes (315 or 179 lakes) and predictor information (environmental with or without fish data), and evaluated them by cross-validation and several independent datasets. In cross-validation, models developed on 315-lake environmental dataset performed better than those developed on 179-lake environmental and fish dataset. The advantage of a larger dataset disappeared when models were tested on independent datasets. Predictions of the models were more diverse when developed on environmental conditions alone, whereas they were more consistent when including fish (especially diversity) data. Random forests had relatively good and more stable performance than the other approaches when tested on independent datasets. Given the improvement of model transferability in this study by including relevant species occurrence or diversity index, incorporating biotic information in addition to environmental predictors, may help develop more reliable models with better realism, generality, and precision.  相似文献   

13.
Böhning D  Sarol J 《Biometrics》2000,56(1):304-308
In this paper, we consider the case of efficient estimation of the risk difference in a multicenter study allowing for baseline heterogeneity. We consider the optimally weighted estimator for the common risk difference and show that this estimator has considerable bias when the true weights (which are inversely proportional to the variances of the center-specific risk difference estimates) are replaced by their sample estimates. In addition, we propose a new estimator for this situation of the Mantel-Haenszel type that is unbiased and, in addition, has a smaller variance for small sample sizes within the study centers. Simulations illustrate these findings.  相似文献   

14.
The application of species distribution models (SDMs) to areas outside of where a model was created allows informed decisions across large spatial scales, yet transferability remains a challenge in ecological modeling. We examined how regional variation in animal‐environment relationships influenced model transferability for Canada lynx (Lynx canadensis), with an additional conservation aim of modeling lynx habitat across the northwestern United States. Simultaneously, we explored the effect of sample size from GPS data on SDM model performance and transferability. We used data from three geographically distinct Canada lynx populations in Washington (n = 17 individuals), Montana (n = 66), and Wyoming (n = 10) from 1996 to 2015. We assessed regional variation in lynx‐environment relationships between these three populations using principal components analysis (PCA). We used ensemble modeling to develop SDMs for each population and all populations combined and assessed model prediction and transferability for each model scenario using withheld data and an extensive independent dataset (n = 650). Finally, we examined GPS data efficiency by testing models created with sample sizes of 5%–100% of the original datasets. PCA results indicated some differences in environmental characteristics between populations; models created from individual populations showed differential transferability based on the populations'' similarity in PCA space. Despite population differences, a single model created from all populations performed as well, or better, than each individual population. Model performance was mostly insensitive to GPS sample size, with a plateau in predictive ability reached at ~30% of the total GPS dataset when initial sample size was large. Based on these results, we generated well‐validated spatial predictions of Canada lynx distribution across a large portion of the species'' southern range, with precipitation and temperature the primary environmental predictors in the model. We also demonstrated substantial redundancy in our large GPS dataset, with predictive performance insensitive to sample sizes above 30% of the original.  相似文献   

15.
Multiple imputation (MI) is increasingly popular for handling multivariate missing data. Two general approaches are available in standard computer packages: MI based on the posterior distribution of incomplete variables under a multivariate (joint) model, and fully conditional specification (FCS), which imputes missing values using univariate conditional distributions for each incomplete variable given all the others, cycling iteratively through the univariate imputation models. In the context of longitudinal or clustered data, it is not clear whether these approaches result in consistent estimates of regression coefficient and variance component parameters when the analysis model of interest is a linear mixed effects model (LMM) that includes both random intercepts and slopes with either covariates or both covariates and outcome contain missing information. In the current paper, we compared the performance of seven different MI methods for handling missing values in longitudinal and clustered data in the context of fitting LMMs with both random intercepts and slopes. We study the theoretical compatibility between specific imputation models fitted under each of these approaches and the LMM, and also conduct simulation studies in both the longitudinal and clustered data settings. Simulations were motivated by analyses of the association between body mass index (BMI) and quality of life (QoL) in the Longitudinal Study of Australian Children (LSAC). Our findings showed that the relative performance of MI methods vary according to whether the incomplete covariate has fixed or random effects and whether there is missingnesss in the outcome variable. We showed that compatible imputation and analysis models resulted in consistent estimation of both regression parameters and variance components via simulation. We illustrate our findings with the analysis of LSAC data.  相似文献   

16.
Ensembling combines the predictions made by individual component base models with the goal of achieving a predictive accuracy that is better than that of any one of the constituent member models. Diversity among the base models in terms of predictions is a crucial criterion in ensembling. However, there are practical instances when the available base models produce highly correlated predictions, because they may have been developed within the same research group or may have been built from the same underlying algorithm. We investigated, via a case study on Fusarium head blight (FHB) on wheat in the U.S., whether ensembles of simple yet highly correlated models for predicting the risk of FHB epidemics, all generated from logistic regression, provided any benefit to predictive performance, despite relatively low levels of base model diversity. Three ensembling methods were explored: soft voting, weighted averaging of smaller subsets of the base models, and penalized regression as a stacking algorithm. Soft voting and weighted model averages were generally better at classification than the base models, though not universally so. The performances of stacked regressions were superior to those of the other two ensembling methods we analyzed in this study. Ensembling simple yet correlated models is computationally feasible and is therefore worth pursuing for models of epidemic risk.  相似文献   

17.
Roy A  Bhaumik DK  Aryal S  Gibbons RD 《Biometrics》2007,63(3):699-707
Summary .   We consider the problem of sample size determination for three-level mixed-effects linear regression models for the analysis of clustered longitudinal data. Three-level designs are used in many areas, but in particular, multicenter randomized longitudinal clinical trials in medical or health-related research. In this case, level 1 represents measurement occasion, level 2 represents subject, and level 3 represents center. The model we consider involves random effects of the time trends at both the subject level and the center level. In the most common case, we have two random effects (constant and a single trend), at both subject and center levels. The approach presented here is general with respect to sampling proportions, number of groups, and attrition rates over time. In addition, we also develop a cost model, as an aid in selecting the most parsimonious of several possible competing models (i.e., different combinations of centers, subjects within centers, and measurement occasions). We derive sample size requirements (i.e., power characteristics) for a test of treatment-by-time interaction(s) for designs based on either subject-level or cluster-level randomization. The general methodology is illustrated using two characteristic examples.  相似文献   

18.
Wang Z  Louis TA 《Biometrics》2004,60(4):884-891
Marginal models and conditional mixed-effects models are commonly used for clustered binary data. However, regression parameters and predictions in nonlinear mixed-effects models usually do not have a direct marginal interpretation, because the conditional functional form does not carry over to the margin. Because both marginal and conditional inferences are of interest, a unified approach is attractive. To this end, we investigate a parameterization of generalized linear mixed models with a structured random-intercept distribution that matches the conditional and marginal shapes. We model the marginal mean of response distribution and select the distribution of the random intercept to produce the match and also to model covariate-dependent random effects. We discuss the relation between this approach and some existing models and compare the approaches on two datasets.  相似文献   

19.
Habitats in the Wadden Sea, a world heritage area, are affected by land subsidence resulting from natural gas extraction and by sea level rise. Here we describe a method to monitor changes in habitat types by producing sequential maps based on point information followed by mapping using a multinomial logit regression model with abiotic variables of which maps are available as predictors.In a 70 ha study area a total of 904 vegetation samples has been collected in seven sampling rounds with an interval of 2–3 years. Half of the vegetation plots was permanent, violating the assumption of independent data in multinomial logistic regression. This paper shows how this dependency can be accounted for by adding a random effect to the multinomial logit (MLN) model, thus becoming a mixed multinomial logit (MMNL) model. In principle all regression coefficients can be taken as random, but in this study only the intercepts are treated as location-specific random variables (random intercepts model). With six habitat types we have five intercepts, so that the number of extra model parameters becomes 15, 5 variances and 10 covariances.The likelihood ratio test showed that the MMNL model fitted significantly better than the MNL model with the same fixed effects. McFadden-R2 for the MMNL model was 0.467, versus 0.395 for the MNL model. The estimated coefficients of the MMNL and MNL model were comparable; those of altitude, the most important predictor, differed most. The MMNL model accounts for pseudo-replication at the permanent plots, which explains the larger standard errors of the MMNL coefficients. The habitat type at a given location-year combination was predicted by the habitat type with the largest predicted probability. The series of maps shows local trends in habitat types most likely driven by sea-level rise, soil subsidence, and a restoration project.We conclude that in environmental modeling of categorical variables using panel data, dependency of repeated observations at permanent plots should be accounted for. This will affect the estimated probabilities of the categories, and even stronger the standard errors of the regression coefficients.  相似文献   

20.
Ecologists are increasingly using statistical models to predict animal abundance and occurrence in unsampled locations. The reliability of such predictions depends on a number of factors, including sample size, how far prediction locations are from the observed data, and similarity of predictive covariates in locations where data are gathered to locations where predictions are desired. In this paper, we propose extending Cook’s notion of an independent variable hull (IVH), developed originally for application with linear regression models, to generalized regression models as a way to help assess the potential reliability of predictions in unsampled areas. Predictions occurring inside the generalized independent variable hull (gIVH) can be regarded as interpolations, while predictions occurring outside the gIVH can be regarded as extrapolations worthy of additional investigation or skepticism. We conduct a simulation study to demonstrate the usefulness of this metric for limiting the scope of spatial inference when conducting model-based abundance estimation from survey counts. In this case, limiting inference to the gIVH substantially reduces bias, especially when survey designs are spatially imbalanced. We also demonstrate the utility of the gIVH in diagnosing problematic extrapolations when estimating the relative abundance of ribbon seals in the Bering Sea as a function of predictive covariates. We suggest that ecologists routinely use diagnostics such as the gIVH to help gauge the reliability of predictions from statistical models (such as generalized linear, generalized additive, and spatio-temporal regression models).  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号