首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
An important assumption in observational studies is that sampled individuals are representative of some larger study population. Yet, this assumption is often unrealistic. Notable examples include online public‐opinion polls, publication biases associated with statistically significant results, and in ecology, telemetry studies with significant habitat‐induced probabilities of missed locations. This problem can be overcome by modeling selection probabilities simultaneously with other predictor–response relationships or by weighting observations by inverse selection probabilities. We illustrate the problem and a solution when modeling mixed migration strategies of northern white‐tailed deer (Odocoileus virginianus). Captures occur on winter yards where deer migrate in response to changing environmental conditions. Yet, not all deer migrate in all years, and captures during mild years are more likely to target deer that migrate every year (i.e., obligate migrators). Characterizing deer as conditional or obligate migrators is also challenging unless deer are observed for many years and under a variety of winter conditions. We developed a hidden Markov model where the probability of capture depends on each individual's migration strategy (conditional versus obligate migrator), a partially latent variable that depends on winter severity in the year of capture. In a 15‐year study, involving 168 white‐tailed deer, the estimated probability of migrating for conditional migrators increased nonlinearly with an index of winter severity. We estimated a higher proportion of obligates in the study cohort than in the population, except during a span of 3 years surrounding back‐to‐back severe winters. These results support the hypothesis that selection biases occur as a result of capturing deer on winter yards, with the magnitude of bias depending on the severity of winter weather. Hidden Markov models offer an attractive framework for addressing selection biases due to their ability to incorporate latent variables and model direct and indirect links between state variables and capture probabilities.  相似文献   

2.
This paper treats the topic of representing supplementary variables in biplots obtained by principal component analysis (PCA) and correspondence analysis (CA). We follow a geometrical approach where we minimize errors that are obtained when the scores of the PCA or CA solution are projected onto a vector that represents a supplementary variable. This paper shows that optimal directions for supplementary variables can be found by solving a regression problem, and justifies that earlier formulae from Gabriel are optimal in the least squares sense. We derive new results regarding the geometrical properties, goodness of fit statistics and the interpretation of supplementary variables. It is shown that supplementary variables can be represented by plotting their correlation coefficients with the axes of the biplot only when the proper type of scaling is used. We discuss supplementary variables in an ecological context and give illustrations with data from an environmental monitoring survey.  相似文献   

3.
Summary The genetic variability of 18 sire families of the Athens-Canadian randombred population infected with coccidiosis was assessed by examining the response variables of weight gain, packed red blood cell volume, mortality and coccidial lesions. A significant gain and PCV depression and high lesion scores for Eimeria tenella and E. acervulina were produced in the infected group compared to the noninfected group. Significant variation among the sire families was observed for all of the response variables except E. acervulina lesions and a significant sex x sire interaction was observed for weight gain. The heritability (h2) estimates for the response variables revealed that resistance to coccidiosis in chickens is moderately heritable. The h2 estimates for gain and PCV increased with the coccidial infections indicating that maximum progress in selecting for resistance should be made when the population was exposed to coccidial infection. Gain was positively correlated to the other measures of resistance and thus selecting for coccidial resistance should not reduce growth rates. PCV was similarly correlated but had higher positive correlation with E. tenella lesion. Percent mortality which is the selection parameter in most coccidial selection programs was correlated with resistance to coccidiosis. The phenotypic and genotypic correlations demonstrated that chickens susceptible to E. tenella were also susceptible to E. acervulina. Total lesion scores were moderately to highly correlated with the other variables and would be a suitable variable to use in coccidiosis experimentation including a genetic selection program for resistance. This study shows that progress could be made in selecting for resistance to coccidiosis in chickens using one or a combination of these response variables.  相似文献   

4.
Multiple imputation (MI) is used to handle missing at random (MAR) data. Despite warnings from statisticians, continuous variables are often recoded into binary variables. With MI it is important that the imputation and analysis models are compatible; variables should be imputed in the same form they appear in the analysis model. With an encoded binary variable more accurate imputations may be obtained by imputing the underlying continuous variable. We conducted a simulation study to explore how best to impute a binary variable that was created from an underlying continuous variable. We generated a completely observed continuous outcome associated with an incomplete binary covariate that is a categorized version of an underlying continuous covariate, and an auxiliary variable associated with the underlying continuous covariate. We simulated data with several sample sizes, and set 25% and 50% of data in the covariate to MAR dependent on the outcome and the auxiliary variable. We compared the performance of five different imputation methods: (a) Imputation of the binary variable using logistic regression; (b) imputation of the continuous variable using linear regression, then categorizing into the binary variable; (c, d) imputation of both the continuous and binary variables using fully conditional specification (FCS) and multivariate normal imputation; (e) substantive-model compatible (SMC) FCS. Bias and standard errors were large when the continuous variable only was imputed. The other methods performed adequately. Imputation of both the binary and continuous variables using FCS often encountered mathematical difficulties. We recommend the SMC-FCS method as it performed best in our simulation studies.  相似文献   

5.
Wermuth  Nanny; Cox  D. R. 《Biometrika》2008,95(1):17-33
Undetected confounding may severely distort the effect of anexplanatory variable on a response variable, as defined by astepwise data-generating process. The best known type of distortion,which we call direct confounding, arises from an unobservedexplanatory variable common to a response and its main explanatoryvariable of interest. It is relevant mainly for observationalstudies, since it is avoided by successful randomization. Bycontrast, indirect confounding, which we identify in this paper,is an issue also for intervention studies. For general stepwise-generatingprocesses, we provide matrix and graphical criteria to decidewhich types of distortion may be present, when they are absentand how they are avoided. We then turn to linear systems withoutother types of distortion, but with indirect confounding. Forsuch systems, the magnitude of distortion in a least-squaresregression coefficient is derived and shown to be estimable,so that it becomes possible to recover the effect of the generatingprocess from the distorted coefficient.  相似文献   

6.
In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often suboptimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data‐generating mechanism. Specifically, we discuss a generalization of the analysis of variance variable importance measure and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa.  相似文献   

7.
Biomedical researchers are often interested in estimating the effect of an environmental exposure in relation to a chronic disease endpoint. However, the exposure variable of interest may be measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies an additive measurement error model, but it may not have repeated measurements. The subset in which the surrogate variables are available is called a calibration sample. In addition to the surrogate variables that are available among the subjects in the calibration sample, we consider the situation when there is an instrumental variable available for all study subjects. An instrumental variable is correlated with the unobserved true exposure variable, and hence can be useful in the estimation of the regression coefficients. In this paper, we propose a nonparametric method for Cox regression using the observed data from the whole cohort. The nonparametric estimator is the best linear combination of a nonparametric correction estimator from the calibration sample and the difference of the naive estimators from the calibration sample and the whole cohort. The asymptotic distribution is derived, and the finite sample performance of the proposed estimator is examined via intensive simulation studies. The methods are applied to the Nutritional Biomarkers Study of the Women's Health Initiative.  相似文献   

8.
In many clinical settings, a commonly encountered problem is to assess accuracy of a screening test for early detection of a disease. In these applications, predictive performance of the test is of interest. Variable selection may be useful in designing a medical test. An example is a research study conducted to design a new screening test by selecting variables from an existing screener with a hierarchical structure among variables: there are several root questions followed by their stem questions. The stem questions will only be asked after a subject has answered the root question. It is therefore unreasonable to select a model that only contains stem variables but not its root variable. In this work, we propose methods to perform variable selection with structured variables when predictive accuracy of a diagnostic test is the main concern of the analysis. We take a linear combination of individual variables to form a combined test. We then maximize a direct summary measure of the predictive performance of the test, the area under a receiver operating characteristic curve (AUC of an ROC), subject to a penalty function to control for overfitting. Since maximizing empirical AUC of the ROC of a combined test is a complicated nonconvex problem (Pepe, Cai, and Longton, 2006, Biometrics62, 221-229), we explore the connection between the empirical AUC and a support vector machine (SVM). We cast the problem of maximizing predictive performance of a combined test as a penalized SVM problem and apply a reparametrization to impose the hierarchical structure among variables. We also describe a penalized logistic regression variable selection procedure for structured variables and compare it with the ROC-based approaches. We use simulation studies based on real data to examine performance of the proposed methods. Finally we apply developed methods to design a structured screener to be used in primary care clinics to refer potentially psychotic patients for further specialty diagnostics and treatment.  相似文献   

9.
Models of species’ distributions and niches are frequently used to infer the importance of range- and niche-defining variables. However, the degree to which these models can reliably identify important variables and quantify their influence remains unknown. Here we use a series of simulations to explore how well models can 1) discriminate between variables with different influence and 2) calibrate the magnitude of influence relative to an ‘omniscient’ model. To quantify variable importance, we trained generalized additive models (GAMs), Maxent and boosted regression trees (BRTs) on simulated data and tested their sensitivity to permutations in each predictor. Importance was inferred by calculating the correlation between permuted and unpermuted predictions, and by comparing predictive accuracy of permuted and unpermuted predictions using AUC and the continuous Boyce index. In scenarios with one influential and one uninfluential variable, models failed to discriminate reliably between variables when training occurrences were < 8–64, prevalence was > 0.5, spatial extent was small, environmental data had coarse resolution and spatial autocorrelation was low, or when pairwise correlation between environmental variables was |r| > 0.7. When two variables influenced the distribution equally, importance was underestimated when species had narrow or intermediate niche breadth. Interactions between variables in how they shaped the niche did not affect inferences about their importance. When variables acted unequally, the effect of the stronger variable was overestimated. GAMs and Maxent discriminated between variables more reliably than BRTs, but no algorithm was consistently well-calibrated vis-à-vis the omniscient model. Algorithm-specific measures of importance like Maxent's change-in-gain metric were less robust than the permutation test. Overall, high predictive accuracy did not connote robust inferential capacity. As a result, requirements for reliably measuring variable importance are likely more stringent than for creating models with high predictive accuracy.  相似文献   

10.
Multivariable model building for propensity score modeling approaches is challenging. A common propensity score approach is exposure-driven propensity score matching, where the best model selection strategy is still unclear. In particular, the situation may require variable selection, while it is still unclear if variables included in the propensity score should be associated with the exposure and the outcome, with either the exposure or the outcome, with at least the exposure or with at least the outcome. Unmeasured confounders, complex correlation structures, and non-normal covariate distributions further complicate matters. We consider the performance of different modeling strategies in a simulation design with a complex but realistic structure and effects on a binary outcome. We compare the strategies in terms of bias and variance in estimated marginal exposure effects. Considering the bias in estimated marginal exposure effects, the most reliable results for estimating the propensity score are obtained by selecting variables related to the exposure. On average this results in the least bias and does not greatly increase variances. Although our results cannot be generalized, this provides a counterexample to existing recommendations in the literature based on simple simulation settings. This highlights that recommendations obtained in simple simulation settings cannot always be generalized to more complex, but realistic settings and that more complex simulation studies are needed.  相似文献   

11.

Background  

Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories.  相似文献   

12.
In this paper, we discuss the identifiability and estimation of causal effects of a continuous treatment on a binary response when the treatment is measured with errors and there exists a latent categorical confounder associated with both treatment and response. Under some widely used parametric models, we first discuss the identifiability of the causal effects and then propose an approach for estimation and inference. Our approach can eliminate the biases induced by latent confounding and measurement errors by using only a single instrumental variable. Based on the identification results, we give guidelines for determining the existence of a latent categorical confounder and for selecting the number of levels of the latent confounder. We apply the proposed approach to a data set from the Framingham Heart Study to evaluate the effect of the systolic blood pressure on the coronary heart disease.  相似文献   

13.
通常来讲,生态学者对于解释生态关系、描述格局和过程、进行空间或时间预测比较感兴趣。这些工作可以通过模拟输出值(响应)与一些特征值(即解释变量)的关系来实现。然而,生态数据模拟遇到了挑战,这是因为响应变量和预测变量可能是连续变量或离散变量。需要解释的生态关系通常是非线性的,并且解释变量之间具有复杂的相互作用关系。响应变量和解释变量存在缺失值并不是不常有的现象,奇异值也经常出现在生态数据中。此外,生态学者通常希望生态模型即要易于建立又易要于解释。通常是利用多种统计方法来分析处理各种各样情景中出现的独特的生态问题,这些模型包括(多元)逻辑回归、线性模型、生存模型、方差分析等等。随机森林是一个可以处理所有这些问题的有效方法。随机森林可以用来做分类、聚类、回归和生存分析、评估变量的重要性、检测数据中的奇异值、对缺失数据进行插补等。鉴于随机森林本身在算法上的优势,将就随机森林在生态学中的应用进行总结,对建模过程进行概述,并以云南松分布模拟研究为例,对其主要功能特点进行案例展示。通过对随机森林的一般术语、概念和建模思想进行介绍,有利于读者掌握本方法的应用本质,可以预见随机森林在生态学研究中将得到更多的应用和发展。  相似文献   

14.
Question: Predictive vegetation modelling relies on the use of environmental variables, which are usually derived from abase data set with some level of error, and this error is propagated to any subsequently derived environmental variables. The question for this study is: What is the level of error and uncertainty in environmental variables based on the error propagated from a Digital Elevation Model (DEM) and how does it vary for both direct and indirect variables? Location: Kioloa region, New South Wales, Australia Methods: The level of error in a DEM is assessed and used to develop an error model for analysing error propagation to derived environmental variables. We tested both indirect (elevation, slope, aspect, topographic position) and direct (average air temperature, net solar radiation, and topographic wetness index) variables for their robustness to propagated error from the DEM. Results: It is shown that the direct environmental variable net solar radiation is less affected by error in the DEM than the indirect variables aspect and slope, but that regional conditions such as slope steepness and cloudiness can influence this outcome. However, the indirect environmental variable topographic position was less affected by error in the DEM than topographic wetness index. Interestingly, the results disagreed with the current assumption that indirect variables are necessarily less sensitive to propagated error because they are less derived. Conclusions: The results indicate that variables exhibit both systematic bias and instability under uncertainty. There is a clear need to consider the sensitivity of variables to error in their base data sets in addition to the question of whether to use direct or indirect variables.  相似文献   

15.
16.
Understanding the mechanisms of habitat selection is fundamental to the construction of proper conservation and management plans for many avian species. Habitat changes caused by human beings increase the landscape complexity and thus the complexity of data available for explaining species distribution. New techniques that assume no linearity and capable to extrapolate the response variables across landscapes are needed for dealing with difficult relationships between habitat variables and distribution data. We used a random forest algorithm to study breeding-site selection of herons and egrets in a human-influenced landscape by analyzing land use around their colonies. We analyzed the importance of each land-use variable for different scales and its relationship to the probability of colony presence. We found that there exist two main spatial scales on which herons and egrets select their colony sites: medium scale (4 km) and large scale (10–15 km). Colonies were attracted to areas with large amounts of evergreen forests at the medium scale, whereas avoidance of high-density urban areas was important at the large scale. Previous studies used attractive factors, mainly foraging areas, to explain bird-colony distributions, but our study is the first to show the major importance of repellent factors at large scales. We believe that the newest non-linear methods, such as random forests, are needed when modelling complex variable interactions when organisms are distributed in complex landscapes. These methods could help to improve the conservation plans of those species threatened by the advance of highly human-influenced landscapes.  相似文献   

17.
The development of clinical prediction models requires the selection of suitable predictor variables. Techniques to perform objective Bayesian variable selection in the linear model are well developed and have been extended to the generalized linear model setting as well as to the Cox proportional hazards model. Here, we consider discrete time‐to‐event data with competing risks and propose methodology to develop a clinical prediction model for the daily risk of acquiring a ventilator‐associated pneumonia (VAP) attributed to P. aeruginosa (PA) in intensive care units. The competing events for a PA VAP are extubation, death, and VAP due to other bacteria. Baseline variables are potentially important to predict the outcome at the start of ventilation, but may lose some of their predictive power after a certain time. Therefore, we use a landmark approach for dynamic Bayesian variable selection where the set of relevant predictors depends on the time already spent at risk. We finally determine the direct impact of a variable on each competing event through cause‐specific variable selection.  相似文献   

18.
The paper deals with the optimal Bayes discriminant rule for qualitative variables. The performance of variable selection is investigated under strong assumptions like the restriction to dichotomous variables, which are assumed to be independent or dependent with fixed dependence structure, and all parameters known. Differences in comparison with normal variables in linear discriminant analysis can be shown. This is a further reason for applying special methods of discriminant analysis in the case of qualitative variables.  相似文献   

19.
This paper focuses on the problems of estimation and variable selection in the functional linear regression model (FLM) with functional response and scalar covariates. To this end, two different types of regularization (L1 and L2) are considered in this paper. On the one hand, a sample approach for functional LASSO in terms of basis representation of the sample values of the response variable is proposed. On the other hand, we propose a penalized version of the FLM by introducing a P-spline penalty in the least squares fitting criterion. But our aim is to propose P-splines as a powerful tool simultaneously for variable selection and functional parameters estimation. In that sense, the importance of smoothing the response variable before fitting the model is also studied. In summary, penalized (L1 and L2) and nonpenalized regression are combined with a presmoothing of the response variable sample curves, based on regression splines or P-splines, providing a total of six approaches to be compared in two simulation schemes. Finally, the most competitive approach is applied to a real data set based on the graft-versus-host disease, which is one of the most frequent complications (30% –50%) in allogeneic hematopoietic stem-cell transplantation.  相似文献   

20.
A two variable model with delay in both the variables, is proposed for the circadian oscillations of protein concentrations in the fungal species Neurospora crassa. The dynamical variables chosen are the concentrations of FRQ and WC-1 proteins. Our model is a two variable simplification of the detailed model of Smolen et al. (J. Neurosci. 21 (2001) 6644) modeling circadian oscillations with interlocking positive and negative feedback loops, containing 23 variables. In our model, as in the case of Smolen's model, a sustained limit cycle oscillation takes place in both FRQ and WC-1 protein in continuous darkness, and WC-1 is anti-phase to FRQ protein, as observed in experiments. The model accounts for various characteristic features of circadian rhythms such as entrainment to light dark cycles, phase response curves and robustness to parameter variation and molecular fluctuations. Simulations are carried out to study the effect of periodic forcing of circadian oscillations by light-dark cycles. The periodic forcing resulted in a rich bifurcation diagram that includes quasiperiodicity and chaotic oscillations, depending on the magnitude of the periodic changes in the light controlled parameter. When positive feedback is eliminated, our model reduces to the generic one dimensional delay model of Lema et al. (J. Theor. Biol. 204 (2000) 565), delay model of the circadian pace maker with FRQ protein as the dynamical variable which represses its own production. This one-dimensional model also exhibits all characteristic features of circadian oscillations and gives rise to circadian oscillations which are reasonably robust to parameter variations and molecular noise.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号