首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

Variable selection is an important step in building a multivariate regression model for which several methods and statistical packages are available. A comprehensive approach for variable selection in complex multivariate regression analyses within HIV cohorts is explored by utilizing both epidemiological and biostatistical procedures.

Methods

Three different methods for variable selection were illustrated in a study comparing survival time between subjects in the Department of Defense’s National History Study and the Atlanta Veterans Affairs Medical Center’s HIV Atlanta VA Cohort Study. The first two methods were stepwise selection procedures, based either on significance tests (Score test), or on information theory (Akaike Information Criterion), while the third method employed a Bayesian argument (Bayesian Model Averaging).

Results

All three methods resulted in a similar parsimonious survival model. Three of the covariates previously used in the multivariate model were not included in the final model suggested by the three approaches. When comparing the parsimonious model to the previously published model, there was evidence of less variance in the main survival estimates.

Conclusions

The variable selection approaches considered in this study allowed building a model based on significance tests, on an information criterion, and on averaging models using their posterior probabilities. A parsimonious model that balanced these three approaches was found to provide a better fit than the previously reported model.  相似文献   

2.

Background

Commonly when designing studies, researchers propose to measure several independent variables in a regression model, a subset of which are identified as the main variables of interest while the rest are retained in a model as covariates or confounders. Power for linear regression in this setting can be calculated using SAS PROC POWER. There exists a void in estimating power for the logistic regression models in the same setting.

Methods

Currently, an approach that calculates power for only one variable of interest in the presence of other covariates for logistic regression is in common use and works well for this special case. In this paper we propose three related algorithms along with corresponding SAS macros that extend power estimation for one or more primary variables of interest in the presence of some confounders.

Results

The three proposed empirical algorithms employ likelihood ratio test to provide a user with either a power estimate for a given sample size, a quick sample size estimate for a given power, and an approximate power curve for a range of sample sizes. A user can specify odds ratios for a combination of binary, uniform and standard normal independent variables of interest, and or remaining covariates/confounders in the model, along with a correlation between variables.

Conclusions

These user friendly algorithms and macro tools are a promising solution that can fill the void for estimation of power for logistic regression when multiple independent variables are of interest, in the presence of additional covariates in the model.
  相似文献   

3.

Background

Over time, methods for the development of clinical decision support (CDS) systems have evolved from interpretable and easy-to-use scoring systems to very complex and non-interpretable mathematical models. In order to accomplish effective decision support, CDS systems should provide information on how the model arrives at a certain decision. To address the issue of incompatibility between performance, interpretability and applicability of CDS systems, this paper proposes an innovative model structure, automatically leading to interpretable and easily applicable models. The resulting models can be used to guide clinicians when deciding upon the appropriate treatment, estimating patient-specific risks and to improve communication with patients.

Methods and Findings

We propose the interval coded scoring (ICS) system, which imposes that the effect of each variable on the estimated risk is constant within consecutive intervals. The number and position of the intervals are automatically obtained by solving an optimization problem, which additionally performs variable selection. The resulting model can be visualised by means of appealing scoring tables and color bars. ICS models can be used within software packages, in smartphone applications, or on paper, which is particularly useful for bedside medicine and home-monitoring. The ICS approach is illustrated on two gynecological problems: diagnosis of malignancy of ovarian tumors using a dataset containing 3,511 patients, and prediction of first trimester viability of pregnancies using a dataset of 1,435 women. Comparison of the performance of the ICS approach with a range of prediction models proposed in the literature illustrates the ability of ICS to combine optimal performance with the interpretability of simple scoring systems.

Conclusions

The ICS approach can improve patient-clinician communication and will provide additional insights in the importance and influence of available variables. Future challenges include extensions of the proposed methodology towards automated detection of interaction effects, multi-class decision support systems, prognosis and high-dimensional data.  相似文献   

4.

Background  

When predictive survival models are built from high-dimensional data, there are often additional covariates, such as clinical scores, that by all means have to be included into the final model. While there are several techniques for the fitting of sparse high-dimensional survival models by penalized parameter estimation, none allows for explicit consideration of such mandatory covariates.  相似文献   

5.

Background  

Inferring gene networks from time-course microarray experiments with vector autoregressive (VAR) model is the process of identifying functional associations between genes through multivariate time series. This problem can be cast as a variable selection problem in Statistics. One of the promising methods for variable selection is the elastic net proposed by Zou and Hastie (2005). However, VAR modeling with the elastic net succeeds in increasing the number of true positives while it also results in increasing the number of false positives.  相似文献   

6.

Objectives

Little is known about influences of sample selection on estimation in propensity score matching. The purpose of the study was to assess potential selection bias using one-to-one greedy matching versus optimal full matching as part of an evaluation of supportive housing in New York City (NYC).

Study Design and Settings

Data came from administrative data for 2 groups of applicants who were eligible for an NYC supportive housing program in 2007–09, including chronically homeless adults with a substance use disorder and young adults aging out of foster care. We evaluated the 2 matching methods in their ability to balance covariates and represent the original population, and in how those methods affected outcomes related to Medicaid expenditures.

Results

In the population with a substance use disorder, only optimal full matching performed well in balancing covariates, whereas both methods created representative populations. In the young adult population, both methods balanced covariates effectively, but only optimal full matching created representative populations. In the young adult population, the impact of the program on Medicaid expenditures was attenuated when one-to-one greedy matching was used, compared with optimal full matching.

Conclusion

Given covariate balancing with both methods, attenuated program impacts in the young adult population indicated that one-to-one greedy matching introduced selection bias.  相似文献   

7.

Key message

Development of models to predict genotype by environment interactions, in unobserved environments, using environmental covariates, a crop model and genomic selection. Application to a large winter wheat dataset.

Abstract

Genotype by environment interaction (G*E) is one of the key issues when analyzing phenotypes. The use of environment data to model G*E has long been a subject of interest but is limited by the same problems as those addressed by genomic selection methods: a large number of correlated predictors each explaining a small amount of the total variance. In addition, non-linear responses of genotypes to stresses are expected to further complicate the analysis. Using a crop model to derive stress covariates from daily weather data for predicted crop development stages, we propose an extension of the factorial regression model to genomic selection. This model is further extended to the marker level, enabling the modeling of quantitative trait loci (QTL) by environment interaction (Q*E), on a genome-wide scale. A newly developed ensemble method, soft rule fit, was used to improve this model and capture non-linear responses of QTL to stresses. The method is tested using a large winter wheat dataset, representative of the type of data available in a large-scale commercial breeding program. Accuracy in predicting genotype performance in unobserved environments for which weather data were available increased by 11.1 % on average and the variability in prediction accuracy decreased by 10.8 %. By leveraging agronomic knowledge and the large historical datasets generated by breeding programs, this new model provides insight into the genetic architecture of genotype by environment interactions and could predict genotype performance based on past and future weather scenarios.  相似文献   

8.

Aims

To develop a risk assessment model for persons at risk from type 2 diabetes in Chinese.

Materials and Methods

The model was generated from the cross-sectional data of 16246 persons aged from 20 years old and over. C4.5 algorithm and multivariate logistic regression were used for variable selection. Relative risk value combined with expert decision constructed a comprehensive risk assessment for evaluating the individual risk category. The validity of the model was tested by cross validation and a survey performed six years later with some participants.

Results

Nine variables were selected as risk variables. A mathematical model was established to calculate the average probability of diabetes in each cluster''s group divided by sex and age. A series of criteria combined with relative RR value (2.2) and level of risk variables stratified individuals into four risk groups (non, low, medium and high risk). The overall accuracy reached 90.99% evaluated by cross-validation inside the model population. The incidence of diabetes for each risk group increased from 1.5 (non-risk group) to 28.2(high-risk group) per one thousand persons per year with six years follow-up.

Discussion

The model could determine the individual risk for type 2 diabetes by four risk degrees. This model could be used as a technique tool not only to support screening persons at different risk, but also to evaluate the result of the intervention.  相似文献   

9.

Background

Hantavirus pulmonary syndrome (HPS) is a life threatening disease transmitted by the rodent Oligoryzomys longicaudatus in Chile. Hantavirus outbreaks are typically small and geographically confined. Several studies have estimated risk based on spatial and temporal distribution of cases in relation to climate and environmental variables, but few have considered climatological modeling of HPS incidence for monitoring and forecasting purposes.

Methodology

Monthly counts of confirmed HPS cases were obtained from the Chilean Ministry of Health for 2001–2012. There were an estimated 667 confirmed HPS cases. The data suggested a seasonal trend, which appeared to correlate with changes in climatological variables such as temperature, precipitation, and humidity. We considered several Auto Regressive Integrated Moving Average (ARIMA) time-series models and regression models with ARIMA errors with one or a combination of these climate variables as covariates. We adopted an information-theoretic approach to model ranking and selection. Data from 2001–2009 were used in fitting and data from January 2010 to December 2012 were used for one-step-ahead predictions.

Results

We focused on six models. In a baseline model, future HPS cases were forecasted from previous incidence; the other models included climate variables as covariates. The baseline model had a Corrected Akaike Information Criterion (AICc) of 444.98, and the top ranked model, which included precipitation, had an AICc of 437.62. Although the AICc of the top ranked model only provided a 1.65% improvement to the baseline AICc, the empirical support was 39 times stronger relative to the baseline model.

Conclusions

Instead of choosing a single model, we present a set of candidate models that can be used in modeling and forecasting confirmed HPS cases in Chile. The models can be improved by using data at the regional level and easily extended to other countries with seasonal incidence of HPS.  相似文献   

10.
Dynamic treatment regimes (DTRs) consist of a sequence of decision rules, one per stage of intervention, that aim to recommend effective treatments for individual patients according to patient information history. DTRs can be estimated from models which include interactions between treatment and a (typically small) number of covariates which are often chosen a priori. However, with increasingly large and complex data being collected, it can be difficult to know which prognostic factors might be relevant in the treatment rule. Therefore, a more data-driven approach to select these covariates might improve the estimated decision rules and simplify models to make them easier to interpret. We propose a variable selection method for DTR estimation using penalized dynamic weighted least squares. Our method has the strong heredity property, that is, an interaction term can be included in the model only if the corresponding main terms have also been selected. We show our method has both the double robustness property and the oracle property theoretically; and the newly proposed method compares favorably with other variable selection approaches in numerical studies. We further illustrate the proposed method on data from the Sequenced Treatment Alternatives to Relieve Depression study.  相似文献   

11.

Background

Neonatal mortality contributes a large proportion towards early childhood mortality in developing countries, with considerable geographical variation at small areas within countries.

Methods

A geo-additive logistic regression model is proposed for quantifying small-scale geographical variation in neonatal mortality, and to estimate risk factors of neonatal mortality. Random effects are introduced to capture spatial correlation and heterogeneity. The spatial correlation can be modelled using the Markov random fields (MRF) when data is aggregated, while the two dimensional P-splines apply when exact locations are available, whereas the unstructured spatial effects are assigned an independent Gaussian prior. Socio-economic and bio-demographic factors which may affect the risk of neonatal mortality are simultaneously estimated as fixed effects and as nonlinear effects for continuous covariates. The smooth effects of continuous covariates are modelled by second-order random walk priors. Modelling and inference use the empirical Bayesian approach via penalized likelihood technique. The methodology is applied to analyse the likelihood of neonatal deaths, using data from the 2000 Malawi demographic and health survey. The spatial effects are quantified through MRF and two dimensional P-splines priors.

Results

Findings indicate that both fixed and spatial effects are associated with neonatal mortality.

Conclusions

Our study, therefore, suggests that the challenge to reduce neonatal mortality goes beyond addressing individual factors, but also require to understanding unmeasured covariates for potential effective interventions.  相似文献   

12.

Background

Why do some groups of physically linked genes stay linked over long evolutionary periods? Although several factors are associated with the formation of gene clusters in eukaryotic genomes, the particular contribution of each feature to clustering maintenance remains unclear.

Results

We quantify the strength of the proposed factors in a yeast lineage. First we identify the magnitude of each variable to determine linkage conservation by using several comparator species at different distances to Saccharomyces cerevisiae. For adjacent gene pairs, in line with null simulations, intergenic distance acts as the strongest covariate. Which of the other covariates appear important depends on the comparator, although high co-expression is related to synteny conservation commonly, especially in the more distant comparisons, these being expected to reveal strong but relatively rare selection. We also analyze those pairs that are immediate neighbors through all the lineages considered. Current intergene distance is again the best predictor, followed by the local density of essential genes and co-regulation, with co-expression and recombination rate being the weakest predictors. The genome duplication seen in yeast leaves some mark on linkage conservation, as adjacent pairs resolved as single copy in all post-whole genome duplication species are more often found as adjacent in pre-duplication species.

Conclusion

Current intergene distance is consistently the strongest predictor of synteny conservation as expected under a simple null model. Other variables are of lesser importance and their relevance depends both on the species comparison in question and the fate of the duplicates following genome duplication.
  相似文献   

13.

Background

Cell fate regulation directly affects tissue homeostasis and human health. Research on cell fate decision sheds light on key regulators, facilitates understanding the mechanisms, and suggests novel strategies to treat human diseases that are related to abnormal cell development.

Results

In this study, we proposed a polynomial based model to predict cell fate. This model was derived from Taylor series. As a case study, gene expression data of pancreatic cells were adopted to test and verify the model. As numerous features (genes) are available, we employed two kinds of feature selection methods, i.e. correlation based and apoptosis pathway based. Then polynomials of different degrees were used to refine the cell fate prediction function. 10-fold cross-validation was carried out to evaluate the performance of our model. In addition, we analyzed the stability of the resultant cell fate prediction model by evaluating the ranges of the parameters, as well as assessing the variances of the predicted values at randomly selected points. Results show that, within both the two considered gene selection methods, the prediction accuracies of polynomials of different degrees show little differences. Interestingly, the linear polynomial (degree 1 polynomial) is more stable than others. When comparing the linear polynomials based on the two gene selection methods, it shows that although the accuracy of the linear polynomial that uses correlation analysis outcomes is a little higher (achieves 86.62%), the one within genes of the apoptosis pathway is much more stable.

Conclusions

Considering both the prediction accuracy and the stability of polynomial models of different degrees, the linear model is a preferred choice for cell fate prediction with gene expression data of pancreatic cells. The presented cell fate prediction model can be extended to other cells, which may be important for basic research as well as clinical study of cell development related diseases.
  相似文献   

14.

Background

Generally, utility based decision making models focus on experimental outcomes. In this paper we propose a utility model based on molecular diffusion to simulate the choice behavior of Drosophila larvae exposed to different light conditions.

Methods

In this paper, light/dark choice-based Drosophila larval phototaxis is analyzed with our molecular diffusion based model. An ISCEM algorithm is developed to estimate the model parameters.

Results

By applying this behavioral utility model to light intensity and phototaxis data, we show that this model fits the experimental data very well.

Conclusions

Our model provides new insights into decision making mechanisms in general. From an engineering viewpoint, we propose that the model could be applied to a wider range of decision making practices.  相似文献   

15.
16.
17.

Background

Regional disparity in suicide rates is a serious problem worldwide. One possible cause is unequal distribution of the health workforce, especially psychiatrists. Research about the association between regional physician numbers and suicide rates is therefore important but studies are rare. The objective of this study was to evaluate the association between physician numbers and suicide rates in Japan, by municipality.

Methods

The study included all the municipalities in Japan (n = 1,896). We estimated smoothed standardized mortality ratios of suicide rates for each municipality and evaluated the association between health workforce and suicide rates using a hierarchical Bayesian model accounting for spatially correlated random effects, a conditional autoregressive model. We assumed a Poisson distribution for the observed number of suicides and set the expected number of suicides as the offset variable. The explanatory variables were numbers of physicians, a binary variable for the presence of psychiatrists, and social covariates.

Results

After adjustment for socioeconomic factors, suicide rates in municipalities that had at least one psychiatrist were lower than those in the other municipalities. There was, however, a positive and statistically significant association between the number of physicians and suicide rates.

Conclusions

Suicide rates in municipalities that had at least one psychiatrist were lower than those in other municipalities, but the number of physicians was positively and significantly related with suicide rates. To improve the regional disparity in suicide rates, the government should encourage psychiatrists to participate in community-based suicide prevention programs and to settle in municipalities that currently have no psychiatrists. The government and other stakeholders should also construct better networks between psychiatrists and non-psychiatrists to support sharing of information for suicide prevention.  相似文献   

18.

Background  

In a spatially and temporally variable adaptive landscape, mutations operating in opposite directions and mutations of large effect should be commonly fixed due to the shifting locations of phenotypic optima. Similarly, an adaptive landscape with multiple phenotypic optima and deep valleys of low fitness between peaks will favor mutations of large effect. Traits under biotic selection should experience a more spatially and temporally variable adaptive landscape with more phenotypic optima than that experienced by traits under abiotic selection. To test this hypothesis, we assemble information from QTL mapping studies conducted in plants, comparing effect directions and effect sizes of detected QTL controlling traits putatively under abiotic selection to those controlling traits putatively under biotic selection.  相似文献   

19.

Background  

Commonly used phylogenetic models assume a homogeneous evolutionary process throughout the tree. It is known that these homogeneous models are often too simplistic, and that with time some properties of the evolutionary process can change (due to selection or drift). In particular, as constraints on sequences evolve, the proportion of variable sites can vary between lineages. This affects the ability of phylogenetic methods to correctly estimate phylogenetic trees, especially for long timescales. To date there is no phylogenetic model that allows for change in the proportion of variable sites, and the degree to which this affects phylogenetic reconstruction is unknown.  相似文献   

20.

Background  

A recent publication described a supervised classification method for microarray data: Between Group Analysis (BGA). This method which is based on performing multivariate ordination of groups proved to be very efficient for both classification of samples into pre-defined groups and disease class prediction of new unknown samples. Classification and prediction with BGA are classically performed using the whole set of genes and no variable selection is required. We hypothesize that an optimized selection of highly discriminating genes might improve the prediction power of BGA.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号