首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Variable selection and model choice in geoadditive regression models   总被引:3,自引:0,他引:3  
Kneib T  Hothorn T  Tutz G 《Biometrics》2009,65(2):626-634
Summary .  Model choice and variable selection are issues of major concern in practical regression analyses, arising in many biometric applications such as habitat suitability analyses, where the aim is to identify the influence of potentially many environmental conditions on certain species. We describe regression models for breeding bird communities that facilitate both model choice and variable selection, by a boosting algorithm that works within a class of geoadditive regression models comprising spatial effects, nonparametric effects of continuous covariates, interaction surfaces, and varying coefficients. The major modeling components are penalized splines and their bivariate tensor product extensions. All smooth model terms are represented as the sum of a parametric component and a smooth component with one degree of freedom to obtain a fair comparison between the model terms. A generic representation of the geoadditive model allows us to devise a general boosting algorithm that automatically performs model choice and variable selection.  相似文献   

2.
Neuroimaging data often take the form of high-dimensional arrays, also known as tensors. Addressing scientific questions arising from such data demands new regression models that take multidimensional arrays as covariates. Simply turning an image array into a vector would both cause extremely high dimensionality and destroy the inherent spatial structure of the array. In a recent work, Zhou et al. (J Am Stat Assoc, 108(502):540–552, 2013) proposed a family of generalized linear tensor regression models based upon the CP (CANDECOMP/PARAFAC) decomposition of regression coefficient array. Low-rank approximation brings the ultrahigh dimensionality to a manageable level and leads to efficient estimation. In this article, we propose a tensor regression model based on the more flexible Tucker decomposition. Compared to the CP model, Tucker regression model allows different number of factors along each mode. Such flexibility leads to several advantages that are particularly suited to neuroimaging analysis, including further reduction of the number of free parameters, accommodation of images with skewed dimensions, explicit modeling of interactions, and a principled way of image downsizing. We also compare the Tucker model with CP numerically on both simulated data and real magnetic resonance imaging data, and demonstrate its effectiveness in finite sample performance.  相似文献   

3.
For independent data, non-parametric bootstrap is realised by resampling the data with replacement. This approach fails for dependent data such as time series. If the data generating process is at least stationary and mixing, the blockwise bootstrap by drawing subsamples or blocks of the data saves the concept. For the blockwise bootstrap a blocklength has to be selected. We propose a method for selecting the optimal blocklength. To improve the finite size properties of the blockwise bootstrap, studentised statistics is considered. If the statistic can be represented as a smooth function model this studentisation can be approximated efficiently. The studentised blockwise bootstrap method is applied for testing hypotheses on medical time series.  相似文献   

4.
The application of Gibbs sampling is considered for inference in a mixed inheritance model in animal populations. Implementation of the Gibbs sampler on scalar components, as used for human populations, appeared not to be efficient, and an approach with blockwise sampling of genotypes was proposed for use in animal populations. The blockwise sampling of genotypes was proposed for use in animal populations. The blockwise sampling by which genotypes of a sire and its final progeny were sampled jointly was effective in improving mixing, although further improvements could be looked for. Posterior densities of parameters were visualised from Gibbs samples; from the former highly marginalised Bayesian point and interval estimates can be obtained.  相似文献   

5.
Median regression with censored cost data   总被引:2,自引:0,他引:2  
Bang H  Tsiatis AA 《Biometrics》2002,58(3):643-649
Because of the skewness of the distribution of medical costs, we consider modeling the median as well as other quantiles when establishing regression relationships to covariates. In many applications, the medical cost data are also right censored. In this article, we propose semiparametric procedures for estimating the parameters in median regression models based on weighted estimating equations when censoring is present. Numerical studies are conducted to show that our estimators perform well with small samples and the resulting inference is reliable in circumstances of practical importance. The methods are applied to a dataset for medical costs of patients with colorectal cancer.  相似文献   

6.
Generalized estimating equation (GEE) is widely adopted for regression modeling for longitudinal data, taking account of potential correlations within the same subjects. Although the standard GEE assumes common regression coefficients among all the subjects, such an assumption may not be realistic when there is potential heterogeneity in regression coefficients among subjects. In this paper, we develop a flexible and interpretable approach, called grouped GEE analysis, to modeling longitudinal data with allowing heterogeneity in regression coefficients. The proposed method assumes that the subjects are divided into a finite number of groups and subjects within the same group share the same regression coefficient. We provide a simple algorithm for grouping subjects and estimating the regression coefficients simultaneously, and show the asymptotic properties of the proposed estimator. The number of groups can be determined by the cross validation with averaging method. We demonstrate the proposed method through simulation studies and an application to a real data set.  相似文献   

7.
The study on the relationship between trace elements and diseases often need to build a classification/regression model. Furthermore, the accuracy of such a model is of particular importance and directly decides its applicability. The goal of this study is to explore the feasibility of applying boosting, i.e., a new strategy from machine learning, to model the relationship between trace elements and diseases. Two examples are employed to illustrate the technique in the applications of classification and regression, respectively. The first example involves the diagnosis of anorexia according to the concentrations of six elements (i.e. classification task). Decision stump and support vector machine are used as the weak/base algorithm and reference algorithm, respectively. The second example involves the prediction of breast cancer mortality based on the intake of trace elements (i.e. a regression task). In this regard, partial least squares is not only used as the weak/base algorithm, but also the reference algorithm. The results from both examples confirm the potential of boosting in modeling the relationship between trace elements and diseases.  相似文献   

8.
Antigenic characterization based on serological data, such as Hemagglutination Inhibition (HI) assay, is one of the routine procedures for influenza vaccine strain selection. In many cases, it would be impossible to measure all pairwise antigenic correlations between testing antigens and reference antisera in each individual experiment. Thus, we have to combine and integrate the HI tables from a number of individual experiments. Measurements from different experiments may be inconsistent due to different experimental conditions. Consequently we will observe a matrix with missing data and possibly inconsistent measurements. In this paper, we develop a new mathematical model, which we refer to as Joint Matrix Completion and Filtering, for HI data integration. In this approach, we simultaneously handle the incompleteness and uncertainty of observations by assuming that the underlying merged HI data matrix has low rank, as well as carefully modeling different levels of noises in each individual table. An efficient blockwise coordinate descent procedure is developed for optimization. The performance of our approach is validated on synthetic and real influenza datasets. The proposed joint matrix completion and filtering model can be adapted as a general model for biological data integration, targeting data noises and missing values within and across experiments.  相似文献   

9.
Staniswalis JG 《Biometrics》2008,64(4):1054-1061
SUMMARY: Nonparametric regression models are proposed in the framework of ecological inference for exploratory modeling of disease prevalence rates adjusted for variables, such as age, ethnicity/race, and socio-economic status. Ecological inference is needed when a response variable and covariate are not available at the subject level because only summary statistics are available for the reporting unit, for example, in the form of R x C tables. In this article, only the marginal counts are assumed available in the sample of R x C contingency tables for modeling the joint distribution of counts. A general form for the ecological regression model is proposed, whereby certain covariates are included as a varying coefficient regression model, whereas others are included as a functional linear model. The nonparametric regression curves are modeled as splines fit by penalized weighted least squares. A data-driven selection of the smoothing parameter is proposed using the pointwise maximum squared bias computed from averaging kernels (explained by O'Sullivan, 1986, Statistical Science 1, 502-517). Analytic expressions for bias and variance are provided that could be used to study the rates of convergence of the estimators. Instead, this article focuses on demonstrating the utility of the estimators in a study of disparity in health outcomes by ethnicity/race.  相似文献   

10.

Background

Period 10 dinucleotides are structurally and functionally validated factors that influence the ability of DNA to form nucleosomes, histone core octamers. Robust identification of periodic signals in DNA sequences is therefore required to understand nucleosome organisation in genomes. While various techniques for identifying periodic components in genomic sequences have been proposed or adopted, the requirements for such techniques have not been considered in detail and confirmatory testing for a priori specified periods has not been developed.

Results

We compared the estimation accuracy and suitability for confirmatory testing of autocorrelation, discrete Fourier transform (DFT), integer period discrete Fourier transform (IPDFT) and a previously proposed Hybrid measure. A number of different statistical significance procedures were evaluated but a blockwise bootstrap proved superior. When applied to synthetic data whose period-10 signal had been eroded, or for which the signal was approximately period-10, the Hybrid technique exhibited superior properties during exploratory period estimation. In contrast, confirmatory testing using the blockwise bootstrap procedure identified IPDFT as having the greatest statistical power. These properties were validated on yeast sequences defined from a ChIP-chip study where the Hybrid metric confirmed the expected dominance of period-10 in nucleosome associated DNA but IPDFT identified more significant occurrences of period-10. Application to the whole genomes of yeast and mouse identified ~ 21% and ~ 19% respectively of these genomes as spanned by period-10 nucleosome positioning sequences (NPS).

Conclusions

For estimating the dominant period, we find the Hybrid period estimation method empirically to be the most effective for both eroded and approximate periodicity. The blockwise bootstrap was found to be effective as a significance measure, performing particularly well in the problem of period detection in the presence of eroded periodicity. The autocorrelation method was identified as poorly suited for use with the blockwise bootstrap. Application of our methods to the genomes of two model organisms revealed a striking proportion of the yeast and mouse genomes are spanned by NPS. Despite their markedly different sizes, roughly equivalent proportions (19-21%) of the genomes lie within period-10 spans of the NPS dinucleotides {AA, TT, TA}. The biological significance of these regions remains to be demonstrated. To facilitate this, the genomic coordinates are available as Additional files 1, 2, and 3 in a format suitable for visualisation as tracks on popular genome browsers.

Reviewers

This article was reviewed by Prof Tomas Radivoyevitch, Dr Vsevolod Makeev (nominated by Dr Mikhail Gelfand), and Dr Rob D Knight.  相似文献   

11.
The intrinsic pK values, as well as the free fractions of sodium and calcium counterions, were determined on salt-free solutions of amidated pectinates and amidated pectates. The apparent pK values were non dependent of the degree of amidation but only to the effective charge density of the pectic polymers and an unique value of 2.9 ± 0.1 was found for the intrinsic pK value. The results obtained by conductimetry and with (sodium and calcium) specific electrodes showed a blockwise distribution of amide and acid groups in amidated pectates and a blockwise distribution of amide groups and a rather statistical distribution of acid groups in amidated pectinates.  相似文献   

12.
A simple method was developed that enabled the enzymatic determination of the galactose distribution in galactomannans. endo-Mannanase of Aspergillus niger was used to degrade the galactomannan polymers and the degradation products were determined with high-performance anion-exchange chromatography. A whole range of commercial high-to-low substituted galactomannans was analyzed in this way. It was found that differences in the anion-exchange chromatograms reflected dissimilarities in the distribution of galactose and could be used directly to discern these dissimilarities. The differences among the various elution profiles were used to construct a similarity distance tree. In addition to this approach, the absolute amount of non-substituted mannose released by the enzyme was found to be a good discriminating factor. In this way, galactomannans with regular, blockwise, and randomly distributed galactose could be discerned. All guars and the highly substituted gum of Prosopis juliflora were found to have a blockwise distribution of galactose. For different batches of tara gum both random and blockwise distributions were found. Among batches of locust bean gum the greatest variation was observed: both random, blockwise, and ordered galactose distributions were present. Cassia gum was found to have a highly regular distribution of galactose.  相似文献   

13.
A strain energy function for finite deformations is developed that has the capability to describe the nonlinear, anisotropic, and asymmetric mechanical response that is typical of articular cartilage. In particular, the bimodular feature is employed by including strain energy terms that are only mechanically active when the corresponding fiber directions are in tension. Furthermore, the strain energy function is a polyconvex function of the deformation gradient tensor so that it meets material stability criteria. A novel feature of the model is the use of bimodular and polyconvex "strong interaction terms" for the strain invariants of orthotropic materials. Several regression analyses are performed using a hypothetical experimental dataset that captures the anisotropic and asymmetric behavior of articular cartilage. The results suggest that the main advantage of a model employing the strong interaction terms is to provide the capability for modeling anisotropic and asymmetric Poisson's ratios, as well as axial stress-axial strain responses, in tension and compression for finite deformations.  相似文献   

14.
Statistical models are simple mathematical rules derived from empirical data describing the association between an outcome and several explanatory variables. In a typical modeling situation statistical analysis often involves a large number of potential explanatory variables and frequently only partial subject-matter knowledge is available. Therefore, selecting the most suitable variables for a model in an objective and practical manner is usually a non-trivial task. We briefly revisit the purposeful variable selection procedure suggested by Hosmer and Lemeshow which combines significance and change-in-estimate criteria for variable selection and critically discuss the change-in-estimate criterion. We show that using a significance-based threshold for the change-in-estimate criterion reduces to a simple significance-based selection of variables, as if the change-in-estimate criterion is not considered at all. Various extensions to the purposeful variable selection procedure are suggested. We propose to use backward elimination augmented with a standardized change-in-estimate criterion on the quantity of interest usually reported and interpreted in a model for variable selection. Augmented backward elimination has been implemented in a SAS macro for linear, logistic and Cox proportional hazards regression. The algorithm and its implementation were evaluated by means of a simulation study. Augmented backward elimination tends to select larger models than backward elimination and approximates the unselected model up to negligible differences in point estimates of the regression coefficients. On average, regression coefficients obtained after applying augmented backward elimination were less biased relative to the coefficients of correctly specified models than after backward elimination. In summary, we propose augmented backward elimination as a reproducible variable selection algorithm that gives the analyst more flexibility in adopting model selection to a specific statistical modeling situation.  相似文献   

15.
16.
We calculate the many-body, nonpairwise interaction between N rigid, anisotropic membrane inclusions by modeling them as point-like constraints on the membrane's curvature tensor and by minimizing the membrane's curvature energy. Because multipolar distortions of higher-order decay on very short distances, our calculation gives the correct elastic interaction energy for inclusions separated by distances of the order of several times their size. As an application, we show by thermally equilibrating the many-body elastic energy using a Monte Carlo algorithm, that inclusions shaped as "saddles" attract each other and build an "egg-carton" structure. The latter is reminiscent of some patterns observed in membranes obtained from biological extracts, the origin of which is still mysterious.  相似文献   

17.
G C Wei  M A Tanner 《Biometrics》1991,47(4):1297-1309
The first part of the article reviews the Data Augmentation algorithm and presents two approximations to the Data Augmentation algorithm for the analysis of missing-data problems: the Poor Man's Data Augmentation algorithm and the Asymptotic Data Augmentation algorithm. These two algorithms are then implemented in the context of censored regression data to obtain semiparametric methodology. The performances of the censored regression algorithms are examined in a simulation study. It is found, up to the precision of the study, that the bias of both the Poor Man's and Asymptotic Data Augmentation estimators, as well as the Buckley-James estimator, does not appear to differ from zero. However, with regard to mean squared error, over a wide range of settings examined in this simulation study, the two Data Augmentation estimators have a smaller mean squared error than does the Buckley-James estimator. In addition, associated with the two Data Augmentation estimators is a natural device for estimating the standard error of the estimated regression parameters. It is shown how this device can be used to estimate the standard error of either Data Augmentation estimate of any parameter (e.g., the correlation coefficient) associated with the model. In the simulation study, the estimated standard error of the Asymptotic Data Augmentation estimate of the regression parameter is found to be congruent with the Monte Carlo standard deviation of the corresponding parameter estimate. The algorithms are illustrated using the updated Stanford heart transplant data set.  相似文献   

18.
Alternative search strategies for the directed evolution of proteins are presented and compared with each other. In particular, two different machine learning strategies based on partial least-squares regression are developed: the first contains only linear terms that represent a given residue's independent contribution to fitness, the second contains additional nonlinear terms to account for potential epistatic coupling between residues. The nonlinear modeling strategy is further divided into two types, one that contains all possible nonlinear terms and another that makes use of a genetic algorithm to select a subset of important interaction terms. The performance of each modeling type as a function of training set size is analysed. Simulated molecular evolution on a synthetic protein landscape shows the use of machine learning techniques to guide library design can be a powerful addition to library generation methods such as DNA shuffling.  相似文献   

19.
The article presents modeling of daily average ozone level prediction by means of neural networks, support vector regression and methods based on uncertainty. Based on data measured by a monitoring station of the Pardubice micro-region, the Czech Republic, and optimization of the number of parameters by a defined objective function and genetic algorithm a model of daily average ozone level prediction in a certain time has been designed. The designed model has been optimized in light of its input parameters. The goal of prediction by various methods was to compare the results of prediction with the aim of various recommendations to micro-regional public administration management. It is modeling by means of feed-forward perceptron type neural networks, time delay neural networks, radial basis function neural networks, ε-support vector regression, fuzzy inference systems and Takagi–Sugeno intuitionistic fuzzy inference systems. Special attention is paid to the adaptation of the Takagi–Sugeno intuitionistic fuzzy inference system and adaptation of fuzzy logic-based systems using evolutionary algorithms. Based on data obtained, the daily average ozone level prediction in a certain time is characterized by a root mean squared error. The best possible results were obtained by means of an ε-support vector regression with polynomial kernel functions and Takagi–Sugeno intuitionistic fuzzy inference systems with adaptation by means of a Kalman filter.  相似文献   

20.
Wood SN 《Biometrics》2006,62(4):1025-1036
A general method for constructing low-rank tensor product smooths for use as components of generalized additive models or generalized additive mixed models is presented. A penalized regression approach is adopted in which tensor product smooths of several variables are constructed from smooths of each variable separately, these "marginal" smooths being represented using a low-rank basis with an associated quadratic wiggliness penalty. The smooths offer several advantages: (i) they have one wiggliness penalty per covariate and are hence invariant to linear rescaling of covariates, making them useful when there is no "natural" way to scale covariates relative to each other; (ii) they have a useful tuneable range of smoothness, unlike single-penalty tensor product smooths that are scale invariant; (iii) the relatively low rank of the smooths means that they are computationally efficient; (iv) the penalties on the smooths are easily interpretable in terms of function shape; (v) the smooths can be generated completely automatically from any marginal smoothing bases and associated quadratic penalties, giving the modeler considerable flexibility to choose the basis penalty combination most appropriate to each modeling task; and (vi) the smooths can easily be written as components of a standard linear or generalized linear mixed model, allowing them to be used as components of the rich family of such models implemented in standard software, and to take advantage of the efficient and stable computational methods that have been developed for such models. A small simulation study shows that the methods can compare favorably with recently developed smoothing spline ANOVA methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号