首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Sightability models are binary logistic-regression models used to estimate and adjust for visibility bias in wildlife-population surveys. Like many models in wildlife and ecology, sightability models are typically developed from small observational datasets with many candidate predictors. Aggressive model-selection methods are often employed to choose a best model for prediction and effect estimation, despite evidence that such methods can lead to overfitting (i.e., selected models may describe random error or noise rather than true predictor–response curves) and poor predictive ability. We used moose (Alces alces) sightability data from northeastern Minnesota (2005–2007) as a case study to illustrate an alternative approach, which we refer to as degrees-of-freedom (df) spending: sample-size guidelines are used to determine an acceptable level of model complexity and then a pre-specified model is fit to the data and used for inference. For comparison, we also constructed sightability models using Akaike's Information Criterion (AIC) step-down procedures and model averaging (based on a small set of models developed using df-spending guidelines). We used bootstrap procedures to mimic the process of model fitting and prediction, and to compute an index of overfitting, expected predictive accuracy, and model-selection uncertainty. The index of overfitting increased 13% when the number of candidate predictors was increased from three to eight and a best model was selected using step-down procedures. Likewise, model-selection uncertainty increased when the number of candidate predictors increased. Model averaging (based on R = 30 models with 1–3 predictors) effectively shrunk regression coefficients toward zero and produced similar estimates of precision to our 3-df pre-specified model. As such, model averaging may help to guard against overfitting when too many predictors are considered (relative to available sample size). The set of candidate models will influence the extent to which coefficients are shrunk toward zero, which has implications for how one might apply model averaging to problems traditionally approached using variable-selection methods. We often recommend the df-spending approach in our consulting work because it is easy to implement and it naturally forces investigators to think carefully about their models and predictors. Nonetheless, similar concepts should apply whether one is fitting 1 model or using multi-model inference. For example, model-building decisions should consider the effective sample size, and potential predictors should be screened (without looking at their relationship to the response) for missing data, narrow distributions, collinearity, potentially overly influential observations, and measurement errors (e.g., via logical error checks). © 2011 The Wildlife Society.  相似文献   

2.
MOTIVATION: Although population-based association mapping may be subject to the bias caused by population stratification, alternative methods that are robust to population stratification such as family-based linkage analysis have lower mapping resolution. Recently, various statistical methods robust to population stratification were proposed for association studies, using unrelated individuals to identify associations between candidate genes and traits of interest. The association between a candidate gene and a quantitative trait is often evaluated via a regression model with inferred population structure variables as covariates, where the residual distribution is customarily assumed to be from a symmetric and unimodal parametric family, such as a Gaussian, although this may be inappropriate for the analysis of many real-life datasets. RESULTS: In this article, we proposed a new structured association (SA) test. Our method corrects for continuous population stratification by first deriving population structure and kinship matrices through a set of random genetic markers and then modeling the relationship between trait values, genotypic scores at a candidate marker and genetic background variables through a semiparametric model, where the error distribution is modeled as a mixture of Polya trees centered around a normal family of distributions. We compared our model to the existing SA tests in terms of model fit, type I error rate, power, precision and accuracy by application to a real dataset as well as simulated datasets.  相似文献   

3.
4.
A low-resolution scoring function for the selection of native and near-native structures from a set of predicted structures for a given protein sequence has been developed. The scoring function, ProVal (Protein Validate), used several variables that describe an aspect of protein structure for which the proximity to the native structure can be assessed quantitatively. Among the parameters included are a packing estimate, surface areas, and the contact order. A partial least squares for latent variables (PLS) model was built for each candidate set of the 28 decoy sets of structures generated for 22 different proteins using the described parameters as independent variables. The C(alpha) RMS of the candidate structures versus the experimental structure was used as the dependent variable. The final generalized scoring function was an average of all models derived, ensuring that the function was not optimized for specific fold classes or method of structure generation of the candidate folds. The results show that the crystal structure was scored best in 64% of the 28 test sets and was clearly separated from the decoys in many examples. In all the other cases in which the crystal structure did not rank first, it ranked within the top 10%. Thus, although ProVal could not distinguish between predicted structures that were similar overall in fold quality due to its inherently low resolution, it can clearly be used as a primary filter to eliminate approximately 90% of fold candidates generated by current prediction methods from all-atom modeling and further evaluation. The correlation between the predicted and actual C(alpha) RMS values varies considerably between the candidate fold sets.  相似文献   

5.
Fitting parameter sets of non-linear equations in cardiac single cell ionic models to reproduce experimental behavior is a time consuming process. The standard procedure is to adjust maximum channel conductances in ionic models to reproduce action potentials (APs) recorded in isolated cells. However, vastly different sets of parameters can produce similar APs. Furthermore, even with an excellent AP match in case of single cell, tissue behaviour may be very different. We hypothesize that this uncertainty can be reduced by additionally fitting membrane resistance (Rm). To investigate the importance of Rm, we developed a genetic algorithm approach which incorporated Rm data calculated at a few points in the cycle, in addition to AP morphology. Performance was compared to a genetic algorithm using only AP morphology data. The optimal parameter sets and goodness of fit as computed by the different methods were compared. First, we fit an ionic model to itself, starting from a random parameter set. Next, we fit the AP of one ionic model to that of another. Finally, we fit an ionic model to experimentally recorded rabbit action potentials. Adding the extra objective (Rm, at a few voltages) to the AP fit, lead to much better convergence. Typically, a smaller MSE (mean square error, defined as the average of the squared error between the target AP and AP that is to be fitted) was achieved in one fifth of the number of generations compared to using only AP data. Importantly, the variability in fit parameters was also greatly reduced, with many parameters showing an order of magnitude decrease in variability. Adding Rm to the objective function improves the robustness of fitting, better preserving tissue level behavior, and should be incorporated.  相似文献   

6.
Summary The maternal age dependence of Down's syndrome rates was analyzed by two mathematical models, a discontinuous (DS) slope model which fits different exponential equations to different parts of the 20–49 age interval and a CPE model which fits a function that is the sum of a constant and exponential term over this whole 20–49 range. The CPE model had been considered but rejected by Penrose, who preferred models postulating changes with age assuming either a power function X10, where X is age or a Poisson model in which accumulation of 17 events was the assumed threshold for the occurrence of Down's syndrome. However, subsequent analyses indicated that the two models preferred by Penrose did not fit recent data sets as well as the DS or CPE model. Here we report analyses of broadened power and Poisson models in which n (the postulated number of independent events) can vary. Five data sets are analyzed. For the power models the range of the optimal n is 11 to 13; for the Poisson it is 17 to 25. The DS, Poisson, and power models each give the best fit to one data set; the CPE, to two sets. No particular model is clearly preferable. It appears unlikely that, with a data set from any single available source, a specific etiologic hypothesis for the maternal age dependence of Down's syndrome can be clearly inferred by the use of these or similar regression models.  相似文献   

7.
Random Forest is a prediction technique based on growing trees on bootstrap samples of data, in conjunction with a random selection of explanatory variables to define the best split at each node. In the case of a quantitative outcome, the tree predictor takes on a numerical value. We applied Random Forest to the first replicate of the Genetic Analysis Workshop 13 simulated data set, with the sibling pairs as our units of analysis and identity by descent (IBD) at selected loci as our explanatory variables. With the knowledge of the true model, we performed two sets of analyses on three phenotypes: HDL, triglycerides, and glucose. The goal was to approach the mapping of complex traits from a multivariate perspective. The first set of analyses mimics a candidate gene approach with a high proportion of true genes among the predictors while the second set represents a genome scan analysis using microsatellite markers. Random Forest was able to identify a few of the major genes influencing the phenotypes, such as baseline HDL and triglycerides, but failed to identify the major genes regulating baseline glucose levels.  相似文献   

8.
The information contents in previously published peptide sets was compared with smaller sets of peptides selected according to statistical designs. It was found that minimum analogue peptide sets (MAPS) constructed by factorial or fractional factorial designs in physiochemical properties contained substantial structure-activity information. Although five to six times smaller than the originally published peptide sets the MAPS resulted in QSAR models able to predict biological activity. The QSARs derived from a MAPS of nine dipeptides, and from a set of 58 dipeptides inhibiting angiotensin converting enzyme were compared and found to be of equal strength. Furthermore, for a set of bitter tasting dipeptides it was found that an incomplete MAPS of 10 dipeptides gave just as good a model as the model based on a set of 48 dipeptides. By comparison other non-designed sets of peptides gave QSARs with poor predictive power. It was also demonstrated how MAPS centered on a lead peptide can be constructed as to specifically explore the physiochemical and biological properties in the vicinity of the lead. It was concluded that small information-rich peptide sets MAPS can be constructed on the basis of statistical designs with principal properties of amino acids as design variables.  相似文献   

9.
Summary Several statistical methods for detecting associations between quantitative traits and candidate genes in structured populations have been developed for fully observed phenotypes. However, many experiments are concerned with failure‐time phenotypes, which are usually subject to censoring. In this article, we propose statistical methods for detecting associations between a censored quantitative trait and candidate genes in structured populations with complex multiple levels of genetic relatedness among sampled individuals. The proposed methods correct for continuous population stratification using both population structure variables as covariates and the frailty terms attributable to kinship. The relationship between the time‐at‐onset data and genotypic scores at a candidate marker is modeled via a parametric Weibull frailty accelerated failure time (AFT) model as well as a semiparametric frailty AFT model, where the baseline survival function is flexibly modeled as a mixture of Polya trees centered around a family of Weibull distributions. For both parametric and semiparametric models, the frailties are modeled via an intrinsic Gaussian conditional autoregressive prior distribution with the kinship matrix being the adjacency matrix connecting subjects. Simulation studies and applications to the Arabidopsis thaliana line flowering time data sets demonstrated the advantage of the new proposals over existing approaches.  相似文献   

10.

Background  

DNA microarray technology has emerged as a major tool for exploring cancer biology and solving clinical issues. Predicting a patient's response to chemotherapy is one such issue; successful prediction would make it possible to give patients the most appropriate chemotherapy regimen. Patient response can be classified as either a pathologic complete response (PCR) or residual disease (NoPCR), and these strongly correlate with patient outcome. Microarrays can be used as multigenic predictors of patient response, but probe selection remains problematic. In this study, each probe set was considered as an elementary predictor of the response and was ranked on its ability to predict a high number of PCR and NoPCR cases in a ratio similar to that seen in the learning set. We defined a valuation function that assigned high values to probe sets according to how different the expression of the genes was and to how closely the relative proportions of PCR and NoPCR predictions to the proportions observed in the learning set was. Multigenic predictors were designed by selecting probe sets highly ranked in their predictions and tested using several validation sets.  相似文献   

11.
Statistical assessment of candidate gene effects can be viewed as a problem of variable selection and model comparison. Given a certain number of genes to be considered, many possible models may fit to the data well, each including a specific set of gene effects and possibly their interactions. The question arises as to which of these models is most plausible. Inference about candidate gene effects based on a specific model ignores uncertainty about model choice. Here, a Bayesian model averaging approach is proposed for evaluation of candidate gene effects. The method is implemented through simultaneous sampling of multiple models. By averaging over a set of competing models, the Bayesian model averaging approach incorporates model uncertainty into inferences about candidate gene effects. Features of the method are demonstrated using a simulated data set with ten candidate genes under consideration.  相似文献   

12.
A statistical approach was applied to select those models that best fit each individual mitochondrial (mt) protein at different taxonomic levels of metazoans. The existing mitochondrial replacement matrices, MtREV and MtMam, were found to be the best-fit models for the mt-proteins of vertebrates, with the exception of Nd6, at different taxonomic levels. Remarkably, existing mitochondrial matrices generally failed to best-fit invertebrate mt-proteins. In an attempt to better model the evolution of invertebrate mt-proteins, a new replacement matrix, named MtArt, was constructed based on arthropod mt-proteomes. The new model was found to best fit almost all analyzed invertebrate mt-protein data sets. The observed pattern of model fit across the different data sets indicates that no single replacement matrix is able to describe the general evolutionary properties of mt-proteins but rather that taxonomical biases and/or the existence of different mt-genetic codes have great influence on which model is selected.  相似文献   

13.
A viscoelastic model of the K-BKZ (Kaye, Technical Report 134, College of Aeronautics, Cranfield 1962; Bernstein et al., Trans Soc Rheol 7: 391–410, 1963) type is developed for isotropic biological tissues and applied to the fat pad of the human heel. To facilitate this pursuit, a class of elastic solids is introduced through a novel strain-energy function whose elements possess strong ellipticity, and therefore lead to stable material models. This elastic potential – via the K-BKZ hypothesis – also produces the tensorial structure of the viscoelastic model. Candidate sets of functions are proposed for the elastic and viscoelastic material functions present in the model, including two functions whose origins lie in the fractional calculus. The Akaike information criterion is used to perform multi-model inference, enabling an objective selection to be made as to the best material function from within a candidate set.Dedicated to Prof. Ronald L. Bagley.  相似文献   

14.
We developed an algorithm, Lever, that systematically maps metazoan DNA regulatory motifs or motif combinations to sets of genes. Lever assesses whether the motifs are enriched in cis-regulatory modules (CRMs), predicted by our PhylCRM algorithm, in the noncoding sequences surrounding the genes. Lever analysis allows unbiased inference of functional annotations to regulatory motifs and candidate CRMs. We used human myogenic differentiation as a model system to statistically assess greater than 25,000 pairings of gene sets and motifs or motif combinations. We assigned functional annotations to candidate regulatory motifs predicted previously and identified gene sets that are likely to be co-regulated via shared regulatory motifs. Lever allows moving beyond the identification of putative regulatory motifs in mammalian genomes, toward understanding their biological roles. This approach is general and can be applied readily to any cell type, gene expression pattern or organism of interest.  相似文献   

15.
In order to evaluate the feasibility of a combined evolutionary algorithm-information theoretic approach to select the best model from a set of candidate invasive species models in ecology, and/or to evolve the most parsimonious model from a suite of competing models by comparing their relative performance, it is prudent to use a unified model that covers a myriad of situations. Using Schnute's postulates as a starting point [Schnute, J., 1981. A versatile growth model with statistically stable parameters, Can. J. Fish Aquat. Sci. 38, 1128-1140], we present a single, unified model for growth that can be successfully utilized for model selection in evolutionary computations. Depending on the parameter settings, the unified equation can describe several growth mechanisms. Such a generalized model mechanism, which encompasses a suite of competing models, can be successfully implemented in evolutionary computational algorithms to evolve the most parsimonious model that best fits ground truth data. We have done exactly this by testing the effectiveness of our reaction-diffusion-advection (RDA) model in an evolutionary computation model selection algorithm. The algorithm was validated (with success) against field data sets of the Zebra mussel invasion of Lake Champlain in the United States.  相似文献   

16.
An improved method is described for the analysis of data obtained by the technique of labelled mitoses. It is a development of the method described by Barrett (1966) in which theoretical curves are computed on the basis of a model which assumes that the phases G1, S and G2 are described by independent log-normal distributions; the analysis consists in finding a form of this model which gives a labelled mitoses curve which is the best fit to the available data. This fitting procedure has now been made automatic. No comprehensive indication of the goodness of fit can be given, although in the analysis of over fifty sets of data the method appears to have worked well.
A supplementary computer program is described which, on the basis of three separate assumed modes of cell loss, calculates the form of the age distributions and theoretical continuous labelling curves. This allows growth fraction to be calculated in a way which takes account of the distribution of phase durations and the non-rectangular age distributions of expanding cell populations. It also gives an opportunity to study the implications of continuous labelling data as regards the mode of cell loss.
A comparison is made between the present method of labelled mitoses curve analysis and the empirical rules which have often been used.  相似文献   

17.
Abiotic factors such as climate and soil determine the species fundamental niche, which is further constrained by biotic interactions such as interspecific competition. To parameterize this realized niche, species distribution models (SDMs) most often relate species occurrence data to abiotic variables, but few SDM studies include biotic predictors to help explain species distributions. Therefore, most predictions of species distributions under future climates assume implicitly that biotic interactions remain constant or exert only minor influence on large‐scale spatial distributions, which is also largely expected for species with high competitive ability. We examined the extent to which variance explained by SDMs can be attributed to abiotic or biotic predictors and how this depends on species traits. We fit generalized linear models for 11 common tree species in Switzerland using three different sets of predictor variables: biotic, abiotic, and the combination of both sets. We used variance partitioning to estimate the proportion of the variance explained by biotic and abiotic predictors, jointly and independently. Inclusion of biotic predictors improved the SDMs substantially. The joint contribution of biotic and abiotic predictors to explained deviance was relatively small (~9%) compared to the contribution of each predictor set individually (~20% each), indicating that the additional information on the realized niche brought by adding other species as predictors was largely independent of the abiotic (topo‐climatic) predictors. The influence of biotic predictors was relatively high for species preferably growing under low disturbance and low abiotic stress, species with long seed dispersal distances, species with high shade tolerance as juveniles and adults, and species that occur frequently and are dominant across the landscape. The influence of biotic variables on SDM performance indicates that community composition and other local biotic factors or abiotic processes not included in the abiotic predictors strongly influence prediction of species distributions. Improved prediction of species' potential distributions in future climates and communities may assist strategies for sustainable forest management.  相似文献   

18.
It has been claimed that blending processes such as trade and exchange have always been more important in the evolution of cultural similarities and differences among human populations than the branching process of population fissioning. In this paper, we report the results of a novel comparative study designed to shed light on this claim. We fitted the bifurcating tree model that biologists use to represent the relationships of species to 21 biological data sets that have been used to reconstruct the relationships of species and/or higher level taxa and to 21 cultural data sets. We then compared the average fit between the biological data sets and the model with the average fit between the cultural data sets and the model. Given that the biological data sets can be confidently assumed to have been structured by speciation, which is a branching process, our assumption was that, if cultural evolution is dominated by blending processes, the fit between the bifurcating tree model and the cultural data sets should be significantly worse than the fit between the bifurcating tree model and the biological data sets. Conversely, if cultural evolution is dominated by branching processes, the fit between the bifurcating tree model and the cultural data sets should be no worse than the fit between the bifurcating tree model and the biological data sets. We found that the average fit between the cultural data sets and the bifurcating tree model was not significantly different from the fit between the biological data sets and the bifurcating tree model. This indicates that the cultural data sets are not less tree-like than are the biological data sets. As such, our analysis does not support the suggestion that blending processes have always been more important than branching processes in cultural evolution. We conclude from this that, rather than deciding how cultural evolution has proceeded a priori, researchers need to ascertain which model or combination of models is relevant in a particular case and why.  相似文献   

19.
1. The authors define a function with value 1 for the positive examples and 0 for the negative ones. They fit a continuous function but do not deal at all with the error margin of the fit, which is almost as large as the function values they compute. 2. The term "quality" for the value of the fitted function gives the impression that some biological significance is associated with values of the fitted function strictly between 0 and 1, but there is no justification for this kind of interpretation and finding the point where the fit achieves its maximum does not make sense. 3. By neglecting the error margin the authors try to optimize the fitted function using differences in the second, third, fourth, and even fifth decimal place which have no statistical significance. 4. Even if such a fit could profit from more data points, the authors should first prove that the region of interest has some kind of smoothness, that is, that a continuous fit makes any sense at all. 5. "Simulated molecular evolution" is a misnomer. We are dealing here with random search. Since the margin of error is so large, the fitted function does not provide statistically significant information about the points in search space where strings with cleavage sites could be found. This implies that the method is a highly unreliable stochastic search in the space of strings, even if the neural network is capable of learning some simple correlations. 6. Classical statistical methods are for these kind of problems with so few data points clearly superior to the neural networks used as a "black box" by the authors, which in the way they are structured provide a model with an error margin as large as the numbers being computed.7. And finally, even if someone would provide us with a function which separates strings with cleavage sites from strings without them perfectly, so-called simulated molecular evolution would not be better than random selection.Since a perfect fit would only produce exactly ones or zeros,starting a search in a region of space where all strings in the neighborhood get the value zero would not provide any kind of directional information for new iterations. We would just skip from one point to the other in a typical random walk manner.  相似文献   

20.
Richard F. Green 《Oikos》2006,112(2):274-284
Oaten's (1977) stochastic model for optimal foraging in patches has been solved for a number of particular cases. A few cases, such as Poisson prey distribution and either systematic or random search, are easy to solve. In other cases, such as binomial prey distribution and random search, the form of the optimal strategy may be found using a theorem of McNamara, although more work is required to find which particular rule of the proper form is actually best. More generally (but not completely generally), optimal strategies may be found using dynamic programming. This requires that the number of prey found up to a particular time is a sufficient statistic for the number of prey remaining in a patch. This requirement cannot be dispensed with, but other simplifying assumptions that were used in the past are not necessary. In particular, it is not necessary, even for the sake of convenience, to assume that prey distribution has a form convenient for Bayesian analysis, such as a beta mixture of binomials or a gamma mixture of Poissons. Any prey distribution may be used if whatever prey are in a patch are located at random, and if search either is systematic for discrete time or for continuous time, or is random for continuous time. In earlier work, some pains had to be taken to find the rate of finding prey achieved by a given candidate strategy, but this is not necessary if expected gains and expected times are calculated routinely for each potential stopping point during dynamic programming. A new, simple method of finding optimal strategies is illustrated for discrete time and systematic search. This paper is based on a talk given at the Fifth Hans Kristiansson Symposium held in Lund, Sweden in August, 2003. The subject of the symposium was Bayesian foraging.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号