期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Opening the black box: an open‐source release of Maxent

Steven J. Phillips Robert P. Anderson Miroslav Dudík Robert E. Schapire Mary E. Blair 《Ecography》2017,40(7):887-893

This software note announces a new open‐source release of the Maxent software for modeling species distributions from occurrence records and environmental data, and describes a new R package for fitting such models. The new release (ver. 3.4.0) will be hosted online by the American Museum of Natural History, along with future versions. It contains small functional changes, most notably use of a complementary log‐log (cloglog) transform to produce an estimate of occurrence probability. The cloglog transform derives from the recently‐published interpretation of Maxent as an inhomogeneous Poisson process (IPP), giving it a stronger theoretical justification than the logistic transform which it replaces by default. In addition, the new R package, maxnet, fits Maxent models using the glmnet package for regularized generalized linear models. We discuss the implications of the IPP formulation in terms of model inputs and outputs, treating occurrence records as points rather than grid cells and interpreting the exponential Maxent model (raw output) as as an estimate of relative abundance. With these two open‐source developments, we invite others to freely use and contribute to the software. 相似文献

2.

Clustering in linear‐mixed models with a group fused lasso penalty

下载免费PDF全文

Felix Heinzl Gerhard Tutz 《Biometrical journal. Biometrische Zeitschrift》2014,56(1):44-68

A method is proposed that aims at identifying clusters of individuals that show similar patterns when observed repeatedly. We consider linear‐mixed models that are widely used for the modeling of longitudinal data. In contrast to the classical assumption of a normal distribution for the random effects a finite mixture of normal distributions is assumed. Typically, the number of mixture components is unknown and has to be chosen, ideally by data driven tools. For this purpose, an EM algorithm‐based approach is considered that uses a penalized normal mixture as random effects distribution. The penalty term shrinks the pairwise distances of cluster centers based on the group lasso and the fused lasso method. The effect is that individuals with similar time trends are merged into the same cluster. The strength of regularization is determined by one penalization parameter. For finding the optimal penalization parameter a new model choice criterion is proposed. 相似文献

3.

Variable selection in Bayesian generalized linear‐mixed models: An illustration using candidate gene case‐control association studies

下载免费PDF全文

Miao‐Yu Tsai 《Biometrical journal. Biometrische Zeitschrift》2015,57(2):234-253

The problem of variable selection in the generalized linear‐mixed models (GLMMs) is pervasive in statistical practice. For the purpose of variable selection, many methodologies for determining the best subset of explanatory variables currently exist according to the model complexity and differences between applications. In this paper, we develop a “higher posterior probability model with bootstrap” (HPMB) approach to select explanatory variables without fitting all possible GLMMs involving a small or moderate number of explanatory variables. Furthermore, to save computational load, we propose an efficient approximation approach with Laplace's method and Taylor's expansion to approximate intractable integrals in GLMMs. Simulation studies and an application of HapMap data provide evidence that this selection approach is computationally feasible and reliable for exploring true candidate genes and gene–gene associations, after adjusting for complex structures among clusters. 相似文献

4.

Using species distribution models to identify suitable areas for biofuel feedstock production

JASON M. EVANS ROBERT J. FLETCHER JR. JANAKI ALAVALAPATI 《Global Change Biology Bioenergy》2010,2(2):63-78

The 2007 Energy Independence and Security Act mandates a five‐fold increase in US biofuel production by 2022. Given this ambitious policy target, there is a need for spatially explicit estimates of landscape suitability for growing biofuel feedstocks. We developed a suitability modeling approach for two major US biofuel crops, corn (Zea mays) and switchgrass (Panicum virgatum), based upon the use of two presence‐only species distribution models (SDMs): maximum entropy (Maxent) and support vector machines (SVM). SDMs are commonly used for modeling animal and plant distributions in natural environments, but have rarely been used to develop landscape models for cultivated crops. AUC, Kappa, and correlation measures derived from test data indicate that SVM slightly outperformed Maxent in modeling US corn production, although both models produced significantly accurate results. When compared with results from a mechanistic switchgrass model recently developed by Oak Ridge National Laboratory (ORNL), SVM results showed higher correlation than Maxent results with models fit using county‐scale point inputs of switchgrass production derived from expert opinion estimates. However, Maxent results for an alternative switchgrass model developed with point inputs from research trial sites showed higher correlation to the ORNL model than the corresponding results obtained from SVM. Further analysis indicates that both modeling approaches were effective in predicting county‐scale increases in corn production from 2006 to 2007, a time period in which US corn production increased by 24%. We conclude that presence‐only methods are a powerful first‐cut tool for estimating relative land suitability across geographic regions in which candidate biofuel feedstocks can be grown, and may also provide important insight into potential land‐use change patterns likely to be associated with increased biofuel demand. 相似文献

5.

Modeling spatiotemporal abundance of mobile wildlife in highly variable environments using boosted GAMLSS hurdle models

Adam Smith Benjamin Hofner Juliet S. Lamb Jason Osenkowski Taber Allison Giancarlo Sadoti Scott R. McWilliams Peter Paton 《Ecology and evolution》2019,9(5):2346-2364

Modeling organism distributions from survey data involves numerous statistical challenges, including accounting for zero‐inflation, overdispersion, and selection and incorporation of environmental covariates. In environments with high spatial and temporal variability, addressing these challenges often requires numerous assumptions regarding organism distributions and their relationships to biophysical features. These assumptions may limit the resolution or accuracy of predictions resulting from survey‐based distribution models. We propose an iterative modeling approach that incorporates a negative binomial hurdle, followed by modeling of the relationship of organism distribution and abundance to environmental covariates using generalized additive models (GAM) and generalized additive models for location, scale, and shape (GAMLSS). Our approach accounts for key features of survey data by separating binary (presence‐absence) from count (abundance) data, separately modeling the mean and dispersion of count data, and incorporating selection of appropriate covariates and response functions from a suite of potential covariates while avoiding overfitting. We apply our modeling approach to surveys of sea duck abundance and distribution in Nantucket Sound (Massachusetts, USA), which has been proposed as a location for offshore wind energy development. Our model results highlight the importance of spatiotemporal variation in this system, as well as identifying key habitat features including distance to shore, sediment grain size, and seafloor topographic variation. Our work provides a powerful, flexible, and highly repeatable modeling framework with minimal assumptions that can be broadly applied to the modeling of survey data with high spatiotemporal variability. Applying GAMLSS models to the count portion of survey data allows us to incorporate potential overdispersion, which can dramatically affect model results in highly dynamic systems. Our approach is particularly relevant to systems in which little a priori knowledge is available regarding relationships between organism distributions and biophysical features, since it incorporates simultaneous selection of covariates and their functional relationships with organism responses. 相似文献

6.

MaxEnt模型参数设置对其所模拟物种地理分布和生态位的影响——以茶翅蝽为例 总被引：2，自引：0，他引：2

朱耿平原雪姣范靖宇王梦琳《生物安全学报》2018,27(2):118-123

【目的】生态位模型被广泛应用于入侵生物学和保护生物学研究,现有建模工具中,MaxEnt是最流行和运用最广泛的生态位模型。然而最近研究表明,基于MaxEnt模型的默认参数构建模型时,模型倾向于过度拟合,并非一定为最佳模型,尤其是在处理一些分布点较少的物种。【方法】以茶翅蝽为例,通过设置不同的特征参数、调控倍频以及背景拟不存在点数分别构建茶翅蝽的本土模型,然后将其转入入侵地来验证和比较模型,通过检测模型预测的物种对环境因子的响应曲线、潜在分布在生态空间中的生态位映射以及潜在分布的空间差异性,探讨3种参数设置对MaxEnt模型模拟物种分布和生态位的影响。【结果】在茶翅蝽的案例分析中,特征参数的设置对MaxEnt模型所模拟的潜在分布和生态位的影响最大,调控倍频的影响次之,背景拟不存在点数的影响最小。与其他特征相比,基于特征H和T的模型其响应曲线较为曲折;随着调控倍频的增加,响应曲线变得圆滑。【结论】在构建MaxEnt模型时,需要从生态空间中考虑物种的生态需求,分析模型参数对预测物种分布和生态位可能造成的影响。相似文献

7.

Modeling of species distributions with Maxent: new extensions and a comprehensive evaluation 总被引：43，自引：1，他引：42

Steven J. Phillips Miroslav Dudík 《Ecography》2008,31(2):161-175

Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively time‐consuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use “default settings”, tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presence‐only data. We evaluate our method on independently collected high‐quality presence‐absence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce “hinge features” that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore “background sampling” strategies that cope with sample selection bias and decrease model‐building time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presence‐only data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model performance; 3) logistic output improves model calibration, so that large differences in output values correspond better to large differences in suitability; 4) “target‐group” background sampling can give much better predictive performance than random background sampling; 5) random background sampling results in a dramatic decrease in running time, with no decrease in model performance. 相似文献

8.

Opportunities for improved distribution modelling practice via a strict maximum likelihood interpretation of MaxEnt

下载免费PDF全文

Rune Halvorsen Sabrina Mazzoni Anders Bryn Vegar Bakkestuen 《Ecography》2015,38(2):172-183

Maximum entropy (MaxEnt) modelling, as implemented in the Maxent software, has rapidly become one of the most popular methods for distribution modelling. Originally, MaxEnt was described as a machine‐learning method. More recently, it has been explained from principles of Bayesian estimation. MaxEnt offers numerous options (variants of the method) and settings (tuning of parameters) to the users. A widespread practice of accepting the Maxent software's default options and settings has been established, most likely because of ecologists’ lack of familiarity with machine‐learning and Bayesian statistical concepts and the ease by which the default models are obtained in Maxent. However, these defaults have been shown, in many cases, to be suboptimal and exploration of alternatives has repeatedly been called for. In this paper, we derive MaxEnt from strict maximum likelihood principles, and point out parallels between MaxEnt and standard modelling tools like generalised linear models (GLM). Furthermore, we describe several new options opened by this new derivation of MaxEnt, which may improve MaxEnt practice. The most important of these is the option for selecting variables by subset selection methods instead of the ?₁‐regularisation method, which currently is the Maxent software default. Other new options include: incorporation of new transformations of explanatory variables and user control of the transformation process; improved variable contribution measures and options for variation partitioning; and improved output prediction formats. The new options are exemplified for a data set for the plant species Scorzonera humilis in SE Norway, which was analysed by the standard MaxEnt procedure in a previously published paper. We recommend that thorough comparisons between the proposed alternative options and default procedures and variants thereof be carried out. 相似文献

9.

Maxent模型复杂度对物种潜在分布区预测的影响 总被引：4，自引：0，他引：4

朱耿平乔慧捷《生物多样性》2016,24(10):1189-267

生态位模型在入侵生物学和保护生物学中具有广泛的应用, 其中Maxent模型最为流行, 被越来越多地应用在预测物种的现实分布和潜在分布的研究中。在Maxent模型中, 多数研究者采用默认参数来构建模型, 这些默认参数源自早期对266个物种的测试, 以预测物种的现实分布为目的。近期研究发现, Maxent模型采用复杂机械学习算法, 对采样偏差敏感, 易产生过度拟合, 模型转移能力仅在低阈值情况下较好。基于默认参数的Maxent模型不仅预测结果不可靠, 而且有时很难解释。在本研究中, 作者以入侵害虫茶翅蝽(Halyomorpha halys)为例, 采用经典模型构建方案(即构建本土模型然后将其转移至入侵地来评估), 利用ENMeval数据包来调整本土Maxent模型调控倍频和特征组合参数, 分析各种参数条件下模型的复杂度, 然后选取最低复杂度的模型参数(即为最优模型), 综合比较默认参数和调整参数后Maxent模型的响应曲线和预测结果, 探讨Maxent模型复杂度对预测结果的影响及Maxent模型构建时所需注意事项, 以期对物种潜在分布进行合理的预测, 促进Maxent模型在我国的合理运用和发展。作者认为, 环境变量的选择至关重要, 需要综合分析其对所模拟物种分布的限制作用和环境变量之间的空间相关性。构建Maxent模型前需对物种分布采样偏差及模型的构建区域进行合理地判断, 模型构建时需要比较不同参数下模型的预测结果和响应曲线, 选取复杂度较低的模型参数来最终建模。在茶翅蝽的分析中, Maxent模型的默认参数和最优模型参数不同, 与Maxent模型默认参数相比, 采用调整参数后所构建的模型预测效果较好, 响应曲线较为平滑, 模型转移能力较高, 能够较为合理反映物种对环境因子的响应和准确地模拟该物种的潜在分布。相似文献

10.

Spatial sampling bias and model complexity in stream‐based species distribution models: A case study of Paddlefish (Polyodon spathula) in the Arkansas River basin,USA

Andrew T. Taylor Thomas Hafen Colt T. Holley Alin Gonzlez James M. Long 《Ecology and evolution》2020,10(2):705-717

Leveraging existing presence records and geospatial datasets, species distribution modeling has been widely applied to informing species conservation and restoration efforts. Maxent is one of the most popular modeling algorithms, yet recent research has demonstrated Maxent models are vulnerable to prediction errors related to spatial sampling bias and model complexity. Despite elevated rates of biodiversity imperilment in stream ecosystems, the application of Maxent models to stream networks has lagged, as has the availability of tools to address potential sources of error and calculate model evaluation metrics when modeling in nonraster environments (such as stream networks). Herein, we use Maxent and customized R code to estimate the potential distribution of paddlefish (Polyodon spathula) at a stream‐segment level within the Arkansas River basin, USA, while accounting for potential spatial sampling bias and model complexity. Filtering the presence data appeared to adequately remove an eastward, large‐river sampling bias that was evident within the unfiltered presence dataset. In particular, our novel riverscape filter provided a repeatable means of obtaining a relatively even coverage of presence data among watersheds and streams of varying sizes. The greatest differences in estimated distributions were observed among models constructed with default versus AIC_C‐selected parameterization. Although all models had similarly high performance and evaluation metrics, the AIC_C‐selected models were more inclusive of westward‐situated and smaller, headwater streams. Overall, our results solidified the importance of accounting for model complexity and spatial sampling bias in SDMs constructed within stream networks and provided a roadmap for future paddlefish restoration efforts in the study area. 相似文献

11.

A comparison of model selection methods for prediction in the presence of multiply imputed data

Le Thi Phuong Thao Ronald Geskus 《Biometrical journal. Biometrische Zeitschrift》2019,61(2):343-356

Many approaches for variable selection with multiply imputed data in the development of a prognostic model have been proposed. However, no method prevails as uniformly best. We conducted a simulation study with a binary outcome and a logistic regression model to compare two classes of variable selection methods in the presence of MI data: (I) Model selection on bootstrap data, using backward elimination based on AIC or lasso, and fit the final model based on the most frequently (e.g. ) selected variables over all MI and bootstrap data sets; (II) Model selection on original MI data, using lasso. The final model is obtained by (i) averaging estimates of variables that were selected in any MI data set or (ii) in 50% of the MI data; (iii) performing lasso on the stacked MI data, and (iv) as in (iii) but using individual weights as determined by the fraction of missingness. In all lasso models, we used both the optimal penalty and the 1‐se rule. We considered recalibrating models to correct for overshrinkage due to the suboptimal penalty by refitting the linear predictor or all individual variables. We applied the methods on a real dataset of 951 adult patients with tuberculous meningitis to predict mortality within nine months. Overall, applying lasso selection with the 1‐se penalty shows the best performance, both in approach I and II. Stacking MI data is an attractive approach because it does not require choosing a selection threshold when combining results from separate MI data sets 相似文献

12.

The challenge of modeling niches and distributions for data‐poor species: a comprehensive approach to model complexity

下载免费PDF全文

Peter J. Galante Babatunde Alade Robert Muscarella Sharon A. Jansa Steven M. Goodman Robert P. Anderson 《Ecography》2018,41(5):726-736

Models of species ecological niches and geographic distributions now represent a widely used tool in ecology, evolution, and biogeography. However, the very common situation of species with few available occurrence localities presents major challenges for such modeling techniques, in particular regarding model complexity and evaluation. Here, we summarize the state of the field regarding these issues and provide a worked example using the technique Maxent for a small mammal endemic to Madagascar (the nesomyine rodent Eliurus majori). Two relevant model‐selection approaches exist in the literature (information criteria, specifically AICc; and performance predicting withheld data, via a jackknife), but AICc is not strictly applicable to machine‐learning algorithms like Maxent. We compare models chosen under each selection approach with those corresponding to Maxent default settings, both with and without spatial filtering of occurrence records to reduce the effects of sampling bias. Both selection approaches chose simpler models than those made using default settings. Furthermore, the approaches converged on a similar answer when sampling bias was taken into account, but differed markedly with the unfiltered occurrence data. Specifically, for that dataset, the models selected by AICc had substantially fewer parameters than those identified by performance on withheld data. Based on our knowledge of the study species, models chosen under both AICc and withheld‐data‐selection showed higher ecological plausibility when combined with spatial filtering. The results for this species intimate that AICc may consistently select models with fewer parameters and be more robust to sampling bias. To test these hypotheses and reach general conclusions, comprehensive research should be undertaken with a wide variety of real and simulated species. Meanwhile, we recommend that researchers assess the critical yet underappreciated issue of model complexity both via information criteria and performance on withheld data, comparing the results between the two approaches and taking into account ecological plausibility. 相似文献

13.

ModEco: an integrated software package for ecological niche modeling 总被引：2，自引：0，他引：2

Qinghua Guo Yu Liu 《Ecography》2010,33(4):637-642

ModEco is a software package for ecological niche modeling. It integrates a range of niche modeling methods within a geographical information system. ModEco provides a user friendly platform that enables users to explore, analyze, and model species distribution data with relative ease. ModEco has several unique features: 1) it deals with different types of ecological observation data, such as presence and absence data, presence‐only data, and abundance data; 2) it provides a range of models when dealing with presence‐only data, such as presence‐only models, pseudo‐absence models, background vs presence data models, and ensemble models; and 3) it includes relatively comprehensive tools for data visualization, feature selection, and accuracy assessment. 相似文献

14.

High‐Dimensional Cox Models: The Choice of Penalty as Part of the Model Building Process

Axel Benner Manuela Zucknick Thomas Hielscher Carina Ittrich Ulrich Mansmann 《Biometrical journal. Biometrische Zeitschrift》2010,52(1):50-69

The Cox proportional hazards regression model is the most popular approach to model covariate information for survival times. In this context, the development of high‐dimensional models where the number of covariates is much larger than the number of observations ( $p \,{\gg }\, n$ ) is an ongoing challenge. A practicable approach is to use ridge penalized Cox regression in such situations. Beside focussing on finding the best prediction rule, one is often interested in determining a subset of covariates that are the most important ones for prognosis. This could be a gene set in the biostatistical analysis of microarray data. Covariate selection can then, for example, be done by L₁‐penalized Cox regression using the lasso (Tibshirani ( 1997 ). Statistics in Medicine 16 , 385–395). Several approaches beyond the lasso, that incorporate covariate selection, have been developed in recent years. This includes modifications of the lasso as well as nonconvex variants such as smoothly clipped absolute deviation (SCAD) (Fan and Li ( 2001 ). Journal of the American Statistical Association 96 , 1348–1360; Fan and Li ( 2002 ). The Annals of Statistics 30 , 74–99). The purpose of this article is to implement them practically into the model building process when analyzing high‐dimensional data with the Cox proportional hazards model. To evaluate penalized regression models beyond the lasso, we included SCAD variants and the adaptive lasso (Zou ( 2006 ). Journal of the American Statistical Association 101 , 1418–1429). We compare them with “standard” applications such as ridge regression, the lasso, and the elastic net. Predictive accuracy, features of variable selection, and estimation bias will be studied to assess the practical use of these methods. We observed that the performance of SCAD and adaptive lasso is highly dependent on nontrivial preselection procedures. A practical solution to this problem does not yet exist. Since there is high risk of missing relevant covariates when using SCAD or adaptive lasso applied after an inappropriate initial selection step, we recommend to stay with lasso or the elastic net in actual data applications. But with respect to the promising results for truly sparse models, we see some advantage of SCAD and adaptive lasso, if better preselection procedures would be available. This requires further methodological research. 相似文献

15.

美国大陆外来入侵物种斑马纹贻贝(Dreissena polymorpha)潜在生境预测模型 总被引：8，自引：1，他引：7

李明阳巨云为 Sunil Kumar Thomas J. Stohlgren 《生态学报》2008,28(9):4253-4258

防止外来生物入侵造成危害的重要手段是阻止可能造成入侵的物种进入适合其生存的地区.论文以1864个美国外来入侵物种斑马纹贻贝定点发生数据和开放式基础地理信息数据库Daymet的34个环境变量为主要信息源,采用逻辑斯蒂回归(LR)、分类与回归树模型(CART)、基于规则的遗传算法(GARP)、最大熵法(Maxent)4种途径,建立美国大陆部分潜在生境预测模型,从接受者运行特征曲线下面积(AUC)、Pearson相关系数、Kappa值3个方面来检验模型预测精度,在此基础上分析斑马纹贻贝的空间分布规律及其环境影响因素.研究结果表明:在3个评价指标中,4个生态位模型预测精度均达到优良水平,其中Maxent在物种现实生境模拟、主要生态环境因子筛选、环境因子对物种生境影响的定量描述方面都表现出了优越的性能;距水源距离、海拔高度、降水频率、太阳辐射是影响物种空间分布的主要环境因子.论文提出的研究方法对中国外来入侵物种生境预测具有较强的借鉴意义,研究结果对中国海洋外来入侵物种沙筛贝的预测与防治,具有一定的指导作用. 相似文献

16.

Validating distribution models for twelve endemic bird species of tropical dry forest in western Mexico

下载免费PDF全文

Miguel A. Ortega‐Huerta Jorge H. Vega‐Rivera 《Ecology and evolution》2017,7(19):7672-7686

Considering the high biodiversity and conservation concerns of the tropical dry forest, this study aim is to predict and evaluate the potential and current distributions of twelve species of endemic birds which distribute along the western slope of Mexico. The main goal is to evaluate altogether different methods for predicting actual species distribution models (ADMs) of the twelve species including the identification of key environmental potential limiting factors. ADMs for twelve endemic Mexican birds were generated and validated by means of applying: (1) three widely used species niche modeling approaches (ENFA, Garp, and Maxent); (2) two thresholding methods, based on ROC curves and Kappa Index, for transforming continuous models to presence/absence (binary) models; (3) documented habitat–species associations for reducing species potential distribution models (PDMs); and (4) field occurrence data for validating final ADMs. Binary PDMs' predicted areas seemed overestimated, while ADMs looked drastically reduced and fragmented because of the approach taken for eliminating those predicted areas which were documented as unsuitable habitat types for individual species. Results indicated that both thresholding methods generated similar threshold values for species modeled by each of the three species distribution modeling algorithms (SDMAs). A Wilcoxon signed‐rank test, however, showed that Kappa values were generally higher than ROC curve for species modeled by ENFA and Maxent, while for Garp models there were no significant differences. Prediction success (e.g., true presences percentage) obtained from field occurrence data revealed a range of 50%–82% among the 12 species. The three modeling approaches applied enabled to test the application of two thresholding methods for transforming continuous to binary (presence/absence) models. The use of documented habitat preferences resulted in drastic reductions and fragmentation of PDMs. However, ADMs predictive success rate, tested using field species occurrence data, varied between 50 and 82%. 相似文献

17.

L1 Penalized Estimation in the Cox Proportional Hazards Model

Jelle J. Goeman 《Biometrical journal. Biometrische Zeitschrift》2010,52(1):70-84

This article presents a novel algorithm that efficiently computes L₁ penalized (lasso) estimates of parameters in high‐dimensional models. The lasso has the property that it simultaneously performs variable selection and shrinkage, which makes it very useful for finding interpretable prediction rules in high‐dimensional data. The new algorithm is based on a combination of gradient ascent optimization with the Newton–Raphson algorithm. It is described for a general likelihood function and can be applied in generalized linear models and other models with an L₁ penalty. The algorithm is demonstrated in the Cox proportional hazards model, predicting survival of breast cancer patients using gene expression data, and its performance is compared with competing approaches. An R package, penalized , that implements the method, is available on CRAN. 相似文献

18.

The effects of small sample size and sample bias on threshold selection and accuracy assessment of species distribution models

William T. Bean Robert Stafford Justin S. Brashares 《Ecography》2012,35(3):250-258

Species distribution models are used for a range of ecological and evolutionary questions, but often are constructed from few and/or biased species occurrence records. Recent work has shown that the presence‐only model Maxent performs well with small sample sizes. While the apparent accuracy of such models with small samples has been studied, less emphasis has been placed on the effect of small or biased species records on the secondary modeling steps, specifically accuracy assessment and threshold selection, particularly with profile (presence‐only) modeling techniques. When testing the effects of small sample sizes on distribution models, accuracy assessment has generally been conducted with complete species occurrence data, rather than similarly limited (e.g. few or biased) test data. Likewise, selection of a probability threshold – a selection of probability that classifies a model into discrete areas of presences and absences – has also generally been conducted with complete data. In this study we subsampled distribution data for an endangered rodent across multiple years to assess the effects of different sample sizes and types of bias on threshold selection, and examine the differences between apparent and actual accuracy of the models. Although some previously recommended threshold selection techniques showed little difference in threshold selection, the most commonly used methods performed poorly. Apparent model accuracy calculated from limited data was much higher than true model accuracy, but the true model accuracy was lower than it could have been with a more optimal threshold. That is, models with thresholds and accuracy calculated from biased and limited data had inflated reported accuracy, but were less accurate than they could have been if better data on species distribution were available and an optimal threshold were used. 相似文献

19.

Akaike information criterion should not be a “test” of geographical prediction accuracy in ecological niche modelling

《Ecological Informatics》2019

Model complexity in ecological niche modelling has been recently considered as an important issue that might affect model performance. New methodological developments have implemented the Akaike information criterion (AIC) to capture model complexity in the Maxent algorithm model. AIC is calculated based on the number of parameters and likelihoods of continuous raw outputs. ENMeval R package allows users to perform a species-specific tuning of Maxent settings running models with different combinations of regularization multiplier and feature classes and finally, all these models are compared using AIC corrected for small sample size. This approach is focused to find the “best” model parametrization and it is thought to maximize the model complexity and therefore, its predictability. We found that most niche modelling studies examined by us (68%) tend to consider AIC as a criterion of predictive accuracy in geographical distribution. In other words, AIC is used as a criterion to choose those models with the highest capacity to discriminate between presences and absences. However, the link between AIC and geographical predictive accuracy has not been tested so far. Here, we evaluated this relationship using a set of simulated (virtual) species. We created a set of nine virtual species with different ecological and geographical traits (e.g., niche position, niche breadth, range size) and generated different sets of true presences and absences data across geography. We built a set of models using Maxent algorithm with different regularization values and features schemes and calculated AIC values for each model. For each model, we obtained binary predictions using different threshold criteria and validated using independent presence and absences data. We correlated AIC values against standard validation metrics (e.g., Kappa, TSS) and the number of pixels correctly predicted as presences and absences. We did not find a correlation between AIC values and predictive accuracy from validation metrics. In general, those models with the lowest AIC values tend to generate geographical predictions with high commission and omission errors. The results were consistent across all species simulated. Finally, we suggest that AIC should not be used if users are interested in prediction more than explanation in ecological niche modelling. 相似文献

20.

Testing the ability of species distribution models to infer variable importance

Adam B. Smith Maria J. Santos 《Ecography》2020,43(12):1801-1813

Models of species’ distributions and niches are frequently used to infer the importance of range- and niche-defining variables. However, the degree to which these models can reliably identify important variables and quantify their influence remains unknown. Here we use a series of simulations to explore how well models can 1) discriminate between variables with different influence and 2) calibrate the magnitude of influence relative to an ‘omniscient’ model. To quantify variable importance, we trained generalized additive models (GAMs), Maxent and boosted regression trees (BRTs) on simulated data and tested their sensitivity to permutations in each predictor. Importance was inferred by calculating the correlation between permuted and unpermuted predictions, and by comparing predictive accuracy of permuted and unpermuted predictions using AUC and the continuous Boyce index. In scenarios with one influential and one uninfluential variable, models failed to discriminate reliably between variables when training occurrences were < 8–64, prevalence was > 0.5, spatial extent was small, environmental data had coarse resolution and spatial autocorrelation was low, or when pairwise correlation between environmental variables was |r| > 0.7. When two variables influenced the distribution equally, importance was underestimated when species had narrow or intermediate niche breadth. Interactions between variables in how they shaped the niche did not affect inferences about their importance. When variables acted unequally, the effect of the stronger variable was overestimated. GAMs and Maxent discriminated between variables more reliably than BRTs, but no algorithm was consistently well-calibrated vis-à-vis the omniscient model. Algorithm-specific measures of importance like Maxent's change-in-gain metric were less robust than the permutation test. Overall, high predictive accuracy did not connote robust inferential capacity. As a result, requirements for reliably measuring variable importance are likely more stringent than for creating models with high predictive accuracy. 相似文献