首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Species distribution modeling (SDM) is an essential method in ecology and conservation. SDMs are often calibrated within one country's borders, typically along a limited environmental gradient with biased and incomplete data, making the quality of these models questionable. In this study, we evaluated how adequate are national presence‐only data for calibrating regional SDMs. We trained SDMs for Egyptian bat species at two different scales: only within Egypt and at a species‐specific global extent. We used two modeling algorithms: Maxent and elastic net, both under the point‐process modeling framework. For each modeling algorithm, we measured the congruence of the predictions of global and regional models for Egypt, assuming that the lower the congruence, the lower the appropriateness of the Egyptian dataset to describe the species' niche. We inspected the effect of incorporating predictions from global models as additional predictor (“prior”) to regional models, and quantified the improvement in terms of AUC and the congruence between regional models run with and without priors. Moreover, we analyzed predictive performance improvements after correction for sampling bias at both scales. On average, predictions from global and regional models in Egypt only weakly concur. Collectively, the use of priors did not lead to much improvement: similar AUC and high congruence between regional models calibrated with and without priors. Correction for sampling bias led to higher model performance, whatever prior used, making the use of priors less pronounced. Under biased and incomplete sampling, the use of global bats data did not improve regional model performance. Without enough bias‐free regional data, we cannot objectively identify the actual improvement of regional models after incorporating information from the global niche. However, we still believe in great potential for global model predictions to guide future surveys and improve regional sampling in data‐poor regions.  相似文献   

2.
Leveraging existing presence records and geospatial datasets, species distribution modeling has been widely applied to informing species conservation and restoration efforts. Maxent is one of the most popular modeling algorithms, yet recent research has demonstrated Maxent models are vulnerable to prediction errors related to spatial sampling bias and model complexity. Despite elevated rates of biodiversity imperilment in stream ecosystems, the application of Maxent models to stream networks has lagged, as has the availability of tools to address potential sources of error and calculate model evaluation metrics when modeling in nonraster environments (such as stream networks). Herein, we use Maxent and customized R code to estimate the potential distribution of paddlefish (Polyodon spathula) at a stream‐segment level within the Arkansas River basin, USA, while accounting for potential spatial sampling bias and model complexity. Filtering the presence data appeared to adequately remove an eastward, large‐river sampling bias that was evident within the unfiltered presence dataset. In particular, our novel riverscape filter provided a repeatable means of obtaining a relatively even coverage of presence data among watersheds and streams of varying sizes. The greatest differences in estimated distributions were observed among models constructed with default versus AICC‐selected parameterization. Although all models had similarly high performance and evaluation metrics, the AICC‐selected models were more inclusive of westward‐situated and smaller, headwater streams. Overall, our results solidified the importance of accounting for model complexity and spatial sampling bias in SDMs constructed within stream networks and provided a roadmap for future paddlefish restoration efforts in the study area.  相似文献   

3.
Models of species ecological niches and geographic distributions now represent a widely used tool in ecology, evolution, and biogeography. However, the very common situation of species with few available occurrence localities presents major challenges for such modeling techniques, in particular regarding model complexity and evaluation. Here, we summarize the state of the field regarding these issues and provide a worked example using the technique Maxent for a small mammal endemic to Madagascar (the nesomyine rodent Eliurus majori). Two relevant model‐selection approaches exist in the literature (information criteria, specifically AICc; and performance predicting withheld data, via a jackknife), but AICc is not strictly applicable to machine‐learning algorithms like Maxent. We compare models chosen under each selection approach with those corresponding to Maxent default settings, both with and without spatial filtering of occurrence records to reduce the effects of sampling bias. Both selection approaches chose simpler models than those made using default settings. Furthermore, the approaches converged on a similar answer when sampling bias was taken into account, but differed markedly with the unfiltered occurrence data. Specifically, for that dataset, the models selected by AICc had substantially fewer parameters than those identified by performance on withheld data. Based on our knowledge of the study species, models chosen under both AICc and withheld‐data‐selection showed higher ecological plausibility when combined with spatial filtering. The results for this species intimate that AICc may consistently select models with fewer parameters and be more robust to sampling bias. To test these hypotheses and reach general conclusions, comprehensive research should be undertaken with a wide variety of real and simulated species. Meanwhile, we recommend that researchers assess the critical yet underappreciated issue of model complexity both via information criteria and performance on withheld data, comparing the results between the two approaches and taking into account ecological plausibility.  相似文献   

4.
Modeling plant growth using functional traits is important for understanding the mechanisms that underpin growth and for predicting new situations. We use three data sets on plant height over time and two validation methods—in‐sample model fit and leave‐one‐species‐out cross‐validation—to evaluate non‐linear growth model predictive performance based on functional traits. In‐sample measures of model fit differed substantially from out‐of‐sample model predictive performance; the best fitting models were rarely the best predictive models. Careful selection of predictor variables reduced the bias in parameter estimates, and there was no single best model across our three data sets. Testing and comparing multiple model forms is important. We developed an R package with a formula interface for straightforward fitting and validation of hierarchical, non‐linear growth models. Our intent is to encourage thorough testing of multiple growth model forms and an increased emphasis on assessing model fit relative to a model's purpose.  相似文献   

5.
Species distribution models (SDMs) are often calibrated using presence‐only datasets plagued with environmental sampling bias, which leads to a decrease of model accuracy. In order to compensate for this bias, it has been suggested that background data (or pseudoabsences) should represent the area that has been sampled. However, spatially‐explicit knowledge of sampling effort is rarely available. In multi‐species studies, sampling effort has been inferred following the target‐group (TG) approach, where aggregated occurrence of TG species informs the selection of background data. However, little is known about the species‐ specific response to this type of bias correction. The present study aims at evaluating the impacts of sampling bias and bias correction on SDM performance. To this end, we designed a realistic system of sampling bias and virtual species based on 92 terrestrial mammal species occurring in the Mediterranean basin. We manipulated presence and background data selection to calibrate four SDM types. Unbiased (unbiased presence data) and biased (biased presence data) SDMs were calibrated using randomly distributed background data. We used real and TG‐estimated sampling efforts in background selection to correct for sampling bias in presence data. Overall, environmental sampling bias had a deleterious effect on SDM performance. In addition, bias correction improved model accuracy, and especially when based on spatially‐explicit knowledge of sampling effort. However, our results highlight important species‐specific variations in susceptibility to sampling bias, which were largely explained by range size: widely‐distributed species were most vulnerable to sampling bias and bias correction was even detrimental for narrow‐ranging species. Furthermore, spatial discrepancies in SDM predictions suggest that bias correction effectively replaces an underestimation bias with an overestimation bias, particularly in areas of low sampling intensity. Thus, our results call for a better estimation of sampling effort in multispecies system, and cautions the uninformed and automatic application of TG bias correction.  相似文献   

6.
Accurate modeling of geographic distributions of species is crucial to various applications in ecology and conservation. The best performing techniques often require some parameter tuning, which may be prohibitively time‐consuming to do separately for each species, or unreliable for small or biased datasets. Additionally, even with the abundance of good quality data, users interested in the application of species models need not have the statistical knowledge required for detailed tuning. In such cases, it is desirable to use “default settings”, tuned and validated on diverse datasets. Maxent is a recently introduced modeling technique, achieving high predictive accuracy and enjoying several additional attractive properties. The performance of Maxent is influenced by a moderate number of parameters. The first contribution of this paper is the empirical tuning of these parameters. Since many datasets lack information about species absence, we present a tuning method that uses presence‐only data. We evaluate our method on independently collected high‐quality presence‐absence data. In addition to tuning, we introduce several concepts that improve the predictive accuracy and running time of Maxent. We introduce “hinge features” that model more complex relationships in the training data; we describe a new logistic output format that gives an estimate of probability of presence; finally we explore “background sampling” strategies that cope with sample selection bias and decrease model‐building time. Our evaluation, based on a diverse dataset of 226 species from 6 regions, shows: 1) default settings tuned on presence‐only data achieve performance which is almost as good as if they had been tuned on the evaluation data itself; 2) hinge features substantially improve model performance; 3) logistic output improves model calibration, so that large differences in output values correspond better to large differences in suitability; 4) “target‐group” background sampling can give much better predictive performance than random background sampling; 5) random background sampling results in a dramatic decrease in running time, with no decrease in model performance.  相似文献   

7.
Aim Globally, species distribution patterns in the deep sea are poorly resolved, with spatial coverage being sparse for most taxa and true absence data missing. Increasing human impacts on deep‐sea ecosystems mean that reaching a better understanding of such patterns is becoming more urgent. Cold‐water stony corals (Order Scleractinia) form structurally complex habitats (dense thickets or reefs) that can support a diversity of other associated fauna. Despite their widely accepted ecological importance, records of scleractinian corals on seamounts are patchy and simply not available for most of the global ocean. The objective of this paper is to model the global distribution of suitable habitat for stony corals on seamounts. Location Seamounts worldwide. Methods We compiled a database containing all accessible records of scleractinian corals on seamounts. Two modelling approaches developed for presence‐only data were used to predict global habitat suitability for seamount scleractinians: maximum entropy modelling (Maxent) and environmental niche factor analysis (ENFA). We generated habitat‐suitability maps and used a cross‐validation process with a threshold‐independent metric to evaluate the performance of the models. Results Both models performed well in cross‐validation, although the Maxent method consistently outperformed ENFA. Highly suitable habitat for seamount stony corals was predicted to occur at most modelled depths in the North Atlantic, and in a circumglobal strip in the Southern Hemisphere between 20° and 50° S and shallower than around 1500 m. Seamount summits in most other regions appeared much less likely to provide suitable habitat, except for small near‐surface patches. The patterns of habitat suitability largely reflect current biogeographical knowledge. Environmental variables positively associated with high predicted habitat suitability included the aragonite saturation state, and oxygen saturation and concentration. By contrast, low levels of dissolved inorganic carbon, nitrate, phosphate and silicate were associated with high predicted suitability. High correlation among variables made assessing individual drivers difficult. Main conclusions Our models predict environmental conditions likely to play a role in determining large‐scale scleractinian coral distributions on seamounts, and provide a baseline scenario on a global scale. These results present a first‐order hypothesis that can be tested by further sampling. Given the high vulnerability of cold‐water corals to human impacts, such predictions are crucial tools in developing worldwide conservation and management strategies for seamount ecosystems.  相似文献   

8.
Modelling and predicting fungal distribution patterns using herbarium data   总被引:1,自引:0,他引:1  
Aim The main aims of this study are: (1) to test if temperature and related parameters are the primary determinants of the regional distribution of macrofungi (as is commonly recognized for plants); (2) to test if the success of modelling fungal distribution patterns depends on species and distribution characteristics; and (3) to explore the potential of using herbarium data for modelling and predicting fungal species’ distributions. Location The study area, Norway, spans 58–71° N latitude and 4–32° E longitude, and embraces extensive ecological gradients in a small area. Methods The study is based on 1020 herbarium collections of nine selected species of macrofungi and a set of 75 environmental predictor variables, all recorded in a 5 × 5‐km grid covering Norway. Primarily, generalized linear model (GLM; logistic regression) analyses were used to identify the environmental variables that best accounted for the species’ recorded distributions in Norway. Second, Maxent analyses (using variables identified by GLM) were used to produce predictive potential distribution maps for these species. Results Variables relating to temperature and radiation were most frequently included in the GLMs, and between 24.8% and 59.8% of the variation in single‐species occurrence was accounted for. The fraction of variation explained by the GLMs ranged from 41.6% to 59.8% for species with restricted distributions, and from 24.8% to 39.3% for species with widespread/scattered and intermediate distributions. The two‐step procedure of GLM followed by Maxent gave predictions with very high values for the area under the curve (0.927–0.997), and maps of potential distribution were generally credible. Main conclusions We show that temperature is a key factor governing the distribution of macrofungi in Norway, indicating that fungi may respond strongly to global warming. We confirm that modelling success depends partly on species and distribution characteristics, notably on how the distribution relates to the extent of the study area. Our study demonstrates that the combination of GLM and Maxent may be a fruitful approach for biogeography. We conclude that herbarium data improve insight into factors that control the distributions of fungi, of particular value for research on fleshy fungi (mushrooms), which have largely cryptic life cycles.  相似文献   

9.
Systematic species surveys over large areas are mostly not affordable, constraining conservation planners to make best use of incomplete data. Spatially explicit species distribution models (SDM) may be useful to detect and compensate for incomplete information. SDMs can either be based on standardized, systematic sampling in a restricted subarea, or – as a cost‐effective alternative – on data haphazardly collated by “volunteer‐based monitoring schemes” (VMS), area‐wide but inherently biased and of heterogeneous spatial precision. Using data on capercaillie Tetrao urogallus, we evaluated the capacity of SDMs generated from incomplete survey data to localise unknown areas inhabited by the species and to predict relative local observation density. Addressing the trade‐off between data precision, sample size and spatial extent of the sampling area, we compared three different sampling strategies: VMS‐data collected throughout the whole study area (7000 km2) using either 1) exact locations or 2) locations aggregated to grid cells of the size of an average individual home range, and 3) systematic transect counts conducted within a small subarea (23.8 km2). For each strategy, we compared two sample sizes and two modelling methods (ENFA and Maxent), which were evaluated using cross‐validation and independent data. Models based on VMS‐data (strategies 1 and 2) performed equally well in predicting relative observation density and in localizing “unknown” occurrences. They always outperformed strategy 3‐models, irrespective of sample size and modelling method, partly because the VMS‐data provided the more comprehensive clues for setting the discrimination‐threshold for predicting presence or absence. Accounting for potential errors due to extrapolation (e.g. projections outside the environmental domain or potentially biasing variables) reduced, but did not fully compensate for the observed discrepancies. As they cover a broader range of species‐habitat relations, the area‐wide data achieved a better model quality with less a‐priori knowledge. Furthermore, in a highly mobile species like capercaillie a sampling resolution corresponding to an individuals' home range can lead to equally good predictions as the use of exact locations. Consequently, when a trade‐off between the sampling effort and the spatial extent of the sampling area is necessary, less precise data unsystematically collected over a large representative region are preferable to systematically sampled data from a restricted region.  相似文献   

10.
Hepatocellular carcinoma (HCC) is closely associated with abnormal DNA methylation. In this study, we analyzed 450K methylation chip data from 377 HCC samples and 50 adjacent normal samples in the TCGA database. We screened 47,099 differentially methylated sites using Cox regression as well as SVM‐RFE and FW‐SVM algorithms, and constructed a model using three risk categories to predict the overall survival based on 134 methylation sites. The model showed a 10‐fold cross‐validation score of 0.95 and satisfactory predictive power, and correctly classified 26 of 33 samples in testing set obtained by stratified sampling from high, intermediate and low risk groups.  相似文献   

11.
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross‐validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross‐validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non‐causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross‐validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non‐random and blocked cross‐validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross‐validation is nearly universally more appropriate than random cross‐validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross‐validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.  相似文献   

12.
Aim Elucidating the environmental limits of coral reefs is central to projecting future impacts of climate change on these ecosystems and their global distribution. Recent developments in species distribution modelling (SDM) and the availability of comprehensive global environmental datasets have provided an opportunity to reassess the environmental factors that control the distribution of coral reefs at the global scale as well as to compare the performance of different SDM techniques. Location Shallow waters world‐wide. Methods The SDM methods used were maximum entropy (Maxent) and two presence/absence methods: classification and regression trees (CART) and boosted regression trees (BRT). The predictive variables considered included sea surface temperature (SST), salinity, aragonite saturation state (ΩArag), nutrients, irradiance, water transparency, dust, current speed and intensity of cyclone activity. For many variables both mean and SD were considered, and at weekly, monthly and annually averaged time‐scales. All were transformed to a global 1° × 1° grid to generate coral reef probability maps for comparison with known locations. Model performance was compared in terms of receiver operating characteristic (ROC) curves and area under the curve (AUC) scores. Potential geographical bias was explored via misclassification maps of false positive and negative errors on test data. Results Boosted regression trees consistently outperformed other methods, although Maxent also performed acceptably. The dominant environmental predictors were the temperature variables (annual mean SST, and monthly and weekly minimum SST), followed by, and with their relative importance differing between regions, nutrients, light availability and ΩArag. No systematic bias in SDM performance was found between major coral provinces, but false negatives were more likely for cells containing ‘marginal’ non‐reef‐forming coral communities, e.g. Bermuda. Main conclusions Agreement between BRT and Maxent models gives predictive confidence for exploring the environmental limits of coral reef ecosystems at a spatial scale relevant to global climate models (c. 1° × 1°). Although SST‐related variables dominate the coral reef distribution models, contributions from nutrients, ΩArag and light availability were critical in developing models of reef presence in regions such as the Bahamas, South Pacific and Coral Triangle. The steep response in SST‐driven probabilities at low temperatures indicates that latitudinal expansion of coral reef habitat is very sensitive to global warming.  相似文献   

13.
A topic of particular current interest is community‐level approaches to species distribution modelling (SDM), i.e. approaches that simultaneously analyse distributional data for multiple species. Previous studies have looked at the advantages of community‐level approaches for parameter estimation, but not for model selection – the process of choosing which model (and in particular, which subset of environmental variables) to fit to data. We compared the predictive performance of models using the same modelling method (generalised linear models) but choosing the subset of variables to include in the model either simultaneously across all species (community‐level model selection) or separately for each species (species‐specific model selection). Our results across two large presence/absence tree community datasets were inconclusive as to whether there was an overall difference in predictive performance between models fitted via species‐specific vs community‐level model selection. However, we found some evidence that a community approach was best suited to modelling rare species, and its performance decayed with increasing prevalence. That is, when data were sparse there was more opportunity for gains from “borrowing strength” across species via a community‐level approach. Interestingly, we also found that the community‐level approach tended to work better when the model selection problem was more difficult, and more reliably detected “noise” variables that should be excluded from the model.  相似文献   

14.
Logistic Multiple Regression, Principal Component Regression and Classification and Regression Tree Analysis (CART), commonly used in ecological modelling using GIS, are compared with a relatively new statistical technique, Multivariate Adaptive Regression Splines (MARS), to test their accuracy, reliability, implementation within GIS and ease of use. All were applied to the same two data sets, covering a wide range of conditions common in predictive modelling, namely geographical range, scale, nature of the predictors and sampling method. We ran two series of analyses to verify if model validation by an independent data set was required or cross‐validation on a learning data set sufficed. Results show that validation by independent data sets is needed. Model accuracy was evaluated using the area under Receiver Operating Characteristics curve (AUC). This measure was used because it summarizes performance across all possible thresholds, and is independent of balance between classes. MARS and Regression Tree Analysis achieved the best prediction success, although the CART model was difficult to use for cartographic purposes due to the high model complexity.  相似文献   

15.
The distribution of harbour porpoises in EU waters is poorly understood, and modelled predictions of their distributions could inform the strategic spatial planning of future exploitation of the marine environment to avoid potential conflicts. We analysed satellite telemetry data from 39 harbour porpoises Phocoena phocoena in inner Danish waters using a modelling tool rooted in maximum entropy: Maxent. Maxent does not require absence data and has been shown to be effective for data characterised by small sample size, sampling bias and locational errors. For each season we used an iterative bootstrapping procedure to randomly select among the most precise records from each of the 39 tagged individuals, and ran Maxent on pooled records based on explanatory environmental variables hypothesised to serve as good proxies for harbour porpoise prey abundance. Among our environmental variables, distance to coast and bottom salinity had the most explanatory power, and their response shapes were relatively consistent across most seasons. The predictive power of the models (assessed by ROC‐AUC) ranged from 0.70 to 0.86 within seasons. The southern Kattegat, the Belt Seas, most western part of the Baltic Sea and the Sound were predicted to have relatively high probabilities of occurrence across seasons. In contrast, the central part of Kattegat and the Baltic Sea south and east of Limhamn and Darss Ridge consistently showed low probabilities of occurrence. Areas with the lowest probabilities of occurrence were generally characterised by high predictive uncertainty. Our methods have implications for the analyses of satellite tagged animals in terrestrial and marine environments. By coupling a bootstrapping procedure with Maxent we circumvented some of the statistical challenges presented by satellite telemetry data to generate spatial predictions within the inner Danish waters.  相似文献   

16.
Aim Environmental niche models that utilize presence‐only data have been increasingly employed to model species distributions and test ecological and evolutionary predictions. The ideal method for evaluating the accuracy of a niche model is to train a model with one dataset and then test model predictions against an independent dataset. However, a truly independent dataset is often not available, and instead random subsets of the total data are used for ‘training’ and ‘testing’ purposes. The goal of this study was to determine how spatially autocorrelated sampling affects measures of niche model accuracy when using subsets of a larger dataset for accuracy evaluation. Location The distribution of Centaurea maculosa (spotted knapweed; Asteraceae) was modelled in six states in the western United States: California, Oregon, Washington, Idaho, Wyoming and Montana. Methods Two types of niche modelling algorithms – the genetic algorithm for rule‐set prediction (GARP) and maximum entropy modelling (as implemented with Maxent) – were used to model the potential distribution of C. maculosa across the region. The effect of spatially autocorrelated sampling was examined by applying a spatial filter to the presence‐only data (to reduce autocorrelation) and then comparing predictions made using the spatial filter with those using a random subset of the data, equal in sample size to the filtered data. Results The accuracy of predictions from both algorithms was sensitive to the spatial autocorrelation of sampling effort in the occurrence data. Spatial filtering led to lower values of the area under the receiver operating characteristic curve plot but higher similarity statistic (I) values when compared with predictions from models built with random subsets of the total data, meaning that spatial autocorrelation of sampling effort between training and test data led to inflated measures of accuracy. Main conclusions The findings indicate that care should be taken when interpreting the results from presence‐only niche models when training and test data have been randomly partitioned but occurrence data were non‐randomly sampled (in a spatially autocorrelated manner). The higher accuracies obtained without the spatial filter are a result of spatial autocorrelation of sampling effort between training and test data inflating measures of prediction accuracy. If independently surveyed data for testing predictions are unavailable, then it may be necessary to explicitly account for the spatial autocorrelation of sampling effort between randomly partitioned training and test subsets when evaluating niche model predictions.  相似文献   

17.
Ecological niche modeling (ENM) has become an important tool in conservation biology. Despite its recent success, several basic issues related to algorithm performance are still being debated. We assess the ability of two of the most popular algorithms, GARP and Maxent, to predict distributions when sampling is geographically biased. We use an extensive data set collected in the Brazilian Cerrado, a biodiversity hotspot in South America. We found that both algorithms give richness predictions that are very similar to other traditionally used richness estimators. Also, both algorithms correctly predicted the presence of most species collected during fieldwork, and failed to predict species collected only in very few cases (usually species with very few known localities, i.e., <5). We also found that Maxent tends to be more sensitive to sampling bias than GARP. However, Maxent performs better when sampling is poor (e.g., low number of data points). Our results indicates that ENM, even when provided with limited and geographically biased localities, is a very useful technique to estimate richness and composition of unsampled areas. We conclude that data generated by ENM maximize the utility of existing biodiversity data, providing a very useful first evaluation. However, for reliable conservation decisions ENM data must be followed by well-designed field inventories, especially for the detection of restricted range, rare species.  相似文献   

18.
Recognition of the importance of cross‐validation (‘any technique or instance of assessing how the results of a statistical analysis will generalize to an independent dataset’; Wiktionary, en.wiktionary.org) is one reason that the U.S. Securities and Exchange Commission requires all investment products to carry some variation of the disclaimer, ‘Past performance is no guarantee of future results.’ Even a cursory examination of financial behaviour, however, demonstrates that this warning is regularly ignored, even by those who understand what an independent dataset is. In the natural sciences, an analogue to predicting future returns for an investment strategy is predicting power of a particular algorithm to perform with new data. Once again, the key to developing an unbiased assessment of future performance is through testing with independent data—that is, data that were in no way involved in developing the method in the first place. A ‘gold‐standard’ approach to cross‐validation is to divide the data into two parts, one used to develop the algorithm, the other used to test its performance. Because this approach substantially reduces the sample size that can be used in constructing the algorithm, researchers often try other variations of cross‐validation to accomplish the same ends. As illustrated by Anderson in this issue of Molecular Ecology Resources, however, not all attempts at cross‐validation produce the desired result. Anderson used simulated data to evaluate performance of several software programs designed to identify subsets of loci that can be effective for assigning individuals to population of origin based on multilocus genetic data. Such programs are likely to become increasingly popular as researchers seek ways to streamline routine analyses by focusing on small sets of loci that contain most of the desired signal. Anderson found that although some of the programs made an attempt at cross‐validation, all failed to meet the ‘gold standard’ of using truly independent data and therefore produced overly optimistic assessments of power of the selected set of loci—a phenomenon known as ‘high grading bias.’  相似文献   

19.
In model building and model evaluation, cross‐validation is a frequently used resampling method. Unfortunately, this method can be quite time consuming. In this article, we discuss an approximation method that is much faster and can be used in generalized linear models and Cox’ proportional hazards model with a ridge penalty term. Our approximation method is based on a Taylor expansion around the estimate of the full model. In this way, all cross‐validated estimates are approximated without refitting the model. The tuning parameter can now be chosen based on these approximations and can be optimized in less time. The method is most accurate when approximating leave‐one‐out cross‐validation results for large data sets which is originally the most computationally demanding situation. In order to demonstrate the method's performance, it will be applied to several microarray data sets. An R package penalized, which implements the method, is available on CRAN.  相似文献   

20.
When modelling the distribution of a species, it is often not possible to comprehensively sample the whole distribution of the species and managers may have habitat models based on data from one area that they want to apply in other areas. Hence, an important question is: how accurate are models of the distributions of species when applied beyond the areas where they were developed? A first step in measuring model transferability could be testing models in adjacent areas. We predicted the habitat associations of the brush‐tailed rock‐wallaby (Petrogale penicillata) across two spatial scales in two neighbouring study areas in eastern Australia, south‐east Queensland and north‐east New South Wales. We used classification trees for exploratory data analysis of habitat relationships and then applied logistic regression models to predict species occurrence. We assessed the within‐area discriminative ability of the habitat models using cross‐validation and threshold plots, and tested the predictive ability of the models for adjacent areas using the receiver operating characteristic statistic to determine the area under the curve. We found that models performed well within an area and extrapolating them to adjacent areas resulted in good predictive performance at the site scale but substantially poorer predictive performance at the landscape scale. We conclude that distribution models for wildlife species should only be extrapolated to neighbouring areas with caution when using landscape‐scale environmental variables. Alternatively, only key habitat associations predicted by the models at this scale should be transferred across adjacent areas once verified against local knowledge of the ecology of the study species.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号