首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Recognition of the importance of cross‐validation (‘any technique or instance of assessing how the results of a statistical analysis will generalize to an independent dataset’; Wiktionary, en.wiktionary.org) is one reason that the U.S. Securities and Exchange Commission requires all investment products to carry some variation of the disclaimer, ‘Past performance is no guarantee of future results.’ Even a cursory examination of financial behaviour, however, demonstrates that this warning is regularly ignored, even by those who understand what an independent dataset is. In the natural sciences, an analogue to predicting future returns for an investment strategy is predicting power of a particular algorithm to perform with new data. Once again, the key to developing an unbiased assessment of future performance is through testing with independent data—that is, data that were in no way involved in developing the method in the first place. A ‘gold‐standard’ approach to cross‐validation is to divide the data into two parts, one used to develop the algorithm, the other used to test its performance. Because this approach substantially reduces the sample size that can be used in constructing the algorithm, researchers often try other variations of cross‐validation to accomplish the same ends. As illustrated by Anderson in this issue of Molecular Ecology Resources, however, not all attempts at cross‐validation produce the desired result. Anderson used simulated data to evaluate performance of several software programs designed to identify subsets of loci that can be effective for assigning individuals to population of origin based on multilocus genetic data. Such programs are likely to become increasingly popular as researchers seek ways to streamline routine analyses by focusing on small sets of loci that contain most of the desired signal. Anderson found that although some of the programs made an attempt at cross‐validation, all failed to meet the ‘gold standard’ of using truly independent data and therefore produced overly optimistic assessments of power of the selected set of loci—a phenomenon known as ‘high grading bias.’  相似文献   

2.
Modeling plant growth using functional traits is important for understanding the mechanisms that underpin growth and for predicting new situations. We use three data sets on plant height over time and two validation methods—in‐sample model fit and leave‐one‐species‐out cross‐validation—to evaluate non‐linear growth model predictive performance based on functional traits. In‐sample measures of model fit differed substantially from out‐of‐sample model predictive performance; the best fitting models were rarely the best predictive models. Careful selection of predictor variables reduced the bias in parameter estimates, and there was no single best model across our three data sets. Testing and comparing multiple model forms is important. We developed an R package with a formula interface for straightforward fitting and validation of hierarchical, non‐linear growth models. Our intent is to encourage thorough testing of multiple growth model forms and an increased emphasis on assessing model fit relative to a model's purpose.  相似文献   

3.
Ecological data often show temporal, spatial, hierarchical (random effects), or phylogenetic structure. Modern statistical approaches are increasingly accounting for such dependencies. However, when performing cross‐validation, these structures are regularly ignored, resulting in serious underestimation of predictive error. One cause for the poor performance of uncorrected (random) cross‐validation, noted often by modellers, are dependence structures in the data that persist as dependence structures in model residuals, violating the assumption of independence. Even more concerning, because often overlooked, is that structured data also provides ample opportunity for overfitting with non‐causal predictors. This problem can persist even if remedies such as autoregressive models, generalized least squares, or mixed models are used. Block cross‐validation, where data are split strategically rather than randomly, can address these issues. However, the blocking strategy must be carefully considered. Blocking in space, time, random effects or phylogenetic distance, while accounting for dependencies in the data, may also unwittingly induce extrapolations by restricting the ranges or combinations of predictor variables available for model training, thus overestimating interpolation errors. On the other hand, deliberate blocking in predictor space may also improve error estimates when extrapolation is the modelling goal. Here, we review the ecological literature on non‐random and blocked cross‐validation approaches. We also provide a series of simulations and case studies, in which we show that, for all instances tested, block cross‐validation is nearly universally more appropriate than random cross‐validation if the goal is predicting to new data or predictor space, or for selecting causal predictors. We recommend that block cross‐validation be used wherever dependence structures exist in a dataset, even if no correlation structure is visible in the fitted model residuals, or if the fitted models account for such correlations.  相似文献   

4.
Acute myeloid leukaemia (AML) is the most common type of adult acute leukaemia and has a poor prognosis. Thus, optimal risk stratification is of greatest importance for reasonable choice of treatment and prognostic evaluation. For our study, a total of 1707 samples of AML patients from three public databases were divided into meta‐training, meta‐testing and validation sets. The meta‐training set was used to build risk prediction model, and the other four data sets were employed for validation. By log‐rank test and univariate COX regression analysis as well as LASSO‐COX, AML patients were divided into high‐risk and low‐risk groups based on AML risk score (AMLRS) which was constituted by 10 survival‐related genes. In meta‐training, meta‐testing and validation sets, the patient in the low‐risk group all had a significantly longer OS (overall survival) than those in the high‐risk group (P < .001), and the area under ROC curve (AUC) by time‐dependent ROC was 0.5854‐0.7905 for 1 year, 0.6652‐0.8066 for 3 years and 0.6622‐0.8034 for 5 years. Multivariate COX regression analysis indicated that AMLRS was an independent prognostic factor in four data sets. Nomogram combining the AMLRS and two clinical parameters performed well in predicting 1‐year, 3‐year and 5‐year OS. Finally, we created a web‐based prognostic model to predict the prognosis of AML patients ( https://tcgi.shinyapps.io/amlrs_nomogram/ ).  相似文献   

5.
Species distribution modelling (SDM) has become an essential method in ecology and conservation. In the absence of survey data, the majority of SDMs are calibrated with opportunistic presence‐only data, incurring substantial sampling bias. We address the challenge of correcting for sampling bias in the data‐sparse situations. We modelled the relative intensity of bat records in their entire range using three modelling algorithms under the point‐process modelling framework (GLMs with subset selection, GLMs fitted with an elastic‐net penalty, and Maxent). To correct for sampling bias, we applied model‐based bias correction by incorporating spatial information on site accessibility or sampling efforts. We evaluated the effect of bias correction on the models’ predictive performance (AUC and TSS), calculated on spatial‐block cross‐validation and a holdout data set. When evaluated with independent, but also sampling‐biased test data, correction for sampling bias led to improved predictions. The predictive performance of the three modelling algorithms was very similar. Elastic‐net models have intermediate performance, with slight advantage for GLMs on cross‐validation and Maxent on hold‐out evaluation. Model‐based bias correction is very useful in data‐sparse situations, where detailed data are not available to apply other bias correction methods. However, bias correction success depends on how well the selected bias variables describe the sources of bias. In this study, accessibility covariates described bias in our data better than the effort covariate, and their use led to larger changes in predictive performance. Objectively evaluating bias correction requires bias‐free presence–absence test data, and without them the real improvement for describing a species’ environmental niche cannot be assessed.  相似文献   

6.
Establishing the sex of individuals in wild systems can be challenging and often requires genetic testing. Genotyping‐by‐sequencing (GBS) and other reduced‐representation DNA sequencing (RRS) protocols (e.g., RADseq, ddRAD) have enabled the analysis of genetic data on an unprecedented scale. Here, we present a novel approach for the discovery and statistical validation of sex‐specific loci in GBS data sets. We used GBS to genotype 166 New Zealand fur seals (NZFS, Arctocephalus forsteri) of known sex. We retained monomorphic loci as potential sex‐specific markers in the locus discovery phase. We then used (i) a sex‐specific locus threshold (SSLT) to identify significantly male‐specific loci within our data set; and (ii) a significant sex‐assignment threshold (SSAT) to confidently assign sex in silico the presence or absence of significantly male‐specific loci to individuals in our data set treated as unknowns (98.9% accuracy for females; 95.8% for males, estimated via cross‐validation). Furthermore, we assigned sex to 86 individuals of true unknown sex using our SSAT and assessed the effect of SSLT adjustments on these assignments. From 90 verified sex‐specific loci, we developed a panel of three sex‐specific PCR primers that we used to ascertain sex independently of our GBS data, which we show amplify reliably in at least two other pinniped species. Using monomorphic loci normally discarded from large SNP data sets is an effective way to identify robust sex‐linked markers for nonmodel species. Our novel pipeline can be used to identify and statistically validate monomorphic and polymorphic sex‐specific markers across a range of species and RRS data sets.  相似文献   

7.
Man Jin  Yixin Fang 《Biometrics》2011,67(1):124-132
Summary In family studies, canonical discriminant analysis can be used to find linear combinations of phenotypes that exhibit high ratios of between‐family to within‐family variabilities. But with large numbers of phenotypes, canonical discriminant analysis may overfit. To estimate the predicted ratios associated with the coefficients obtained from canonical discriminant analysis, two methods are developed; one is based on bias correction and the other based on cross‐validation. Because the cross‐validation is computationally intensive, an approximation to the cross‐validation is also developed. Furthermore, these methods can be applied to perform variable selection in canonical discriminant analysis. The proposed methods are illustrated with simulation studies and applications to two real examples.  相似文献   

8.
9.
Recently, microRNAs (miRNAs) are confirmed to be important molecules within many crucial biological processes and therefore related to various complex human diseases. However, previous methods of predicting miRNA–disease associations have their own deficiencies. Under this circumstance, we developed a prediction method called deep representations‐based miRNA–disease association (DRMDA) prediction. The original miRNA–disease association data were extracted from HDMM database. Meanwhile, stacked auto‐encoder, greedy layer‐wise unsupervised pre‐training algorithm and support vector machine were implemented to predict potential associations. We compared DRMDA with five previous classical prediction models (HGIMDA, RLSMDA, HDMP, WBSMDA and RWRMDA) in global leave‐one‐out cross‐validation (LOOCV), local LOOCV and fivefold cross‐validation, respectively. The AUCs achieved by DRMDA were 0.9177, 08339 and 0.9156 ± 0.0006 in the three tests above, respectively. In further case studies, we predicted the top 50 potential miRNAs for colon neoplasms, lymphoma and prostate neoplasms, and 88%, 90% and 86% of the predicted miRNA can be verified by experimental evidence, respectively. In conclusion, DRMDA is a promising prediction method which could identify potential and novel miRNA–disease associations.  相似文献   

10.
Summary Time varying, individual covariates are problematic in experiments with marked animals because the covariate can typically only be observed when each animal is captured. We examine three methods to incorporate time varying, individual covariates of the survival probabilities into the analysis of data from mark‐recapture‐recovery experiments: deterministic imputation, a Bayesian imputation approach based on modeling the joint distribution of the covariate and the capture history, and a conditional approach considering only the events for which the associated covariate data are completely observed (the trinomial model). After describing the three methods, we compare results from their application to the analysis of the effect of body mass on the survival of Soay sheep (Ovis aries) on the Isle of Hirta, Scotland. Simulations based on these results are then used to make further comparisons. We conclude that both the trinomial model and Bayesian imputation method perform best in different situations. If the capture and recovery probabilities are all high, then the trinomial model produces precise, unbiased estimators that do not depend on any assumptions regarding the distribution of the covariate. In contrast, the Bayesian imputation method performs substantially better when capture and recovery probabilities are low, provided that the specified model of the covariate is a good approximation to the true data‐generating mechanism.  相似文献   

11.
There is an increasing need for life cycle data for bio‐based products, which becomes particularly evident with the recent drive for greenhouse gas reporting and carbon footprinting studies. Meeting this need is challenging given that many bio‐products have not yet been studied by life cycle assessment (LCA), and those that have are specific and limited to certain geographic regions. In an attempt to bridge data gaps for bio‐based products, LCA practitioners can use either proxy data sets (e.g., use existing environmental data for apples to represent pears) or extrapolated data (e.g., derive new data for pears by modifying data for apples considering pear‐specific production characteristics). This article explores the challenges and consequences of using these two approaches. Several case studies are used to illustrate the trade‐offs between uncertainty and the ease of application, with carbon footprinting as an example. As shown, the use of proxy data sets is the quickest and easiest solution for bridging data gaps but also has the highest uncertainty. In contrast, data extrapolation methods may require extensive expert knowledge and are thus harder to use but give more robust results in bridging data gaps. They can also provide a sound basis for understanding variability in bio‐based product data. If resources (time, budget, and expertise) are limited, the use of averaged proxy data may be an acceptable compromise for initial or screening assessments. Overall, the article highlights the need for further research on the development and validation of different approaches to bridging data gaps for bio‐based products.  相似文献   

12.
In this work, a methodology for the model‐based identifiable parameter determination (MBIPD) is presented. This systematic approach is proposed to be used for structure and parameter identification of nonlinear models of biological reaction networks. Usually, this kind of problems are over‐parameterized with large correlations between parameters. Hence, the related inverse problems for parameter determination and analysis are mathematically ill‐posed and numerically difficult to solve. The proposed MBIPD methodology comprises several tasks: (i) model selection, (ii) tracking of an adequate initial guess, and (iii) an iterative parameter estimation step which includes an identifiable parameter subset selection (SsS) algorithm and accuracy analysis of the estimated parameters. The SsS algorithm is based on the analysis of the sensitivity matrix by rank revealing factorization methods. Using this, a reduction of the parameter search space to a reasonable subset, which can be reliably and efficiently estimated from available measurements, is achieved. The simultaneous saccharification and fermentation (SSF) process for bio‐ethanol production from cellulosic material is used as case study for testing the methodology. The successful application of MBIPD to the SSF process demonstrates a relatively large reduction in the identified parameter space. It is shown by a cross‐validation that using the identified parameters (even though the reduction of the search space), the model is still able to predict the experimental data properly. Moreover, it is shown that the model is easily and efficiently adapted to new process conditions by solving reduced and well conditioned problems. © 2013 American Institute of Chemical Engineers Biotechnol. Prog., 29:1064–1082, 2013  相似文献   

13.
LncRNA and miRNA are key molecules in mechanism of competing endogenous RNAs(ceRNA), and their interactions have been discovered with important roles in gene regulation. As supplementary to the identification of lncRNA‐miRNA interactions from CLIP‐seq experiments, in silico prediction can select the most potential candidates for experimental validation. Although developing computational tool for predicting lncRNA‐miRNA interaction is of great importance for deciphering the ceRNA mechanism, little effort has been made towards this direction. In this paper, we propose an approach based on linear neighbour representation to predict lncRNA‐miRNA interactions (LNRLMI). Specifically, we first constructed a bipartite network by combining the known interaction network and similarities based on expression profiles of lncRNAs and miRNAs. Based on such a data integration, linear neighbour representation method was introduced to construct a prediction model. To evaluate the prediction performance of the proposed model, k‐fold cross validations were implemented. As a result, LNRLMI yielded the average AUCs of 0.8475 ± 0.0032, 0.8960 ± 0.0015 and 0.9069 ± 0.0014 on 2‐fold, 5‐fold and 10‐fold cross validation, respectively. A series of comparison experiments with other methods were also conducted, and the results showed that our method was feasible and effective to predict lncRNA‐miRNA interactions via a combination of different types of useful side information. It is anticipated that LNRLMI could be a useful tool for predicting non‐coding RNA regulation network that lncRNA and miRNA are involved in.  相似文献   

14.
An input‐output‐based life cycle inventory (IO‐based LCI) is grounded on economic environmental input‐output analysis (IO analysis). It is a fast and low‐budget method for generating LCI data sets, and is used to close data gaps in life cycle assessment (LCA). Due to the fact that its methodological basis differs from that of process‐based inventory, its application in LCA is a matter of controversy. We developed a German IO‐based approach to derive IO‐based LCI data sets that is based on the German IO accounts and on the German environmental accounts, which provide data for the sector‐specific direct emissions of seven airborne compounds. The method to calculate German IO‐based LCI data sets for building products is explained in detail. The appropriateness of employing IO‐based LCI for German buildings is analyzed by using process‐based LCI data from the Swiss Ecoinvent database to validate the calculated IO‐based LCI data. The extent of the deviations between process‐based LCI and IO‐based LCI varies considerably for the airborne emissions we investigated. We carried out a systematic evaluation of the possible reasons for this deviation. This analysis shows that the sector‐specific effects (aggregation of sectors) and the quality of primary data for emissions from national inventory reporting (NIR) are the main reasons for the deviations. As a rule, IO‐based LCI data sets seem to underestimate specific emissions while overestimating sector‐specific aspects.  相似文献   

15.
Agro‐Land Surface Models (agro‐LSM) combine detailed crop models and large‐scale vegetation models (DGVMs) to model the spatial and temporal distribution of energy, water, and carbon fluxes within the soil–vegetation–atmosphere continuum worldwide. In this study, we identify and optimize parameters controlling leaf area index (LAI) in the agro‐LSM ORCHIDEE‐STICS developed for sugarcane. Using the Morris method to identify the key parameters impacting LAI, at eight different sugarcane field trial sites, in Australia and La Reunion island, we determined that the three most important parameters for simulating LAI are (i) the maximum predefined rate of LAI increase during the early crop development phase, a parameter that defines a plant density threshold below which individual plants do not compete for growing their LAI, and a parameter defining a threshold for nitrogen stress on LAI. A multisite calibration of these three parameters is performed using three different scoring functions. The impact of the choice of a particular scoring function on the optimized parameter values is investigated by testing scoring functions defined from the model‐data RMSE, the figure of merit and a Bayesian quadratic model‐data misfit function. The robustness of the calibration is evaluated for each of the three scoring functions with a systematic cross‐validation method to find the most satisfactory one. Our results show that the figure of merit scoring function is the most robust metric for establishing the best parameter values controlling the LAI. The multisite average figure of merit scoring function is improved from 67% of agreement to 79%. The residual error in LAI simulation after the calibration is discussed.  相似文献   

16.
DNA methylation is an important biological regulatory mechanism that changes gene expression without altering the DNA sequence. Increasing studies have revealed that DNA methylation data play a vital role in the field of oncology. However, the methylation site signature in triple‐negative breast cancer (TNBC) remains unknown. In our research, we analysed 158 TNBC samples and 98 noncancerous samples from The Cancer Genome Atlas (TCGA) in three phases. In the discovery phase, 86 CpGs were identified by univariate Cox proportional hazards regression (CPHR) analyses to be significantly correlated with overall survival (P < 0.01). In the training phase, these candidate CpGs were further narrowed down to a 15‐CpG‐based signature by conducting least absolute shrinkage and selector operator (LASSO) Cox regression in the training set. In the validation phase, the 15‐CpG‐based signature was verified using two different internal sets and one external validation set. Furthermore, a nomogram comprising the CpG‐based signature and TNM stage was generated to predict the 1‐, 3‐ and 5‐year overall survival in the primary set, and it showed excellent performance in the three validation sets (concordance indexes: 0.924, 0.974 and 0.637). This study showed that our nomogram has a precise predictive effect on the prognosis of TNBC and can potentially be implemented for clinical treatment and diagnosis.  相似文献   

17.
Markov chain Monte Carlo (MCMC) method was applied to model kinetics of a fed‐batch Chinese hamster ovary cell culture process in 5,000‐L bioreactors. The kinetic model consists of six differential equations, which describe dynamics of viable cell density and concentrations of glucose, glutamine, ammonia, lactate, and the antibody fusion protein B1 (B1). The kinetic model has 18 parameters, six of which were calculated from the cell culture data, whereas the other 12 were estimated from a training data set that comprised of seven cell culture runs using a MCMC method. The model was confirmed in two validation data sets that represented a perturbation of the cell culture condition. The agreement between the predicted and measured values of both validation data sets may indicate high reliability of the model estimates. The kinetic model uniquely incorporated the ammonia removal and the exponential function of B1 protein concentration. The model indicated that ammonia and lactate play critical roles in cell growth and that low concentrations of glucose (0.17 mM) and glutamine (0.09 mM) in the cell culture medium may help reduce ammonia and lactate production. The model demonstrated that 83% of the glucose consumed was used for cell maintenance during the late phase of the cell cultures, whereas the maintenance coefficient for glutamine was negligible. Finally, the kinetic model suggests that it is critical for B1 production to sustain a high number of viable cells. The MCMC methodology may be a useful tool for modeling kinetics of a fed‐batch mammalian cell culture process. © 2009 American Institute of Chemical Engineers Biotechnol. Prog., 2010  相似文献   

18.
MicroRNAs (miRNAs) have been confirmed to be closely related to various human complex diseases by many experimental studies. It is necessary and valuable to develop powerful and effective computational models to predict potential associations between miRNAs and diseases. In this work, we presented a prediction model of Graphlet Interaction for MiRNA‐Disease Association prediction (GIMDA) by integrating the disease semantic similarity, miRNA functional similarity, Gaussian interaction profile kernel similarity and the experimentally confirmed miRNA‐disease associations. The related score of a miRNA to a disease was calculated by measuring the graphlet interactions between two miRNAs or two diseases. The novelty of GIMDA lies in that we used graphlet interaction to analyse the complex relationships between two nodes in a graph. The AUCs of GIMDA in global and local leave‐one‐out cross‐validation (LOOCV) turned out to be 0.9006 and 0.8455, respectively. The average result of five‐fold cross‐validation reached to 0.8927 ± 0.0012. In case study for colon neoplasms, kidney neoplasms and prostate neoplasms based on the database of HMDD V2.0, 45, 45, 41 of the top 50 potential miRNAs predicted by GIMDA were validated by dbDEMC and miR2Disease. Additionally, in the case study of new diseases without any known associated miRNAs and the case study of predicting potential miRNA‐disease associations using HMDD V1.0, there were also high percentages of top 50 miRNAs verified by the experimental literatures.  相似文献   

19.
Disparity‐through‐time analyses can be used to determine how morphological diversity changes in response to mass extinctions, or to investigate the drivers of morphological change. These analyses are routinely applied to palaeobiological datasets, yet, although there is much discussion about how to best calculate disparity, there has been little consideration of how taxa should be sub‐sampled through time. Standard practice is to group taxa into discrete time bins, often based on stratigraphic periods. However, this can introduce biases when bins are of unequal size, and implicitly assumes a punctuated model of evolution. In addition, many time bins may have few or no taxa, meaning that disparity cannot be calculated for the bin and making it harder to complete downstream analyses. Here we describe a different method to complement the disparity‐through‐time tool‐kit: time‐slicing. This method uses a time‐calibrated phylogenetic tree to sample disparity‐through‐time at any fixed point in time rather than binning taxa. It uses all available data (tips, nodes and branches) to increase the power of the analyses, specifies the implied model of evolution (punctuated or gradual), and is implemented in R. We test the time‐slicing method on four example datasets and compare its performance in common disparity‐through‐time analyses. We find that the way we time sub‐sample taxa can change our interpretations of the results of disparity‐through‐time analyses. We advise using multiple methods for time sub‐sampling taxa, rather than just time binning, to gain a better understanding disparity‐through‐time.  相似文献   

20.
Age structure is a fundamental aspect of animal population biology. Age is strongly related to individual physiological condition, reproductive potential and mortality rate. Currently, there are no robust molecular methods for age estimation in birds. Instead, individuals must be ringed as chicks to establish known‐age populations, which is a labour‐intensive and expensive process. The estimation of chronological age using DNA methylation (DNAm) is emerging as a robust approach in mammals including humans, mice and some non‐model species. Here, we quantified DNAm in whole blood samples from a total of 71 known‐age Short‐tailed shearwaters (Ardenna tenuirostris) using digital restriction enzyme analysis of methylation (DREAM). The DREAM method measures DNAm levels at thousands of CpG dinucleotides throughout the genome. We identified seven CpG sites with DNAm levels that correlated with age. A model based on these relationships estimated age with a mean difference of 2.8 years to known age, based on validation estimates from models created by repeated sampling of training and validation data subsets. Longitudinal observation of individuals re‐sampled over 1 or 2 years generally showed an increase in estimated age (6/7 cases). For the first time, we have shown that epigenetic changes with age can be detected in a wild bird. This approach should be of broad interest to researchers studying age biomarkers in non‐model species and will allow identification of markers that can be assessed using targeted techniques for accurate age estimation in large population studies.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号