High‐grading bias: subtle problems with assessing power of selected subsets of loci for population assignment期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

High‐grading bias: subtle problems with assessing power of selected subsets of loci for population assignment

Authors:	ROBIN S WAPLES

Institution:	NOAA Fisheries, Northwest Fisheries Science Center, 2725 Montlake Blvd. East, Seattle, WA 98112, USA

Abstract:	Recognition of the importance of cross‐validation (‘any technique or instance of assessing how the results of a statistical analysis will generalize to an independent dataset’; Wiktionary, en.wiktionary.org) is one reason that the U.S. Securities and Exchange Commission requires all investment products to carry some variation of the disclaimer, ‘Past performance is no guarantee of future results.’ Even a cursory examination of financial behaviour, however, demonstrates that this warning is regularly ignored, even by those who understand what an independent dataset is. In the natural sciences, an analogue to predicting future returns for an investment strategy is predicting power of a particular algorithm to perform with new data. Once again, the key to developing an unbiased assessment of future performance is through testing with independent data—that is, data that were in no way involved in developing the method in the first place. A ‘gold‐standard’ approach to cross‐validation is to divide the data into two parts, one used to develop the algorithm, the other used to test its performance. Because this approach substantially reduces the sample size that can be used in constructing the algorithm, researchers often try other variations of cross‐validation to accomplish the same ends. As illustrated by Anderson in this issue of Molecular Ecology Resources, however, not all attempts at cross‐validation produce the desired result. Anderson used simulated data to evaluate performance of several software programs designed to identify subsets of loci that can be effective for assigning individuals to population of origin based on multilocus genetic data. Such programs are likely to become increasingly popular as researchers seek ways to streamline routine analyses by focusing on small sets of loci that contain most of the desired signal. Anderson found that although some of the programs made an attempt at cross‐validation, all failed to meet the ‘gold standard’ of using truly independent data and therefore produced overly optimistic assessments of power of the selected set of loci—a phenomenon known as ‘high grading bias.’

Keywords:	assignment tests cross‐validation discriminant function analysis independence jackknife sample size split‐sample

设为首页 | 免责声明 | 关于勤云 | 加入收藏