首页 | 本学科首页   官方微博 | 高级检索  
     


Computing Exact p-values for a Cross-correlation Shotgun Proteomics Score Function
Authors:J. Jeffry Howbert  William Stafford Noble
Affiliation:From the ‡Department of Genome Sciences, University of Washington, Seattle, Washington; ;§Department of Computer Science and Engineering, University of Washington, Seattle, Washington
Abstract:The core of every protein mass spectrometry analysis pipeline is a function that assesses the quality of a match between an observed spectrum and a candidate peptide. We describe a procedure for computing exact p-values for the oldest and still widely used score function, SEQUEST XCorr. The procedure uses dynamic programming to enumerate efficiently the full distribution of scores for all possible peptides whose masses are close to that of the spectrum precursor mass. Ranking identified spectra by p-value rather than XCorr significantly reduces variance because of spectrum-specific effects on the score. In combination with the Percolator postprocessor, the XCorr p-value yields more spectrum and peptide identifications at a fixed false discovery rate than Mascot, X!Tandem, Comet, and MS-GF+ across a variety of data sets.A high-throughput proteomics experiment generates many thousands of candidate hypotheses, only a fraction of which are true and an even smaller fraction of which are of significant biological interest. Consequently, the accurate and efficient assignment of statistical confidence estimates to identified fragmentation spectra is critical to making efficient use of shotgun proteomics data sets.Historically, the field has shifted from focusing on discrimination—the ability of a search engine to distinguish between correct and incorrect spectrum identifications—to calibration. If a score function is well-calibrated, then the score of x assigned to one spectrum is directly comparable to a score of x assigned to a different spectrum. A well-known example of poor calibration is the distribution of SEQUEST XCorr scores produced on spectra of varying charges (Fig. 1A), such that a score of 1.8 for a doubly charged (2+) spectrum indicates a good quality identification, whereas the same score assigned to a 3+ spectrum corresponds to a much poorer match. Improving the calibration of a given score function across spectra can lead to large improvements in the number of identified spectra at a fixed statistical confidence threshold. Score calibration can be carried out using empirical curve fitting procedures to estimate p-values (1, 2, 3) or, more recently, using dynamic programming to calculate an exact p-value for each observed score (4). Machine learning postprocessors such as PeptideProphet (5) and Percolator (6) simultaneously calibrate scores and incorporate additional information, leading to even larger improvements in identification rates.Open in a separate windowFig. 1.Distribution of scores for charge 2+ and charge 3+ spectra from yeast data set.A, Standard XCorr scores. B, XCorr Šidák-corrected p-values.In this work, we describe a dynamic programming method for computing exact p-values for the oldest and one of the most widely used score functions, SEQUEST XCorr (7, 8). We demonstrate analytically and empirically that the resulting p-values are valid relative to a widely accepted null model. Furthermore, we show that, across a variety of data sets, our XCorr p-value yields significantly improved statistical power relative to competing, state-of-the-art methods, including SEQUEST, Mascot (9), X!Tandem (10), and Comet (3), and is competitive with other dynamic programming-based calibration methods like MS-GF+ (11). Strikingly, the improved calibration from our scoring scheme is complementary to that provided by Percolator, so that the combination of the two methods yields even better results, evaluated both at the spectrum and peptide levels.
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号