首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. We present two flexible methods, the variable component mixture model and the semiparametric mixture model, that remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. Using a control protein mixture data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, we demonstrate that both methods improve parametric models in terms of the accuracy of probability estimates and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient number of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy.  相似文献   

2.
Tandem mass spectrometry is commonly used to identify peptides, typically by comparing their product ion spectra with those predicted from a protein sequence database and scoring these matches. The most reported quality metric for a set of peptide identifications is the false discovery rate (FDR), the fraction of expected false identifications in the set. This metric has so far only been used for completely sequenced organisms or known protein mixtures. We have investigated whether FDR estimations are also applicable in the case of partially sequenced organisms, where many high-quality spectra fail to identify the correct peptides because the latter are not present in the searched sequence database. Using real data from human plasma and simulated partial sequence databases derived from two complete human sequence databases with different levels of redundancy, we could demonstrate that the mixture model approach in PeptideProphet is robust for partial databases, particularly if used in combination with decoy sequences. We therefore recommend using this method when estimating the FDR and reporting peptide identifications from incompletely sequenced organisms.  相似文献   

3.
Development of robust statistical methods for validation of peptide assignments to tandem mass (MS/MS) spectra obtained using database searching remains an important problem. PeptideProphet is one of the commonly used computational tools available for that purpose. An alternative simple approach for validation of peptide assignments is based on addition of decoy (reversed, randomized, or shuffled) sequences to the searched protein sequence database. The probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single semisupervised framework, leading to improved robustness and higher accuracy of computed probabilities even in the case of most challenging data sets. We present a semisupervised expectation-maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixture model, extending PeptideProphet to incorporate decoy peptide matches. Using several data sets of varying complexity, from control protein mixtures to a human plasma sample, and using three commonly used database search programs, SEQUEST, MASCOT, and TANDEM/k-score, we illustrate that more accurate mixture estimation leads to an improved control of the false discovery rate in the classification of peptide assignments.  相似文献   

4.
The sequence tag-based peptide identification methods are a promising alternative to the traditional database search approach. However, a more comprehensive analysis, optimization, and comparison with established methods are necessary before these methods can gain widespread use in the proteomics community. Using the InsPecT open source code base ( Tanner et al., Anal. Chem. 2005, 77, 4626- 39 ), we present an improved sequence tag generation method that directly incorporates multicharged fragment ion peaks present in many tandem mass spectra of higher charge states. We also investigate the performance of sequence tagging under different settings using control data sets generated on five different types of mass spectrometers, as well as using a complex phosphopeptide-enriched sample. We also demonstrate that additional modeling of InsPecT search scores using a semiparametric approach incorporating the accuracy of the precursor ion mass measurement provides additional improvement in the ability to discriminate between correct and incorrect peptide identifications. The overall superior performance of the sequence tag-based peptide identification method is demonstrated by comparison with a commonly used SEQUEST/PeptideProphet approach.  相似文献   

5.
Alves G  Ogurtsov AY  Yu YK 《PloS one》2010,5(11):e15438
Statistically meaningful comparison/combination of peptide identification results from various search methods is impeded by the lack of a universal statistical standard. Providing an E-value calibration protocol, we demonstrated earlier the feasibility of translating either the score or heuristic E-value reported by any method into the textbook-defined E-value, which may serve as the universal statistical standard. This protocol, although robust, may lose spectrum-specific statistics and might require a new calibration when changes in experimental setup occur. To mitigate these issues, we developed a new MS/MS search tool, RAId_aPS, that is able to provide spectrum-specific-values for additive scoring functions. Given a selection of scoring functions out of RAId score, K-score, Hyperscore and XCorr, RAId_aPS generates the corresponding score histograms of all possible peptides using dynamic programming. Using these score histograms to assign E-values enables a calibration-free protocol for accurate significance assignment for each scoring function. RAId_aPS features four different modes: (i) compute the total number of possible peptides for a given molecular mass range, (ii) generate the score histogram given a MS/MS spectrum and a scoring function, (iii) reassign E-values for a list of candidate peptides given a MS/MS spectrum and the scoring functions chosen, and (iv) perform database searches using selected scoring functions. In modes (iii) and (iv), RAId_aPS is also capable of combining results from different scoring functions using spectrum-specific statistics. The web link is http://www.ncbi.nlm.nih.gov/CBBresearch/Yu/raid_aps/index.html. Relevant binaries for Linux, Windows, and Mac OS X are available from the same page.  相似文献   

6.
Spectral library searching is an emerging approach in peptide identifications from tandem mass spectra, a critical step in proteomic data analysis. In spectral library searching, a spectral library is first meticulously compiled from a large collection of previously observed peptide MS/MS spectra that are conclusively assigned to their corresponding amino acid sequence. An unknown spectrum is then identified by comparing it to all the candidates in the spectral library for the most similar match. This review discusses the basic principles of spectral library building and searching, describes its advantages and limitations, and provides a primer for researchers interested in adopting this new approach in their data analysis. It will also discuss the future outlook on the evolution and utility of spectral libraries in the field of proteomics.  相似文献   

7.
A very popular approach in proteomics is the so-called "shotgun LC-MS/MS" strategy. In its mostly used form, a total protein digest is separated by ion exchange fractionation in the first dimension followed by off- or on-line RP LC-MS/MS. We replaced the first dimension by isoelectric focusing in the liquid phase using the Off-Gel device producing 15 fractions. As peptides are separated by their isoelectric point in the first dimension and hydrophobicity in the second, those experimentally derived parameters (pI and R(T)) can be used for the validation of potentially identified peptides. We applied this strategy to a cellular extract of Drosophila Kc167 cells and identified peptides with two different database search engines, namely PHENYX and SEQUEST, with PeptideProphet validation of the SEQUEST results. PHENYX returned 7582 potential peptide identifications and SEQUEST 7629. The SEQUEST results were reduced to 2006 identifications by validation with PeptideProphet. Validation of the PeptideProphet, SEQUEST and PHENYX results by pI and R(T) parameters confirmed 1837 PeptideProphet identifications while in the remainder of the SEQUEST results another 1130 peptides were found to be likely hits. The validation on PHENYX resulted in the fixation of a solid p-value threshold of <1 x 10(-04) that sets by itself the correct identification confidence to >95%, and a final count of 2034 highly confident peptide identifications was achieved after pI and R(T) validation. Although the PeptideProphet and PHENYX datasets have a very high confidence the overlap of common identifications was only at 79.4%, to be explained by the fact that data interpretation was done searching different protein databases with two search engines of different algorithms. The approach used in this study allowed for an automated and improved data validation process for shotgun proteomics projects producing MS/MS peptide identification results of very high confidence.  相似文献   

8.
Shotgun proteomics experiments are dependent upon database search engines to identify peptides from tandem mass spectra. Many of these algorithms score potential identifications by evaluating the number of fragment ions matched between each peptide sequence and an observed spectrum. These systems, however, generally do not distinguish between matching an intense peak and matching a minor peak. We have developed a statistical model to score peptide matches that is based upon the multivariate hypergeometric distribution. This scorer, part of the "MyriMatch" database search engine, places greater emphasis on matching intense peaks. The probability that the best match for each spectrum has occurred by random chance can be employed to separate correct matches from random ones. We evaluated this software on data sets from three different laboratories employing three different ion trap instruments. Employing a novel system for testing discrimination, we demonstrate that stratifying peaks into multiple intensity classes improves the discrimination of scoring. We compare MyriMatch results to those of Sequest and X!Tandem, revealing that it is capable of higher discrimination than either of these algorithms. When minimal peak filtering is employed, performance plummets for a scoring model that does not stratify matched peaks by intensity. On the other hand, we find that MyriMatch discrimination improves as more peaks are retained in each spectrum. MyriMatch also scales well to tandem mass spectra from high-resolution mass analyzers. These findings may indicate limitations for existing database search scorers that count matched peaks without differentiating them by intensity. This software and source code is available under Mozilla Public License at this URL: http://www.mc.vanderbilt.edu/msrc/bioinformatics/.  相似文献   

9.
We report on the effectiveness of CID, HCD, and ETD for LC-FT MS/MS analysis of peptides using a tandem linear ion trap-Orbitrap mass spectrometer. A range of software tools and analysis parameters were employed to explore the use of CID, HCD, and ETD to identify peptides (isolated from human blood plasma) without the use of specific "enzyme rules". In the evaluation of an FDR-controlled SEQUEST scoring method, the use of accurate masses for fragments increased the number of identified peptides (by ~50%) compared to the use of conventional low accuracy fragment mass information, and CID provided the largest contribution to the identified peptide data sets compared to HCD and ETD. The FDR-controlled Mascot scoring method provided significantly fewer peptide identifications than SEQUEST (by 1.3-2.3 fold) and CID, HCD, and ETD provided similar contributions to identified peptides. Evaluation of de novo sequencing and the UStags method for more intense fragment ions revealed that HCD afforded more contiguous residues (e.g., ≥ 7 amino acids) than either CID or ETD. Both the FDR-controlled SEQUEST and Mascot scoring methods provided peptide data sets that were affected by the decoy database used and mass tolerances applied (e.g., identical peptides between data sets could be limited to ~70%), while the UStags method provided the most consistent peptide data sets (>90% overlap). The m/z ranges in which CID, HCD, and ETD contributed the largest number of peptide identifications were substantially overlapping. This work suggests that the three peptide ion fragmentation methods are complementary and that maximizing the number of peptide identifications benefits significantly from a careful match with the informatics tools and methods applied. These results also suggest that the decoy strategy may inaccurately estimate identification FDRs.  相似文献   

10.
MOTIVATION: Statistical evaluation of the confidence of peptide and protein identifications made by tandem mass spectrometry is a critical component for appropriately interpreting the experimental data and conducting downstream analysis. Although many approaches have been developed to assign confidence measure from different perspectives, a unified statistical framework that integrates the uncertainty of peptides and proteins is still missing. RESULTS: We developed a hierarchical statistical model (HSM) that jointly models the uncertainty of the identified peptides and proteins and can be applied to any scoring system. With data sets of a standard mixture and the yeast proteome, we demonstrate that the HSM offers a reliable or at least conservative false discovery rate (FDR) estimate for peptide and protein identifications. The probability measure of HSM also offers a powerful discriminating score for peptide identification. AVAILABILITY: The algorithm is available upon request from the authors.  相似文献   

11.
Development of statistical methods for assessing the significance of peptide assignments to tandem mass spectra obtained using database searching remains an important problem. In the past several years, several different approaches have emerged, including the concept of expectation values, target-decoy strategy, and the probability mixture modeling approach of PeptideProphet. In this work, we provide a background on statistical significance analysis in the field of mass spectrometry-based proteomics, and present our perspective on the current and future developments in this area.  相似文献   

12.
The development of liquid chromatography coupled with tandem mass spectrometry (LC-MS/MS) has made it possible to characterize phosphopeptides in an increasingly large-scale and high-throughput fashion. However, extracting confident phosphopeptide identifications from the resulting large data sets in a similar high-throughput fashion remains difficult, as does rigorously estimating the false discovery rate (FDR) of a set of phosphopeptide identifications. This article describes a data analysis pipeline designed to address these issues. The first step is to reanalyze phosphopeptide identifications that contain ambiguous assignments for the incorporated phosphate(s) to determine the most likely arrangement of the phosphate(s). The next step is to employ an expectation maximization algorithm to estimate the joint distribution of the peptide scores. A linear discriminant analysis is then performed to determine how to optimally combine peptide scores (in this case, from SEQUEST) into a discriminant score that possesses the maximum discriminating power. Based on this discriminant score, the p- and q-values for each phosphopeptide identification are calculated, and the phosphopeptide identification FDR is then estimated. This data analysis approach was applied to data from a study of irradiated human skin fibroblasts to provide a robust estimate of FDR for phosphopeptides. The Phosphopeptide FDR Estimator software is freely available for download at http://ncrr.pnl.gov/software/.  相似文献   

13.
We describe the application of a peptide retention time reversed phase liquid chromatography (RPLC) prediction model previously reported (Petritis et al. Anal. Chem. 2003, 75, 1039) for improved peptide identification. The model uses peptide sequence information to generate a theoretical (predicted) elution time that can be compared with the observed elution time. Using data from a set of known proteins, the retention time parameter was incorporated into a discriminant function for use with tandem mass spectrometry (MS/MS) data analyzed with the peptide/protein identification program SEQUEST. For singly charged ions, the number of confident identifications increased by 12% when the elution time metric is included compared to when mass spectral data is the sole source of information in the context of a Drosophila melanogaster database. A 3-4% improvement was obtained for doubly and triply charged ions for the same biological system. Application to the larger Rattus norvegicus (rat) and human proteome databases resulted in an 8-9% overall increase in the number of confident identifications, when both the discriminant function and elution time are used. The effect of adding "runner-up" hits (peptide matches that are not the highest scoring for a spectra) from SEQUEST is also explored, and we find that the number of confident identifications is further increased by 1% when these hits are also considered. Finally, application of the discriminant functions derived in this work with approximately 2.2 million spectra from over three hundred LC-MS/MS analyses of peptides from human plasma protein resulted in a 16% increase in confident peptide identifications (9022 vs 7779) using elution time information. Further improvements from the use of elution time information can be expected as both the experimental control of elution time reproducibility and the predictive capability are improved.  相似文献   

14.
The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow a similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10-40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.  相似文献   

15.
MS/MS and associated database search algorithms are essential proteomic tools for identifying peptides. Due to their widespread use, it is now time to perform a systematic analysis of the various algorithms currently in use. Using blood specimens used in the HUPO Plasma Proteome Project, we have evaluated five search algorithms with respect to their sensitivity and specificity, and have also accurately benchmarked them based on specified false-positive (FP) rates. Spectrum Mill and SEQUEST performed well in terms of sensitivity, but were inferior to MASCOT, X!Tandem, and Sonar in terms of specificity. Overall, MASCOT, a probabilistic search algorithm, correctly identified most peptides based on a specified FP rate. The rescoring algorithm, PeptideProphet, enhanced the overall performance of the SEQUEST algorithm, as well as provided predictable FP error rates. Ideally, score thresholds should be calculated for each peptide spectrum or minimally, derived from a reversed-sequence search as demonstrated in this study based on a validated data set. The availability of open-source search algorithms, such as X!Tandem, makes it feasible to further improve the validation process (manual or automatic) on the basis of "consensus scoring", i.e., the use of multiple (at least two) search algorithms to reduce the number of FPs. complement.  相似文献   

16.
Large proteomic data sets identifying hundreds or thousands of modified peptides are becoming increasingly common in the literature. Several methods for assessing the reliability of peptide identifications both at the individual peptide or data set level have become established. However, tools for measuring the confidence of modification site assignments are sparse and are not often employed. A few tools for estimating phosphorylation site assignment reliabilities have been developed, but these are not integral to a search engine, so require a particular search engine output for a second step of processing. They may also require use of a particular fragmentation method and are mostly only applicable for phosphorylation analysis, rather than post-translational modifications analysis in general. In this study, we present the performance of site assignment scoring that is directly integrated into the search engine Protein Prospector, which allows site assignment reliability to be automatically reported for all modifications present in an identified peptide. It clearly indicates when a site assignment is ambiguous (and if so, between which residues), and reports an assignment score that can be translated into a reliability measure for individual site assignments.  相似文献   

17.
用于串联质谱鉴定多肽的计量方法   总被引:1,自引:0,他引:1  
目前已有多种对串联质谱与数据库中多肽的理论质谱的一致性进行评估的高通量计量算法用于鸟枪法蛋白质组学 (shotgunproteomics)研究。然而这些方法操作时存在大量错误的多肽鉴定。这里提出一种新的串联质谱识别多肽序列的计量算法。该算法综合考虑了串联质谱中不同离子出现的概率、多肽的酶切位点数、理论离子与实验离子的匹配程度和匹配模式。对大容量的串联质谱数据集的测试表明 ,根据算法开发的软件PepSearch比目前最常用的软件SEQUEST有更好的鉴定准确性。PepSearch可从http : compbio.sibsnet.org projects pepsearch下载。  相似文献   

18.
The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets.  相似文献   

19.
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号