期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Pepitome: evaluating improved spectral library search for identification complementarity and quality assessment

Dasari S Chambers MC Martinez MA Carpenter KL Ham AJ Vega-Montoto LJ Tabb DL 《Journal of proteome research》2012,11(3):1686-1695

Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines. 相似文献

2.

An improved method for the construction of decoy peptide MS/MS spectra suitable for the accurate estimation of false discovery rates

Ahrné E Ohta Y Nikitin F Scherl A Lisacek F Müller M 《Proteomics》2011,11(20):4085-4095

The relevance of libraries of annotated MS/MS spectra is growing with the amount of proteomic data generated in high-throughput experiments. These reference libraries provide a fast and accurate way to identify newly acquired MS/MS spectra. In the context of multiple hypotheses testing, the control of the number of false-positive identifications expected in the final result list by means of the calculation of the false discovery rate (FDR). In a classical sequence search where experimental MS/MS spectra are compared with the theoretical peptide spectra calculated from a sequence database, the FDR is estimated by searching randomized or decoy sequence databases. Despite on-going discussion on how exactly the FDR has to be calculated, this method is widely accepted in the proteomic community. Recently, similar approaches to control the FDR of spectrum library searches were discussed. We present in this paper a detailed analysis of the similarity between spectra of distinct peptides to set the basis of our own solution for decoy library creation (DeLiberator). It differs from the previously published results in some key points, mainly in implementing new methods that prevent decoy spectra from being too similar to the original library spectra while keeping important features of real MS/MS spectra. Using different proteomic data sets and library creation methods, we evaluate our approach and compare it with alternative methods. 相似文献

3.

Learning from Decoys to Improve the Sensitivity and Specificity of Proteomics Database Search Results

Amit Kumar Yadav Dhirendra Kumar Debasis Dash 《PloS one》2012,7(11)

The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size. 相似文献

4.

Improving sensitivity in proteome studies by analysis of false discovery rates for multiple search engines

Andrew R. Jones Dr. Jennifer A. Siepen Simon J. Hubbard Norman W. Paton 《Proteomics》2009,9(5):1220-1229

LC‐MS experiments can generate large quantities of data, for which a variety of database search engines are available to make peptide and protein identifications. Decoy databases are becoming widely used to place statistical confidence in result sets, allowing the false discovery rate (FDR) to be estimated. Different search engines produce different identification sets so employing more than one search engine could result in an increased number of peptides (and proteins) being identified, if an appropriate mechanism for combining data can be defined. We have developed a search engine independent score, based on FDR, which allows peptide identifications from different search engines to be combined, called the FDR Score. The results demonstrate that the observed FDR is significantly different when analysing the set of identifications made by all three search engines, by each pair of search engines or by a single search engine. Our algorithm assigns identifications to groups according to the set of search engines that have made the identification, and re‐assigns the score (combined FDR Score). The combined FDR Score can differentiate between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine. 相似文献

5.

Spectral probabilities and generating functions of tandem mass spectra: a strike against decoy databases

Kim S Gupta N Pevzner PA 《Journal of proteome research》2008,7(8):3354-3363

A key problem in computational proteomics is distinguishing between correct and false peptide identifications. We argue that evaluating the error rates of peptide identifications is not unlike computing generating functions in combinatorics. We show that the generating functions and their derivatives ( spectral energy and spectral probability) represent new features of tandem mass spectra that, similarly to Delta-scores, significantly improve peptide identifications. Furthermore, the spectral probability provides a rigorous solution to the problem of computing statistical significance of spectral identifications. The spectral energy/probability approach improves the sensitivity-specificity tradeoff of existing MS/MS search tools, addresses the notoriously difficult problem of "one-hit-wonders" in mass spectrometry, and often eliminates the need for decoy database searches. We therefore argue that the generating function approach has the potential to increase the number of peptide identifications in MS/MS searches. 相似文献

6.

Clustering millions of tandem mass spectra 总被引：1，自引：0，他引：1

Frank AM Bandeira N Shen Z Tanner S Briggs SP Smith RD Pevzner PA 《Journal of proteome research》2008,7(1):113-122

Tandem mass spectrometry (MS/MS) experiments often generate redundant data sets containing multiple spectra of the same peptides. Clustering of MS/MS spectra takes advantage of this redundancy by identifying multiple spectra of the same peptide and replacing them with a single representative spectrum. Analyzing only representative spectra results in significant speed-up of MS/MS database searches. We present an efficient clustering approach for analyzing large MS/MS data sets (over 10 million spectra) with a capability to reduce the number of spectra submitted to further analysis by an order of magnitude. The MS/MS database search of clustered spectra results in fewer spurious hits to the database and increases number of peptide identifications as compared to regular nonclustered searches. Our open source software MS-Clustering is available for download at http://peptide.ucsd.edu or can be run online at http://proteomics.bioprojects.org/MassSpec. 相似文献

7.

Transferred Subgroup False Discovery Rate for Rare Post-translational Modifications Detected by Mass Spectrometry

Yan Fu Xiaohong Qian 《Molecular & cellular proteomics : MCP》2014,13(5):1359-1368

In shotgun proteomics, high-throughput mass spectrometry experiments and the subsequent data analysis produce thousands to millions of hypothetical peptide identifications. The common way to estimate the false discovery rate (FDR) of peptide identifications is the target-decoy database search strategy, which is efficient and accurate for large datasets. However, the legitimacy of the target-decoy strategy for protein-modification-centric studies has rarely been rigorously validated. It is often the case that a global FDR is estimated for all peptide identifications including both modified and unmodified peptides, but that only a subgroup of identifications with a certain type of modification is focused on. As revealed recently, the subgroup FDR of modified peptide identifications can differ dramatically from the global FDR at the same score threshold, and thus the former, when it is of interest, should be separately estimated. However, rare modifications often result in a very small number of modified peptide identifications, which makes the direct separate FDR estimation inaccurate because of the inadequate sample size. This paper presents a method called the transferred FDR for accurately estimating the FDR of an arbitrary number of modified peptide identifications. Through flexible use of the empirical data from a target-decoy database search, a theoretical relationship between the subgroup FDR and the global FDR is made computable. Through this relationship, the subgroup FDR can be predicted from the global FDR, allowing one to avoid an inaccurate direct estimation from a limited amount of data. The effectiveness of the method is demonstrated with both simulated and real mass spectra.Post-translational modifications of proteins often play an essential role in the functions of proteins in cells (1). Abnormal modifications can change the properties of proteins, causing serious diseases (2). Because protein modifications are not directly encoded in the nucleotide sequences of organisms, they must be investigated at the protein level. In recent years, mass spectrometry technology has developed rapidly and has become the standard method for identifying proteins and their modifications in biological and clinical samples (–).In shotgun proteomics experiments, proteins are digested into peptide mixtures that are then analyzed via high-throughput liquid chromatography–tandem mass spectrometry, resulting in thousands to millions of tandem mass spectra. To identify the peptide sequences and the modifications on them, the spectra are commonly searched against a protein sequence database (–). During the database search, according to the variable modification types specified by the user, all forms of modified candidate peptides are enumerated. For each spectrum, candidate peptides (with possible modifications) from the database are scored according to the quality of their match to the input spectrum. However, for many reasons, the top-scored matches are not always correct peptide identifications, and therefore they must be filtered according to their identification scores (). Finding an appropriate score threshold that gives the desired false discovery rate (FDR)¹ is a multiple hypothesis testing problem (10–).At present, the common way to control the FDR of peptide identifications is an empirical approach called the target-decoy search strategy (). In this strategy, in addition to the target protein sequences, the mass spectra are also searched against the same number of decoy protein sequences (e.g. reverse sequences of the target proteins). Because an incorrect identification has an equal chance of being a match to the target sequences or to the decoy sequences, the number of decoy matches above a score threshold can be used as an estimate of the number of random target matches, and the FDR (of the target matches) can be simply estimated as the number of decoy matches divided by the number of target matches. The target-decoy method, although simple and effective, is applicable to large datasets only. When the number of matches being evaluated is very small, this method becomes inaccurate because of the inadequate sample size (, ). Fortunately, for high-throughput proteomic mass spectrometry experiments, the number of mass spectra is always sufficiently large. Current efforts are mostly devoted to increasing the sensitivity of peptide identification at a given FDR by using various techniques such as machine learning ().When the purpose of an experiment is to search for protein modifications, the problem of FDR estimation becomes somewhat complex. In fact, the legality of the target-decoy method for modification-centric studies was not rigorously discussed until very recently (16). At present, for multiple reasons, the identifications of modified and unmodified peptides are usually combined in the search result, and a global FDR is estimated for them in combination, with only a subgroup of identifications with specific modifications being focused on. However, the FDR of modified peptides can be significantly or even extremely different from that of unmodified peptides at the same score threshold. There are three reasons for this fact. First, because the spectra of modified peptides can have their own features (e.g. insufficient fragmentation or neutral losses), they can have different score distributions from those of unmodified peptides. Second, because the proportions of modified and unmodified peptides in the protein sample are different, the prior probabilities of obtaining a correct identification are different for modified and unmodified peptides. Third, because the proportions of modified and unmodified candidate peptides in the search space are different, the prior probabilities of obtaining an incorrect identification are also different for modified and unmodified peptides. Therefore, the modified peptide identifications of interest should be extracted from the identification result and subjected to a separate FDR estimation, as pointed out recently (16–).The difficulty of separate FDR estimations is highlighted when there are too few modified peptide identifications to allow an accurate estimation. Many protein modifications are present in low abundance in cells but play important biological functions. These rare modifications have very low chances of being detected by mass spectrometry. A crucial question is, if very few modifications are identified from a very large dataset of mass spectra, can they be regarded as correct identifications? There was no answer to this question in the past in terms of FDR control. The target-decoy strategy loses its efficacy in such cases. For example, imagine that we have 10 modified peptide identifications above a score threshold after a search and that all of them are matches to target protein sequences. Can we say that the FDR of these identifications is zero (0/10)? If we decrease the score threshold slightly in such a way that one more modified peptide identification is included but find that that peptide is unfortunately a match to the decoy sequence, then can we say that the FDR of the top 10 target identifications is 10% (1/10)? It is clear here that the inclusion or exclusion of the 11th decoy identification has a great influence on the FDR estimated via the common target-decoy strategy. In fact, according to a binomial model (), the probability that there are one or more false identifications among the top 10 target matches is as high as 0.5, which means that the real proportion of false discoveries has a half-chance of being no less than 10% (1/10). The appropriate way to estimate the FDR of the 10 target identifications is to give an appropriate estimate of the expected number of false identifications among them, and, most important, this estimate must not be an integer (e.g. 0 or 1) but can be a real number between 0 and 1. Note that single-spectrum significance measures (e.g. p values) are not appropriate for multiple hypothesis testing, not to mention that they can hardly be accurately computed in mass spectrometry.Separate FDR estimation for grouped multiple hypothesis testing is not new in statistics and bioinformatics. A typical example is the microarray data of mRNAs from different locations in an organism or from genes that are involved in different biological processes (19, ). Efron (21) recently proposed a method for robust separate FDR estimation for small subgroups in the empirical Bayes framework. The underlying principle of this method is that if we can find the quantitative relationship between the subgroup FDR and the global FDR, the former can be indirectly inferred from the latter instead of being estimated from a limited amount of data. The relationship given by Efron is quite general and makes no use of domain-specific information. Furthermore, it requires known conditional probabilities of null and non-null cases given the score threshold. These probabilities are, however, unavailable in the modified peptide identification problem.This paper presents a dedicated method for accurate FDR estimation for rare protein modifications detected from large-scale mass spectral data. This method is based on a theoretical relationship between the subgroup FDR of modified peptide identifications and the global FDR of all peptide identifications. To make the relationship computable, the component factors in it are replaced by or fitted from the empirical data of the target-decoy database search results. Most important, the probability that an incorrect identification is an assignment of a modified peptide is approximated by a linear function of the score threshold. By extrapolation, this probability can be reliably obtained for high-tail scores that are suitable as thresholds. The proposed method was validated on both simulated and real mass spectra. To the best of our knowledge, this study is the first effort toward reliable FDR control of rare protein modifications identified from mass spectra. (Note that the error rate control for modification site location is another complex problem (, ) and is not the aim of this paper.) 相似文献

8.

Integrated approach for manual evaluation of peptides identified by searching protein sequence databases with tandem mass spectra

Chen Y Kwon SW Kim SC Zhao Y 《Journal of proteome research》2005,4(3):998-1005

Quantitative proteomics relies on accurate protein identification, which often is carried out by automated searching of a sequence database with tandem mass spectra of peptides. When these spectra contain limited information, automated searches may lead to incorrect peptide identifications. It is therefore necessary to validate the identifications by careful manual inspection of the mass spectra. Not only is this task time-consuming, but the reliability of the validation varies with the experience of the analyst. Here, we report a systematic approach to evaluating peptide identifications made by automated search algorithms. The method is based on the principle that the candidate peptide sequence should adequately explain the observed fragment ions. Also, the mass errors of neighboring fragments should be similar. To evaluate our method, we studied tandem mass spectra obtained from tryptic digests of E. coli and HeLa cells. Candidate peptides were identified with the automated search engine Mascot and subjected to the manual validation method. The method found correct peptide identifications that were given low Mascot scores (e.g., 20-25) and incorrect peptide identifications that were given high Mascot scores (e.g., 40-50). The method comprehensively detected false results from searches designed to produce incorrect identifications. Comparison of the tandem mass spectra of synthetic candidate peptides to the spectra obtained from the complex peptide mixtures confirmed the accuracy of the evaluation method. Thus, the evaluation approach described here could help boost the accuracy of protein identification, increase number of peptides identified, and provide a step toward developing a more accurate next-generation algorithm for protein identification. 相似文献

9.

Verification of automated peptide identifications from proteomic tandem mass spectra

Tabb DL Friedman DB Ham AJ 《Nature protocols》2006,1(5):2213-2222

Shotgun proteomics yields tandem mass spectra of peptides that can be identified by database search algorithms. When only a few observed peptides suggest the presence of a protein, establishing the accuracy of the peptide identifications is necessary for accepting or rejecting the protein identification. In this protocol, we describe the properties of peptide identifications that can differentiate legitimately identified peptides from spurious ones. The chemistry of fragmentation, as embodied in the 'mobile proton' and 'pathways in competition' models, informs the process of confirming or rejecting each spectral match. Examples of ion-trap and tandem time-of-flight (TOF/TOF) mass spectra illustrate these principles of fragmentation. 相似文献

10.

Estimating false discovery rates for peptide and protein identification using randomized databases

Gregory Hather Roger Higdon Andrew Bauman Priska D. von Haller Eugene Kolker 《Proteomics》2010,10(12):2369-2376

MS‐based proteomics characterizes protein contents of biological samples. The most common approach is to first match observed MS/MS peptide spectra against theoretical spectra from a protein sequence database and then to score these matches. The false discovery rate (FDR) can be estimated as a function of the score by searching together the protein sequence database and its randomized version and comparing the score distributions of the randomized versus nonrandomized matches. This work introduces a straightforward isotonic regression‐based method to estimate the cumulative FDRs and local FDRs (LFDRs) of peptide identification. Our isotonic method not only performed as well as other methods used for comparison, but also has the advantages of being: (i) monotonic in the score, (ii) computationally simple, and (iii) not dependent on assumptions about score distributions. We demonstrate the flexibility of our approach by using it to estimate FDRs and LFDRs for protein identification using summaries of the peptide spectra scores. We reconfirmed that several of these methods were superior to a two‐peptide rule. Finally, by estimating both the FDRs and LFDRs, we showed for both peptide and protein identification, moderate FDR values (5%) corresponded to large LFDR values (53 and 60%). 相似文献

11.

Andromeda: a peptide search engine integrated into the MaxQuant environment 总被引：3，自引：0，他引：3

Cox J Neuhauser N Michalski A Scheltema RA Olsen JV Mann M 《Journal of proteome research》2011,10(4):1794-1805

A key step in mass spectrometry (MS)-based proteomics is the identification of peptides in sequence databases by their fragmentation spectra. Here we describe Andromeda, a novel peptide search engine using a probabilistic scoring model. On proteome data, Andromeda performs as well as Mascot, a widely used commercial search engine, as judged by sensitivity and specificity analysis based on target decoy searches. Furthermore, it can handle data with arbitrarily high fragment mass accuracy, is able to assign and score complex patterns of post-translational modifications, such as highly phosphorylated peptides, and accommodates extremely large databases. The algorithms of Andromeda are provided. Andromeda can function independently or as an integrated search engine of the widely used MaxQuant computational proteomics platform and both are freely available at www.maxquant.org. The combination enables analysis of large data sets in a simple analysis workflow on a desktop computer. For searching individual spectra Andromeda is also accessible via a web server. We demonstrate the flexibility of the system by implementing the capability to identify cofragmented peptides, significantly improving the total number of identified peptides. 相似文献

12.

RT-SVR+q: a strategy for post-Mascot analysis using retention time and q value metric to improve peptide and protein identifications

Cao W Ma D Kapur A Patankar MS Ma Y Li L 《Journal of Proteomics》2011,75(2):480-490

Shotgun proteomics commonly utilizes database search like Mascot to identify proteins from tandem MS/MS spectra. False discovery rate (FDR) is often used to assess the confidence of peptide identifications. However, a widely accepted FDR of 1% sacrifices the sensitivity of peptide identification while improving the accuracy. This article details a machine learning approach combining retention time based support vector regressor (RT-SVR) with q value based statistical analysis to improve peptide and protein identifications with high sensitivity and accuracy. The use of confident peptide identifications as training examples and careful feature selection ensures high R values (>0.900) for all models. The application of RT-SVR model on Mascot results (p=0.10) increases the sensitivity of peptide identifications. q Value, as a function of deviation between predicted and experimental RTs (ΔRT), is used to assess the significance of peptide identifications. We demonstrate that the peptide and protein identifications increase by up to 89.4% and 83.5%, respectively, for a specified q value of 0.01 when applying the method to proteomic analysis of the natural killer leukemia cell line (NKL). This study establishes an effective methodology and provides a platform for profiling confident proteomes in more relevant species as well as a future investigation of accurate protein quantification. 相似文献

13.

ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer

Zhang N Li XJ Ye M Pan S Schwikowski B Aebersold R 《Proteomics》2005,5(16):4096-4106

In MS/MS experiments with automated precursor ion, selection only a fraction of sequencing attempts lead to the successful identification of a peptide. A number of reasons may contribute to this situation. They include poor fragmentation of the selected precursor ion, the presence of modified residues in the peptide, mismatches with sequence databases, and frequently, the concurrent fragmentation of multiple precursors in the same CID attempt. Current database search engines are incapable of correctly assigning the sequences of multiple precursors to such spectra. We have developed a search engine, ProbIDtree, which can identify multiple peptides from a CID spectrum generated by the concurrent fragmentation of multiple precursor ions. This is achieved by iterative database searching in which the submitted spectra are generated by subtracting the fragment ions assigned to a tentatively matched peptide from the acquired spectrum and in which each match is assigned a tentative probability score. Tentatively matched peptides are organized in a tree structure from which their adjusted probability scores are calculated and used to determine the correct identifications. The results using MALDI-TOF-TOF MS/MS data demonstrate that multiple peptides can be effectively identified simultaneously with high confidence using ProbIDtree. 相似文献

14.

Partially sequenced organisms, decoy searches and false discovery rates

Victor B Gabriël S Kanobana K Mostovenko E Polman K Dorny P Deelder AM Palmblad M 《Journal of proteome research》2012,11(3):1991-1995

Tandem mass spectrometry is commonly used to identify peptides, typically by comparing their product ion spectra with those predicted from a protein sequence database and scoring these matches. The most reported quality metric for a set of peptide identifications is the false discovery rate (FDR), the fraction of expected false identifications in the set. This metric has so far only been used for completely sequenced organisms or known protein mixtures. We have investigated whether FDR estimations are also applicable in the case of partially sequenced organisms, where many high-quality spectra fail to identify the correct peptides because the latter are not present in the searched sequence database. Using real data from human plasma and simulated partial sequence databases derived from two complete human sequence databases with different levels of redundancy, we could demonstrate that the mixture model approach in PeptideProphet is robust for partial databases, particularly if used in combination with decoy sequences. We therefore recommend using this method when estimating the FDR and reporting peptide identifications from incompletely sequenced organisms. 相似文献

15.

YPED:An Integrated Bioinformatics Suite and Database for Mass Spectrometry-based Proteomics Research

Christopher M.Colangelo Mark Shifman Kei-Hoi Cheung Kathryn L.Stone Nicholas J.Carriero Erol E.Gulcicek TuKiet T.Lam Terence Wu Robert D.Bjornson Can Bruce Angus C.Nairn Jesse Rinehart Perry L.Miller Kenneth R.Williams 《基因组蛋白质组与生物信息学报(英文版)》2015,13(1):25-35

We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a singlelaboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry(LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring(MRM)/selective reaction monitoring(SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results. 相似文献

16.

Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry 总被引：1，自引：0，他引：1

Elias JE Gygi SP 《Nature methods》2007,4(3):207-214

Liquid chromatography and tandem mass spectrometry (LC-MS/MS) has become the preferred method for conducting large-scale surveys of proteomes. Automated interpretation of tandem mass spectrometry (MS/MS) spectra can be problematic, however, for a variety of reasons. As most sequence search engines return results even for 'unmatchable' spectra, proteome researchers must devise ways to distinguish correct from incorrect peptide identifications. The target-decoy search strategy represents a straightforward and effective way to manage this effort. Despite the apparent simplicity of this method, some controversy surrounds its successful application. Here we clarify our preferred methodology by addressing four issues based on observed decoy hit frequencies: (i) the major assumptions made with this database search strategy are reasonable; (ii) concatenated target-decoy database searches are preferable to separate target and decoy database searches; (iii) the theoretical error associated with target-decoy false positive (FP) rate measurements can be estimated; and (iv) alternate methods for constructing decoy databases are similarly effective once certain considerations are taken into account. 相似文献

17.

Randomized sequence databases for tandem mass spectrometry peptide and protein identification 总被引：4，自引：0，他引：4

Higdon R Hogan JM Van Belle G Kolker E 《Omics : a journal of integrative biology》2005,9(4):364-379

Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence." 相似文献

18.

用于串联质谱鉴定多肽的计量方法 总被引：1，自引：0，他引：1

盛泉虎汤海旭解涛王连水丁达夫《Acta biochimica et biophysica Sinica》2003,35(8):734-740

目前已有多种对串联质谱与数据库中多肽的理论质谱的一致性进行评估的高通量计量算法用于鸟枪法蛋白质组学 (shotgunproteomics)研究。然而这些方法操作时存在大量错误的多肽鉴定。这里提出一种新的串联质谱识别多肽序列的计量算法。该算法综合考虑了串联质谱中不同离子出现的概率、多肽的酶切位点数、理论离子与实验离子的匹配程度和匹配模式。对大容量的串联质谱数据集的测试表明 ,根据算法开发的软件PepSearch比目前最常用的软件SEQUEST有更好的鉴定准确性。PepSearch可从http : compbio.sibsnet.org projects pepsearch下载。相似文献

19.

PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification

Zhang J Xin L Shan B Chen W Xie M Yuen D Zhang W Zhang Z Lajoie GA Ma B 《Molecular & cellular proteomics : MCP》2012,11(4):M111.010587

Many software tools have been developed for the automated identification of peptides from tandem mass spectra. The accuracy and sensitivity of the identification software via database search are critical for successful proteomics experiments. A new database search tool, PEAKS DB, has been developed by incorporating the de novo sequencing results into the database search. PEAKS DB achieves significantly improved accuracy and sensitivity over two other commonly used software packages. Additionally, a new result validation method, decoy fusion, has been introduced to solve the issue of overconfidence that exists in the conventional target decoy method for certain types of peptide identification software. 相似文献

20.

Experiment-specific estimation of peptide identification probabilities using a randomized database

Higdon R Hogan JM Kolker N van Belle G Kolker E 《Omics : a journal of integrative biology》2007,11(4):351-365

Determining the error rate for peptide and protein identification accurately and reliably is necessary to enable evaluation and crosscomparisons of high throughput proteomics experiments. Currently, peptide identification is based either on preset scoring thresholds or on probabilistic models trained on datasets that are often dissimilar to experimental results. The false discovery rates (FDR) and peptide identification probabilities for these preset thresholds or models often vary greatly across different experimental treatments, organisms, or instruments used in specific experiments. To overcome these difficulties, randomized databases have been used to estimate the FDR. However, the cumulative FDR may include low probability identifications when there are a large number of peptide identifications and exclude high probability identifications when there are few. To overcome this logical inconsistency, this study expands the use of randomized databases to generate experiment-specific estimates of peptide identification probabilities. These experiment-specific probabilities are generated by logistic and Loess regression models of the peptide scores obtained from original and reshuffled database matches. These experiment-specific probabilities are shown to very well approximate "true" probabilities based on known standard protein mixtures across different experiments. Probabilities generated by the earlier Peptide_Prophet and more recent LIPS models are shown to differ significantly from this study's experiment-specific probabilities, especially for unknown samples. The experiment-specific probabilities reliably estimate the accuracy of peptide identifications and overcome potential logical inconsistencies of the cumulative FDR. This estimation method is demonstrated using a Sequest database search, LIPS model, and a reshuffled database. However, this approach is generally applicable to any search algorithm, peptide scoring, and statistical model when using a randomized database. 相似文献