首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Development of robust statistical methods for validation of peptide assignments to tandem mass (MS/MS) spectra obtained using database searching remains an important problem. PeptideProphet is one of the commonly used computational tools available for that purpose. An alternative simple approach for validation of peptide assignments is based on addition of decoy (reversed, randomized, or shuffled) sequences to the searched protein sequence database. The probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single semisupervised framework, leading to improved robustness and higher accuracy of computed probabilities even in the case of most challenging data sets. We present a semisupervised expectation-maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixture model, extending PeptideProphet to incorporate decoy peptide matches. Using several data sets of varying complexity, from control protein mixtures to a human plasma sample, and using three commonly used database search programs, SEQUEST, MASCOT, and TANDEM/k-score, we illustrate that more accurate mixture estimation leads to an improved control of the false discovery rate in the classification of peptide assignments.  相似文献   

2.
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.  相似文献   

3.
The high selectivity and throughput of tandem mass spectrometry allow for rapid identification and localization of various posttranslational protein modifications from complex mixtures by shotgun approaches. Although sequence database search algorithms provide necessary support to process the potentially enormous quantity of MS/MS spectra generated from large scale tandem mass spectrometry experiments, false positive identifications of peptide modifications may exist even after implementation of stringent identification criteria. In this report, we describe factors that lead to misinterpretation of MS/MS spectra as well as common chemical and experimental artifacts that generate false positives using the proteomics-based identification of tyrosine nitration as an example. In addition to the proposed manual validation criteria, the importance of peptide synthesis and subsequent MS/MS characterization for validation of peptide nitration demonstrated by several examples from earlier publications is also presented.  相似文献   

4.
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.  相似文献   

5.
Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra.  相似文献   

6.

Background

The sequence database searching has been the dominant method for peptide identification, in which a large number of peptide spectra generated from LC/MS/MS experiments are searched using a search engine against theoretical fragmentation spectra derived from a protein sequences database or a spectral library. Selecting trustworthy peptide spectrum matches (PSMs) remains a challenge.

Results

A novel scoring method named FC-Ranker is developed to assign a nonnegative weight to each target PSM based on the possibility of its being correct. Particularly, the scores of PSMs are updated by using a fuzzy SVM classification model and a fuzzy silhouette index iteratively. Trustworthy PSMs will be assigned high scores when the algorithm stops.

Conclusions

Our experimental studies show that FC-Ranker outperforms other post-database search algorithms over a variety of datasets, and it can be extended to solve a general classification problem with uncertain labels.
  相似文献   

7.
MS/MS is a widely used method for proteome‐wide analysis of protein expression and PTMs. The thousands of MS/MS spectra produced from a single experiment pose a major challenge for downstream analysis. Standard programs, such as MASCOT, provide peptide assignments for many of the spectra, including identification of PTM sites, but these results are plagued by false‐positive identifications. In phosphoproteomic experiments, only a single peptide assignment is typically available to support identification of each phosphorylation site, and hence minimizing false positives is critical. Thus, tedious manual validation is often required to increase confidence in the spectral assignments. We have developed phoMSVal, an open‐source platform for managing MS/MS data and automatically validating identified phosphopeptides. We tested five classification algorithms with 17 extracted features to separate correct peptide assignments from incorrect ones using over 2600 manually curated spectra. The naïve Bayes algorithm was among the best classifiers with an AUC value of 97% and PPV of 97% for phosphotyrosine data. This classifier required only three features to achieve a 76% decrease in false positives as compared with MASCOT while retaining 97% of true positives. This algorithm was able to classify an independent phosphoserine/threonine data set with AUC value of 93% and PPV of 91%, demonstrating the applicability of this method for all types of phospho‐MS/MS data. PhoMSVal is available at http://csbi.ltdk.helsinki.fi/phomsval .  相似文献   

8.
LC MS/MS has become an established technology in proteomic studies, and with the maturation of the technology the bottleneck has shifted from data generation to data validation and mining. To address this bottleneck we developed Experimental Peptide Identification Repository (EPIR), which is an integrated software platform for storage, validation, and mining of LC MS/MS-derived peptide evidence. EPIR is a cumulative data repository where precursor ions are linked to peptide assignments and protein associations returned by a search engine (e.g. Mascot, Sequest, or PepSea). Any number of datasets can be parsed into EPIR and subsequently validated and mined using a set of software modules that overlay the database. These include a peptide validation module, a protein grouping module, a generic module for extracting quantitative data, a comparative module, and additional modules for extracting statistical information. In the present study, the utility of EPIR and associated software tools is demonstrated on LC MS/MS data derived from a set of model proteins and complex protein mixtures derived from MCF-7 breast cancer cells. Emphasis is placed on the key strengths of EPIR, including the ability to validate and mine multiple combined datasets, and presentation of protein-level evidence in concise, nonredundant protein groups that are based on shared peptide evidence.  相似文献   

9.
Searching tandem mass spectra against a protein database has been a mainstream method for peptide identification. Improving peptide identification results by ranking true Peptide-Spectrum Matches (PSMs) over their false counterparts leads to the development of various reranking algorithms. In peptide reranking, discriminative information is essential to distinguish true PSMs from false PSMs. Generally, most peptide reranking methods obtain discriminative information directly from database search scores or by training machine learning models. Information in the protein database and MS1 spectra (i.e., single stage MS spectra) is ignored. In this paper, we propose to use information in the protein database and MS1 spectra to rerank peptide identification results. To quantitatively analyze their effects to peptide reranking results, three peptide reranking methods are proposed: PPMRanker, PPIRanker, and MIRanker. PPMRanker only uses Protein-Peptide Map (PPM) information from the protein database, PPIRanker only uses Precursor Peak Intensity (PPI) information, and MIRanker employs both PPM information and PPI information. According to our experiments on a standard protein mixture data set, a human data set and a mouse data set, PPMRanker and MIRanker achieve better peptide reranking results than PetideProphet, PeptideProphet+NSP (number of sibling peptides) and a score regularization method SRPI. The source codes of PPMRanker, PPIRanker, and MIRanker, and all supplementary documents are available at our website: http://bioinformatics.ust.hk/pepreranking/. Alternatively, these documents can also be downloaded from: http://sourceforge.net/projects/pepreranking/.  相似文献   

10.
MOTIVATION: Tandem mass spectrometry (MS/MS) identifies protein sequences using database search engines, at the core of which is a score that measures the similarity between peptide MS/MS spectra and a protein sequence database. The TANDEM application was developed as a freely available database search engine for the proteomics research community. To extend TANDEM as a platform for further research on developing improved database scoring methods, we modified the software to allow users to redefine the scoring function and replace the native TANDEM scoring function while leaving the remaining core application intact. Redefinition is performed at run time so multiple scoring functions are available to be selected and applied from a single search engine binary. We introduce the implementation of the pluggable scoring algorithm and also provide implementations of two TANDEM compatible scoring functions, one previously described scoring function compatible with PeptideProphet and one very simple scoring function that quantitative researchers may use to begin their development. This extension builds on the open-source TANDEM project and will facilitate research into and dissemination of novel algorithms for matching MS/MS spectra to peptide sequences. The pluggable scoring schema is also compatible with related search applications P3 and Hunter, which are part of the X! suite of database matching algorithms. The pluggable scores and the X! suite of applications are all written in C++. AVAILABILITY: Source code for the scoring functions is available from http://proteomics.fhcrc.org  相似文献   

11.
Orthogonal analysis of amino acid substitutions as a result of SNPs in existing proteomic datasets provides a critical foundation for the emerging field of population-based proteomics. Large-scale proteomics datasets, derived from shotgun tandem MS analysis of complex cellular protein mixtures, contain many unassigned spectra that may correspond to alternate alleles coded by SNPs. The purpose of this work was to identify tandem MS spectra in LC-MS/MS shotgun proteomics datasets that may represent coding nonsynonymous SNPs (nsSNP). To this end, we generated a tryptic peptide database created from allelic information found in NCBI's dbSNP. We searched this database with tandem MS spectra of tryptic peptides from DU4475 breast tumor cells that had been fractioned by pI in the first-dimension and reverse-phase LC in the second dimension. In all we identified 629 nsSNPs, of which 36 were of alternate SNP alleles not found in the reference NCBI or IPI protein databases. Searches for SNP-peptides carry a high risk of false positives due both to mass shifts caused by modifications and because of multiple representations of the same peptide within the genome. In this work, false positives were filtered using a novel peptide pI prediction algorithm and characterized using a decoy database developed by random substitution of similarly sized reference peptides. Secondary validation by sequencing of corresponding genomic DNA confirmed the presence of the predicted SNP in 8 of 10 SNP-peptides. This work highlights that the usefulness of interpreting unassigned spectra as polymorphisms is highly reliant on the ability to detect and filter false positives.  相似文献   

12.
Quantitative proteomics relies on accurate protein identification, which often is carried out by automated searching of a sequence database with tandem mass spectra of peptides. When these spectra contain limited information, automated searches may lead to incorrect peptide identifications. It is therefore necessary to validate the identifications by careful manual inspection of the mass spectra. Not only is this task time-consuming, but the reliability of the validation varies with the experience of the analyst. Here, we report a systematic approach to evaluating peptide identifications made by automated search algorithms. The method is based on the principle that the candidate peptide sequence should adequately explain the observed fragment ions. Also, the mass errors of neighboring fragments should be similar. To evaluate our method, we studied tandem mass spectra obtained from tryptic digests of E. coli and HeLa cells. Candidate peptides were identified with the automated search engine Mascot and subjected to the manual validation method. The method found correct peptide identifications that were given low Mascot scores (e.g., 20-25) and incorrect peptide identifications that were given high Mascot scores (e.g., 40-50). The method comprehensively detected false results from searches designed to produce incorrect identifications. Comparison of the tandem mass spectra of synthetic candidate peptides to the spectra obtained from the complex peptide mixtures confirmed the accuracy of the evaluation method. Thus, the evaluation approach described here could help boost the accuracy of protein identification, increase number of peptides identified, and provide a step toward developing a more accurate next-generation algorithm for protein identification.  相似文献   

13.
LC–MS/MS has become the standard platform for the characterization of immunopeptidomes, the collection of peptides naturally presented by major histocompatibility complex molecules to the cell surface. The protocols and algorithms used for immunopeptidomics data analysis are based on tools developed for traditional bottom‐up proteomics that address the identification of peptides generated by tryptic digestion. Such algorithms are generally not tailored to the specific requirements of MHC ligand identification and, as a consequence, immunopeptidomics datasets suffer from dismissal of informative spectral information and high false discovery rates. Here, a new pipeline for the refinement of peptide‐spectrum matches (PSM) is proposed, based on the assumption that immunopeptidomes contain a limited number of recurring peptide motifs, corresponding to MHC specificities. Sequence motifs are learned directly from the individual peptidome by training a prediction model on high‐confidence PSMs. The model is then applied to PSM candidates with lower confidence, and sequences that score significantly higher than random peptides are rescued as likely true ligands. The pipeline is applied to MHC class I immunopeptidomes from three different species, and it is shown that it can increase the number of identified ligands by up to 20–30%, while effectively removing false positives and products of co‐precipitation. Spectral validation using synthetic peptides confirms the identity of a large proportion of rescued ligands in the experimental peptidome.  相似文献   

14.
Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. We present two flexible methods, the variable component mixture model and the semiparametric mixture model, that remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. Using a control protein mixture data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, we demonstrate that both methods improve parametric models in terms of the accuracy of probability estimates and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient number of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy.  相似文献   

15.
Proteomics research routinely involves identifying peptides and proteins via MS/MS sequence database search. Thus the database search engine is an integral tool in many proteomics research groups. Here, we introduce the Comet search engine to the existing landscape of commercial and open‐source database search tools. Comet is open source, freely available, and based on one of the original sequence database search tools that has been widely used for many years.  相似文献   

16.

Background  

Rejection of false positive peptide matches in database searches of shotgun proteomic experimental data is highly desirable. Several methods have been developed to use the peptide retention time as to refine and improve peptide identifications from database search algorithms. This report describes the implementation of an automated approach to reduce false positives and validate peptide matches.  相似文献   

17.
MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ~100 000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at http://www.massmatrix.net.  相似文献   

18.
Current efforts aimed at developing high-throughput proteomics focus on increasing the speed of protein identification. Although improvements in sample separation, enrichment, automated handling, mass spectrometric analysis, as well as data reduction and database interrogation strategies have done much to increase the quality, quantity and efficiency of data collection, significant bottlenecks still exist. Various separation techniques have been coupled with tandem mass spectrometric (MS/MS) approaches to allow a quicker analysis of complex mixtures of proteins, especially where a high number of unambiguous protein identifications are the exception, rather than the rule. MS/MS is required to provide structural / amino acid sequence information on a peptide and thus allow protein identity to be inferred from individual peptides. Currently these spectra need to be manually validated because: (a) the potential of false positive matches i.e., protein not in database, and (b) observed fragmentation trends may not be incorporated into current MS/MS search algorithms. This validation represents a significant bottleneck associated with high-throughput proteomic strategies. We have developed CHOMPER, a software program which reduces the time required to both visualize and confirm MS/MS search results and generate post-analysis reports and protein summary tables. CHOMPER extracts the identification information from SEQUEST MS/MS search result files, reproduces both the peptide and protein identification summaries, provides a more interactive visualization of the MS/MS spectra and facilitates the direct submission of manually validated identifications to a database.  相似文献   

19.
Eriksson J  Fenyö D 《Proteomics》2002,2(3):262-270
A rapid and accurate method for testing the significance of protein identities determined by mass spectrometric analysis of protein digests and genome database searching is presented. The method is based on direct computation using a statistical model of the random matching of measured and theoretical proteolytic peptide masses. Protein identification algorithms typically rank the proteins of a genome database according to a score based on the number of matches between the masses obtained by mass spectrometry analysis and the theoretical proteolytic peptide masses of a database protein. The random matching of experimental and theoretical masses can cause false results. A result is significant only if the score characterizing the result deviates significantly from the score expected from a false result. A distribution of the score (number of matches) for random (false) results is computed directly from our model of the random matching, which allows significance testing under any experimental and database search constraints. In order to mimic protein identification data quality in large-scale proteome projects, low-to-high quality proteolytic peptide mass data were generated in silico and subsequently submitted to a database search program designed to include significance testing based on direct computation. This simulation procedure demonstrates the usefulness of direct significance testing for automatically screening for samples that must be subjected to peptide sequence analysis by e.g. tandem mass spectrometry in order to determine the protein identity.  相似文献   

20.
We present MassSieve, a Java‐based platform for visualization and parsimony analysis of single and comparative LC‐MS/MS database search engine results. The success of mass spectrometric peptide sequence assignment algorithms has led to the need for a tool to merge and evaluate the increasing data set sizes that result from LC‐MS/MS‐based shotgun proteomic experiments. MassSieve supports reports from multiple search engines with differing search characteristics, which can increase peptide sequence coverage and/or identify conflicting or ambiguous spectral assignments.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号