首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 660 毫秒
1.
Peptide mass fingerprinting, regardless of becoming complementary to tandem mass spectrometry for protein identification, is still the subject of in-depth study because of its higher sample throughput, higher level of specificity for single peptides and lower level of sensitivity to unexpected post-translational modifications compared with tandem mass spectrometry. In this study, we propose, implement and evaluate a uniform approach using support vector machines to incorporate individual concepts and conclusions for accurate PMF. We focus on the inherent attributes and critical issues of the theoretical spectrum (peptides), the experimental spectrum (peaks) and spectrum (masses) alignment. Eighty-one feature-matching patterns derived from cleavage type, uniqueness and variable masses of theoretical peptides together with the intensity rank of experimental peaks were proposed to characterize the matching profile of the peptide mass fingerprinting procedure. We developed a new strategy including the participation of matched peak intensity redistribution to handle shared peak intensities and 440 parameters were generated to digitalize each feature-matching pattern. A high performance for an evaluation data set of 137 items was finally achieved by the optimal multi-criteria support vector machines approach, with 491 final features out of a feature vector of 35,640 normalized features through cross training and validating a publicly available "gold standard" peptide mass fingerprinting data set of 1733 items. Compared with the Mascot, MS-Fit, ProFound and Aldente algorithms commonly used for MS-based protein identification, the feature-matching patterns algorithm has a greater ability to clearly separate correct identifications and random matches with the highest values for sensitivity (82%), precision (97%) and F1-measure (89%) of protein identification. Several conclusions reached via this research make general contributions to MS-based protein identification. Firstly, inherent attributes showed comparable or even greater robustness than other explicit. As an inherent attribute of an experimental spectrum, peak intensity should receive considerable attention during protein identification. Secondly, alignment between intense experimental peaks and properly digested, unique or non-modified theoretical peptides is very likely to occur in positive peptide mass fingerprinting. Finally, normalization by several types of harmonic factors, including missed cleavages and mass modification, can make important contributions to the performance of the procedure.  相似文献   

2.
用于串联质谱鉴定多肽的计量方法   总被引:1,自引:0,他引:1  
目前已有多种对串联质谱与数据库中多肽的理论质谱的一致性进行评估的高通量计量算法用于鸟枪法蛋白质组学 (shotgunproteomics)研究。然而这些方法操作时存在大量错误的多肽鉴定。这里提出一种新的串联质谱识别多肽序列的计量算法。该算法综合考虑了串联质谱中不同离子出现的概率、多肽的酶切位点数、理论离子与实验离子的匹配程度和匹配模式。对大容量的串联质谱数据集的测试表明 ,根据算法开发的软件PepSearch比目前最常用的软件SEQUEST有更好的鉴定准确性。PepSearch可从http : compbio.sibsnet.org projects pepsearch下载。  相似文献   

3.
Wenguang Shao  Kan Zhu  Henry Lam 《Proteomics》2013,13(22):3273-3283
Spectral library searching is a maturing approach for peptide identification from MS/MS, offering an alternative to traditional sequence database searching. Spectral library searching relies on direct spectrum‐to‐spectrum matching between the query data and the spectral library, which affords better discrimination of true and false matches, leading to improved sensitivity. However, due to the inherent diversity of the peak location and intensity profiles of real spectra, the resulting similarity score distributions often take on unpredictable shapes. This makes it difficult to model the scores of the false matches accurately, necessitating the use of decoy searching to sample the score distribution of the false matches. Here, we refined the similarity scoring in spectral library searching to enable the validation of spectral search results without the use of decoys. We rank‐transformed the peak intensities to standardize all spectra, making it possible to fit a parametric distribution to the scores of the nontop‐scoring spectral matches. The statistical significance of the top‐scoring match can then be estimated in a rigorous manner according to Extreme Value Theory. The overall result is a more robust and interpretable measure of the quality of the spectral match, which can be obtained without decoys. We tested this refined similarity scoring function on real datasets and demonstrated its effectiveness. This approach reduces search time, increases sensitivity, and extends spectral library searching to situations where decoy spectra cannot be readily generated, such as in searching unidentified and nonpeptide spectral libraries.  相似文献   

4.
The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow a similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10-40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences.  相似文献   

5.
Peptide identification by tandem mass spectrometry is the dominant proteomics workflow for protein characterization in complex samples. The peptide fragmentation spectra generated by these workflows exhibit characteristic fragmentation patterns that can be used to identify the peptide. In other fields, where the compounds of interest do not have the convenient linear structure of peptides, fragmentation spectra are identified by comparing new spectra with libraries of identified spectra, an approach called spectral matching. In contrast to sequence-based tandem mass spectrometry search engines used for peptides, spectral matching can make use of the intensities of fragment peaks in library spectra to assess the quality of a match. We evaluate a hidden Markov model approach (HMMatch) to spectral matching, in which many examples of a peptide's fragmentation spectrum are summarized in a generative probabilistic model that captures the consensus and variation of each peak's intensity. We demonstrate that HMMatch has good specificity and superior sensitivity, compared to sequence database search engines such as X!Tandem. HMMatch achieves good results from relatively few training spectra, is fast to train, and can evaluate many spectra per second. A statistical significance model permits HMMatch scores to be compared with each other, and with other peptide identification tools, on a unified scale. HMMatch shows a similar degree of concordance with X!Tandem, Mascot, and NIST's MS Search, as they do with each other, suggesting that each tool can assign peptides to spectra that the others miss. Finally, we show that it is possible to extrapolate HMMatch models beyond a single peptide's training spectra to the spectra of related peptides, expanding the application of spectral matching techniques beyond the set of peptides previously observed.  相似文献   

6.
High‐resolution MS/MS spectra of peptides can be deisotoped to identify monoisotopic masses of peptide fragments. The use of such masses should improve protein identification rates. However, deisotoping is not universally used and its benefits have not been fully explored. Here, MS2‐Deisotoper, a tool for use prior to database search, is used to identify monoisotopic peaks in centroided MS/MS spectra. MS2‐Deisotoper works by comparing the mass and relative intensity of each peptide fragment peak to every other peak of greater mass, and by applying a set of rules concerning mass and intensity differences. After comprehensive parameter optimization, it is shown that MS2‐Deisotoper can improve the number of peptide spectrum matches (PSMs) identified by up to 8.2% and proteins by up to 2.8%. It is effective with SILAC and non‐SILAC MS/MS data. The identification of unique peptide sequences is also improved, increasing the number of human proteoforms by 3.7%. Detailed investigation of results shows that deisotoping increases Mascot ion scores, improves FDR estimation for PSMs, and leads to greater protein sequence coverage. At a peptide level, it is found that the efficacy of deisotoping is affected by peptide mass and charge. MS2‐Deisotoper can be used via a user interface or as a command‐line tool.  相似文献   

7.
Eriksson J  Fenyö D 《Proteomics》2002,2(3):262-270
A rapid and accurate method for testing the significance of protein identities determined by mass spectrometric analysis of protein digests and genome database searching is presented. The method is based on direct computation using a statistical model of the random matching of measured and theoretical proteolytic peptide masses. Protein identification algorithms typically rank the proteins of a genome database according to a score based on the number of matches between the masses obtained by mass spectrometry analysis and the theoretical proteolytic peptide masses of a database protein. The random matching of experimental and theoretical masses can cause false results. A result is significant only if the score characterizing the result deviates significantly from the score expected from a false result. A distribution of the score (number of matches) for random (false) results is computed directly from our model of the random matching, which allows significance testing under any experimental and database search constraints. In order to mimic protein identification data quality in large-scale proteome projects, low-to-high quality proteolytic peptide mass data were generated in silico and subsequently submitted to a database search program designed to include significance testing based on direct computation. This simulation procedure demonstrates the usefulness of direct significance testing for automatically screening for samples that must be subjected to peptide sequence analysis by e.g. tandem mass spectrometry in order to determine the protein identity.  相似文献   

8.
De novo peptide sequencing via tandem mass spectrometry.   总被引:10,自引:0,他引:10  
Peptide sequencing via tandem mass spectrometry (MS/MS) is one of the most powerful tools in proteomics for identifying proteins. Because complete genome sequences are accumulating rapidly, the recent trend in interpretation of MS/MS spectra has been database search. However, de novo MS/MS spectral interpretation remains an open problem typically involving manual interpretation by expert mass spectrometrists. We have developed a new algorithm, SHERENGA, for de novo interpretation that automatically learns fragment ion types and intensity thresholds from a collection of test spectra generated from any type of mass spectrometer. The test data are used to construct optimal path scoring in the graph representations of MS/MS spectra. A ranked list of high scoring paths corresponds to potential peptide sequences. SHERENGA is most useful for interpreting sequences of peptides resulting from unknown proteins and for validating the results of database search algorithms in fully automated, high-throughput peptide sequencing.  相似文献   

9.
Tandem mass spectrometry (MS/MS) has emerged as a cornerstone of proteomics owing in part to robust spectral interpretation algorithms. Widely used algorithms do not fully exploit the intensity patterns present in mass spectra. Here, we demonstrate that intensity pattern modeling improves peptide and protein identification from MS/MS spectra. We modeled fragment ion intensities using a machine-learning approach that estimates the likelihood of observed intensities given peptide and fragment attributes. From 1,000,000 spectra, we chose 27,000 with high-quality, nonredundant matches as training data. Using the same 27,000 spectra, intensity was similarly modeled with mismatched peptides. We used these two probabilistic models to compute the relative likelihood of an observed spectrum given that a candidate peptide is matched or mismatched. We used a 'decoy' proteome approach to estimate incorrect match frequency, and demonstrated that an intensity-based method reduces peptide identification error by 50-96% without any loss in sensitivity.  相似文献   

10.
MS/MS and database searching has emerged as a valuable technology for rapidly analyzing protein expression, localization, and post-translational modifications. The probability-based search engine Mascot has found widespread use as a tool to correlate tandem mass spectra with peptides in a sequence database. Although the Mascot scoring algorithm provides a probability-based model for peptide identification, the independent peptide scores do not correlate with the significance of the proteins to which they match. Herein, we describe a heuristic method for organizing proteins identified at a specified false-discovery rate using Mascot-matched peptides. We call this method PROVALT, and it uses peptide matches from a random database to calculate false-discovery rates for protein identifications and reduces a complex list of peptide matches to a nonredundant list of homologous protein groups. This method was evaluated using Mascot-identified peptides from a Trypanosoma cruzi epimastigote whole-cell lysate, which was separated by multidimensional LC and analyzed by MS/MS. PROVALT was then compared with the two traditional methods of protein identification when using Mascot, the single peptide score and cumulative protein score methods, and was shown to be superior to both in regards to the number of proteins identified and the inclusion of lower scoring nonrandom peptide matches.  相似文献   

11.
De novo interpretation of tandem mass spectrometry (MS/MS) spectra provides sequences for searching protein databases when limited sequence information is present in the database. Our objective was to define a strategy for this type of homology-tolerant database search. Homology searches, using MS-Homology software, were conducted with 20, 10, or 5 of the most abundant peptides from 9 proteins, based either on precursor trigger intensity or on total ion current, and allowing for 50%, 30%, or 10% mismatch in the search. Protein scores were corrected by subtracting a threshold score that was calculated from random peptides. The highest (p < .01) corrected protein scores (i.e., above the threshold) were obtained by submitting 20 peptides and allowing 30% mismatch. Using these criteria, protein identification based on ion mass searching using MS/MS data (i.e., Mascot) was compared with that obtained using homology search. The highest-ranking protein was the same using Mascot, homology search using the 20 most intense peptides, or homology search using all peptides, for 63.4% of 112 spots from two-dimensional polyacrylamide gel electrophoresis gels. For these proteins, the percent coverage was greatest using Mascot compared with the use of all or just the 20 most intense peptides in a homology search (25.1%, 18.3%, and 10.6%, respectively). Finally, 35% of de novo sequences completely matched the corresponding known amino acid sequence of the matching peptide. This percentage increased when the search was limited to the 20 most intense peptides (44.0%). After identifying the protein using MS-Homology, a peptide mass search may increase the percent coverage of the protein identified.  相似文献   

12.
In shotgun proteomics, tandem mass spectra of peptides are typically identified through database search algorithms such as Sequest. We have developed DirecTag, an open-source algorithm to infer partial sequence tags directly from observed fragment ions. This algorithm is unique in its implementation of three separate scoring systems to evaluate each tag on the basis of peak intensity, m/ z fidelity, and complementarity. In data sets from several types of mass spectrometers, DirecTag reproducibly exceeded the accuracy and speed of InsPecT and GutenTag, two previously published algorithms for this purpose. The source code and binaries for DirecTag are available from http://fenchurch.mc.vanderbilt.edu.  相似文献   

13.
Zhang J  McQuillan I  Wu FX 《Proteomics》2011,11(19):3779-3785
Peptide-spectrum matching is one of the most time-consuming portion of the database search method for assignment of tandem mass spectra to peptides. In this study, we develop a parallel algorithm for peptide-spectrum matching using Single-Instruction Multiple Data (SIMD) instructions. Unlike other parallel algorithms in peptide-spectrum matching, our algorithm parallelizes the computation of matches between a single spectrum and a given peptide sequence from the database. It also significantly reduces the number of comparison operations. Extra improvements are obtained by using SIMD instructions to avoid conditional branches and unnecessary memory access within the algorithm. The implementation of the developed algorithm is based on the Streaming SIMD Extensions technology that is embedded in most Intel microprocessors. Similar technology also exists in other modern microprocessors. A simulation shows that the developed algorithm achieves an 18-fold speedup over the previous version of Real-Time Peptide-Spectrum Matching algorithm [F. X. Wu et al., Rapid Commun. Mass Sepctrom. 2006, 20, 1199-1208]. Therefore, the developed algorithm can be employed to develop real-time control methods for MS/MS.  相似文献   

14.
Independent of the approach used, the ability to correctly interpret tandem MS data depends on the quality of the original spectra. Even in the case of the highest quality spectra, the majority of spectral peaks can not be reliably interpreted. The accuracy of sequencing algorithms can be improved by filtering out such 'noise' peaks. Preprocessing MS/MS spectra to select informative ion peaks increases accuracy and reduces the processing time. Intuitively, the mix of informative versus non-informative peaks has a direct effect on the quality and size of the resulting candidate peptide search space. As the number of selected peaks increases, the corresponding search space increases exponentially. If we select too few peaks then the ion-ladder interpretation of the spectrum will contain gaps that can only be explained by permutations of combinations of amino acids. This will result in a larger candidate peptide search space and poorer quality candidates. The dependency that peptide sequencing accuracy has on an initial peak selection regime makes this preprocessing step a crucial facet of any approach, whether de novo or not, to MS/MS spectra interpretation.We have developed a novel approach to address this problem. Our approach uses a staged neural network to model ion fragmentation patterns and estimate the posterior probability of each ion type. Our method improves upon other preprocessing techniques and shows a significant reduction in the search space for candidate peptides without sacrificing candidate peptide quality.  相似文献   

15.
MOTIVATION: Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines. OVERVIEW: The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein. AVAILABILITY: The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.  相似文献   

16.
Creasy DM  Cottrell JS 《Proteomics》2002,2(10):1426-1434
An error tolerant mode for database matching of uninterpreted tandem mass spectrometry data is described. Selected database entries are searched without enzyme specificity, using a comprehensive list of chemical and post-translational modifications, together with a residue substitution matrix. The modifications are tested serially, to avoid the catastrophic loss of discrimination that would occur if all the permutations of large numbers of modifications in combination were possible. The new mode has been coded as an extension to the Mascot search engine, and tested against a number of Liquid chromatography-tandem mass spectrometry datasets. The results show a number of additional peptide matches, but require careful interpretation. The most significant limitation of this approach is that it can only reveal new matches to proteins that already have at least one significant peptide match.  相似文献   

17.
In high-throughput proteomics the development of computational methods and novel experimental strategies often rely on each other. In certain areas, mass spectrometry methods for data acquisition are ahead of computational methods to interpret the resulting tandem mass spectra. Particularly, although there are numerous situations in which a mixture tandem mass spectrum can contain fragment ions from two or more peptides, nearly all database search tools still make the assumption that each tandem mass spectrum comes from one peptide. Common examples include mixture spectra from co-eluting peptides in complex samples, spectra generated from data-independent acquisition methods, and spectra from peptides with complex post-translational modifications. We propose a new database search tool (MixDB) that is able to identify mixture tandem mass spectra from more than one peptide. We show that peptides can be reliably identified with up to 95% accuracy from mixture spectra while considering only a 0.01% of all possible peptide pairs (four orders of magnitude speedup). Comparison with current database search methods indicates that our approach has better or comparable sensitivity and precision at identifying single-peptide spectra while simultaneously being able to identify 38% more peptides from mixture spectra at significantly higher precision.  相似文献   

18.
Two new statistical models based on Monte Carlo Simulation (MCS) have been developed to score peptide matches in shotgun proteomic data and incorporated in a database search program, MassMatrix (www.massmatrix.net). The first model evaluates peptide matches based on the total abundance of matched peaks in the experimental spectra. The second model evaluates amino acid residue tags within MS/MS spectra. The two models provide complementary scores for peptide matches that result in higher confidence in peptide identification when significant scores are returned from both models. The MCS-based models use a variance reduction technique that improves estimation precision. Due to the high computational expense of MCS-based models, peptide matches were prefiltered by other statistical models before further evaluation by the MCS-based models. Receiver operating characteristic analysis of the data sets confirmed that MCS-based models improved the overall performance of the MassMatrix search software, especially for low-mass accuracy data sets.  相似文献   

19.
MOTIVATION: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. RESULTS: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

20.
Proteomic techniques are fast becoming the main method for qualitative and quantitative determination of the protein content in biological systems. Despite notable advances, efficient and accurate analysis of high throughput proteomic data generated by mass spectrometers remains one of the major stumbling blocks in the protein identification problem. We present a model for the number of random matches between an experimental MS-MS spectrum and a theoretical spectrum of a peptide. The shape of the probability distribution is a function of the experimental accuracy, the number of peaks in the experimental spectrum, the length of the interval over which the peaks are distributed, and the number of theoretical spectral peaks in this interval. Based on this probability distribution, a goodness-of-fit tool can be used to yield fast and accurate scoring schemes for peptide identification through database search. In this paper, we describe one possible implementation of such a method and compare the performance of the resulting scoring function with that of SEQUEST. In terms of speed, our algorithm is roughly two orders of magnitude faster than the SEQUEST program, and its accuracy of peptide identification compares favorably to that of SEQUEST. Moreover, our algorithm does not use information related to the intensities of the peaks.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号