首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.

Background  

Tandem mass spectrometry (MS/MS) is a powerful tool for protein identification. Although great efforts have been made in scoring the correlation between tandem mass spectra and an amino acid sequence database, improvements could be made in three aspects, including characterization ofpeaks in spectra, adoption of effective scoring functions and access to thereliability of matching between peptides and spectra.  相似文献   

2.
Creasy DM  Cottrell JS 《Proteomics》2002,2(10):1426-1434
An error tolerant mode for database matching of uninterpreted tandem mass spectrometry data is described. Selected database entries are searched without enzyme specificity, using a comprehensive list of chemical and post-translational modifications, together with a residue substitution matrix. The modifications are tested serially, to avoid the catastrophic loss of discrimination that would occur if all the permutations of large numbers of modifications in combination were possible. The new mode has been coded as an extension to the Mascot search engine, and tested against a number of Liquid chromatography-tandem mass spectrometry datasets. The results show a number of additional peptide matches, but require careful interpretation. The most significant limitation of this approach is that it can only reveal new matches to proteins that already have at least one significant peptide match.  相似文献   

3.
Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra.  相似文献   

4.
Mass spectrometry has made rapid advances in the recent past and has become the preferred method for proteomics. Although many open source algorithms for peptide identification exist, such as X!Tandem and OMSSA, it has majorly been a domain of proprietary software. There is a need for better, freely available, and configurable algorithms that can help in identifying the correct peptides while keeping the false positives to a minimum. We have developed MassWiz, a novel empirical scoring function that gives appropriate weights to major ions, continuity of b-y ions, intensities, and the supporting neutral losses based on the instrument type. We tested MassWiz accuracy on 486,882 spectra from a standard mixture of 18 proteins generated on 6 different instruments downloaded from the Seattle Proteome Center public repository. We compared the MassWiz algorithm with Mascot, Sequest, OMSSA, and X!Tandem at 1% FDR. MassWiz outperformed all in the largest data set (AGILENT XCT) and was second only to Mascot in the other data sets. MassWiz showed good performance in the analysis of high confidence peptides, i.e., those identified by at least three algorithms. We also analyzed a yeast data set containing 106,133 spectra downloaded from the NCBI Peptidome repository and got similar results. The results demonstrate that MassWiz is an effective algorithm for high-confidence peptide identification without compromising on the number of assignments. MassWiz is open-source, versatile, and easily configurable.  相似文献   

5.
With the recent quick expansion of DNA and protein sequence databases, intensive efforts are underway to interpret the linear genetic information of DNA in terms of function, structure, and control of biological processes. The systematic identification and quantification of expressed proteins has proven particularly powerful in this regard. Large-scale protein identification is usually achieved by automated liquid chromatography-tandem mass spectrometry of complex peptide mixtures and sequence database searching of the resulting spectra [Aebersold and Goodlett, Chem. Rev. 2001, 101, 269-295]. As generating large numbers of sequence-specific mass spectra (collision-induced dissociation/CID) spectra has become a routine operation, research has shifted from the generation of sequence database search results to their validation. Here we describe in detail a novel probabilistic model and score function that ranks the quality of the match between tandem mass spectral data and a peptide sequence in a database. We document the performance of the algorithm on a reference data set and in comparison with another sequence database search tool. The software is publicly available for use and evaluation at http://www.systemsbiology.org/research/software/proteomics/ProbID.  相似文献   

6.
When performing bioinformatics analysis on tandem mass spectrometry data, there is a computational need to efficiently store and sort these semi-ordered datasets. To solve this problem, a new data structure based on dynamic arrays was designed and implemented in an algorithm that parses semi-ordered data made by Mascot, a separate software program that matches peptide tandem mass spectra to protein sequences in a database. By accommodating the special features of these large datasets, the combined dynamic array (CDA) provides efficient searching and insertion operations. The operations on real datasets using this new data structure are hundreds times faster than operations using binary tree and red-black tree structures. The difference becomes more significant when the dataset size grows. This data structure may be useful for improving the speed of other related types of protein assembling software or other types of software that operate on datasets with similar semi-ordered features.  相似文献   

7.
We evaluate statistical models used in two-hypothesis tests for identifying peptides from tandem mass spectrometry data. The null hypothesis H(0), that a peptide matches a spectrum by chance, requires information on the probability of by-chance matches between peptide fragments and peaks in the spectrum. Likewise, the alternate hypothesis H(A), that the spectrum is due to a particular peptide, requires probabilities that the peptide fragments would indeed be observed if it was the causative agent. We compare models for these probabilities by determining the identification rates produced by the models using an independent data set. The initial models use different probabilities depending on fragment ion type, but uniform probabilities for each ion type across all of the labile bonds along the backbone. More sophisticated models for probabilities under both H(A) and H(0) are introduced that do not assume uniform probabilities for each ion type. In addition, the performance of these models using a standard likelihood model is compared to an information theory approach derived from the likelihood model. Also, a simple but effective model for incorporating peak intensities is described. Finally, a support-vector machine is used to discriminate between correct and incorrect identifications based on multiple characteristics of the scoring functions. The results are shown to reduce the misidentification rate significantly when compared to a benchmark cross-correlation based approach.  相似文献   

8.
Database search is a standard technique for identifying peptides from their tandem mass spectra. To increase the number of correctly identified peptides, we suggest a probabilistic framework that allows the combination of scores from different search engines into a joint consensus score. Central to the approach is a novel method to estimate scores for peptides not found by an individual search engine. This approach allows the estimation of p-values for each candidate peptide and their combination across all search engines. The consensus approach works better than any single search engine across all different instrument types considered in this study. Improvements vary strongly from platform to platform and from search engine to search engine. Compared to the industry standard MASCOT, our approach can identify up to 60% more peptides. The software for consensus predictions is implemented in C++ as part of OpenMS, a software framework for mass spectrometry. The source code is available in the current development version of OpenMS and can easily be used as a command line application or via a graphical pipeline designer TOPPAS.  相似文献   

9.
MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ~100 000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at http://www.massmatrix.net.  相似文献   

10.

Background  

Tandem mass spectrometry-based database searching has become an important technology for peptide and protein identification. One of the key challenges in database searching is the remarkable increase in computational demand, brought about by the expansion of protein databases, semi- or non-specific enzymatic digestion, post-translational modifications and other factors. Some software tools choose peptide indexing to accelerate processing. However, peptide indexing requires a large amount of time and space for construction, especially for the non-specific digestion. Additionally, it is not flexible to use.  相似文献   

11.
Peptide identification by tandem mass spectrometry is an important tool in proteomic research. Powerful identification programs exist, such as SEQUEST, ProICAT and Mascot, which can relate experimental spectra to the theoretical ones derived from protein databases, thus removing much of the manual input needed in the identification process. However, the time-consuming validation of the peptide identifications is still the bottleneck of many proteomic studies. One way to further streamline this process is to remove those spectra that are unlikely to provide a confident or valid peptide identification, and in this way to reduce the labour from the validation phase. RESULTS: We propose a prefiltering scheme for evaluating the quality of spectra before the database search. The spectra are classified into two classes: spectra which contain valuable information for peptide identification and spectra that are not derived from peptides or contain insufficient information for interpretation. The different spectral features developed for the classification are tested on a real-life material originating from human lymphoblast samples and on a standard mixture of 9 proteins, both labelled with the ICAT-reagent. The results show that the prefiltering scheme efficiently separates the two spectra classes.  相似文献   

12.
Proteomics, or the direct analysis of the expressed protein components of a cell, is critical to our understanding of cellular biological processes in normal and diseased tissue. A key requirement for its success is the ability to identify proteins in complex mixtures. Recent technological advances in tandem mass spectrometry has made it the method of choice for high-throughput identification of proteins. Unfortunately, the software for unambiguously identifying peptide sequences has not kept pace with the recent hardware improvements in mass spectrometry instruments. Critical for reliable high-throughput protein identification, scoring functions evaluate the quality of a match between experimental spectra and a database peptide. Current scoring function technology relies heavily on ad-hoc parameterization and manual curation by experienced mass spectrometrists. In this work, we propose a two-stage stochastic model for the observed MS/MS spectrum, given a peptide. Our model explicitly incorporates fragment ion probabilities, noisy spectra, and instrument measurement error. We describe how to compute this probability based score efficiently, using a dynamic programming technique. A prototype implementation demonstrates the effectiveness of the model.  相似文献   

13.
Tandem mass spectrometry has emerged to be one of the most powerful high-throughput techniques for protein identification. Tandem mass spectrometry selects and fragments peptides of interest into N-terminal ions and C-terminal ions, and it measures the mass/charge ratios of these ions. The de novo peptide sequencing problem is to derive the peptide sequences from given tandem mass spectral data of k ion peaks without searching against protein databases. By transforming the spectral data into a matrix spectrum graph G = (V, E), where |V| = O(k(2)) and |E| = O(k(3)), we give the first polynomial time suboptimal algorithm that finds all the suboptimal solutions (peptides) in O(p|E|) time, where p is the number of solutions. The algorithm has been implemented and tested on experimental data. The program is available at http://hto-c.usc.edu:8000/msms/menu/denovo.htm.  相似文献   

14.
MS/MS is a widely used method for proteome‐wide analysis of protein expression and PTMs. The thousands of MS/MS spectra produced from a single experiment pose a major challenge for downstream analysis. Standard programs, such as MASCOT, provide peptide assignments for many of the spectra, including identification of PTM sites, but these results are plagued by false‐positive identifications. In phosphoproteomic experiments, only a single peptide assignment is typically available to support identification of each phosphorylation site, and hence minimizing false positives is critical. Thus, tedious manual validation is often required to increase confidence in the spectral assignments. We have developed phoMSVal, an open‐source platform for managing MS/MS data and automatically validating identified phosphopeptides. We tested five classification algorithms with 17 extracted features to separate correct peptide assignments from incorrect ones using over 2600 manually curated spectra. The naïve Bayes algorithm was among the best classifiers with an AUC value of 97% and PPV of 97% for phosphotyrosine data. This classifier required only three features to achieve a 76% decrease in false positives as compared with MASCOT while retaining 97% of true positives. This algorithm was able to classify an independent phosphoserine/threonine data set with AUC value of 93% and PPV of 91%, demonstrating the applicability of this method for all types of phospho‐MS/MS data. PhoMSVal is available at http://csbi.ltdk.helsinki.fi/phomsval .  相似文献   

15.
We describe a probabilistic peptide fragmentation model for use in protein databank searching and de novo sequencing of electrospray tandem mass spectrometry data. A probabilistic framework for tuning of the model using a range of well-characterized samples are introduced. We present preliminary results of our tuning efforts.  相似文献   

16.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, because it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospect and can potentially improve the confidence of matches and generate more identifications. On the basis of our previously reported study of fragmentation intensity patterns, we developed a protein identification algorithm, SeQuence IDentfication (SQID), that makes use of the coarse intensity from a statistical analysis. The scoring scheme was validated by comparing with Sequest and X!Tandem using three data sets, and the results indicate an improvement in the number of identified peptides, including unique peptides that are not identified by Sequest or X!Tandem. The software and source code are available under the GNU GPL license at http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm.  相似文献   

17.
The MultiTag method (Sunyaev et al., Anal. Chem. 2003 15, 1307-1315) employs multiple error-tolerant searches with peptide sequence tags (Mann and Wilm, Anal. Chem. 1994, 66, 4390-4399) for the identification of proteins from organisms with unsequenced genomes. Here we demonstrate that the error-tolerant capabilities of MultiTag increased the number of peptide alignments and improved the confidence of identifications in an EST database. The MultiTag outperformed conventional database searching software that only utilizes stringent matching of tandem mass spectra to nucleotide sequences of ESTs.  相似文献   

18.
蛋白质质谱技术是蛋白质组学的重要研究工具,它被出色地应用于癌症早期诊断等领域,但是蛋白质质谱数据带来的维灾难问题使得降维成为质谱分析的必需的步骤。本文首先将美国国家癌症研究所提供的高分辨率SELDI—TOF卵巢质谱数据进行预处理;然后将质谱数据的特征选择问题转化成基于模拟退火算法的组合优化模型,用基于线性判别式分析的分类错误率和样本后验概率构造待优化目标函数,用基于均匀分布和控制参数的方法构造新解产生器,在退火过程中添加记忆功能;然后用10-fold交叉验证法选择训练和测试样本,用线性判别式分析分类器评价降维后的质谱数据。实验证明,用模拟退火算法选择6个以上特征时,能够将高分辨率SELDI—TOF卵巢质谱数据全部正确分类,说明模拟退火算法可以很好地应用于蛋白质质谱数据的特征选择。  相似文献   

19.
Next to the identification of proteins and the determination of their expression levels, the analysis of post-translational modifications (PTM) is becoming an increasingly important aspect in proteomics. Here, we review mass spectrometric (MS) techniques for the study of protein glycosylation at the glycopeptide level. Enrichment and separation techniques for glycoproteins and glycopeptides from complex (glyco-)protein mixtures and digests are summarized. Various tandem MS (MS/MS) techniques for the analysis of glycopeptides are described and compared with respect to the information they provide on peptide sequence, glycan attachment site and glycan structure. Approaches using electrospray ionization and matrix-assisted laser desorption/ionization (MALDI) of glycopeptides are presented and the following fragmentation techniques in glycopeptide analysis are compared: collision-induced fragmentation on different types of instruments, metastable fragmentation after MALDI ionization, infrared multi-photon dissociation, electron-capture dissociation and electron-transfer dissociation. This review discusses the potential and limitations of tandem mass spectrometry of glycopeptides as a tool in structural glycoproteomics.  相似文献   

20.
Whereas the bearing of mass measurement error on protein identification is sometimes underestimated, uncertainty in observed peptide masses unavoidably translates to ambiguity in subsequent protein identifications. Although ongoing instrumental advances continue to make high accuracy mass spectrometry (MS) increasingly accessible, many proteomics experiments are still conducted with rather large mass error tolerances. In addition, the ranking schemes of most protein identification algorithms do not include a meaningful incorporation of mass measurement error. This article provides a critical evaluation of mass error tolerance as it pertains to false positive peptide and protein associations resulting from peptide mass fingerprint (PMF) database searching. High accuracy, high resolution PMFs of several model proteins were obtained using matrix-assisted laser desorption/ionization Fourier transform ion cyclotron resonance mass spectrometry (MALDI-FTICR-MS). Varying levels of mass accuracy were simulated by systematically modulating the mass error tolerance of the PMF query and monitoring the effect on figures of merit indicating the PMF quality. Importantly, the benefits of decreased mass error tolerance are not manifest in Mowse scores when operating at tolerances in the low parts-per-million range but become apparent with the consideration of additional metrics that are often overlooked. Furthermore, the outcomes of these experiments support the concept that false discovery is closely tied to mass measurement error in PMF analysis. Clear establishment of this relation demonstrates the need for mass error-aware protein identification routines and argues for a more prominent contribution of high accuracy mass measurement to proteomic science.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号