Recent emergence of new mass spectrometry techniques (
e.g. electron transfer dissociation, ETD) and improved availability of additional proteases (
e.g. Lys-N) for protein digestion in high-throughput experiments raised the challenge of designing new algorithms for interpreting the resulting new types of tandem mass (MS/MS) spectra. Traditional MS/MS database search algorithms such as SEQUEST and Mascot were originally designed for collision induced dissociation (CID) of tryptic peptides and are largely based on expert knowledge about fragmentation of tryptic peptides (rather than machine learning techniques) to design CID-specific scoring functions. As a result, the performance of these algorithms is suboptimal for new mass spectrometry technologies or nontryptic peptides. We recently proposed the generating function approach (MS-GF) for CID spectra of tryptic peptides. In this study, we extend MS-GF to automatically derive scoring parameters from a set of annotated MS/MS spectra of any type (
e.g. CID, ETD, etc.), and present a new database search tool MS-GFDB based on MS-GF. We show that MS-GFDB outperforms Mascot for ETD spectra or peptides digested with Lys-N. For example, in the case of ETD spectra, the number of tryptic and Lys-N peptides identified by MS-GFDB increased by a factor of 2.7 and 2.6 as compared with Mascot. Moreover, even following a decade of Mascot developments for analyzing CID spectra of tryptic peptides, MS-GFDB (that is not particularly tailored for CID spectra or tryptic peptides) resulted in 28% increase over Mascot in the number of peptide identifications. Finally, we propose a statistical framework for analyzing multiple spectra from the same precursor (
e.g. CID/ETD spectral pairs) and assigning
p values to peptide-spectrum-spectrum matches.Since the introduction of electron capture dissociation (ECD)
1 in 1998 (
1), electron-based peptide dissociation technologies have played an important role in analyzing intact proteins and post-translational modifications (
2). However, until recently, this research-grade technology was available only to a small number of laboratories because it was commercially unavailable, required experience for operation, and could be implemented only with expensive FT-ICR instruments. The discovery of electron-transfer dissociation (ETD) (
3) enabled an ECD-like technology to be implemented in (relatively cheap) ion-trap instruments. Nowadays, many researchers are employing the ETD technology for tandem mass spectra generation (
4–
9).Although the hardware technologies to generate ETD spectra are maturing rapidly, software technologies to analyze ETD spectra are still in infancy. There are two major approaches to analyzing tandem mass spectra:
de novo sequencing and database search. Both approaches find the best-scoring peptide either among all possible peptides (
de novo sequencing) or among all peptides in a protein database (database search). Although
de novo sequencing is emerging as an alternative to database search, database search remains a more accurate (and thus preferred) method of spectral interpretation, so here we focus on the database search approach.Numerous database search engines are currently available, including SEQUEST (
10), Mascot (
11), OMSSA (
12), X!Tandem (
13), and InsPecT (
14). However, most of them are inadequate for the analysis of ETD spectra because they are optimized for collision induced dissociation (CID) spectra that show different fragmentation propensities than those of ETD spectra. Additionally, the existing tandem mass spectrometry (MS/MS) tools are biased toward the analysis of tryptic peptides because trypsin is usually used for CID, and thus not suitable for the analysis of nontryptic peptides that are common for ETD. Therefore, even though some database search engines support the analysis of ETD spectra (
e.g. SEQUEST, Mascot, and OMSSA), their performance remains suboptimal when it comes to analyzing ETD spectra. Recently, an ETD-specific database search tool (Z-Core) was developed; however it does not significantly improve over OMSSA (
15).We present a new database search tool (MS-GFDB) that significantly outperforms existing database search engines in the analysis of ETD spectra, and performs equally well on nontryptic peptides. MS-GFDB employs the generating function approach (MS-GF) that computes rigorous
p values of peptide-spectrum matches (PSMs) based on the spectrum-specific score histogram of all peptides (
16).
2 MS-GF
p values are dependent only on the PSM (and not on the database), thus can be used as an alternative scoring function for the database search.Computing
p values requires a scoring model evaluating qualities of PSMs. MS-GF adopts a probabilistic scoring model (MS-Dictionary scoring model) described in Kim
et al., 2009 (
17), considering multiple features including product ion types, peak intensities and mass errors. To define the parameters of this scoring model, MS-GF only needs a set of
training PSMs.
3 This set of PSMs can be obtained in a variety of ways: for example, one can generate CID/ETD pairs and use peptides identified by CID to form PSMs for ETD. Alternatively, one can generate spectra from a purified protein (when PSMs can be inferred from the accurate parent mass alone) or use a previously developed (not necessary optimal) tool to generate training PSMs. From these training PSMs, MS-GF automatically derives scoring parameters without assuming any prior knowledge about the specifics of a particular peptide fragmentation method (
e.g. ETD, CID, etc.) and/or proteolytic origin of the peptides. MS-GF was originally designed for the analysis of CID spectra, but now it has been extended to other types of spectra generated by various fragmentation techniques and/or various enzymes. We show that MS-GF can be successfully applied to novel types of spectra (
e.g. ETD of Lys-N peptides (
18,
19)) by simply retraining scoring parameters without any modification. Note that although the same scoring model is used for different types of spectra, the parameters derived to score different types of spectra are dissimilar.We compared the performance of MS-GFDB with Mascot on a large ETD data set and found that it generated many more peptide identifications for the same false discovery rates (FDR). For example, at 1% peptide level FDR, MS-GFDB identified 9450 unique peptides from 81,864 ETD spectra of Lys-N peptides whereas Mascot only identified 3672 unique peptides, ≈160% increase in the number of peptide identifications (a similar improvement is observed for ETD spectra of tryptic peptides).
4 MS-GFDB also showed a significant 28% improvement in the number of identified peptides from CID spectra of tryptic peptides (16,203 peptides as compared with 12,658 peptides identified by Mascot).The ETD technology complements rather than replaces CID because both technologies have some advantages: CID for smaller peptides with small charges, ETD for larger and multiply charged peptides (
20,
21). An alternative way to utilize ETD is to use it in conjunction with CID because CID and ETD generate complementary sequence information (
20,
22,
23). ETD-enabled instruments often support generating both CID and ETD spectra (CID/ETD pairs) for the same peptide. Although the CID/ETD pairs promise a great improvement in peptide identification, the full potential of such pairs has not been fully realized yet. In the case of
de novo sequencing,
de novo sequencing tools utilizing CID/ETD pairs indeed result in more accurate
de novo peptide sequencing than traditional CID-based algorithms (
23,
24,
25). However, in the case of database search, the argument that the use of CID/ETD pairs improves peptide identifications remains poorly substantiated. A few tools are developed to use CID/ETD (or CID/ECD) pairs for the database search but they are limited to preprocessing/postprocessing of the spectral data before or following running a traditional database search tool (
26,
27). Nielsen
et al., 2005 (
22) pioneered the combined use of CID and ECD for the database search. Given a CID/ECD pair, they generated a combined spectrum comprised only of complementary pairs of peaks, and searched it with Mascot.
5 However, this approach is hard to generalize to less accurate CID/ETD pairs generated by ion-trap instruments because there is a higher chance that the identified complementary pairs of peaks are spurious. More importantly, using traditional MS/MS tools (such as Mascot) for the database search of the combined spectrum is inappropriate, because they are not optimized for analyzing such combined spectra; a better approach would be to develop a new database search tool tailored for the combined spectrum. Recently, Molina
et al., 2008 (
26) studied database search of CID/ETD pairs using Spectrum Mill (Agilent Technologies, Santa Clara, CA) and came to a counterintuitive conclusion that using only CID spectra identifies 12% more unique peptides than using CID/ETD pairs. We believe that it is an acknowledgment of limitations of the traditional MS/MS database search tools for the analysis of multiple spectra generated from a single peptide.In this paper, we modify the generating function approach for interpreting CID/ETD pairs and further apply it to improve the database search with CID/ETD pairs. In contrast to previous approaches, our scoring is specially designed to interpret CID/ETD pairs and can be generalized to analyzing any type of multiple spectra generated from a single peptide. When CID/ETD pairs from trypsin digests are used, MS-GFDB identified 13% and 27% more peptides compared with the case when only CID spectra and only ETD spectra are used, respectively. The difference was even more prominent when CID/ETD pairs from Lys-N digests were used, with 41% and 33% improvement over CID only and ETD only, respectively.Assigning a
p value to a PSM greatly helped researchers to evaluate the quality of peptide identifications. We now turn to the problem of assigning a
p value to a peptide-spectrum-spectrum match (PS
2M) when two spectra in PS
2M are generated by different fragmentation technologies (
e.g. ETD and CID). We argue that assigning statistical significance to a PS
2M (or even PS
nM) is a prerequisite for rigorous CID/ETD analyses. To our knowledge, MS-GFDB is the first tool to generate statistically rigorous
p values of PS
nMs.The MS-GFDB executable and source code is available at the website of Center for Computational Mass Spectrometry at UCSD (
http://proteomics.ucsd.edu). It takes a set of spectra (CID, ETD, or CID/ETD pairs) and a protein database as an input and outputs peptide matches. If the input is a set of CID/ETD pairs, it outputs the best scoring peptide matches and their
p values (1) using only CID spectra, (2) using only ETD spectra, and (3) using combined spectra of CID/ETD pairs.
相似文献