Deconvolution and Database Search of Complex Tandem Mass Spectra of Intact Proteins: A COMBINATORIAL APPROACH*期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

按检索

Deconvolution and Database Search of Complex Tandem Mass Spectra of Intact Proteins: A COMBINATORIAL APPROACH*

Authors:	Xiaowen Liu Yuval Inbar Pieter C Dorrestein Colin Wynne Nathan Edwards Puneet Souda Julian P Whitelegge Vineet Bafna Pavel A Pevzner

Institution:	3. Departments of 3Computer Science and Engineering and;4. 4Pharmacology, Chemistry and Biochemistry, University of California, San Diego, California 92093,;5. 5Department of Chemistry and Biochemistry, University of Maryland, College Park, Maryland 20742,;6. 6Department of Biochemistry and Molecular and Cellular Biology, Georgetown University Medical Center, Washington, D. C. 20007, and

Abstract:	Top-down proteomics studies intact proteins, enabling new opportunities for analyzing post-translational modifications. Because tandem mass spectra of intact proteins are very complex, spectral deconvolution (grouping peaks into isotopomer envelopes) is a key initial stage for their interpretation. In such spectra, isotopomer envelopes of different protein fragments span overlapping regions on the m/z axis and even share spectral peaks. This raises both pattern recognition and combinatorial challenges for spectral deconvolution. We present MS-Deconv, a combinatorial algorithm for spectral deconvolution. The algorithm first generates a large set of candidate isotopomer envelopes for a spectrum, then represents the spectrum as a graph, and finally selects its highest scoring subset of envelopes as a heaviest path in the graph. In contrast with other approaches, the algorithm scores sets of envelopes rather than individual envelopes. We demonstrate that MS-Deconv improves on Thrash and Xtract in the number of correctly recovered monoisotopic masses and speed. We applied MS-Deconv to a large set of top-down spectra from Yersinia rohdei (with a still unsequenced genome) and further matched them against the protein database of related and sequenced bacterium Yersinia enterocolitica. MS-Deconv is available at http://proteomics.ucsd.edu/Software.html.Top-down proteomics is a mass spectrometry-based approach for identification of proteins and their post-translational modifications (PTMs)1 (1 –14). Unlike the “bottom-up” approach where proteins are first digested into peptides and then a peptide mixture is analyzed by mass spectrometry, the top-down approach analyzes intact proteins. Thus, it has advantages in detecting and localizing PTMs as well as identifying multiple protein species (e.g. proteolytically processed protein species). Despite its advantages, top-down proteomics presents many challenges. These include requirement of high sample quantity, sophisticated instrumentation, protein separation, and robust computational analysis tools. For this reason, top-down proteomics has rarely been used for analyzing complex mixtures (12 –18), and it is typically used to study single purified proteins. However, this situation is quickly changing with recent top-down studies of complex protein mixtures (14, 19).Because of the existence of natural isotopes, fragment ions of the same chemical formula and charge state are usually represented by a collection of spectral peaks in tandem mass spectra called an isotopomer envelope. The monoisotopic mass of a chemical formula is the sum of the masses of the atoms using the principal (most abundant) isotope for each element. Spectral deconvolution focuses on grouping spectral peaks into isotopomer envelopes. By doing so, the charge state and monoisotopic mass of each envelope are effectively determined. A complex multi-isotopic peak list in the m/z space is translated into a simple monoisotopic mass list that is easier to analyze.Given the monoisotopic mass and charge state of a fragment ion, its theoretical isotopic distribution can be predicted by assuming the fragment ion has an average elemental composition with respect to its mass (20) or using its precise elemental composition if the protein is known. Exploiting this, many deconvolution methods use theoretical isotopic distributions to detect and evaluate candidate isotopomer envelopes, which is the envelope detection problem (Fig. 1). To evaluate the fit of a candidate envelope to its theoretical isotopic distribution, many metrics have been proposed (20 –32).Open in a separate window Fig. 1.Envelope detection. a, a theoretical isotopic distribution is predicted with the monoisotopic mass and charge state of a fragment ion. b, an observed envelope is detected by mapping peaks in the theoretical distribution to the spectrum. c, match between the theoretical isotopic distribution and the observed envelope. d, the theoretical isotopic distribution is scaled (the intensities of the peaks are multiplied by a constant) to have the best fit with the intensities of peaks in the observed envelope. Finally, a score for the observed envelope can be computed by comparing it with the intensity-scaled theoretical isotopic distribution.The candidate envelopes often overlap and share peaks, leading to a combinatorial problem of selecting the list of envelopes that best explains the spectrum (Fig. 2). In contrast to the well studied envelope detection problem, the envelope selection problem remains poorly explored. Most deconvolution algorithms follow a simple greedy approach to selecting the set of envelopes where the highest scoring envelopes are iteratively selected and removed from the spectrum. Although this approach often generates reasonable sets of envelopes for simple spectra, its performance deteriorates in cases of complex spectra.Open in a separate window Fig. 2.Envelope selection problem. Overlapping envelopes lead to a difficult combinatorial problem of selecting an optimal set of envelopes. We illustrate two cases where a deconvolution method that follows a greedy envelope selection outputs the envelope E₂, whereas the optimal solution consists of the envelopes E₁ and E₃. Example a illustrates the case where envelopes do not share peaks, and example b illustrates the case where envelopes share a spectral peak (E₁ and E₃).In particular, the greedy approach performs well when the envelopes are distributed sparsely along the m/z axis. Large proteins have many fragments that appear in multiple charge states. The high number of envelopes/peaks and the small m/z spread of the fragments with high charge states result in narrow m/z regions with high peak density. In these peak-dense regions, envelopes may overlap and share peaks, and the greedy approach and even manual interpretation often fail to find the optimal combination of envelopes (supplemental Fig. 1).Several methods have been proposed to explore the envelope selection problem. McIlwain et al. (33) presented a dynamic programming algorithm for selecting a set of envelopes such that the m/z ranges of the envelopes do not overlap. This non-overlapping condition becomes too restrictive for complex spectra of intact proteins. Samuelsson et al. (34) proposed a method that follows a non-negative sparse regression scheme. Du and Angeletti (35) and Renard et al. (36) addressed the envelope selection problem as a statistical problem of variable selection and used LASSO to solve it.Here, we present MS-Deconv, a combinatorial algorithm for spectral deconvolution. MS-Deconv (i) generates a large set of candidate envelopes, (ii) constructs an envelope graph encoding all envelopes and relationships between them, and (iii) finds a heaviest path in the envelope graph. Although the envelope graph of a complex spectrum is large (exceeding a million nodes in some cases), the heaviest path algorithm can efficiently find an optimal set of envelopes. MS-Deconv explicitly scores combinations of candidate envelopes rather than individual envelopes as in previous approaches.We tested MS-Deconv on a data set of top-down spectra from known proteins and evaluated the monoisotopic masses recovered by MS-Deconv. A mass was classified as a true positive if it was matched to the monoisotopic mass of a theoretical fragment ion of the protein within a specific parts per million (ppm) tolerance. We compared the performance of MS-Deconv with the widely used Thrash (20) and Xtract (37) and demonstrated that, with a few exceptions, MS-Deconv recovers more true positive masses. For example, for the collisionally activated dissociation (CAD) spectrum of bacteriorhodopsin (BR) with charge 10, the percentage of true positive masses among the top 150 masses is above 70% for MS-Deconv and less than 50% for Thrash. Additionally, MS-Deconv is ∼33 times faster than Thrash and 4 times faster than Xtract. Furthermore, MS-Deconv implements some user-friendly features: (i) outputs the set of peptide sequence tags, (ii) provides protein and spectral annotations, and (iii) allows one to inspect the recovered envelopes. We also tested MS-Deconv on a large LC-MS/MS data set from Yersinia rohdei (with a still unsequenced genome) (19). Y. rohdei is a non-pathogenic bacterium that is often used as a simulant for the potential bioterrorism agent Yersinia pestis, the causative agent of plague. We applied MS-Deconv to extract monoisotopic mass lists from top-down spectra and compared the mass lists with those reported by Thrash. We used ProSightPC (38) and the spectral alignment algorithm (39) to identify related proteins from a protein database of Yersinia enterocolitica (with a closely related and sequenced genome). The results demonstrated that MS-Deconv reported more matched fragments than Thrash for most proteins. Additionally, using spectral alignment, we identified eight proteins in Y. rohdei that were not reported in the ProSightPC-based searches (19) of the Y. enterocolitica protein database.

Keywords:
本文献已被 ScienceDirect 等数据库收录！

设为首页 | 免责声明 | 关于勤云 | 加入收藏