首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
2.
Current analytical strategies for collecting proteomic data using data-dependent acquisition (DDA) are limited by the low analytical reproducibility of the method. Proteomic discovery efforts that exploit the benefits of DDA, such as providing peptide sequence information, but that enable improved analytical reproducibility, represent an ideal scenario for maximizing measureable peptide identifications in “shotgun”-type proteomic studies. Therefore, we propose an analytical workflow combining DDA with retention time aligned extracted ion chromatogram (XIC) areas obtained from high mass accuracy MS1 data acquired in parallel. We applied this workflow to the analyses of sample matrixes prepared from mouse blood plasma and brain tissues and observed increases in peptide detection of up to 30.5% due to the comparison of peptide MS1 XIC areas following retention time alignment of co-identified peptides. Furthermore, we show that the approach is quantitative using peptide standards diluted into a complex matrix. These data revealed that peptide MS1 XIC areas provide linear response of over three orders of magnitude down to low femtomole (fmol) levels. These findings argue that augmenting “shotgun” proteomic workflows with retention time alignment of peptide identifications and comparative analyses of corresponding peptide MS1 XIC areas improve the analytical performance of global proteomic discovery methods using DDA.Label-free methods in mass spectrometry-based proteomics, such as those used in common “shotgun” proteomic studies, provide peptide sequence information as well as relative measurements of peptide abundance (13). A common data acquisition strategy is based on data-dependent acquisition (DDA)1 where the most abundant precursor ions are selected for tandem mass spectrometry (MS/MS) analysis (12). DDA attempts to minimize redundant peptide precursor selection and maximize the depth of proteome coverage (2). However, the analytical reproducibility of peptide identifications obtained using DDA-based methods result in <75% overlap between technical replicates (34). Comparisons of peptide identifications between replicate analyses have shown that the rate of new peptide identifications increases sharply following two replicate sample injections and gradually tapers off after approximately five replicate injections (4). This phenomenon is due, in part, to the semirandom sampling of peptides in a DDA experiment (5).Alternate label-free methods focused on measuring the abundance of intact peptide ions, such as the accurate mass and time tag (AMT) approach (68, 42), are aimed at differential analyses of extracted ion chromatogram (XIC) areas integrated from high mass accuracy peptide precursor mass spectra (MS1 spectra) exhibiting discrete chromatographic elution times. This method is particularly powerful for the analysis of post-translationally modified (PTM) peptides as pairing the low abundance of PTM candidates with the variable nature of DDA complicates comparisons between samples (9, 43). However, label-free strategies focused on the analysis of peptide MS1 XIC areas are dependent on a priori knowledge of peptide ions and retention times (210). Thus, prospective analyses of samples are needed to assess peptides and their respective retention times. This prospective analysis may not be possible for reagent-limited samples. Further, the usage of previously established peptide features in the analysis of different sample types can be confounded by unknown matrix effects that can produce variable retention time characteristics and peptide ion suppression (2). Therefore, proteomic strategies that make use of DDA, to provide peptide sequence information and identify features within the sample, but that also use MS1 data for comparisons between samples, represent an ideal combination for maximizing measureable peptide identification events in “shotgun” proteomic discovery analyses.Here we describe an analytical workflow that combines traditional DDA methods with the analysis of retention time aligned XIC areas extracted from high mass accuracy peptide precursor MS1 spectra. This method resulted in a 25.1% (±6.6%) increase in measureable peptide identification events across samples of diverse composition because of the inferential extraction of peptide MS1 XIC areas in sample sets lacking corresponding MS/MS events. These findings were observed in measurements of peptide MS1 XIC abundances using sample types ranging from tryptic digests of olfactory bulb tissues dissected from Homer2 knockout and wild-type mice to mouse blood plasma exhibiting differential levels of hemolysis. We further establish that this method is quantitative using a dilution series of known bovine standard peptide concentrations spiked into mouse blood plasma. These data show that comparative analysis between samples should be performed using peptide MS1 data as opposed to semirandomly sampled peptide MS/MS data. This approach maximizes the number of peptides that can be compared between samples.  相似文献   

3.
Comprehensive proteomic profiling of biological specimens usually requires multidimensional chromatographic peptide fractionation prior to mass spectrometry. However, this approach can suffer from poor reproducibility because of the lack of standardization and automation of the entire workflow, thus compromising performance of quantitative proteomic investigations. To address these variables we developed an online peptide fractionation system comprising a multiphasic liquid chromatography (LC) chip that integrates reversed phase and strong cation exchange chromatography upstream of the mass spectrometer (MS). We showed superiority of this system for standardizing discovery and targeted proteomic workflows using cancer cell lysates and nondepleted human plasma. Five-step multiphase chip LC MS/MS acquisition showed clear advantages over analyses of unfractionated samples by identifying more peptides, consuming less sample and often improving the lower limits of quantitation, all in highly reproducible, automated, online configuration. We further showed that multiphase chip LC fractionation provided a facile means to detect many N- and C-terminal peptides (including acetylated N terminus) that are challenging to identify in complex tryptic peptide matrices because of less favorable ionization characteristics. Given as much as 95% of peptides were detected in only a single salt fraction from cell lysates we exploited this high reproducibility and coupled it with multiple reaction monitoring on a high-resolution MS instrument (MRM-HR). This approach increased target analyte peak area and improved lower limits of quantitation without negatively influencing variance or bias. Further, we showed a strategy to use multiphase LC chip fractionation LC-MS/MS for ion library generation to integrate with SWATHTM data-independent acquisition quantitative workflows. All MS data are available via ProteomeXchange with identifier PXD001464.Mass spectrometry based proteomic quantitation is an essential technique used for contemporary, integrative biological studies. Whether used in discovery experiments or for targeted biomarker applications, quantitative proteomic studies require high reproducibility at many levels. It requires reproducible run-to-run peptide detection, reproducible peptide quantitation, reproducible depth of proteome coverage, and ideally, a high degree of cross-laboratory analytical reproducibility. Mass spectrometry centered proteomics has evolved steadily over the past decade, now mature enough to derive extensive draft maps of the human proteome (1, 2). Nonetheless, a key requirement yet to be realized is to ensure that quantitative proteomics can be carried out in a timely manner while satisfying the aforementioned challenges associated with reproducibility. This is especially important for recent developments using data independent MS quantitation and multiple reaction monitoring on high-resolution MS (MRM-HR)1 as they are both highly dependent on LC peptide retention time reproducibility and precursor detectability, while attempting to maximize proteome coverage (3). Strategies usually employed to increase the depth of proteome coverage utilize various sample fractionation methods including gel-based separation, affinity enrichment or depletion, protein or peptide chemical modification-based enrichment, and various peptide chromatography methods, particularly ion exchange chromatography (410). In comparison to an unfractionated “naive” sample, the trade-off in using these enrichments/fractionation approaches are higher risk of sample losses, introduction of undesired chemical modifications (e.g. oxidation, deamidation, N-terminal lactam formation), and the potential for result skewing and bias, as well as numerous time and human resources required to perform the sample preparation tasks. Online-coupled approaches aim to minimize those risks and address resource constraints. A widely practiced example of the benefits of online sample fractionation has been the decade long use of combining strong cation exchange chromatography (SCX) with C18 reversed-phase (RP) for peptide fractionation (known as MudPIT – multidimensional protein identification technology), where SCX and RP is performed under the same buffer conditions and the SCX elution performed with volatile organic cations compatible with reversed phase separation (11). This approach greatly increases analyte detection while avoiding sample handling losses. The MudPIT approach has been widely used for discovery proteomics (1214), and we have previously shown that multiphasic separations also have utility for targeted proteomics when configured for selected reaction monitoring MS (SRM-MS). We showed substantial advantages of MudPIT-SRM-MS with reduced ion suppression, increased peak areas and lower limits of detection (LLOD) compared with conventional RP-SRM-MS (15).To improve the reproducibility of proteomic workflows, increase throughput and minimize sample loss, numerous microfluidic devices have been developed and integrated for proteomic applications (16, 17). These devices can broadly be classified into two groups: (1) microfluidic chips for peptide separation (1825) and; (2) proteome reactors that combine enzymatic processing with peptide based fractionation (2630). Because of the small dimension of these devices, they are readily able to integrate into nanoLC workflows. Various applications have been described including increasing proteome coverage (22, 27, 28) and targeting of phosphopeptides (24, 31, 32), glycopeptides and released glycans (29, 33, 34).In this work, we set out to take advantage of the benefits of multiphasic peptide separations and address the reproducibility needs required for high-throughput comparative proteomics using a variety of workflows. We integrated a multiphasic SCX and RP column in a “plug-and-play” microfluidic chip format for online fractionation, eliminating the need for users to make minimal dead volume connections between traps and columns. We show the flexibility of this format to provide robust peptide separation and reproducibility using conventional and topical mass spectrometry workflows. This was undertaken by coupling the multiphase liquid chromatography (LC) chip to a fast scanning Q-ToF mass spectrometer for data dependent MS/MS, data independent MS (SWATH) and for targeted proteomics using MRM-HR, showing clear advantages for repeatable analyses compared with conventional proteomic workflows.  相似文献   

4.
Isobaric labeling techniques coupled with high-resolution mass spectrometry have been widely employed in proteomic workflows requiring relative quantification. For each high-resolution tandem mass spectrum (MS/MS), isobaric labeling techniques can be used not only to quantify the peptide from different samples by reporter ions, but also to identify the peptide it is derived from. Because the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide or protein identification. In this article, we demonstrate that there are a lot of high-frequency, high-abundance isobaric related ions in the MS/MS spectrum, and removing isobaric related ions combined with deisotoping and deconvolution in MS/MS preprocessing procedures significantly improves the peptide/protein identification sensitivity. The user-friendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from https://github.com/shengqh/RCPA.Tools/releases as part of the software suite ProteomicsTools. The data have been deposited to the ProteomeXchange with identifier PXD000994.Mass spectrometry-based proteomics has been widely applied to investigate protein mixtures derived from tissue, cell lysates, or from body fluids (1, 2). Liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS)1 is the most popular strategy for protein/peptide mixtures analysis in shotgun proteomics (3). Large-scale protein/peptide mixtures are separated by liquid chromatography followed by online detection by tandem mass spectrometry. The capabilities of proteomics rely greatly on the performance of the mass spectrometer. With the improvement of MS technology, proteomics has benefited significantly from the high-resolution and excellent mass accuracy (4). In recent years, based on the higher efficiency of higher energy collision dissociation (HCD), a new “high–high” strategy (high-resolution MS as well as MS/MS(tandem MS)) has been applied instead of the “high–low” strategy (high-resolution MS, i.e. in Orbitrap, and low-resolution MS/MS, i.e. in ion trap) to obtain high quality tandem MS/MS data as well as full MS in shotgun proteomics. Both full MS scans and MS/MS scans can be performed, and the whole cycle time of MS detection is very compatible with the chromatographic time scale (5).High-resolution measurement is one of the most important features in mass spectrometric application. In this high–high strategy, high-resolution and accurate spectra will be achieved in tandem MS/MS scans as well as full MS scans, which makes isotopic peaks distinguishable from one another, thus enabling the easy calculation of precise charge states and monoisotopic mass. During an LC-MS/MS experiment, a multiply charged precursor ion (peptide) is usually isolated and fragmented, and then the multiple charge states of the fragment ions are generated and collected. After full extraction of peak lists from original tandem mass spectra, the commonly used search engines (i.e. Mascot (6), Sequest (7)) have no capability to distinguish isotopic peaks and recognize charge states, so all of the product ions are considered as all charge state hypotheses during the database search for protein identification. These multiple charge states of fragment ions and their isotopic cluster peaks can be incorrectly assigned by the search engine, which can cause false peptide identification. To overcome this issue, data preprocessing of the high-resolution MS/MS spectra is required before submitting them for identification. There are usually two major preprocessing steps used for high-resolution MS/MS data: deisotoping and deconvolution (8, 9). Deisotoping of spectra removes all isotopic peaks except monoisotopic peaks from multi-isotopic peaks. Deconvolution of spectra translates multiply charged ions to singly charged ions and also accumulates the intensity of fragment ions by summing up all the intensities from their multiply charged states. After performing these two data-preprocessing steps, the resulting spectra is simpler and cleaner and allows more precise database searching and accurate bioinformatics analysis.With the capacity to analyze multiple samples simultaneously, stable isotope labeling approaches have been widely used in quantitative proteomics. Stable isotope labeling approaches are categorized as metabolic labeling (SILAC, stable isotope labeling by amino acids in cell culture) and chemical labeling (10, 11). The peptides labeled by the SILAC approach are quantified by precursor ions in full MS spectra, whereas peptides that have been isobarically labeled using chemical means are quantified by reporter ions in MS/MS spectra. There are two similar isobaric chemical labeling methods: (1) isobaric tag for relative and absolute quantification (iTRAQ), and (2) tandem mass tag (TMT) (12, 13). These reagents contain an amino-reactive group that specifically reacts with N-terminal amino groups and epilson-amino groups of lysine residues to label digested peptides in a typical shotgun proteomics experiment. There are four different channels of isobaric tags: TMT two-plex, iTRAQ four-plex, TMT six-plex, and iTRAQ eight-plex (1216). The number before “plex” denotes the number of samples that can be analyzed by the mass spectrum simultaneously. Peptides labeled with different isotopic variants of the tag show identical or similar mass and appear as a single peak in full scans. This single peak may be selected for subsequent MS/MS analysis. In an MS/MS scan, the mass of reporter ions (114 to 117 for iTRAQ four-plex, 113 to 121 for iTRAQ eight-plex, and 126 to 131for TMT six-plex upon CID or HCD activation) are associated with corresponding samples, and the intensities represent the relative abundances of the labeled peptides. Meanwhile, the other ions from the MS/MS spectra can be used for peptide identification. Because of the multiplexing capability, isobaric labeling methods combined with bottom-up proteomics have been widely applied for accurate quantification of proteins on a global scale (14, 1719). Although mostly associated with peptide labeling, these isobaric labeling methods have also been applied at protein level (2023).For the proteomic analysis of isobarically labeled peptides/proteins in “high–high” MS strategy, the common consensus is that accurate reporter ions can contribute to more accurate quantification. However, there is no evidence to show how the ions related to isobaric labeling affect the peptide/protein identification and what preprocessing steps should be taken for high-resolution isobarically labeled MS/MS. To demonstrate the effectiveness and importance of preprocessing, we examined how the combination of preprocessing steps improved peptide/protein sensitivity in database searching. Several combinatorial ways of data-preprocessing were applied for high-throughput data analysis including deisotoping to keep simple monoisotopic mass peaks, deconvolution of ions with multiple charge states, and preservation of top 10 peaks in every 100 Dalton mass range. After systematic analysis of high-resolution isobarically labeled spectra, we further processed the spectra and removed interferential ions that were not related to the peptide. Our results suggested that the preprocessing of isobarically labeled high-resolution tandem mass spectra significantly improved the peptide/protein identification sensitivity.  相似文献   

5.
Quantifying the similarity of spectra is an important task in various areas of spectroscopy, for example, to identify a compound by comparing sample spectra to those of reference standards. In mass spectrometry based discovery proteomics, spectral comparisons are used to infer the amino acid sequence of peptides. In targeted proteomics by selected reaction monitoring (SRM) or SWATH MS, predetermined sets of fragment ion signals integrated over chromatographic time are used to identify target peptides in complex samples. In both cases, confidence in peptide identification is directly related to the quality of spectral matches. In this study, we used sets of simulated spectra of well-controlled dissimilarity to benchmark different spectral comparison measures and to develop a robust scoring scheme that quantifies the similarity of fragment ion spectra. We applied the normalized spectral contrast angle score to quantify the similarity of spectra to objectively assess fragment ion variability of tandem mass spectrometric datasets, to evaluate portability of peptide fragment ion spectra for targeted mass spectrometry across different types of mass spectrometers and to discriminate target assays from decoys in targeted proteomics. Altogether, this study validates the use of the normalized spectral contrast angle as a sensitive spectral similarity measure for targeted proteomics, and more generally provides a methodology to assess the performance of spectral comparisons and to support the rational selection of the most appropriate similarity measure. The algorithms used in this study are made publicly available as an open source toolset with a graphical user interface.In “bottom-up” proteomics, peptide sequences are identified by the information contained in their fragment ion spectra (1). Various methods have been developed to generate peptide fragment ion spectra and to match them to their corresponding peptide sequences. They can be broadly grouped into discovery and targeted methods. In the widely used discovery (also referred to as shotgun) proteomic approach, peptides are identified by establishing peptide to spectrum matches via a method referred to as database searching. Each acquired fragment ion spectrum is searched against theoretical peptide fragment ion spectra computed from the entries of a specified sequence database, whereby the database search space is constrained to a user defined precursor mass tolerance (2, 3). The quality of the match between experimental and theoretical spectra is typically expressed with multiple scores. These include the number of matching or nonmatching fragments, the number of consecutive fragment ion matches among others. With few exceptions (47) commonly used search engines do not use the relative intensities of the acquired fragment ion signals even though this information could be expected to strengthen the confidence of peptide identification because the relative fragment ion intensity pattern acquired under controlled fragmentation conditions can be considered as a unique “fingerprint” for a given precursor. Thanks to community efforts in acquiring and sharing large number of datasets, the proteomes of some species are now essentially mapped out and experimental fragment ion spectra covering entire proteomes are increasingly becoming accessible through spectral databases (816). This has catalyzed the emergence of new proteomics strategies that differ from classical database searching in that they use prior spectral information to identify peptides. Those comprise inclusion list sequencing (directed sequencing), spectral library matching, and targeted proteomics (17). These methods explicitly use the information contained in empirical fragment ion spectra, including the fragment ion signal intensity to identify the target peptide. For these methods, it is therefore of highest importance to accurately control and quantify the degree of reproducibility of the fragment ion spectra across experiments, instruments, labs, methods, and to quantitatively assess the similarity of spectra. To date, dot product (1824), its corresponding arccosine spectral contrast angle (2527) and (Pearson-like) spectral correlation (2831), and other geometrical distance measures (18, 32), have been used in the literature for assessing spectral similarity. These measures have been used in different contexts including shotgun spectra clustering (19, 26), spectral library searching (18, 20, 21, 24, 25, 2729), cross-instrument fragmentation comparisons (22, 30) and for scoring transitions in targeted proteomics analyses such as selected reaction monitoring (SRM)1 (23, 31). However, to our knowledge, those scores have never been objectively benchmarked for their performance in discriminating well-defined levels of dissimilarities between spectra. In particular, similarity scores obtained by different methods have not yet been compared for targeted proteomics applications, where the sensitive discrimination of highly similar spectra is critical for the confident identification of targeted peptides.In this study, we have developed a method to objectively assess the similarity of fragment ion spectra. We provide an open-source toolset that supports these analyses. Using a computationally generated benchmark spectral library with increasing levels of well-controlled spectral dissimilarity, we performed a comprehensive and unbiased comparison of the performance of the main scores used to assess spectral similarity in mass spectrometry.We then exemplify how this method, in conjunction with its corresponding benchmarked perturbation spectra set, can be applied to answer several relevant questions for MS-based proteomics. As a first application, we show that it can efficiently assess the absolute levels of peptide fragmentation variability inherent to any given mass spectrometer. By comparing the instrument''s intrinsic fragmentation conservation distribution to that of the benchmarked perturbation spectra set, nominal values of spectral similarity scores can indeed be translated into a more directly understandable percentage of variability inherent to the instrument fragmentation. As a second application, we show that the method can be used to derive an absolute measure to estimate the conservation of peptide fragmentation between instruments or across proteomics methods. This allowed us to quantitatively evaluate, for example, the transferability of fragment ion spectra acquired by data dependent analysis in a first instrument into a fragment/transition assay list used for targeted proteomics applications (e.g. SRM or targeted extraction of data independent acquisition SWATH MS (33)) on another instrument. Third, we used the method to probe the fragmentation patterns of peptides carrying a post-translation modification (e.g. phosphorylation) by comparing the spectra of modified peptide with those of their unmodified counterparts. Finally, we used the method to determine the overall level of fragmentation conservation that is required to support target-decoy discrimination and peptide identification in targeted proteomics approaches such as SRM and SWATH MS.  相似文献   

6.
Database search programs are essential tools for identifying peptides via mass spectrometry (MS) in shotgun proteomics. Simultaneously achieving high sensitivity and high specificity during a database search is crucial for improving proteome coverage. Here we present JUMP, a new hybrid database search program that generates amino acid tags and ranks peptide spectrum matches (PSMs) by an integrated score from the tags and pattern matching. In a typical run of liquid chromatography coupled with high-resolution tandem MS, more than 95% of MS/MS spectra can generate at least one tag, whereas the remaining spectra are usually too poor to derive genuine PSMs. To enhance search sensitivity, the JUMP program enables the use of tags as short as one amino acid. Using a target-decoy strategy, we compared JUMP with other programs (e.g. SEQUEST, Mascot, PEAKS DB, and InsPecT) in the analysis of multiple datasets and found that JUMP outperformed these preexisting programs. JUMP also permitted the analysis of multiple co-fragmented peptides from “mixture spectra” to further increase PSMs. In addition, JUMP-derived tags allowed partial de novo sequencing and facilitated the unambiguous assignment of modified residues. In summary, JUMP is an effective database search algorithm complementary to current search programs.Peptide identification by tandem mass spectra is a critical step in mass spectrometry (MS)-based1 proteomics (1). Numerous computational algorithms and software tools have been developed for this purpose (26). These algorithms can be classified into three categories: (i) pattern-based database search, (ii) de novo sequencing, and (iii) hybrid search that combines database search and de novo sequencing. With the continuous development of high-performance liquid chromatography and high-resolution mass spectrometers, it is now possible to analyze almost all protein components in mammalian cells (7). In contrast to rapid data collection, it remains a challenge to extract accurate information from the raw data to identify peptides with low false positive rates (specificity) and minimal false negatives (sensitivity) (8).Database search methods usually assign peptide sequences by comparing MS/MS spectra to theoretical peptide spectra predicted from a protein database, as exemplified in SEQUEST (9), Mascot (10), OMSSA (11), X!Tandem (12), Spectrum Mill (13), ProteinProspector (14), MyriMatch (15), Crux (16), MS-GFDB (17), Andromeda (18), BaMS2 (19), and Morpheus (20). Some other programs, such as SpectraST (21) and Pepitome (22), utilize a spectral library composed of experimentally identified and validated MS/MS spectra. These methods use a variety of scoring algorithms to rank potential peptide spectrum matches (PSMs) and select the top hit as a putative PSM. However, not all PSMs are correctly assigned. For example, false peptides may be assigned to MS/MS spectra with numerous noisy peaks and poor fragmentation patterns. If the samples contain unknown protein modifications, mutations, and contaminants, the related MS/MS spectra also result in false positives, as their corresponding peptides are not in the database. Other false positives may be generated simply by random matches. Therefore, it is of importance to remove these false PSMs to improve dataset quality. One common approach is to filter putative PSMs to achieve a final list with a predefined false discovery rate (FDR) via a target-decoy strategy, in which decoy proteins are merged with target proteins in the same database for estimating false PSMs (2326). However, the true and false PSMs are not always distinguishable based on matching scores. It is a problem to set up an appropriate score threshold to achieve maximal sensitivity and high specificity (13, 27, 28).De novo methods, including Lutefisk (29), PEAKS (30), NovoHMM (31), PepNovo (32), pNovo (33), Vonovo (34), and UniNovo (35), identify peptide sequences directly from MS/MS spectra. These methods can be used to derive novel peptides and post-translational modifications without a database, which is useful, especially when the related genome is not sequenced. High-resolution MS/MS spectra greatly facilitate the generation of peptide sequences in these de novo methods. However, because MS/MS fragmentation cannot always produce all predicted product ions, only a portion of collected MS/MS spectra have sufficient quality to extract partial or full peptide sequences, leading to lower sensitivity than achieved with the database search methods.To improve the sensitivity of the de novo methods, a hybrid approach has been proposed to integrate peptide sequence tags into PSM scoring during database searches (36). Numerous software packages have been developed, such as GutenTag (37), InsPecT (38), Byonic (39), DirecTag (40), and PEAKS DB (41). These methods use peptide tag sequences to filter a protein database, followed by error-tolerant database searching. One restriction in most of these algorithms is the requirement of a minimum tag length of three amino acids for matching protein sequences in the database. This restriction reduces the sensitivity of the database search, because it filters out some high-quality spectra in which consecutive tags cannot be generated.In this paper, we describe JUMP, a novel tag-based hybrid algorithm for peptide identification. The program is optimized to balance sensitivity and specificity during tag derivation and MS/MS pattern matching. JUMP can use all potential sequence tags, including tags consisting of only one amino acid. When we compared its performance to that of two widely used search algorithms, SEQUEST and Mascot, JUMP identified ∼30% more PSMs at the same FDR threshold. In addition, the program provides two additional features: (i) using tag sequences to improve modification site assignment, and (ii) analyzing co-fragmented peptides from mixture MS/MS spectra.  相似文献   

7.
Based on conventional data-dependent acquisition strategy of shotgun proteomics, we present a new workflow DeMix, which significantly increases the efficiency of peptide identification for in-depth shotgun analysis of complex proteomes. Capitalizing on the high resolution and mass accuracy of Orbitrap-based tandem mass spectrometry, we developed a simple deconvolution method of “cloning” chimeric tandem spectra for cofragmented peptides. Additional to a database search, a simple rescoring scheme utilizes mass accuracy and converts the unwanted cofragmenting events into a surprising advantage of multiplexing. With the combination of cloning and rescoring, we obtained on average nine peptide-spectrum matches per second on a Q-Exactive workbench, whereas the actual MS/MS acquisition rate was close to seven spectra per second. This efficiency boost to 1.24 identified peptides per MS/MS spectrum enabled analysis of over 5000 human proteins in single-dimensional LC-MS/MS shotgun experiments with an only two-hour gradient. These findings suggest a change in the dominant “one MS/MS spectrum - one peptide” paradigm for data acquisition and analysis in shotgun data-dependent proteomics. DeMix also demonstrated higher robustness than conventional approaches in terms of lower variation among the results of consecutive LC-MS/MS runs.Shotgun proteomics analysis based on a combination of high performance liquid chromatography and tandem mass spectrometry (MS/MS) (1) has achieved remarkable speed and efficiency (27). In a single four-hour long high performance liquid chromatography-MS/MS run, over 40,000 peptides and 5000 proteins can be identified using a high-resolution Orbitrap mass spectrometer with data-dependent acquisition (DDA)1 (2, 3). However, in a typical LC-MS analysis of unfractionated human cell lysate, over 100,000 individual peptide isotopic patterns can be detected (4), which corresponds to simultaneous elution of hundreds of peptides. With this complexity, a mass spectrometer needs to achieve ≥25 Hz MS/MS acquisition rate to fully sample all the detectable peptides, and ≥17 Hz to cover reasonably abundant ones (4). Although this acquisition rate is reachable by modern time-of-flight (TOF) instruments, the reported DDA identification results do not encompass all expected peptides. Recently, the next-generation Orbitrap instrument, working at 20 Hz MS/MS acquisition rate, demonstrated nearly full profiling of yeast proteome using an 80 min gradient, which opened the way for comprehensive analysis of human proteome in a time efficient manner (5).During the high performance liquid chromatography-MS/MS DDA analysis of complex samples, high density of co-eluting peptides results in a high probability for two or more peptides to overlap within an MS/MS isolation window. With the commonly used ±1.0–2.0 Th isolation windows, most MS/MS spectra are chimeric (4, 810), with cofragmenting precursors being naturally multiplexed. However, as has been discussed previously (9, 10), the cofragmentation events are currently ignored in most of the conventional analysis workflows. According to the prevailing assumption of “one MS/MS spectrum–one peptide,” chimeric MS/MS spectra are generally unwelcome in DDA, because the product ions from different precursors may interfere with the assignment of MS/MS fragment identities, increasing the rate of false discoveries in database search (8, 9). In some studies, the precursor isolation width was set as narrow as ±0.35 Th to prevent unwanted ions from being coselected, fragmented or detected (4, 5).On the contrary, multiplexing by cofragmentation is considered to be one of the solid advantages in data-independent acquisition (DIA) (1013). In several commonly used DIA methods, the precursor ion selection windows are set much wider than in DDA: from 25 Th as in SWATH (12), to extremely broad range as in AIF (13). In order to use the benefit of MS/MS multiplexing in DDA, several approaches have been proposed to deconvolute chimeric MS/MS spectra. In “alternative peptide identification” method implemented in Percolator (14), a machine learning algorithm reranks and rescores peptide-spectrum matches (PSMs) obtained from one or more MS/MS search engines. But the deconvolution in Percolator is limited to cofragmented peptides with masses differing from the target peptide by the tolerance of the database search, which can be as narrow as a few ppm. The “active demultiplexing” method proposed by Ledvina et al. (15) actively separates MS/MS data from several precursors using masses of complementary fragments. However, higher-energy collisional dissociation often produces MS/MS spectra with too few complementary pairs for reliable peptide identification. The “MixDB” method introduces a sophisticated new search engine, also with a machine learning algorithm (9). And the “second peptide identification” method implemented in Andromeda/MaxQuant workflow (16) submits the same dataset to the search engine several times based on the list of chromatographic peptide features, subtracting assigned MS/MS peaks after each identification round. This approach is similar to the ProbIDTree search engine that also performed iterative identification while removing assigned peaks after each round of identification (17).One important factor for spectral deconvolution that has not been fully utilized in most conventional workflows is the excellent mass accuracy achievable with modern high-resolution mass spectrometry (18). An Orbitrap Fourier-transform mass spectrometer can provide mass accuracy in the range of hundreds of ppb (parts per billion) for mass peaks with high signal-to-noise (S/N) ratio (19). However, the mass error of peaks with lower S/N ratios can be significantly higher and exceed 1 ppm. Despite this dependence of the mass accuracy from the S/N level, most MS and MS/MS search engines only allow users to set hard cut-off values for the mass error tolerances. Moreover, some search engines do not provide the option of choosing a relative error tolerance for MS/MS fragments. Such negligent treatment of mass accuracy reduces the analytical power of high accuracy experiments (18).Identification results coming from different MS/MS search engines are sometimes not consistent because of different statistical assumptions used in scoring PSMs. Introduction of tools integrating the results of different search engines (14, 20, 21) makes the data interpretation even more complex and opaque for the user. The opposite trend—simplification of MS/MS data interpretation—is therefore a welcome development. For example, an extremely straightforward algorithm recently proposed by Wenger et al. (22) demonstrated a surprisingly high performance in peptide identification, even though it is only marginally more complex than simply counting the number of matches of theoretical fragment peaks in high resolution MS/MS, without any a priori statistical assumption.In order to take advantage of natural multiplexing of MS/MS spectra in DDA, as well as properly utilize high accuracy of Orbitrap-based mass spectrometry, we developed a simple and robust data analysis workflow DeMix. It is presented in Fig. 1 as an expansion of the conventional workflow. Principles of some of the processes used by the workflow are borrowed from other approaches, including the custom-made mass peak centroiding (20), chromatographic feature detection (19, 20), and two-pass database search with the first limited pass to provide a “software lock mass” for mass scale recalibration (23).Open in a separate windowFig. 1.An overview of the DeMix workflow that expands the conventional workflow, shown by the dashed line. Processes are colored in purple for TOPP, red for search engine (Morpheus/Mascot/MS-GF+), and blue for in-house programs.In DeMix workflow, the deconvolution of chimeric MS/MS spectra consists of simply “cloning” an MS/MS spectrum if a potential cofragmented peptide is detected. The list of candidate peptide precursors is generated from chromatographic feature detection, as in the MaxQuant/Andromeda workflow (16, 19), but using The OpenMS Proteomics Pipeline (TOPP) (20, 24). During the cloning, the precursor is replaced by the new candidate, but no changes in the MS/MS fragment list are made, and therefore the cloned MS/MS spectra remain chimeric. Processing such spectra requires a search engine tolerant to the presence of unassigned peaks, as such peaks are always expected when multiple precursors cofragment. Thus, we chose Morpheus (22) as a search engine. Based on the original search algorithm, we implement a reformed scoring scheme: Morpheus-AS (advanced scoring). It inherits all the basic principles from Morpheus but deeper utilizes the high mass accuracy of the data. This kind of database search removes the necessity of spectral processing for physical separation of MS/MS data into multiple subspectra (15), or consecutive subtraction of peaks (16, 17).Despite the fact that DeMix workflow is largely a combination of known approaches, it provides remarkable improvement compared with the state-of-the-art. On our Orbitrap Q-Exactive workbench, testing on a benchmark dataset of two-hour single-dimension LC-MS/MS experiments from HeLa cell lysate, we identified on average 1.24 peptide per MS/MS spectrum, breaking the “one MS/MS spectrum–one peptide” paradigm on the level of whole data set. At 1% false discovery rate (FDR), we obtained on average nine PSMs per second (at the actual acquisition rate of ca. seven MS/MS spectra per second), and detected 40 human proteins per minute.  相似文献   

8.
The success of high-throughput proteomics hinges on the ability of computational methods to identify peptides from tandem mass spectra (MS/MS). However, a common limitation of most peptide identification approaches is the nearly ubiquitous assumption that each MS/MS spectrum is generated from a single peptide. We propose a new computational approach for the identification of mixture spectra generated from more than one peptide. Capitalizing on the growing availability of large libraries of single-peptide spectra (spectral libraries), our quantitative approach is able to identify up to 98% of all mixture spectra from equally abundant peptides and automatically adjust to varying abundance ratios of up to 10:1. Furthermore, we show how theoretical bounds on spectral similarity avoid the need to compare each experimental spectrum against all possible combinations of candidate peptides (achieving speedups of over five orders of magnitude) and demonstrate that mixture-spectra can be identified in a matter of seconds against proteome-scale spectral libraries. Although our approach was developed for and is demonstrated on peptide spectra, we argue that the generality of the methods allows for their direct application to other types of spectral libraries and mixture spectra.The success of tandem MS (MS/MS1) approaches to peptide identification is partly due to advances in computational techniques allowing for the reliable interpretation of MS/MS spectra. Mainstream computational techniques mainly fall into two categories: database search approaches that score each spectrum against peptides in a sequence database (14) or de novo techniques that directly reconstruct the peptide sequence from each spectrum (58). The combination of these methods with advances in high-throughput MS/MS have promoted the accelerated growth of spectral libraries, collections of peptide MS/MS spectra the identification of which were validated by accepted statistical methods (9, 10) and often also manually confirmed by mass spectrometry experts. The similar concept of spectral archives was also recently proposed to denote spectral libraries including “interesting” nonidentified spectra (11) (i.e. recurring spectra with good de novo reconstructions but no database match). The growing availability of these large collections of MS/MS spectra has reignited the development of alternative peptide identification approaches based on spectral matching (1214) and alignment (1517) algorithms.However, mainstream approaches were developed under the (often unstated) assumption that each MS/MS spectrum is generated from a single peptide. Although chromatographic procedures greatly contribute to making this a reasonable assumption, there are several situations where it is difficult or even impossible to separate pairs of peptides. Examples include certain permutations of the peptide sequence or post-translational modifications (see (18) for examples of co-eluting histone modification variants). In addition, innovative experimental setups have demonstrated the potential for increased throughput in peptide identification using mixture spectra; examples include data-independent acquisition (19) ion-mobility MS (20), and MSE strategies (21).To alleviate the algorithmic bottleneck in such scenarios, we describe a computational approach, M-SPLIT (mixture-spectrum partitioning using library of identified tandem mass spectra), that is able to reliably and efficiently identify peptides from mixture spectra, which are generated from a pair of peptides. In brief, a mixture spectrum is modeled as linear combination of two single-peptide spectra, and peptide identification is done by searching against a spectral library. We show that efficient filtration and accurate branch-and-bound strategies can be used to avoid the huge computational cost of searching all possible pairs. Thus equipped, our approach is able to identify the correct matches by considering only a minuscule fraction of all possible matches. Beyond potentially enhancing the identification capabilities of current MS/MS acquisition setups, we argue that the availability of methods to reliably identify MS/MS spectra from mixtures of peptides could enable the collection of MS/MS data using accelerated chromatography setups to obtain the same or better peptide identification results in a fraction of the experimental time currently required for exhaustive peptide separation.  相似文献   

9.
The orbitrap mass analyzer combines high sensitivity, high resolution, and high mass accuracy in a compact format. In proteomics applications, it is used in a hybrid configuration with a linear ion trap (LTQ-Orbitrap) where the linear trap quadrupole (LTQ) accumulates, isolates, and fragments peptide ions. Alternatively, isolated ions can be fragmented by higher energy collisional dissociation. A recently introduced stand-alone orbitrap analyzer (Exactive) also features a higher energy collisional dissociation cell but cannot isolate ions. Here we report that this instrument can efficiently characterize protein mixtures by alternating MS and “all-ion fragmentation” (AIF) MS/MS scans in a manner similar to that previously described for quadrupole time-of-flight instruments. We applied the peak recognition algorithms of the MaxQuant software at both the precursor and product ion levels. Assignment of fragment ions to co-eluting precursor ions was facilitated by high resolution (100,000 at m/z 200) and high mass accuracy. For efficient fragmentation of different mass precursors, we implemented a stepped collision energy procedure with cumulative MS readout. AIF on the Exactive identified 45 of 48 proteins in an equimolar protein standard mixture and all of them when using a small database. The technique also identified proteins with more than 100-fold abundance differences in a high dynamic range standard. When applied to protein identification in gel slices, AIF unambiguously characterized an immunoprecipitated protein that was barely visible by Coomassie staining and quantified it relative to contaminating proteins. AIF on a benchtop orbitrap instrument is therefore an attractive technology for a wide range of proteomics analyses.Mass spectrometry (MS)-based proteomics is commonly performed in a “shotgun” format where proteins are digested to peptides, which are separated and analyzed by liquid chromatography-tandem mass spectrometry (LC-MS/MS) (1, 2). Many peptides typically co-elute from the column and are selected for fragmentation on the basis of their abundance (“data dependent acquisition”). The precursor mass, which can be determined with high mass accuracy in most current instruments, together with a list of fragment ions, which are often determined at lower mass accuracy, are together used to identify the peptide in a sequence database. This scheme is the basis of most of current proteomics research from the identification of single protein bands to the comprehensive characterization of entire proteomes. To minimize stochastic effects from the selection of peptides for fragmentation and to maximize coverage in complex mixtures, very high sequencing speed is desirable. Although this is achievable, it requires complex instrumentation, and there is still no guarantee that all peptides in a mixture are fragmented and identified. Illustrating this challenge, when the Association of Biomolecular Resource Facilities (ABRF)1 and the Human Proteome Organisation (HUPO) conducted studies of protein identification success in different laboratories, results were varying (4, 5).2 Despite using state of the art proteomics workflows, often with extensive fractionation, only a few laboratories correctly identified all of the proteins in an equimolar 49-protein mixture (ABRF) or a 20-protein mixture (HUPO).As an alternative to data-dependent shotgun proteomics, the mass spectrometer can be operated to fragment the entire mass range of co-eluting analytes. This approach has its roots in precursor ion scanning techniques in which all precursors were fragmented simultaneously either in the source region or in the collision cell, and the appearance of specific “reporter ions” for a modification of interest was recorded (68). Several groups reported the identification of peptides from MS scans in conjunction with MS/MS scans without precursor ion selection (912). Yates and co-workers (13) pursued an intermediate strategy by cycling through the mass range in 10 m/z fragmentation windows. The major challenge of data-independent acquisition is that the direct relationship between precursor and fragments is lost. In most of the above studies, this problem was alleviated by making use of the fact that precursors and fragments have to “co-elute.”In recent years, data-independent proteomics has mainly been pursued on the quadrupole TOF platform where it has been termed MSE in analogy to MS2, MS3, and MSn techniques used for fragmenting one peptide at a time. Geromanos and co-workers (1416) applied MSE to absolute quantification of proteins in mixtures. Another study showed excellent protein coverage of yeast enolase with data-independent peptide fragmentation where enolase peptide intensities varied over 2 orders of magnitude (17). In a recent comparison of data-dependent and -independent peptide fragmentation, the authors concluded that fragmentation information was highly comparable (18, 19).Recently, the orbitrap mass analyzer (2023) has been introduced in a benchtop format without the linear ion trap that normally performs ion accumulation, fragmentation, and analysis of the fragments. This instrument, termed Exactive, was developed for small molecule applications such as metabolite analysis. It can be obtained with a higher energy collisional dissociation (HCD) cell (24), enabling efficient fragmentation but no precursor ion selection. This option is called “all-ion fragmentation” (AIF) by the manufacturer, and this is the term that we use below. We reasoned that the high resolution (100,000 compared with 10,000 in quadrupole TOF) and mass accuracy of this device in both the MS and MS/MS modes might facilitate the analysis of the complex fragmentation spectra generated by dissociating several precursors simultaneously. The simplicity and compactness of this instrumentation platform would then make it interesting for diverse proteomics applications.  相似文献   

10.
11.
In large-scale proteomic experiments, multiple peptide precursors are often cofragmented simultaneously in the same mixture tandem mass (MS/MS) spectrum. These spectra tend to elude current computational tools because of the ubiquitous assumption that each spectrum is generated from only one peptide. Therefore, tools that consider multiple peptide matches to each MS/MS spectrum can potentially improve the relatively low spectrum identification rate often observed in proteomics experiments. More importantly, data independent acquisition protocols promoting the cofragmentation of multiple precursors are emerging as alternative methods that can greatly improve the throughput of peptide identifications but their success also depends on the availability of algorithms to identify multiple peptides from each MS/MS spectrum. Here we address a fundamental question in the identification of mixture MS/MS spectra: determining the statistical significance of multiple peptides matched to a given MS/MS spectrum. We propose the MixGF generating function model to rigorously compute the statistical significance of peptide identifications for mixture spectra and show that this approach improves the sensitivity of current mixture spectra database search tools by a ≈30–390%. Analysis of multiple data sets with MixGF reveals that in complex biological samples the number of identified mixture spectra can be as high as 20% of all the identified spectra and the number of unique peptides identified only in mixture spectra can be up to 35.4% of those identified in single-peptide spectra.The advancement of technology and instrumentation has made tandem mass (MS/MS)1 spectrometry the leading high-throughput method to analyze proteins (1, 2, 3). In typical experiments, tens of thousands to millions of MS/MS spectra are generated and enable researchers to probe various aspects of the proteome on a large scale. Part of this success hinges on the availability of computational methods that can analyze the large amount of data generated from these experiments. The classical question in computational proteomics asks: given an MS/MS spectrum, what is the peptide that generated the spectrum? However, it is increasingly being recognized that this assumption that each MS/MS spectrum comes from only one peptide is often not valid. Several recent analyses show that as many as 50% of the MS/MS spectra collected in typical proteomics experiments come from more than one peptide precursor (4, 5). The presence of multiple peptides in mixture spectra can decrease their identification rate to as low as one half of that for MS/MS spectra generated from only one peptide (6, 7, 8). In addition, there have been numerous developments in data independent acquisition (DIA) technologies where multiple peptide precursors are intentionally selected to cofragment in each MS/MS spectrum (9, 10, 11, 12, 13, 14, 15). These emerging technologies can address some of the enduring disadvantages of traditional data-dependent acquisition (DDA) methods (e.g. low reproducibility (16)) and potentially increase the throughput of peptide identification 5–10 fold (4, 17). However, despite the growing importance of mixture spectra in various contexts, there are still only a few computational tools that can analyze mixture spectra from more than one peptide (18, 19, 20, 21, 8, 22). Our recent analysis indicated that current database search methods for mixture spectra still have relatively low sensitivity compared with their single-peptide counterpart and the main bottleneck is their limited ability to separate true matches from false positive matches (8). Traditionally problem of peptide identification from MS/MS spectra involves two sub-problems: 1) define a Peptide-Spectrum-Match (PSM) scoring function that assigns each MS/MS spectrum to the peptide sequence that most likely generated the spectrum; and 2) given a set of top-scoring PSMs, select a subset that corresponds to statistical significance PSMs. Here we focus on the second problem, which is still an ongoing research question even for the case of single-peptide spectra (23, 24, 25, 26). Intuitively the second problem is difficult because one needs to consider spectra across the whole data set (instead of comparing different peptide candidates against one spectrum as in the first problem) and PSM scoring functions are often not well-calibrated across different spectra (i.e. a PSM score of 50 may be good for one spectrum but poor for a different spectrum). Ideally, a scoring function will give high scores to all true PSMs and low scores to false PSMs regardless of the peptide or spectrum being considered. However, in practice, some spectra may receive higher scores than others simply because they have more peaks or their precursor mass results in more peptide candidates being considered from the sequence database (27, 28). Therefore, a scoring function that accounts for spectrum or peptide-specific effects can make the scores more comparable and thus help assess the confidence of identifications across different spectra. The MS-GF solution to this problem is to compute the per-spectrum statistical significance of each top-scoring PSM, which can be defined as the probability that a random peptide (out of all possible peptide within parent mass tolerance) will match to the spectrum with a score at least as high as that of the top-scoring PSM. This measures how good the current best match is in relation to all possible peptides matching to the same spectrum, normalizing any spectrum effect from the scoring function. Intuitively, our proposed MixGF approach extends the MS-GF approach to now calculate the statistical significance of the top pair of peptides matched from the database to a given mixture spectrum M (i.e. the significance of the top peptide–peptide spectrum match (PPSM)). As such, MixGF determines the probability that a random pair of peptides (out of all possible peptides within parent mass tolerance) will match a given mixture spectrum with a score at least as high as that of the top-scoring PPSM.Despite the theoretical attractiveness of computing statistical significance, it is generally prohibitive for any database search methods to score all possible peptides against a spectrum. Therefore, earlier works in this direction focus on approximating this probability by assuming the score distribution of all PSMs follows certain analytical form such as the normal, Poisson or hypergeometric distributions (29, 30, 31). In practice, because score distributions are highly data-dependent and spectrum-specific, these model assumptions do not always hold. Other approaches tried to learn the score distribution empirically from the data (29, 27). However, one is most interested in the region of the score distribution where only a small fraction of false positives are allowed (typically at 1% FDR). This usually corresponds to the extreme tail of the distribution where p values are on the order of 10−9 or lower and thus there is typically lack of sufficient data points to accurately model the tail of the score distribution (32). More recently, Kim et al. (24) and Alves et al. (33), in parallel, proposed a generating function approach to compute the exact score distribution of random peptide matches for any spectra without explicitly matching all peptides to a spectrum. Because it is an exact computation, no assumption is made about the form of score distribution and the tail of the distribution can be computed very accurately. As a result, this approach substantially improved the ability to separate true matches from false positive ones and lead to a significant increase in sensitivity of peptide identification over state-of-the-art database search tools in single-peptide spectra (24).For mixture spectra, it is expected that the scores for the top-scoring match will be even less comparable across different spectra because now more than one peptide and different numbers of peptides can be matched to each spectrum at the same time. We extend the generating function approach (24) to rigorously compute the statistical significance of multiple-Peptide-Spectrum Matches (mPSMs) and demonstrate its utility toward addressing the peptide identification problem in mixture spectra. In particular, we show how to extend the generating approach for mixture from two peptides. We focus on this relatively simple case of mixture spectra because it accounts for a large fraction of mixture spectra presented in traditional DDA workflows (5). This allows us to test and develop algorithmic concepts using readily-available DDA data because data with more complex mixture spectra such as those from DIA workflows (11) is still not widely available in public repositories.  相似文献   

12.
13.
14.
Top-down mass spectrometry (MS)-based proteomics is arguably a disruptive technology for the comprehensive analysis of all proteoforms arising from genetic variation, alternative splicing, and posttranslational modifications (PTMs). However, the complexity of top-down high-resolution mass spectra presents a significant challenge for data analysis. In contrast to the well-developed software packages available for data analysis in bottom-up proteomics, the data analysis tools in top-down proteomics remain underdeveloped. Moreover, despite recent efforts to develop algorithms and tools for the deconvolution of top-down high-resolution mass spectra and the identification of proteins from complex mixtures, a multifunctional software platform, which allows for the identification, quantitation, and characterization of proteoforms with visual validation, is still lacking. Herein, we have developed MASH Suite Pro, a comprehensive software tool for top-down proteomics with multifaceted functionality. MASH Suite Pro is capable of processing high-resolution MS and tandem MS (MS/MS) data using two deconvolution algorithms to optimize protein identification results. In addition, MASH Suite Pro allows for the characterization of PTMs and sequence variations, as well as the relative quantitation of multiple proteoforms in different experimental conditions. The program also provides visualization components for validation and correction of the computational outputs. Furthermore, MASH Suite Pro facilitates data reporting and presentation via direct output of the graphics. Thus, MASH Suite Pro significantly simplifies and speeds up the interpretation of high-resolution top-down proteomics data by integrating tools for protein identification, quantitation, characterization, and visual validation into a customizable and user-friendly interface. We envision that MASH Suite Pro will play an integral role in advancing the burgeoning field of top-down proteomics.With well-developed algorithms and computational tools for mass spectrometry (MS)1 data analysis, peptide-based bottom-up proteomics has gained considerable popularity in the field of systems biology (19). Nevertheless, the bottom-up approach is suboptimal for the analysis of protein posttranslational modifications (PTMs) and sequence variants as a result of protein digestion (10). Alternatively, the protein-based top-down proteomics approach analyzes intact proteins, which provides a “bird''s eye” view of all proteoforms (11), including those arising from sequence variations, alternative splicing, and diverse PTMs, making it a disruptive technology for the comprehensive analysis of proteoforms (1224). However, the complexity of top-down high-resolution mass spectra presents a significant challenge for data analysis. In contrast to the well-developed software packages available for processing data from bottom-up proteomics experiments, the data analysis tools in top-down proteomics remain underdeveloped.The initial step in the analysis of top-down proteomics data is deconvolution of high-resolution mass and tandem mass spectra. Thorough high-resolution analysis of spectra by horn (THRASH), which was the first algorithm developed for the deconvolution of high-resolution mass spectra (25), is still widely used. THRASH automatically detects and evaluates individual isotopomer envelopes by comparing the experimental isotopomer envelope with a theoretical envelope and reporting those that score higher than a user-defined threshold. Another commonly used algorithm, MS-Deconv, utilizes a combinatorial approach to address the difficulty of grouping MS peaks from overlapping isotopomer envelopes (26). Recently, UniDec, which employs a Bayesian approach to separate mass and charge dimensions (27), can also be applied to the deconvolution of high-resolution spectra. Although these algorithms assist in data processing, unfortunately, the deconvolution results often contain a considerable amount of misassigned peaks as a consequence of the complexity of the high-resolution MS and MS/MS data generated in top-down proteomics experiments. Errors such as these can undermine the accuracy of protein identification and PTM localization and, thus, necessitate the implementation of visual components that allow for the validation and manual correction of the computational outputs.Following spectral deconvolution, a typical top-down proteomics workflow incorporates identification, quantitation, and characterization of proteoforms; however, most of the recently developed data analysis tools for top-down proteomics, including ProSightPC (28, 29), Mascot Top Down (also known as Big-Mascot) (30), MS-TopDown (31), and MS-Align+ (32), focus almost exclusively on protein identification. ProSightPC was the first software tool specifically developed for top-down protein identification. This software utilizes “shotgun annotated” databases (33) that include all possible proteoforms containing user-defined modifications. Consequently, ProSightPC is not optimized for identifying PTMs that are not defined by the user(s). Additionally, the inclusion of all possible modified forms within the database dramatically increases the size of the database and, thus, limits the search speed (32). Mascot Top Down (30) is based on standard Mascot but enables database searching using a higher mass limit for the precursor ions (up to 110 kDa), which allows for the identification of intact proteins. Protein identification using Mascot Top Down is fundamentally similar to that used in bottom-up proteomics (34), and, therefore, it is somewhat limited in terms of identifying unexpected PTMs. MS-TopDown (31) employs the spectral alignment algorithm (35), which matches the top-down tandem mass spectra to proteins in the database without prior knowledge of the PTMs. Nevertheless, MS-TopDown lacks statistical evaluation of the search results and performs slowly when searching against large databases. MS-Align+ also utilizes spectral alignment for top-down protein identification (32). It is capable of identifying unexpected PTMs and allows for efficient filtering of candidate proteins when the top-down spectra are searched against a large protein database. MS-Align+ also provides statistical evaluation for the selection of proteoform spectrum match (PrSM) with high confidence. More recently, Top-Down Mass Spectrometry Based Proteoform Identification and Characterization (TopPIC) was developed (http://proteomics.informatics.iupui.edu/software/toppic/index.html). TopPIC is an updated version of MS-Align+ with increased spectral alignment speed and reduced computing requirements. In addition, MSPathFinder, developed by Kim et al., also allows for the rapid identification of proteins from top-down tandem mass spectra (http://omics.pnl.gov/software/mspathfinder) using spectral alignment. Although software tools employing spectral alignment, such as MS-Align+ and MSPathFinder, are particularly useful for top-down protein identification, these programs operate using command line, making them difficult to use for those with limited knowledge of command syntax.Recently, new software tools have been developed for proteoform characterization (36, 37). Our group previously developed MASH Suite, a user-friendly interface for the processing, visualization, and validation of high-resolution MS and MS/MS data (36). Another software tool, ProSight Lite, developed recently by the Kelleher group (37), also allows characterization of protein PTMs. However, both of these software tools require prior knowledge of the protein sequence for the effective localization of PTMs. In addition, both software tools cannot process data from liquid chromatography (LC)-MS and LC-MS/MS experiments, which limits their usefulness in large-scale top-down proteomics. Thus, despite these recent efforts, a multifunctional software platform enabling identification, quantitation, and characterization of proteins from top-down spectra, as well as visual validation and data correction, is still lacking.Herein, we report the development of MASH Suite Pro, an integrated software platform, designed to incorporate tools for protein identification, quantitation, and characterization into a single comprehensive package for the analysis of top-down proteomics data. This program contains a user-friendly customizable interface similar to the previously developed MASH Suite (36) but also has a number of new capabilities, including the ability to handle complex proteomics datasets from LC-MS and LC-MS/MS experiments, as well as the ability to identify unknown proteins and PTMs using MS-Align+ (32). Importantly, MASH Suite Pro also provides visualization components for the validation and correction of the computational outputs, which ensures accurate and reliable deconvolution of the spectra and localization of PTMs and sequence variations.  相似文献   

15.
Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The “picked” protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The “picked” target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used “classic” protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software.Shotgun proteomics is the most popular approach for large-scale identification and quantification of proteins. The rapid evolution of high-end mass spectrometers in recent years (15) has made proteomic studies feasible that identify and quantify as many as 10,000 proteins in a sample (68) and enables many lines of new scientific research including, for example, the analysis of many human proteomes, and proteome-wide protein–drug interaction studies (911). One fundamental step in most proteomic experiments is the identification of proteins in the biological system under investigation. To achieve this, proteins are digested into peptides, analyzed by LC-MS/MS, and tandem mass spectra are used to interrogate protein sequence databases using search engines that match experimental data to data generated in silico (12, 13). Peptide spectrum matches (PSMs)1 are commonly assigned by a search engine using either a heuristic or a probabilistic scoring scheme (1418). Proteins are then inferred from identified peptides and a protein score or a probability derived as a measure for the confidence in the identification (13, 19).Estimating the proportion of false matches (false discovery rate; FDR) in an experiment is important to assess and maintain the quality of protein identifications. Owing to its conceptual and practical simplicity, the most widely used strategy to estimate FDR in proteomics is the target–decoy database search strategy (target–decoy strategy; TDS) (20). The main assumption underlying this idea is that random matches (false positives) should occur with similar likelihood in the target database and the decoy (reversed, shuffled, or otherwise randomized) version of the same database (21, 22). The number of matches to the decoy database, therefore, provides an estimate of the number of random matches one should expect to obtain in the target database. The number of target and decoy hits can then be used to calculate either a local or a global FDR for a given data set (2126). This general idea can be applied to control the FDR at the level of PSMs, peptides, and proteins, typically by counting the number of target and decoy observations above a specified score.Despite the significant practical impact of the TDS, it has been observed that a peptide FDR that results in an acceptable protein FDR (of say 1%) for a small or medium sized data set, turns into an unacceptably high protein FDR when the data set grows larger (22, 27). This is because the basic assumption of the classical TDS is compromised when a large proportion of the true positive proteins have already been identified. In small data sets, containing say only a few hundred to a few thousand proteins, random peptide matches will be distributed roughly equally over all decoy and “leftover” target proteins, allowing for a reasonably accurate estimation of false positive target identifications by using the number of decoy identifications. However, in large experiments comprising hundreds to thousands of LC-MS/MS runs, 10,000 or more target proteins may be genuinely and repeatedly identified, leaving an ever smaller number of (target) proteins to be hit by new false positive peptide matches. In contrast, decoy proteins are only hit by the occasional random peptide match but fully count toward the number of false positive protein identifications estimated from the decoy hits. The higher the number of genuinely identified target proteins gets, the larger this imbalance becomes. If this is not corrected for in the decoy space, an overestimation of false positives will occur.This problem has been recognized and e.g. Reiter and colleagues suggested a way for correcting for the overestimation of false positive protein hits termed MAYU (27). Following the main assumption that protein identifications containing false positive PSMs are uniformly distributed over the target database, MAYU models the number of false positive protein identifications using a hypergeometric distribution. Its parameters are estimated from the number of protein database entries and the total number of target and decoy protein identifications. The protein FDR is then estimated by dividing the number of expected false positive identifications (expectation value of the hypergeometric distribution) by the total number of target identifications. Although this approach was specifically designed for large data sets (tested on ∼1300 LC-MS/MS runs from digests of C. elegans proteins), it is not clear how far the approach actually scales. Another correction strategy for overestimation of false positive rates, the R factor, was suggested initially for peptides (28) and more recently for proteins (29). A ratio, R, of forward and decoy hits in the low probability range is calculated, where the number of true peptide or protein identifications is expected to be close to zero, and hence, R should approximate one. The number of decoy hits is then multiplied (corrected) by the R factor when performing FDR calculations. The approach is conceptually simpler than the MAYU strategy and easy to implement, but is also based on the assumption that the inflation of the decoy hits intrinsic in the classic target–decoy strategy occurs to the same extent in all probability ranges.In the context of the above, it is interesting to note that there is currently no consensus in the community regarding if and how protein FDRs should be calculated for data of any size. One perhaps extreme view is that, owing to issues and assumptions related to the peptide to protein inference step and ways of constructing decoy protein sequences, protein level FDRs cannot be meaningfully estimated at all (30). This is somewhat unsatisfactory as an estimate of protein level error in proteomic experiments is highly desirable. Others have argued that target–decoy searches are not even needed when accurate p values of individual PSMs are available (31) whereas others choose to tighten the PSM or peptide FDRs obtained from TDS analysis to whatever threshold necessary to obtain a desired protein FDR (32). This is likely too conservative.We have recently proposed an alternative protein FDR approach termed “picked” target–decoy strategy (picked TDS) that indicated improved performance over the classical TDS in a very large proteomic data set (9) but a systematic investigation of the idea had not been performed at the time. In this study, we further characterized the picked TDS for protein FDR estimation and investigated its scalability compared with that of the classic TDS FDR method in data sets of increasing size up to ∼19,000 LC-MS/MS runs. The results show that the picked TDS is effective in preventing decoy protein over-representation, identifies more true positive hits, and works equally well for small and large proteomic data sets.  相似文献   

16.
Despite increasing importance of protein glycosylation, most of the large-scale glycoproteomics have been limited to profiling the sites of N-glycosylation. However, in-depth knowledge of protein glycosylation to uncover functions and their clinical applications requires quantitative glycoproteomics eliciting both peptide and glycan sequences concurrently. Here we describe a novel strategy for the multiplexed quantitative mouse serum glycoproteomics based on a specific chemical ligation, namely, reverse glycoblotting technique, focusing sialic acids and multiple reaction monitoring (MRM). LC-MS/MS analysis of de-glycosylated peptides identified 270 mouse serum peptides (95 glycoproteins) as sialylated glycopeptides, of which 67 glycopeptides were fully characterized by MS/MS analyses in a straightforward manner. We revealed the importance of a fragment ion containing innermost N-acetylglucosamine (GlcNAc) residue as MRM transitions regardless the sequence of the peptides. Versatility of the reverse glycoblotting-assisted MRM assays was demonstrated by quantitative comparison of 25 targeted glycopeptides from 16 proteins between mice with homo and hetero types of diabetes disease model.Clinical proteomics focusing on the identification and validation of biomarkers and the discovery of proteins as therapeutic targets is an emerging and highly important area of proteomics. Biomarkers are measurable indicators of a specific biological state (particularly one relevant to the risk of contraction) and the presence or the stage of disease, and are thus expected to be useful for the prediction, detection, and diagnosis of disease as well as to follow the efficacy, toxicology, and side effects of drug treatment, and to provide new functional insights into biological processes.At present, proteomics methods based on mass spectrometry (MS) have emerged as the preferred strategy for discovery of diagnostic, prognostic, and therapeutic protein biomarkers. Most biomarker discovery studies use unbiased, “identified-based” approaches that rely on high performance mass spectrometers and extensive sample processing. Semiquantitative comparisons of protein relative abundance between disease and control patient samples are used to identify proteins that are differentially expressed and, thus, to populate lists of potential biomarkers. De novo proteomics discovery experiments often result in tens to hundreds of candidate biomarkers that must be subsequently verified in serum. However, despite the large numbers of putative biomarkers, only a small number of them are passed through the development and validation process into clinical practice, and their rate of introduction is declining. The first non-standard abbreviation (MS above is standard) must be footnoted the same as the abbreviation footnote, and MRM must be the first abbreviation in the list because it is the one footnoted. After that the order does not matter.Targeted proteomics using multiple reaction monitoring (MRM)1 is emerging as a technology that complements the discovery capabilities of shotgun strategies as well as an alternative powerful novel MS-based approach to measure a series of candidate biomarkers (17). Therefore, MRM is expected to provide a powerful high throughput platform for biomarker validation, although clinical validation of novel biomarkers has been traditionally relying on immunoassays (8, 9). MRM exploits the unique capabilities of triple quadrupoles (QQQ) MS for quantitative analysis. In MRM, the first and the third quadrupoles act as filters to specifically select predefined m/z values corresponding to the peptide precursor ion and specific fragment ion of the peptide, whereas the second quadrupole serves as collision cell. Several such transitions (precursor/fragment ion pairs) are monitored over time, yielding a set of chromatographic traces with retention time and signal intensity for a specific transition as coordinates. These measurements have been multiplexed to provide 30 or more specific assays in one run. Such methods are slowly gaining acceptance in the clinical laboratory for the routine measurement of endogenous metabolites (10) (e.g. in screening newborns for a panel of inborn errors of metabolism) some drugs (11) (e.g. immunosuppressants), and the component analysis of sugars (12).One of the profound challenges in clinical proteomics is the need to handle highly complex biological mixtures. This complexity presents unique analytical challenges that are further magnified with the use of clinical serum/plasma samples to search for novel biomarkers of human disease. The serum proteome is composed of tens of thousands of unique proteins, of which concentrations may exceed 10 orders of magnitude. Protein glycosylation, one of the most common post-translational modifications, generates tremendous diversity, complexity, and heterogeneity of gene products. It changes the biological and physical properties of proteins, which include functions as signals or ligands to control their distribution, antigenicity, metabolic fate, stability, and solubility. Protein glycosylation, in particular by N-linked glycans, is prevalent in proteins destined for extracellular environments. These include proteins on the extracellular side of the plasma membrane, secreted proteins, and proteins contained in body fluids (such as blood serum, cerebrospinal fluid, urine, breast milk, saliva, lung lavage fluid, or pancreatic juice). Considering that such body fluids are most easily accessible for diagnostic and therapeutic purposes, it is not surprising that many clinical biomarkers and therapeutic targets are glycoproteins. These include, for example, cancer antigen 125 (CA125) in ovarian cancer, human epidermal growth factor receptor 2 (Her2/neu) in breast cancer, and prostate-specific antigen (PSA) in prostate cancer. In addition, changes in the extent of glycosylation and the structure of N-glycans or O-glycans attached to proteins on the cell surface and in body fluids have been shown to correlate with cancer and other disease states, highlighting the clinical importance of this modification as an indicator or effector of pathologic mechanisms (1316). Thus, clinical proteomic platforms should have capability to provide protein glycosylation information as well as sufficient analytical depth to reliably detect and quantify specific proteins with sufficient accuracy and throughput.To improve the detection limits to the required sensitivities, one needs to dramatically reduce the complexity of the sera samples. For focused glycoproteomics, several techniques using lectins or antibodies enabling the large-scale identification of glycoproteins have recently been developed (1719). Notably, Zhang et al. reported a method for the selective isolation of peptides based on chemical oxidation of the carbohydrate moiety and subsequent conjugation to a solid support using hydrazide chemistry (2026). However, it is not possible to provide any structural information about N-glycans because the MS analysis is performed on peptides of which N-glycans are removed preferentially by treating with peptide N-glycanase (PNGase). In 2007, we developed a method for rapid enrichment analysis of peptides bearing sialylated N-glycans on the MALDI-TOF-MS platform (27). The method involves highly selective oxidation of sialic acid residues of glycopeptides to elaborate terminal aldehyde group and subsequent enrichment by chemical ligation with a polymer reagent, namely, reverse glycoblotting technique inspired from an original concept of glycoblotting method (28). This method, in principle, is capable identifying both glycan and peptide sequences concurrently. Recently, Nilsson et al. reported that glycopeptides from human cerebrospinal fluid can be enriched on the basis of the same principle as the reverse glycoblotting protocol, and captured glycopeptides were analyzed with ESI FT-ICR MS (29). Because it is well known that sialic acids play important roles in various biological processes including cell differentiation, immune response, and oncogenesis (3034), our attention has been directed toward feasibility of the reverse glycoblotting technique in quantitative analysis of the specific glycopeptides carrying sialic acid(s) by combining with multiplexed MRM-based MS.  相似文献   

17.
Antibodies are of importance for the field of proteomics, both as reagents for imaging cells, tissues, and organs and as capturing agents for affinity enrichment in mass-spectrometry-based techniques. It is important to gain basic insights regarding the binding sites (epitopes) of antibodies and potential cross-reactivity to nontarget proteins. Knowledge about an antibody''s linear epitopes is also useful in, for instance, developing assays involving the capture of peptides obtained from trypsin cleavage of samples prior to mass spectrometry analysis. Here, we describe, for the first time, the design and use of peptide arrays covering all human proteins for the analysis of antibody specificity, based on parallel in situ photolithic synthesis of a total of 2.1 million overlapping peptides. This has allowed analysis of on- and off-target binding of both monoclonal and polyclonal antibodies, complemented with precise mapping of epitopes based on full amino acid substitution scans. The analysis suggests that linear epitopes are relatively short, confined to five to seven residues, resulting in apparent off-target binding to peptides corresponding to a large number of unrelated human proteins. However, subsequent analysis using recombinant proteins suggests that these linear epitopes have a strict conformational component, thus giving us new insights regarding how antibodies bind to their antigens.Antibodies are used in proteomics both as imaging reagents for the analysis of tissue specificity (1) and subcellular localization (2) and as capturing agents for targeted proteomics (3), in particular for the enrichment of peptides for immunoaffinity methods such as Stable Isotope Standards and Capture by Anti-peptide Antibodies (4). In fact, the Human Proteome Project (5) has announced that one of the three pillars of the project will be antibody-based, with one of the aims being to generate antibodies to at least one representative protein from all protein-coding genes. Knowledge about the binding site (epitope) of an antibody toward a target protein is thus important for gaining basic insights into antibody specificity and sensitivity and facilitating the identification and design of antigens to be used for reagents in proteomics, as well as for the generation of therapeutic antibodies and vaccines (1, 6). With over 20 monoclonal-antibody-based drugs now on the market and over 100 in clinical trials, the field of antibody therapeutics has become a central component of the pharmaceutical industry (7). One of the key parameters for antibodies includes the nature of the binding recognition toward the target, involving either linear epitopes formed by consecutive amino acid residues or conformational epitopes consisting of amino acids brought together by the fold of the target protein (8).A large number of methods have therefore been developed to determine the epitopes of antibodies, including mass spectrometry (9), solid phase libraries (10, 11), and different display systems (1214) such as bacterial display (15) and phage display (16). The most common method for epitope mapping involves the use of soluble and immobilized (tethered) peptide libraries, often in an array format, exemplified by the “Geysen Pepscan” method (11) in which overlapping “tiled” peptides are synthesized and used for binding analysis. The tiled peptide approach can also be combined with alanine scans (17) in which alanine substitutions are introduced into the synthetic peptides and the direct contribution of each amino acid can be investigated. Maier et al. (18) described a high-throughput epitope-mapping screen of a recombinant peptide library consisting of a total of 2304 overlapping peptides of the vitamin D receptor, and recently Buus et al. (19) used in situ synthesis on microarrays to design and generate 70,000 peptides for epitope mapping of antibodies using a range of peptides with sizes from 4-mer to 20-mer.So far it has not been possible to investigate on- and off-target binding in a proteome-wide manner, but the emergence of new methods for in situ synthesis of peptides on ultra-dense arrays has made this achievable. Here, we describe the design and use of peptide arrays generated with parallel in situ photolithic synthesis (20) of a total of 2.1 million overlapping peptides covering all human proteins with overlapping peptides. Miniaturization of the peptide arrays (21) has led to improved density of the synthesized peptides and consequently has improved the resolution and coverage of the epitope mapping. This has allowed us to study the specificity and cross-reactivity of both monoclonal and polyclonal antibodies across the whole “epitome” with the use of both proteome-wide arrays and focused-content peptide arrays covering selected antigen sequences to precisely map the contribution of each amino acid of the target protein for binding recognition of the corresponding antibodies. The results show the usefulness of proteome-wide epitope mapping, showing a path forward for high-throughput analysis of antibody interactions.  相似文献   

18.
The past 15 years have seen significant progress in LC-MS/MS peptide sequencing, including the advent of successful de novo and database search methods; however, analysis of glycopeptide and, more generally, glycoconjugate spectra remains a much more open problem, and much annotation is still performed manually. This is partly because glycans, unlike peptides, need not be linear chains and are instead described by trees. In this study, we introduce SweetSEQer, an extremely simple open source tool for identifying potential glycopeptide MS/MS spectra. We evaluate SweetSEQer on manually curated glycoconjugate spectra and on negative controls, and we demonstrate high quality filtering that can be easily improved for specific applications. We also demonstrate a high overlap between peaks annotated by experts and peaks annotated by SweetSEQer, as well as demonstrate inferred glycan graphs consistent with canonical glycan tree motifs. This study presents a novel tool for annotating spectra and producing glycan graphs from LC-MS/MS spectra. The tool is evaluated and shown to perform similarly to an expert on manually curated data.Protein glycosylation is a common modification, affecting ∼50% of all expressed proteins (1). Glycosylation affects critical biological functions, including cell-cell recognition, circulating half-life, substrate binding, immunogenicity, and others (2). Regrettably, determining the exact role glycosylation plays in different biological contexts is slowed by a dearth of analytical methods and of appropriate software. Such software is crucial for performing and aiding experts in data analysis complex glycosylation.Glycopeptides are highly heterogeneous in regard to glycan composition, glycan structure, and linkage stereochemistry in addition to the tens of thousands of possible peptides. The analysis of protein glycosylation is often segmented into three distinct types of mass spectrometry experiments, which together help to resolve this complexity. The first analyzes enzymatically or chemically released glycans (which may or may not be chemically modified), and the second determines glycosylation sites after release of glycans from peptides (the resulting mass spectra allow detection of glycosylation sites and the glycans on those sites simultaneously). The third determines the glycosylation sites and the glycans on those sites simultaneously, by MS of intact glycopeptides. Frequently, researchers will perform all three types of analysis, with the first two types providing information about possible combinations of glycan structures and peptides that could be found in the third experiment. Using this MS1 information, the problem is reduced to matching masses observed with a combinatorial pool of all possible glycans and all possible glycosylated peptides within a sample; however, this combinatorial approach alone is insufficient (3), and tandem mass spectrometry can provide copious additional information to help resolve the glycopeptide content from complex samples.The similar problem of inferring peptide sequences from MS/MS spectra has received considerably more attention. Peptide inference is more constrained than glycan inference, because the chain of MS/MS peaks corresponds to a linear peptide sequence; given an MS/MS spectrum, the linear peptide sequence can be inferred through brute force or dynamic programming via de novo methods (46) as described in Ref. 7. Additionally, the possible search space of peptides can be dramatically lowered by using database searching (821) as described in Ref. 7, which compares the MS/MS spectrum to the predicted spectra from only those peptides resulting from a protein database or translated open reading frames (ORFs) of a genomic database.The possible search space of glycans is larger than the search space of peptides because, in contrast to linear peptide chains, glycans may form branching trees. Identifying glycans using database search methodologies is impractical, as it is impractical to define the database when the detailed activities of the set of glycosyltransferases are not defined. Generating an overly large database would artificially inflate the set of incompletely characterized spectra, and too small of a search space would lead to inaccurate results. Furthermore, as glycosylation is not a template-driven process, no clear choice for a database matching approach is available, and de novo sequencing is therefore a more appropriate approach.As a result, few desirable software options are available for the high throughput analysis of tandem mass spectrometry data from intact glycopeptides (as noted in a recent review (22)). In fact, manual annotation of spectra is still commonplace, despite being slow and despite the potential for disagreement between different experts. Some available software requires user-defined lists of glycan and/or peptide masses as input, which is suboptimal from a sample consumption and throughput perspective (23, 24). These lists must typically be generated by parallel experiments or simply hypothesized a priori, meaning omissions in either list may affect the results. Furthermore, some software does not work on batched input files, meaning each spectrum must be analyzed separately (23, 2528). Moreover, there is an even greater lack of open source software for glycoproteomics, so modifying the existing software for the researchers individual applications is not easily achieved. The one open source tool that we know of (GlypID) is applicable only to the analysis of glycopeptide spectra acquired from a very specialized workflow, which requires MS1, CID, and higher-energy C-trap type dissociation (HCD) spectra (29). With that approach, oxonium ions from HCD spectra are necessary to predict the glycan class; potential peptide lists are queried by precursor m/z values (requiring accurate a priori knowledge of all modifications), and possible theoretical “N-linked” precursor m/z values are used to select candidate spectra (using templates, unlike de novo characterization). As a result, the tool is specialized and limited to analysis of “N-linked” glycopeptide spectra from very specific experimental setups.Free, open-source glycoproteomic software capable of batch analysis of general tandem mass spectrometry spectra of glycoconjugates is sorely needed. In this work, we present SweetSEQer, a tool for de novo analysis of tandem mass spectra of glycoconjugates (the most general class of spectra containing fragmentation involving sugars). Furthermore, because SweetSEQer is so general and simple, and because it does not require specific experimental setup, it is widely applicable to the analysis of general glycoconjugate spectra (e.g. it is already applicable to “O-linked” glycopeptide and glycoconjugate spectra). Moreover, because it is an open source and does not use external software, it not only eschews solving problems like MS1 deisotoping, it can also be easily customized and even used to augment and complement existing tools like GlypID (and, because we do not use a “copyleft” software license, our algorithm and code can even be added to non-open source and proprietary variants).SweetSEQer''s performance was tested on a validated, manually annotated set of glycoconjugate identifications from a urinary glycoproteomics study. Specificity was demonstrated by showing a low identification rate on negative control spectra from Escherichia coli. Annotated structures are shown to be consistent by a human expert by demonstrating a high overlap in identified glycan fragment ions, as well as a consistency between SweetSEQer''s predicted glycan graph and glycan chains produced by an expert. Our simple object-oriented python implementation is freely available (Apache 2.0 license) on line.  相似文献   

19.
Liquid chromatography-tandem mass spectrometry (LC-MS/MS)-based proteomics provides a wealth of information about proteins present in biological samples. In bottom-up LC-MS/MS-based proteomics, proteins are enzymatically digested into peptides prior to query by LC-MS/MS. Thus, the information directly available from the LC-MS/MS data is at the peptide level. If a protein-level analysis is desired, the peptide-level information must be rolled up into protein-level information. We propose a principal component analysis-based statistical method, ProPCA, for efficiently estimating relative protein abundance from bottom-up label-free LC-MS/MS data that incorporates both spectral count information and LC-MS peptide ion peak attributes, such as peak area, volume, or height. ProPCA may be used effectively with a variety of quantification platforms and is easily implemented. We show that ProPCA outperformed existing quantitative methods for peptide-protein roll-up, including spectral counting methods and other methods for combining LC-MS peptide peak attributes. The performance of ProPCA was validated using a data set derived from the LC-MS/MS analysis of a mixture of protein standards (the UPS2 proteomic dynamic range standard introduced by The Association of Biomolecular Resource Facilities Proteomics Standards Research Group in 2006). Finally, we applied ProPCA to a comparative LC-MS/MS analysis of digested total cell lysates prepared for LC-MS/MS analysis by alternative lysis methods and show that ProPCA identified more differentially abundant proteins than competing methods.One of the fundamental goals of proteomics methods for the biological sciences is to identify and quantify all proteins present in a sample. LC-MS/MS-based proteomics methodologies offer a promising approach to this problem (13). These methodologies allow for the acquisition of a vast amount of information about the proteins present in a sample. However, extracting reliable protein abundance information from LC-MS/MS data remains challenging. In this work, we were primarily concerned with the analysis of data acquired using bottom-up label-free LC-MS/MS-based proteomics techniques where “bottom-up” refers to the fact that proteins are enzymatically digested into peptides prior to query by the LC-MS/MS instrument platform (4), and “label-free” indicates that analyses are performed without the aid of stable isotope labels. One challenge inherent in the bottom-up approach to proteomics is that information directly available from the LC-MS/MS data is at the peptide level. When a protein-level analysis is desired, as is often the case with discovery-driven LC-MS research, peptide-level information must be rolled up into protein-level information.Spectral counting (510) is a straightforward and widely used example of peptide-protein roll-up for LC-MS/MS data. Information experimentally acquired in single stage (MS) and tandem (MS/MS) spectra may lead to the assignment of MS/MS spectra to peptide sequences in a database-driven or database-free manner using various peptide identification software platforms (SEQUEST (11) and Mascot (12), for instance); the identified peptide sequences correspond, in turn, to proteins. In principle, the number of tandem spectra matched to peptides corresponding to a certain protein, the spectral count (SC),1 is positively associated with the abundance of a protein (5). In spectral counting techniques, raw or normalized SCs are used as a surrogate for protein abundance. Spectral counting methods have been moderately successful in quantifying protein abundance and identifying significant proteins in various settings. However, SC-based methods do not make full use of information available from peaks in the LC-MS domain, and this surely leads to loss of efficiency.Peaks in the LC-MS domain corresponding to peptide ion species are highly sensitive to differences in protein abundance (13, 14). Identifying LC-MS peaks that correspond to detected peptides and measuring quantitative attributes of these peaks (such as height, area, or volume) offers a promising alternative to spectral counting methods. These methods have become especially popular in applications using stable isotope labeling (15). However, challenges remain, especially in the label-free analysis of complex proteomics samples where complications in peak detection, alignment, and integration are a significant obstacle. In practice, alignment, identification, and quantification of LC-MS peptide peak attributes (PPAs) may be accomplished using recently developed peak matching platforms (1618). A highly sensitive indicator of protein abundance may be obtained by rolling up PPA measurements into protein-level information (16, 19, 20). Existing peptide-protein roll-up procedures based on PPAs typically involve taking the mean of (possibly normalized) PPA measurements over all peptides corresponding to a protein to obtain a protein-level estimate of abundance. Despite the promise of PPA-based procedures for protein quantification, the performance of PPA-based methods may vary widely depending on the particular roll-up procedure used; furthermore, PPA-based procedures are limited by difficulties in accurately identifying and measuring peptide peak attributes. These two issues are related as the latter issue affects the robustness of PPA-based roll-up methods. Indeed, existing peak matching and quantification platforms tend to result in PPA measurement data sets with substantial missingness (16, 19, 21), especially when working with very complex samples where substantial dynamic ranges and ion suppression are difficulties that must be overcome. Missingness may, in turn, lead to instability in protein-level abundance estimates. A good peptide-protein roll-up procedure that utilizes PPAs should account for this missingness and the resulting instability in a principled way. However, even in the absence of missingness, there is no consensus in the existing literature on peptide-protein roll-up for PPA measurements.In this work, we propose ProPCA, a peptide-protein roll-up method for efficiently extracting protein abundance information from bottom-up label-free LC-MS/MS data. ProPCA is an easily implemented, unsupervised method that is related to principle component analysis (PCA) (22). ProPCA optimally combines SC and PPA data to obtain estimates of relative protein abundance. ProPCA addresses missingness in PPA measurement data in a unified way while capitalizing on strengths of both SCs and PPA-based roll-up methods. In particular, ProPCA adapts to the quality of the available PPA measurement data. If the PPA measurement data are poor and, in the extreme case, no PPA measurements are available, then ProPCA is equivalent to spectral counting. On the other hand, if there is no missingness in the PPA measurement data set, then the ProPCA estimate is a weighted mean of PPA measurements and spectral counts where the weights are chosen to reflect the ability of spectral counts and each peptide to predict protein abundance.Below, we assess the performance of ProPCA using a data set obtained from the LC-MS/MS analysis of protein standards (UPS2 proteomic dynamic range standard set2 manufactured by Sigma-Aldrich) and show that ProPCA outperformed other existing roll-up methods by multiple metrics. The applicability of ProPCA is not limited by the quantification platform used to obtain SCs and PPA measurements. To demonstrate this, we show that ProPCA continued to perform well when used with an alternative quantification platform. Finally, we applied ProPCA to a comparative LC-MS/MS analysis of digested total human hepatocellular carcinoma (HepG2) cell lysates prepared for LC-MS/MS analysis by alternative lysis methods. We show that ProPCA identified more differentially abundant proteins than competing methods.  相似文献   

20.
The data-independent acquisition (DIA) approach has recently been introduced as a novel mass spectrometric method that promises to combine the high content aspect of shotgun proteomics with the reproducibility and precision of selected reaction monitoring. Here, we evaluate, whether SWATH-MS type DIA effectively translates into a better protein profiling as compared with the established shotgun proteomics.We implemented a novel DIA method on the widely used Orbitrap platform and used retention-time-normalized (iRT) spectral libraries for targeted data extraction using Spectronaut. We call this combination hyper reaction monitoring (HRM). Using a controlled sample set, we show that HRM outperformed shotgun proteomics both in the number of consistently identified peptides across multiple measurements and quantification of differentially abundant proteins. The reproducibility of HRM in peptide detection was above 98%, resulting in quasi complete data sets compared with 49% of shotgun proteomics.Utilizing HRM, we profiled acetaminophen (APAP)1-treated three-dimensional human liver microtissues. An early onset of relevant proteome changes was revealed at subtoxic doses of APAP. Further, we detected and quantified for the first time human NAPQI-protein adducts that might be relevant for the toxicity of APAP. The adducts were identified on four mitochondrial oxidative stress related proteins (GATM, PARK7, PRDX6, and VDAC2) and two other proteins (ANXA2 and FTCD).Our findings imply that DIA should be the preferred method for quantitative protein profiling.Quantitative mass spectrometry is a powerful and widely used approach to identify differentially abundant proteins, e.g. for proteome profiling and biomarker discovery (1). Several tens of thousands of peptides and thousands of proteins can be routinely identified from a single sample injection in shotgun proteomics (2). Shotgun proteomics, however, is limited by low analytical reproducibility. This is due to the complexity of the samples that results in under sampling (supplemental Fig. 1) and to the fact that the acquisition of MS2 spectra is often triggered outside of the elution peak apex. As a result, only 17% of the detectable peptides are typically fragmented, and less than 60% of those are identified. This translates in reliable identification of only 10% of the detectable peptides (3). The overlap of peptide identification across technical replicates is typically 35–60% (4), which results in inconsistent peptide quantification. Alternatively to shotgun proteomics, selected reaction monitoring (SRM) enables quantification of up to 200–300 peptides at very high reproducibility, accuracy, and precision (58).Data-independent acquisition (DIA), a novel acquisition type, overcomes the semistochastic nature of shotgun proteomics (918). Spectra are acquired according to a predefined schema instead of dependent on the data. Targeted analysis of DIA data was introduced with SWATH-MS (19). For the originally published SWATH-MS, the mass spectrometer cycles through 32 predefined, contiguous, 25 Thomson wide precursor windows, and records high-resolution fragment ion spectra (19). This results in a comprehensive measurement of all detectable precursors of the selected mass range. The main novelty of SWATH-MS was in the analysis of the collected DIA data. Predefined fragment ions are extracted using precompiled spectrum libraries, which results in SRM-like data. Such targeted analyses are now enabled by several publicly available computational tools, in particular Spectronaut2, Skyline (20), and OpenSWATH (21). The accuracy of peptide identification is evaluated based on the mProphet method (22).We introduce a novel SWATH-MS-type DIA workflow termed hyper reaction monitoring (HRM) (reviewed in (23)) implemented on a Thermo Scientific Q Exactive platform. It consists of comprehensive DIA acquisition and targeted data analysis with retention-time-normalized spectral libraries (24). Its high accuracy of peptide identification and quantification is due to three aspects. First, we developed a novel, improved DIA method. Second, we reimplemented the mProphet (22) approach in the software Spectronaut (www.spectronaut.org). Third, we developed large, optimized, and retention-time-normalized (iRT) spectral libraries.We compared HRM and state-of-the-art shotgun proteomics in terms of ability to discover differentially abundant proteins. For this purpose, we used a “profiling standard sample set” with 12 non-human proteins spiked at known absolute concentrations into a stable human cell line protein extract. This resulted in quasi complete data sets for HRM and the detection of a larger number of differentially abundant proteins as compared with shotgun proteomics. We utilized HRM to identify changes in the proteome in primary three-dimensional human liver microtissues after APAP exposure (2527). These primary hepatocytes exhibit active drug metabolism. With a starting material of only 12,000 cells per sample, the abundance of 2,830 proteins was quantified over an APAP concentration range. Six novel NAPQI-cysteine proteins adducts that might be relevant for the toxicity of APAP were found and quantified mainly on mitochondrion-related proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号