首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
Comprehensive proteomic profiling of biological specimens usually requires multidimensional chromatographic peptide fractionation prior to mass spectrometry. However, this approach can suffer from poor reproducibility because of the lack of standardization and automation of the entire workflow, thus compromising performance of quantitative proteomic investigations. To address these variables we developed an online peptide fractionation system comprising a multiphasic liquid chromatography (LC) chip that integrates reversed phase and strong cation exchange chromatography upstream of the mass spectrometer (MS). We showed superiority of this system for standardizing discovery and targeted proteomic workflows using cancer cell lysates and nondepleted human plasma. Five-step multiphase chip LC MS/MS acquisition showed clear advantages over analyses of unfractionated samples by identifying more peptides, consuming less sample and often improving the lower limits of quantitation, all in highly reproducible, automated, online configuration. We further showed that multiphase chip LC fractionation provided a facile means to detect many N- and C-terminal peptides (including acetylated N terminus) that are challenging to identify in complex tryptic peptide matrices because of less favorable ionization characteristics. Given as much as 95% of peptides were detected in only a single salt fraction from cell lysates we exploited this high reproducibility and coupled it with multiple reaction monitoring on a high-resolution MS instrument (MRM-HR). This approach increased target analyte peak area and improved lower limits of quantitation without negatively influencing variance or bias. Further, we showed a strategy to use multiphase LC chip fractionation LC-MS/MS for ion library generation to integrate with SWATHTM data-independent acquisition quantitative workflows. All MS data are available via ProteomeXchange with identifier PXD001464.Mass spectrometry based proteomic quantitation is an essential technique used for contemporary, integrative biological studies. Whether used in discovery experiments or for targeted biomarker applications, quantitative proteomic studies require high reproducibility at many levels. It requires reproducible run-to-run peptide detection, reproducible peptide quantitation, reproducible depth of proteome coverage, and ideally, a high degree of cross-laboratory analytical reproducibility. Mass spectrometry centered proteomics has evolved steadily over the past decade, now mature enough to derive extensive draft maps of the human proteome (1, 2). Nonetheless, a key requirement yet to be realized is to ensure that quantitative proteomics can be carried out in a timely manner while satisfying the aforementioned challenges associated with reproducibility. This is especially important for recent developments using data independent MS quantitation and multiple reaction monitoring on high-resolution MS (MRM-HR)1 as they are both highly dependent on LC peptide retention time reproducibility and precursor detectability, while attempting to maximize proteome coverage (3). Strategies usually employed to increase the depth of proteome coverage utilize various sample fractionation methods including gel-based separation, affinity enrichment or depletion, protein or peptide chemical modification-based enrichment, and various peptide chromatography methods, particularly ion exchange chromatography (410). In comparison to an unfractionated “naive” sample, the trade-off in using these enrichments/fractionation approaches are higher risk of sample losses, introduction of undesired chemical modifications (e.g. oxidation, deamidation, N-terminal lactam formation), and the potential for result skewing and bias, as well as numerous time and human resources required to perform the sample preparation tasks. Online-coupled approaches aim to minimize those risks and address resource constraints. A widely practiced example of the benefits of online sample fractionation has been the decade long use of combining strong cation exchange chromatography (SCX) with C18 reversed-phase (RP) for peptide fractionation (known as MudPIT – multidimensional protein identification technology), where SCX and RP is performed under the same buffer conditions and the SCX elution performed with volatile organic cations compatible with reversed phase separation (11). This approach greatly increases analyte detection while avoiding sample handling losses. The MudPIT approach has been widely used for discovery proteomics (1214), and we have previously shown that multiphasic separations also have utility for targeted proteomics when configured for selected reaction monitoring MS (SRM-MS). We showed substantial advantages of MudPIT-SRM-MS with reduced ion suppression, increased peak areas and lower limits of detection (LLOD) compared with conventional RP-SRM-MS (15).To improve the reproducibility of proteomic workflows, increase throughput and minimize sample loss, numerous microfluidic devices have been developed and integrated for proteomic applications (16, 17). These devices can broadly be classified into two groups: (1) microfluidic chips for peptide separation (1825) and; (2) proteome reactors that combine enzymatic processing with peptide based fractionation (2630). Because of the small dimension of these devices, they are readily able to integrate into nanoLC workflows. Various applications have been described including increasing proteome coverage (22, 27, 28) and targeting of phosphopeptides (24, 31, 32), glycopeptides and released glycans (29, 33, 34).In this work, we set out to take advantage of the benefits of multiphasic peptide separations and address the reproducibility needs required for high-throughput comparative proteomics using a variety of workflows. We integrated a multiphasic SCX and RP column in a “plug-and-play” microfluidic chip format for online fractionation, eliminating the need for users to make minimal dead volume connections between traps and columns. We show the flexibility of this format to provide robust peptide separation and reproducibility using conventional and topical mass spectrometry workflows. This was undertaken by coupling the multiphase liquid chromatography (LC) chip to a fast scanning Q-ToF mass spectrometer for data dependent MS/MS, data independent MS (SWATH) and for targeted proteomics using MRM-HR, showing clear advantages for repeatable analyses compared with conventional proteomic workflows.  相似文献   

2.
3.
A complete understanding of the biological functions of large signaling peptides (>4 kDa) requires comprehensive characterization of their amino acid sequences and post-translational modifications, which presents significant analytical challenges. In the past decade, there has been great success with mass spectrometry-based de novo sequencing of small neuropeptides. However, these approaches are less applicable to larger neuropeptides because of the inefficient fragmentation of peptides larger than 4 kDa and their lower endogenous abundance. The conventional proteomics approach focuses on large-scale determination of protein identities via database searching, lacking the ability for in-depth elucidation of individual amino acid residues. Here, we present a multifaceted MS approach for identification and characterization of large crustacean hyperglycemic hormone (CHH)-family neuropeptides, a class of peptide hormones that play central roles in the regulation of many important physiological processes of crustaceans. Six crustacean CHH-family neuropeptides (8–9.5 kDa), including two novel peptides with extensive disulfide linkages and PTMs, were fully sequenced without reference to genomic databases. High-definition de novo sequencing was achieved by a combination of bottom-up, off-line top-down, and on-line top-down tandem MS methods. Statistical evaluation indicated that these methods provided complementary information for sequence interpretation and increased the local identification confidence of each amino acid. Further investigations by MALDI imaging MS mapped the spatial distribution and colocalization patterns of various CHH-family neuropeptides in the neuroendocrine organs, revealing that two CHH-subfamilies are involved in distinct signaling pathways.Neuropeptides and hormones comprise a diverse class of signaling molecules involved in numerous essential physiological processes, including analgesia, reward, food intake, learning and memory (1). Disorders of the neurosecretory and neuroendocrine systems influence many pathological processes. For example, obesity results from failure of energy homeostasis in association with endocrine alterations (2, 3). Previous work from our lab used crustaceans as model organisms found that multiple neuropeptides were implicated in control of food intake, including RFamides, tachykinin related peptides, RYamides, and pyrokinins (46).Crustacean hyperglycemic hormone (CHH)1 family neuropeptides play a central role in energy homeostasis of crustaceans (717). Hyperglycemic response of the CHHs was first reported after injection of crude eyestalk extract in crustaceans. Based on their preprohormone organization, the CHH family can be grouped into two sub-families: subfamily-I containing CHH, and subfamily-II containing molt-inhibiting hormone (MIH) and mandibular organ-inhibiting hormone (MOIH). The preprohormones of the subfamily-I have a CHH precursor related peptide (CPRP) that is cleaved off during processing; and preprohormones of the subfamily-II lack the CPRP (9). Uncovering their physiological functions will provide new insights into neuroendocrine regulation of energy homeostasis.Characterization of CHH-family neuropeptides is challenging. They are comprised of more than 70 amino acids and often contain multiple post-translational modifications (PTMs) and complex disulfide bridge connections (7). In addition, physiological concentrations of these peptide hormones are typically below picomolar level, and most crustacean species do not have available genome and proteome databases to assist MS-based sequencing.MS-based neuropeptidomics provides a powerful tool for rapid discovery and analysis of a large number of endogenous peptides from the brain and the central nervous system. Our group and others have greatly expanded the peptidomes of many model organisms (3, 1833). For example, we have discovered more than 200 neuropeptides with several neuropeptide families consisting of as many as 20–40 members in a simple crustacean model system (5, 6, 2531, 34). However, a majority of these neuropeptides are small peptides with 5–15 amino acid residues long, leaving a gap of identifying larger signaling peptides from organisms without sequenced genome. The observed lack of larger size peptide hormones can be attributed to the lack of effective de novo sequencing strategies for neuropeptides larger than 4 kDa, which are inherently more difficult to fragment using conventional techniques (3437). Although classical proteomics studies examine larger proteins, these tools are limited to identification based on database searching with one or more peptides matching without complete amino acid sequence coverage (36, 38).Large populations of neuropeptides from 4–10 kDa exist in the nervous systems of both vertebrates and invertebrates (9, 39, 40). Understanding their functional roles requires sufficient molecular knowledge and a unique analytical approach. Therefore, developing effective and reliable methods for de novo sequencing of large neuropeptides at the individual amino acid residue level is an urgent gap to fill in neurobiology. In this study, we present a multifaceted MS strategy aimed at high-definition de novo sequencing and comprehensive characterization of the CHH-family neuropeptides in crustacean central nervous system. The high-definition de novo sequencing was achieved by a combination of three methods: (1) enzymatic digestion and LC-tandem mass spectrometry (MS/MS) bottom-up analysis to generate detailed sequences of proteolytic peptides; (2) off-line LC fractionation and subsequent top-down MS/MS to obtain high-quality fragmentation maps of intact peptides; and (3) on-line LC coupled to top-down MS/MS to allow rapid sequence analysis of low abundance peptides. Combining the three methods overcomes the limitations of each, and thus offers complementary and high-confidence determination of amino acid residues. We report the complete sequence analysis of six CHH-family neuropeptides including the discovery of two novel peptides. With the accurate molecular information, MALDI imaging and ion mobility MS were conducted for the first time to explore their anatomical distribution and biochemical properties.  相似文献   

4.
Top-down mass spectrometry (MS)-based proteomics is arguably a disruptive technology for the comprehensive analysis of all proteoforms arising from genetic variation, alternative splicing, and posttranslational modifications (PTMs). However, the complexity of top-down high-resolution mass spectra presents a significant challenge for data analysis. In contrast to the well-developed software packages available for data analysis in bottom-up proteomics, the data analysis tools in top-down proteomics remain underdeveloped. Moreover, despite recent efforts to develop algorithms and tools for the deconvolution of top-down high-resolution mass spectra and the identification of proteins from complex mixtures, a multifunctional software platform, which allows for the identification, quantitation, and characterization of proteoforms with visual validation, is still lacking. Herein, we have developed MASH Suite Pro, a comprehensive software tool for top-down proteomics with multifaceted functionality. MASH Suite Pro is capable of processing high-resolution MS and tandem MS (MS/MS) data using two deconvolution algorithms to optimize protein identification results. In addition, MASH Suite Pro allows for the characterization of PTMs and sequence variations, as well as the relative quantitation of multiple proteoforms in different experimental conditions. The program also provides visualization components for validation and correction of the computational outputs. Furthermore, MASH Suite Pro facilitates data reporting and presentation via direct output of the graphics. Thus, MASH Suite Pro significantly simplifies and speeds up the interpretation of high-resolution top-down proteomics data by integrating tools for protein identification, quantitation, characterization, and visual validation into a customizable and user-friendly interface. We envision that MASH Suite Pro will play an integral role in advancing the burgeoning field of top-down proteomics.With well-developed algorithms and computational tools for mass spectrometry (MS)1 data analysis, peptide-based bottom-up proteomics has gained considerable popularity in the field of systems biology (19). Nevertheless, the bottom-up approach is suboptimal for the analysis of protein posttranslational modifications (PTMs) and sequence variants as a result of protein digestion (10). Alternatively, the protein-based top-down proteomics approach analyzes intact proteins, which provides a “bird''s eye” view of all proteoforms (11), including those arising from sequence variations, alternative splicing, and diverse PTMs, making it a disruptive technology for the comprehensive analysis of proteoforms (1224). However, the complexity of top-down high-resolution mass spectra presents a significant challenge for data analysis. In contrast to the well-developed software packages available for processing data from bottom-up proteomics experiments, the data analysis tools in top-down proteomics remain underdeveloped.The initial step in the analysis of top-down proteomics data is deconvolution of high-resolution mass and tandem mass spectra. Thorough high-resolution analysis of spectra by horn (THRASH), which was the first algorithm developed for the deconvolution of high-resolution mass spectra (25), is still widely used. THRASH automatically detects and evaluates individual isotopomer envelopes by comparing the experimental isotopomer envelope with a theoretical envelope and reporting those that score higher than a user-defined threshold. Another commonly used algorithm, MS-Deconv, utilizes a combinatorial approach to address the difficulty of grouping MS peaks from overlapping isotopomer envelopes (26). Recently, UniDec, which employs a Bayesian approach to separate mass and charge dimensions (27), can also be applied to the deconvolution of high-resolution spectra. Although these algorithms assist in data processing, unfortunately, the deconvolution results often contain a considerable amount of misassigned peaks as a consequence of the complexity of the high-resolution MS and MS/MS data generated in top-down proteomics experiments. Errors such as these can undermine the accuracy of protein identification and PTM localization and, thus, necessitate the implementation of visual components that allow for the validation and manual correction of the computational outputs.Following spectral deconvolution, a typical top-down proteomics workflow incorporates identification, quantitation, and characterization of proteoforms; however, most of the recently developed data analysis tools for top-down proteomics, including ProSightPC (28, 29), Mascot Top Down (also known as Big-Mascot) (30), MS-TopDown (31), and MS-Align+ (32), focus almost exclusively on protein identification. ProSightPC was the first software tool specifically developed for top-down protein identification. This software utilizes “shotgun annotated” databases (33) that include all possible proteoforms containing user-defined modifications. Consequently, ProSightPC is not optimized for identifying PTMs that are not defined by the user(s). Additionally, the inclusion of all possible modified forms within the database dramatically increases the size of the database and, thus, limits the search speed (32). Mascot Top Down (30) is based on standard Mascot but enables database searching using a higher mass limit for the precursor ions (up to 110 kDa), which allows for the identification of intact proteins. Protein identification using Mascot Top Down is fundamentally similar to that used in bottom-up proteomics (34), and, therefore, it is somewhat limited in terms of identifying unexpected PTMs. MS-TopDown (31) employs the spectral alignment algorithm (35), which matches the top-down tandem mass spectra to proteins in the database without prior knowledge of the PTMs. Nevertheless, MS-TopDown lacks statistical evaluation of the search results and performs slowly when searching against large databases. MS-Align+ also utilizes spectral alignment for top-down protein identification (32). It is capable of identifying unexpected PTMs and allows for efficient filtering of candidate proteins when the top-down spectra are searched against a large protein database. MS-Align+ also provides statistical evaluation for the selection of proteoform spectrum match (PrSM) with high confidence. More recently, Top-Down Mass Spectrometry Based Proteoform Identification and Characterization (TopPIC) was developed (http://proteomics.informatics.iupui.edu/software/toppic/index.html). TopPIC is an updated version of MS-Align+ with increased spectral alignment speed and reduced computing requirements. In addition, MSPathFinder, developed by Kim et al., also allows for the rapid identification of proteins from top-down tandem mass spectra (http://omics.pnl.gov/software/mspathfinder) using spectral alignment. Although software tools employing spectral alignment, such as MS-Align+ and MSPathFinder, are particularly useful for top-down protein identification, these programs operate using command line, making them difficult to use for those with limited knowledge of command syntax.Recently, new software tools have been developed for proteoform characterization (36, 37). Our group previously developed MASH Suite, a user-friendly interface for the processing, visualization, and validation of high-resolution MS and MS/MS data (36). Another software tool, ProSight Lite, developed recently by the Kelleher group (37), also allows characterization of protein PTMs. However, both of these software tools require prior knowledge of the protein sequence for the effective localization of PTMs. In addition, both software tools cannot process data from liquid chromatography (LC)-MS and LC-MS/MS experiments, which limits their usefulness in large-scale top-down proteomics. Thus, despite these recent efforts, a multifunctional software platform enabling identification, quantitation, and characterization of proteins from top-down spectra, as well as visual validation and data correction, is still lacking.Herein, we report the development of MASH Suite Pro, an integrated software platform, designed to incorporate tools for protein identification, quantitation, and characterization into a single comprehensive package for the analysis of top-down proteomics data. This program contains a user-friendly customizable interface similar to the previously developed MASH Suite (36) but also has a number of new capabilities, including the ability to handle complex proteomics datasets from LC-MS and LC-MS/MS experiments, as well as the ability to identify unknown proteins and PTMs using MS-Align+ (32). Importantly, MASH Suite Pro also provides visualization components for the validation and correction of the computational outputs, which ensures accurate and reliable deconvolution of the spectra and localization of PTMs and sequence variations.  相似文献   

5.
6.
The field of proteomics has evolved hand-in-hand with technological advances in LC-MS/MS systems, now enabling the analysis of very deep proteomes in a reasonable time. However, most applications do not deal with full cell or tissue proteomes but rather with restricted subproteomes relevant for the research context at hand or resulting from extensive fractionation. At the same time, investigation of many conditions or perturbations puts a strain on measurement capacity. Here, we develop a high-throughput workflow capable of dealing with large numbers of low or medium complexity samples and specifically aim at the analysis of 96-well plates in a single day (15 min per sample). We combine parallel sample processing with a modified liquid chromatography platform driving two analytical columns in tandem, which are coupled to a quadrupole Orbitrap mass spectrometer (Q Exactive HF). The modified LC platform eliminates idle time between measurements, and the high sequencing speed of the Q Exactive HF reduces required measurement time. We apply the pipeline to the yeast chromatin remodeling landscape and demonstrate quantification of 96 pull-downs of chromatin complexes in about 1 day. This is achieved with only 500 μg input material, enabling yeast cultivation in a 96-well format. Our system retrieved known complex-members and the high throughput allowed probing with many bait proteins. Even alternative complex compositions were detectable in these very short gradients. Thus, sample throughput, sensitivity and LC/MS-MS duty cycle are improved severalfold compared with established workflows. The pipeline can be extended to different types of interaction studies and to other medium complexity proteomes.Shotgun proteomics is concerned with the identification and quantification of proteins (13). Prior to analysis, the proteins are digested into peptides, resulting in highly complex mixtures. To deal with this complexity, the peptides are separated by liquid chromatography followed by online analysis with mass spectrometry (MS), today facilitating the characterization of almost complete cell line proteomes in a short time (35). In addition to the characterization of entire proteomes, there is also a great demand for analyzing low or medium complexity samples. Given the trend toward a systems biology view, relatively larges sets of samples often have to be measured. One such category of lower complexity protein mixtures occurs in the determination of physical interaction partners of a protein of interest, which requires the identification and quantification of the proteins “pulled-down” or immunoprecipitated via a bait protein. Protein interactions are essential for almost all biological processes and orchestrate a cell''s behavior by regulating enzymes, forming macromolecular assemblies and functionalizing multiprotein complexes that are capable of more complex behavior than the sum of their parts. The human genome has almost 20,000 protein encoding genes, and it has been estimated that 80% of the proteins engage in complex interactions and that 130,000 to 650,000 protein interactions can take place in a human cell (6, 7). These numbers demonstrate a clear need for systematic and high-throughput mapping of protein–protein interactions (PPIs) to understand these complexes.The introduction of generic methods to detect PPIs, such as the yeast two-hybrid screen (Y2H) (8) or affinity purification combined with mass spectrometry (AP-MS)1 (9), have revolutionized the protein interactomics field. AP-MS in particular has emerged as an important tool to catalogue interactions with the aim of better understanding basic biochemical mechanisms in many different organisms (1017). It can be performed under near-physiological conditions and is capable of identifying functional protein complexes (18). In addition, the combination of affinity purification with quantitative mass spectrometry has greatly improved the discrimination of true interactors from unspecific background binders, a long-standing challenge in the AP-MS field (1921). Nowadays, quantitative AP-MS is employed to address many different biological questions, such as detection of dynamic changes in PPIs upon perturbation (2225) or the impact of posttranslational signaling on PPIs (26, 27). Recent developments even make it possible to provide abundances and stoichiometry information of the bait and prey proteins under study, combined with quantitative data from very deep cellular proteomes. Furthermore, sample preparation in AP-MS can now be performed in high-throughput formats capable of producing hundreds of samples per day. With such throughput in sample generation, the LC-MS/MS part of the AP-MS pipeline has become a major bottleneck for large studies, limiting throughput to a small fraction of the available samples. In principle, this limitation could be circumvented by multiplexing analysis via isotope-labeling strategies (28, 29) or by drastically reducing the measurement time per sample (3032). The former strategy requires exquisite control of the processing steps and has not been widely implemented yet. The latter strategy depends on mass spectrometers with sufficiently high sequencing speed to deal with the pull-down in a very short time. Since its introduction about 10 years ago (33), the Orbitrap mass spectrometer has featured ever-faster sequencing capabilities, with the Q Exactive HF now reaching a peptide sequencing speed of up to 17 Hz (34). This should now make it feasible to substantially lower the amount of time spent per measurement.Although very short LC-MS/MS runs can in principle be used for high-throughput analyses, they usually lead to a drop in LC-MS duty cycle. This is because each sample needs initial washing, loading, and equilibration steps, independent of gradient time, which takes a substantial percentage for most LC setups - typically at least 15–20 min. To achieve a more efficient LC-MS duty cycle, while maintaining high sensitivity, a second analytical column can be introduced. This enables the parallelization of several steps related to sample loading and to the LC operating steps, including valve switching. Such dual analytical column or “double-barrel: setups have been described for various applications and platforms (30, 3539).Starting from the reported performance and throughput of workflows that are standard today (16, 21, 4042), we asked if it would be possible to obtain a severalfold increase in both sample throughput and sensitivity, as well as a considerable reduction in overall wet lab costs and working time. Specifically, our goal was to quantify 96 medium complexity samples in a single day. Such a number of samples can be processed with a 96-well plate, which currently is the format of choice for highly parallelized sample preparation workflows, often with a high degree of automation. We investigated which advances were needed in sample preparation, liquid chromatography, and mass spectrometry. Based on our findings, we developed a parallelized platform for high-throughput sample preparation and LC-MS/MS analysis, which we applied to pull-down samples from the yeast chromatin remodeling landscape. The extent of retrieval of known complex members served as a quality control of the developed pipeline.  相似文献   

7.
Database search programs are essential tools for identifying peptides via mass spectrometry (MS) in shotgun proteomics. Simultaneously achieving high sensitivity and high specificity during a database search is crucial for improving proteome coverage. Here we present JUMP, a new hybrid database search program that generates amino acid tags and ranks peptide spectrum matches (PSMs) by an integrated score from the tags and pattern matching. In a typical run of liquid chromatography coupled with high-resolution tandem MS, more than 95% of MS/MS spectra can generate at least one tag, whereas the remaining spectra are usually too poor to derive genuine PSMs. To enhance search sensitivity, the JUMP program enables the use of tags as short as one amino acid. Using a target-decoy strategy, we compared JUMP with other programs (e.g. SEQUEST, Mascot, PEAKS DB, and InsPecT) in the analysis of multiple datasets and found that JUMP outperformed these preexisting programs. JUMP also permitted the analysis of multiple co-fragmented peptides from “mixture spectra” to further increase PSMs. In addition, JUMP-derived tags allowed partial de novo sequencing and facilitated the unambiguous assignment of modified residues. In summary, JUMP is an effective database search algorithm complementary to current search programs.Peptide identification by tandem mass spectra is a critical step in mass spectrometry (MS)-based1 proteomics (1). Numerous computational algorithms and software tools have been developed for this purpose (26). These algorithms can be classified into three categories: (i) pattern-based database search, (ii) de novo sequencing, and (iii) hybrid search that combines database search and de novo sequencing. With the continuous development of high-performance liquid chromatography and high-resolution mass spectrometers, it is now possible to analyze almost all protein components in mammalian cells (7). In contrast to rapid data collection, it remains a challenge to extract accurate information from the raw data to identify peptides with low false positive rates (specificity) and minimal false negatives (sensitivity) (8).Database search methods usually assign peptide sequences by comparing MS/MS spectra to theoretical peptide spectra predicted from a protein database, as exemplified in SEQUEST (9), Mascot (10), OMSSA (11), X!Tandem (12), Spectrum Mill (13), ProteinProspector (14), MyriMatch (15), Crux (16), MS-GFDB (17), Andromeda (18), BaMS2 (19), and Morpheus (20). Some other programs, such as SpectraST (21) and Pepitome (22), utilize a spectral library composed of experimentally identified and validated MS/MS spectra. These methods use a variety of scoring algorithms to rank potential peptide spectrum matches (PSMs) and select the top hit as a putative PSM. However, not all PSMs are correctly assigned. For example, false peptides may be assigned to MS/MS spectra with numerous noisy peaks and poor fragmentation patterns. If the samples contain unknown protein modifications, mutations, and contaminants, the related MS/MS spectra also result in false positives, as their corresponding peptides are not in the database. Other false positives may be generated simply by random matches. Therefore, it is of importance to remove these false PSMs to improve dataset quality. One common approach is to filter putative PSMs to achieve a final list with a predefined false discovery rate (FDR) via a target-decoy strategy, in which decoy proteins are merged with target proteins in the same database for estimating false PSMs (2326). However, the true and false PSMs are not always distinguishable based on matching scores. It is a problem to set up an appropriate score threshold to achieve maximal sensitivity and high specificity (13, 27, 28).De novo methods, including Lutefisk (29), PEAKS (30), NovoHMM (31), PepNovo (32), pNovo (33), Vonovo (34), and UniNovo (35), identify peptide sequences directly from MS/MS spectra. These methods can be used to derive novel peptides and post-translational modifications without a database, which is useful, especially when the related genome is not sequenced. High-resolution MS/MS spectra greatly facilitate the generation of peptide sequences in these de novo methods. However, because MS/MS fragmentation cannot always produce all predicted product ions, only a portion of collected MS/MS spectra have sufficient quality to extract partial or full peptide sequences, leading to lower sensitivity than achieved with the database search methods.To improve the sensitivity of the de novo methods, a hybrid approach has been proposed to integrate peptide sequence tags into PSM scoring during database searches (36). Numerous software packages have been developed, such as GutenTag (37), InsPecT (38), Byonic (39), DirecTag (40), and PEAKS DB (41). These methods use peptide tag sequences to filter a protein database, followed by error-tolerant database searching. One restriction in most of these algorithms is the requirement of a minimum tag length of three amino acids for matching protein sequences in the database. This restriction reduces the sensitivity of the database search, because it filters out some high-quality spectra in which consecutive tags cannot be generated.In this paper, we describe JUMP, a novel tag-based hybrid algorithm for peptide identification. The program is optimized to balance sensitivity and specificity during tag derivation and MS/MS pattern matching. JUMP can use all potential sequence tags, including tags consisting of only one amino acid. When we compared its performance to that of two widely used search algorithms, SEQUEST and Mascot, JUMP identified ∼30% more PSMs at the same FDR threshold. In addition, the program provides two additional features: (i) using tag sequences to improve modification site assignment, and (ii) analyzing co-fragmented peptides from mixture MS/MS spectra.  相似文献   

8.
9.
Quantifying the similarity of spectra is an important task in various areas of spectroscopy, for example, to identify a compound by comparing sample spectra to those of reference standards. In mass spectrometry based discovery proteomics, spectral comparisons are used to infer the amino acid sequence of peptides. In targeted proteomics by selected reaction monitoring (SRM) or SWATH MS, predetermined sets of fragment ion signals integrated over chromatographic time are used to identify target peptides in complex samples. In both cases, confidence in peptide identification is directly related to the quality of spectral matches. In this study, we used sets of simulated spectra of well-controlled dissimilarity to benchmark different spectral comparison measures and to develop a robust scoring scheme that quantifies the similarity of fragment ion spectra. We applied the normalized spectral contrast angle score to quantify the similarity of spectra to objectively assess fragment ion variability of tandem mass spectrometric datasets, to evaluate portability of peptide fragment ion spectra for targeted mass spectrometry across different types of mass spectrometers and to discriminate target assays from decoys in targeted proteomics. Altogether, this study validates the use of the normalized spectral contrast angle as a sensitive spectral similarity measure for targeted proteomics, and more generally provides a methodology to assess the performance of spectral comparisons and to support the rational selection of the most appropriate similarity measure. The algorithms used in this study are made publicly available as an open source toolset with a graphical user interface.In “bottom-up” proteomics, peptide sequences are identified by the information contained in their fragment ion spectra (1). Various methods have been developed to generate peptide fragment ion spectra and to match them to their corresponding peptide sequences. They can be broadly grouped into discovery and targeted methods. In the widely used discovery (also referred to as shotgun) proteomic approach, peptides are identified by establishing peptide to spectrum matches via a method referred to as database searching. Each acquired fragment ion spectrum is searched against theoretical peptide fragment ion spectra computed from the entries of a specified sequence database, whereby the database search space is constrained to a user defined precursor mass tolerance (2, 3). The quality of the match between experimental and theoretical spectra is typically expressed with multiple scores. These include the number of matching or nonmatching fragments, the number of consecutive fragment ion matches among others. With few exceptions (47) commonly used search engines do not use the relative intensities of the acquired fragment ion signals even though this information could be expected to strengthen the confidence of peptide identification because the relative fragment ion intensity pattern acquired under controlled fragmentation conditions can be considered as a unique “fingerprint” for a given precursor. Thanks to community efforts in acquiring and sharing large number of datasets, the proteomes of some species are now essentially mapped out and experimental fragment ion spectra covering entire proteomes are increasingly becoming accessible through spectral databases (816). This has catalyzed the emergence of new proteomics strategies that differ from classical database searching in that they use prior spectral information to identify peptides. Those comprise inclusion list sequencing (directed sequencing), spectral library matching, and targeted proteomics (17). These methods explicitly use the information contained in empirical fragment ion spectra, including the fragment ion signal intensity to identify the target peptide. For these methods, it is therefore of highest importance to accurately control and quantify the degree of reproducibility of the fragment ion spectra across experiments, instruments, labs, methods, and to quantitatively assess the similarity of spectra. To date, dot product (1824), its corresponding arccosine spectral contrast angle (2527) and (Pearson-like) spectral correlation (2831), and other geometrical distance measures (18, 32), have been used in the literature for assessing spectral similarity. These measures have been used in different contexts including shotgun spectra clustering (19, 26), spectral library searching (18, 20, 21, 24, 25, 2729), cross-instrument fragmentation comparisons (22, 30) and for scoring transitions in targeted proteomics analyses such as selected reaction monitoring (SRM)1 (23, 31). However, to our knowledge, those scores have never been objectively benchmarked for their performance in discriminating well-defined levels of dissimilarities between spectra. In particular, similarity scores obtained by different methods have not yet been compared for targeted proteomics applications, where the sensitive discrimination of highly similar spectra is critical for the confident identification of targeted peptides.In this study, we have developed a method to objectively assess the similarity of fragment ion spectra. We provide an open-source toolset that supports these analyses. Using a computationally generated benchmark spectral library with increasing levels of well-controlled spectral dissimilarity, we performed a comprehensive and unbiased comparison of the performance of the main scores used to assess spectral similarity in mass spectrometry.We then exemplify how this method, in conjunction with its corresponding benchmarked perturbation spectra set, can be applied to answer several relevant questions for MS-based proteomics. As a first application, we show that it can efficiently assess the absolute levels of peptide fragmentation variability inherent to any given mass spectrometer. By comparing the instrument''s intrinsic fragmentation conservation distribution to that of the benchmarked perturbation spectra set, nominal values of spectral similarity scores can indeed be translated into a more directly understandable percentage of variability inherent to the instrument fragmentation. As a second application, we show that the method can be used to derive an absolute measure to estimate the conservation of peptide fragmentation between instruments or across proteomics methods. This allowed us to quantitatively evaluate, for example, the transferability of fragment ion spectra acquired by data dependent analysis in a first instrument into a fragment/transition assay list used for targeted proteomics applications (e.g. SRM or targeted extraction of data independent acquisition SWATH MS (33)) on another instrument. Third, we used the method to probe the fragmentation patterns of peptides carrying a post-translation modification (e.g. phosphorylation) by comparing the spectra of modified peptide with those of their unmodified counterparts. Finally, we used the method to determine the overall level of fragmentation conservation that is required to support target-decoy discrimination and peptide identification in targeted proteomics approaches such as SRM and SWATH MS.  相似文献   

10.
Based on conventional data-dependent acquisition strategy of shotgun proteomics, we present a new workflow DeMix, which significantly increases the efficiency of peptide identification for in-depth shotgun analysis of complex proteomes. Capitalizing on the high resolution and mass accuracy of Orbitrap-based tandem mass spectrometry, we developed a simple deconvolution method of “cloning” chimeric tandem spectra for cofragmented peptides. Additional to a database search, a simple rescoring scheme utilizes mass accuracy and converts the unwanted cofragmenting events into a surprising advantage of multiplexing. With the combination of cloning and rescoring, we obtained on average nine peptide-spectrum matches per second on a Q-Exactive workbench, whereas the actual MS/MS acquisition rate was close to seven spectra per second. This efficiency boost to 1.24 identified peptides per MS/MS spectrum enabled analysis of over 5000 human proteins in single-dimensional LC-MS/MS shotgun experiments with an only two-hour gradient. These findings suggest a change in the dominant “one MS/MS spectrum - one peptide” paradigm for data acquisition and analysis in shotgun data-dependent proteomics. DeMix also demonstrated higher robustness than conventional approaches in terms of lower variation among the results of consecutive LC-MS/MS runs.Shotgun proteomics analysis based on a combination of high performance liquid chromatography and tandem mass spectrometry (MS/MS) (1) has achieved remarkable speed and efficiency (27). In a single four-hour long high performance liquid chromatography-MS/MS run, over 40,000 peptides and 5000 proteins can be identified using a high-resolution Orbitrap mass spectrometer with data-dependent acquisition (DDA)1 (2, 3). However, in a typical LC-MS analysis of unfractionated human cell lysate, over 100,000 individual peptide isotopic patterns can be detected (4), which corresponds to simultaneous elution of hundreds of peptides. With this complexity, a mass spectrometer needs to achieve ≥25 Hz MS/MS acquisition rate to fully sample all the detectable peptides, and ≥17 Hz to cover reasonably abundant ones (4). Although this acquisition rate is reachable by modern time-of-flight (TOF) instruments, the reported DDA identification results do not encompass all expected peptides. Recently, the next-generation Orbitrap instrument, working at 20 Hz MS/MS acquisition rate, demonstrated nearly full profiling of yeast proteome using an 80 min gradient, which opened the way for comprehensive analysis of human proteome in a time efficient manner (5).During the high performance liquid chromatography-MS/MS DDA analysis of complex samples, high density of co-eluting peptides results in a high probability for two or more peptides to overlap within an MS/MS isolation window. With the commonly used ±1.0–2.0 Th isolation windows, most MS/MS spectra are chimeric (4, 810), with cofragmenting precursors being naturally multiplexed. However, as has been discussed previously (9, 10), the cofragmentation events are currently ignored in most of the conventional analysis workflows. According to the prevailing assumption of “one MS/MS spectrum–one peptide,” chimeric MS/MS spectra are generally unwelcome in DDA, because the product ions from different precursors may interfere with the assignment of MS/MS fragment identities, increasing the rate of false discoveries in database search (8, 9). In some studies, the precursor isolation width was set as narrow as ±0.35 Th to prevent unwanted ions from being coselected, fragmented or detected (4, 5).On the contrary, multiplexing by cofragmentation is considered to be one of the solid advantages in data-independent acquisition (DIA) (1013). In several commonly used DIA methods, the precursor ion selection windows are set much wider than in DDA: from 25 Th as in SWATH (12), to extremely broad range as in AIF (13). In order to use the benefit of MS/MS multiplexing in DDA, several approaches have been proposed to deconvolute chimeric MS/MS spectra. In “alternative peptide identification” method implemented in Percolator (14), a machine learning algorithm reranks and rescores peptide-spectrum matches (PSMs) obtained from one or more MS/MS search engines. But the deconvolution in Percolator is limited to cofragmented peptides with masses differing from the target peptide by the tolerance of the database search, which can be as narrow as a few ppm. The “active demultiplexing” method proposed by Ledvina et al. (15) actively separates MS/MS data from several precursors using masses of complementary fragments. However, higher-energy collisional dissociation often produces MS/MS spectra with too few complementary pairs for reliable peptide identification. The “MixDB” method introduces a sophisticated new search engine, also with a machine learning algorithm (9). And the “second peptide identification” method implemented in Andromeda/MaxQuant workflow (16) submits the same dataset to the search engine several times based on the list of chromatographic peptide features, subtracting assigned MS/MS peaks after each identification round. This approach is similar to the ProbIDTree search engine that also performed iterative identification while removing assigned peaks after each round of identification (17).One important factor for spectral deconvolution that has not been fully utilized in most conventional workflows is the excellent mass accuracy achievable with modern high-resolution mass spectrometry (18). An Orbitrap Fourier-transform mass spectrometer can provide mass accuracy in the range of hundreds of ppb (parts per billion) for mass peaks with high signal-to-noise (S/N) ratio (19). However, the mass error of peaks with lower S/N ratios can be significantly higher and exceed 1 ppm. Despite this dependence of the mass accuracy from the S/N level, most MS and MS/MS search engines only allow users to set hard cut-off values for the mass error tolerances. Moreover, some search engines do not provide the option of choosing a relative error tolerance for MS/MS fragments. Such negligent treatment of mass accuracy reduces the analytical power of high accuracy experiments (18).Identification results coming from different MS/MS search engines are sometimes not consistent because of different statistical assumptions used in scoring PSMs. Introduction of tools integrating the results of different search engines (14, 20, 21) makes the data interpretation even more complex and opaque for the user. The opposite trend—simplification of MS/MS data interpretation—is therefore a welcome development. For example, an extremely straightforward algorithm recently proposed by Wenger et al. (22) demonstrated a surprisingly high performance in peptide identification, even though it is only marginally more complex than simply counting the number of matches of theoretical fragment peaks in high resolution MS/MS, without any a priori statistical assumption.In order to take advantage of natural multiplexing of MS/MS spectra in DDA, as well as properly utilize high accuracy of Orbitrap-based mass spectrometry, we developed a simple and robust data analysis workflow DeMix. It is presented in Fig. 1 as an expansion of the conventional workflow. Principles of some of the processes used by the workflow are borrowed from other approaches, including the custom-made mass peak centroiding (20), chromatographic feature detection (19, 20), and two-pass database search with the first limited pass to provide a “software lock mass” for mass scale recalibration (23).Open in a separate windowFig. 1.An overview of the DeMix workflow that expands the conventional workflow, shown by the dashed line. Processes are colored in purple for TOPP, red for search engine (Morpheus/Mascot/MS-GF+), and blue for in-house programs.In DeMix workflow, the deconvolution of chimeric MS/MS spectra consists of simply “cloning” an MS/MS spectrum if a potential cofragmented peptide is detected. The list of candidate peptide precursors is generated from chromatographic feature detection, as in the MaxQuant/Andromeda workflow (16, 19), but using The OpenMS Proteomics Pipeline (TOPP) (20, 24). During the cloning, the precursor is replaced by the new candidate, but no changes in the MS/MS fragment list are made, and therefore the cloned MS/MS spectra remain chimeric. Processing such spectra requires a search engine tolerant to the presence of unassigned peaks, as such peaks are always expected when multiple precursors cofragment. Thus, we chose Morpheus (22) as a search engine. Based on the original search algorithm, we implement a reformed scoring scheme: Morpheus-AS (advanced scoring). It inherits all the basic principles from Morpheus but deeper utilizes the high mass accuracy of the data. This kind of database search removes the necessity of spectral processing for physical separation of MS/MS data into multiple subspectra (15), or consecutive subtraction of peaks (16, 17).Despite the fact that DeMix workflow is largely a combination of known approaches, it provides remarkable improvement compared with the state-of-the-art. On our Orbitrap Q-Exactive workbench, testing on a benchmark dataset of two-hour single-dimension LC-MS/MS experiments from HeLa cell lysate, we identified on average 1.24 peptide per MS/MS spectrum, breaking the “one MS/MS spectrum–one peptide” paradigm on the level of whole data set. At 1% false discovery rate (FDR), we obtained on average nine PSMs per second (at the actual acquisition rate of ca. seven MS/MS spectra per second), and detected 40 human proteins per minute.  相似文献   

11.
Isobaric labeling techniques coupled with high-resolution mass spectrometry have been widely employed in proteomic workflows requiring relative quantification. For each high-resolution tandem mass spectrum (MS/MS), isobaric labeling techniques can be used not only to quantify the peptide from different samples by reporter ions, but also to identify the peptide it is derived from. Because the ions related to isobaric labeling may act as noise in database searching, the MS/MS spectrum should be preprocessed before peptide or protein identification. In this article, we demonstrate that there are a lot of high-frequency, high-abundance isobaric related ions in the MS/MS spectrum, and removing isobaric related ions combined with deisotoping and deconvolution in MS/MS preprocessing procedures significantly improves the peptide/protein identification sensitivity. The user-friendly software package TurboRaw2MGF (v2.0) has been implemented for converting raw TIC data files to mascot generic format files and can be downloaded for free from https://github.com/shengqh/RCPA.Tools/releases as part of the software suite ProteomicsTools. The data have been deposited to the ProteomeXchange with identifier PXD000994.Mass spectrometry-based proteomics has been widely applied to investigate protein mixtures derived from tissue, cell lysates, or from body fluids (1, 2). Liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS)1 is the most popular strategy for protein/peptide mixtures analysis in shotgun proteomics (3). Large-scale protein/peptide mixtures are separated by liquid chromatography followed by online detection by tandem mass spectrometry. The capabilities of proteomics rely greatly on the performance of the mass spectrometer. With the improvement of MS technology, proteomics has benefited significantly from the high-resolution and excellent mass accuracy (4). In recent years, based on the higher efficiency of higher energy collision dissociation (HCD), a new “high–high” strategy (high-resolution MS as well as MS/MS(tandem MS)) has been applied instead of the “high–low” strategy (high-resolution MS, i.e. in Orbitrap, and low-resolution MS/MS, i.e. in ion trap) to obtain high quality tandem MS/MS data as well as full MS in shotgun proteomics. Both full MS scans and MS/MS scans can be performed, and the whole cycle time of MS detection is very compatible with the chromatographic time scale (5).High-resolution measurement is one of the most important features in mass spectrometric application. In this high–high strategy, high-resolution and accurate spectra will be achieved in tandem MS/MS scans as well as full MS scans, which makes isotopic peaks distinguishable from one another, thus enabling the easy calculation of precise charge states and monoisotopic mass. During an LC-MS/MS experiment, a multiply charged precursor ion (peptide) is usually isolated and fragmented, and then the multiple charge states of the fragment ions are generated and collected. After full extraction of peak lists from original tandem mass spectra, the commonly used search engines (i.e. Mascot (6), Sequest (7)) have no capability to distinguish isotopic peaks and recognize charge states, so all of the product ions are considered as all charge state hypotheses during the database search for protein identification. These multiple charge states of fragment ions and their isotopic cluster peaks can be incorrectly assigned by the search engine, which can cause false peptide identification. To overcome this issue, data preprocessing of the high-resolution MS/MS spectra is required before submitting them for identification. There are usually two major preprocessing steps used for high-resolution MS/MS data: deisotoping and deconvolution (8, 9). Deisotoping of spectra removes all isotopic peaks except monoisotopic peaks from multi-isotopic peaks. Deconvolution of spectra translates multiply charged ions to singly charged ions and also accumulates the intensity of fragment ions by summing up all the intensities from their multiply charged states. After performing these two data-preprocessing steps, the resulting spectra is simpler and cleaner and allows more precise database searching and accurate bioinformatics analysis.With the capacity to analyze multiple samples simultaneously, stable isotope labeling approaches have been widely used in quantitative proteomics. Stable isotope labeling approaches are categorized as metabolic labeling (SILAC, stable isotope labeling by amino acids in cell culture) and chemical labeling (10, 11). The peptides labeled by the SILAC approach are quantified by precursor ions in full MS spectra, whereas peptides that have been isobarically labeled using chemical means are quantified by reporter ions in MS/MS spectra. There are two similar isobaric chemical labeling methods: (1) isobaric tag for relative and absolute quantification (iTRAQ), and (2) tandem mass tag (TMT) (12, 13). These reagents contain an amino-reactive group that specifically reacts with N-terminal amino groups and epilson-amino groups of lysine residues to label digested peptides in a typical shotgun proteomics experiment. There are four different channels of isobaric tags: TMT two-plex, iTRAQ four-plex, TMT six-plex, and iTRAQ eight-plex (1216). The number before “plex” denotes the number of samples that can be analyzed by the mass spectrum simultaneously. Peptides labeled with different isotopic variants of the tag show identical or similar mass and appear as a single peak in full scans. This single peak may be selected for subsequent MS/MS analysis. In an MS/MS scan, the mass of reporter ions (114 to 117 for iTRAQ four-plex, 113 to 121 for iTRAQ eight-plex, and 126 to 131for TMT six-plex upon CID or HCD activation) are associated with corresponding samples, and the intensities represent the relative abundances of the labeled peptides. Meanwhile, the other ions from the MS/MS spectra can be used for peptide identification. Because of the multiplexing capability, isobaric labeling methods combined with bottom-up proteomics have been widely applied for accurate quantification of proteins on a global scale (14, 1719). Although mostly associated with peptide labeling, these isobaric labeling methods have also been applied at protein level (2023).For the proteomic analysis of isobarically labeled peptides/proteins in “high–high” MS strategy, the common consensus is that accurate reporter ions can contribute to more accurate quantification. However, there is no evidence to show how the ions related to isobaric labeling affect the peptide/protein identification and what preprocessing steps should be taken for high-resolution isobarically labeled MS/MS. To demonstrate the effectiveness and importance of preprocessing, we examined how the combination of preprocessing steps improved peptide/protein sensitivity in database searching. Several combinatorial ways of data-preprocessing were applied for high-throughput data analysis including deisotoping to keep simple monoisotopic mass peaks, deconvolution of ions with multiple charge states, and preservation of top 10 peaks in every 100 Dalton mass range. After systematic analysis of high-resolution isobarically labeled spectra, we further processed the spectra and removed interferential ions that were not related to the peptide. Our results suggested that the preprocessing of isobarically labeled high-resolution tandem mass spectra significantly improved the peptide/protein identification sensitivity.  相似文献   

12.
Optimal performance of LC-MS/MS platforms is critical to generating high quality proteomics data. Although individual laboratories have developed quality control samples, there is no widely available performance standard of biological complexity (and associated reference data sets) for benchmarking of platform performance for analysis of complex biological proteomes across different laboratories in the community. Individual preparations of the yeast Saccharomyces cerevisiae proteome have been used extensively by laboratories in the proteomics community to characterize LC-MS platform performance. The yeast proteome is uniquely attractive as a performance standard because it is the most extensively characterized complex biological proteome and the only one associated with several large scale studies estimating the abundance of all detectable proteins. In this study, we describe a standard operating protocol for large scale production of the yeast performance standard and offer aliquots to the community through the National Institute of Standards and Technology where the yeast proteome is under development as a certified reference material to meet the long term needs of the community. Using a series of metrics that characterize LC-MS performance, we provide a reference data set demonstrating typical performance of commonly used ion trap instrument platforms in expert laboratories; the results provide a basis for laboratories to benchmark their own performance, to improve upon current methods, and to evaluate new technologies. Additionally, we demonstrate how the yeast reference, spiked with human proteins, can be used to benchmark the power of proteomics platforms for detection of differentially expressed proteins at different levels of concentration in a complex matrix, thereby providing a metric to evaluate and minimize preanalytical and analytical variation in comparative proteomics experiments.Access to proteomics performance standards is essential for several reasons. First, to generate the highest quality data possible, proteomics laboratories routinely benchmark and perform quality control (QC)1 monitoring of the performance of their instrumentation using standards. Second, appropriate standards greatly facilitate the development of improvements in technologies by providing a timeless standard with which to evaluate new protocols or instruments that claim to improve performance. For example, it is common practice for an individual laboratory considering purchase of a new instrument to require the vendor to run “demo” samples so that data from the new instrument can be compared head to head with existing instruments in the laboratory. Third, large scale proteomics studies designed to aggregate data across laboratories can be facilitated by the use of a performance standard to measure reproducibility across sites or to compare the performance of different LC-MS configurations or sample processing protocols used between laboratories to facilitate development of optimized standard operating procedures (SOPs).Most individual laboratories have adopted their own QC standards, which range from mixtures of known synthetic peptides to digests of bovine serum albumin or more complex mixtures of several recombinant proteins (1). However, because each laboratory performs QC monitoring in isolation, it is difficult to compare the performance of LC-MS platforms throughout the community.Several standards for proteomics are available for request or purchase (2, 3). RM8327 is a mixture of three peptides developed as a reference material in collaboration between the National Institute of Standards and Technology (NIST) and the Association of Biomolecular Resource Facilities. Mixtures of 15–48 purified human proteins are also available, such as the HUPO (Human Proteome Organisation) Gold MS Protein Standard (Invitrogen), the Universal Proteomics Standard (UPS1; Sigma), and CRM470 from the European Union Institute for Reference Materials and Measurements. Although defined mixtures of peptides or proteins can address some benchmarking and QC needs, there is an additional need for more complex reference materials to fully represent the challenges of LC-MS data acquisition in complex matrices encountered in biological samples (2, 3).Although it has not been widely distributed as a reference material, the yeast Saccharomyces cerevisiae proteome has been extensively used by the proteomics community to characterize the capabilities of a variety of LC-MS-based approaches (415). Yeast provides a uniquely attractive complex performance standard for several reasons. Yeast encodes a complex proteome consisting of ∼4,500 proteins expressed during normal growth conditions (7, 1618). The concentration range of yeast proteins is sufficient to challenge the dynamic range of conventional mass spectrometers; the abundance of proteins ranges from fewer than 50 to more than 106 molecules per cell (4, 15, 16). Additionally, it is the most extensively characterized complex biological proteome and the only one associated with several large scale studies estimating the abundance of all detectable proteins (5, 9, 16, 17, 19, 20) as well as LC-MS/MS data sets showing good correlation between LC-MS/MS detection efficiency and the protein abundance estimates (4, 11, 12, 15). Finally, it is inexpensive and easy to produce large quantities of yeast protein extract for distribution.In this study, we describe large scale production of a yeast S. cerevisiae performance standard, which we offer to the community through NIST. Through a series of interlaboratory studies, we created a reference data set characterizing the yeast performance standard and defining reasonable performance of ion trap-based LC-MS platforms in expert laboratories using a series of performance metrics. This publicly available data set provides a basis for additional laboratories using the yeast standard to benchmark their own performance as well as to improve upon the current status by evolving protocols, improving instrumentation, or developing new technologies. Finally, we demonstrate how the yeast performance standard, spiked with human proteins, can be used to benchmark the power of proteomics platforms for detection of differentially expressed proteins at different levels of concentration in a complex matrix.  相似文献   

13.
14.
The increasing scale and complexity of quantitative proteomics studies complicate subsequent analysis of the acquired data. Untargeted label-free quantification, based either on feature intensities or on spectral counting, is a method that scales particularly well with respect to the number of samples. It is thus an excellent alternative to labeling techniques. In order to profit from this scalability, however, data analysis has to cope with large amounts of data, process them automatically, and do a thorough statistical analysis in order to achieve reliable results. We review the state of the art with respect to computational tools for label-free quantification in untargeted proteomics. The two fundamental approaches are feature-based quantification, relying on the summed-up mass spectrometric intensity of peptides, and spectral counting, which relies on the number of MS/MS spectra acquired for a certain protein. We review the current algorithmic approaches underlying some widely used software packages and briefly discuss the statistical strategies for analyzing the data.Over recent decades, mass spectrometry has become the analytical method of choice in most proteomics studies (e.g. Refs. 14). A standard mass spectrometric workflow allows for both protein identification and protein quantification (5) in some form. For a long time, the technology has been used mainly for qualitative assessments of protein mixtures, namely, to assess whether a specific protein is in the sample or not. However, for the majority of interesting research questions, especially in the field of systems biology, this binary information (present or not) is not sufficient (6). The necessity of more detailed information on protein expression levels drives the field of quantitative proteomics (7, 8), which enables the integration of proteomics data with other data sources and allows network-centered studies, as reviewed in Ref. 9. Recent studies show that mass-spectrometry-based quantitative proteomics experiments can provide quantitative information (relative or absolute) for large parts, if not the entire set, of expressed proteins (1012).Since the isotope-coded affinity tag protocol was first published in 1999 (13), numerous labeling strategies have found their way into the field of quantitative proteomics (14). These include isotope-coded protein labeling (15), metabolic labeling (16, 17), and isobaric tags (18, 19). Comprehensive overviews of different quantification strategies can be found in Refs. 20 and 21. Because of the shortcomings of labeling strategies, label-free methods are increasingly gaining the interest of proteomics researchers (22, 23). In label-free quantification, no label is introduced to either of the samples. All samples are analyzed in separate LC/MS experiments, and the individual peptide properties of the individual measurements are then compared. Regardless of the quantification strategy, computational approaches for data analyses have become the critical final step of the proteomics workflow. Overviews of existing computational approaches in proteomics are provided in Refs. 24 and 25. The computational label-free quantification workflow in visualized in Fig. 1. Comparing peptide quantities using mass spectrometry remains a difficult task, because mass spectrometers have different response values for different chemical entities, and thus a direct comparison of different peptides is not possible. The computational analysis of a label-free quantitative data set consists of several steps that are mainly split in raw data signal processing and quantification. Signal processing steps comprise data reduction procedures such as baseline removal, denoising, and centroiding.Open in a separate windowFig. 1.The sample cohort that can be analyzed via label-free proteomics is not limited in size. Each sample is processed separately through the sample preparation and data acquisition pipeline. For data analysis, the data from the different LC/MS runs are combined.These steps can be accomplished in modular building blocks, or the entire analysis can be performed using monolithic analysis software. Recently, it has been shown that it is beneficial to combine modular blocks from different software tools to a consensus pipeline (26). The same study also illustrates the diversity of methods that are modularized by different software tools. In another recent publication, monolithic software packages are compared (27). In that study, the authors identify a set of seven metrics: detection sensitivity, detection consistency, intensity consistency, intensity accuracy, detection accuracy, statistical capability, and quantification accuracy. Despite the missing independence of these metrics and the loose reporting of software parameter settings, such comparative studies are of great interest to the field of quantitative proteomics. A general conclusion from these studies is that the choice of software might, to a certain degree, affect the final results of the study.Absolute quantification of peptides and proteins using intensity-based label-free methods is possible and can be done with excellent accuracy, if standard addition is used. With the help of known concentrations, calibration lines can be drawn, and absolute protein quantities can be directly inferred from these calibration measurements (28). Furthermore, it has been suggested that peptide peak intensities can be predicted and absolute quantities can be derived from these predictions (29). However, the limited accuracy of predictions or the need for peptides of known concentrations limits these approaches to selected proteins/peptides only and prevents their use on a proteome-wide scale.Spectral counting methods have also been used for the estimation of absolute concentrations on a global scale (30), albeit at drastically reduced accuracy relative to intensity-based methods. In one study, the authors used a mixture of 48 proteins with known concentrations and predicted the absolute copy number amounts of thousands of proteins based on that mixture. Despite the fact that large, proteome-wide data sets will dilute the effects of different peptide detectabilities on the individual protein level, such methods will always be limited in their accuracy of quantification.The generic nature of label-free quantification is not restricted to any model system and can also be employed with tissue or body fluids (31, 32). However, the label-free approach is more sensitive to technical deviations between LC/MS runs as information is compared between different measurements. Therefore, the reproducibility of the analytical platform is crucial for successful label-free quantification. The recent success of label-free quantification could only be accomplished through significant improvements of algorithms (3336). An increasingly large collection of software tools for label-free proteomics have been published as open source applications or have entered the market as commercially available packages. This review aims at outlining the computational methods that are generally implemented by these software tools. Furthermore, we illustrate strengths and weaknesses of different tools. The review provides an information resource for the broad proteomics audience and does not illustrate all algorithmic details of the individual tools.  相似文献   

15.
16.
In large-scale proteomic experiments, multiple peptide precursors are often cofragmented simultaneously in the same mixture tandem mass (MS/MS) spectrum. These spectra tend to elude current computational tools because of the ubiquitous assumption that each spectrum is generated from only one peptide. Therefore, tools that consider multiple peptide matches to each MS/MS spectrum can potentially improve the relatively low spectrum identification rate often observed in proteomics experiments. More importantly, data independent acquisition protocols promoting the cofragmentation of multiple precursors are emerging as alternative methods that can greatly improve the throughput of peptide identifications but their success also depends on the availability of algorithms to identify multiple peptides from each MS/MS spectrum. Here we address a fundamental question in the identification of mixture MS/MS spectra: determining the statistical significance of multiple peptides matched to a given MS/MS spectrum. We propose the MixGF generating function model to rigorously compute the statistical significance of peptide identifications for mixture spectra and show that this approach improves the sensitivity of current mixture spectra database search tools by a ≈30–390%. Analysis of multiple data sets with MixGF reveals that in complex biological samples the number of identified mixture spectra can be as high as 20% of all the identified spectra and the number of unique peptides identified only in mixture spectra can be up to 35.4% of those identified in single-peptide spectra.The advancement of technology and instrumentation has made tandem mass (MS/MS)1 spectrometry the leading high-throughput method to analyze proteins (1, 2, 3). In typical experiments, tens of thousands to millions of MS/MS spectra are generated and enable researchers to probe various aspects of the proteome on a large scale. Part of this success hinges on the availability of computational methods that can analyze the large amount of data generated from these experiments. The classical question in computational proteomics asks: given an MS/MS spectrum, what is the peptide that generated the spectrum? However, it is increasingly being recognized that this assumption that each MS/MS spectrum comes from only one peptide is often not valid. Several recent analyses show that as many as 50% of the MS/MS spectra collected in typical proteomics experiments come from more than one peptide precursor (4, 5). The presence of multiple peptides in mixture spectra can decrease their identification rate to as low as one half of that for MS/MS spectra generated from only one peptide (6, 7, 8). In addition, there have been numerous developments in data independent acquisition (DIA) technologies where multiple peptide precursors are intentionally selected to cofragment in each MS/MS spectrum (9, 10, 11, 12, 13, 14, 15). These emerging technologies can address some of the enduring disadvantages of traditional data-dependent acquisition (DDA) methods (e.g. low reproducibility (16)) and potentially increase the throughput of peptide identification 5–10 fold (4, 17). However, despite the growing importance of mixture spectra in various contexts, there are still only a few computational tools that can analyze mixture spectra from more than one peptide (18, 19, 20, 21, 8, 22). Our recent analysis indicated that current database search methods for mixture spectra still have relatively low sensitivity compared with their single-peptide counterpart and the main bottleneck is their limited ability to separate true matches from false positive matches (8). Traditionally problem of peptide identification from MS/MS spectra involves two sub-problems: 1) define a Peptide-Spectrum-Match (PSM) scoring function that assigns each MS/MS spectrum to the peptide sequence that most likely generated the spectrum; and 2) given a set of top-scoring PSMs, select a subset that corresponds to statistical significance PSMs. Here we focus on the second problem, which is still an ongoing research question even for the case of single-peptide spectra (23, 24, 25, 26). Intuitively the second problem is difficult because one needs to consider spectra across the whole data set (instead of comparing different peptide candidates against one spectrum as in the first problem) and PSM scoring functions are often not well-calibrated across different spectra (i.e. a PSM score of 50 may be good for one spectrum but poor for a different spectrum). Ideally, a scoring function will give high scores to all true PSMs and low scores to false PSMs regardless of the peptide or spectrum being considered. However, in practice, some spectra may receive higher scores than others simply because they have more peaks or their precursor mass results in more peptide candidates being considered from the sequence database (27, 28). Therefore, a scoring function that accounts for spectrum or peptide-specific effects can make the scores more comparable and thus help assess the confidence of identifications across different spectra. The MS-GF solution to this problem is to compute the per-spectrum statistical significance of each top-scoring PSM, which can be defined as the probability that a random peptide (out of all possible peptide within parent mass tolerance) will match to the spectrum with a score at least as high as that of the top-scoring PSM. This measures how good the current best match is in relation to all possible peptides matching to the same spectrum, normalizing any spectrum effect from the scoring function. Intuitively, our proposed MixGF approach extends the MS-GF approach to now calculate the statistical significance of the top pair of peptides matched from the database to a given mixture spectrum M (i.e. the significance of the top peptide–peptide spectrum match (PPSM)). As such, MixGF determines the probability that a random pair of peptides (out of all possible peptides within parent mass tolerance) will match a given mixture spectrum with a score at least as high as that of the top-scoring PPSM.Despite the theoretical attractiveness of computing statistical significance, it is generally prohibitive for any database search methods to score all possible peptides against a spectrum. Therefore, earlier works in this direction focus on approximating this probability by assuming the score distribution of all PSMs follows certain analytical form such as the normal, Poisson or hypergeometric distributions (29, 30, 31). In practice, because score distributions are highly data-dependent and spectrum-specific, these model assumptions do not always hold. Other approaches tried to learn the score distribution empirically from the data (29, 27). However, one is most interested in the region of the score distribution where only a small fraction of false positives are allowed (typically at 1% FDR). This usually corresponds to the extreme tail of the distribution where p values are on the order of 10−9 or lower and thus there is typically lack of sufficient data points to accurately model the tail of the score distribution (32). More recently, Kim et al. (24) and Alves et al. (33), in parallel, proposed a generating function approach to compute the exact score distribution of random peptide matches for any spectra without explicitly matching all peptides to a spectrum. Because it is an exact computation, no assumption is made about the form of score distribution and the tail of the distribution can be computed very accurately. As a result, this approach substantially improved the ability to separate true matches from false positive ones and lead to a significant increase in sensitivity of peptide identification over state-of-the-art database search tools in single-peptide spectra (24).For mixture spectra, it is expected that the scores for the top-scoring match will be even less comparable across different spectra because now more than one peptide and different numbers of peptides can be matched to each spectrum at the same time. We extend the generating function approach (24) to rigorously compute the statistical significance of multiple-Peptide-Spectrum Matches (mPSMs) and demonstrate its utility toward addressing the peptide identification problem in mixture spectra. In particular, we show how to extend the generating approach for mixture from two peptides. We focus on this relatively simple case of mixture spectra because it accounts for a large fraction of mixture spectra presented in traditional DDA workflows (5). This allows us to test and develop algorithmic concepts using readily-available DDA data because data with more complex mixture spectra such as those from DIA workflows (11) is still not widely available in public repositories.  相似文献   

17.
18.
Quantitative proteome analyses suggest that the well-established stain colloidal Coomassie Blue, when used as an infrared dye, may provide sensitive, post-electrophoretic in-gel protein detection that can rival even Sypro Ruby. Considering the central role of two-dimensional gel electrophoresis in top-down proteomic analyses, a more cost effective alternative such as Coomassie Blue could prove an important tool in ongoing refinements of this important analytical technique. To date, no systematic characterization of Coomassie Blue infrared fluorescence detection relative to detection with SR has been reported. Here, seven commercial Coomassie stain reagents and seven stain formulations described in the literature were systematically compared. The selectivity, threshold sensitivity, inter-protein variability, and linear-dynamic range of Coomassie Blue infrared fluorescence detection were assessed in parallel with Sypro Ruby. Notably, several of the Coomassie stain formulations provided infrared fluorescence detection sensitivity to <1 ng of protein in-gel, slightly exceeding the performance of Sypro Ruby. The linear dynamic range of Coomassie Blue infrared fluorescence detection was found to significantly exceed that of Sypro Ruby. However, in two-dimensional gel analyses, because of a blunted fluorescence response, Sypro Ruby was able to detect a few additional protein spots, amounting to 0.6% of the detected proteome. Thus, although both detection methods have their advantages and disadvantages, differences between the two appear to be small. Coomassie Blue infrared fluorescence detection is thus a viable alternative for gel-based proteomics, offering detection comparable to Sypro Ruby, and more reliable quantitative assessments, but at a fraction of the cost.Gel electrophoresis is an accessible, widely applicable and mature protein resolving technology. As the original top-down approach to proteomic analyses, among its many attributes the high resolution achievable by two dimensional gel-electrophoresis (2DE)1 ensures that it remains an effective analytical technology despite the appearance of alternatives. However, in-gel detection remains a limiting factor for gel-based analyses; available technology generally permits the detection and quantification of only relatively abundant proteins (35). Many critical components in normal physiology and also disease may be several orders of magnitude less abundant and thus below the detection threshold of in-gel stains, or indeed most techniques. Pre- and post-fractionation technologies have been developed to address this central issue in proteomics but these are not without limitations (15). Thus improved detection methods for gel-based proteomics continue to be a high priority, and the literature is rich with different in-gel detection methods and innovative improvements (634). This history of iterative refinement presents a wealth of choices when selecting a detection strategy for a gel-based proteomic analysis (35).Perhaps the best known in-gel detection method is the ubiquitous Coomassie Blue (CB) stain; CB has served as a gel stain and protein quantification reagent for over 40 years. Though affordable, robust, easy to use, and compatible with mass spectrometry (MS), CB staining is relatively insensitive. In traditional organic solvent formulations, CB detects ∼ 10 ng of protein in-gel, and some reports suggest poorer sensitivity (27, 29, 36, 37). Sensitivity is hampered by relatively high background staining because of nonspecific retention of dye within the gel matrix (32, 36, 38, 39). The development of colloidal CB (CCB) formulations largely addressed these limitations (12); the concentration of soluble CB was carefully controlled by sequestering the majority of the dye into colloidal particles, mediated by pH, solvent, and the ionic strength of the solution. Minimizing soluble dye concentration and penetration of the gel matrix mitigated background staining, and the introduction of phosphoric acid into the staining reagent enhanced dye-protein interactions (8, 12, 40), contributing to an in-gel staining sensitivity of 5–10 ng protein, with some formulations reportedly yielding sensitivities of 0.1–1 ng (8, 12, 22, 39, 41, 42). Thus CCB achieved higher sensitivity than traditional CB staining, yet maintained all the advantages of the latter, including low cost and compatibility with existing densitometric detection instruments and MS. Although surpassed by newer methods, the practical advantages of CCB ensure that it remains one of the most common gel stains in use.Fluorescent stains have become the routine and sensitive alternative to visible dyes. Among these, the ruthenium-organometallic family of dyes have been widely applied and the most commercially well-known is Sypro Ruby (SR), which is purported to interact noncovalently with primary amines in proteins (15, 18, 19, 43). Chief among the attributes of these dyes is their high sensitivity. In-gel detection limits of < 1 ng for some proteins have been reported for SR (6, 9, 14, 44, 45). Moreover, SR staining has been reported to yield a greater linear dynamic range (LDR), and reduced interprotein variability (IPV) compared with CCB and silver stains (15, 19, 4649). SR is easy to use, fully MS compatible, and relatively forgiving of variations in initial conditions (6, 15). The chief consequence of these advances remains high cost; SR and related stains are notoriously expensive, and beyond the budget of many laboratories. Furthermore, despite some small cost advantage relative to SR, none of the available alternatives has been consistently and quantitatively demonstrated to substantially improve on the performance of SR under practical conditions (9, 50).Notably, there is evidence to suggest that CCB staining is not fundamentally insensitive, but rather that its sensitivity has been limited by traditional densitometric detection (50, 51). When excited in the near IR at ∼650 nm, protein-bound CB in-gel emits light in the range of 700–800 nm. Until recently, the lack of low-cost, widely available and sufficiently sensitive infrared (IR)-capable imaging instruments prevented mainstream adoption of in-gel CB infrared fluorescence detection (IRFD); advances in imaging technology are now making such instruments far more accessible. Initial reports suggested that IRFD of CB-stained gels provided greater sensitivity than traditional densitometric detection (50, 51). Using CB R250, in-gel IRFD was reported to detect as little as 2 ng of protein in-gel, with a LDR of about an order of magnitude (2 to 20 ng, or 10 to 100 ng in separate gels), beyond which the fluorescent response saturated into the μg range (51). Using the G250 dye variant, it was determined that CB-IRFD of 2D gels detected ∼3 times as many proteins as densitometric imaging, and a comparable number of proteins as seen by SR (50). This study also concluded that CB-IRFD yielded a significantly higher signal to background ratio (S/BG) than SR, providing initial evidence that CB-IRFD may be superior to SR in some aspects of stain performance (50).Despite this initial evidence of the viability of CB-IRF as an in-gel protein detection method, a detailed characterization of this technology has not yet been reported. Here a more thorough, quantitative characterization of CB-IRFD is described, establishing its lowest limit of detection (LLD), IPV, and LDR in comparison to SR. Finally a wealth of modifications and enhancements of CCB formulations have been reported (8, 12, 21, 24, 26, 29, 40, 41, 5254), and likewise there are many commercially available CCB stain formulations. To date, none of these formulations have been compared quantitatively in terms of their relative performance when detected using IRF. As a general detection method for gel-based proteomics, CB-IRFD was found to provide comparable or even slightly superior performance to SR according to most criteria, including sensitivity and selectivity (50). Furthermore, in terms of LDR, CB-IRFD showed distinct advantages over SR. However, assessing proteomes resolved by 2DE revealed critical distinctions between CB-IRFD and SR in terms of protein quantification versus threshold detection: neither stain could be considered unequivocally superior to the other by all criteria. Nonetheless, IRFD proved the most sensitive method of detecting CB-stained protein in-gel, enabling high sensitivity detection without the need for expensive reagents or even commercial formulations. Overall, CB-IRFD is a viable alternative to SR and other mainstream fluorescent stains, mitigating the high cost of large-scale gel-based proteomic analyses, making high sensitivity gel-based proteomics accessible to all labs. With improvements to CB formulations and/or image acquisition instruments, the performance of this detection technology may be further enhanced.  相似文献   

19.
Top-down proteomics is emerging as a viable method for the routine identification of hundreds to thousands of proteins. In this work we report the largest top-down study to date, with the identification of 1,220 proteins from the transformed human cell line H1299 at a false discovery rate of 1%. Multiple separation strategies were utilized, including the focused isolation of mitochondria, resulting in significantly improved proteome coverage relative to previous work. In all, 347 mitochondrial proteins were identified, including ∼50% of the mitochondrial proteome below 30 kDa and over 75% of the subunits constituting the large complexes of oxidative phosphorylation. Three hundred of the identified proteins were found to be integral membrane proteins containing between 1 and 12 transmembrane helices, requiring no specific enrichment or modified LC-MS parameters. Over 5,000 proteoforms were observed, many harboring post-translational modifications, including over a dozen proteins containing lipid anchors (some previously unknown) and many others with phosphorylation and methylation modifications. Comparison between untreated and senescent H1299 cells revealed several changes to the proteome, including the hyperphosphorylation of HMGA2. This work illustrates the burgeoning ability of top-down proteomics to characterize large numbers of intact proteoforms in a high-throughput fashion.Although traditional bottom-up approaches to mass-spectrometry-based proteomics are capable of identifying thousands of protein groups from a complex mixture, proteolytic digestion can result in the loss of information pertaining to post-translational modifications and sequence variants (1, 2). The recent implementation of top-down proteomics in a high-throughput format using either Fourier transform ion cyclotron resonance (35) or Orbitrap instruments (6, 7) has shown an increasing scale of applicability while preserving information on combinatorial modifications and highly related sequence variants. For example, the identification of over 500 bacterial proteins helped researchers find covalent switches on cysteines (7), and over 1,000 proteins were identified from human cells (3). Such advances have driven the detection of whole protein forms, now simply called proteoforms (8), with several laboratories now seeking to tie these to specific functions in cell and disease biology (911).The term “proteoform” denotes a specific primary structure of an intact protein molecule that arises from a specific gene and refers to a precise combination of genetic variation, splice variants, and post-translational modifications. Whereas special attention is required in order to accomplish gene- and variant-specific identifications via the bottom-up approach, top-down proteomics routinely links proteins to specific genes without the problem of protein inference. However, the fully automated characterization of whole proteoforms still represents a significant challenge in the field. Another major challenge is to extend the top-down approach to the study of whole integral membrane proteins, whose hydrophobicity can often limit their analysis via LC-MS (5, 1216). Though integral membrane proteins are often difficult to solubilize, the long stretches of sequence information provided from fragmentation of their transmembrane domains in the gas phase can actually aid in their identification (5, 13).In parallel to the early days of bottom-up proteomics a decade ago (1721), in this work we brought the latest methods for top-down proteomics into combination with subcellular fractionation and cellular treatments to expand coverage of the human proteome. We utilized multiple dimensions of separation and an Orbitrap Elite mass spectrometer to achieve large-scale interrogation of intact proteins derived from H1299 cells. For this focus issue on post-translational modifications, we report this summary of findings from the largest implementation of top-down proteomics to date, which resulted in the identification of 1,220 proteins and thousands more proteoforms. We also applied the platform to H1299 cells induced into senescence by treatment with the DNA-damaging agent camptothecin.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号