首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 343 毫秒
1.
Using enrichment strategies many research groups are routinely producing large data sets of post-translationally modified peptides for proteomic analysis using tandem mass spectrometry. Although search engines are relatively effective at identifying these peptides with a defined measure of reliability, their localization of site/s of modification is often arbitrary and unreliable. The field continues to be in need of a widely accepted metric for false localization rate that accurately describes the certainty of site localization in published data sets and allows for consistent measurement of differences in performance of emerging scoring algorithms. In this article are discussed the main strategies currently used by software for modification site localization and ways of assessing the performance of these different tools. Methods for representing ambiguity are reviewed and a discussion of how the approaches transfer to different data types and modifications is presented.  相似文献   

2.
LC‐MS experiments can generate large quantities of data, for which a variety of database search engines are available to make peptide and protein identifications. Decoy databases are becoming widely used to place statistical confidence in result sets, allowing the false discovery rate (FDR) to be estimated. Different search engines produce different identification sets so employing more than one search engine could result in an increased number of peptides (and proteins) being identified, if an appropriate mechanism for combining data can be defined. We have developed a search engine independent score, based on FDR, which allows peptide identifications from different search engines to be combined, called the FDR Score. The results demonstrate that the observed FDR is significantly different when analysing the set of identifications made by all three search engines, by each pair of search engines or by a single search engine. Our algorithm assigns identifications to groups according to the set of search engines that have made the identification, and re‐assigns the score (combined FDR Score). The combined FDR Score can differentiate between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.  相似文献   

3.
The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets.  相似文献   

4.
MOTIVATION: Tandem mass-spectrometry of trypsin digests, followed by database searching, is one of the most popular approaches in high-throughput proteomics studies. Peptides are considered identified if they pass certain scoring thresholds. To avoid false positive protein identification, > or = 2 unique peptides identified within a single protein are generally recommended. Still, in a typical high-throughput experiment, hundreds of proteins are identified only by a single peptide. We introduce here a method for distinguishing between true and false identifications among single-hit proteins. The approach is based on randomized database searching and usage of logistic regression models with cross-validation. This approach is implemented to analyze three bacterial samples enabling recovery 68-98% of the correct single-hit proteins with an error rate of < 2%. This results in a 22-65% increase in number of identified proteins. Identifying true single-hit proteins will lead to discovering many crucial regulators, biomarkers and other low abundance proteins. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

5.
MOTIVATION: Statistical evaluation of the confidence of peptide and protein identifications made by tandem mass spectrometry is a critical component for appropriately interpreting the experimental data and conducting downstream analysis. Although many approaches have been developed to assign confidence measure from different perspectives, a unified statistical framework that integrates the uncertainty of peptides and proteins is still missing. RESULTS: We developed a hierarchical statistical model (HSM) that jointly models the uncertainty of the identified peptides and proteins and can be applied to any scoring system. With data sets of a standard mixture and the yeast proteome, we demonstrate that the HSM offers a reliable or at least conservative false discovery rate (FDR) estimate for peptide and protein identifications. The probability measure of HSM also offers a powerful discriminating score for peptide identification. AVAILABILITY: The algorithm is available upon request from the authors.  相似文献   

6.
Discovery or shotgun proteomics has emerged as the most powerful technique to comprehensively map out a proteome. Reconstruction of protein identities from the raw mass spectrometric data constitutes a cornerstone of any shotgun proteomics workflow. The inherent uncertainty of mass spectrometric data and the complexity of a proteome render protein inference and the statistical validation of protein identifications a non-trivial task, still being a subject of ongoing research. This review aims to survey the different conceptual approaches to the different tasks of inferring and statistically validating protein identifications and to discuss their implications on the scope of proteome exploration.  相似文献   

7.
In recent years, a variety of approaches have been developed using decoy databases to empirically assess the error associated with peptide identifications from large-scale proteomics experiments. We have developed an approach for calculating the expected uncertainty associated with false-positive rate determination using concatenated reverse and forward protein sequence databases. After explaining the theoretical basis of our model, we compare predicted error with the results of experiments characterizing a series of mixtures containing known proteins. In general, results from characterization of known proteins show good agreement with our predictions. Finally, we consider how these approaches may be applied to more complicated data sets, as when peptides are separated by charge state prior to false-positive determination.  相似文献   

8.
We present a mitochondrial (mt) genome phylogeny inferring relationships within Neuropterida (lacewings, alderflies and camel flies) and between Neuropterida and other holometabolous insect orders. Whole mt genomes were sequenced for Sialis hamata (Megaloptera: Sialidae), Ditaxis latistyla (Neuroptera: Mantispidae), Mongoloraphidia harmandi (Raphidioptera: Raphidiidae), Macrogyrus oblongus (Coleoptera: Gyrinidae), Rhopaea magnicornis (Coleoptera: Scarabaeidae), and Mordella atrata (Coleoptera: Mordellidae) and compared against representatives of other holometabolous orders in phylogenetic analyses. Additionally, we test the sensitivity of phylogenetic inferences to four analytical approaches: inclusion vs. exclusion of RNA genes, manual vs. algorithmic alignments, arbitrary vs. algorithmic approaches to excluding variable gene regions and how each approach interacts with phylogenetic inference methods (parsimony vs. Bayesian inference). Of these factors, phylogenetic inference method had the most influence on interordinal relationships. Bayesian analyses inferred topologies largely congruent with morphologically‐based hypotheses of neuropterid relationships, a monophyletic Neuropterida whose sister group is Coleoptera. In contrast, parsimony analyses failed to support a monophyletic Neuropterida as Raphidioptera was the sister group of the entire Holometabola excluding Hymenoptera, and Neuroptera + Megaloptera is the sister group of Diptera, a relationship which has not previously been proposed based on either molecular or morphological data sets. These differences between analytical methods are due to the high among site rate heterogeneity found in insect mt genomes which is properly modelled by Bayesian methods but results in artifactual relationships under parsimony. Properly analysed, the mt genomic data set presented here is among the first molecular data to support traditional, morphology‐based interpretations of relationships between the three neuropterid orders and their grouping with Coleoptera.  相似文献   

9.
A very popular approach in proteomics is the so-called "shotgun LC-MS/MS" strategy. In its mostly used form, a total protein digest is separated by ion exchange fractionation in the first dimension followed by off- or on-line RP LC-MS/MS. We replaced the first dimension by isoelectric focusing in the liquid phase using the Off-Gel device producing 15 fractions. As peptides are separated by their isoelectric point in the first dimension and hydrophobicity in the second, those experimentally derived parameters (pI and R(T)) can be used for the validation of potentially identified peptides. We applied this strategy to a cellular extract of Drosophila Kc167 cells and identified peptides with two different database search engines, namely PHENYX and SEQUEST, with PeptideProphet validation of the SEQUEST results. PHENYX returned 7582 potential peptide identifications and SEQUEST 7629. The SEQUEST results were reduced to 2006 identifications by validation with PeptideProphet. Validation of the PeptideProphet, SEQUEST and PHENYX results by pI and R(T) parameters confirmed 1837 PeptideProphet identifications while in the remainder of the SEQUEST results another 1130 peptides were found to be likely hits. The validation on PHENYX resulted in the fixation of a solid p-value threshold of <1 x 10(-04) that sets by itself the correct identification confidence to >95%, and a final count of 2034 highly confident peptide identifications was achieved after pI and R(T) validation. Although the PeptideProphet and PHENYX datasets have a very high confidence the overlap of common identifications was only at 79.4%, to be explained by the fact that data interpretation was done searching different protein databases with two search engines of different algorithms. The approach used in this study allowed for an automated and improved data validation process for shotgun proteomics projects producing MS/MS peptide identification results of very high confidence.  相似文献   

10.
Serfling-type periodic regression models have been widely used to identify and analyse epidemic of influenza. In these approaches, the baseline is traditionally determined using cleaned historical non-epidemic data. However, we found that the previous exclusion of epidemic seasons was empirical, since year-year variations in the seasonal pattern of activity had been ignored. Therefore, excluding fixed ‘epidemic’ months did not seem reasonable. We made some adjustments in the rule of epidemic-period removal to avoid potentially subjective definition of the start and end of epidemic periods. We fitted the baseline iteratively. Firstly, we established a Serfling regression model based on the actual observations without any removals. After that, instead of manually excluding a predefined ‘epidemic’ period (the traditional method), we excluded observations which exceeded a calculated boundary. We then established Serfling regression once more using the cleaned data and excluded observations which exceeded a calculated boundary. We repeated this process until the R2 value stopped to increase. In addition, the definitions of the onset of influenza epidemic were heterogeneous, which might make it impossible to accurately evaluate the performance of alternative approaches. We then used this modified model to detect the peak timing of influenza instead of the onset of epidemic and compared this model with traditional Serfling models using observed weekly case counts of influenza-like illness (ILIs), in terms of sensitivity, specificity and lead time. A better performance was observed. In summary, we provide an adjusted Serfling model which may have improved performance over traditional models in early warning at arrival of peak timing of influenza.  相似文献   

11.
Traditionally, single-copy orthologs have been the gold standard in phylogenomics. Most phylogenomic studies identify putative single-copy orthologs using clustering approaches and retain families with a single sequence per species. This limits the amount of data available by excluding larger families. Recent advances have suggested several ways to include data from larger families. For instance, tree-based decomposition methods facilitate the extraction of orthologs from large families. Additionally, several methods for species tree inference are robust to the inclusion of paralogs and could use all of the data from larger families. Here, we explore the effects of using all families for phylogenetic inference by examining relationships among 26 primate species in detail and by analyzing five additional data sets. We compare single-copy families, orthologs extracted using tree-based decomposition approaches, and all families with all data. We explore several species tree inference methods, finding that identical trees are returned across nearly all subsets of the data and methods for primates. The relationships among Platyrrhini remain contentious; however, the species tree inference method matters more than the subset of data used. Using data from larger gene families drastically increases the number of genes available and leads to consistent estimates of branch lengths, nodal certainty and concordance, and inferences of introgression in primates. For the other data sets, topological inferences are consistent whether single-copy families or orthologs extracted using decomposition approaches are analyzed. Using larger gene families is a promising approach to include more data in phylogenomics without sacrificing accuracy, at least when high-quality genomes are available.  相似文献   

12.
The present study reports an in-depth proteome analysis of two Lactobacillus rhamnosus strains, the well-known probiotic strain GG and the dairy strain Lc705. We used GeLC-MS/MS, in which proteins are separated using 1-DE and identified using nanoLC-MS/MS, to generate high-quality protein catalogs. To maximize the number of identifications, all data sets were searched against the target databases using two search engines, Mascot and Paragon. As a result, over 1600 high-confidence protein identifications, covering nearly 60% of the predicted proteomes, were obtained from each strain. This approach enabled identification of more than 40% of all predicted surfome proteins, including a high number of lipoproteins, integral membrane proteins, peptidoglycan associated proteins, and proteins predicted to be released into the extracellular environment. A comparison of both data sets revealed the expression of more than 90 proteins in GG and 150 in Lc705, which lack evolutionary counterparts in the other strain. Differences were noted in proteins with a likely role in biofilm formation, phage-related functions, reshaping the bacterial cell wall, and immunomodulation. The present study provides the most comprehensive catalog of the Lactobacillus proteins to date and holds great promise for the discovery of novel probiotic effector molecules.  相似文献   

13.
Shotgun proteomics using mass spectrometry is a powerful method for protein identification but suffers limited sensitivity in complex samples. Integrating peptide identifications from multiple database search engines is a promising strategy to increase the number of peptide identifications and reduce the volume of unassigned tandem mass spectra. Existing methods pool statistical significance scores such as p-values or posterior probabilities of peptide-spectrum matches (PSMs) from multiple search engines after high scoring peptides have been assigned to spectra, but these methods lack reliable control of identification error rates as data are integrated from different search engines. We developed a statistically coherent method for integrative analysis, termed MSblender. MSblender converts raw search scores from search engines into a probability score for every possible PSM and properly accounts for the correlation between search scores. The method reliably estimates false discovery rates and identifies more PSMs than any single search engine at the same false discovery rate. Increased identifications increment spectral counts for most proteins and allow quantification of proteins that would not have been quantified by individual search engines. We also demonstrate that enhanced quantification contributes to improve sensitivity in differential expression analyses.  相似文献   

14.
Confident identification of peptides via tandem mass spectrometry underpins modern high-throughput proteomics. This has motivated considerable recent interest in the postprocessing of search engine results to increase confidence and calculate robust statistical measures, for example through the use of decoy databases to calculate false discovery rates (FDR). FDR-based analyses allow for multiple testing and can assign a single confidence value for both sets and individual peptide spectrum matches (PSMs). We recently developed an algorithm for combining the results from multiple search engines, integrating FDRs for sets of PSMs made by different search engine combinations. Here we describe a web-server and a downloadable application that makes this routinely available to the proteomics community. The web server offers a range of outputs including informative graphics to assess the confidence of the PSMs and any potential biases. The underlying pipeline also provides a basic protein inference step, integrating PSMs into protein ambiguity groups where peptides can be matched to more than one protein. Importantly, we have also implemented full support for the mzIdentML data standard, recently released by the Proteomics Standards Initiative, providing users with the ability to convert native formats to mzIdentML files, which are available to download.  相似文献   

15.
16.
Spectral libraries have emerged as a viable alternative to protein sequence databases for peptide identification. These libraries contain previously detected peptide sequences and their corresponding tandem mass spectra (MS/MS). Search engines can then identify peptides by comparing experimental MS/MS scans to those in the library. Many of these algorithms employ the dot product score for measuring the quality of a spectrum-spectrum match (SSM). This scoring system does not offer a clear statistical interpretation and ignores fragment ion m/z discrepancies in the scoring. We developed a new spectral library search engine, Pepitome, which employs statistical systems for scoring SSMs. Pepitome outperformed the leading library search tool, SpectraST, when analyzing data sets acquired on three different mass spectrometry platforms. We characterized the reliability of spectral library searches by confirming shotgun proteomics identifications through RNA-Seq data. Applying spectral library and database searches on the same sample revealed their complementary nature. Pepitome identifications enabled the automation of quality analysis and quality control (QA/QC) for shotgun proteomics data acquisition pipelines.  相似文献   

17.
Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins.  相似文献   

18.
19.
Database-searching programs generally identify only a fraction of the spectra acquired in a standard LC/MS/MS study of digested proteins. Subtle variations in database-searching algorithms for assigning peptides to MS/MS spectra have been known to provide different identification results. To leverage this variation, a probabilistic framework is developed for combining the results of multiple search engines. The scores for each search engine are first independently converted into peptide probabilities. These probabilities can then be readily combined across search engines using Bayesian rules and the expectation maximization learning algorithm. A significant gain in the number of peptides identified with high confidence with each additional search engine is demonstrated using several data sets of increasing complexity, from a control protein mixture to a human plasma sample, searched using SEQUEST, Mascot, and X! Tandem database-searching programs. The increased rate of peptide assignments also translates into a substantially larger number of protein identifications in LC/MS/MS studies compared to a typical analysis using a single database-search tool.  相似文献   

20.
A major unmet need in LC-MS/MS-based proteomics analyses is a set of tools for quantitative assessment of system performance and evaluation of technical variability. Here we describe 46 system performance metrics for monitoring chromatographic performance, electrospray source stability, MS1 and MS2 signals, dynamic sampling of ions for MS/MS, and peptide identification. Applied to data sets from replicate LC-MS/MS analyses, these metrics displayed consistent, reasonable responses to controlled perturbations. The metrics typically displayed variations less than 10% and thus can reveal even subtle differences in performance of system components. Analyses of data from interlaboratory studies conducted under a common standard operating procedure identified outlier data and provided clues to specific causes. Moreover, interlaboratory variation reflected by the metrics indicates which system components vary the most between laboratories. Application of these metrics enables rational, quantitative quality assessment for proteomics and other LC-MS/MS analytical applications.LC-MS/MS provides the most widely used technology platform for proteomics analyses of purified proteins, simple mixtures, and complex proteomes. In a typical analysis, protein mixtures are proteolytically digested, the peptide digest is fractionated, and the resulting peptide fractions then are analyzed by LC-MS/MS (1, 2). Database searches of the MS/MS spectra yield peptide identifications and, by inference and assembly, protein identifications. Depending on protein sample load and the extent of peptide fractionation used, LC-MS/MS analytical systems can generate from hundreds to thousands of peptide and protein identifications (3). Many variations of LC-MS/MS analytical platforms have been described, and the performance of these systems is influenced by a number of experimental design factors (4).Comparison of data sets obtained by LC-MS/MS analyses provides a means to evaluate the proteomic basis for biologically significant states or phenotypes. For example, data-dependent LC-MS/MS analyses of tumor and normal tissues enabled unbiased discovery of proteins whose expression is enhanced in cancer (57). Comparison of data-dependent LC-MS/MS data sets from phosphotyrosine peptides in drug-responsive and -resistant cell lines identified differentially regulated phosphoprotein signaling networks (8, 9). Similarly, activity-based probes and data-dependent LC-MS/MS analysis were used to identify differentially regulated enzymes in normal and tumor tissues (10). All of these approaches assume that the observed differences reflect differences in the proteomic composition of the samples analyzed rather than analytical system variability. The validity of this assumption is difficult to assess because of a lack of objective criteria to assess analytical system performance.The problem of variability poses three practical questions for analysts using LC-MS/MS proteomics platforms. First, is the analytical system performing optimally for the reproducible analysis of complex proteomes? Second, can the sources of suboptimal performance and variability be identified, and can the impact of changes or improvements be evaluated? Third, can system performance metrics provide documentation to support the assessment of proteomic differences between biologically interesting samples?Currently, the most commonly used measure of variability in LC-MS/MS proteomics analyses is the number of confident peptide identifications (1113). Although consistency in numbers of identifications may indicate repeatability, the numbers do not indicate whether system performance is optimal or which components require optimization. One well characterized source of variability in peptide identifications is the automated sampling of peptide ion signals for acquisition of MS/MS spectra by instrument control software, which results in stochastic sampling of lower abundance peptides (14). Variability certainly also arises from sample preparation methods (e.g. protein extraction and digestion). A largely unexplored source of variability is the performance of the core LC-MS/MS analytical system, which includes the LC system, the MS instrument, and system software. The configuration, tuning, and operation of these system components govern sample injection, chromatography, electrospray ionization, MS signal detection, and sampling for MS/MS analysis. These characteristics all are subject to manipulation by the operator and thus provide means to optimize system performance.Here we describe the development of 46 metrics for evaluating the performance of LC-MS/MS system components. We have implemented a freely available software pipeline that generates these metrics directly from LC-MS/MS data files. We demonstrate their use in characterizing sources of variability in proteomics platforms, both for replicate analyses on a single instrument and in the context of large interlaboratory studies conducted by the National Cancer Institute-supported Clinical Proteomic Technology Assessment for Cancer (CPTAC)1 Network.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号