首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
In high-throughput mass spectrometry proteomics, peptides and proteins are not simply identified as present or not present in a sample, rather the identifications are associated with differing levels of confidence. The false discovery rate (FDR) has emerged as an accepted means for measuring the confidence associated with identifications. We have developed the Systematic Protein Investigative Research Environment (SPIRE) for the purpose of integrating the best available proteomics methods. Two successful approaches to estimating the FDR for MS protein identifications are the MAYU and our current SPIRE methods. We present here a method to combine these two approaches to estimating the FDR for MS protein identifications into an integrated protein model (IPM). We illustrate the high quality performance of this IPM approach through testing on two large publicly available proteomics datasets. MAYU and SPIRE show remarkable consistency in identifying proteins in these datasets. Still, IPM results in a more robust FDR estimation approach and additional identifications, particularly among low abundance proteins. IPM is now implemented as a part of the SPIRE system.  相似文献   

2.
Shotgun proteomics using mass spectrometry is a powerful method for protein identification but suffers limited sensitivity in complex samples. Integrating peptide identifications from multiple database search engines is a promising strategy to increase the number of peptide identifications and reduce the volume of unassigned tandem mass spectra. Existing methods pool statistical significance scores such as p-values or posterior probabilities of peptide-spectrum matches (PSMs) from multiple search engines after high scoring peptides have been assigned to spectra, but these methods lack reliable control of identification error rates as data are integrated from different search engines. We developed a statistically coherent method for integrative analysis, termed MSblender. MSblender converts raw search scores from search engines into a probability score for every possible PSM and properly accounts for the correlation between search scores. The method reliably estimates false discovery rates and identifies more PSMs than any single search engine at the same false discovery rate. Increased identifications increment spectral counts for most proteins and allow quantification of proteins that would not have been quantified by individual search engines. We also demonstrate that enhanced quantification contributes to improve sensitivity in differential expression analyses.  相似文献   

3.
Estimating false discovery rates (FDRs) of protein identification continues to be an important topic in mass spectrometry–based proteomics, particularly when analyzing very large datasets. One performant method for this purpose is the Picked Protein FDR approach which is based on a target-decoy competition strategy on the protein level that ensures that FDRs scale to large datasets. Here, we present an extension to this method that can also deal with protein groups, that is, proteins that share common peptides such as protein isoforms of the same gene. To obtain well-calibrated FDR estimates that preserve protein identification sensitivity, we introduce two novel ideas. First, the picked group target-decoy and second, the rescued subset grouping strategies. Using entrapment searches and simulated data for validation, we demonstrate that the new Picked Protein Group FDR method produces accurate protein group-level FDR estimates regardless of the size of the data set. The validation analysis also uncovered that applying the commonly used Occam’s razor principle leads to anticonservative FDR estimates for large datasets. This is not the case for the Picked Protein Group FDR method. Reanalysis of deep proteomes of 29 human tissues showed that the new method identified up to 4% more protein groups than MaxQuant. Applying the method to the reanalysis of the entire human section of ProteomicsDB led to the identification of 18,000 protein groups at 1% protein group-level FDR. The analysis also showed that about 1250 genes were represented by ≥2 identified protein groups. To make the method accessible to the proteomics community, we provide a software tool including a graphical user interface that enables merging results from multiple MaxQuant searches into a single list of identified and quantified protein groups.  相似文献   

4.
High resolution proteomics approaches have been successfully utilized for the comprehensive characterization of the cell proteome. However, in the case of quantitative proteomics an open question still remains, which quantification strategy is best suited for identification of biologically relevant changes, especially in clinical specimens. In this study, a thorough comparison of a label-free approach (intensity-based) and 8-plex iTRAQ was conducted as applied to the analysis of tumor tissue samples from non-muscle invasive and muscle-invasive bladder cancer. For the latter, two acquisition strategies were tested including analysis of unfractionated and fractioned iTRAQ-labeled peptides. To reduce variability, aliquots of the same protein extract were used as starting material, whereas to obtain representative results per method further sample processing and MS analysis were conducted according to routinely applied protocols. Considering only multiple-peptide identifications, LC-MS/MS analysis resulted in the identification of 910, 1092 and 332 proteins by label-free, fractionated and unfractionated iTRAQ, respectively. The label-free strategy provided higher protein sequence coverage compared to both iTRAQ experiments. Even though pre-fraction of the iTRAQ labeled peptides allowed for a higher number of identifications, this was not accompanied by a respective increase in the number of differentially expressed changes detected. Validity of the proteomics output related to protein identification and differential expression was determined by comparison to existing data in the field (Protein Atlas and published data on the disease). All methods predicted changes which to a large extent agreed with published data, with label-free providing a higher number of significant changes than iTRAQ. Conclusively, both label-free and iTRAQ (when combined to peptide fractionation) provide high proteome coverage and apparently valid predictions in terms of differential expression, nevertheless label-free provides higher sequence coverage and ultimately detects a higher number of differentially expressed proteins. The risk for receiving false associations still exists, particularly when analyzing highly heterogeneous biological samples, raising the need for the analysis of higher sample numbers and/or application of adjustment for multiple testing.  相似文献   

5.
MS/MS combined with database search methods can identify the proteins present in complex mixtures. High throughput methods that infer probable peptide sequences from enzymatically digested protein samples create a challenge in how best to aggregate the evidence for candidate proteins. Typically the results of multiple technical and/or biological replicate experiments must be combined to maximize sensitivity. We present a statistical method for estimating probabilities of protein expression that integrates peptide sequence identifications from multiple search algorithms and replicate experimental runs. The method was applied to create a repository of 797 non-homologous zebrafish (Danio rerio) proteins, at an empirically validated false identification rate under 1%, as a resource for the development of targeted quantitative proteomics assays. We have implemented this statistical method as an analytic module that can be integrated with an existing suite of open-source proteomics software.  相似文献   

6.
Confident identification of peptides via tandem mass spectrometry underpins modern high-throughput proteomics. This has motivated considerable recent interest in the postprocessing of search engine results to increase confidence and calculate robust statistical measures, for example through the use of decoy databases to calculate false discovery rates (FDR). FDR-based analyses allow for multiple testing and can assign a single confidence value for both sets and individual peptide spectrum matches (PSMs). We recently developed an algorithm for combining the results from multiple search engines, integrating FDRs for sets of PSMs made by different search engine combinations. Here we describe a web-server and a downloadable application that makes this routinely available to the proteomics community. The web server offers a range of outputs including informative graphics to assess the confidence of the PSMs and any potential biases. The underlying pipeline also provides a basic protein inference step, integrating PSMs into protein ambiguity groups where peptides can be matched to more than one protein. Importantly, we have also implemented full support for the mzIdentML data standard, recently released by the Proteomics Standards Initiative, providing users with the ability to convert native formats to mzIdentML files, which are available to download.  相似文献   

7.
Labeling‐based proteomics is a powerful method for detection of differentially expressed proteins (DEPs). The current data analysis platform typically relies on protein‐level ratios, which is obtained by summarizing peptide‐level ratios for each protein. In shotgun proteomics, however, some proteins are quantified with more peptides than others, and this reproducibility information is not incorporated into the differential expression (DE) analysis. Here, we propose a novel probabilistic framework EBprot that directly models the peptide‐protein hierarchy and rewards the proteins with reproducible evidence of DE over multiple peptides. To evaluate its performance with known DE states, we conducted a simulation study to show that the peptide‐level analysis of EBprot provides better receiver‐operating characteristic and more accurate estimation of the false discovery rates than the methods based on protein‐level ratios. We also demonstrate superior classification performance of peptide‐level EBprot analysis in a spike‐in dataset. To illustrate the wide applicability of EBprot in different experimental designs, we applied EBprot to a dataset for lung cancer subtype analysis with biological replicates and another dataset for time course phosphoproteome analysis of EGF‐stimulated HeLa cells with multiplexed labeling. Through these examples, we show that the peptide‐level analysis of EBprot is a robust alternative to the existing statistical methods for the DE analysis of labeling‐based quantitative datasets. The software suite is freely available on the Sourceforge website http://ebprot.sourceforge.net/ . All MS data have been deposited in the ProteomeXchange with identifier PXD001426 ( http://proteomecentral.proteomexchange.org/dataset/PXD001426/ ).  相似文献   

8.
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a singlelaboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry(LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring(MRM)/selective reaction monitoring(SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results.  相似文献   

9.
We report a hybrid search method combining database and spectral library searches that allows for a straightforward approach to characterizing the error rates from the combined data. Using these methods, we demonstrate significantly increased sensitivity and specificity in matching peptides to tandem mass spectra. The hybrid search method increased the number of spectra that can be assigned to a peptide in a global proteomics study by 57-147% at an estimated false discovery rate of 5%, with clear room for even greater improvements. The approach combines the general utility of using consensus model spectra typical of database search methods with the accuracy of the intensity information contained in spectral libraries. A common scoring metric based on recent developments linking data analysis and statistical thermodynamics is used, which allows the use of a conservative estimate of error rates for the combined data. We applied this approach to proteomics analysis of Synechococcus sp. PCC 7002, a cyanobacterium that is a model organism for studies of photosynthetic carbon fixation and biofuels development. The increased specificity and sensitivity of this approach allowed us to identify many more peptides involved in the processes important for photoautotrophic growth.  相似文献   

10.
Lo SL  You T  Lin Q  Joshi SB  Chung MC  Hew CL 《Proteomics》2006,6(6):1758-1769
In the field of proteomics, the increasing difficulty to unify the data format, due to the different platforms/instrumentation and laboratory documentation systems, greatly hinders experimental data verification, exchange, and comparison. Therefore, it is essential to establish standard formats for every necessary aspect of proteomics data. One of the recently published data models is the proteomics experiment data repository [Taylor, C. F., Paton, N. W., Garwood, K. L., Kirby, P. D. et al., Nat. Biotechnol. 2003, 21, 247-254]. Compliant with this format, we developed the systematic proteomics laboratory analysis and storage hub (SPLASH) database system as an informatics infrastructure to support proteomics studies. It consists of three modules and provides proteomics researchers a common platform to store, manage, search, analyze, and exchange their data. (i) Data maintenance includes experimental data entry and update, uploading of experimental results in batch mode, and data exchange in the original PEDRo format. (ii) The data search module provides several means to search the database, to view either the protein information or the differential expression display by clicking on a gel image. (iii) The data mining module contains tools that perform biochemical pathway, statistics-associated gene ontology, and other comparative analyses for all the sample sets to interpret its biological meaning. These features make SPLASH a practical and powerful tool for the proteomics community.  相似文献   

11.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.  相似文献   

12.
Zhao Song  Luonan Chen  Dong Xu 《Proteomics》2009,9(11):3090-3099
Protein identification using Peptide Mass Fingerprinting (PMF) data remains an important yet only partially solved problem. Current computational methods may lead to false positive identification since the top hit from a database search may not be the target protein. In addition, the identification scores assigned singly by a scoring function (raw scores) are not normalized. Therefore, the ranking based on raw scores may be biased. To address the above issue, we have developed a statistical model to evaluate the confidence of the raw score and to improve the ranking of proteins for identification. The results show that the statistical model better ranks the correct protein than the raw scores. Our study provides a new method to enhance the accuracy of protein identification by using PMF data. We incorporated the method into our software package “Protein‐Decision” together with a user‐friendly graphical interface. A standalone version of Protein‐Decision is freely available at http://digbio.missouri.edu/ProteinDecision/ .  相似文献   

13.
In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism‐specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease‐associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide‐level identifications in the main MS‐based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism‐specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS‐based bottom‐up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes.  相似文献   

14.
Most proteomics experiments make use of 'high throughput' technologies such as 2-DE, MS or protein arrays to measure simultaneously the expression levels of thousands of proteins. Such experiments yield large, high-dimensional data sets which usually reflect not only the biological but also technical and experimental factors. Statistical tools are essential for evaluating these data and preventing false conclusions. Here, an overview is given of some typical statistical tools for proteomics experiments. In particular, we present methods for data preprocessing (e.g. calibration, missing values estimation and outlier detection), comparison of protein expression in different groups (e.g. detection of differentially expressed proteins or classification of new observations) as well as the detection of dependencies between proteins (e.g. protein clusters or networks). We also discuss questions of sample size planning for some of these methods.  相似文献   

15.
Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra.  相似文献   

16.
Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.  相似文献   

17.
The use of multidimensional capillary HPLC combined with MS/MS has allowed high qualitative and quantitative proteome coverage of prokaryotic organisms. The determination of protein abundance change between two or more conditions has matured to the point that false discovery rates can be very low and for smaller proteomes coverage is sufficiently high to explicitly consider false negative error. Selected aspects of using these methods for global protein abundance assessments are reviewed. These include instrumental issues that influence the reliability of abundance ratios; a comparison of sources of nonlinearity, errors, and data compression in proteomics and spotted cDNA arrays; strengths and weaknesses of spectral counting versus stable isotope metabolic labeling; and a survey of microbiological applications of global abundance analysis at the protein level. Proteomic results for two organisms that have been studied extensively using these methods are reviewed in greater detail. Spectral counting and metabolic labeling data are compared and the utility of proteomics for global gene regulation studies are discussed for the methanogenic Archaeon Methanococcus maripaludis. The oral pathogen Porphyromonas gingivalis is discussed as an example of an organism where a large percentage of the proteome differs in relative abundance between the intracellular and extracellular phenotype.  相似文献   

18.
Protein identification using MS is an important technique in proteomics as well as a major generator of proteomics data. We have designed the protein identification data object model (PDOM) and developed a parser based on this model to facilitate the analysis and storage of these data. The parser works with HTML or XML files saved or exported from MASCOT MS/MS ions search in peptide summary report or MASCOT PMF search in protein summary report. The program creates PDOM objects, eliminates redundancy in the input file, and has the capability to output any PDOM object to a relational database. This program facilitates additional analysis of MASCOT search results and aids the storage of protein identification information. The implementation is extensible and can serve as a template to develop parsers for other search engines. The parser can be used as a stand-alone application or can be driven by other Java programs. It is currently being used as the front end for a system that loads HTML and XML result files of MASCOT searches into a relational database. The source code is freely available at http://www.ccbm.jhu.edu and the program uses only free and open-source Java libraries.  相似文献   

19.
Proteomics aims to study the whole protein content of a biological sample in one set of experiments. Such an approach has the potential value to acquire an understanding of the complex responses of an organism to a stimulus. The large vascular and air space surface area of the lung expose it to a multitude of stimuli that can trigger a variety of responses by many different cell types. This complexity makes the lung a promising, but also challenging, target for proteomics. Important steps made in the last decade have increased the potential value of the results of proteomics studies for the clinical scientist. Advances in protein separation and staining techniques have improved protein identification to include the least abundant proteins. The evolution in mass spectrometry has led to the identification of a large part of the proteins of interest rather than just describing changes in patterns of protein spots. Protein profiling techniques allow the rapid comparison of complex samples and the direct investigation of tissue specimens. In addition, proteomics has been complemented by the analysis of posttranslational modifications and techniques for the quantitative comparison of different proteomes. These methodologies have made the application of proteomics on the study of specific diseases or biological processes under clinically relevant conditions possible. The quantity of data that is acquired with these new techniques places new challenges on data processing and analysis. This article provides a brief review of the most promising proteomics methods and some of their applications to pulmonary research.  相似文献   

20.
An important aim of proteogenomics, which combines data of high throughput nucleic acid and protein analysis, is to reliably identify single amino acid substitutions representing a main type of coding genome variants. Exact knowledge of deviations from the consensus genome can be utilized in several biomedical fields, such as studies of expression of mutated proteins in cancer, deciphering heterozygosity mechanisms, identification of neoantigens in anticancer vaccine production, search for RNA editing sites at the level of the proteome, etc. Generation of this new knowledge requires processing of large data arrays from high–resolution mass spectrometry, where information on single–point protein variation is often difficult to extract. Accordingly, a significant problem in proteogenomic analysis is the presence of high levels of false positive results for variant–containing peptides in the produced results. Here we review recently suggested approaches of high quality proteomics data processing that may provide more reliable identification of single amino acid substitutions, especially contrary to residue modifications occurring in vitro and in vivo. Optimized methods for assessment of false discovery rate save instrumental and computational time spent for validation of interesting findings of amino acid polymorphism by orthogonal methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号