首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Protein identification via peptide mass fingerprinting (PMF) remains a key component of high-throughput proteomics experiments in post-genomic science. Candidate protein identifications are made using bioinformatic tools from peptide peak lists obtained via mass spectrometry (MS). These algorithms rely on several search parameters, including the number of potential uncut peptide bonds matching the primary specificity of the hydrolytic enzyme used in the experiment. Typically, up to one of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using two distinct, nonredundant datasets of peptides identified via PMF and tandem MS, a simple predictive method based on information theory is presented which is able to identify experimentally defined missed cleavages with up to 90% accuracy from amino acid sequence alone. Using this simple protocol, we are able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines, using the PMF dataset as a test set. In addition, the improved approach is also demonstrated on an independent PMF data set of known proteins that also has corresponding high-quality tandem MS data, validating the protein identifications. This approach has wider applicability for proteomics database searching, and the program for predicting missed cleavages and masking Fasta-formatted protein sequence databases has been made available via http:// ispider.smith.man.ac uk/MissedCleave.  相似文献   

2.
Mass spectrometry‐based proteomics is a popular and powerful method for precise and highly multiplexed protein identification. The most common method of analyzing untargeted proteomics data is called database searching, where the database is simply a collection of protein sequences from the target organism, derived from genome sequencing. Experimental peptide tandem mass spectra are compared to simplified models of theoretical spectra calculated from the translated genomic sequences. However, in several interesting application areas, such as forensics, archaeology, venomics, and others, a genome sequence may not be available, or the correct genome sequence to use is not known. In these cases, de novo peptide identification can play an important role. De novo methods infer peptide sequence directly from the tandem mass spectrum without reference to a sequence database, usually using graph‐based or machine learning algorithms. In this review, we provide a basic overview of de novo peptide identification methods and applications, briefly covering de novo algorithms and tools, and focusing in more depth on recent applications from venomics, metaproteomics, forensics, and characterization of antibody drugs.  相似文献   

3.
氨基酸突变能够改变蛋白的结构和功能,影响生物体的生命过程.基于串联质谱的鸟枪法蛋白质组学是目前大规模研究蛋白质组学的主要方法,但是现有的质谱数据鉴定流程为了提高鉴定结果的灵敏度往往会有意压缩数据库中的氨基酸突变信息.因此,如何挖掘数据中的氨基酸突变信息成为当前质谱数据鉴定的一个重要部分.当前应用于氨基酸突变鉴定的串联质谱鉴定方法大致可以分为3大类:基于序列数据库搜索的方法、基于序列标签搜索的算法以及基于图谱库搜索的算法.本文首先详细介绍了这3种氨基酸突变鉴定算法,并分析了各种方法的特点和不足,然后介绍了氨基酸突变鉴定的研究现状和发展方向.随着基于串联质谱的蛋白质组学的不断发展,蛋白序列中的氨基酸突变信息将被更好地解析出来,从而得以深入探讨由氨基酸突变引起的蛋白结构和功能改变,为揭示氨基酸突变的生物学意义奠定基础.  相似文献   

4.
Identifying the proteome: software tools   总被引:18,自引:0,他引:18  
The interest in proteomics has recently increased dramatically and proteomic methods are now applied to many problems in cell biology. The method of choice in proteomics for identifying and characterizing proteins is mass spectrometry combined with database searching. Software tools have been improved to increase the sensitivity of protein identification and methods for evaluating the search results have been incorporated  相似文献   

5.
Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins.  相似文献   

6.
In shotgun proteomics, protein identification by tandem mass spectrometry relies on bioinformatics tools. Despite recent improvements in identification algorithms, a significant number of high quality spectra remain unidentified for various reasons. Here we present ScanRanker, an open-source tool that evaluates the quality of tandem mass spectra via sequence tagging with reliable performance in data from different instruments. The superior performance of ScanRanker enables it not only to find unassigned high quality spectra that evade identification through database search but also to select spectra for de novo sequencing and cross-linking analysis. In addition, we demonstrate that the distribution of ScanRanker scores predicts the richness of identifiable spectra among multiple LC-MS/MS runs in an experiment, and ScanRanker scores assist the process of peptide assignment validation to increase confident spectrum identifications. The source code and executable versions of ScanRanker are available from http://fenchurch.mc.vanderbilt.edu.  相似文献   

7.
基于串联质谱的蛋白质组研究会产生海量的质谱数据,这些数据通常使用数据库搜索引擎进行鉴定分析,并根据肽段谱图匹配(PSM)反推真实的样品蛋白质.对于高通量蛋白质组数据的处理,其鉴定结果的可信是后续分析应用的前提,因此对鉴定结果的质量控制尤为重要,而基于目标-诱饵库(target-decoy)搜索策略的质量控制是目前应用最为广泛的方法.本文首先介绍了基于目标-诱饵库搜索策略搜库和质量控制的实施流程,然后综述了基于目标-诱饵库搜索策略的质量控制工具,并提出了目标-诱饵库搜索策略的不足及改善方法,最后对目标-诱饵库搜索策略进行了总结与展望.  相似文献   

8.
MOTIVATION: Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines. OVERVIEW: The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein. AVAILABILITY: The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.  相似文献   

9.
Peptide identification of tandem mass spectra by a variety of available search algorithms forms the foundation for much of modern day mass spectrometry-based proteomics. Despite the critical importance of proper evaluation and interpretation of the results generated by these algorithms there is still little consistency in their application or understanding of their similarities and differences. A survey was conducted of four tandem mass spectrometry peptide identification search algorithms, including Mascot, Open Mass Spectrometry Search Algorithm, Sequest, and X! Tandem. The same input data, search parameters, and sequence library were used for the searches. Comparisons were based on commonly used scoring methodologies for each algorithm and on the results of a target-decoy approach to sequence library searching. The results indicated that there is little difference in the output of the algorithms so long as consistent scoring procedures are applied. The results showed that some commonly used scoring procedures may lead to excessive false discovery rates. Finally an alternative method for the determination of an optimal cutoff threshold is proposed.  相似文献   

10.
We report a significantly-enhanced bioinformatics suite and database for proteomics research called Yale Protein Expression Database(YPED) that is used by investigators at more than 300 institutions worldwide. YPED meets the data management, archival, and analysis needs of a high-throughput mass spectrometry-based proteomics research ranging from a singlelaboratory, group of laboratories within and beyond an institution, to the entire proteomics community. The current version is a significant improvement over the first version in that it contains new modules for liquid chromatography–tandem mass spectrometry(LC–MS/MS) database search results, label and label-free quantitative proteomic analysis, and several scoring outputs for phosphopeptide site localization. In addition, we have added both peptide and protein comparative analysis tools to enable pairwise analysis of distinct peptides/proteins in each sample and of overlapping peptides/proteins between all samples in multiple datasets. We have also implemented a targeted proteomics module for automated multiple reaction monitoring(MRM)/selective reaction monitoring(SRM) assay development. We have linked YPED's database search results and both label-based and label-free fold-change analysis to the Skyline Panorama repository for online spectra visualization. In addition, we have built enhanced functionality to curate peptide identifications into an MS/MS peptide spectral library for all of our protein database search identification results.  相似文献   

11.
With the recent quick expansion of DNA and protein sequence databases, intensive efforts are underway to interpret the linear genetic information of DNA in terms of function, structure, and control of biological processes. The systematic identification and quantification of expressed proteins has proven particularly powerful in this regard. Large-scale protein identification is usually achieved by automated liquid chromatography-tandem mass spectrometry of complex peptide mixtures and sequence database searching of the resulting spectra [Aebersold and Goodlett, Chem. Rev. 2001, 101, 269-295]. As generating large numbers of sequence-specific mass spectra (collision-induced dissociation/CID) spectra has become a routine operation, research has shifted from the generation of sequence database search results to their validation. Here we describe in detail a novel probabilistic model and score function that ranks the quality of the match between tandem mass spectral data and a peptide sequence in a database. We document the performance of the algorithm on a reference data set and in comparison with another sequence database search tool. The software is publicly available for use and evaluation at http://www.systemsbiology.org/research/software/proteomics/ProbID.  相似文献   

12.
In high-throughput proteomics the development of computational methods and novel experimental strategies often rely on each other. In certain areas, mass spectrometry methods for data acquisition are ahead of computational methods to interpret the resulting tandem mass spectra. Particularly, although there are numerous situations in which a mixture tandem mass spectrum can contain fragment ions from two or more peptides, nearly all database search tools still make the assumption that each tandem mass spectrum comes from one peptide. Common examples include mixture spectra from co-eluting peptides in complex samples, spectra generated from data-independent acquisition methods, and spectra from peptides with complex post-translational modifications. We propose a new database search tool (MixDB) that is able to identify mixture tandem mass spectra from more than one peptide. We show that peptides can be reliably identified with up to 95% accuracy from mixture spectra while considering only a 0.01% of all possible peptide pairs (four orders of magnitude speedup). Comparison with current database search methods indicates that our approach has better or comparable sensitivity and precision at identifying single-peptide spectra while simultaneously being able to identify 38% more peptides from mixture spectra at significantly higher precision.  相似文献   

13.
The combination of tandem mass spectrometry and sequence database searching is the method of choice for the identification of peptides and the mapping of proteomes. Over the last several years, the volume of data generated in proteomic studies has increased dramatically, which challenges the computational approaches previously developed for these data. Furthermore, a multitude of search engines have been developed that identify different, overlapping subsets of the sample peptides from a particular set of tandem mass spectrometry spectra. We present iProphet, the new addition to the widely used open-source suite of proteomic data analysis tools Trans-Proteomics Pipeline. Applied in tandem with PeptideProphet, it provides more accurate representation of the multilevel nature of shotgun proteomic data. iProphet combines the evidence from multiple identifications of the same peptide sequences across different spectra, experiments, precursor ion charge states, and modified states. It also allows accurate and effective integration of the results from multiple database search engines applied to the same data. The use of iProphet in the Trans-Proteomics Pipeline increases the number of correctly identified peptides at a constant false discovery rate as compared with both PeptideProphet and another state-of-the-art tool Percolator. As the main outcome, iProphet permits the calculation of accurate posterior probabilities and false discovery rate estimates at the level of sequence identical peptide identifications, which in turn leads to more accurate probability estimates at the protein level. Fully integrated with the Trans-Proteomics Pipeline, it supports all commonly used MS instruments, search engines, and computer platforms. The performance of iProphet is demonstrated on two publicly available data sets: data from a human whole cell lysate proteome profiling experiment representative of typical proteomic data sets, and from a set of Streptococcus pyogenes experiments more representative of organism-specific composite data sets.  相似文献   

14.
Sequence determination of peptides is a crucial step in mass spectrometry–based proteomics. Peptide sequences are determined either by database search or by de novo sequencing using tandem mass spectrometry. Determination of all the theoretical expected peptide fragments and eliminating false discoveries remains a challenge in proteomics. Developing standards for evaluating the performance of mass spectrometers and algorithms used for identification of proteins is important for proteomics studies. The current study is focused on these aspects by using synthetic peptides. A total of 599 peptides were designed from in silico tryptic digest with 1 or 2 missed cleavages from 199 human proteins, and synthetic peptides corresponding to these sequences were obtained. The peptides were mixed together, and analysis was carried out using liquid chromatography–electrospray ionization tandem mass spectrometry on a Q-Exactive HF mass spectrometer. The peptides and proteins were identified with SEQUEST program. The analysis was carried out using the proteomics workflows. A total of 573 peptides representing 196 proteins could be identified, and a spectral library was created for these peptides. Analysis parameters such as “no enzyme selection” gave the maximum number of detected peptides as compared with trypsin in the selection. False discoveries could be identified. This study highlights the limitations of peptide detection and the need for developing powerful algorithms along with tools to evaluate mass spectrometers and algorithms. It also shows the limitations of peptide detection even with high-end mass spectrometers. The mass spectral data are available in ProteomeXchange with accession no. PXD017992.  相似文献   

15.
Informatics for protein identification by mass spectrometry   总被引:3,自引:0,他引:3  
High throughput protein analysis (i.e., proteomics) first became possible when sensitive peptide mass mapping techniques were developed, thereby allowing for the possibility of identifying and cataloging most 2D gel electrophoresis spots. Shortly thereafter a few groups pioneered the idea of identifying proteins by using peptide tandem mass spectra to search protein sequence databases. Hence, it became possible to identify proteins from very complex mixtures. One drawback to these latter techniques is that it is not entirely straightforward to make matches using tandem mass spectra of peptides that are modified or have sequences that differ slightly from what is present in the sequence database that is being searched. This has been part of the motivation behind automated de novo sequencing programs that attempt to derive a peptide sequence regardless of its presence in a sequence database. The sequence candidates thus generated are then subjected to homology-based database search programs (e.g., BLAST or FASTA). These homology search programs, however, were not developed with mass spectrometry in mind, and it became necessary to make minor modifications such that mass spectrometric ambiguities can be taken into account when comparing query and database sequences. Finally, this review will discuss the important issue of validating protein identifications. All of the search programs will produce a top ranked answer; however, only the credulous are willing to accept them carte blanche.  相似文献   

16.
Despite the publication of several software tools for analysis of glycopeptide tandem mass spectra, there remains a lack of consensus regarding the most effective and appropriate methods. In part, this reflects problems with applying standard methods for proteomics database searching and false discovery rate calculation. While the analysis of small post-translational modifications (PTMs) may be regarded as an extension of proteomics database searching, glycosylation requires specialized approaches. This is because glycans are large and heterogeneous by nature, causing glycopeptides to exist as multiple glycosylated variants. Thus, the mass of the peptide cannot be calculated directly from that of the intact glycopeptide. In addition, the chemical nature of the glycan strongly influences product ion patterns observed for glycopeptides. As a result, glycopeptidomics requires specialized bioinformatics methods. We summarize the recent progress towards a consensus for effective glycopeptide tandem mass spectrometric analysis.  相似文献   

17.
Park GW  Kwon KH  Kim JY  Lee JH  Yun SH  Kim SI  Park YM  Cho SY  Paik YK  Yoo JS 《Proteomics》2006,6(4):1121-1132
In shotgun proteomics, proteins can be fractionated by 1-D gel electrophoresis and digested into peptides, followed by liquid chromatography to separate the peptide mixture. Mass spectrometry generates hundreds of thousands of tandem mass spectra from these fractions, and proteins are identified by database searching. However, the search scores are usually not sufficient to distinguish the correct peptides. In this study, we propose a confident protein identification method for high-throughput analysis of human proteome. To build a filtering protocol in database search, we chose Pseudomonas putida KT2440 as a reference because this bacterial proteome contains fewer modifications and is simpler than the human proteome. First, the P. putida KT2440 proteome was filtered by reversed sequence database search and correlated by the molecular weight in 1-D-gel band positions. The characterization protocol was then applied to determine the criteria for clustering of the human plasma proteome into three different groups. This protein filtering method, based on bacterial proteome data analysis, represents a rapid way to generate higher confidence protein list of the human proteome, which includes some of heavily modified and cleaved proteins.  相似文献   

18.
This paper describes an open-source system for analyzing, storing, and validating proteomics information derived from tandem mass spectrometry. It is based on a combination of data analysis servers, a user interface, and a relational database. The database was designed to store the minimum amount of information necessary to search and retrieve data obtained from the publicly available data analysis servers. Collectively, this system was referred to as the Global Proteome Machine (GPM). The components of the system have been made available as open source development projects. A publicly available system has been established, comprised of a group of data analysis servers and one main database server.  相似文献   

19.
Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. We present two flexible methods, the variable component mixture model and the semiparametric mixture model, that remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. Using a control protein mixture data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, we demonstrate that both methods improve parametric models in terms of the accuracy of probability estimates and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient number of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy.  相似文献   

20.
Proteomics research routinely involves identifying peptides and proteins via MS/MS sequence database search. Thus the database search engine is an integral tool in many proteomics research groups. Here, we introduce the Comet search engine to the existing landscape of commercial and open‐source database search tools. Comet is open source, freely available, and based on one of the original sequence database search tools that has been widely used for many years.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号