首页 | 本学科首页   官方微博 | 高级检索  
   检索      


PepExplorer: A Similarity-driven Tool for Analyzing de Novo Sequencing Results
Authors:Felipe V Leprevost  Richard H Valente  Diogo B Lima  Jonas Perales  Rafael Melani  John R Yates  III  Valmir C Barbosa  Magno Junqueira  Paulo C Carvalho
Institution:3. Laboratory for Proteomics and Protein Engineering, Carlos Chagas Institute, Fiocruz, Paraná, Brazil;4. Laboratory of Toxinology, Oswaldo Cruz Institute, Fiocruz, Rio de Janeiro, Brazil;5. Instituto Nacional de Ciência e Tecnologia em Toxinas (INCTTox/CNPq), Brazil;6. Proteomics Unit, Rio de Janeiro Proteomics Network, Department of Biochemistry, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil;12. Systems Engineering and Computer Science Program, Federal University of Rio de Janeiro, Rio de Janeiro, Brazil
Abstract:Peptide spectrum matching is the current gold standard for protein identification via mass-spectrometry-based proteomics. Peptide spectrum matching compares experimental mass spectra against theoretical spectra generated from a protein sequence database to perform identification, but protein sequences not present in a database cannot be identified unless their sequences are in part conserved. The alternative approach, de novo sequencing, can make it possible to infer a peptide sequence directly from a mass spectrum, but interpreting long lists of peptide sequences resulting from large-scale experiments is not trivial. With this as motivation, PepExplorer was developed to use rigorous pattern recognition to assemble a list of homologue proteins using de novo sequencing data coupled to sequence alignment to allow biological interpretation of the data. PepExplorer can read the output of various widely adopted de novo sequencing tools and converge to a list of proteins with a global false-discovery rate. To this end, it employs a radial basis function neural network that considers precursor charge states, de novo sequencing scores, peptide lengths, and alignment scores to select similar protein candidates, from a target-decoy database, usually obtained from phylogenetically related species. Alignments are performed using a modified Smith–Waterman algorithm tailored for the task at hand. We verified the effectiveness of our approach using a reference set of identifications generated by ProLuCID when searching for Pyrococcus furiosus mass spectra on the corresponding NCBI RefSeq database. We then modified the sequence database by swapping amino acids until ProLuCID was no longer capable of identifying any proteins. By searching the mass spectra using PepExplorer on the modified database, we were able to recover most of the identifications at a 1% false-discovery rate. Finally, we employed PepExplorer to disclose a comprehensive proteomic assessment of the Bothrops jararaca plasma, a known biological source of natural inhibitors of snake toxins. PepExplorer is integrated into the PatternLab for Proteomics environment, which makes available various tools for downstream data analysis, including resources for quantitative and differential proteomics.Very often, groundbreaking discoveries with a significant impact on the biotechnological and biomedical fields have emerged from studying “non-canonical” organisms. For example, the study of Thermus aquaticus allowed us to ultimately pave the way to modern molecular biology with the characterization of that organism''s thermostable DNA polymerase (1). The characterization of the green fluorescent protein in Aequoria victoria led to a revolution in cellular biology and to a Nobel Prize being awarded to Osamu Shimomura, Martin Chalfie, and Roger Tsien. In Brazil, Sergio Ferreira''s work on the venom of the Brazilian poisonous snake Bothrops jararaca enabled the development of the first angiotensin-converting enzyme inhibitor drug (Captopril) for the treatment of hypertension (2).In scenarios such as these, proteomics has the potential to allow a better understanding of the complexity of biological systems and the process of evolution than the study of the genetic code alone. It enables the characterization of molecular processes according to their protein content, facilitating new discoveries. In proteomics, the most frequently used strategy for protein identification is so-called peptide spectrum matching (PSM),1 or the comparison of experimental mass spectra obtained by fragmenting peptides in a mass spectrometer to theoretical spectra generated from a sequence database. In general, the identification process follows from the sequence whose theoretical spectrum yields the highest matching score according to some empirical or probabilistic function. Examples of search engines adopting this strategy are SEQUEST (3), X!Tandem (4), and Mascot (5).Back in the 1990s, establishment of a cutoff score for confident identification relied mostly on user experience; for example, given a specific charge state, Washburn et al. established cross-correlation and deltaCn cutoff values for SEQUEST in order to allow the selection of a subset of confident identifications from LCQ data. This has since been termed “the Washburn criterion.” In what followed, target-decoy databases were implemented to allow for more sophisticated refinements in filtering the data (6). In 2007, Elias and Gygi published a seminal paper on the target-decoy approach to shotgun proteomics (7) that ultimately firmed this approach as a standard and motivated the development of several statistical filters capable of converging to a list of confident identifications satisfying a user-specified false-discovery rate (FDR) with significantly more sensitivity than the conservative Washburn criterion. Such statistical filters include mixtures of probabilities (8), quadratic discriminant analysis (9), semi-supervised learning with support vector machines (10), and Bayesian logic (11) using a semi-labeled decoy analysis to account for overfitting (12). With so many advances, the PSM workflow has become the gold standard, as it is very sensitive and the least error-prone method when a database is available with the corresponding proteins. The latter factor limits the application of PSM to those organisms for which accurate sequence databases have been established. If a peptide''s sequence is not contained within the sequence database, it cannot be identified via the PSM method. However, efforts in developing error-tolerant PSM approaches such as implemented in Mascot have made it possible to handle minor sequence modifications constrained by a simple set of rules. Nevertheless, increasing the search space in the PSM approach leads to decreased sensitivity (13).Even though the concept of computer-aided de novo sequencing predates that of PSM (14), advances in the quality of mass spectrometry data and the power of computer hardware have allowed it to reemerge at the heart of a highly active field. De novo sequencing is unbiased insofar as it is not constrained by a sequence database, and it is therefore complementary to PSM. However, it has remained the most error prone of the two methods (15). The challenges of de novo sequencing notwithstanding, a few recent and notable improvements in computer-aided de novo analysis are PepNovo (16), which combines graph theory with machine learning; pNovo+ (17), which is optimized for high-resolution HCD data; NovoHMM (18), relying on hidden Markov models for increased sensitivity; and PEAKS (19), which creates a spectrum graph model by performing dynamic programming on the mass values regardless of the presence of an observed fragment ion. By considering the complementarities of different fragmentation strategies (e.g. collision induced dissociation, electron transfer dissociation (20), and electron capture dissociation (21)), computational proteomics scientists have also demonstrated significant advances in de novo accuracy (22). In particular, the Bandeira group has continually pushed the limits and redefined the notion of what de novo sequencing can do by introducing the spectral networks paradigm (2325). Briefly, this strategy can assemble mass spectra into spectral pairs by joining overlapping spectra obtained from sample aliquots digested by different enzymes. As a consequence, it reduces noise and significantly improves protein coverage. Its latest version also combines data from different fragmentation techniques.These algorithm developments have improved de novo sequencing, shifting the bottleneck to post-sequence processing of data. This is because the output of de novo software is a long list of highly similar full and partial peptide sequence and scores. An initial attempt to overcome these limitations consisted of a tag approach that was a hybrid of de novo sequencing and database searching: short sequence tags were derived from tandem mass spectra and used to search a sequence database (26). In what followed, a modified version based on the FASTA homology search tool was proposed for homology-driven proteomics (27). This strategy was implemented as part of the CIDentify tool, whose novelty was to account, in the alignment score, for limitations of mass spectrometry sequencing such as switching between leucine and isoleucine or other combinations of amino acids having the same mass. The next steps were taken mainly by the Shevchenko group through the introduction of the MS-Blast algorithm, which relies on a different set of scores and uses the PAM30MS substitution matrix, itself tailored for mass-spectrometry-based proteomics (28, 29). For a complete review of de novo sequencing and homology searching, we suggest Ref. 30.The current de novo post-processing paradigm presents several limitations that are similar to those of the early PSM workflow. Output files generally consist of a peptide list with corresponding scores, demanding an experienced user to assess trustworthy identifications. If the same peptide is analyzed by different mass spectrometers, different scores might be generated, which makes data comparison between different groups a challenging task. In a sense, problems are similar to those encountered when adopting the early Washburn criterion. Assembling protein information from a list of peptides is not a simple task, and usually it is not performed using state-of-the-art de novo tools. Although there are great tools for doing this at the PSM level, there is still a lack of similar tools for de novo sequencing.To tackle the aforementioned shortcomings, and in line with our strong interest in diversity-driven proteomics (29), we present a methodology for post-processing de novo sequencing data that allows inference of protein identification through statistical mapping of de novo sequencing results to a protein sequence database. Our approach begins with the use of Gotoh''s version of the Smith–Waterman algorithm, based on affine gap scoring (31) for increased scalability, to align de novo sequences against those in a target-decoy database. Then a radial basis function neural network (RBF-NN) is used to rank results according to alignment score, de novo score, precursor charge state, and peptide length. Finally, a heuristic method is used to present protein identification results in a user-friendly, interactive report. The resulting algorithm was implemented as the software PepExplorer. In essence, its goal is somewhat similar to that of post-processing tools such as DTASelect (9), Percolator (10), and SEPro (11), but with an extra layer of complexity inherent from de novo sequencing. PepExplorer can handle the output of several widely adopted de novo tools, such as PepNovo, pNovo+, and PEAKS, and accepts a generic format to enable result analysis from a broader range of tools once results are run through simple parsers. Similarly, the software accepts a series of database formats for input analysis. These features are not found in other tools. PepExplorer is freely available to the scientific community and is provided with the necessary documentation.The effectiveness of our methodology has been verified in two distinct scenarios, the first a real but controlled experiment and the other pertaining to comprehensive profiling of the plasma components of Bothrops jararaca, a venomous viper endemic to Brazil, southern Paraguay, and northern Argentina. The first scenario''s purpose was to validate the effectiveness of the tool in analyzing a published Pyrococcus furiosus dataset (11). We note that this organism is recognized by the proteomics community as well suited for benchmarking, because it allows for the rigorous testing of identification algorithms at the peptide and protein levels (32, 33). We modified the P. furiosus sequence database in such a way that no more peptides were identified via the PSM approach or another widely adopted error-tolerant search tool, Mod-A (34). We then found that we could recover protein identifications using our tool. The B. jararaca scenario has allowed us to explore uncharted territory, as this organism has an incomplete sequence database and we were therefore required to rely on those of orthologous organisms. In particular, B. jararaca plasma was chosen because it is a main research model studied at the Laboratory of Toxinology (FIOCRUZ, Brazil), and several natural inhibitors of snake toxins have already been identified/characterized from this biological matrix (3537).
Keywords:
本文献已被 ScienceDirect 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号