Peptide and protein identification remains challenging in organisms with poorly annotated or rapidly evolving genomes, as are commonly encountered in environmental or biofuels research. Such limitations render tandem mass spectrometry (MS/MS) database search algorithms ineffective as they lack corresponding sequences required for peptide-spectrum matching. We address this challenge with the spectral networks approach to (1) match spectra of orthologous peptides across multiple related species and then (2) propagate peptide annotations from identified to unidentified spectra. We here present algorithms to assess the statistical significance of spectral alignments (Align-GF), reduce the impurity in spectral networks, and accurately estimate the error rate in propagated identifications. Analyzing three related
Cyanothece species, a model organism for biohydrogen production, spectral networks identified peptides from highly divergent sequences from networks with dozens of variant peptides, including thousands of peptides in species lacking a sequenced genome. Our analysis further detected the presence of many novel putative peptides even in genomically characterized species, thus suggesting the possibility of gaps in our understanding of their proteomic and genomic expression. A web-based pipeline for spectral networks analysis is available at
http://proteomics.ucsd.edu/software.Microorganisms have evolved their cellular metabolism to generate energy for life in unusual environments (
1), and their capabilities are of great interest in the production of renewable bioenergy and could contribute toward managing the world''s current energy and climate crisis (
2). Genomics studies have increased the number of sequenced bioenergy-related microbial genomes and revealed the possible biological reactions involved in bioenergy production (
3). Studies of photosynthetic microorganisms, for example, have yielded insights into how they harvest solar energy and use it to produce bioenergy products (
4). Despite this importance of microorganisms, the characterization of diverse microbial phenotypes by proteomics tandem mass spectrometry (MS/MS) has been limited. The dominant approaches for MS/MS analysis heavily rely on the availability of completely annotated genomes (
i.e. accurate protein databases) (
5–
7), yet most microorganisms populating the planet have unsequenced or poorly annotated genomes. Thus it remains challenging to identify proteins from environmental and unculturable organisms.One solution to protein identification in a species with no sequenced genome is to use the genomes of closely related species (
8). This requires matching MS/MS data to slightly different peptides in amino acid sequences (polymorphic, orthologous peptides); but matching shifted masses of peptides and their fragment ions is computationally expensive and challenging. Moreover, different species-specific post-translational modifications (PTMs)
1 can make the cross-species identification more complex. The common computational approach is tolerantly matching
de novo sequences derived from MS/MS data to the database while allowing for amino acid mutations and modifications (
9–
11). However, this approach critically depends on good
de novo interpretations, which are nearly always partially incorrect and yield high-quality subsequences only for a small fraction of all spectra. The blind database search approach, developed to identify peptides with unexpected modifications, can also be used to directly match MS/MS data from unknown species to a database of closely related species, but its utilization is limited because of its exceptionally large search space (
12–
18). These spectrum-database matching approaches to cross-species identification pose significant challenges in its speed and sensitivity with a huge database, which leads to a much longer search time and more false positive identifications (
19,
20).As a complementary approach to spectrum-database matching, spectral library searching is an emerging and promising approach (
21). A spectral library is a large collection of identified MS/MS spectra, and an unknown query spectrum can then be identified by direct spectral matching to the library. The great advantage of this approach is the reduction of search space and the use of fragmentation patterns of peptides. The spectral networks approach expands this concept to the identification of modified peptides in MS/MS data sets (
22,
23). Spectral networks do not directly search a database, but groups MS/MS spectra by computing the pairwise similarity between MS/MS spectra of peptide variants and then constructs networks where each spectrum defines a node and each significant spectral pair, highly correlated in the fragmentation pattern, defines an edge (). In spectral networks, identification of spectra belonging to the same subnetwork should be related and thus the peptide sequence for an identified spectrum can be propagated to neighboring unidentified spectra.
Open in a separate windowOverview of multi-species spectral networks. Nodes represent individual spectra and edges between nodes represent significant pairwise alignment between spectra; edges are labeled with amino acid mutations (dotted edges) or parent mass differences (solid edges). In spectral networks, a peptide and its related variants are ideally grouped into a single subnetwork. If at least one spectrum in a subnetwork is annotated (filled node), all the neighboring spectra (unfilled nodes) can potentially become identified by propagating the annotation over network edges. For example, all spectra in the subnetwork of “peptide A” (top left, blue network) can be annotated via up to three iterative propagations, first from A to {A
1, A
2, A
3}, second from {A
2, A
3} to {A
4, A
5}, and third from {A
4, A
5} to A
6. This paradigm can be equally applied to cross-species data analysis, as “peptide L” identified in species 1 (top middle, olive-colored network) is propagated to a node unidentified in
species 2, identifying its orthologous “
peptide l”, with a serine to alanine polymorphism. Thus, spectral networks enable the detection of orthologous peptide pairs between different species.We recently reported that a vast number of polymorphic, orthologous peptides across species are present in MS/MS data sets (
24). We propose a new approach in cross-species proteomics research that aggregates MS/MS of multiple related species followed by spectral networks analysis of the pooled data to capitalize on pairs of spectra from orthologous peptides, as shown in . This approach does not require advance knowledge of the genomes for all species, and enables the identification of novel, polymorphic peptides across species via interspecies propagation. Compared with previous approaches, cross-species spectral network analysis has two major advantages. First, by matching spectra to spectra instead of spectra to database sequences, spectral networks only consider the sequence variability of peptides present in the samples instead of considering all possible variability across the whole database of related species; thus the performance of spectral networks is independent of database size. Second, the analysis of the set of highly related spectra increases the reliability in identifying polymorphic peptides in that multiple different spectra can support the same novel identification. The utility of spectral networks can be also expanded to the proteomic analysis of microbial communities that often contain hundreds of distinct organisms (
25,
26). But despite the success of spectral networks in low complexity data sets (
22,
23), the analysis of large multi-species proteomics data requires significantly higher reliability in spectral similarity scores because the number of pairwise spectral comparisons grows quadratically with the number of spectra.In this work, we present algorithmic and statistical advances to spectral networks to improve its utility with large and diverse spectral data sets. To statistically assess the significance of spectral alignments in pairing millions of spectra, we propose Align-GF (generating function for spectral alignment) to compute rigorous
p values of a spectral pair based on the complete score histogram of all possible alignments between two spectra. We show that Align-GF successfully addressed the reliability challenge in a large data set analysis and demonstrated its utility by leading to a 4-fold increase in the sensitivity of spectral pairs. Even with this dramatically improved accuracy, a very small number of incorrect pairs in a network can still complicate propagation of annotations. To further progress toward the ideal scenario where each subnetwork consists of only spectra from a single peptide family, we introduce new procedures to split mixed networks from different peptide families and show that these effectively eliminate many false spectral pairs. Finally, we propose the first approach to calculation of false discovery rate (FDR) for spectral networks propagation of identifications from unmodified to progressively more modified peptides. The proposed FDR estimation was conservative and was more rigorous for highly modified peptides, and thus now makes propagation results comparable to other peptide identification approaches.The cross-species spectral networks techniques proposed here enabled the proteomic analysis of three different
Cyanothece species, including a strain where the genome sequence is not known.
Cyanobacteria are one of the most diverse and widely distributed microorganisms and have received significant consideration as satisfying various demands required in bioenergy generation (
27). We show that spectral networks can improve peptide identification by up to 38% compared with mainstream approaches, including many polymorphic and modified peptides. Spectral networks could identify peptides with highly divergent sequences (with 7 amino acid mutations) by leveraging networks of variant peptides, and one example subnetwork of species-specific variants of phycobilisome proteins reflects the diversity of photosynthetic light-harvesting strategies (
28). Our approach thus demonstrates the potential gains in multi-species proteomics and sets the stage for related developments in higher-complexity metaproteomics samples. Finally, spectral networks revealed many unidentified subnetworks containing only unidentified spectra, thus strongly suggesting the presence of novel peptides that are missing from current protein databases. Although we illustrate the potential of our approach on a specific set of bioenergy-related species, we note that the proposed approach is generic and should be applicable to any other set of related species. The diversity of biologically important protein families could be studied by comparing closely and more remotely related species.
相似文献