首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.

Background

It is important to accurately determine the performance of peptide:MHC binding predictions, as this enables users to compare and choose between different prediction methods and provides estimates of the expected error rate. Two common approaches to determine prediction performance are cross-validation, in which all available data are iteratively split into training and testing data, and the use of blind sets generated separately from the data used to construct the predictive method. In the present study, we have compared cross-validated prediction performances generated on our last benchmark dataset from 2009 with prediction performances generated on data subsequently added to the Immune Epitope Database (IEDB) which served as a blind set.

Results

We found that cross-validated performances systematically overestimated performance on the blind set. This was found not to be due to the presence of similar peptides in the cross-validation dataset. Rather, we found that small size and low sequence/affinity diversity of either training or blind datasets were associated with large differences in cross-validated vs. blind prediction performances. We use these findings to derive quantitative rules of how large and diverse datasets need to be to provide generalizable performance estimates.

Conclusion

It has long been known that cross-validated prediction performance estimates often overestimate performance on independently generated blind set data. We here identify and quantify the specific factors contributing to this effect for MHC-I binding predictions. An increasing number of peptides for which MHC binding affinities are measured experimentally have been selected based on binding predictions and thus are less diverse than historic datasets sampling the entire sequence and affinity space, making them more difficult benchmark data sets. This has to be taken into account when comparing performance metrics between different benchmarks, and when deriving error estimates for predictions based on benchmark performance.

Electronic supplementary material

The online version of this article (doi:10.1186/1471-2105-15-241) contains supplementary material, which is available to authorized users.  相似文献   

2.
Signal peptides and transmembrane helices both contain a stretch of hydrophobic amino acids. This common feature makes it difficult for signal peptide and transmembrane helix predictors to correctly assign identity to stretches of hydrophobic residues near the N-terminal methionine of a protein sequence. The inability to reliably distinguish between N-terminal transmembrane helix and signal peptide is an error with serious consequences for the prediction of protein secretory status or transmembrane topology. In this study, we report a new method for differentiating protein N-terminal signal peptides and transmembrane helices. Based on the sequence features extracted from hydrophobic regions (amino acid frequency, hydrophobicity, and the start position), we set up discriminant functions and examined them on non-redundant datasets with jackknife tests. This method can incorporate other signal peptide prediction methods and achieve higher prediction accuracy. For Gram-negative bacterial proteins, 95.7% of N-terminal signal peptides and transmembrane helices can be correctly predicted (coefficient 0.90). Given a sensitivity of 90%, transmembrane helices can be identified from signal peptides with a precision of 99% (coefficient 0.92). For eukaryotic proteins, 94.2% of N-terminal signal peptides and transmembrane helices can be correctly predicted with coefficient 0.83. Given a sensitivity of 90%, transmembrane helices can be identified from signal peptides with a precision of 87% (coefficient 0.85). The method can be used to complement current transmembrane protein prediction and signal peptide prediction methods to improve their prediction accuracies.  相似文献   

3.
4.
The identification of MHC class II restricted peptide epitopes is an important goal in immunological research. A number of computational tools have been developed for this purpose, but there is a lack of large-scale systematic evaluation of their performance. Herein, we used a comprehensive dataset consisting of more than 10,000 previously unpublished MHC-peptide binding affinities, 29 peptide/MHC crystal structures, and 664 peptides experimentally tested for CD4+ T cell responses to systematically evaluate the performances of publicly available MHC class II binding prediction tools. While in selected instances the best tools were associated with AUC values up to 0.86, in general, class II predictions did not perform as well as historically noted for class I predictions. It appears that the ability of MHC class II molecules to bind variable length peptides, which requires the correct assignment of peptide binding cores, is a critical factor limiting the performance of existing prediction tools. To improve performance, we implemented a consensus prediction approach that combines methods with top performances. We show that this consensus approach achieved best overall performance. Finally, we make the large datasets used publicly available as a benchmark to facilitate further development of MHC class II binding peptide prediction methods.  相似文献   

5.
Prediction of peptides binding to HLA (human leukocyte antigen) finds application in peptide vaccine design. A number of statistical and structural models have been developed in recent years for HLA binding peptide prediction. However, a Bayesian Network (BNT) model is not available. In this study we describe a BNT model for HLA-A2 binding peptide prediction. It has been demonstrated that the BNT model allows up to 99 % accurate identification of the HLA-A2 binding peptides and provides similar prediction accuracy compared to HMM (Hidden Markov Model) and ANN (Artificial Neural Network). At the same time, it has been shown that the BNT has that advantage that it allows more accurate performance for smaller sets of empirical data compared to the HMM and the ANN methods. When the size of the training set has been reduced to 40% from the original data, the identification of the HLA-A2 binding peptides by the BNT, ANN and HMM methods produced ARoc (area under receiver operating characteristic) values 0.88, 0.85, 0.85 respectively. The results of the work demonstrate certain advantages of using the Bayesian Networks in predicting the HLA binding peptides using smaller datasets.  相似文献   

6.
Many genomes of nonmodel organisms are yet to be annotated. Peptidomics research on those organisms therefore cannot adopt the commonly used database-driven identification strategy, leaving the more difficult de novo sequencing approach as the only alternative. The reported tool uses the growing resources of publicly or in-house available fragmentation spectra and sequences of (model) organisms to elucidate the identity of peptides of experimental spectra of nonannotated species. Clustering algorithms are implemented to infer the identity of unknown peak lists based on their publicly or in-house available counterparts. The reported tool, which we call the HomClus-tool, can cope with post-translational modifications and amino acid substitutions. We applied this tool on two locusts (Schistocerca gregaria and Locusta migratoria) LC-MALDI-TOF/TOF datasets. Compared to a Mascot database search (using the available UniProt-KB proteins of these species), we were able to double the amount of peptide identifications for both spectral sets. Known bioactive peptides from Drosophila melanogaster (i.e., fragmentations spectra generated in silico thereof) were used as a starting point for clustering, trying to reveal their experimental homologues' counterparts.  相似文献   

7.
LC–MS/MS has become the standard platform for the characterization of immunopeptidomes, the collection of peptides naturally presented by major histocompatibility complex molecules to the cell surface. The protocols and algorithms used for immunopeptidomics data analysis are based on tools developed for traditional bottom‐up proteomics that address the identification of peptides generated by tryptic digestion. Such algorithms are generally not tailored to the specific requirements of MHC ligand identification and, as a consequence, immunopeptidomics datasets suffer from dismissal of informative spectral information and high false discovery rates. Here, a new pipeline for the refinement of peptide‐spectrum matches (PSM) is proposed, based on the assumption that immunopeptidomes contain a limited number of recurring peptide motifs, corresponding to MHC specificities. Sequence motifs are learned directly from the individual peptidome by training a prediction model on high‐confidence PSMs. The model is then applied to PSM candidates with lower confidence, and sequences that score significantly higher than random peptides are rescued as likely true ligands. The pipeline is applied to MHC class I immunopeptidomes from three different species, and it is shown that it can increase the number of identified ligands by up to 20–30%, while effectively removing false positives and products of co‐precipitation. Spectral validation using synthetic peptides confirms the identity of a large proportion of rescued ligands in the experimental peptidome.  相似文献   

8.
APCs process heat shock protein (HSP):peptide complexes to present HSP-chaperoned peptides on class I MHC molecules, but the ability of HSPs to contribute chaperoned peptides for class II MHC (MHC-II) Ag processing and presentation is unclear. Our studies revealed that exogenous bacterial HSPs (Escherichia coli DnaK and Mycobacterium tuberculosis HSP70) delivered an extended OVA peptide for processing and MHC-II presentation, as detected by T hybridoma cells. Bacterial HSPs enhanced MHC-II presentation only if peptide was complexed to the HSP, suggesting that the key HSP function was enhanced delivery or processing of chaperoned peptide Ag rather than generalized enhancement of APC function. HSP-enhanced processing was intact in MyD88 knockout cells, which lack most TLR signaling, further suggesting the effect was not due to TLR-induced induction of accessory molecules. Bacterial HSPs enhanced uptake of peptide, which may contribute to increased MHC-II presentation. In addition, HSPs enhanced binding of peptide to MHC-II molecules at pH 5.0 (the pH of vacuolar compartments), but not at pH 7.4, indicating another mechanism for enhancement of MHC-II Ag processing. Bacterial HSPs are a potential source of microbial peptide Ags during phagocytic processing of bacteria during infection and could potentially be incorporated in vaccines to enhance presentation of peptides to CD4+ T cells.  相似文献   

9.
10.
11.
12.
MOTIVATION: Machine learning methods such as neural networks, support vector machines, and other classification and regression methods rely on iterative optimization of the model quality in the space of the parameters of the method. Model quality measures (accuracies, correlations, etc.) are frequently overly optimistic because the training sets are dominated by particular families and subfamilies. To overcome the bias, the dataset is usually reduced by filtering out closely related objects. However, such filtering uses fixed similarity thresholds and ignores a part of the training information. RESULTS: We suggested a novel approach to calculate prediction model quality based on assigning to each data point inverse density weights derived from the postulated distance metric. We demonstrated that our new weighted measures estimate the model generalization better and are consistent with the machine learning theory. The Vapnik-Chervonenkis theorem was reformulated and applied to derive the space-uniform error estimates. Two examples were used to illustrate the advantages of the inverse density weighting. First, we demonstrated on a set with a built-in bias that the unweighted cross-validation procedure leads to an overly optimistic quality estimate, while the density-weighted quality estimates are more realistic. Second, an analytical equation for weighted quality estimates was used to derive an SVM model for signal peptide prediction using a full set of known signal peptides, instead of the usual filtered subset.  相似文献   

13.
14.
Recognition of peptides bound to major histocompatibility complex (MHC) class I molecules by T lymphocytes is an essential part of immune surveillance. Each MHC allele has a characteristic peptide binding preference, which can be captured in prediction algorithms, allowing for the rapid scan of entire pathogen proteomes for peptide likely to bind MHC. Here we make public a large set of 48,828 quantitative peptide-binding affinity measurements relating to 48 different mouse, human, macaque, and chimpanzee MHC class I alleles. We use this data to establish a set of benchmark predictions with one neural network method and two matrix-based prediction methods extensively utilized in our groups. In general, the neural network outperforms the matrix-based predictions mainly due to its ability to generalize even on a small amount of data. We also retrieved predictions from tools publicly available on the internet. While differences in the data used to generate these predictions hamper direct comparisons, we do conclude that tools based on combinatorial peptide libraries perform remarkably well. The transparent prediction evaluation on this dataset provides tool developers with a benchmark for comparison of newly developed prediction methods. In addition, to generate and evaluate our own prediction methods, we have established an easily extensible web-based prediction framework that allows automated side-by-side comparisons of prediction methods implemented by experts. This is an advance over the current practice of tool developers having to generate reference predictions themselves, which can lead to underestimating the performance of prediction methods they are not as familiar with as their own. The overall goal of this effort is to provide a transparent prediction evaluation allowing bioinformaticians to identify promising features of prediction methods and providing guidance to immunologists regarding the reliability of prediction tools.  相似文献   

15.
Orthogonal analysis of amino acid substitutions as a result of SNPs in existing proteomic datasets provides a critical foundation for the emerging field of population-based proteomics. Large-scale proteomics datasets, derived from shotgun tandem MS analysis of complex cellular protein mixtures, contain many unassigned spectra that may correspond to alternate alleles coded by SNPs. The purpose of this work was to identify tandem MS spectra in LC-MS/MS shotgun proteomics datasets that may represent coding nonsynonymous SNPs (nsSNP). To this end, we generated a tryptic peptide database created from allelic information found in NCBI's dbSNP. We searched this database with tandem MS spectra of tryptic peptides from DU4475 breast tumor cells that had been fractioned by pI in the first-dimension and reverse-phase LC in the second dimension. In all we identified 629 nsSNPs, of which 36 were of alternate SNP alleles not found in the reference NCBI or IPI protein databases. Searches for SNP-peptides carry a high risk of false positives due both to mass shifts caused by modifications and because of multiple representations of the same peptide within the genome. In this work, false positives were filtered using a novel peptide pI prediction algorithm and characterized using a decoy database developed by random substitution of similarly sized reference peptides. Secondary validation by sequencing of corresponding genomic DNA confirmed the presence of the predicted SNP in 8 of 10 SNP-peptides. This work highlights that the usefulness of interpreting unassigned spectra as polymorphisms is highly reliant on the ability to detect and filter false positives.  相似文献   

16.
Major histocompatibility complex class II (MHC-II) antigen presentation underlies a wide range of immune responses in health and disease. However, how MHC-II antigen presentation is regulated by the peptide-loading catalyst HLA-DM (DM), its associated modulator, HLA-DO (DO), is incompletely understood. This is due largely to technical limitations: model antigen-presenting cell (APC) systems that express these MHC-II peptidome regulators at physiologically variable levels have not been described. Likewise, computational prediction tools that account for DO and DM activities are not presently available. To address these gaps, we created a panel of single MHC-II allele, HLA-DR4-expressing APC lines that cover a wide range of DO:DM ratio states. Using a combined immunopeptidomic and proteomic discovery strategy, we measured the effects DO:DM ratios have on peptide presentation by surveying over 10,000 unique DR4-presented peptides. The resulting data provide insight into peptide characteristics that influence their presentation with increasing DO:DM ratios. These include DM sensitivity, peptide abundance, binding affinity and motif, peptide length, and choice of binding register along the source protein. These findings have implications for designing improved HLA-II prediction algorithms and research strategies for dissecting the variety of functions that different APCs serve in the body.  相似文献   

17.

Background  

Antigen presenting cells (APCs) sample the extra cellular space and present peptides from here to T helper cells, which can be activated if the peptides are of foreign origin. The peptides are presented on the surface of the cells in complex with major histocompatibility class II (MHC II) molecules. Identification of peptides that bind MHC II molecules is thus a key step in rational vaccine design and developing methods for accurate prediction of the peptide:MHC interactions play a central role in epitope discovery. The MHC class II binding groove is open at both ends making the correct alignment of a peptide in the binding groove a crucial part of identifying the core of an MHC class II binding motif. Here, we present a novel stabilization matrix alignment method, SMM-align, that allows for direct prediction of peptide:MHC binding affinities. The predictive performance of the method is validated on a large MHC class II benchmark data set covering 14 HLA-DR (human MHC) and three mouse H2-IA alleles.  相似文献   

18.
Peptide length-based prediction of peptide-MHC class II binding   总被引:2,自引:0,他引:2  
MOTIVATION: Algorithms for predicting peptide-MHC class II binding are typically similar, if not identical, to methods for predicting peptide-MHC class I binding despite known differences between the two scenarios. We investigate whether representing one of these differences, the greater range of peptide lengths binding MHC class II, improves the performance of these algorithms. RESULTS: A non-linear relationship between peptide length and peptide-MHC class II binding affinity was identified in the data available for several MHC class II alleles. Peptide length was incorporated into existing prediction algorithms using one of several modifications: using regression to pre-process the data, using peptide length as an additional variable within the algorithm, or representing register shifting in longer peptides. For several datasets and at least two algorithms these modifications consistently improved prediction accuracy. AVAILABILITY: http://malthus.micro.med.umich.edu/Bioinformatics  相似文献   

19.
Lai JS  Cheng CW  Sung TY  Hsu WL 《PloS one》2012,7(4):e35018
Secretome analysis is important in pathogen studies. A fundamental and convenient way to identify secreted proteins is to first predict signal peptides, which are essential for protein secretion. However, signal peptides are highly complex functional sequences that are easily confused with transmembrane domains. Such confusion would obviously affect the discovery of secreted proteins. Transmembrane proteins are important drug targets, but very few transmembrane protein structures have been determined experimentally; hence, prediction of the structures is essential. In the field of structure prediction, researchers do not make assumptions about organisms, so there is a need for a general signal peptide predictor.To improve signal peptide prediction without prior knowledge of the associated organisms, we present a machine-learning method, called SVMSignal, which uses biochemical properties as features, as well as features acquired from a novel encoding, to capture biochemical profile patterns for learning the structures of signal peptides directly.We tested SVMSignal and five popular methods on two benchmark datasets from the SPdb and UniProt/Swiss-Prot databases, respectively. Although SVMSignal was trained on an old dataset, it performed well, and the results demonstrate that learning the structures of signal peptides directly is a promising approach. We also utilized SVMSignal to analyze proteomes in the entire HAMAP microbial database. Finally, we conducted a comparative study of secretome analysis on seven tuberculosis-related strains selected from the HAMAP database. We identified ten potential secreted proteins, two of which are drug resistant and four are potential transmembrane proteins.SVMSignal is publicly available at http://bio-cluster.iis.sinica.edu.tw/SVMSignal. It provides user-friendly interfaces and visualizations, and the prediction results are available for download.  相似文献   

20.
The coupling between peptides and MHC-II proteins in the human immune system is not well understood. This work presents an evidence-based hypothesis of a guiding intermolecular force present in every human MHC-II protein (HLA-II). Previously, we examined the spatial positions of the fully conserved residues in all HLA-II protein types. In each one, constant planar patterns were revealed. These molecular planes comprise of amino acid groups of the same chemical species (for example, Gly) distributed across the protein structure. Each amino acid plane has a unique direction and this directional element offers spatial selectivity. Constant within all planes, too, is the presence of an aromatic residue possessing electrons in movement, leading the authors to consider that the planes generate electromagnetic fields that could serve as an attractive force in a single direction. Selection and attraction between HLA-II molecules and antigen peptides would, therefore, be non-random, resulting in a coupling mechanism as effective and rapid as is clearly required in the immune response. On the basis of planar projections onto the HLA-II groove, modifications were made by substituting the key residues in the class II-associated invariant chain peptide—a peptide with a universal binding affinity—resulting in eight different modified peptides with affinities greater than that of the unmodified peptide. Accurate and reliable prediction of MHC class II-binding peptides may facilitate the design of universal vaccine-peptides with greatly enhanced binding affinities. The proposed mechanisms of selection, attraction and coupling between HLA-II and antigen peptides are explained further in the paper.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号