首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Gao M  Skolnick J 《Proteins》2011,79(5):1623-1634
With the development of many computational methods that predict the structural models of protein-protein complexes, there is a pressing need to benchmark their performance. As was the case for protein monomers, assessing the quality of models of protein complexes is not straightforward. An effective scoring scheme should be able to detect substructure similarity and estimate its statistical significance. Here, we focus on characterizing the similarity of the interfaces of the complex and introduce two scoring functions. The first, the interfacial Template Modeling score (iTM-score), measures the geometric distance between the interfaces, while the second, the Interface Similarity score (IS-score), evaluates their residue-residue contact similarity in addition to their geometric similarity. We first demonstrate that the IS-score is more suitable for assessing docking models than the iTM-score. The IS-score is then validated in a large-scale benchmark test on 1562 dimeric complexes. Finally, the scoring function is applied to evaluate docking models submitted to the Critical Assessment of Prediction of Interactions (CAPRI) experiments. While the results according to the new scoring scheme are generally consistent with the original CAPRI assessment, the IS-score identifies models whose significance was previously underestimated.  相似文献   

2.
In protein tertiary structure prediction, a crucial step is to select near-native structures from a large number of predicted structural models. Over the years, extensive research has been conducted for the protein structure selection problem with most approaches focusing on developing more accurate energy or scoring functions. Despite significant advances in this area, the discerning power of current approaches is still unsatisfactory. In this paper, we propose a novel consensus-based algorithm for the selection of predicted protein structures. Given a set of predicted models, our method first removes redundant structures to derive a subset of reference models. Then, a structure is ranked based on its average pairwise similarity to the reference models. Using the CASP8 data set containing a large collection of predicted models for 122 targets, we compared our method with the best CASP8 quality assessment (QA) servers, which are all consensus based, and showed that our QA scores correlate better with the GDT-TSs than those of the CASP8 QA servers. We also compared our method with the state-of-the-art scoring functions and showed its improved performance for near-native model selection. The GDT-TSs of the top models picked by our method are on average more than 8 percent better than the ones selected by the best performing scoring function.  相似文献   

3.
Our information-driven docking approach HADDOCK has demonstrated a sustained performance since the start of its participation to CAPRI. This is due, in part, to its ability to integrate data into the modeling process, and to the robustness of its scoring function. We participated in CAPRI both as server and manual predictors. In CAPRI rounds 38-45, we have used various strategies depending on the available information. These ranged from imposing restraints to a few residues identified from literature as being important for the interaction, to binding pockets identified from homologous complexes or template-based refinement/CA-CA restraint-guided docking from identified templates. When relevant, symmetry restraints were used to limit the conformational sampling. We also tested for a large decamer target a new implementation of the MARTINI coarse-grained force field in HADDOCK. Overall, we obtained acceptable or better predictions for 13 and 11 server and manual submissions, respectively, out of the 22 interfaces. Our server performance (acceptable or higher-quality models when considering the top 10) was better (59%) than the manual (50%) one, in which we typically experiment with various combinations of protocols and data sources. Again, our simple scoring function based on a linear combination of intermolecular van der Waals and electrostatic energies and an empirical desolvation term demonstrated a good performance in the scoring experiment with a 63% success rate across all 22 interfaces. An analysis of model quality indicates that, while we are consistently performing well in generating acceptable models, there is room for improvement for generating/identifying higher quality models.  相似文献   

4.
Structures of proteins complexed with other proteins, peptides, or ligands are essential for investigation of molecular mechanisms. However, the experimental structures of protein complexes of interest are often not available. Therefore, computational methods are widely used to predict these structures, and, of those methods, template-based modeling is the most successful. In the rounds 38-45 of the Critical Assessment of PRediction of Interactions (CAPRI), we applied template-based modeling for 9 of 11 protein-protein and protein-peptide interaction targets, resulting in medium and high-quality models for six targets. For the protein-oligosaccharide docking targets, we used constraints derived from template structures, and generated models of at least acceptable quality for most of the targets. Apparently, high flexibility of oligosaccharide molecules was the main cause preventing us from obtaining models of higher quality. We also participated in the CAPRI scoring challenge, the goal of which was to identify the highest quality models from a large pool of decoys. In this experiment, we tested VoroMQA, a scoring method based on interatomic contact areas. The results showed VoroMQA to be quite effective in scoring strongly binding and obligatory protein complexes, but less successful in the case of transient interactions. We extensively used manual intervention in both CAPRI modeling and scoring experiments. This oftentimes allowed us to select the correct templates from available alternatives and to limit the search space during the model scoring.  相似文献   

5.
Continuum solvent models such as Generalized-Born and Poisson–Boltzmann methods hold the promise to treat solvation effect efficiently and to enable rapid scoring of protein structures when they are combined with physics-based energy functions. Yet, direct comparison of these two approaches on large protein data set is lacking. Building on our previous work with a scoring function based on a Generalized-Born (GB) solvation model, and short molecular-dynamics simulations, we further extended the scoring function to compare with the MM-PBSA method to treat the solvent effect. We benchmarked this scoring function against seven publicly available decoy sets. We found that, somewhat surprisingly, the results of MM-PBSA approach are comparable to the previous GB-based scoring function. We also discussed the effect to the scoring function accuracy due to presence of large ligands and ions in some native structures of the decoy sets.  相似文献   

6.
Yue Cao  Yang Shen 《Proteins》2020,88(8):1091-1099
Structural information about protein-protein interactions, often missing at the interactome scale, is important for mechanistic understanding of cells and rational discovery of therapeutics. Protein docking provides a computational alternative for such information. However, ranking near-native docked models high among a large number of candidates, often known as the scoring problem, remains a critical challenge. Moreover, estimating model quality, also known as the quality assessment problem, is rarely addressed in protein docking. In this study, the two challenging problems in protein docking are regarded as relative and absolute scoring, respectively, and addressed in one physics-inspired deep learning framework. We represent protein and complex structures as intra- and inter-molecular residue contact graphs with atom-resolution node and edge features. And we propose a novel graph convolutional kernel that aggregates interacting nodes’ features through edges so that generalized interaction energies can be learned directly from 3D data. The resulting energy-based graph convolutional networks (EGCN) with multihead attention are trained to predict intra- and inter-molecular energies, binding affinities, and quality measures (interface RMSD) for encounter complexes. Compared to a state-of-the-art scoring function for model ranking, EGCN significantly improves ranking for a critical assessment of predicted interactions (CAPRI) test set involving homology docking; and is comparable or slightly better for Score_set, a CAPRI benchmark set generated by diverse community-wide docking protocols not known to training data. For Score_set quality assessment, EGCN shows about 27% improvement to our previous efforts. Directly learning from 3D structure data in graph representation, EGCN represents the first successful development of graph convolutional networks for protein docking.  相似文献   

7.
BACKGROUND: Human diversity, namely single nucleotide polymorphisms (SNPs), is becoming a focus of biomedical research. Despite the binary nature of SNP determination, the majority of genotyping assay data need a critical evaluation for genotype calling. We applied statistical models to improve the automated analysis of 2-dimensional SNP data. METHODS: We derived several quantities in the framework of Gaussian mixture models that provide figures of merit to objectively measure the data quality. The accuracy of individual observations is scored as the probability of belonging to a certain genotype cluster, while the assay quality is measured by the overlap between the genotype clusters. RESULTS: The approach was extensively tested with a dataset of 438 nonredundant SNP assays comprising >150,000 datapoints. The performance of our automatic scoring method was compared with manual assignments. The agreement for the overall assay quality is remarkably good, and individual observations were scored differently by man and machine in 2.6% of cases, when applying stringent probability threshold values. CONCLUSION: Our definition of bounds for the accuracy for complete assays in terms of misclassification probabilities goes beyond other proposed analysis methods. We expect the scoring method to minimise human intervention and provide a more objective error estimate in genotype calling.  相似文献   

8.
We describe a new method for identifying the sequences that signal the start of translation, and the boundaries between exons and introns (donor and acceptor sites) in human mRNA. According to the mandatory keyword, ORGANISM, and feature key, CDS, a large set of standard data for each signal site was extracted from the ASCII flat file, gbpri.seq, in the GenBank release 108.0. This was used to generate the scoring matrices, which summarize the sequence information for each signal site. The scoring matrices take into account the independent nucleotide frequencies between adjacent bases in each position within the signal site regions, and the relative weight on each nucleotide in proportion to their probabilities in the known signal sites. Using a scoring scheme that is based on the nucleotide scoring matrices, the method has great sensitivity and specificity when used to locate signals in uncharacterized human genomic DNA. These matrices are especially effective at distinguishing true and false sites.  相似文献   

9.
MOTIVATION: Feature selection methods aim to reduce the complexity of data and to uncover the most relevant biological variables. In reality, information in biological datasets is often incomplete as a result of untrustworthy samples and missing values. The reliability of selection methods may therefore be questioned. METHOD: Information loss is incorporated into a perturbation scheme, testing which features are stable under it. This method is applied to data analysis by unsupervised feature filtering (UFF). The latter has been shown to be a very successful method in analysis of gene-expression data. RESULTS: We find that the UFF quality degrades smoothly with information loss. It remains successful even under substantial damage. Our method allows for selection of a best imputation method on a dataset treated by UFF. More importantly, scoring features according to their stability under information loss is shown to be correlated with biological importance in cancer studies. This scoring may lead to novel biological insights.  相似文献   

10.
Protein and peptide mass analysis and amino acid sequencing by mass spectrometry is widely used for identification and annotation of post-translational modifications (PTMs) in proteins. Modification-specific mass increments, neutral losses or diagnostic fragment ions in peptide mass spectra provide direct evidence for the presence of post-translational modifications, such as phosphorylation, acetylation, methylation or glycosylation. However, the commonly used database search engines are not always practical for exhaustive searches for multiple modifications and concomitant missed proteolytic cleavage sites in large-scale proteomic datasets, since the search space is dramatically expanded. We present a formal definition of the problem of searching databases with tandem mass spectra of peptides that are partially (sub-stoichiometrically) modified. In addition, an improved search algorithm and peptide scoring scheme that includes modification specific ion information from MS/MS spectra was implemented and tested using the Virtual Expert Mass Spectrometrist (VEMS) software. A set of 2825 peptide MS/MS spectra were searched with 16 variable modifications and 6 missed cleavages. The scoring scheme returned a large set of post-translationally modified peptides including precise information on modification type and position. The scoring scheme was able to extract and distinguish the near-isobaric modifications of trimethylation and acetylation of lysine residues based on the presence and absence of diagnostic neutral losses and immonium ions. In addition, the VEMS software contains a range of new features for analysis of mass spectrometry data obtained in large-scale proteomic experiments. Windows binaries are available at http://www.yass.sdu.dk/.  相似文献   

11.
We propose a new type of probabilistic scoring scheme framework for protein identification from peptide masses. We first introduce the framework itself and explain its requirements. In a second part, we describe a particular implementation and test it on a data set of more than 8000 MALDI-TOF spectra with known contents. Doing so, we also compare its performance to two widely used scoring schemes, thereby demonstrating the potential of the proposed approach.  相似文献   

12.
Zhang Y  Skolnick J 《Proteins》2004,57(4):702-710
We have developed a new scoring function, the template modeling score (TM-score), to assess the quality of protein structure templates and predicted full-length models by extending the approaches used in Global Distance Test (GDT)1 and MaxSub.2 First, a protein size-dependent scale is exploited to eliminate the inherent protein size dependence of the previous scores and appropriately account for random protein structure pairs. Second, rather than setting specific distance cutoffs and calculating only the fractions with errors below the cutoff, all residue pairs in alignment/modeling are evaluated in the proposed score. For comparison of various scoring functions, we have constructed a large-scale benchmark set of structure templates for 1489 small to medium size proteins using the threading program PROSPECTOR_3 and built the full-length models using MODELLER and TASSER. The TM-score of the initial threading alignments, compared to the GDT and MaxSub scoring functions, shows a much stronger correlation to the quality of the final full-length models. The TM-score is further exploited as an assessment of all 'new fold' targets in the recent CASP5 experiment and shows a close coincidence with the results of human-expert visual assessment. These data suggest that the TM-score is a useful complement to the fully automated assessment of protein structure predictions. The executable program of TM-score is freely downloadable at http://bioinformatics.buffalo.edu/TM-score.  相似文献   

13.
The amplified fragment length polymorphism (AFLP) technique is an increasingly popular component of the phylogenetic toolbox, particularly for plant species. Technological advances in capillary electrophoresis now allow very precise estimates of DNA fragment mobility and amplitude, and current AFLP software allows greater control of data scoring and the production of the binary character matrix. However, for AFLP to become a useful modern tool for large data sets, improvements to automated scoring are required. We design a procedure that can be used to optimize AFLP scoring parameters to improve phylogenetic resolution and demonstrate it for two AFLP scoring programs (GeneMapper and GeneMarker). In general, we found that there was a trade-off between getting more characters of lower quality and fewer characters of high quality. Conservative settings that gave the least error did not give the best phylogenetic resolution, as too many useful characters were discarded. For example, in GeneMapper, we found that bin width was a crucial parameter, and that although reducing bin width from 1.0 to 0.5 base pairs increased the error rate, it nevertheless improved resolution due to the increased number of informative characters. For our 30-taxon data sets, moving from default to optimized parameter settings gave between 3 and 11 extra internal edges with >50% bootstrap support, in the best case increasing the number of resolved edges from 14 to 25 out of a possible 27. Nevertheless, improvements to current AFLP software packages are needed to (1) make use of replicate profiles to calibrate the data and perform error calculations and (2) perform tests to optimize scoring parameters in a rigorous and automated way. This is true not only when AFLP data are used for phylogenetics, but also for other applications, including linkage mapping and population genetics.  相似文献   

14.
Camacho CJ  Ma H  Champ PC 《Proteins》2006,63(4):868-877
Predicting protein-protein interactions involves sampling and scoring docked conformations. Barring some large structural rearrangement, rapidly sampling the space of docked conformations is now a real possibility, and the limiting step for the successful prediction of protein interactions is the scoring function used to reduce the space of conformations from billions to a few, and eventually one high affinity complex. An atomic level free-energy scoring function that estimates in units of kcal/mol both electrostatic and desolvation interactions (plus van der Waals if appropriate) of protein-protein docked conformations is used to rerank the blind predictions (860 in total) submitted for six targets to the community-wide Critical Assessment of PRediction of Interactions (CAPRI; http://capri.ebi.ac.uk). We found that native-like models often have varying intermolecular contacts and atom clashes, making unlikely that one can construct a universal function that would rank all these models as native-like. Nevertheless, our scoring function is able to consistently identify the native-like complexes as those with the lowest free energy for the individual models of 16 (out of 17) human predictors for five of the targets, while at the same time the modelers failed to do so in more than half of the cases. The scoring of high-quality models developed by a wide variety of methods and force fields confirms that electrostatic and desolvation forces are the dominant interactions determining the bound structure. The CAPRI experiment has shown that modelers can predict valuable models of protein-protein complexes, and improvements in scoring functions should soon solve the docking problem for complexes whose backbones do not change much upon binding. A scoring server and programs are available at http://structure.pitt.edu.  相似文献   

15.
Unbalanced repeated-measures models with structured covariance matrices   总被引:32,自引:0,他引:32  
The question of how to analyze unbalanced or incomplete repeated-measures data is a common problem facing analysts. We address this problem through maximum likelihood analysis using a general linear model for expected responses and arbitrary structural models for the within-subject covariances. Models that can be fit include standard univariate and multivariate models with incomplete data, random-effects models, and models with time-series and factor-analytic error structures. We describe Newton-Raphson and Fisher scoring algorithms for computing maximum likelihood estimates, and generalized EM algorithms for computing restricted and unrestricted maximum likelihood estimates. An example fitting several models to a set of growth data is included.  相似文献   

16.
The analysis of high-dimensional data sets is often forced to rely upon well-chosen summary statistics. A systematic approach to choosing such statistics, which is based upon a sound theoretical framework, is currently lacking. In this paper we develop a sequential scheme for scoring statistics according to whether their inclusion in the analysis will substantially improve the quality of inference. Our method can be applied to high-dimensional data sets for which exact likelihood equations are not possible. We illustrate the potential of our approach with a series of examples drawn from genetics. In summary, in a context in which well-chosen summary statistics are of high importance, we attempt to put the 'well' into 'chosen.'  相似文献   

17.
Hubbard RA  Inoue LY  Fann JR 《Biometrics》2008,64(3):843-850
Summary .   Longitudinal studies are a powerful tool for characterizing the course of chronic disease. These studies are usually carried out with subjects observed at periodic visits giving rise to panel data. Under this observation scheme the exact times of disease state transitions and sequence of disease states visited are unknown and Markov process models are often used to describe disease progression. Most applications of Markov process models rely on the assumption of time homogeneity, that is, that the transition rates are constant over time. This assumption is not satisfied when transition rates depend on time from the process origin. However, limited statistical tools are available for dealing with nonhomogeneity. We propose models in which the time scale of a nonhomogeneous Markov process is transformed to an operational time scale on which the process is homogeneous. We develop a method for jointly estimating the time transformation and the transition intensity matrix for the time transformed homogeneous process. We assess maximum likelihood estimation using the Fisher scoring algorithm via simulation studies and compare performance of our method to homogeneous and piecewise homogeneous models. We apply our methodology to a study of delirium progression in a cohort of stem cell transplantation recipients and show that our method identifies temporal trends in delirium incidence and recovery.  相似文献   

18.
The promise of mass spectrometry as a tool for probing signal-transduction is predicated on reliable identification of post-translational modifications. Phosphorylations are key mediators of cellular signaling, yet are hard to detect, partly because of unusual fragmentation patterns of phosphopeptides. In addition to being accurate, MS/MS identification software must be robust and efficient to deal with increasingly large spectral data sets. Here, we present a new scoring function for the Inspect software for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation. The scoring function was modeled by learning fragmentation patterns from 7677 validated phosphopeptide spectra. We compare our algorithm against SEQUEST and X!Tandem on testing and training data sets. At a 1% false positive rate, Inspect identified the greatest total number of phosphorylated spectra, 13% more than SEQUEST and 39% more than X!Tandem. Spectra identified by Inspect tended to score better in several spectral quality measures. Furthermore, Inspect runs much faster than either SEQUEST or X!Tandem, making desktop phosphoproteomics feasible. Finally, we used our new models to reanalyze a corpus of 423,000 LTQ spectra acquired for a phosphoproteome analysis of Saccharomyces cerevisiae DNA damage and repair pathways and discovered 43% more phosphopeptides than the previous study.  相似文献   

19.
Fischer B  Fukuzawa K  Wenzel W 《Proteins》2008,70(4):1264-1273
The adaptation of forcefield-based scoring function to specific receptors remains an important challenge for in-silico drug discovery. Here we compare binding energies of forcefield-based scoring functions with models that are reparameterized on the basis of large-scale quantum calculations of the receptor. We compute binding energies of eleven ligands to the human estrogen receptor subtype alpha (ERalpha) and four ligands to the human retinoic acid receptor of isotype gamma (RARgamma). Using the FlexScreen all-atom receptor-ligand docking approach, we compare docking simulations parameterized by quantum-mechanical calculation of a large protein fragment with purely forcefield-based models. The use of receptor flexibility in the FlexScreen permits the treatment of all ligands in the same receptor model. We find a high correlation between the classical binding energy obtained in the docking simulation and quantum mechanical binding energies and a good correlation with experimental affinities R=0.81 for ERalpha and R=0.95 for RARgamma using the quantum derived scoring functions. A significant part of this improvement is retained, when only the receptor is treated with quantum-based parameters, while the ligands are parameterized with a purely classical model.  相似文献   

20.
MS2 library spectra are rich in reproducible information about peptide fragmentation patterns compared to theoretical spectra modeled by a sequence search tool. So far, spectrum library searches are mostly applied to detect peptides as they are present in the library. However, they also allow finding modified variants of the library peptides if the search is done with a large precursor mass window and an adapted Spectrum-Spectrum Match (SSM) scoring algorithm. We perform a thorough evaluation on the use of library spectra as opposed to theoretical peptide spectra for the identification of PTMs, analyzing spectra of a well-annotated modification-rich test data set compiled from public data repositories. These initial studies motivate the development of our modification tolerant spectrum library search tool QuickMod, designed to identify modified variants of the peptides listed in the spectrum library without any prior input from the user estimating the modifications present in the sample. We built the search algorithm of QuickMod after carefully testing different SSM similarity scores. The final spectrum scoring scheme uses a support vector machine (SVM) on a selection of scoring features to classify correct and incorrect SSM. After identification of a list of modified peptides at a given False Discovery Rate (FDR), the modifications need to be positioned on the peptide sequence. We present a rapid modification site assignment algorithm and evaluate its positioning accuracy. Finally, we demonstrate that QuickMod performs favorably in terms of speed and identification rate when compared to other software solutions for PTM analysis.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号