首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We present a new structurally derived pair-to-pair substitution matrix (P2PMAT). This matrix is constructed from a very large amount of integrated high quality multiple sequence alignments (Blocks) and protein structures. It evaluates the likelihoods of all 160,000 pair-to-pair substitutions. P2PMAT matrix implicitly accounts for evolutionary conservation, correlated mutations, and residue-residue contact potentials. The usefulness of the matrix for structural predictions is shown in this article. Predicting protein residue-residue contacts from sequence information alone, by our method (P2PConPred) is particularly accurate in the protein cores, where it performs better than other basic contact prediction methods (increasing accuracy by 25-60%). The method mean accuracy for protein cores is 24% for 59 diverse families and 34% for a subset of proteins shorter than 100 residues. This is above the level that was recently shown to be sufficient to significantly improve ab initio protein structure prediction. We also demonstrate the ability of our approach to identify native structures within large sets of (300-2000) protein decoys. On the basis of evolutionary information alone our method ranks the native structure in the top 0.3% of the decoys in 4/10 of the sets, and in 8/10 of sets the native structure is ranked in the top 10% of the decoys. The method can, thus, be used to assist filtering wrong models, complementing traditional scoring functions.  相似文献   

2.
Protein functional sites control most biological processes and are important targets for drug design and protein engineering. To characterize them, the evolutionary trace (ET) ranks the relative importance of residues according to their evolutionary variations. Generally, top‐ranked residues cluster spatially to define evolutionary hotspots that predict functional sites in structures. Here, various functions that measure the physical continuity of ET ranks among neighboring residues in the structure, or in the sequence, are shown to inform sequence selection and to improve functional site resolution. This is shown first, in 110 proteins, for which the overlap between top‐ranked residues and actual functional sites rose by 8% in significance. Then, on a structural proteomic scale, optimized ET led to better 3D structure‐function motifs (3D templates) and, in turn, to enzyme function prediction by the Evolutionary Trace Annotation (ETA) method with better sensitivity of (40% to 53%) and positive predictive value (93% to 94%). This suggests that the similarity of evolutionary importance among neighboring residues in the sequence and in the structure is a universal feature of protein evolution. In practice, this yields a tool for optimizing sequence selections for comparative analysis and, via ET, for better predictions of functional site and function. This should prove useful for the efficient mutational redesign of protein function and for pharmaceutical targeting.  相似文献   

3.
In this article, we present COMSAT, a hybrid framework for residue contact prediction of transmembrane (TM) proteins, integrating a support vector machine (SVM) method and a mixed integer linear programming (MILP) method. COMSAT consists of two modules: COMSAT_SVM which is trained mainly on position–specific scoring matrix features, and COMSAT_MILP which is an ab initio method based on optimization models. Contacts predicted by the SVM model are ranked by SVM confidence scores, and a threshold is trained to improve the reliability of the predicted contacts. For TM proteins with no contacts above the threshold, COMSAT_MILP is used. The proposed hybrid contact prediction scheme was tested on two independent TM protein sets based on the contact definition of 14 Å between Cα‐Cα atoms. First, using a rigorous leave‐one‐protein‐out cross validation on the training set of 90 TM proteins, an accuracy of 66.8%, a coverage of 12.3%, a specificity of 99.3% and a Matthews' correlation coefficient (MCC) of 0.184 were obtained for residue pairs that are at least six amino acids apart. Second, when tested on a test set of 87 TM proteins, the proposed method showed a prediction accuracy of 64.5%, a coverage of 5.3%, a specificity of 99.4% and a MCC of 0.106. COMSAT shows satisfactory results when compared with 12 other state‐of‐the‐art predictors, and is more robust in terms of prediction accuracy as the length and complexity of TM protein increase. COMSAT is freely accessible at http://hpcc.siat.ac.cn/COMSAT/ . Proteins 2016; 84:332–348. © 2016 Wiley Periodicals, Inc.  相似文献   

4.
For current state-of-the-art methods, the prediction of correct topology of membrane proteins has been reported to be above 80%. However, this performance has only been observed in small and possibly biased data sets obtained from protein structures or biochemical assays. Here, we test a number of topology predictors on an "unseen" set of proteins of known structure and also on four "genome-scale" data sets, including one recent large set of experimentally validated human membrane proteins with glycosylated sites. The set of glycosylated proteins is also used to examine the ability of prediction methods to separate membrane from nonmembrane proteins. The results show that methods utilizing multiple sequence alignments are overall superior to methods that do not. The best performance is obtained by TOPCONS, a consensus method that combines several of the other prediction methods. The best methods to distinguish membrane from nonmembrane proteins belong to the "Phobius" group of predictors. We further observe that the reported high accuracies in the smaller benchmark sets are not quite maintained in larger scale benchmarks. Instead, we estimate the performance of the best prediction methods for eukaryotic membrane proteins to be between 60% and 70%. The low agreement between predictions from different methods questions earlier estimates about the global properties of the membrane proteome. Finally, we suggest a pipeline to estimate these properties using a combination of the best predictors that could be applied in large-scale proteomics studies of membrane proteins.  相似文献   

5.
Hu LL  Li Z  Wang K  Niu S  Shi XH  Cai YD  Li HP 《Biopolymers》2011,95(11):763-771
Protein methylation, one of the most important post-translational modifications, typically takes place on arginine or lysine residue. The reversible modification involves a series of basic cellular processes. Identification of methyl proteins with their sites will facilitate the understanding of the molecular mechanism of methylation. Besides the experimental methods, computational predictions of methylated sites are much more desirable for their convenience and fast speed. Here, we propose a method dedicated to predicting methylated sites of proteins. Feature selection was made on sequence conservation, physicochemical/biochemical properties, and structural disorder by applying maximum relevance minimum redundancy and incremental feature selection methods. The prediction models were built according to nearest the neighbor algorithm and evaluated by the jackknife cross-validation. We built 11 and 9 predictors for methylarginine and methyllysine, respectively, and integrated them to predict methylated sites. As a result, the average prediction accuracies are 74.25%, 77.02% for methylarginine and methyllysine training sets, respectively. Feature analysis suggested evolutionary information, and physicochemical/biochemical properties play important roles in the recognition of methylated sites. These findings may provide valuable information for exploiting the mechanisms of methylation. Our method may serve as a useful tool for biologists to find the potential methylated sites of proteins.  相似文献   

6.
Detection of biologically interesting, low-abundance proteins in complex proteomes such as serum typically requires extensive fractionation and high-performance mass spectrometers. Processing of the resulting large data sets involves trade-offs between confidence of identification and depth of protein coverage; that is, higher stringency filters preferentially reduce the number of low-abundance proteins identified. In the current study, an alternative database search and results filtering strategies were evaluated using test samples ranging from purified proteins to ovarian tumor secretomes and human serum to maximize peptide and protein coverage. Full and partial tryptic searches were compared because substantial numbers of partial tryptic peptides were observed in all samples, and the proportion of partial tryptic peptides was particularly high for serum. When data filters that yielded similar false discovery rates (FDR) were used, full tryptic searches detected far fewer peptides than partial tryptic searches. In contrast to the common practice of using full tryptic specificity and a narrow precursor mass tolerance, more proteins and peptides could be confidently identified using a partial tryptic database search with a 100 ppm precursor mass tolerance followed by filtering of results using 10 ppm mass error and full tryptic boundaries.  相似文献   

7.
8.
Three different prenyltransferases attach isoprenyl anchors to C-terminal motifs in substrate proteins. These lipid anchors serve for membrane attachment or protein-protein interactions in many pathways. Although well-tolerated selective prenyltransferase inhibitors are clinically available, their mode of action remains unclear since the known substrate sets of the various prenyltransferases are incomplete. The Prenylation Prediction Suite (PrePS) has been applied for large-scale predictions of prenylated proteins. To prioritize targets for experimental verification, we rank the predictions by their functional importance estimated by evolutionary conservation of the prenylation motifs within protein families. The ranked lists of predictions are accessible as PRENbase (http://mendel.imp.univie.ac.at/sat/PrePS/PRENbase) and can be queried for verification status, type of modifying enzymes (anchor type), and taxonomic distribution. Our results highlight a large group of plant metal-binding chaperones as well as several newly predicted proteins involved in ubiquitin-mediated protein degradation, enriching the known functional repertoire of prenylated proteins. Furthermore, we identify two possibly prenylated proteins in Mimivirus. The section HumanPRENbase provides complete lists of predicted prenylated human proteins-for example, the list of farnesyltransferase targets that cannot become substrates of geranylgeranyltransferase 1 and, therefore, are especially affected by farnesyltransferase inhibitors (FTIs) used in cancer and anti-parasite therapy. We report direct experimental evidence verifying the prediction of the human proteins Prickle1, Prickle2, the BRO1 domain-containing FLJ32421 (termed BROFTI), and Rab28 (short isoform) as exclusive farnesyltransferase targets. We introduce PRENbase, a database of large-scale predictions of protein prenylation substrates ranked by evolutionary conservation of the motif. Experimental evidence is presented for the selective farnesylation of targets with an evolutionary conserved modification site.  相似文献   

9.
By design, structural genomics (SG) solves many structures that cannot be assigned function based on homology to known proteins. Alternative function annotation methods are therefore needed and this study focuses on function prediction with three-dimensional (3D) templates: small structural motifs built of just a few functionally critical residues. Although experimentally proven functional residues are scarce, we show here that Evolutionary Trace (ET) rankings of residue importance are sufficient to build 3D templates, match them, and then assign Gene Ontology (GO) functions in enzymes and non-enzymes alike. In a high-specificity mode, this Evolutionary Trace Annotation (ETA) method covered half (53%) of the 2384 annotated SG protein controls. Three-quarters (76%) of predictions were both correct and complete. The positive predictive value for all GO depths (all-depth PPV) was 84%, and it rose to 94% over GO depths 1-3 (depth 3 PPV). In a high-sensitivity mode, coverage rose significantly (84%), while accuracy fell moderately: 68% of predictions were both correct and complete, all-depth PPV was 75%, and depth 3 PPV was 86%. These data concur with prior mutational experiments showing that ET rank information identifies key functional determinants in proteins. In practice, ETA predicted functions in 42% of 3461 unannotated SG proteins. In 529 cases—including 280 non-enzymes and 21 for metal ion ligands—the expected accuracy is 84% at any GO depth and 94% down to GO depth 3, while for the remaining 931 the expected accuracies are 60% and 71%, respectively. Thus, local structural comparisons of evolutionarily important residues can help decipher protein functions to known reliability levels and without prior assumption on functional mechanisms. ETA is available at http://mammoth.bcm.tmc.edu/eta.  相似文献   

10.
O-GalNAc-glycosylation is one of the main types of glycosylation in mammalian cells. No consensus recognition sequence for the O-glycosyltransferases is known, making prediction methods necessary to bridge the gap between the large number of known protein sequences and the small number of proteins experimentally investigated with regard to glycosylation status. From O-GLYCBASE a total of 86 mammalian proteins experimentally investigated for in vivo O-GalNAc sites were extracted. Mammalian protein homolog comparisons showed that a glycosylated serine or threonine is less likely to be precisely conserved than a nonglycosylated one. The Protein Data Bank was analyzed for structural information, and 12 glycosylated structures were obtained. All positive sites were found in coil or turn regions. A method for predicting the location for mucin-type glycosylation sites was trained using a neural network approach. The best overall network used as input amino acid composition, averaged surface accessibility predictions together with substitution matrix profile encoding of the sequence. To improve prediction on isolated (single) sites, networks were trained on isolated sites only. The final method combines predictions from the best overall network and the best isolated site network; this prediction method correctly predicted 76% of the glycosylated residues and 93% of the nonglycosylated residues. NetOGlyc 3.1 can predict sites for completely new proteins without losing its performance. The fact that the sites could be predicted from averaged properties together with the fact that glycosylation sites are not precisely conserved indicates that mucin-type glycosylation in most cases is a bulk property and not a very site-specific one. NetOGlyc 3.1 is made available at www.cbs.dtu.dk/services/netoglyc.  相似文献   

11.
Chen H  Zhou HX 《Proteins》2005,61(1):21-35
The number of structures of protein-protein complexes deposited to the Protein Data Bank is growing rapidly. These structures embed important information for predicting structures of new protein complexes. This motivated us to develop the PPISP method for predicting interface residues in protein-protein complexes. In PPISP, sequence profiles and solvent accessibility of spatially neighboring surface residues were used as input to a neural network. The network was trained on native interface residues collected from the Protein Data Bank. The prediction accuracy at the time was 70% with 47% coverage of native interface residues. Now we have extensively improved PPISP. The training set now consisted of 1156 nonhomologous protein chains. Test on a set of 100 nonhomologous protein chains showed that the prediction accuracy is now increased to 80% with 51% coverage. To solve the problem of over-prediction and under-prediction associated with individual neural network models, we developed a consensus method that combines predictions from multiple models with different levels of accuracy and coverage. Applied on a benchmark set of 68 proteins for protein-protein docking, the consensus approach outperformed the best individual models by 3-8 percentage points in accuracy. To demonstrate the predictive power of cons-PPISP, eight complex-forming proteins with interfaces characterized by NMR were tested. These proteins are nonhomologous to the training set and have a total of 144 interface residues identified by chemical shift perturbation. cons-PPISP predicted 174 interface residues with 69% accuracy and 47% coverage and promises to complement experimental techniques in characterizing protein-protein interfaces. .  相似文献   

12.
A method to detect DNA-binding sites on the surface of a protein structure is important for functional annotation. This work describes the analysis of residue patches on the surface of DNA-binding proteins and the development of a method of predicting DNA-binding sites using a single feature of these surface patches. Surface patches and the DNA-binding sites were initially analysed for accessibility, electrostatic potential, residue propensity, hydrophobicity and residue conservation. From this, it was observed that the DNA-binding sites were, in general, amongst the top 10% of patches with the largest positive electrostatic scores. This knowledge led to the development of a prediction method in which patches of surface residues were selected such that they excluded residues with negative electrostatic scores. This method was used to make predictions for a data set of 56 non-homologous DNA-binding proteins. Correct predictions made for 68% of the data set.  相似文献   

13.

Background  

DNA microarray technology has emerged as a major tool for exploring cancer biology and solving clinical issues. Predicting a patient's response to chemotherapy is one such issue; successful prediction would make it possible to give patients the most appropriate chemotherapy regimen. Patient response can be classified as either a pathologic complete response (PCR) or residual disease (NoPCR), and these strongly correlate with patient outcome. Microarrays can be used as multigenic predictors of patient response, but probe selection remains problematic. In this study, each probe set was considered as an elementary predictor of the response and was ranked on its ability to predict a high number of PCR and NoPCR cases in a ratio similar to that seen in the learning set. We defined a valuation function that assigned high values to probe sets according to how different the expression of the genes was and to how closely the relative proportions of PCR and NoPCR predictions to the proportions observed in the learning set was. Multigenic predictors were designed by selecting probe sets highly ranked in their predictions and tested using several validation sets.  相似文献   

14.
The strip-of-helix hydrophobicity algorithm was devised to identify protein sequences which, when coiled as alpha or 3(10) helices, had one axial, hydrophobic strip and otherwise variably hydrophilic residues. The strip-of-helix hydrophobicity algorithm also ranked such sequences according to an index, the mean hydrophobicity of amino acids in the axial strip. This algorithm well predicted T cell-presented fragments of antigenic proteins. A derivative of this algorithm (the structural helices algorithm (SHA] was tested for the prediction of helices in crystallographically defined proteins. For the SHA, eight amino acid sequences, 2 cycles plus one amino acid in an alpha helix, with strip-of-helix hydrophobicity indices greater than 2.5, were selected with overlapping segments joined. These selections were terminated according to simple "capping rules," which took into account the roles of N-terminal Asn or Pro and C-terminal Gly in the stability of helices. In analyses of 35 crystallographically defined proteins with known alpha and 3(10) helices, the predictions with the SHA overlapped (had overlap indices x greater than or equal to 0.5) with 34% of known helices, touched (had overlap indices 0.5 greater than x greater than 0) or overlapped with 66% of known helices, or were neighboring (came within 6 residues) or touched or overlapped with 82% of known helices. At each level of judging the quality of prediction, the SHA was usually less sensitive (correct predictions/total number of known helices) and more efficient (correct predictions/total number of predictions) than the Chou-Fasman and Garnier-Robson methods. It was simpler in design and calculation. The chemical mechanisms underlying these algorithms appear to apply both to protein folding and to selection of T cell-presented antigenic sequences.  相似文献   

15.
A neural network-based tool, TargetP, for large-scale subcellular location prediction of newly identified proteins has been developed. Using N-terminal sequence information only, it discriminates between proteins destined for the mitochondrion, the chloroplast, the secretory pathway, and "other" localizations with a success rate of 85% (plant) or 90% (non-plant) on redundancy-reduced test sets. From a TargetP analysis of the recently sequenced Arabidopsis thaliana chromosomes 2 and 4 and the Ensembl Homo sapiens protein set, we estimate that 10% of all plant proteins are mitochondrial and 14% chloroplastic, and that the abundance of secretory proteins, in both Arabidopsis and Homo, is around 10%. TargetP also predicts cleavage sites with levels of correctly predicted sites ranging from approximately 40% to 50% (chloroplastic and mitochondrial presequences) to above 70% (secretory signal peptides). TargetP is available as a web-server at http://www.cbs.dtu.dk/services/TargetP/.  相似文献   

16.
MOTIVATION: Identifying the location of ligand binding sites on a protein is of fundamental importance for a range of applications including molecular docking, de novo drug design and structural identification and comparison of functional sites. Here, we describe a new method of ligand binding site prediction called Q-SiteFinder. It uses the interaction energy between the protein and a simple van der Waals probe to locate energetically favourable binding sites. Energetically favourable probe sites are clustered according to their spatial proximity and clusters are then ranked according to the sum of interaction energies for sites within each cluster. RESULTS: There is at least one successful prediction in the top three predicted sites in 90% of proteins tested when using Q-SiteFinder. This success rate is higher than that of a commonly used pocket detection algorithm (Pocket-Finder) which uses geometric criteria. Additionally, Q-SiteFinder is twice as effective as Pocket-Finder in generating predicted sites that map accurately onto ligand coordinates. It also generates predicted sites with the lowest average volumes of the methods examined in this study. Unlike pocket detection, the volumes of the predicted sites appear to show relatively low dependence on protein volume and are similar in volume to the ligands they contain. Restricting the size of the pocket is important for reducing the search space required for docking and de novo drug design or site comparison. The method can be applied in structural genomics studies where protein binding sites remain uncharacterized since the 86% success rate for unbound proteins appears to be only slightly lower than that of ligand-bound proteins. AVAILABILITY: Both Q-SiteFinder and Pocket-Finder have been made available online at http://www.bioinformatics.leeds.ac.uk/qsitefinder and http://www.bioinformatics.leeds.ac.uk/pocketfinder  相似文献   

17.
Biophysical forcefields have contributed less than originally anticipated to recent progress in protein structure prediction. Here, we have investigated the selectivity of a recently developed all‐atom free‐energy forcefield for protein structure prediction and quality assessment (QA). Using a heuristic method, but excluding homology, we generated decoy‐sets for all targets of the CASP7 protein structure prediction assessment with <150 amino acids. The decoys in each set were then ranked by energy in short relaxation simulations and the best low‐energy cluster was submitted as a prediction. For four of nine template‐free targets, this approach generated high‐ranking predictions within the top 10 models submitted in CASP7 for the respective targets. For these targets, our de‐novo predictions had an average GDT_S score of 42.81, significantly above the average of all groups. The refinement protocol has difficulty for oligomeric targets and when no near‐native decoys are generated in the decoy library. For targets with high‐quality decoy sets the refinement approach was highly selective. Motivated by this observation, we rescored all server submissions up to 200 amino acids using a similar refinement protocol, but using no clustering, in a QA exercise. We found an excellent correlation between the best server models and those with the lowest energy in the forcefield. The free‐energy refinement protocol may thus be an efficient tool for relative QA and protein structure prediction. Proteins 2009. © 2009 Wiley‐Liss, Inc.  相似文献   

18.
The protein structure field is experiencing a revolution. From the increased throughput of techniques to determine experimental structures, to developments such as cryo-EM that allow us to find the structures of large protein complexes or, more recently, the development of artificial intelligence tools, such as AlphaFold, that can predict with high accuracy the folding of proteins for which the availability of homology templates is limited. Here we quantify the effect of the recently released AlphaFold database of protein structural models in our knowledge on human proteins. Our results indicate that our current baseline for structural coverage of 48%, considering experimentally-derived or template-based homology models, elevates up to 76% when including AlphaFold predictions. At the same time the fraction of dark proteome is reduced from 26% to just 10% when AlphaFold models are considered. Furthermore, although the coverage of disease-associated genes and mutations was near complete before AlphaFold release (69% of Clinvar pathogenic mutations and 88% of oncogenic mutations), AlphaFold models still provide an additional coverage of 3% to 13% of these critically important sets of biomedical genes and mutations. Finally, we show how the contribution of AlphaFold models to the structural coverage of non-human organisms, including important pathogenic bacteria, is significantly larger than that of the human proteome. Overall, our results show that the sequence-structure gap of human proteins has almost disappeared, an outstanding success of direct consequences for the knowledge on the human genome and the derived medical applications.  相似文献   

19.
20.
Predicting transmembrane beta-barrels in proteomes   总被引:1,自引:0,他引:1  
Very few methods address the problem of predicting beta-barrel membrane proteins directly from sequence. One reason is that only very few high-resolution structures for transmembrane beta-barrel (TMB) proteins have been determined thus far. Here we introduced the design, statistics and results of a novel profile-based hidden Markov model for the prediction and discrimination of TMBs. The method carefully attempts to avoid over-fitting the sparse experimental data. While our model training and scoring procedures were very similar to a recently published work, the architecture and structure-based labelling were significantly different. In particular, we introduced a new definition of beta- hairpin motifs, explicit state modelling of transmembrane strands, and a log-odds whole-protein discrimination score. The resulting method reached an overall four-state (up-, down-strand, periplasmic-, outer-loop) accuracy as high as 86%. Furthermore, accurately discriminated TMB from non-TMB proteins (45% coverage at 100% accuracy). This high precision enabled the application to 72 entirely sequenced Gram-negative bacteria. We found over 164 previously uncharacterized TMB proteins at high confidence. Database searches did not implicate any of these proteins with membranes. We challenge that the vast majority of our 164 predictions will eventually be verified experimentally. All proteome predictions and the PROFtmb prediction method are available at http://www.rostlab.org/ services/PROFtmb/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号