首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 78 毫秒
1.
We present a novel method for the comparison of multiple protein alignments with assessment of statistical significance (COMPASS). The method derives numerical profiles from alignments, constructs optimal local profile-profile alignments and analytically estimates E-values for the detected similarities. The scoring system and E-value calculation are based on a generalization of the PSI-BLAST approach to profile-sequence comparison, which is adapted for the profile-profile case. Tested along with existing methods for profile-sequence (PSI-BLAST) and profile-profile (prof_sim) comparison, COMPASS shows increased abilities for sensitive and selective detection of remote sequence similarities, as well as improved quality of local alignments. The method allows prediction of relationships between protein families in the PFAM database beyond the range of conventional methods. Two predicted relations with high significance are similarities between various Rossmann-type folds and between various helix-turn-helix-containing families. The potential value of COMPASS for structure/function predictions is illustrated by the detection of an intricate homology between the DNA-binding domain of the CTF/NFI family and the MH1 domain of the Smad family.  相似文献   

2.
MOTIVATION: Improved comparisons of multiple sequence alignments (profiles) with other profiles can identify subtle relationships between protein families and motifs significantly beyond the resolution of sequence-based comparisons. RESULTS: The local alignment of multiple alignments (LAMA) method was modified to estimate alignment score significance by applying a new measure based on Fisher's combining method. To verify the new procedure, we used known protein structures, sequence annotations and cyclical relations consistency analysis (CYRCA) sets of consistently aligned blocks. Using the new significance measure improved the sensitivity of LAMA without altering its selectivity. The program performed better than other profile-to-profile methods (COMPASS and Prof_sim) and a sequence-to-profile method (PSI-BLAST). The testing was large scale and used several parameters, including pseudo-counts profile calculations and local ungapped blocks or more extended gapped profiles. This comparison provides guidelines to the relative advantages of each method for different cases. We demonstrate and discuss the unique advantages of using block multiple alignments of protein motifs.  相似文献   

3.
MOTIVATION: The Monte Carlo fragment insertion method for protein tertiary structure prediction (ROSETTA) of Baker and others, has been merged with the I-SITES library of sequence structure motifs and the HMMSTR model for local structure in proteins, to form a new public server for the ab initio prediction of protein structure. The server performs several tasks in addition to tertiary structure prediction, including a database search, amino acid profile generation, fragment structure prediction, and backbone angle and secondary structure prediction. Meeting reasonable service goals required improvements in the efficiency, in particular for the ROSETTA algorithm. RESULTS: The new server was used for blind predictions of 40 protein sequences as part of the CASP4 blind structure prediction experiment. The results for 31 of those predictions are presented here. 61% of the residues overall were found in topologically correct predictions, which are defined as fragments of 30 residues or more with a root-mean-square deviation in superimposed alpha carbons of less than 6A. HMMSTR 3-state secondary structure predictions were 73% correct overall. Tertiary structure predictions did not improve the accuracy of secondary structure prediction.  相似文献   

4.
PFAM is a popular and effective database of Hidden Markov Models (HMMs), which represent a wide range of protein families. Here, we introduce TLFAM as a more specific set of HMM databases. Analyses of bacterial genomes using TLFAM-Pro show better scores, E-values, and alignment lengths than those using the more generalized PFAM. Since PFAM will still find hits that TLFAM-Pro will not, we recommend that they be used jointly, rather than exclusively. This method provides the best features of both databases. This method has been extended to a number of other organism types, such as archaea, and the databases are freely available to interested researchers.  相似文献   

5.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

6.
Vicatos S  Reddy BV  Kaznessis Y 《Proteins》2005,58(4):935-949
In this work we present a novel correlated mutations analysis (CMA) method that is significantly more accurate than previously reported CMA methods. Calculation of correlation coefficients is based on physicochemical properties of residues (predictors) and not on substitution matrices. This results in reliable prediction of pairs of residues that are distant in protein sequence but proximal in its three dimensional tertiary structure. Multiple sequence alignments (MSA) containing a sequence of known structure for 127 families from PFAM database have been selected so that all major protein architectures described in CATH classification database are represented. Protein sequences in the selected families were filtered so that only those evolutionarily close to the target protein remain in the MSA. The average accuracy obtained for the alpha beta class of proteins was 26.8% of predicted proximal pairs with average improvement over random accuracy (IOR) of 6.41. Average accuracy is 20.6% for the mainly beta class and 14.4% for the mainly alpha class. The optimum correlation coefficient cutoff (cc cutoff) was found to be around 0.65. The first predictor, which correlates to hydrophobicity, provides the most reliable results. The other two predictors give good predictions which can be used in conjunction to those of the first one. When stricter cc cutoff is chosen, the average accuracy increases significantly (38.76% for alpha beta class), but the trade off is a smaller number of predictions. The use of solvent accessible area estimations for filtering false positives out of the predictions is promising.  相似文献   

7.
Adamczak R  Porollo A  Meller J 《Proteins》2004,56(4):753-767
Accurate prediction of relative solvent accessibilities (RSAs) of amino acid residues in proteins may be used to facilitate protein structure prediction and functional annotation. Toward that goal we developed a novel method for improved prediction of RSAs. Contrary to other machine learning-based methods from the literature, we do not impose a classification problem with arbitrary boundaries between the classes. Instead, we seek a continuous approximation of the real-value RSA using nonlinear regression, with several feed forward and recurrent neural networks, which are then combined into a consensus predictor. A set of 860 protein structures derived from the PFAM database was used for training, whereas validation of the results was carefully performed on several nonredundant control sets comprising a total of 603 structures derived from new Protein Data Bank structures and had no homology to proteins included in the training. Two classes of alternative predictors were developed for comparison with the regression-based approach: one based on the standard classification approach and the other based on a semicontinuous approximation with the so-called thermometer encoding. Furthermore, a weighted approximation, with errors being scaled by the observed levels of variability in RSA for equivalent residues in families of homologous structures, was applied in order to improve the results. The effects of including evolutionary profiles and the growth of sequence databases were assessed. In accord with the observed levels of variability in RSA for different ranges of RSA values, the regression accuracy is higher for buried than for exposed residues, with overall 15.3-15.8% mean absolute errors and correlation coefficients between the predicted and experimental values of 0.64-0.67 on different control sets. The new method outperforms classification-based algorithms when the real value predictions are projected onto two-class classification problems with several commonly used thresholds to separate exposed and buried residues. For example, classification accuracy of about 77% is consistently achieved on all control sets with a threshold of 25% RSA. A web server that enables RSA prediction using the new method and provides customizable graphical representation of the results is available at http://sable.cchmc.org.  相似文献   

8.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.  相似文献   

9.
The genome projects have unearthed an enormous diversity of genes of unknown function that are still awaiting biological and biochemical characterization. These genes, as most others, can be grouped into families based on sequence similarity. The PFAM database currently contains over 2,200 such families, referred to as domains of unknown function (DUF). In a coordinated effort, the four large-scale centers of the NIH Protein Structure Initiative have determined the first three-dimensional structures for more than 250 of these DUF families. Analysis of the first 248 reveals that about two thirds of the DUF families likely represent very divergent branches of already known and well-characterized families, which allows hypotheses to be formulated about their biological function. The remainder can be formally categorized as new folds, although about one third of these show significant substructure similarity to previously characterized folds. These results infer that, despite the enormous increase in the number and the diversity of new genes being uncovered, the fold space of the proteins they encode is gradually becoming saturated. The previously unexplored sectors of the protein universe appear to be primarily shaped by extreme diversification of known protein families, which then enables organisms to evolve new functions and adapt to particular niches and habitats. Notwithstanding, these DUF families still constitute the richest source for discovery of the remaining protein folds and topologies.  相似文献   

10.
11.
Expansions of sequence databases driven by new sequencing technology continue apace. These result in a continuous supply of protein sequences and domains that cannot be straightforwardly annotated by simple homology methods. For these, structure-based function prediction may contribute to an improved annotation. Here, short Domains of Unknown Function (DUFs) are ab initio modeled with ROSETTA and screened for likely nucleic acid binding function. Thirty-two DUFs are thereby predicted to have a nucleic acid binding function. In most cases, additional evidence supporting that function could be obtained from structure comparison, domain architectures, distant evolutionary relationships, genome context or protein-protein interaction data. These predictions contribute to the function annotation of thousands of proteins.  相似文献   

12.
Frenkel ZM  Trifonov EN 《Proteins》2007,67(2):271-284
A new method is proposed to reveal apparent evolutionary relationships between protein fragments with similar 3D structures by finding "intermediate" sequences in the proteomic database. Instead of looking for homologies and intermediates for a whole protein domain, we build a chain of intermediate short sequences, which allows one to link similar structural modules of proteins belonging to the same or different families. Several such chains of intermediates can be combined into an evolutionary tree of structural protein modules. All calculations were made for protein fragments of 20 aa residues. Three evolutionary trees for different module structures are described. The aim of the paper is to introduce the new method and to demonstrate its potential for protein structural predictions. The approach also opens new perspectives for protein evolution studies.  相似文献   

13.
14.
PD-(D/E)XK nucleases, initially represented by only Type II restriction enzymes, now comprise a large and extremely diverse superfamily of proteins. They participate in many different nucleic acids transactions including DNA degradation, recombination, repair and RNA processing. Different PD-(D/E)XK families, although sharing a structurally conserved core, typically display little or no detectable sequence similarity except for the active site motifs. This makes the identification of new superfamily members using standard homology search techniques challenging. To tackle this problem, we developed a method for the detection of PD-(D/E)XK families based on the binary classification of profile-profile alignments using support vector machines (SVMs). Using a number of both superfamily-specific and general features, SVMs were trained to identify true positive alignments of PD-(D/E)XK representatives. With this method we identified several PFAM families of uncharacterized proteins as putative new members of the PD-(D/E)XK superfamily. In addition, we assigned several unclassified restriction enzymes to the PD-(D/E)XK type. Results show that the new method is able to make confident assignments even for alignments that have statistically insignificant scores. We also implemented the method as a freely accessible web server at http://www.ibt.lt/bioinformatics/software/pdexk/.  相似文献   

15.

Background  

The relationship between divergence of amino-acid sequence and divergence of function among homologous proteins is complex. The assumption that homologs share function – the basis of transfer of annotations in databases – must therefore be regarded with caution. Here, we present a quantitative study of sequence and function divergence, based on the Gene Ontology classification of function. We determined the relationship between sequence divergence and function divergence in 6828 protein families from the PFAM database. Within families there is a broad range of sequence similarity from very closely related proteins – for instance, orthologs in different mammals – to very distantly-related proteins at the limit of reliable recognition of homology.  相似文献   

16.
The small ubiquitin-like modifier (SUMO) proteins are a kind of proteins that can be attached to a series of proteins. The sumoylation of protein is an important posttranslational modification. Thus, the prediction of the sumoylation site of a given protein is significant. Here we employed a combined method to perform this task. We predicted the sumoylation site of a protein by a two-staged procedure. At the first stage, whether a protein would be sumoylated was predicted; whereas at the second stage, the sumoylation sites of the protein were predicted if it was determined to be modified by SUMO at the first stage. At the first stage, we encoded a protein with protein families (PFAM) and trained the predictor with nearest network algorithm (NNA); at the second stage, we encoded nonapeptides (peptides that contain nine residues) of the protein containing the lysine residues, with Amino Acid Index, and trained the predictor with NNA. The predictor was tested by the k-fold cross-validation method. The highest accuracy of the second-staged predictor was 99.55% when 12 features were incorporated in the predictor. The corresponding Matthews Correlation Coefficient was 0.7952. These results indicate that the method is a promising tool to predict the sumoylation site of a protein. At last, the features used in the predictor are discussed. The software is available at request.  相似文献   

17.
MOTIVATION: The sequence patterns contained in the available motif and hidden Markov model (HMM) databases are a valuable source of information for protein sequence annotation. For structure prediction and fold recognition purposes, we computed mappings from such pattern databases to the protein domain hierarchy given by the ASTRAL compendium and applied them to the prediction of SCOP classifications. Our aim is to make highly confident predictions also for non-trivial cases if possible and abstain from a prediction otherwise, and thus to provide a method that can be used as a first step in a pipeline of prediction methods. We describe two successful examples for such pipelines. With the AutoSCOP approach, it is possible to make predictions in a large-scale manner for many domains of the available sequences in the well-known protein sequence databases. RESULTS: AutoSCOP computes unique sequence patterns and pattern combinations for SCOP classifications. For instance, we assign a SCOP superfamily to a pattern found in its members whenever the pattern does not occur in any other SCOP superfamily. Especially on the fold and superfamily level, our method achieves both high sensitivity (above 93%) and high specificity (above 98%) on the difference set between two ASTRAL versions, due to being able to abstain from unreliable predictions. Further, on a harder test set filtered at low sequence identity, the combination with profile-profile alignments improves accuracy and performs comparably even to structure alignment methods. Integrating our method with structure alignment, we are able to achieve an accuracy of 99% on SCOP fold classifications on this set. In an analysis of false assignments of domains from new folds/superfamilies/families to existing SCOP classifications, AutoSCOP correctly abstains for more than 70% of the domains belonging to new folds and superfamilies, and more than 80% of the domains belonging to new families. These findings show that our approach is a useful additional filter for SCOP classification prediction of protein domains in combination with well-known methods such as profile-profile alignment. AVAILABILITY: A web server where users can input their domain sequences is available at http://www.bio.ifi.lmu.de/autoscop.  相似文献   

18.
We describe a method to assign a protein structure to a functional family using family-specific fingerprints. Fingerprints represent amino acid packing patterns that occur in most members of a family but are rare in the background, a nonredundant subset of PDB; their information is additional to sequence alignments, sequence patterns, structural superposition, and active-site templates. Fingerprints were derived for 120 families in SCOP using Frequent Subgraph Mining. For a new structure, all occurrences of these family-specific fingerprints may be found by a fast algorithm for subgraph isomorphism; the structure can then be assigned to a family with a confidence value derived from the number of fingerprints found and their distribution in background proteins. In validation experiments, we infer the function of new members added to SCOP families and we discriminate between structurally similar, but functionally divergent TIM barrel families. We then apply our method to predict function for several structural genomics proteins, including orphan structures. Some predictions have been corroborated by other computational methods and some validated by subsequent functional characterization.  相似文献   

19.
The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncover new proteins. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to extreme diversity among its members. Previous comparisons of BLAST, k-nearest neighbor (k-NN), hidden markov model (HMM) and support vector machine (SVM) using alignment-based features have suggested that classifiers at the complexity of SVM are needed to attain high accuracy. Here, analogous to document classification, we applied Decision Tree and Naive Bayes classifiers with chi-square feature selection on counts of n-grams (i.e. short peptide sequences of length n) to this classification task. Using the GPCR dataset and evaluation protocol from the previous study, the Naive Bayes classifier attained an accuracy of 93.0 and 92.4% in level I and level II subfamily classification respectively, while SVM has a reported accuracy of 88.4 and 86.3%. This is a 39.7 and 44.5% reduction in residual error for level I and level II subfamily classification, respectively. The Decision Tree, while inferior to SVM, outperforms HMM in both level I and level II subfamily classification. For those GPCR families whose profiles are stored in the Protein FAMilies database of alignments and HMMs (PFAM), our method performs comparably to a search against those profiles. Finally, our method can be generalized to other protein families by applying it to the superfamily of nuclear receptors with 94.5, 97.8 and 93.6% accuracy in family, level I and level II subfamily classification respectively.  相似文献   

20.
Structural biology and structural genomics are expected to produce many three-dimensional protein structures in the near future. Each new structure raises questions about its function and evolution. Correct functional and evolutionary classification of a new structure is difficult for distantly related proteins and error-prone using simple statistical scores based on sequence or structure similarity. Here we present an accurate numerical method for the identification of evolutionary relationships (homology). The method is based on the principle that natural selection maintains structural and functional continuity within a diverging protein family. The problem of different rates of structural divergence between different families is solved by first using structural similarities to produce a global map of folds in protein space and then further subdividing fold neighborhoods into superfamilies based on functional similarities. In a validation test against a classification by human experts (SCOP), 77% of homologous pairs were identified with 92% reliability. The method is fully automated, allowing fast, self-consistent and complete classification of large numbers of protein structures. In particular, the discrimination between analogy and homology of close structural neighbors will lead to functional predictions while avoiding overprediction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号