首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
2.
3.

Background

It is a major challenge of computational biology to provide a comprehensive functional classification of all known proteins. Most existing methods seek recurrent patterns in known proteins based on manually-validated alignments of known protein families. Such methods can achieve high sensitivity, but are limited by the necessary manual labor. This makes our current view of the protein world incomplete and biased. This paper concerns ProtoNet, a automatic unsupervised global clustering system that generates a hierarchical tree of over 1,000,000 proteins, based solely on sequence similarity.

Results

In this paper we show that ProtoNet correctly captures functional and structural aspects of the protein world. Furthermore, a novel feature is an automatic procedure that reduces the tree to 12% its original size. This procedure utilizes only parameters intrinsic to the clustering process. Despite the substantial reduction in size, the system's predictive power concerning biological functions is hardly affected. We then carry out an automatic comparison with existing functional protein annotations. Consequently, 78% of the clusters in the compressed tree (5,300 clusters) get assigned a biological function with a high confidence. The clustering and compression processes are unsupervised, and robust.

Conclusions

We present an automatically generated unbiased method that provides a hierarchical classification of all currently known proteins.
  相似文献   

4.

Background  

Methods for predicting protein function directly from amino acid sequences are useful tools in the study of uncharacterised protein families and in comparative genomics. Until now, this problem has been approached using machine learning techniques that attempt to predict membership, or otherwise, to predefined functional categories or subcellular locations. A potential drawback of this approach is that the human-designated functional classes may not accurately reflect the underlying biology, and consequently important sequence-to-function relationships may be missed.  相似文献   

5.
6.
The structure, function, stability, and many other properties of a protein in a fixed environment are fully specified by its sequence, but in a manner that is difficult to discern. We present a general approach for rapidly mapping sequences directly to their energies on a pre-specified rigid backbone, an important sub-problem in computational protein design and in some methods for protein structure prediction. The cluster expansion (CE) method that we employ can, in principle, be extended to model any computable or measurable protein property directly as a function of sequence. Here we show how CE can be applied to the problem of computational protein design, and use it to derive excellent approximations of physical potentials. The approach provides several attractive advantages. First, following a one-time derivation of a CE expansion, the amount of time necessary to evaluate the energy of a sequence adopting a specified backbone conformation is reduced by a factor of 10(7) compared to standard full-atom methods for the same task. Second, the agreement between two full-atom methods that we tested and their CE sequence-based expressions is very high (root mean square deviation 1.1-4.7 kcal/mol, R2 = 0.7-1.0). Third, the functional form of the CE energy expression is such that individual terms of the expansion have clear physical interpretations. We derived expressions for the energies of three classic protein design targets-a coiled coil, a zinc finger, and a WW domain-as functions of sequence, and examined the most significant terms. Single-residue and residue-pair interactions are sufficient to accurately capture the energetics of the dimeric coiled coil, whereas higher-order contributions are important for the two more globular folds. For the task of designing novel zinc-finger sequences, a CE-derived energy function provides significantly better solutions than a standard design protocol, in comparable computation time. Given these advantages, CE is likely to find many uses in computational structural modeling.  相似文献   

7.
The aim of this work was to study the relationship between structure conservation and sequence divergence in protein evolution. To this end, we developed a model of structurally constrained protein evolution (SCPE) in which trial sequences, generated by random mutations at gene level, are selected against departure from a reference three-dimensional structure. Since at the mutational level SCPE is completely unbiased, any emergent sequence pattern will be due exclusively to structural constraints. In this first report, it is shown that SCPE correctly predicts the characteristic hexapeptide motif of the left-handed parallel beta helix (LbetaH) domain of UDP-N-acetylglucosamine acyltransferases (LpxA).  相似文献   

8.
Systematic and fully automated identification of protein sequence patterns.   总被引:4,自引:0,他引:4  
We present an efficient algorithm to systematically and automatically identify patterns in protein sequence families. The procedure is based on the Splash deterministic pattern discovery algorithm and on a framework to assess the statistical significance of patterns. We demonstrate its application to the fully automated discovery of patterns in 974 PROSITE families (the complete subset of PROSITE families which are defined by patterns and contain DR records). Splash generates patterns with better specificity and undiminished sensitivity, or vice versa, in 28% of the families; identical statistics were obtained in 48% of the families, worse statistics in 15%, and mixed behavior in the remaining 9%. In about 75% of the cases, Splash patterns identify sequence sites that overlap more than 50% with the corresponding PROSITE pattern. The procedure is sufficiently rapid to enable its use for daily curation of existing motif and profile databases. Third, our results show that the statistical significance of discovered patterns correlates well with their biological significance. The trypsin subfamily of serine proteases is used to illustrate this method's ability to exhaustively discover all motifs in a family that are statistically and biologically significant. Finally, we discuss applications of sequence patterns to multiple sequence alignment and the training of more sensitive score-based motif models, akin to the procedure used by PSI-BLAST. All results are available at httpl//www.research.ibm.com/spat/.  相似文献   

9.
We describe a novel approach for inferring functional relationship of proteins by detecting sequence and spatial patterns of protein surfaces. Well-formed concave surface regions in the form of pockets and voids are examined to identify similarity relationship that might be directly related to protein function. We first exhaustively identify and measure analytically all 910,379 surface pockets and interior voids on 12,177 protein structures from the Protein Data Bank. The similarity of patterns of residues forming pockets and voids are then assessed in sequence, in spatial arrangement, and in orientational arrangement. Statistical significance in the form of E and p-values is then estimated for each of the three types of similarity measurements. Our method is fully automated without human intervention and can be used without input of query patterns. It does not assume any prior knowledge of functional residues of a protein, and can detect similarity based on surface patterns small and large. It also tolerates, to some extent, conformational flexibility of functional sites. We show with examples that this method can detect functional relationship with specificity for members of the same protein family and superfamily, as well as remotely related functional surfaces from proteins of different fold structures. We envision that this method can be used for discovering novel functional relationship of protein surfaces, for functional annotation of protein structures with unknown biological roles, and for further inquiries on evolutionary origins of structural elements important for protein function.  相似文献   

10.
The increasing number and diversity of protein sequence families requires new methods to define and predict details regarding function. Here, we present a method for analysis and prediction of functional sub-types from multiple protein sequence alignments. Given an alignment and set of proteins grouped into sub-types according to some definition of function, such as enzymatic specificity, the method identifies positions that are indicative of functional differences by comparison of sub-type specific sequence profiles, and analysis of positional entropy in the alignment. Alignment positions with significantly high positional relative entropy correlate with those known to be involved in defining sub-types for nucleotidyl cyclases, protein kinases, lactate/malate dehydrogenases and trypsin-like serine proteases. We highlight new positions for these proteins that suggest additional experiments to elucidate the basis of specificity. The method is also able to predict sub-type for unclassified sequences. We assess several variations on a prediction method, and compare them to simple sequence comparisons. For assessment, we remove close homologues to the sequence for which a prediction is to be made (by a sequence identity above a threshold). This simulates situations where a protein is known to belong to a protein family, but is not a close relative of another protein of known sub-type. Considering the four families above, and a sequence identity threshold of 30 %, our best method gives an accuracy of 96 % compared to 80 % obtained for sequence similarity and 74 % for BLAST. We describe the derivation of a set of sub-type groupings derived from an automated parsing of alignments from PFAM and the SWISSPROT database, and use this to perform a large-scale assessment. The best method gives an average accuracy of 94 % compared to 68 % for sequence similarity and 79 % for BLAST. We discuss implications for experimental design, genome annotation and the prediction of protein function and protein intra-residue distances.  相似文献   

11.
12.
MOTIVATION: The prediction of ligand-binding residues or catalytically active residues of a protein may give important hints that can guide further genetic or biochemical studies. Existing sequence-based prediction methods mostly rank residue positions by evolutionary conservation calculated from a multiple sequence alignment of homologs. A problem hampering more wide-spread application of these methods is the low per-residue precision, which at 20% sensitivity is around 35% for ligand-binding residues and 20% for catalytic residues. RESULTS: We combine information from the conservation at each site, its amino acid distribution, as well as its predicted secondary structure (ss) and relative solvent accessibility (rsa). First, we measure conservation by how much the amino acid distribution at each site differs from the distribution expected for the predicted ss and rsa states. Second, we include the conservation of neighboring residues in a weighted linear score by analytically optimizing the signal-to-noise ratio of the total score. Third, we use conditional probability density estimation to calculate the probability of each site to be functional given its conservation, the observed amino acid distribution, and the predicted ss and rsa states. We have constructed two large data sets, one based on the Catalytic Site Atlas and the other on PDB SITE records, to benchmark methods for predicting functional residues. The new method FRcons predicts ligand-binding and catalytic residues with higher precision than alternative methods over the entire sensitivity range, reaching 50% and 40% precision at 20% sensitivity, respectively. AVAILABILITY: Server: http://frpred.tuebingen.mpg.de. Data sets: ftp://ftp.tuebingen.mpg.de/pub/protevo/FRpred/.  相似文献   

13.
14.
MOTIVATION: Data Mining Prediction (DMP) is a novel approach to predicting protein functional class from sequence. DMP works even in the absence of a homologous protein of known function. We investigate the utility of different ways of representing protein sequence in DMP (residue frequencies, phylogeny, predicted structure) using the Escherichia coli genome as a model. RESULTS: Using the different representations DMP learnt prediction rules that were more accurate than default at every level of function using every type of representation. The most effective way to represent sequence was using phylogeny (75% accuracy and 13% coverage of unassigned ORFs at the most general level of function: 69% accuracy and 7% coverage at the most detailed). We tested different methods for combining predictions from the different types of representation. These improved both the accuracy and coverage of predictions, e.g. 40% of all unassigned ORFs could be predicted at an estimated accuracy of 60% and 5% of unassigned ORFs could be predicted at an estimated accuracy of 86%.  相似文献   

15.
16.
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.  相似文献   

17.
MOTIVATION: We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition positions between domains. RESULTS: The method was assessed using the domain definitions in SCOP and CATH for proteins of known structure and was compared with several other existing methods. Our method performs well both in terms of accuracy and sensitivity. It improves significantly over the best methods available, even some of the semi-manual ones, while being fully automatic. Our method can also be used to suggest and verify domain partitions based on structural data. A few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed. AVAILABILITY: An online domain-prediction server is available at http://biozon.org/tools/domains/  相似文献   

18.
Hydrophobic side chains often are buried in the interior of a protein, and evolutionarily related proteins usually maintain the hydrophobic character of buried positions. In this paper we show that a pattern of hydrophobicity values derived from a set of related protein sequences is well correlated with the linear pattern of side-chain solvent accessibility values, calculated from a known protein structure representative of the sequences. In several cases, information from aligned sequences can be used to select the correct tertiary fold from a large data base of protein structures.  相似文献   

19.
A method is described in which proteins that match PROSITE patterns are filtered by the root-mean-square deviation of the local 3D structures of the probe and target over the pattern components. This was found to increase the discrimination between true and false members of the protein family but was dependent on how unique the structural features in the pattern were compared to equivalent fragments extracted from the structure databank (for example; if the pattern fell in an alpha-helix, then discrimination was poor.) We then generalised the sequence patterns (by widening the range of amino acid residues allowed at each position) and monitored how well the structural information helped retain specificity. While the discrimination of the pure sequence pattern had generally disappeared at information content values less than ten bits, the discrimination of the combined sequence structure probe remained high at this point before following a similar decay. The displacement between these curves indicates that the structural component is, on average, equivalent to about ten bits. The sequence patterns were also filtered using the structure comparison program SAP, giving a global, rather than local "view" of the proteins. This allowed the information content of the sequence patterns to become even less specific but raised problems of whether some proteins encountered with the same fold but no PROSITE pattern should constitute family members.  相似文献   

20.
MOTIVATION: Protein sequence comparison methods are routinely used to infer the intricate network of evolutionary relationships found within the rapidly growing library of protein sequences, and thereby to predict the structure and function of uncharacterized proteins. In the present study, we detail an improved statistical benchmark of pairwise protein sequence comparison algorithms. We use bootstrap resampling techniques to determine standard statistical errors and to estimate the confidence of our conclusions. We show that the underlying structure within benchmark databases causes Efron's standard, non-parametric bootstrap to be biased. Consequently, the standard bootstrap underpredicts average performance when used in the context of evaluating sequence comparison methods. We have developed, as an alternative, an unbiased statistical evaluation based on the Bayesian bootstrap, a resampling method operationally similar to the standard bootstrap. RESULTS: We apply our analysis to the comparative study of amino acid substitution matrix families and find that using modern matrices results in a small, but statistically significant improvement in remote homology detection compared with the classic PAM and BLOSUM matrices. AVAILABILITY: The sequence sets and code for performing these analyses are available from http://compbio.berkeley.edu/. Contact: brenner@compbio.berkeley.edu.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号