首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 453 毫秒
1.
C Sander  R Schneider 《Proteins》1991,9(1):56-68
The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.  相似文献   

2.
SUMMARY: Sequence-structure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequence-structure alignment in a pool of alignments remains a difficult problem. QUASAR (quality of sequence-structure alignments ranking) provides a unifying framework for scoring sequence-structure alignments that aids finding well-performing combinations of well-known and custom-made scoring schemes. Those scoring functions can be benchmarked against widely accepted quality scores like MaxSub, TMScore, Touch and APDB, thus enabling users to test their own alignment scores against 'standard-of-truth' structure-based scores. Furthermore, individual score combinations can be optimized with respect to benchmark sets based on known structural relationships using QUASAR's in-built optimization routines.  相似文献   

3.
Structural genomics is the idea of covering protein space so that every protein sequence comes within model building distance of a protein of known structure. Unfortunately, reproducing the structural alignment of distantly related proteins is a difficult challenge to existing sequence alignment and motif search software. We have developed a new transitive alignment algorithm (MaxFlow), which generates accurate alignments between proteins deep in the twilight zone of sequence similarity, below 20% sequence identity. In particular, MaxFlow reliably identifies conserved core motifs between proteins which are only indirect PSI-Blast neighbours. Based on MaxFlow alignments, useful 3D models can be generated for all members of a superfamily from as few as a single structural template – despite hundreds of representatives at 40% sequence identity level and patchy detection of homology by PSI-Blast. We propose novel strategies for target prioritization using MaxFlow scores to predict the optimal templates in a superfamily. Our results support an increase in the granularity of covering protein space that has potentially enormous economic implications for planning the transition to the full production phase of structural genomics.  相似文献   

4.
The database PALI (Phylogeny and ALIgnment of homologous protein structures) consists of families of protein domains of known three-dimensional (3D) structure. In a PALI family, every member has been structurally aligned with every other member (pairwise) and also simultaneous superposition (multiple) of all the members has been performed. The database also contains 3D structure-based and structure-dependent sequence similarity-based phylogenetic dendrograms for all the families. The PALI release used in the present analysis comprises 225 families derived largely from the HOMSTRAD and SCOP databases. The quality of the multiple rigid-body structural alignments in PALI was compared with that obtained from COMPARER, which encodes a procedure based on properties and relationships. The alignments from the two procedures agreed very well and variations are seen only in the low sequence similarity cases often in the loop regions. A validation of Direct Pairwise Alignment (DPA) between two proteins is provided by comparing it with Pairwise alignment extracted from Multiple Alignment of all the members in the family (PMA). In general, DPA and PMA are found to vary rarely. The ready availability of pairwise alignments allows the analysis of variations in structural distances as a function of sequence similarities and number of topologically equivalent Calpha atoms. The structural distance metric used in the analysis combines root mean square deviation (r.m.s.d.) and number of equivalences, and is shown to vary similarly to r.m.s.d. The correlation between sequence similarity and structural similarity is poor in pairs with low sequence similarities. A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms. The difference could occur when the sequence similarity among the homologues is low or when the structures are subjected to evolutionary pressure for the retention of function. The PALI database is expected to be useful in furthering our understanding of the relationship between sequences and structures of homologous proteins and their evolution.  相似文献   

5.
PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous super-position (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 'orphans' (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet. in/ approximately pali.  相似文献   

6.
Comparing and classifying the three-dimensional (3D) structures of proteins is of crucial importance to molecular biology, from helping to determine the function of a protein to determining its evolutionary relationships. Traditionally, 3D structures are classified into groups of families that closely resemble the grouping according to their primary sequence. However, significant structural similarities exist at multiple levels between proteins that belong to these different structural families. In this study, we propose a new algorithm, CLICK, to capture such similarities. The method optimally superimposes a pair of protein structures independent of topology. Amino acid residues are represented by the Cartesian coordinates of a representative point (usually the C(α) atom), side chain solvent accessibility, and secondary structure. Structural comparison is effected by matching cliques of points. CLICK was extensively benchmarked for alignment accuracy on four different sets: (i) 9537 pair-wise alignments between two structures with the same topology; (ii) 64 alignments from set (i) that were considered to constitute difficult alignment cases; (iii) 199 pair-wise alignments between proteins with similar structure but different topology; and (iv) 1275 pair-wise alignments of RNA structures. The accuracy of CLICK alignments was measured by the average structure overlap score and compared with other alignment methods, including HOMSTRAD, MUSTANG, Geometric Hashing, SALIGN, DALI, GANGSTA(+), FATCAT, ARTS and SARA. On average, CLICK produces pair-wise alignments that are either comparable or statistically significantly more accurate than all of these other methods. We have used CLICK to uncover relationships between (previously) unrelated proteins. These new biological insights include: (i) detecting hinge regions in proteins where domain or sub-domains show flexibility; (ii) discovering similar small molecule binding sites from proteins of different folds and (iii) discovering topological variants of known structural/sequence motifs. Our method can generally be applied to compare any pair of molecular structures represented in Cartesian coordinates as exemplified by the RNA structure superimposition benchmark.  相似文献   

7.
Structural alignments often reveal relationships between proteins that cannot be detected using sequence alignment alone. However, profile search methods based entirely on structural alignments alone have not been found to be effective in finding remote homologs. Here, we explore the role of structural information in remote homolog detection and sequence alignment. To this end, we develop a series of hybrid multidimensional alignment profiles that combine sequence, secondary and tertiary structure information into hybrid profiles. Sequence-based profiles are profiles whose position-specific scoring matrix is derived from sequence alignment alone; structure-based profiles are those derived from multiple structure alignments. We compare pure sequence-based profiles to pure structure-based profiles, as well as to hybrid profiles that use combined sequence-and-structure-based profiles, where sequence-based profiles are used in loop/motif regions and structural information is used in core structural regions. All of the hybrid methods offer significant improvement over simple profile-to-profile alignment. We demonstrate that both sequence-based and structure-based profiles contribute to remote homology detection and alignment accuracy, and that each contains some unique information. We discuss the implications of these results for further improvements in amino acid sequence and structural analysis.  相似文献   

8.
Twilight zone of protein sequence alignments   总被引:38,自引:0,他引:38  
Sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity is high (>40% for long alignments). The signal gets blurred in the twilight zone of 20-35% sequence identity. Here, more than a million sequence alignments were analysed between protein pairs of known structures to re-define a line distinguishing between true and false positives for low levels of similarity. Four results stood out. (i) The transition from the safe zone of sequence alignment into the twilight zone is described by an explosion of false negatives. More than 95% of all pairs detected in the twilight zone had different structures. More precisely, above a cut-off roughly corresponding to 30% sequence identity, 90% of the pairs were homologous; below 25% less than 10% were. (ii) Whether or not sequence homology implied structural identity depended crucially on the alignment length. For example, if 10 residues were similar in an alignment of length 16 (>60%), structural similarity could not be inferred. (iii) The 'more similar than identical' rule (discarding all pairs for which percentage similarity was lower than percentage identity) reduced false positives significantly. (iv) Using intermediate sequences for finding links between more distant families was almost as successful: pairs were predicted to be homologous when the respective sequence families had proteins in common. All findings are applicable to automatic database searches.  相似文献   

9.
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.  相似文献   

10.
The biological role, biochemical function, and structure of uncharacterized protein sequences is often inferred from their similarity to known proteins. A constant goal is to increase the reliability, sensitivity, and accuracy of alignment techniques to enable the detection of increasingly distant relationships. Development, tuning, and testing of these methods benefit from appropriate benchmarks for the assessment of alignment accuracy.Here, we describe a benchmark protocol to estimate sequence-to-sequence and sequence-to-structure alignment accuracy. The protocol consists of structurally related pairs of proteins and procedures to evaluate alignment accuracy over the whole set. The set of protein pairs covers all the currently known fold types. The benchmark is challenging in the sense that it consists of proteins lacking clear sequence similarity.Correct target alignments are derived from the three-dimensional structures of these pairs by rigid body superposition. An evaluation engine computes the accuracy of alignments obtained from a particular algorithm in terms of alignment shifts with respect to the structure derived alignments. Using this benchmark we estimate that the best results can be obtained from a combination of amino acid residue substitution matrices and knowledge-based potentials.  相似文献   

11.

Background  

While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.  相似文献   

12.
Structural comparison reveals remote homology that often fails to be detected by sequence comparison. The DALI web server ( http://ekhidna2.biocenter.helsinki.fi/dali ) is a platform for structural analysis that provides database searches and interactive visualization, including structural alignments annotated with secondary structure, protein families and sequence logos, and 3D structure superimposition supported by color-coded sequence and structure conservation. Here, we are using DALI to mine the AlphaFold Database version 1, which increased the structural coverage of protein families by 20%. We found 100 remote homologous relationships hitherto unreported in the current reference database for protein domains, Pfam 35.0. In particular, we linked 35 domains of unknown function (DUFs) to the previously characterized families, generating a functional hypothesis that can be explored downstream in structural biology studies. Other findings include gene fusions, tandem duplications, and adjustments to domain boundaries. The evidence for homology can be browsed interactively through live examples on DALI's website.  相似文献   

13.
A natural way to study protein sequence, structure, and function is to put them in the context of evolution. Homologs inherit similarities from their common ancestor, while analogs converge to similar structures due to a limited number of energetically favorable ways to pack secondary structural elements. Using novel strategies, we previously assembled two reliable databases of homologs and analogs. In this study, we compare these two data sets and develop a support vector machine (SVM)-based classifier to discriminate between homologs and analogs. The classifier uses a number of well-known similarity scores. We observe that although both structure scores and sequence scores contribute to SVM performance, profile sequence scores computed based on structural alignments are the best discriminators between remote homologs and structural analogs. We apply our classifier to a representative set from the expert-constructed database, Structural Classification of Proteins (SCOP). The SVM classifier recovers 76% of the remote homologs defined as domains in the same SCOP superfamily but from different families. More importantly, we also detect and discuss interesting homologous relationships between SCOP domains from different superfamilies, folds, and even classes.  相似文献   

14.
The database of Phylogeny and ALIgnment of homologous protein structures (PALI) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of protein domains in various families. The latest updated version (Release 2.1) comprises of 844 families of homologous proteins involving 3863 protein domain structures with each of these families having at least two members. Each member in a family has been structurally aligned with every other member in the same family using two proteins at a time. In addition, an alignment of multiple structures has also been performed using all the members in a family. Every family with at least three members is associated with two dendrograms, one based on a structural dissimilarity metric and the other based on similarity of topologically equivalenced residues for every pairwise alignment. Apart from these multi-member families, there are 817 single member families in the updated version of PALI. A new feature in the current release of PALI is the integration, with 3-D structural families, of sequences of homologues from the sequence databases. Alignments between homologous proteins of known 3-D structure and those without an experimentally derived structure are also provided for every family in the enhanced version of PALI. The database with several web interfaced utilities can be accessed at: http://pauling.mbu.iisc.ernet.in/~pali.  相似文献   

15.
MOTIVATION: In recent years, advances have been made in the ability of computational methods to discriminate between homologous and non-homologous proteins in the 'twilight zone' of sequence similarity, where the percent sequence identity is a poor indicator of homology. To make these predictions more valuable to the protein modeler, they must be accompanied by accurate alignments. Pairwise sequence alignments are inferences of orthologous relationships between sequence positions. Evolutionary distance is traditionally modeled using global amino acid substitution matrices. But real differences in the likelihood of substitutions may exist for different structural contexts within proteins, since structural context contributes to the selective pressure. RESULTS: HMMSUM (HMMSTR-based substitution matrices) is a new model for structural context-based amino acid substitution probabilities consisting of a set of 281 matrices, each for a different sequence-structure context. HMMSUM does not require the structure of the protein to be known. Instead, predictions of local structure are made using HMMSTR, a hidden Markov model for local structure. Alignments using the HMMSUM matrices compare favorably to alignments carried out using the BLOSUM matrices or structure-based substitution matrices SDM and HSDM when validated against remote homolog alignments from BAliBASE. HMMSUM has been implemented using local Dynamic Programming and with the Bayesian Adaptive alignment method.  相似文献   

16.
Protein structure modeling by homology requires an accurate sequence alignment between the query protein and its structural template. However, sequence alignment methods based on dynamic programming (DP) are typically unable to generate accurate alignments for remote sequence homologs, thus limiting the applicability of modeling methods. A central problem is that the alignment that is "optimal" in terms of the DP score does not necessarily correspond to the alignment that produces the most accurate structural model. That is, the correct alignment based on structural superposition will generally have a lower score than the optimal alignment obtained from sequence. Variations of the DP algorithm have been developed that generate alternative alignments that are "suboptimal" in terms of the DP score, but these still encounter difficulties in detecting the correct structural alignment. We present here a new alternative sequence alignment method that relies heavily on the structure of the template. By initially aligning the query sequence to individual fragments in secondary structure elements and combining high-scoring fragments that pass basic tests for "modelability", we can generate accurate alignments within a small ensemble. Our results suggest that the set of sequences that can currently be modeled by homology can be greatly extended.  相似文献   

17.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

18.
A DNA/protein sequence comparison is a popular computational tool for molecular biologists. Finding a good alignment implies an evolutionary and/or functional relationship between proteins or genomic loci. Sequential similarity between two proteins indicates their structural resemblance, providing a practical approach for structural modeling, when structure of one of these proteins is known. The first step in the homology modeling is a construction of an accurate sequence alignment. The commonly used alignment algorithms do not provide an adequate treatment of the structurally mismatched residues in locally dissimilar regions. We propose a simple modification of the existing alignment algorithm which treats these regions properly and demonstrate how this modification improves sequence alignments in real proteins.  相似文献   

19.
Members of a superfamily of proteins could result from divergent evolution of homologues with insignificant similarity in the amino acid sequences. A superfamily relationship is detected commonly after the three-dimensional structures of the proteins are determined using X-ray analysis or NMR. The SUPFAM database described here relates two homologous protein families in a multiple sequence alignment database of either known or unknown structure. The present release (1.1), which is the first version of the SUPFAM database, has been derived by analysing Pfam, which is one of the commonly used databases of multiple sequence alignments of homologous proteins. The first step in establishing SUPFAM is to relate Pfam families with the families in PALI, which is an alignment database of homologous proteins of known structure that is derived largely from SCOP. The second step involves relating Pfam families which could not be associated reliably with a protein superfamily of known structure. The profile matching procedure, IMPALA, has been used in these steps. The first step resulted in identification of 1280 Pfam families (out of 2697, i.e. 47%) which are related, either by close homologous connection to a SCOP family or by distant relationship to a SCOP family, potentially forming new superfamily connections. Using the profiles of 1417 Pfam families with apparently no structural information, an all-against-all comparison involving a sequence-profile match using IMPALA resulted in clustering of 67 homologous protein families of Pfam into 28 potential new superfamilies. Expansion of groups of related proteins of yet unknown structural information, as proposed in SUPFAM, should help in identifying ‘priority proteins’ for structure determination in structural genomics initiatives to expand the coverage of structural information in the protein sequence space. For example, we could assign 858 distinct Pfam domains in 2203 of the gene products in the genome of Mycobacterium tubercolosis. Fifty-one of these Pfam families of unknown structure could be clustered into 17 potentially new superfamilies forming good targets for structural genomics. SUPFAM database can be accessed at http://pauling.mbu.iisc.ernet.in/~supfam.  相似文献   

20.
Lin HN  Notredame C  Chang JM  Sung TY  Hsu WL 《PloS one》2011,6(12):e27872
Most sequence alignment tools can successfully align protein sequences with higher levels of sequence identity. The accuracy of corresponding structure alignment, however, decreases rapidly when considering distantly related sequences (<20% identity). In this range of identity, alignments optimized so as to maximize sequence similarity are often inaccurate from a structural point of view. Over the last two decades, most multiple protein aligners have been optimized for their capacity to reproduce structure-based alignments while using sequence information. Methods currently available differ essentially in the similarity measurement between aligned residues using substitution matrices, Fourier transform, sophisticated profile-profile functions, or consistency-based approaches, more recently.In this paper, we present a flexible similarity measure for residue pairs to improve the quality of protein sequence alignment. Our approach, called SymAlign, relies on the identification of conserved words found across a sizeable fraction of the considered dataset, and supported by evolutionary analysis. These words are then used to define a position specific substitution matrix that better reflects the biological significance of local similarity. The experiment results show that the SymAlign scoring scheme can be incorporated within T-Coffee to improve sequence alignment accuracy. We also demonstrate that SymAlign is less sensitive to the presence of structurally non-similar proteins. In the analysis of the relationship between sequence identity and structure similarity, SymAlign can better differentiate structurally similar proteins from non- similar proteins. We show that protein sequence alignments can be significantly improved using a similarity estimation based on weighted n-grams. In our analysis of the alignments thus produced, sequence conservation becomes a better indicator of structural similarity. SymAlign also provides alignment visualization that can display sub-optimal alignments on dot-matrices. The visualization makes it easy to identify well-supported alternative alignments that may not have been identified by dynamic programming. SymAlign is available at http://bio-cluster.iis.sinica.edu.tw/SymAlign/.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号