首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Twilight zone of protein sequence alignments   总被引:38,自引:0,他引:38  
Sequence alignments unambiguously distinguish between protein pairs of similar and non-similar structure when the pairwise sequence identity is high (>40% for long alignments). The signal gets blurred in the twilight zone of 20-35% sequence identity. Here, more than a million sequence alignments were analysed between protein pairs of known structures to re-define a line distinguishing between true and false positives for low levels of similarity. Four results stood out. (i) The transition from the safe zone of sequence alignment into the twilight zone is described by an explosion of false negatives. More than 95% of all pairs detected in the twilight zone had different structures. More precisely, above a cut-off roughly corresponding to 30% sequence identity, 90% of the pairs were homologous; below 25% less than 10% were. (ii) Whether or not sequence homology implied structural identity depended crucially on the alignment length. For example, if 10 residues were similar in an alignment of length 16 (>60%), structural similarity could not be inferred. (iii) The 'more similar than identical' rule (discarding all pairs for which percentage similarity was lower than percentage identity) reduced false positives significantly. (iv) Using intermediate sequences for finding links between more distant families was almost as successful: pairs were predicted to be homologous when the respective sequence families had proteins in common. All findings are applicable to automatic database searches.  相似文献   

2.
In this article, we use animal G-protein alpha subunit family as an example to illustrate a comprehensive analytical pipeline for detecting different types of functional divergence of protein families, which is phylogeny-dependent, combined with ancestral sequence inference and available protein structure information. In particular, we focus on (i) Type-I functional divergence, or site-specific rate shift, as typically exemplified by amino acid residue highly conserved in a subset of homologous genes but highly variable in a different subset of homologous genes, and (ii) Type-II functional divergence, or the shift of cluster-specific amino acid property, as exemplified by a radical shift of amino acid property between duplicate genes, which is otherwise evolutionally conserved. We utilized the software DIVERGE2 to carry out these analyses. In the case of G-protein alpha subunit gene family, we have predicted amino acid residues that are related to either Type-I or Type-II functional divergence. The inferred ancestral sequences for these sites are helpful to explore the trends of functional divergence. Finally, these predicted residues are mapped to the protein structures to test whether these residues may have 3D structure or solvent accessibility preference.  相似文献   

3.
The database of Phylogeny and ALIgnment of homologous protein structures (PALI) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of protein domains in various families. The latest updated version (Release 2.1) comprises of 844 families of homologous proteins involving 3863 protein domain structures with each of these families having at least two members. Each member in a family has been structurally aligned with every other member in the same family using two proteins at a time. In addition, an alignment of multiple structures has also been performed using all the members in a family. Every family with at least three members is associated with two dendrograms, one based on a structural dissimilarity metric and the other based on similarity of topologically equivalenced residues for every pairwise alignment. Apart from these multi-member families, there are 817 single member families in the updated version of PALI. A new feature in the current release of PALI is the integration, with 3-D structural families, of sequences of homologues from the sequence databases. Alignments between homologous proteins of known 3-D structure and those without an experimentally derived structure are also provided for every family in the enhanced version of PALI. The database with several web interfaced utilities can be accessed at: http://pauling.mbu.iisc.ernet.in/~pali.  相似文献   

4.
PALI (release 1.2) contains three-dimensional (3-D) structure-dependent sequence alignments as well as structure-based phylogenetic trees of homologous protein domains in various families. The data set of homologous protein structures has been derived by consulting the SCOP database (release 1.50) and the data set comprises 604 families of homologous proteins involving 2739 protein domain structures with each family made up of at least two members. Each member in a family has been structurally aligned with every other member in the same family (pairwise alignment) and all the members in the family are also aligned using simultaneous super-position (multiple alignment). The structural alignments are performed largely automatically, with manual interventions especially in the cases of distantly related proteins, using the program STAMP (version 4.2). Every family is also associated with two dendrograms, calculated using PHYLIP (version 3.5), one based on a structural dissimilarity metric defined for every pairwise alignment and the other based on similarity of topologically equivalent residues. These dendrograms enable easy comparison of sequence and structure-based relationships among the members in a family. Structure-based alignments with the details of structural and sequence similarities, superposed coordinate sets and dendrograms can be accessed conveniently using a web interface. The database can be queried for protein pairs with sequence or structural similarities falling within a specified range. Thus PALI forms a useful resource to help in analysing the relationship between sequence and structure variation at a given level of sequence similarity. PALI also contains over 653 'orphans' (single member families). Using the web interface involving PSI_BLAST and PHYLIP it is possible to associate the sequence of a new protein with one of the families in PALI and generate a phylogenetic tree combining the query sequence and proteins of known 3-D structure. The database with the web interfaced search and dendrogram generation tools can be accessed at http://pauling.mbu.iisc.ernet. in/ approximately pali.  相似文献   

5.
Several studies based on the known three-dimensional (3-D) structures of proteins show that two homologous proteins with insignificant sequence similarity could adopt a common fold and may perform same or similar biochemical functions. Hence, it is appropriate to use similarities in 3-D structure of proteins rather than the amino acid sequence similarities in modelling evolution of distantly related proteins. Here we present an assessment of using 3-D structures in modelling evolution of homologous proteins. Using a dataset of 108 protein domain families of known structures with at least 10 members per family we present a comparison of extent of structural and sequence dissimilarities among pairs of proteins which are inputs into the construction of phylogenetic trees. We find that correlation between the structure-based dissimilarity measures and the sequence-based dissimilarity measures is usually good if the sequence similarity among the homologues is about 30% or more. For protein families with low sequence similarity among the members, the correlation coefficient between the sequence-based and the structure-based dissimilarities are poor. In these cases the structure-based dendrogram clusters proteins with most similar biochemical functional properties better than the sequence-similarity based dendrogram. In multi-domain protein families and disulphide-rich protein families the correlation coefficient for the match of sequence-based and structure-based dissimilarity (SDM) measures can be poor though the sequence identity could be higher than 30%. Hence it is suggested that protein evolution is best modelled using 3-D structures if the sequence similarities (SSM) of the homologues are very low.  相似文献   

6.
In order to study structural aspects of sequence conservation in families of homologous proteins, we have analyzed structurally aligned sequences of 585 proteins grouped into 128 homologous families. The conservation of a residue in a family is defined as the average residue similarity in a given position of aligned sequences. The residue similarities were expressed in the form of log-odd substitution tables that take into account the environments of amino acids in three-dimensional structures. The protein core is defined as those residues that have less then 7% solvent accessibility. The density of a protein core is described in terms of atom packing, which is investigated as a criterion for residue substitution and conservation. Although there is no significant correlation between sequence conservation and average atom packing around nonpolar residues such as leucine, valine and isoleucine, a significant correlation is observed for polar residues in the protein core. This may be explained by the hydrogen bonds in which polar residues are involved; the better their protection from water access the more stable should be the structure in that position. Proteins 33:358–366, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

7.
The 3D structural comparison of families of divergent homologous domains revealed two main populations of hydrophobic amino acids, one with a low and the other with a significantly higher mean solvent accessibility, allowing two regions of the core of protein globular domains to be distinguished. The side chains of hydrophobic amino acids in topologically conserved positions (positions in the structural alignment where only hydrophobic amino acids are found), which we call topohydrophobic positions, are considerably less dispersed than those of the other amino acids (hydrophobic or not). Mean distances between gravity centers of amino acids in topohydrophobic positions are significantly shorter than those for non-topohydrophobic positions and show that the corresponding amino acids are almost all in direct contact in the inner core of globular domains. This study also showed that the small number of topohydrophobic positions is a characteristic of the structural differences between proteins of a family. This criterion is independent of the sequence identity between the sequences and of the root-mean-square distance between their corresponding structures. Using sensitive sequence alignment processes it will be possible, for many protein families, to identify topohydrophobic positions from sequences only. Proteins 33:329–342, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

8.
Thiamine diphosphate-dependent decarboxylases catalyze both cleavage and formation of C C bonds in various reactions, which have been assigned to different homologous sequence families. This work compares 53 ThDP-dependent decarboxylases with known crystal structures. Both sequence and structural information were analyzed synergistically and data were analyzed for global and local properties by means of statistical approaches (principle component analysis and principal coordinate analysis) enabling complexity reduction. The different results obtained both locally and globally, that is, individual positions compared with the overall protein sequence or structure, revealed challenges in the assignment of separated homologous families. The methods applied herein support the comparison of enzyme families and the identification of functionally relevant positions. The findings for the family of ThDP-dependent decarboxylases underline that global sequence identity alone is not sufficient to distinguish enzyme function. Instead, local sequence similarity, defined by comparisons of structurally equivalent positions, allows for a better navigation within several groups of homologous enzymes. The differentiation between homologous sequences is further enhanced by taking structural information into account, such as BioGPS analysis of the active site properties or pairwise structural superimpositions. The methods applied herein are expected to be transferrable to other enzyme families, to facilitate family assignments for homologous protein sequences.  相似文献   

9.
We investigated the conservation of sidechain conformation for each residue within a homologous family of proteins in the Protein Data Bank (PDB) and performed sidechain modeling using this information. The information was represented by the probability of conserved sidechain torsional angles obtained from many families of proteins, and these were calculated for a pair of residues at topologically equivalent positions as a result of structural alignment. Probabilities were obtained for a pair of same amino acids and for a pair of different amino acids. The correlation between environmental residues and the fluctuation of probability was examined for the pair of same amino acid residues, and the simple probability was calculated for the pair of different amino acids. From the results on the same amino acid pairs, 17 amino acids, except for Ala, Gly, and Pro, were divided into two types: those that were influenced and those that were not influenced by the environmental residues. From results on different amino acid pairs, a replacement between large residues, such as Trp, Phe, and Tyr, was performed assuming conservation of their torsional angles within a homologous family of proteins. We performed sidechain modeling for 11 known proteins from their native and modeled backbones, respectively. With the native backbones, the percentage of the χ1 angle correct within 30° was found to be 67% and 80% for all and core residues, respectively. With the modeled backbones, the percentage of the correct χ1 angle was found to be 60% and 72% for all and core residues, respectively. To estimate an upper limit on the accuracy for predicting sidechain conformations, we investigated the probability of conserved sidechain torsional angles for highly similar proteins having > 90% sequence identity and <2.5-Å X-ray resolution. In those proteins, 83% of the sidechain conformations were conserved for the χ1 angle. Proteins 31:355–369, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

10.
The database PALI (Phylogeny and ALIgnment of homologous protein structures) consists of families of protein domains of known three-dimensional (3D) structure. In a PALI family, every member has been structurally aligned with every other member (pairwise) and also simultaneous superposition (multiple) of all the members has been performed. The database also contains 3D structure-based and structure-dependent sequence similarity-based phylogenetic dendrograms for all the families. The PALI release used in the present analysis comprises 225 families derived largely from the HOMSTRAD and SCOP databases. The quality of the multiple rigid-body structural alignments in PALI was compared with that obtained from COMPARER, which encodes a procedure based on properties and relationships. The alignments from the two procedures agreed very well and variations are seen only in the low sequence similarity cases often in the loop regions. A validation of Direct Pairwise Alignment (DPA) between two proteins is provided by comparing it with Pairwise alignment extracted from Multiple Alignment of all the members in the family (PMA). In general, DPA and PMA are found to vary rarely. The ready availability of pairwise alignments allows the analysis of variations in structural distances as a function of sequence similarities and number of topologically equivalent Calpha atoms. The structural distance metric used in the analysis combines root mean square deviation (r.m.s.d.) and number of equivalences, and is shown to vary similarly to r.m.s.d. The correlation between sequence similarity and structural similarity is poor in pairs with low sequence similarities. A comparison of sequence and 3D structure-based phylogenies for all the families suggests that only a few families have a radical difference in the two kinds of dendrograms. The difference could occur when the sequence similarity among the homologues is low or when the structures are subjected to evolutionary pressure for the retention of function. The PALI database is expected to be useful in furthering our understanding of the relationship between sequences and structures of homologous proteins and their evolution.  相似文献   

11.
Li W  Liu Z  Lai L 《Biopolymers》1999,49(6):481-495
A general problem in comparative modeling and protein design is the conformational evaluation of loops with a certain sequence in specific environmental protein frameworks. Loops of different sequences and structures on similar scaffolds are common in the Protein Data Bank (PDB). In order to explore both structural and sequential diversity of them, a data base of loops connecting similar secondary structure fragments is constructed by searching the data base of families of structurally similar proteins and PDB. A total of 84 loop families having 2-13 residues are found among the well-determined structures of resolution better than 2.5 A. Eight alpha-alpha, 20 alpha-beta, 19 beta-alpha, and 37 beta-beta families are identified. Every family contains more than 5 loop motifs. In each family, no loops share same sequence and all the frameworks are well superimposed. Forty-three new loop classes are distinguished in the data base. The structural variability of loops in homologous proteins are examined and shown in 44 families. Motif families are characterized with geometric parameters and sequence patterns. The conformations of loops in each family are clustered into subfamilies using average linkage cluster analysis method. Information such as geometric properties, sequence profile, sequential and structural variability in loop, structural alignment parameters, sequence similarities, and clustering results are provided. Correlations between the conformation of loops and loop sequence, motif sequence, and global sequence of PDB chain are examined in order to find how loop structures depend on their sequences and how they are affected by the local and global environment. Strong correlations (R > 0.75) are only found in 24 families. The best R value is 0.98. The data base is available through the Internet.  相似文献   

12.
We investigate the conservation of amino acid residue sequences in 21 DNA-binding protein families and study the effects that mutations have on DNA-sequence recognition. The observations are best understood by assigning each protein family to one of three classes: (i) non-specific, where binding is independent of DNA sequence; (ii) highly specific, where binding is specific and all members of the family target the same DNA sequence; and (iii) multi-specific, where binding is also specific, but individual family members target different DNA sequences. Overall, protein residues in contact with the DNA are better conserved than the rest of the protein surface, but there is a complex underlying trend of conservation for individual residue positions. Amino acid residues that interact with the DNA backbone are well conserved across all protein families and provide a core of stabilising contacts for homologous protein-DNA complexes. In contrast, amino acid residues that interact with DNA bases have variable levels of conservation depending on the family classification. In non-specific families, base-contacting residues are well conserved and interactions are always found in the minor groove where there is little discrimination between base types. In highly specific families, base-contacting residues are highly conserved and allow member proteins to recognise the same target sequence. In multi-specific families, base-contacting residues undergo frequent mutations and enable different proteins to recognise distinct target sequences. Finally, we report that interactions with bases in the target sequence often follow (though not always) a universal code of amino acid-base recognition and the effects of amino acid mutations can be most easily understood for these interactions.  相似文献   

13.
Balaji S  Aruna S  Srinivasan N 《Proteins》2003,53(4):783-791
Occurrence and accommodation of charged amino acid residues in proteins that are structurally equivalent to buried non-polar residues in homologues have been investigated. Using a dataset of 1,852 homologous pairs of crystal structures of proteins available at 2A or better resolution, 14,024 examples of apolar residues in the structurally conserved regions replaced by charged residues in homologues have been identified. Out of 2,530 cases of buried apolar residues, 1,677 of the equivalent charged residues in homologues are exposed and the rest of the charged residues are buried. These drastic substitutions are most often observed in homologous protein pairs with low sequence identity (<30%) and in large protein domains (>300 residues). Such buried charged residues in the large proteins are often located in the interface of sub-domains or in the interface of structural repeats, Beyond 7A of residue depth of buried apolar residues, or less than 4% of solvent accessibility, almost all the substituting charged residues are buried. It is also observed that acidic sidechains have higher preference to get buried than the positively charged residues. There is a preference for buried charged residues to get accommodated in the interior by forming hydrogen bonds with another sidechain than the main chain. The sidechains interacting with a buried charged residue are most often located in the structurally conserved regions of the alignment. About 50% of the observations involving hydrogen bond between buried charged sidechain and another sidechain correspond to salt bridges. Among the buried charged residues interacting with the main chain, positively charged sidechains form hydrogen bonds commonly with main chain carbonyls while the negatively charged residues are accommodated by hydrogen bonding with the main chain amides. These carbonyls and amides are usually located in the loops that are structurally variable among homologous proteins.  相似文献   

14.
It is observed that during divergent evolution of two proteins with a common phylogenetic origin, the structural similarity of their backbones is often preserved even when the sequence similarity between them decreases to a virtually undetectable level. Here we analyzed, whether the conservation of structure along evolution involves also the local atomic structures in the interfaces between secondary structural elements. We have used as study case one protein family, the proteasomal subunits, for which 17 crystal structures are known. These include 14 different subunits of Saccharomyces cerevisiae, 2 subunits of Thermoplasma acidophilum and one subunit of Escherichia coli. The structural core of the 17 proteasomal subunits has 23 secondary structural elements. Any two adjacent secondary structural elements form a molecular interface consisting of two molecular patches. We found 61 interfaces that occurred in all 17 subunits. The 3D shape of equivalent molecular patches from different proteasomal subunits were compared by superposition. Our results demonstrate that pairs of equivalent molecular patches show an RMSD which is lower than that of randomly chosen patches from unrelated proteins. This is true even when patch comparisons with identical residues were excluded from the analysis. Furthermore it is known that the sequential dissimilarity is correlated to the RMSD between the backbones of the members of protein families. The question arises whether this is also true for local atomic structures. The results show that the correlation of individual patch RMSD values and local sequence dissimilarities is low and has a wide range from 0 to 0.41, however, it is surprising that there is a good correlation between the average RMSD of all corresponding patches and the global sequence dissimilarity. This average patch RMSD correlates slightly stronger than the C(alpha)-trace RMSD to the global sequence dissimilarity.  相似文献   

15.
We present a new method for predicting the secondary structure of globular proteins based on non-linear neural network models. Network models learn from existing protein structures how to predict the secondary structure of local sequences of amino acids. The average success rate of our method on a testing set of proteins non-homologous with the corresponding training set was 64.3% on three types of secondary structure (alpha-helix, beta-sheet, and coil), with correlation coefficients of C alpha = 0.41, C beta = 0.31 and Ccoil = 0.41. These quality indices are all higher than those of previous methods. The prediction accuracy for the first 25 residues of the N-terminal sequence was significantly better. We conclude from computational experiments on real and artificial structures that no method based solely on local information in the protein sequence is likely to produce significantly better results for non-homologous proteins. The performance of our method of homologous proteins is much better than for non-homologous proteins, but is not as good as simply assuming that homologous sequences have identical structures.  相似文献   

16.
Structural genomics projects are producing many three-dimensional structures of proteins that have been identified only from their gene sequences. It is therefore important to develop computational methods that will predict sites involved in productive intermolecular interactions that might give clues about functions. Techniques based on evolutionary conservation of amino acids have the advantage over physiochemical methods in that they are more general. However, the majority of techniques neither use all available structural and sequence information, nor are able to distinguish between evolutionary restraints that arise from the need to maintain structure and those that arise from function. Three methods to identify evolutionary restraints on protein sequence and structure are described here. The first identifies those residues that have a higher degree of conservation than expected: this is achieved by comparing for each amino acid position the sequence conservation observed in the homologous family of proteins with the degree of conservation predicted on the basis of amino acid type and local environment. The second uses information theory to identify those positions where environment-specific substitution tables make poor predictions of the overall amino acid substitution pattern. The third method identifies those residues that have highly conserved positions when three-dimensional structures of proteins in a homologous family are superposed. The scores derived from these methods are mapped onto the protein three-dimensional structures and contoured, allowing identification clusters of residues with strong evolutionary restraints that are sites of interaction in proteins involved in a variety of functions. Our method differs from other published techniques by making use of structural information to identify restraints that arise from the structure of the protein and differentiating these restraints from others that derive from intermolecular interactions that mediate functions in the whole organism.  相似文献   

17.
The availability of fast and robust algorithms for protein structure comparison provides an opportunity to produce a database of three-dimensional comparisons, called families of structurally similar proteins (FSSP). The database currently contains an extended structural family for each of 154 representative (below 30% sequence identity) protein chains. Each data set contains: the search structure; all its relatives with 70-30% sequence identity, aligned structurally; and all other proteins from the representative set that contain substructures significantly similar to the search structure. Very close relatives (above 70% sequence identity) rarely have significant structural differences and are excluded. The alignments of remote relatives are the result of pairwise all-against-all structural comparisons in the set of 154 representative protein chains. The comparisons were carried out with each of three novel automatic algorithms that cover different aspects of protein structure similarity. The user of the database has the choice between strict rigid-body comparisons and comparisons that take into account interdomain motion or geometrical distortions; and, between comparisons that require strictly sequential ordering of segments and comparisons, which allow altered topology of loop connections or chain reversals. The data sets report the structurally equivalent residues in the form of a multiple alignment and as a list of matching fragments to facilitate inspection by three-dimensional graphics. If substructures are ignored, the result is a database of structure alignments of full-length proteins, including those in the twilight zone of sequence similarity.(ABSTRACT TRUNCATED AT 250 WORDS)  相似文献   

18.
The Lipase Engineering Database (LED) (http://www.led.uni-stuttgart.de) integrates information on sequence, structure, and function of lipases, esterases, and related proteins. Sequence data on 806 protein entries are assigned to 38 homologous families, which are grouped into 16 superfamilies with no global sequence similarity between each other. For each family, multisequence alignments are provided with functionally relevant residues annotated. Pre-calculated phylogenetic trees allow navigation inside superfamilies. Experimental structures of 45 proteins are superposed and consistently annotated. The LED has been applied to systematically analyze sequence-structure-function relationships of this vast and diverse enzyme class. It is a useful tool to identify functionally relevant residues apart from the active site residues, and to design mutants with desired substrate specificity.  相似文献   

19.
Family profile analysis (FPA), described in this paper, compares all available homologous amino acid sequences of a target family with the profile of a probe family while conventional sequence profile analysis (Gribskov M, Lüthy R, Eisenberg D. Meth Enzymol 1990;183:146-159) considers only a single target sequence in comparison with the probe family. The increased input of sequence information in FPA expands the range for sequence-based recognition of structural relationships. In the FPA algorithm, Zscores of each of the target sequences, obtained from a probe profile search over all known amino acid sequences, are averaged and then compared with the scores for sequences of 100 reference families in the same probe family search. The resulting F-Zscore of the target family, expressed in "effective standard deviations" of the mean Zscores of the reference families, with value above a threshold of 3.5 indicates a statistically significant evolutionary relationship between the target and probe families. The sensitivity of FPA to sequence information was tested with several protein families where distant relationships have been verified from known tertiary protein architectures, which included vitamin B6-dependent enzymes, (beta/alpha)8-barrel proteins, beta-trefoil proteins, and globins. In comparison to other methods, FPA proved to be significantly more sensitive, finding numerous new homologies. The FPA technique is not only useful to test a suspected relationship between probe and target families but also identifies possible target families in profile searches over all known primary structures.  相似文献   

20.
The solution structure of the 154-residue conserved hypothetical protein HI0004 has been determined using multidimensional heteronuclear NMR spectroscopy. HI0004 has sequence homologs in many organisms ranging from bacteria to humans and is believed to be essential in Haemophilus influenzae, although an exact function has yet to be defined. It has a alpha-beta-alpha sandwich architecture consisting of a central four-stranded beta-sheet with the alpha2-helix packed against one side of the beta-sheet and four alpha-helices (alpha1, alpha3, alpha4, alpha5) on the other side. There is structural homology with the eukaryotic matrix metalloproteases (MMPs), but little sequence similarity except for a conserved region containing three histidines that appears in both the MMPs and throughout the HI0004 family of proteins. The solution structure of HI0004 is compared with the X-ray structure of an Aquifex aeolicus homolog, AQ_1354, which has 36% sequence identity over 148 residues. Despite this level of sequence homology, significant differences exist between the two structures. These differences are described along with possible functional implications of the structures.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号