首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Molecular modeling of proteins is confronted with the problem of finding homologous proteins, especially when few identities remain after the process of molecular evolution. Using even the most recent methods based on sequence identity detection, structural relationships are still difficult to establish with high reliability. As protein structures are more conserved than sequences, we investigated the possibility of using protein secondary structure comparison (observed or predicted structures) to discriminate between related and unrelated proteins sequences in the range of 10%-30% sequence identity. Pairwise comparison of secondary structures have been measured using the structural overlap (Sov) parameter. In this article, we show that if the secondary structures likeness is >50%, most of the pairs are structurally related. Taking into account the secondary structures of proteins that have been detected by BLAST, FASTA, or SSEARCH in the noisy region (with high E: value), we show that distantly related protein sequences (even with <20% identity) can be still identified. This strategy can be used to identify three-dimensional templates in homology modeling by finding unexpected related proteins and to select proteins for experimental investigation in a structural genomic approach, as well as for genome annotation.  相似文献   

2.
3.
C Sander  R Schneider 《Proteins》1991,9(1):56-68
The database of known protein three-dimensional structures can be significantly increased by the use of sequence homology, based on the following observations. (1) The database of known sequences, currently at more than 12,000 proteins, is two orders of magnitude larger than the database of known structures. (2) The currently most powerful method of predicting protein structures is model building by homology. (3) Structural homology can be inferred from the level of sequence similarity. (4) The threshold of sequence similarity sufficient for structural homology depends strongly on the length of the alignment. Here, we first quantify the relation between sequence similarity, structure similarity, and alignment length by an exhaustive survey of alignments between proteins of known structure and report a homology threshold curve as a function of alignment length. We then produce a database of homology-derived secondary structure of proteins (HSSP) by aligning to each protein of known structure all sequences deemed homologous on the basis of the threshold curve. For each known protein structure, the derived database contains the aligned sequences, secondary structure, sequence variability, and sequence profile. Tertiary structures of the aligned sequences are implied, but not modeled explicitly. The database effectively increases the number of known protein structures by a factor of five to more than 1800. The results may be useful in assessing the structural significance of matches in sequence database searches, in deriving preferences and patterns for structure prediction, in elucidating the structural role of conserved residues, and in modeling three-dimensional detail by homology.  相似文献   

4.
Structural genomics projects as well as ab initio protein structure prediction methods provide structures of proteins with no sequence or fold similarity to proteins with known functions. These are often low-resolution structures that may only include the positions of C alpha atoms. We present a fast and efficient method to predict DNA-binding proteins from just the amino acid sequences and low-resolution, C alpha-only protein models. The method uses the relative proportions of certain amino acids in the protein sequence, the asymmetry of the spatial distribution of certain other amino acids as well as the dipole moment of the molecule. These quantities are used in a linear formula, with coefficients derived from logistic regression performed on a training set, and DNA-binding is predicted based on whether the result is above a certain threshold. We show that the method is insensitive to errors in the atomic coordinates and provides correct predictions even on inaccurate protein models. We demonstrate that the method is capable of predicting proteins with novel binding site motifs and structures solved in an unbound state. The accuracy of our method is close to another, published method that uses all-atom structures, time-consuming calculations and information on conserved residues.  相似文献   

5.
Zhu M  Li M 《Molecular bioSystems》2012,8(6):1686-1693
G-protein coupled receptors (GPCRs) are recognized to constitute the largest family of membrane proteins. Due to the disproportion in the quantity of crystal structures and their amino acid sequences, homology modeling contributes a reasonable and feasible approach to GPCR theoretical coordinates. With the brand new crystal structures resolved recently, herein we deliberated how to designate them as templates to carry out homology modeling in four aspects: (1) various sequence alignment methods; (2) protein weight matrix; (3) different sets of multiple templates; (4) active and inactive state of templates. The accuracy of models was evaluated by comparing the similarity of stereo conformation and molecular docking results between models and the experimental structure of Meleagris gallopavo β(1)-adrenergic receptor (Mg_Adrb1) that we desired to develop as an example. Our results proposed that: (1) Cobalt and MAFFT, two algorithms of sequence alignment, were suitable for single- and multiple-template modeling, respectively; (2) Blosum30 is applicable to align sequences in the case of low sequence identity; (3) multiple-template modeling is not always better than single-template one; (4) the state of template is an influential factor in simulating the GPCR structures as well.  相似文献   

6.
Due to advances in molecular biology the DNA sequences of structural genes coding for proteins are often known before a protein is characterized or even isolated. The function of a protein whose amino acid sequence has been deduced from a DNA sequence may not even be known. This has created greater interest in the development of methods to predict the tertiary structures of proteins. The a priori prediction of a protein's structure from its amino acid sequence is not yet possible. However, since proteins with similar amino acid sequences are observed to have similar three-dimensional structures, it is possible to use an analogy with a protein of known structure to draw some conclusions about the structure and properties of an uncharacterized protein. The process of predicting the tertiary structure of a protein relies very much upon computer modeling and analysis of the structure. The prediction of the structure of the bacteriophage 434 cro repressor is used as an example illustrating current procedures.  相似文献   

7.
Proteins with similar structures are generally assumed to arise from similar sequences. However, there are more cases than not where this is not true. The dogma is that sequence determines structure; how, then, can very different sequences fold to the same structure? Here, we employ high temperature unfolding simulations to probe the pathways and specific interactions that direct the folding and unfolding of the SH3 domain. The SH3 metafold in the Dynameomics Database consists of 753 proteins with the same structure, but varied sequences and functions. To investigate the relationship between sequence and structure, we selected 17 targets from the SH3 metafold with high sequence variability. Six unfolding simulations were performed for each target, transition states were identified, revealing two general folding/unfolding pathways at the transition state. Transition states were also expressed as mathematical graphs of connected chemical nodes, and it was found that three positions within the structure, independent of sequence, were consistently more connected within the graph than any other nearby positions in the sequence. These positions represent a hub connecting different portions of the structure. Multiple sequence alignment and covariation analyses also revealed certain positions that were more conserved due to packing constraints and stabilizing long‐range contacts. This study demonstrates that members of the SH3 domain with different sequences can unfold through two main pathways, but certain characteristics are conserved regardless of the sequence or unfolding pathway. While sequence determines structure, we show that disparate sequences can provide similar interactions that influence folding and lead to similar structures.  相似文献   

8.
In spite of the tremendous increase in the rate at which protein structures are being determined, there is still an enormous gap between the numbers of known DNA-derived sequences and the numbers of three-dimensional structures. In order to shed light on the biological functions of the molecules, researchers often resort to comparative molecular modeling. Earlier work has shown that when the sequence alignment is in error, then the comparative model is guaranteed to be wrong. In addition, loops, the sites of insertions and deletions in families of homologous proteins, are exceedingly difficult to model. Thus, many of the current problems in comparative molecular modeling are minor versions of the global protein folding problem. In order to assess objectively the current state of comparative molecular modeling, 13 groups submitted blind predictions of seven different proteins of undisclosed tertiary structure. This assessment shows that where sequence identity between the target and the template structure is high (> 70%), comparative molecular modeling is highly successful. On the other hand, automated modeling techniques and sophisticated energy minimization methods fail to improve upon the starting structures when the sequence identity is low (~30%). Based on these results it appears that insertions and deletions are still major problems. Successfully deducing the correct sequence alignment when the local similarity is low is still difficult. We suggest some minimal testing of submitted coordinates that should be required of authors before papers on comparative molecular modeling are accepted for publication in journals. © 1995 Wiley-Liss, Inc.  相似文献   

9.
Kosloff M  Kolodny R 《Proteins》2008,71(2):891-902
It is often assumed that in the Protein Data Bank (PDB), two proteins with similar sequences will also have similar structures. Accordingly, it has proved useful to develop subsets of the PDB from which "redundant" structures have been removed, based on a sequence-based criterion for similarity. Similarly, when predicting protein structure using homology modeling, if a template structure for modeling a target sequence is selected by sequence alone, this implicitly assumes that all sequence-similar templates are equivalent. Here, we show that this assumption is often not correct and that standard approaches to create subsets of the PDB can lead to the loss of structurally and functionally important information. We have carried out sequence-based structural superpositions and geometry-based structural alignments of a large number of protein pairs to determine the extent to which sequence similarity ensures structural similarity. We find many examples where two proteins that are similar in sequence have structures that differ significantly from one another. The source of the structural differences usually has a functional basis. The number of such proteins pairs that are identified and the magnitude of the dissimilarity depend on the approach that is used to calculate the differences; in particular sequence-based structure superpositioning will identify a larger number of structurally dissimilar pairs than geometry-based structural alignments. When two sequences can be aligned in a statistically meaningful way, sequence-based structural superpositioning provides a meaningful measure of structural differences. This approach and geometry-based structure alignments reveal somewhat different information and one or the other might be preferable in a given application. Our results suggest that in some cases, notably homology modeling, the common use of nonredundant datasets, culled from the PDB based on sequence, may mask important structural and functional information. We have established a data base of sequence-similar, structurally dissimilar protein pairs that will help address this problem (http://luna.bioc.columbia.edu/rachel/seqsimstrdiff.htm).  相似文献   

10.
Dolan MA  Keil M  Baker DS 《Proteins》2008,72(4):1243-1258
Although the number of known protein structures is increasing, the number of protein sequences without determined structures is still much larger. Three-dimensional (3D) protein structure information helps in the understanding of functional mechanisms, but solving structures by X-ray crystallography or NMR is often a lengthy and difficult process. A relatively fast way of determining a protein's 3D structure is to construct a computer model using homologous sequence and structure information. Much work has gone into algorithms that comprise the ORCHESTRAR homology modeling program in the SYBYL software package. This novel homology modeling tool combines algorithms for modeling conserved cores, variable regions, and side chains. The paradigm of using existing knowledge from multiple templates and the underlying protein environment knowledgebase is used in all of these algorithms, and will become even more powerful as the number of experimentally derived protein structures increases. To determine how ORCHESTRAR compares to Composer (a broadly used, but an older tool), homology models of 18 proteins were constructed using each program so that a detailed comparison of each step in the modeling process could be carried out. Proteins modeled include kinases, dihydrofolate reductase, HIV protease, and factor Xa. In almost all cases ORCHESTRAR produces models with lower root-mean-squared deviation (RMSD) values when compared with structures determined by X-ray crystallography or NMR. Moreover, ORCHESTRAR produced a homology model for three target sequences where Composer failed to produce any. Data for RMSD comparisons between structurally conserved cores, structurally variable regions, side-chain conformations are presented, as well as analyses of active site and protein-protein interface configurations.  相似文献   

11.
Proteins have been classified into families based upon sequence homology. An accurate, systematic comparative model-building procedure for a homologous family of proteins would be very valuable scientifically. This paper presents such a procedure and applies it to the mammalian serine proteases, which are ubiquitous and involved in many important biological functions. Eleven proteins of this family are considered here, including a variety of blood serum, intestinal and pancreatic proteins as well as a closely related bacterial enzyme.The modeling method capitalizes upon the availability of three experimentally determined structures for mammalian serine proteases. These structures show that the molecule is divided into structurally conserved regions, which contain the strong sequence homology, and structurally variable regions, which include all the additions and deletions. We show that by applying this structural distinction to new sequences, erroneous alignments of the sequences are greatly minimized.For each aligned new sequence, the structurally conserved regions can be constructed from any of the known structures. In examining the variable regions, we have found that a variable region that has the same length and residue character in two different known structures usually has the same conformation in both. Thus, when the eight structurally unknown proteins are modeled, most of the variable regions can be constructed directly from the known structures. A minority of the variable regions require more sophisticated analysis to evaluate the relative merits of a small number of possible conformations. Only a very few are so different that modeling by homology is entirely ruled out. We demonstrate, therefore, that by this modeling procedure, the maximum of each of these mammalian serine proteases is constructed directly from the experimentally determined structures and the necessity to build from intuition or from energy considerations is greatly reduced.  相似文献   

12.
It is commonly believed that similarities between the sequences of two proteins infer similarities between their structures. Sequence alignments reliably recognize pairs of protein of similar structures provided that the percentage sequence identity between their two sequences is sufficiently high. This distinction, however, is statistically less reliable when the percentage sequence identity is lower than 30% and little is known then about the detailed relationship between the two measures of similarity. Here, we investigate the inverse correlation between structural similarity and sequence similarity on 12 protein structure families. We define the structure similarity between two proteins as the cRMS distance between their structures. The sequence similarity for a pair of proteins is measured as the mean distance between the sequences in the subsets of sequence space compatible with their structures. We obtain an approximation of the sequence space compatible with a protein by designing a collection of protein sequences both stable and specific to the structure of that protein. Using these measures of sequence and structure similarities, we find that structural changes within a protein family are linearly related to changes in sequence similarity.  相似文献   

13.
The information required to generate a protein structure is contained in its amino acid sequence, but how three-dimensional information is mapped onto a linear sequence is still incompletely understood. Multiple structure alignments of similar protein structures have been used to investigate conserved sequence features but contradictory results have been obtained, due, in large part, to the absence of subjective criteria to be used in the construction of sequence profiles and in the quantitative comparison of alignment results. Here, we report a new procedure for multiple structure alignment and use it to construct structure-based sequence profiles for similar proteins. The definition of "similar" is based on the structural alignment procedure and on the protein structural distance (PSD) described in paper I of this series, which offers an objective measure for protein structure relationships. Our approach is tested in two well-studied groups of proteins; serine proteases and Ig-like proteins. It is demonstrated that the quality of a sequence profile generated by a multiple structure alignment is quite sensitive to the PSD used as a threshold for the inclusion of proteins in the alignment. Specifically, if the proteins included in the aligned set are too distant in structure from one another, there will be a dilution of information and patterns that are relevant to a subset of the proteins are likely to be lost.In order to understand better how the same three-dimensional information can be encoded in seemingly unrelated sequences, structure-based sequence profiles are constructed for subsets of proteins belonging to nine superfolds. We identify patterns of relatively conserved residues in each subset of proteins. It is demonstrated that the most conserved residues are generally located in the regions where tertiary interactions occur and that are relatively conserved in structure. Nevertheless, the conservation patterns are relatively weak in all cases studied, indicating that structure-determining factors that do not require a particular sequential arrangement of amino acids, such as secondary structure propensities and hydrophobic interactions, are important in encoding protein fold information. In general, we find that similar structures can fold without having a set of highly conserved residue clusters or a well-conserved sequence profile; indeed, in some cases there is no apparent conservation pattern common to structures with the same fold. Thus, when a group of proteins exhibits a common and well-defined sequence pattern, it is more likely that these sequences have a close evolutionary relationship rather than the similarities having arisen from the structural requirements of a given fold.  相似文献   

14.
Finding structural similarities between proteins often helps reveal shared functionality, which otherwise might not be detected by native sequence information alone. Such similarity is usually detected and quantified by protein structure alignment. Determining the optimal alignment between two protein structures, however, remains a hard problem. An alternative approach is to approximate each three-dimensional protein structure using a sequence of motifs derived from a structural alphabet. Using this approach, structure comparison is performed by comparing the corresponding motif sequences or structural sequences. In this article, we measure the performance of such alphabets in the context of the protein structure classification problem. We consider both local and global structural sequences. Each letter of a local structural sequence corresponds to the best matching fragment to the corresponding local segment of the protein structure. The global structural sequence is designed to generate the best possible complete chain that matches the full protein structure. We use an alphabet of 20 letters, corresponding to a library of 20 motifs or protein fragments having four residues. We show that the global structural sequences approximate well the native structures of proteins, with an average coordinate root mean square of 0.69 Å over 2225 test proteins. The approximation is best for all α-proteins, while relatively poorer for all β-proteins. We then test the performance of four different sequence representations of proteins (their native sequence, the sequence of their secondary-structure elements, and the local and global structural sequences based on our fragment library) with different classifiers in their ability to classify proteins that belong to five distinct folds of CATH. Without surprise, the primary sequence alone performs poorly as a structure classifier. We show that addition of either secondary-structure information or local information from the structural sequence considerably improves the classification accuracy. The two fragment-based sequences perform better than the secondary-structure sequence but not well enough at this stage to be a viable alternative to more computationally intensive methods based on protein structure alignment.  相似文献   

15.
Internal symmetry is commonly observed in the majority of fundamental protein folds. Meanwhile, sufficient evidence suggests that nascent polypeptide chains of proteins have the potential to start the co-translational folding process and this process allows mRNA to contain additional information on protein structure. In this paper, we study the relationship between gene sequences and protein structures from the viewpoint of symmetry to explore how gene sequences code for structural symmetry in proteins. We found that, for a set of two-fold symmetric proteins from left-handed beta-helix fold, intragenic symmetry always exists in their corresponding gene sequences. Meanwhile, codon usage bias and local mRNA structure might be involved in modulating translation speed for the formation of structural symmetry: a major decrease of local codon usage bias in the middle of the codon sequence can be identified as a common feature; and major or consecutive decreases in local mRNA folding energy near the boundaries of the symmetric substructures can also be observed. The results suggest that gene duplication and fusion may be an evolutionarily conserved process for this protein fold. In addition, the usage of rare codons and the formation of higher order of secondary structure near the boundaries of symmetric substructures might have coevolved as conserved mechanisms to slow down translation elongation and to facilitate effective folding of symmetric substructures. These findings provide valuable insights into our understanding of the mechanisms of translation and its evolution, as well as the design of proteins via symmetric modules.  相似文献   

16.
PAS domains are widespread in archaea, bacteria, and eukaryota, and play important roles in various functions. In this study, we aim to explore functional evolutionary relationship among proteins in the PAS domain superfamily in view of the sequence‐structure‐dynamics‐function relationship. We collected protein sequences and crystal structure data from RCSB Protein Data Bank of the PAS domain superfamily belonging to three biological functions (nucleotide binding, photoreceptor activity, and transferase activity). Protein sequences were aligned and then used to select sequence‐conserved residues and build phylogenetic tree. Three‐dimensional structure alignment was also applied to obtain structure‐conserved residues. The protein dynamics were analyzed using elastic network model (ENM) and validated by molecular dynamics (MD) simulation. The result showed that the proteins with same function could be grouped by sequence similarity, and proteins in different functional groups displayed statistically significant difference in their vibrational patterns. Interestingly, in all three functional groups, conserved amino acid residues identified by sequence and structure conservation analysis generally have a lower fluctuation than other residues. In addition, the fluctuation of conserved residues in each biological function group was strongly correlated with the corresponding biological function. This research suggested a direct connection in which the protein sequences were related to various functions through structural dynamics. This is a new attempt to delineate functional evolution of proteins using the integrated information of sequence, structure, and dynamics.  相似文献   

17.
The structure of Pyrococcus furiosus carboxypeptidase (PfuCP) has been determined to 2.2 A resolution using multiwavelength anomalous diffraction (MAD) methods. PfuCP represents the first structure of the new M32 family of carboxypeptidases. The overall structure is comprised of a homodimer. Each subunit is mostly helical with its most pronounced feature being a deep substrate binding groove. The active site lies at the bottom of this groove and contains an HEXXH motif that coordinates the metal ion required for catalysis. Surprisingly, the structure is similar to the recently reported rat neurolysin. Comparison of these structures as well as sequence analyses with other homologous proteins reveal several conserved residues. The roles for these conserved residues in the catalytic mechanism are inferred based on modeling and their location.  相似文献   

18.
Cathepsin L is a cysteine protease which degrades connective tissue proteins including collagen, elastin, and fibronectin. In this study, five well-characterized cathepsin L proteins from different arthropods were used as query sequences for the Drosophila genome database. The search yielded 10 cathepsin L-like sequences, of which eight putatively represent novel cathepsin L-like proteins. To understand the phylogenetic relationship among these cathepsin L-like proteins, a phylogenetic tree was constructed based on their sequences. In addition, models of the tertiary structures of cathepsin L were constructed using homology modeling methods and subjected to molecular dynamics simulations to obtain reasonable structure to understand its dynamical behavior. Our findings demonstrate that all of the potential Drosophila cathepsin L-like proteins contain at least one cathepsin propeptide inhibitor domain. Multiple sequence alignment and homology models clearly highlight the conservation of active site residues, disulfide bonds, and amino acid residues critical for inhibitor binding. Furthermore, comparative modeling indicates that the sequence/structure/function profiles and active site architectures are conserved.  相似文献   

19.
The NMR structure of the conserved hypothetical protein TM0487 from Thermotoga maritima represents an alpha/beta-topology formed by the regular secondary structures alpha1-beta1-beta2-alpha2-beta3-beta4-alpha3- beta5-3(10)-alpha4, with a small anti-parallel beta-sheet of beta-strands 1 and 2, and a mixed parallel/anti-parallel beta-sheet of beta-strands 3-5. Similar folds have previously been observed in other proteins, with amino acid sequence identity as low as 3% and a variety of different functions. There are also 216 sequence homologs of TM0487, which all have the signature sequence of domains of unknown function 59 (DUF59), for which no three-dimensional structures have as yet been reported. The TM0487 structure thus presents a platform for homology modeling of this large group of DUF59 proteins. Conserved among most of the DUF59s are 13 hydrophobic residues, which are clustered in the core of TM0487. A putative active site of TM0487 consisting of residues D20, E22, L23, T51, T52, and C55 is conserved in 98 of the 216 DUF59 sequences. Asp20 is buried within the proposed active site without any compensating positive charge, which suggests that its pK(a) value may be perturbed. Furthermore, the DUF59 family includes ORFs that are part of a conserved chromosomal group of proteins predicted to be involved in Fe-S cluster metabolism.  相似文献   

20.
Plant family 1 UDP-dependent glycosyltransferases (UGTs) catalyze the glycosylation of a plethora of bioactive natural products. In Arabidopsis thaliana, 120 UGT encoding genes have been identified. The crystal-based 3D structures of four plant UGTs have recently been published. Despite low sequence conservation, the UGTs show a highly conserved secondary and tertiary structure. The sugar acceptor and sugar donor substrates of UGTs are accommodated in the cleft formed between the N- and C-terminal domains. Several regions of the primary sequence contribute to the formation of the substrate binding pocket including structurally conserved domains as well as loop regions differing both with respect to their amino acid sequence and sequence length. In this review we provide a detailed analysis of the available plant UGT crystal structures to reveal structural features determining substrate specificity. The high 3D structural conservation of the plant UGTs render homology modeling an attractive tool for structure elucidation. The accuracy and utility of UGT structures obtained by homology modeling are discussed and quantitative assessments of model quality are performed by modeling of a plant UGT for which the 3D crystal structure is known. We conclude that homology modeling offers a high degree of accuracy. Shortcomings in homology modeling are also apparent with modeling of loop regions remaining as a particularly difficult task.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号