首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
We identified key residues from the structural alignment of families of protein domains from SCOP which we represented in the form of sparse protein signatures. A signature-generating algorithm (SigGen) was developed and used to automatically identify key residues based on several structural and sequence-based criteria. The capacity of the signatures to detect related sequences from the SWISSPROT database was assessed by receiver operator characteristic (ROC) analysis and jack-knife testing. Test signatures for families from each of the main SCOP classes are described in relation to the quality of the structural alignments, the SigGen parameters used, and their diagnostic performance. We show that automatically generated signatures are potently diagnostic for their family (ROC50 scores typically >0.8), consistently outperform random signatures, and can identify sequence relationships in the "twilight zone" of protein sequence similarity (<40%). Signatures based on 15%-30% of alignment positions occurred most frequently among the best-performing signatures. When alignment quality is poor, sparser signatures perform better, whereas signatures generated from higher-quality alignments of fewer structures require more positions to be diagnostic. Our validation of signatures from the Globin family shows that when sequences from the structural alignment are removed and new signatures generated, the omitted sequences are still detected. The positions highlighted by the signature often correspond (alignment specificity >0.7) to the key positions in the original (non-jack-knifed) alignment. We discuss potential applications of sparse signatures in sequence annotation and homology modeling.  相似文献   

2.
A novel algorithm has been developed for scoring the match between an imprecise sparse signature and all the protein sequences in a sequence database. The method was applied to a specific problem: signatures were derived from the probable folding nucleus and positions obtained from the determined interactions that occur during the folding of three small globular proteins and points of inter-element contact and sequence comparison of the actual three-dimensional structures of the same three proteins. In the case of two of these, lysozyme and myoglobin, the residues in the folding nucleus corresponded well to the key residues spotted by examination of the structures and in the remaining case, barnase, they did not. The diagnostic performance of the two types of signatures were compared for all three proteins. The significance of this for the application of an understanding of the protein folding mechanisms for structure prediction is discussed. The algorithm is generic and could be applied to other user-defined problems of sequence analysis.  相似文献   

3.
MOTIVATION: Previous work had established that it was possible to derive sparse signatures (essentially sequence-length motifs) by examining points of contact between residues in proteins of known three-dimensional (3D) structure. Many interesting protein families have very little tertiary structural information. Methods for deriving signatures using only primary and secondary-structural information were therefore developed. RESULTS: Two methods for deriving protein signatures using protein sequence information and predicted secondary structures are described. One method is based on a scoring approach, the other on the Genetic Algorithm (GA). The effectiveness of the method was tested on the superfamily of GPCRs and compared with the established hidden Markov model (HMM) method. The signature method is shown to perform well, detecting 68% of superfamily members before the first false positive sequence and detecting several distant relationships. The GA population was used to provide information on alignment regions of particular importance for selection of key residues.  相似文献   

4.
Several stratagems are used in protein bioinformatics for the classification of proteins based on sequence, structure or function. We explore the concept of a minimal signature embedded in a sequence that defines the likely position of a protein in a classification. Specifically, we address the derivation of sparse profiles for the G-protein coupled receptor (GPCR) clan of integral membrane proteins. We present an evolutionary algorithm (EA) for the derivation of sparse profiles (signatures) without the need to supply a multiple alignment. We also apply an evolution strategy (ES) to the problem of pattern and profile refinement. Patterns were derived for the GPCR 'superfamily' and GPCR families 1-3 individually from starting populations of randomly generated signatures, using a database of integral membrane protein sequences and an objective function using a modified receiver operator characteristic (ROC) statistic. The signature derived for the family 1 GPCR sequences was shown to perform very well in a stringent cross-validation test, detecting 76% of unseen GPCR sequences at 5% error. Application of the ES refinement method to a signature developed by a previously described method [Sadowski, M.I., Parish, J.H., 2003. Automated generation and refinement of protein signatures: case study with G-protein coupled receptors. Bioinformatics 19, 727-734] resulted in a 6% increase of coverage for 5% error as measured in the validation test. We note that there might be a limit to this or any classification of proteins based on patterns or schemata.  相似文献   

5.
The integrase family of site-specific recombinases catalyzes conservative rearrangements between defined segments of DNA. A highly conserved tetrad (RHRY) of catalytic residues is essential for this process. This tetrad is dispersed in two motifs in the linear sequence, but is configured appropriately in the catalytic pocket to execute the strand cleavage and rejoining reactions. A third conserved motif has been identified in the Xer subgroup of the integrase family. Mutational analysis of 12 conserved residues in this motif in the XerD protein from Salmonella typhimurium led to the identification of an essential fifth catalytic residue (lysine 172) which is implicated in strand cleavage or exchange. This lysine residue occupies part of the turn of an antiparallel beta-hairpin which forms one side of the catalytic cleft in XerD, and is found at similar positions among evolutionarily diverse integrase family members. Related antiparallel beta-hairpins are present in eucaryotic type IB topoisomerase enzymes which also contain a critical lysine residue in the turn of the hairpin. In both the integrase family and eucaryotic type IB topoisomerases, the catalytic lysine residues are in close contact with the substrates and may play similar roles in influencing the reactivity of the phosphotyrosine intermediates formed during reactions catalyzed by both enzymes.  相似文献   

6.

Background  

Proteins having similar functions from different sources can be identified by the occurrence in their sequences, a conserved cluster of amino acids referred to as pattern, motif, signature or fingerprint. The wide usage of protein sequence analysis in par with the growth of databases signifies the importance of using patterns or signatures to retrieve out related sequences. Blue copper proteins are found in the electron transport chain of prokaryotes and eukaryotes. The signatures already existing in the databases like the type 1 copper blue, multiple copper oxidase, cyt b/b6, photosystem 1 psaA&B, psaG&K, and reiske iron sulphur protein are not specified signatures for blue copper proteins as the name itself suggests. Most profile and motif databases strive to classify protein sequences into a broad spectrum of protein families. This work describes the signatures designed based on the copper metal binding motifs in blue copper proteins. The common feature in all blue copper proteins is a trigonal planar arrangement of two nitrogen ligands [each from histidine] and one sulphur containing thiolate ligand [from cysteine], with strong interactions between the copper center and these ligands.  相似文献   

7.
Cytochrome P450 monooxygenases (P450s) are heme-thiolate proteins distributed across the biological kingdoms. P450s are catalytically versatile and play key roles in organisms primary and secondary metabolism. Identification of P450s across the biological kingdoms depends largely on the identification of two P450 signature motifs, EXXR and CXG, in the protein sequence. Once a putative protein has been identified as P450, it will be assigned to a family and subfamily based on the criteria that P450s within a family share more than 40% homology and members of subfamilies share more than 55% homology. However, to date, no evidence has been presented that can distinguish members of a P450 family. Here, for the first time we report the identification of EXXR- and CXG-motifs-based amino acid patterns that are characteristic of the P450 family. Analysis of P450 signature motifs in the under-explored fungal P450s from four different phyla, ascomycota, basidiomycota, zygomycota and chytridiomycota, indicated that the EXXR motif is highly variable and the CXG motif is somewhat variable. The amino acids threonine and leucine are preferred as second and third amino acids in the EXXR motif and proline and glycine are preferred as second and third amino acids in the CXG motif in fungal P450s. Analysis of 67 P450 families from biological kingdoms such as plants, animals, bacteria and fungi showed conservation of a set of amino acid patterns characteristic of a particular P450 family in EXXR and CXG motifs. This suggests that during the divergence of P450 families from a common ancestor these amino acids patterns evolve and are retained in each P450 family as a signature of that family. The role of amino acid patterns characteristic of a P450 family in the structural and/or functional aspects of members of the P450 family is a topic for future research.  相似文献   

8.
This report describes the application of a simple computational tool, AAPAIR.TAB, for the systematic analysis of the cysteine-rich EGF, Sushi, and Laminin motif/sequence families at the two-amino acid level. Automated dipeptide frequency/bias analysis detects preferences in the distribution of amino acids in established protein families, by determining which "ordered dipeptides" occur most frequently in comprehensive motif-specific sequence data sets. Graphic display of the dipeptide frequency/bias data revealed family-specific preferences for certain dipeptides, but more importantly detected a shared preference for employment of the ordered dipeptides Gly-Tyr (GY) and Gly-Phe (GF) in all three protein families. The dipeptide Asn-Gly (NG) also exhibited high-frequency and bias in the EGF and Sushi motif families, whereas Asn-Thr (NT) was distinguished in the Laminin family. Evaluation of the distribution of dipeptides identified by frequency/bias analysis subsequently revealed the highly restricted localization of the G(F/Y) and N(G/T) sequence elements at two separate sites of extreme conservation in the consensus sequence of all three sequence families. The similar employment of the high-frequency/bias dipeptides in three distinct protein sequence families was further correlated with the concurrence of these shared molecular determinants at similar positions within the distinctive scaffolds of three structurally divergent, but similarly employed, motif modules.  相似文献   

9.
The nucleobase-ascorbate transporter (NAT) signature motif is a conserved sequence motif of the ubiquitous NAT/NCS2 family implicated in defining the function and selectivity of purine translocation pathway in the major fungal homolog UapA. To analyze the role of NAT motif more systematically, we employed Cys-scanning mutagenesis of the Escherichia coli xanthine-specific homolog YgfO. Using a functional mutant devoid of Cys residues (C-less), each amino acid residue in sequence (315)GSIPITTFAQNNGVIQMTGVASRYVG(340) (motif underlined) was replaced individually with Cys. Of the 26 single-Cys mutants, 16 accumulate xanthine to > or =50% of the steady state observed with C-less YgfO, 4 accumulate to low levels (10-25% of C-less), F322C, N325C, and N326C accumulate marginally (5-8% of C-less), and P318C, Q324C, and G340C are inactive. When transferred to wild type, F322C(wt) and N326C(wt) are highly active, but P318G(wt), Q324C(wt), N325C(wt), and G340C(wt) are inactive, and G340A(wt) displays low activity. Immunoblot analysis shows that replacements at Pro-318 or Gly-340 are associated with low or negligible expression in the membrane. More extensive mutagenesis reveals that Gln-324 is critical for high affinity uptake and ligand recognition, and Asn-325 is irreplaceable for active xanthine transport, whereas Thr-332 and Gly-333 are important determinants of ligand specificity. All single-Cys mutants react with N-ethylmaleimide, but regarding sensitivity to inactivation, they fall to three regions; positions 315-322 are insensitive to N-ethylmaleimide, with IC(50) values > or =0.4 mM, positions 323-329 are highly sensitive, with IC(50) values of 15-80 microM, and sensitivity of positions 330-340 follows a periodicity, with mutants sensitive to inactivation clustering on one face of an alpha-helix.  相似文献   

10.
Serine/threonine protein kinases of the Ste20p/PAK family are highly conserved from yeast to man. These protein kinases have been implicated in the signaling from heterotrimeric G proteins to mitogen-activated protein (MAP) kinase cascades and to cytoskeletal components such as myosin-I. In the yeast Saccharomyces cerevisiae, Ste20p is involved in transmitting the mating-pheromone signal from the betagamma-subunits of a heterotrimeric G protein to a downstream MAP kinase cascade. We have previously shown that binding of the G-protein beta-subunit (Gbeta) to a short binding site in the non-catalytic carboxy-terminal region of Ste20p is essential fortransmitting the pheromone signal. In this study, we searched protein sequence databases for sequences that are similar to the Gbeta binding site in Ste20p. We identified a sequence motif with the consensus sequence S S L phi P L I/V x phi phi beta (x: any residue; phi: A, I, L, S, or T; beta: basic residues) that is solely present in members of Ste20p/PAK family protein kinases. We propose that this sequence motif, which we have designated GBB (Gbeta binding) motif, is specifically responsible for binding of Gbeta to Ste20p/PAK protein kinases in response to activation of heterotrimeric G protein coupled receptors. Thus, the GBB motif is a novel type of signaling domain that serves to link protein kinases of the Ste20p/PAK family to G protein coupled receptors.  相似文献   

11.
Here, we present an approach for the prediction of binding preferences of members of a large protein family for which structural information for a number of family members bound to a substrate is available. The approach involves a number of steps. First, an accurate multiple alignment of sequences of all members of a protein family is constructed on the basis of a multiple structural superposition of family members with known structure. Second, the methods of continuum electrostatics are used to characterize the energetic contribution of each residue in a protein to the binding of its substrate. Residues that make a significant contribution are mapped onto the protein sequence and are used to define a "binding site signature" for the complex being considered. Third, sequences whose structures have not been determined are checked to see if they have binding-site signatures similar to one of the known complexes. Predictions of binding affinity to a given substrate are based on similarities in binding-site signature. An important component of the approach is the introduction of a context-specific substitution matrix suitable for comparison of binding-site residues.The methods are applied to the prediction of phosphopeptide selectivity of SH2 domains. To this end, the energetic roles of all protein residues in 17 different complexes of SH2 domains with their cognate targets are analyzed. The total number of residues that make significant contributions to binding is found to vary from nine to 19 in different complexes. These energetically important residues are found to contribute to binding through a variety of mechanisms, involving both electrostatic and hydrophobic interactions. Binding-site signatures are found to involve residues in different positions in SH2 sequences, some of them as far as 9A away from a bound peptide. Surprisingly, similarities in the signatures of different domains do not correlate with whole-domain sequence identities unless the latter is greater than 50%.An extensive comparison with the optimal binding motifs determined by peptide library experiments, as well as other experimental data indicate that the similarity in binding preferences of different SH2 domains can be deduced on the basis of their binding-site signatures. The analysis provides a rationale for the empirically derived classification of SH2 domains described by Songyang & Cantley, in that proteins in the same group are found to have similar residues at positions important for binding. Confident predictions of binding preference can be made for about 85% of SH2 domain sequences found in SWISSPROT. The approach described in this work is quite general and can, in principle, be used to analyze binding preferences of members of large protein families for which structural information for a number of family members is available. It also offers a strategy for predicting cross-reactivity of compounds designed to bind to a particular target, for example in structure-based drug design.  相似文献   

12.
MbeA is a 60 kDa protein encoded by plasmid ColE1. It plays a key role in conjugative mobilization. MbeA*, a slightly truncated version of MbeA, was purified for in vitro analysis. MbeA* catalysed DNA cleavage and strand-transfer reactions using oligonucleotides embracing the ColE1 nic site, which was mapped to 5'-(1469)CTGG/CTTA(1462)-3'. Thus MbeA is the relaxase for ColE1 conjugal mobilization, in spite of the fact that it lacks a three histidine motif considered the invariant signature of conjugative relaxases. Amino acid sequence comparisons suggest MbeA is nevertheless related to the common relaxase protein family. For instance, MbeA residue Y19 could correspond to the invariant tyrosine in Motif I, whereas H97, E104 and N106 may constitute the equivalent residues to the histidine triad in Motif III. This hypothesis was tested by site-directed mutagenesis. MbeA amino acid residues Y19, H97, E104 and N106 were changed to alanine. MbeA mutant N106A showed reduced oligonucleotide cleavage and strand-transfer activities, whereas mutation in the other three residues resulted in proteins without detectable activity, suggesting they are directly implicated in catalysis of DNA-cleavage and strand-transfer reactions. A double substitution of E104 and N106 by histidines, therefore reconstituting the canonical histidine triad, restored relaxase activities to 1% of wild type. Thus, MbeA is a variant of the common relaxase theme with a HEN signature motif, which has to be added to the canonical three histidine motif of previously reported relaxases.  相似文献   

13.
14.
SPLASH: structural pattern localization analysis by sequential histograms   总被引:6,自引:0,他引:6  
MOTIVATION: The discovery of sparse amino acid patterns that match repeatedly in a set of protein sequences is an important problem in computational biology. Statistically significant patterns, that is patterns that occur more frequently than expected, may identify regions that have been preserved by evolution and which may therefore play a key functional or structural role. Sparseness can be important because a handful of non-contiguous residues may play a key role, while others, in between, may be changed without significant loss of function or structure. Similar arguments may be applied to conserved DNA patterns. Available sparse pattern discovery algorithms are either inefficient or impose limitations on the type of patterns that can be discovered. RESULTS: This paper introduces a deterministic pattern discovery algorithm, called Splash, which can find sparse amino or nucleic acid patterns matching identically or similarly in a set of protein or DNA sequences. Sparse patterns of any length, up to the size of the input sequence, can be discovered without significant loss in performances. Splash is extremely efficient and embarrassingly parallel by nature. Large databases, such as a complete genome or the non-redundant SWISS-PROT database can be processed in a few hours on a typical workstation. Alternatively, a protein family or superfamily, with low overall homology, can be analyzed to discover common functional or structural signatures. Some examples of biologically interesting motifs discovered by Splash are reported for the histone I and for the G-Protein Coupled Receptor families. Due to its efficiency, Splash can be used to systematically and exhaustively identify conserved regions in protein family sets. These can then be used to build accurate and sensitive PSSM or HMM models for sequence analysis. AVAILABILITY: Splash is available to non-commercial research centers upon request, conditional on the signing of a test field agreement. CONTACT: acal@us.ibm.com, Splash main page http://www.research.ibm.com/splash  相似文献   

15.
The representation of protein structures as small-world networks facilitates the search for topological determinants, which may relate to functionally important residues. Here, we aimed to investigate the performance of residue centrality, viewed as a family fold characteristic, in identifying functionally important residues in protein families. Our study is based on 46 families, including 29 enzyme and 17 non-enzyme families. A total of 80% of these central positions corresponded to active site residues or residues in direct contact with these sites. For enzyme families, this percentage increased to 91%, while for non-enzyme families the percentage decreased substantially to 48%. A total of 70% of these central positions are located in catalytic sites in the enzyme families, 64% are in hetero-atom binding sites in those families binding hetero-atoms, and only 16% belong to protein-protein interfaces in families with protein-protein interaction data. These differences reflect the active site shape: enzyme active sites locate in surface clefts, hetero-atom binding residues are in deep cavities, while protein-protein interactions involve a more planar configuration. On the other hand, not all surface cavities or clefts are comprised of central residues. Thus, closeness centrality identifies functionally important residues in enzymes. While here we focus on binding sites, we expect to identify key residues for the integration and transmission of the information to the rest of the protein, reflecting the relationship between fold and function. Residue centrality is more conserved than the protein sequence, emphasizing the robustness of protein structures.  相似文献   

16.

Background  

Profile-based analysis of multiple sequence alignments (MSA) allows for accurate comparison of protein families. Here, we address the problems of detecting statistically confident dissimilarities between (1) MSA position and a set of predicted residue frequencies, and (2) between two MSA positions. These problems are important for (i) evaluation and optimization of methods predicting residue occurrence at protein positions; (ii) detection of potentially misaligned regions in automatically produced alignments and their further refinement; and (iii) detection of sites that determine functional or structural specificity in two related families.  相似文献   

17.
We investigate the conservation of amino acid residue sequences in 21 DNA-binding protein families and study the effects that mutations have on DNA-sequence recognition. The observations are best understood by assigning each protein family to one of three classes: (i) non-specific, where binding is independent of DNA sequence; (ii) highly specific, where binding is specific and all members of the family target the same DNA sequence; and (iii) multi-specific, where binding is also specific, but individual family members target different DNA sequences. Overall, protein residues in contact with the DNA are better conserved than the rest of the protein surface, but there is a complex underlying trend of conservation for individual residue positions. Amino acid residues that interact with the DNA backbone are well conserved across all protein families and provide a core of stabilising contacts for homologous protein-DNA complexes. In contrast, amino acid residues that interact with DNA bases have variable levels of conservation depending on the family classification. In non-specific families, base-contacting residues are well conserved and interactions are always found in the minor groove where there is little discrimination between base types. In highly specific families, base-contacting residues are highly conserved and allow member proteins to recognise the same target sequence. In multi-specific families, base-contacting residues undergo frequent mutations and enable different proteins to recognise distinct target sequences. Finally, we report that interactions with bases in the target sequence often follow (though not always) a universal code of amino acid-base recognition and the effects of amino acid mutations can be most easily understood for these interactions.  相似文献   

18.
The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be exploited to interpret correlated mutations observed in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple sequence alignment. We then use these correlations to predict contact maps for each of 11 protein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs predicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ratio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as input for the calculation of protein tertiary structure, either from sequence information alone or in combination with experimental information. © 1994 John Wiley & Sons, Inc.  相似文献   

19.
Two families of deaminases, one specific for cytidine, the other for deoxycytidylate, are shown to possess a novel zinc-binding motif, here designated ZBS. We have (1) identified the protein members of these 2 families, (2) carried out sequence analyses that allow specification of this zinc-binding motif, and (3) determined signature sequences that will allow identification of additional members of these families as their sequences become available.  相似文献   

20.
Detecting homology of distantly related proteins with consensus sequences   总被引:15,自引:0,他引:15  
A simple protocol is described that is suitable for the detection of distantly related members of a protein family. In this procedure, similarity to a consensus sequence is used to distinguish chance similarity from similarity due to common ancestry. The consensus sequence is constructed from the sequences of established members of a protein family and it incorporates features characteristic of the protein fold of this family: conserved residues, the pattern of variable and conserved segments, preferred location of gaps etc. The database is searched with the consensus sequence, using the unitary matrix or log odds matrix for scoring the alignments, with variable gap penalty. The advantage of the method is that it weights key residues, ignores sequence similarity in variable segments (thus partially eliminating "background noise" coming from chance similarity), distinguishes gaps disrupting conserved segments from those occurring in positions known to be tolerant of gap events. The utility of the method was demonstrated in the case of the protein family homologous with the internal repeats of complement B as well as the internal repeats identified in fibroblast proteoglycan PG40. The consensus sequence method succeeded in finding some new members of these protein families that could not be detected by earlier methods of sequence comparison.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号