首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Abstract

Current methods for comparative analyses of protein sequences are 1D-alignments of amino acid sequences based on the maximization of amino acid identity (homology) and the prediction of secondary structure elements. This method has a major drawback once the amino acid identity drops below 20–25 %, since maximization of a homology score does not take into account any structural information. A new technique called Hydrophobic Cluster Analysis (HCA) has been developed by Lemesle-Varloot et al. (Biochimie 72, 555–574), 1990). This consists of comparing several sequences simultaneously and combining homology detection with secondary structure analysis.

HCA is primarily based on the detection and comparison of structural segments constituting the hydrophobic core of globular protein domains, with or without transmembrane domains. We have applied HCA to the analysis of different families of G-protein coupled receptors, such as catecholamine receptors as well as peptide hormone receptors. Utilizing HCA the thrombin receptor, a new and as yet unique member of the family of G-protein coupled receptors, can be clearly classified as being closely related to the family of neuropeptide receptors rather than to the catecholamine receptors for which the shape of the hydrophobic clusters and the length of their third cytoplasmic loop are very different. Furthermore, the potential of HCA to predict relationships between new putative and already characterized members of this family of receptors will be presented.  相似文献   

2.
The 3D structural comparison of families of divergent homologous domains revealed two main populations of hydrophobic amino acids, one with a low and the other with a significantly higher mean solvent accessibility, allowing two regions of the core of protein globular domains to be distinguished. The side chains of hydrophobic amino acids in topologically conserved positions (positions in the structural alignment where only hydrophobic amino acids are found), which we call topohydrophobic positions, are considerably less dispersed than those of the other amino acids (hydrophobic or not). Mean distances between gravity centers of amino acids in topohydrophobic positions are significantly shorter than those for non-topohydrophobic positions and show that the corresponding amino acids are almost all in direct contact in the inner core of globular domains. This study also showed that the small number of topohydrophobic positions is a characteristic of the structural differences between proteins of a family. This criterion is independent of the sequence identity between the sequences and of the root-mean-square distance between their corresponding structures. Using sensitive sequence alignment processes it will be possible, for many protein families, to identify topohydrophobic positions from sequences only. Proteins 33:329–342, 1998. © 1998 Wiley-Liss, Inc.  相似文献   

3.
Structural genomic projects envision almost routine protein structure determinations, which are currently imaginable only for small proteins with molecular weights below 25,000 Da. For larger proteins, structural insight can be obtained by breaking them into small segments of amino acid sequences that can fold into native structures, even when isolated from the rest of the protein. Such segments are autonomously folding units (AFU) and have sizes suitable for fast structural analyses. Here, we propose to expand an intuitive procedure often employed for identifying biologically important domains to an automatic method for detecting putative folded protein fragments. The procedure is based on the recognition that large proteins can be regarded as a combination of independent domains conserved among diverse organisms. We thus have developed a program that reorganizes the output of BLAST searches and detects regions with a large number of similar sequences. To automate the detection process, it is reduced to a simple geometrical problem of recognizing rectangular shaped elevations in a graph that plots the number of similar sequences at each residue of a query sequence. We used our program to quantitatively corroborate the premise that segments with conserved sequences correspond to domains that fold into native structures. We applied our program to a test data set composed of 99 amino acid sequences containing 150 segments with structures listed in the Protein Data Bank, and thus known to fold into native structures. Overall, the fragments identified by our program have an almost 50% probability of forming a native structure, and comparable results are observed with sequences containing domain linkers classified in SCOP. Furthermore, we verified that our program identifies AFU in libraries from various organisms, and we found a significant number of AFU candidates for structural analysis, covering an estimated 5 to 20% of the genomic databases. Altogether, these results argue that methods based on sequence similarity can be useful for dissecting large proteins into small autonomously folding domains, and such methods may provide an efficient support to structural genomics projects.  相似文献   

4.
Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.  相似文献   

5.
Protein design experiments have shown that the use of specific subsets of amino acids can produce foldable proteins. This prompts the question of whether there is a minimal amino acid alphabet which could be used to fold all proteins. In this work we make an analogy between sequence patterns which produce foldable sequences and those which make it possible to detect structural homologs by aligning sequences, and use it to suggest the possible size of such a reduced alphabet. We estimate that reduced alphabets containing 10-12 letters can be used to design foldable sequences for a large number of protein families. This estimate is based on the observation that there is little loss of the information necessary to pick out structural homologs in a clustered protein sequence database when a suitable reduction of the amino acid alphabet from 20 to 10 letters is made, but that this information is rapidly degraded when further reductions in the alphabet are made.  相似文献   

6.
Compactness has been used to locate discontinuous structural units containing one or more polypeptide chains in proteins of known structure. Rather than exhaustively calculating the compactness of all possible units, our procedure uses a screening algorithm to find discontinuous regions that are potentially compact. Precise calculations of compactness are restricted only to units in these regions. With our procedure, compactness can be used to discover discontinuous domains with virtually any number of disjoint peptides. Small, single-domain proteins may contain several compact regions: thus, compact regions do not always correspond to folding domains. Because a domain is an independent folding unit and should contain a hydrophobic core, compact units were further examined for the presence of hydrophobic clusters (Zehfus MH, 1995, Protein Sci 4:1188-1202). This added constraint limits the number of acceptable units and helps greatly in the location of the true structural domains. The larger hydrophobically stabilized compact units correspond to domains, while the smaller units may correspond to folding intermediates.  相似文献   

7.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

8.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

9.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

10.
Isolation and structure of a rhodopsin gene from D. melanogaster   总被引:45,自引:0,他引:45  
C S Zuker  A F Cowman  G M Rubin 《Cell》1985,40(4):851-858
Using a novel method for detecting cross-homologous nucleic acid sequences we have isolated the gene coding for the major rhodopsin of Drosophila melanogaster and mapped it to chromosomal region 92B8-11. Comparison of cDNA and genomic DNA sequences indicates that the gene is divided into five exons. The amino acid sequence deduced from the nucleotide sequence is 373 residues long, and the polypeptide chain contains seven hydrophobic segments that appear to correspond to the seven transmembrane segments characteristic of other rhodopsins. Three regions of Drosophila rhodopsin are highly conserved with the corresponding domains of bovine rhodopsin, suggesting an important role for these polypeptide regions.  相似文献   

11.
It is proposed that particular segments of some ribosomal, histone and plant viral capsid proteins adopt a helical structural mode for interaction with nucleic acid. The amino acid regions were determined by three probes applied to 26 protein sequences: searches for helical wheels displaying asymmetric basic charge distributions, secondary structural predictions, and searches for primary structural homologies. In 11 of the protein sequences examined, homologous heptapeptides were found in the residue spans delineated by the three probes. A helical wheel analysis of the oligopeptide amino acids showed a distinct positive charge clustering. It is suggested that the basic amino acid side chains on the hydrophilic helical side interact with nucleic acid negative phosphate groups while the somewhat hydrophobic side is available for interaction within the protein or possibly with the major groove of double-stranded nucleic acid.  相似文献   

12.
Classification of proteins is a major challenge in bioinformatics. Here an approach is presented, that unifies different existing classifications of protein structures and sequences. Protein structural domains are represented as nodes in a hypergraph. Shared memberships in sequence families result in hyperedges in the graph. The presented method partitions the hypergraph into clusters of structural domains. Each computed cluster is based on a set of shared sequence family memberships. Thus, the clusters put existing protein sequence families into the context of structural family hierarchies. Conversely, structural domains are related to their sequence family memberships, which can be used to gain further knowledge about the respective structural families.  相似文献   

13.
Seal RP  Leighton BH  Amara SG 《Neuron》2000,25(3):695-706
Excitatory amino acid transporters (EAATs) function as both substrate transporters and ligand-gated anion channels. Characterization of the transporter's general topology is the first requisite step in defining the structural bases for these distinct activities. While the first six hydrophobic domains can be readily modeled as conventional transmembrane segments, the organization of the C-terminal hydrophobic domains, which have been implicated in both substrate and ion interactions, has been controversial. Here, we report the results of a comprehensive evaluation of the C-terminal topology of EAAT1 determined by the chemical modification of introduced cysteine residues. Our data support a model in which two membrane-spanning domains flank a central region that is highly accessible to the extracellular milieu and contains at least one reentrant loop domain.  相似文献   

14.
The primary structures for several members of both the vicilin and legumin families of storage proteins were examined using a computer routine based on amino acid physical characteristics. The comparison algorithm revealed that sequences from the two families could be aligned and share a number of predicted secondary structural features. The COOH-terminal half of the subunits in both families displayed a highly conserved core region that was largely hydrophobic and in which a high proportion of the residues were predicted to be in beta-sheet conformations. The central region of the molecules which contained mixed areas of predicted helical and sheet conformations showed more variability in residue selection than the COOH-terminal regions. The NH2-terminal segments of subunits from the two different families could not be aligned though they characteristically had a high proportion of residues predicted to be in helical conformations. The feature which most clearly distinguished subunits between the two families was an inserted span in the legumin group with a high proportion of acidic amino acids located between the central and COOH-terminal domains. Residues in this insertion were predicted to exist mainly in helical conformation. Since considerable size variation occurs in this area amongst the legumin subunits, alterations in this region may have a minimal detrimental effect on the structure of the proteins.  相似文献   

15.
Signature sequences are contiguous patterns of amino acids 10-50 residues long that are associated with a particular structure or function in proteins. These may be of three types (by our nomenclature): superfamily signatures, remnant homologies, and motifs. We have performed a systematic search through a database of protein sequences to automatically and preferentially find remnant homologies and motifs. This was accomplished in three steps: 1. We generated a nonredundant sequence database. 2. We used BLAST3 (Altschul and Lipman, Proc. Natl. Acad. Sci. U.S.A. 87:5509-5513, 1990) to generate local pairwise and triplet sequence alignments for every protein in the database vs. every other. 3. We selected "interesting" alignments and grouped them into clusters. We find that most of the clusters contain segments from proteins which share a common structure or function. Many of them correspond to signatures previously noted in the literature. We discuss three previously recognized motifs in detail (FAD/NAD-binding, ATP/GTP-binding, and cytochrome b5-like domains) to demonstrate how the alignments generated by our procedure are consistent with previous work and make structural and functional sense. We also discuss two signatures (for N-acetyltransferases and glycerol-phosphate binding) which to our knowledge have not been previously recognized.  相似文献   

16.
Co-translational translocation of proteins across the membrane of rough endoplasmic reticulum (ER) is interrupted by particular amino acid sequences, which are functionally termed "stop-transfer sequence." We analyzed the structural requirements for the interruption of the peptide translocation. By the manipulation of the cDNA of interleukin 2 (IL2), which passes through ER membrane co-translationally, the middle portion of the IL2 molecule was replaced with systematically altered hydrophobic segments, leucine, alanine, or leucine/alanine mixed clusters. Furthermore, charged amino acid residues were introduced just downstream of the hydrophobic segments. These modified IL2 peptides were synthesized with wheat germ cell-free system in the presence of rough microsomes and the topology of the peptides in the microsomes was assessed by post-translational digestion with proteinase K. We obtained the following results. (i) Each modified protein was processed to the mature form but the extent of stop-translocation varied widely. The ratio of the stopped to the translocated products increased as the length and hydrophobicity of the inserted segment increased. (ii) Shorter hydrophobic segments than naturally occurring native transmembrane segment promoted stop-translocation. (iii) Proteins with hydrophobic segments followed by positive charges were more efficiently stop-translocated than those having negative charges. (iv) If the hydrophobicity of the segment was sufficiently high, the positive charges after the segment were not essential for stop-translocation. We also suggest that the stop-transfer process includes protein-protein interaction between the hydrophobic segment and translocation channel.  相似文献   

17.
The open reading frames of human cytomegalovirus (human herpesvirus-5, HHV5) encode some 213 unique proteins with mostly unknown functions. Using the threading program, ProCeryon, we calculated possible matches between the amino acid sequences of these proteins and the Protein Data Bank library of three-dimensional structures. Thirty-six proteins were fully identified in terms of their structure and, often, function; 65 proteins were recognized as members of narrow structural/functional families (e.g. DNA-binding factors, cytokines, enzymes, signaling particles, cell surface receptors etc.); and 87 proteins were assigned to broad structural classes (e.g. all-beta, 3-layer-alphabetaalpha, multidomain, etc.). Genes encoding proteins with similar folds, or containing identical structural traits (extreme sequence length, runs of unstructured (Pro and/or Gly-rich) residues, transmembrane segments, etc.) often formed tandem clusters throughout the genome. In the course of this work, benchmarks on about 20 known folds were used to optimize adjustable parameters of threading calculations, i.e. gap penalty weights used in sequence/structure alignments; new scores obtained as simple combinations of existing scoring functions; and number of threading runs conducive to meaningful results. An introduction of summed, per-residue-normalized scores has been essential for discovery of subdomains (EGF-like, SH2, SH3) in longer protein sequences, such as the eight "open sandwich" cytokine domains, 60-70 amino acids long and having the 3beta1alpha fold with one or two disulfide bridges, present in otherwise unrelated proteins.  相似文献   

18.
GeMMA (Genome Modelling and Model Annotation) is a new approach to automatic functional subfamily classification within families and superfamilies of protein sequences. A major advantage of GeMMA is its ability to subclassify very large and diverse superfamilies with tens of thousands of members, without the need for an initial multiple sequence alignment. Its performance is shown to be comparable to the established high-performance method SCI-PHY. GeMMA follows an agglomerative clustering protocol that uses existing software for sensitive and accurate multiple sequence alignment and profile–profile comparison. The produced subfamilies are shown to be equivalent in quality whether whole protein sequences are used or just the sequences of component predicted structural domains. A faster, heuristic version of GeMMA that also uses distributed computing is shown to maintain the performance levels of the original implementation. The use of GeMMA to increase the functional annotation coverage of functionally diverse Pfam families is demonstrated. It is further shown how GeMMA clusters can help to predict the impact of experimentally determining a protein domain structure on comparative protein modelling coverage, in the context of structural genomics.  相似文献   

19.
It is known that larger globular proteins are built from domains, relatively independent structural units. A domain size seems to be limited, and a single domain consists of from few tens to a couple of hundred amino acids. Based on Monte Carlo simulations of a reduced protein model restricted to the face centered simple cubic lattice, with a minimal set of short-range and long-range interactions, we have shown that some model sequences upon the folding transition spontaneously divide into separate domains. The observed domain sizes closely correspond to the sizes of real protein domains. Short chains with a proper sequence pattern of the hydrophobic and polar residues undergo a two-state folding transition to the structurally ordered globular state, while similar longer sequences follow a multistate transition. Homopolymeric (uniformly hydrophobic) chains and random heteropolymers undergo a continuous collapse transition into a single globule, and the globular state is much less ordered. Thus, the factors responsible for the multidomain structure of proteins are sufficiently long polypeptide chain and characteristic, protein-like, sequence patterns. These findings provide some hints for the analysis of real sequences aimed at prediction of the domain structure of large proteins.  相似文献   

20.
The strip-of-helix hydrophobicity algorithm was devised to identify protein sequences which, when coiled as alpha or 3(10) helices, had one axial, hydrophobic strip and otherwise variably hydrophilic residues. The strip-of-helix hydrophobicity algorithm also ranked such sequences according to an index, the mean hydrophobicity of amino acids in the axial strip. This algorithm well predicted T cell-presented fragments of antigenic proteins. A derivative of this algorithm (the structural helices algorithm (SHA] was tested for the prediction of helices in crystallographically defined proteins. For the SHA, eight amino acid sequences, 2 cycles plus one amino acid in an alpha helix, with strip-of-helix hydrophobicity indices greater than 2.5, were selected with overlapping segments joined. These selections were terminated according to simple "capping rules," which took into account the roles of N-terminal Asn or Pro and C-terminal Gly in the stability of helices. In analyses of 35 crystallographically defined proteins with known alpha and 3(10) helices, the predictions with the SHA overlapped (had overlap indices x greater than or equal to 0.5) with 34% of known helices, touched (had overlap indices 0.5 greater than x greater than 0) or overlapped with 66% of known helices, or were neighboring (came within 6 residues) or touched or overlapped with 82% of known helices. At each level of judging the quality of prediction, the SHA was usually less sensitive (correct predictions/total number of known helices) and more efficient (correct predictions/total number of predictions) than the Chou-Fasman and Garnier-Robson methods. It was simpler in design and calculation. The chemical mechanisms underlying these algorithms appear to apply both to protein folding and to selection of T cell-presented antigenic sequences.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号