首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Lipocalins constitute a superfamily of extracellular proteins that are found in all three kingdoms of life. Although very divergent in their sequences and functions, they show remarkable similarity in 3-D structures. Lipocalins bind and transport small hydrophobic molecules. Earlier sequence-based phylogenetic studies of lipocalins highlighted that they have a long evolutionary history. However the molecular and structural basis of their functional diversity is not completely understood. The main objective of the present study is to understand functional diversity of the lipocalins using a structure-based phylogenetic approach. The present study with 39 protein domains from the lipocalin superfamily suggests that the clusters of lipocalins obtained by structure-based phylogeny correspond well with the functional diversity. The detailed analysis on each of the clusters and sub-clusters reveals that the 39 lipocalin domains cluster based on their mode of ligand binding though the clustering was performed on the basis of gross domain structure. The outliers in the phylogenetic tree are often from single member families. Also structure-based phylogenetic approach has provided pointers to assign putative function for the domains of unknown function in lipocalin family. The approach employed in the present study can be used in the future for the functional identification of new lipocalin proteins and may be extended to other protein families where members show poor sequence similarity but high structural similarity.  相似文献   

2.
3.
Three-dimensional cluster analysis offers a method for the prediction of functional residue clusters in proteins. This method requires a representative structure and a multiple sequence alignment as input data. Individual residues are represented in terms of regional alignments that reflect both their structural environment and their evolutionary variation, as defined by the alignment of homologous sequences. From the overall (global) and the residue-specific (regional) alignments, we calculate the global and regional similarity matrices, containing scores for all pairwise sequence comparisons in the respective alignments. Comparing the matrices yields two scores for each residue. The regional conservation score (C(R)(x)) defines the conservation of each residue x and its neighbors in 3D space relative to the protein as a whole. The similarity deviation score (S(x)) detects residue clusters with sequence similarities that deviate from the similarities suggested by the full-length sequences. We evaluated 3D cluster analysis on a set of 35 families of proteins with available cocrystal structures, showing small ligand interfaces, nucleic acid interfaces and two types of protein-protein interfaces (transient and stable). We present two examples in detail: fructose-1,6-bisphosphate aldolase and the mitogen-activated protein kinase ERK2. We found that the regional conservation score (C(R)(x)) identifies functional residue clusters better than a scoring scheme that does not take 3D information into account. C(R)(x) is particularly useful for the prediction of poorly conserved, transient protein-protein interfaces. Many of the proteins studied contained residue clusters with elevated similarity deviation scores. These residue clusters correlate with specificity-conferring regions: 3D cluster analysis therefore represents an easily applied method for the prediction of functionally relevant spatial clusters of residues in proteins.  相似文献   

4.
Yau SS  Yu C  He R 《DNA and cell biology》2008,27(5):241-250
Graphical representation of gene sequences provides a simple way of viewing, sorting, and comparing various gene structures. Here we first report a two-dimensional graphical representation for protein sequences. With this method, we constructed the moment vectors for protein sequences, and mathematically proved that the correspondence between moment vectors and protein sequences is one-to-one. Therefore, each protein sequence can be represented as a point in a map, which we call protein map, and cluster analysis can be used for comparison between the points. Sixty-six proteins from five protein families were analyzed using this method. Our data showed that for proteins in the same family, their corresponding points in the map are close to each other. We also illustrate the efficiency of this approach by performing an extensive cluster analysis of the protein kinase C family. These results indicate that this protein map could be used to mathematically specify the similarity of two proteins and predict properties of an unknown protein based on its amino acid sequence.  相似文献   

5.
Mishra P  Pandey PN 《Bioinformation》2011,6(10):372-374
The number of amino acid sequences is increasing very rapidly in the protein databases like Swiss-Prot, Uniprot, PIR and others, but the structure of only some amino acid sequences are found in the Protein Data Bank. Thus, an important problem in genomics is automatically clustering homologous protein sequences when only sequence information is available. Here, we use graph theoretic techniques for clustering amino acid sequences. A similarity graph is defined and clusters in that graph correspond to connected subgraphs. Cluster analysis seeks grouping of amino acid sequences into subsets based on distance or similarity score between pairs of sequences. Our goal is to find disjoint subsets, called clusters, such that two criteria are satisfied: homogeneity: sequences in the same cluster are highly similar to each other; and separation: sequences in different clusters have low similarity to each other. We tested our method on several subsets of SCOP (Structural Classification of proteins) database, a gold standard for protein structure classification. The results show that for a given set of proteins the number of clusters we obtained is close to the superfamilies in that set; there are fewer singeltons; and the method correctly groups most remote homologs.  相似文献   

6.
Several studies based on the known three-dimensional (3-D) structures of proteins show that two homologous proteins with insignificant sequence similarity could adopt a common fold and may perform same or similar biochemical functions. Hence, it is appropriate to use similarities in 3-D structure of proteins rather than the amino acid sequence similarities in modelling evolution of distantly related proteins. Here we present an assessment of using 3-D structures in modelling evolution of homologous proteins. Using a dataset of 108 protein domain families of known structures with at least 10 members per family we present a comparison of extent of structural and sequence dissimilarities among pairs of proteins which are inputs into the construction of phylogenetic trees. We find that correlation between the structure-based dissimilarity measures and the sequence-based dissimilarity measures is usually good if the sequence similarity among the homologues is about 30% or more. For protein families with low sequence similarity among the members, the correlation coefficient between the sequence-based and the structure-based dissimilarities are poor. In these cases the structure-based dendrogram clusters proteins with most similar biochemical functional properties better than the sequence-similarity based dendrogram. In multi-domain protein families and disulphide-rich protein families the correlation coefficient for the match of sequence-based and structure-based dissimilarity (SDM) measures can be poor though the sequence identity could be higher than 30%. Hence it is suggested that protein evolution is best modelled using 3-D structures if the sequence similarities (SSM) of the homologues are very low.  相似文献   

7.
DiffTool is a resource to build and visualize protein clusters computed from a sequence database. The package provides a clustering tool to construct protein families according to sequence similarities and a web interface to query the corresponding clusters. A subtractive genome analysis tool selects protein families specific for a genome or a group of genomes. For each protein cluster, DiffTool includes access to sequences, coloured multiple alignments and phylogenetic trees. AVAILABILITY: A cluster database built from yeast and complete prokaryotic genomes is queryable at http://bioweb.pasteur.fr/seqanal/difftool. All the Perl sources are freely available to non-profit organizations upon request.  相似文献   

8.
Silva PJ 《Proteins》2008,70(4):1588-1594
Hydrophobic cluster analysis (HCA) has long been used as a tool to detect distant homologies between protein sequences, and to classify them into different folds. However, it relies on expert human intervention, and is sensitive to subjective interpretations of pattern similarities. In this study, we describe a novel algorithm to assess the similarity of hydrophobic amino acid distributions between two sequences. Our algorithm correctly identifies as misattributions several HCA-based proposals of structural similarity between unrelated proteins present in the literature. We have also used this method to identify the proper fold of a large variety of sequences, and to automatically select the most appropriate structure for homology modeling of several proteins with low sequence identity to any other member of the protein data bank. Automatic modeling of the target proteins based on these templates yielded structures with TM-scores (vs. experimental structures) above 0.60, even without further refinement. Besides enabling a reliable identification of the correct fold of an unknown sequence and the choice of suitable templates, our algorithm also shows that whereas most structural classes of proteins are very homogeneous in hydrophobic cluster composition, a tenth of the described families are compatible with a large variety of hydrophobic patterns. We have built a browsable database of every major representative hydrophobic cluster pattern present in each structural class of proteins, freely available at http://www2.ufp.pt/ pedros/HCA_db/index.htm.  相似文献   

9.
Periplasmic binding protein-dependent transport systems are composed of a periplasmic substrate-binding protein, a set of 2 (sometimes 1) very hydrophobic integral membrane proteins, and 1 (sometimes 2) hydrophilic peripheral membrane protein that binds and hydrolyzes ATP. These systems are members of the superfamily of ABC transporters. We performed a molecular phylogenetic analysis of the sequences of 70 hydrophobic membrane proteins of these transport systems in order to investigate their evolutionary history. Proteins were grouped into 8 clusters. Within each cluster, protein sequences displayed significant similarities, suggesting that they derive from a common ancestor. Most clusters contained proteins from systems transporting analogous substrates such as monosaccharides, oligopeptides, or hydrophobic amino acids, but this was not a general rule. Proteins from diverse bacteria are found within each cluster, suggesting that the ancestors of current clusters were present before the divergence of bacterial groups. The phylogenetic trees computed for hydrophobic membrane proteins of these permeases are similar to those described for the periplasmic substrate-binding proteins. This result suggests that the genetic regions encoding binding protein-dependent permeases evolved as whole units. Based on the results of the classification of the proteins and on the reconstructed phylogenetic trees, we propose an evolutionary scheme for periplasmic permeases. According to this model, it is probable that these transport systems derive from an ancestral system having only 1 hydrophobic membrane protein.(ABSTRACT TRUNCATED AT 250 WORDS)  相似文献   

10.
The tannase protein sequences of 149 bacteria and 36 fungi were retrieved from NCBI database. Among them only 77 bacterial and 31 fungal tannase sequences were taken which have different amino acid compositions. These sequences were analysed for different physical and chemical properties, superfamily search, multiple sequence alignment, phylogenetic tree construction and motif finding to find out the functional motif and the evolutionary relationship among them. The superfamily search for these tannase exposed the occurrence of proline iminopeptidase-like, biotin biosynthesis protein BioH, O-acetyltransferase, carboxylesterase/thioesterase 1, carbon–carbon bond hydrolase, haloperoxidase, prolyl oligopeptidase, C-terminal domain and mycobacterial antigens families and alpha/beta hydrolase superfamily. Some bacterial and fungal sequence showed similarity with different families individually. The multiple sequence alignment of these tannase protein sequences showed conserved regions at different stretches with maximum homology from amino acid residues 389–469 and 482–523 which could be used for designing degenerate primers or probes specific for tannase producing bacterial and fungal species. Phylogenetic tree showed two different clusters; one has only bacteria and another have both fungi and bacteria showing some relationship between these different genera. Although in second cluster near about all fungal species were found together in a corner which indicates the sequence level similarity among fungal genera. The distributions of fourteen motifs analysis revealed Motif 1 with a signature amino acid sequence of 29 amino acids, i.e. GCSTGGREALKQAQRWPHDYDGIIANNPA, was uniformly observed in 83.3 % of studied tannase sequences representing its participation with the structure and enzymatic function.  相似文献   

11.
Kinch LN  Grishin NV 《Proteins》2002,48(1):75-84
Nitrogen regulatory (PII) proteins are signal transduction molecules involved in controlling nitrogen metabolism in prokaryots. PII proteins integrate the signals of intracellular nitrogen and carbon status into the control of enzymes involved in nitrogen assimilation. Using elaborate sequence similarity detection schemes, we show that five clusters of orthologs (COGs) and several small divergent protein groups belong to the PII superfamily and predict their structure to be a (betaalphabeta)(2) ferredoxin-like fold. Proteins from the newly emerged PII superfamily are present in all major phylogenetic lineages. The PII homologs are quite diverse, with below random (as low as 1%) pairwise sequence identities between some members of distant groups. Despite this sequence diversity, evidence suggests that the different subfamilies retain the PII trimeric structure important for ligand-binding site formation and maintain a conservation of conservations at residue positions important for PII function. Because most of the orthologous groups within the PII superfamily are composed entirely of hypothetical proteins, our remote homology-based structure prediction provides the only information about them. Analogous to structural genomics efforts, such prediction gives clues to the biological roles of these proteins and allows us to hypothesize about locations of functional sites on model structures or rationalize about available experimental information. For instance, conserved residues in one of the families map in close proximity to each other on PII structure, allowing for a possible metal-binding site in the proteins coded by the locus known to affect sensitivity to divalent metal ions. Presented analysis pushes the limits of sequence similarity searches and exemplifies one of the extreme cases of reliable sequence-based structure prediction. In conjunction with structural genomics efforts to shed light on protein function, our strategies make it possible to detect homology between highly diverse sequences and are aimed at understanding the most remote evolutionary connections in the protein world.  相似文献   

12.
Classification of proteins is a major challenge in bioinformatics. Here an approach is presented, that unifies different existing classifications of protein structures and sequences. Protein structural domains are represented as nodes in a hypergraph. Shared memberships in sequence families result in hyperedges in the graph. The presented method partitions the hypergraph into clusters of structural domains. Each computed cluster is based on a set of shared sequence family memberships. Thus, the clusters put existing protein sequence families into the context of structural family hierarchies. Conversely, structural domains are related to their sequence family memberships, which can be used to gain further knowledge about the respective structural families.  相似文献   

13.
Cellular membrane lipids, of which phospholipids are the major constituents, form one of the characteristic features that distinguish Archaea from other organisms. In this study, we focused on the steps in archaeal phospholipid synthetic pathways that generate polar lipids such as archaetidylserine, archaetidylglycerol, and archaetidylinositol. Only archaetidylserine synthase (ASS), from Methanothermobacter thermautotrophicus, has been experimentally identified. Other enzymes have not been fully examined. Through database searching, we detected many archaeal hypothetical proteins that show sequence similarity to members of the CDP alcohol phosphatidyltransferase family, such as phosphatidylserine synthase (PSS), phosphatidylglycerol synthase (PGS) and phosphatidylinositol synthase (PIS) derived from Bacteria and Eukarya. The archaeal hypothetical proteins were classified into two groups, based on the sequence similarity. Members of the first group, including ASS from M. thermautotrophicus, were closely related to PSS. The rough agreement between PSS homologue distribution within Archaea and the experimentally identified distribution of archaetidylserine suggested that the hypothetical proteins are ASSs. We found that an open reading frame (ORF) tends to be adjacent to that of ASS in the genome, and that the order of the two ORFs is conserved. The sequence similarity of phosphatidylserine decarboxylase to the product of the ORF next to the ASS gene, together with the genomic context conservation, suggests that the ORF encodes archaetidylserine decarboxylase, which may transform archaetidylserine to archaetidylethanolamine. The second group of archaeal hypothetical proteins was related to PGS and PIS. The members of this group were subjected to molecular phylogenetic analysis, together with PGSs and PISs and it was found that they formed two distinct clusters in the molecular phylogenetic tree. The distribution of members of each cluster within Archaea roughly corresponded to the experimentally identified distribution of archaetidylglycerol or archaetidylinositol. The molecular phylogenetic tree patterns and the correspondence to the membrane compositions suggest that the two clusters in this group correspond to archaetidylglycerol synthases and archaetidylinositol synthases. No archaeal hypothetical protein with sequence similarity to known phosphatidylcholine synthases was detected in this study.  相似文献   

14.
We have characterized the relationship between accurate phylogenetic reconstruction and sequence similarity, testing whether high levels of sequence similarity can consistently produce accurate evolutionary trees. We generated protein families with known phylogenies using a modified version of the PAML/EVOLVER program that produces insertions and deletions as well as substitutions. Protein families were evolved over a range of 100-400 point accepted mutations; at these distances 63% of the families shared significant sequence similarity. Protein families were evolved using balanced and unbalanced trees, with ancient or recent radiations. In families sharing statistically significant similarity, about 60% of multiple sequence alignments were 95% identical to true alignments. To compare recovered topologies with true topologies, we used a score that reflects the fraction of clades that were correctly clustered. As expected, the accuracy of the phylogenies was greatest in the least divergent families. About 88% of phylogenies clustered over 80% of clades in families that shared significant sequence similarity, using Bayesian, parsimony, distance, and maximum likelihood methods. However, for protein families with short ancient branches (ancient radiation), only 30% of the most divergent (but statistically significant) families produced accurate phylogenies, and only about 70% of the second most highly conserved families, with median expectation values better than 10(-60), produced accurate trees. These values represent upper bounds on expected tree accuracy for sequences with a simple divergence history; proteins from 700 Giardia families, with a similar range of sequence similarities but considerably more gaps, produced much less accurate trees. For our simulated insertions and deletions, correct multiple sequence alignments did not perform much better than those produced by T-COFFEE, and including sequences with expressed sequence tag-like sequencing errors did not significantly decrease phylogenetic accuracy. In general, although less-divergent sequence families produce more accurate trees, the likelihood of estimating an accurate tree is most dependent on whether radiation in the family was ancient or recent. Accuracy can be improved by combining genes from the same organism when creating species trees or by selecting protein families with the best bootstrap values in comprehensive studies.  相似文献   

15.
16.
We propose a new method for classifying and identifying transmembrane (TM) protein functions in proteome-scale by applying a single-linkage clustering method based on TM topology similarity, which is calculated simply from comparing the lengths of loop regions. In this study, we focused on 87 prokaryotic TM proteomes consisting of 31 proteobacteria, 22 gram-positive bacteria, 19 other bacteria, and 15 archaea. Prior to performing the clustering, we first categorized individual TM protein sequences as "known," "putative" (similar to "known" sequences), or "unknown" by using the homology search and the sequence similarity comparison against SWISS-PROT to assess the current status of the functional annotation of the TM proteomes based on sequence similarity only. More than three-quarters, that is, 75.7% of the TM protein sequences are functionally "unknown," with only 3.8% and 20.5% of them being classified as "known" and "putative," respectively. Using our clustering approach based on TM topology similarity, we succeeded in increasing the rate of TM protein sequences functionally classified and identified from 24.3% to 60.9%. Obtained clusters correspond well to functional superfamilies or families, and the functional classification and identification are successfully achieved by this approach. For example, in an obtained cluster of TM proteins with six TM segments, 109 sequences out of 119 sequences annotated as "ATP-binding cassette transporter" are properly included and 122 "unknown" sequences are also contained.  相似文献   

17.

Background  

The rapid burgeoning of available protein data makes the use of clustering within families of proteins increasingly important. The challenge is to identify subfamilies of evolutionarily related sequences. This identification reveals phylogenetic relationships, which provide prior knowledge to help researchers understand biological phenomena. A good evolutionary model is essential to achieve a clustering that reflects the biological reality, and an accurate estimate of protein sequence similarity is crucial to the building of such a model. Most existing algorithms estimate this similarity using techniques that are not necessarily biologically plausible, especially for hard-to-align sequences such as proteins with different domain structures, which cause many difficulties for the alignment-dependent algorithms. In this paper, we propose a novel similarity measure based on matching amino acid subsequences. This measure, named SMS for Substitution Matching Similarity, is especially designed for application to non-aligned protein sequences. It allows us to develop a new alignment-free algorithm, named CLUSS, for clustering protein families. To the best of our knowledge, this is the first alignment-free algorithm for clustering protein sequences. Unlike other clustering algorithms, CLUSS is effective on both alignable and non-alignable protein families. In the rest of the paper, we use the term "phylogenetic" in the sense of "relatedness of biological functions".  相似文献   

18.
A comprehensive, structural and functional, in silico analysis of the medium-chain dehydrogenase/reductase (MDR) superfamily, including 583 proteins, was carried out by use of extensive database mining and the blastp program in an iterative manner to identify all known members of the superfamily. Based on phylogenetic, sequence, and functional similarities, the protein members of the MDR superfamily were classified into three different taxonomic categories: (a) subfamilies, consisting of a closed group containing a set of ideally orthologous proteins that perform the same function; (b) families, each comprising a cluster of monophyletic subfamilies that possess significant sequence identity among them and might share or not common substrates or mechanisms of reaction; and (c) macrofamilies, each comprising a cluster of monophyletic protein families with protein members from the three domains of life, which includes at least one subfamily member that displays activity related to a very ancient metabolic pathway. In this context, a superfamily is a group of homologous protein families (and/or macrofamilies) with monophyletic origin that shares at least a barely detectable sequence similarity, but showing the same 3D fold. The MDR superfamily encloses three macrofamilies, with eight families and 49 subfamilies. These subfamilies exhibit great functional diversity including noncatalytic members with different subcellular, phylogenetic, and species distributions. This results from constant enzymogenesis and proteinogenesis within each kingdom, and highlights the huge plasticity that MDR superfamily members possess. Thus, through evolution a great number of taxa-specific new functions were acquired by MDRs. The generation of new functions fulfilled by proteins, can be considered as the essence of protein evolution. The mechanisms of protein evolution inside MDR are not constrained to conserve substrate specificity and/or chemistry of catalysis. In consequence, MDR functional diversity is more complex than sequence diversity. MDR is a very ancient protein superfamily that existed in the last universal common ancestor. It had at least two (and probably three) different ancestral activities related to formaldehyde metabolism and alcoholic fermentation. Eukaryotic members of this superfamily are more related to bacterial than to archaeal members; horizontal gene transfer among the domains of life appears to be a rare event in modern organisms.  相似文献   

19.
GeneRAGE: a robust algorithm for sequence clustering and domain detection   总被引:9,自引:0,他引:9  
MOTIVATION: Efficient, accurate and automatic clustering of large protein sequence datasets, such as complete proteomes, into families, according to sequence similarity. Detection and correction of false positive and negative relationships with subsequent detection and resolution of multi-domain proteins. RESULTS: A new algorithm for the automatic clustering of protein sequence datasets has been developed. This algorithm represents all similarity relationships within the dataset in a binary matrix. Removal of false positives is achieved through subsequent symmetrification of the matrix using a Smith-Waterman dynamic programming alignment algorithm. Detection of multi-domain protein families and further false positive relationships within the symmetrical matrix is achieved through iterative processing of matrix elements with successive rounds of Smith-Waterman dynamic programming alignments. Recursive single-linkage clustering of the corrected matrix allows efficient and accurate family representation for each protein in the dataset. Initial clusters containing multi-domain families, are split into their constituent clusters using the information obtained by the multi-domain detection step. This algorithm can hence quickly and accurately cluster large protein datasets into families. Problems due to the presence of multi-domain proteins are minimized, allowing more precise clustering information to be obtained automatically. AVAILABILITY: GeneRAGE (version 1.0) executable binaries for most platforms may be obtained from the authors on request. The system is available to academic users free of charge under license.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号