首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 687 毫秒
1.
Annotations of the genes and their products are largely guided by inferring homology. Sequence similarity is the primary measure used for annotation purpose however, the domain content and order were given less importance albeit the fact that domain insertion, deletion, positional changes can bring in functional varieties. Of late, several methods developed quantify domain architecture similarity depending on alignments of their sequences and are focused on only homologous proteins. We present an alignment-free domain architecture-similarity search (ADASS) algorithm that identifies proteins that share very poor sequence similarity yet having similar domain architectures. We introduce a “singlet matching-triplet comparison” method in ADASS, wherein triplet of domains is compared with other triplets in a pair-wise comparison of two domain architectures. Different events in the triplet comparison are scored as per a scoring scheme and an average pairwise distance score (Domain Architecture Distance score - DAD Score) is calculated between protein domains architectures. We use domain architectures of a selected domain termed as centric domain and cluster them based on DAD score. The algorithm has high Positive Prediction Value (PPV) with respect to the clustering of the sequences of selected domain architectures. A comparison of domain architecture based dendrograms using ADASS method and an existing method revealed that ADASS can classify proteins depending on the extent of domain architecture level similarity. ADASS is more relevant in cases of proteins with tiny domains having little contribution to the overall sequence similarity but contributing significantly to the overall function.  相似文献   

2.
Protein domain architectures (PDAs), in which single domains are linked to form multiple-domain proteins, are a major molecular form used by evolution for the diversification of protein functions. However, the design principles of PDAs remain largely uninvestigated. In this study, we constructed networks to connect domain architectures that had grown out from the same single domain for every single domain in the Pfam-A database and found that there are three main distinctive types of these networks, which suggests that evolution can exploit PDAs in three different ways. Further analysis showed that these three different types of PDA networks are each adopted by different types of protein domains, although many networks exhibit the characteristics of more than one of the three types. Our results shed light on nature''s blueprint for protein architecture and provide a framework for understanding architectural design from a network perspective.  相似文献   

3.
MOTIVATION: The structural interaction of proteins and their domains in networks is one of the most basic molecular mechanisms for biological cells. Topological analysis of such networks can provide an understanding of and solutions for predicting properties of proteins and their evolution in terms of domains. A single paradigm for the analysis of interactions at different layers, such as domain and protein layers, is needed. RESULTS: Applying a colored vertex graph model, we integrated two basic interaction layers under a unified model: (1) structural domains and (2) their protein/complex networks. We identified four basic and distinct elements in the model that explains protein interactions at the domain level. We searched for motifs in the networks to detect their topological characteristics using a pruning strategy and a hash table for rapid detection. We obtained the following results: first, compared with a random distribution, a substantial part of the protein interactions could be explained by domain-level structural interaction information. Second, there were distinct kinds of protein interaction patterns classified by specific and distinguishable numbers of domains. The intermolecular domain interaction was the most dominant protein interaction pattern. Third, despite the coverage of the protein interaction information differing among species, the similarity of their networks indicated shared architectures of protein interaction network in living organisms. Remarkably, there were only a few basic architectures in the model (>10 for a 4-node network topology), and we propose that most biological combinations of domains into proteins and complexes can be explained by a small number of key topological motifs. CONTACT: doheon@kaist.ac.kr.  相似文献   

4.
Domains are basic evolutionary units of proteins and most proteins have more than one domain. Advances in domain modeling and collection are making it possible to annotate a large fraction of known protein sequences by a linear ordering of their domains, yielding their architecture. Protein domain architectures link evolutionarily related proteins and underscore their shared functions. Here, we attempt to better understand this association by identifying the evolutionary pathways by which extant architectures may have evolved. We propose a model of evolution in which architectures arise through rearrangements of inferred precursor architectures and acquisition of new domains. These pathways are ranked using a parsimony principle, whereby scenarios requiring the fewest number of independent recombination events, namely fission and fusion operations, are assumed to be more likely. Using a data set of domain architectures present in 159 proteomes that represent all three major branches of the tree of life allows us to estimate the history of over 85% of all architectures in the sequence database. We find that the distribution of rearrangement classes is robust with respect to alternative parsimony rules for inferring the presence of precursor architectures in ancestral species. Analyzing the most parsimonious pathways, we find 87% of architectures to gain complexity over time through simple changes, among which fusion events account for 5.6 times as many architectures as fission. Our results may be used to compute domain architecture similarities, for example, based on the number of historical recombination events separating them. Domain architecture "neighbors" identified in this way may lead to new insights about the evolution of protein function.  相似文献   

5.
Most eukaryotic proteins are multi-domain proteins that are created from fusions of genes, deletions and internal repetitions. An investigation of such evolutionary events requires a method to find the domain architecture from which each protein originates. Therefore, we defined a novel measure, domain distance, which is calculated as the number of domains that differ between two domain architectures. Using this measure the evolutionary events that distinguish a protein from its closest ancestor have been studied and it was found that indels are more common than internal repetition and that the exchange of a domain is rare. Indels and repetitions are common at both the N and C-terminals while they are rare between domains. The evolution of the majority of multi-domain proteins can be explained by the stepwise insertions of single domains, with the exception of repeats that sometimes are duplicated several domains in tandem. We show that domain distances agree with sequence similarity and semantic similarity based on gene ontology annotations. In addition, we demonstrate the use of the domain distance measure to build evolutionary trees. Finally, the evolution of multi-domain proteins is exemplified by a closer study of the evolution of two protein families, non-receptor tyrosine kinases and RhoGEFs.  相似文献   

6.
7.

Background

Conserved domains are recognized as the building blocks of eukaryotic proteins. Domains showing a tendency to occur in diverse combinations (??promiscuous?? domains) are involved in versatile architectures in proteins with different functions. Current models, based on global-level analyses of domain combinations in multiple genomes, have suggested that the propensity of some domains to associate with other domains in high-level architectures increases with organismal complexity. Alternative models using domain-based phylogenetic trees propose that domains have become promiscuous independently in different lineages through convergent evolution and are, thus, random with no functional or structural preferences. Here we test whether complex protein architectures have occurred by accretion from simpler systems and whether the appearance of multidomain combinations parallels organismal complexity. As a model, we analyze the modular evolution of the PWWP domain and ask whether its appearance in combinations with other domains into multidomain architectures is linked with the occurrence of more complex life-forms. Whether high-level combinations of domains are conserved and transmitted as stable units (cassettes) through evolution is examined in the genomes of plant or metazoan species selected for their established position in the evolution of the respective lineages.

Results

Using the domain-tree approach, we analyze the evolutionary origins and distribution patterns of the promiscuous PWWP domain to understand the principles of its modular evolution and its existence in combination with other domains in higher-level protein architectures. We found that as a single module the PWWP domain occurs only in proteins with a limited, mainly, species-specific distribution. Earlier, it was suggested that domain promiscuity is a fast-changing (volatile) feature shaped by natural selection and that only a few domains retain their promiscuity status throughout evolution. In contrast, our data show that most of the multidomain PWWP combinations in extant multicellular organisms (humans or land plants) are present in their unicellular ancestral relatives suggesting they have been transmitted through evolution as conserved linear arrangements (??cassettes??). Among the most interesting biologically relevant results is the finding that the genes of the two plant Trithorax family subgroups (ATX1/2 and ATX3/4/5) have different phylogenetic origins. The two subgroups occur together in the earliest land plants Physcomitrella patens and Selaginella moellendorffii.

Conclusion

Gain/loss of a single PWWP domain is observed throughout evolution reflecting dynamic lineage- or species-specific events. In contrast, higher-level protein architectures involving the PWWP domain have survived as stable arrangements driven by evolutionary descent. The association of PWWP domains with the DNA methyltransferases in O. tauri and in the metazoan lineage seems to have occurred independently consistent with convergent evolution. Our results do not support models wherein more complex protein architectures involving the PWWP domain occur with the appearance of more evolutionarily advanced life forms.  相似文献   

8.
Lu J  Bao Q  Wu J  Wang H  Li D  Xi Y  Wang S  Yu S  Qu J 《Genomics》2008,92(1):60-64
Adenylate cyclases, guanylate cyclases, cyclic nucleotide phosphodiesterases, and cyclic nucleotide-binding proteins constitute the core of cAMP and cGMP signaling components. Using a combination of BLAST and profile search methods, we found that cyclic nucleotide-binding proteins exhibited diverse domain architectures. In addition to the domain architectures involved in the characterized functional groups, a cyclic nucleotide-binding domain was also fused to various domains involved in pyridine nucleotide-disulfide oxidoreductase, acetyltransferase, thioredoxin reductase, glutaminase, rhodanese, ferredoxin, and diguanylate cyclase, implying the versatile functions of cyclic nucleotide-binding proteins. We constructed the CSCDB database to accumulate the components of cAMP and cGMP signaling pathways in the complete genomes. User-friendly interfaces were created for easier browsing, searching, and downloading the data. Besides harboring the sequence itself, each entry provided detailed annotation information, such as sequence features, chromosomal localization, functional domains, transmembrane region, and sequence similarity against several major databases. Currently, CSCDB contains 4234 entries covering 466 organisms, including 35 eukaryotes, 382 bacteria, and 29 archaea. CSCDB can be freely accessible on the web at http://cscdb.com.cn.  相似文献   

9.
There is a limited repertoire of domain families in nature that are duplicated and combined in different ways to form the set of proteins in a genome. Most proteins in both prokaryote and eukaryote genomes consist of two or more domains, and we show that the family size distribution of multi-domain protein families follows a power law like that of individual families. Most domain pairs occur in four to six different domain architectures: in isolation and in combinations with different partners. We showed previously that within the set of all pairwise domain combinations, most small and medium-sized families are observed in combination with one or two other families, while a few large families are very versatile and combine with many different partners. Though this may appear to be a stochastic pattern, in which large families have more combination partners by virtue of their size, we establish here that all the domain families with more than three members in genomes are duplicated more frequently than would be expected by chance considering their number of neighbouring domains. This duplication of domain pairs is statistically significant for between one and three quarters of all families with seven or more members. For the majority of pairwise domain combinations, there is no known three-dimensional structure of the two domains together, and we term these novel combinations. Novel domain combinations are interesting and important targets for structural elucidation, as the geometry and interaction between the domains will help understand the function and evolution of multi-domain proteins. Of particular interest are those combinations that occur in the largest number of multi-domain proteins, and several of these frequent novel combinations contain DNA-binding domains.Abbreviations:SCOP: Structural Classification of Proteins database, PDB: Protein DataBank, HMM: hidden Markov model  相似文献   

10.
Homology identification is the first step for many genomic studies. Current methods, based on sequence comparison, can result in a substantial number of mis-assignments due to the similarity of homologous domains in otherwise unrelated sequences. Here we propose methods to detect homologs through explicit comparison of protein domain content. We developed several schemes for scoring the homology of a pair of protein sequences based on methods used in the field of information retrieval. We evaluate the proposed methods and methods used in the literature using a benchmark of fifteen sequence families of known evolutionary history. The results of these studies demonstrate the effectiveness of comparing domain architectures using these similarity measures. We also demonstrate the importance of both weighting promiscuous domains and of compensating for the statistical effect of having a large number of domains in a protein. Using logistic regression, we demonstrate the benefit of combining similarity measures based on domain content with sequence similarity measures.  相似文献   

11.
The CluSTr database (http://www.ebi.ac.uk/clustr/) offers an automatic classification of SWISS-PROT+TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pair-wise sequence comparisons between proteins using the Smith-Waterman algorithm. The analysis, carried out on different levels of protein similarity, yields a hierarchical organization of clusters. Information about domain content of the clustered proteins is provided via the InterPro resource. The introduced InterPro 'condensed graphical view' simplifies the visual analysis of represented domain architectures. Integrated applications allow users to visualize and edit multiple alignments and build sequence divergence trees. Links to the relevant structural data in Protein Data Bank (PDB) and Homology derived Secondary Structure of Proteins (HSSP) are also provided.  相似文献   

12.
Evolution of Chitin-Binding Proteins in Invertebrates   总被引:11,自引:0,他引:11  
Analysis of a group of invertebrate proteins, including chitinases and peritrophic matrix proteins, reveals the presence of chitin-binding domains that share significant amino acid sequence similarity. The data suggest that these domains evolved from a common ancestor which may be a protein containing a single chitin-binding domain. The duplication and transposition of this chitin-binding domain may have contributed to the functional diversification of chitin-binding proteins. Sequence comparisons indicated that invertebrate and plant chitin binding domains do not share significant amino acid sequence similarity, suggesting that they are not coancestral. However, both the invertebrate and the plant chitin-binding domains are cysteine-rich and have several highly conserved aromatic residues. In plants, cysteines have been elucidated in maintaining protein folding and aromatic amino acids in interacting with saccharides [Wright HT, Sanddrasegaram G, Wright CS (1991) J Mol Evol 33:283–294]. It is likely that these residues perform similar functions in invertebrates. We propose that the invertebrate and the plant chitin-binding domains share similar mechanisms for folding and saccharide binding and that they evolved by convergent evolution. Furthermore, we propose that the disulfide bonds and aromatic residues are hallmarks for saccharide-binding proteins. Received: 2 March 1998 / Accepted: 17 July 1998  相似文献   

13.
The overall function of a multi‐domain protein is determined by the functional and structural interplay of its constituent domains. Traditional sequence alignment‐based methods commonly utilize domain‐level information and provide classification only at the level of domains. Such methods are not capable of taking into account the contributions of other domains in the proteins, and domain‐linker regions and classify multi‐domain proteins. An alignment‐free protein sequence comparison tool, CLAP (CLAssification of Proteins) was previously developed in our laboratory to especially handle multi‐domain protein sequences without a requirement of defining domain boundaries and sequential order of domains. Through this method we aim to achieve a biologically meaningful classification scheme for multi‐domain protein sequences. In this article, CLAP‐based classification has been explored on 5 datasets of multi‐domain proteins and we present detailed analysis for proteins containing (1) Tyrosine phosphatase and (2) SH3 domain. At the domain‐level CLAP‐based classification scheme resulted in a clustering similar to that obtained from an alignment‐based method. CLAP‐based clusters obtained for full‐length datasets were shown to comprise of proteins with similar functions and domain architectures. Our study demonstrates that multi‐domain proteins could be classified effectively by considering full‐length sequences without a requirement of identification of domains in the sequence.  相似文献   

14.
Over the past several years fungal infections have shown an increasing incidence in the susceptible population, and caused high mortality rates. In parallel, multi-resistant fungi are emerging in human infections. Therefore, the identification of new potential antifungal targets is a priority. The first task of this study was to analyse the protein domain and domain architecture content of the 137 fungal proteomes (corresponding to 111 species) available in UniProtKB (UniProt KnowledgeBase) by January 2013. The resulting list of core and exclusive domain and domain architectures is provided in this paper. It delineates the different levels of fungal taxonomic classification: phylum, subphylum, order, genus and species. The analysis highlighted Aspergillus as the most diverse genus in terms of exclusive domain content. In addition, we also investigated which domains could be considered promiscuous in the different organisms. As an application of this analysis, we explored three different ways to detect potential targets for antifungal drugs. First, we compared the domain and domain architecture content of the human and fungal proteomes, and identified those domains and domain architectures only present in fungi. Secondly, we looked for information regarding fungal pathways in public repositories, where proteins containing promiscuous domains could be involved. Three pathways were identified as a result: lovastatin biosynthesis, xylan degradation and biosynthesis of siroheme. Finally, we classified a subset of the studied fungi in five groups depending on their occurrence in clinical samples. We then looked for exclusive domains in the groups that were more relevant clinically and determined which of them had the potential to bind small molecules. Overall, this study provides a comprehensive analysis of the available fungal proteomes and shows three approaches that can be used as a first step in the detection of new antifungal targets.  相似文献   

15.
SMART: a web-based tool for the study of genetically mobile domains   总被引:61,自引:2,他引:59  
SMART (a Simple Modular Architecture Research Tool) allows the identification and annotation of genetically mobile domains and the analysis of domain architectures (http://SMART.embl-heidelberg.de ). More than 400 domain families found in signalling, extra-cellular and chromatin-associated proteins are detectable. These domains are extensively annotated with respect to phyletic distributions, functional class, tertiary structures and functionally important residues. Each domain found in a non-redundant protein database as well as search parameters and taxonomic information are stored in a relational database system. User interfaces to this database allow searches for proteins containing specific combinations of domains in defined taxa.  相似文献   

16.
MOTIVATION: We describe a novel method for detecting the domain structure of a protein from sequence information alone. The method is based on analyzing multiple sequence alignments that are derived from a database search. Multiple measures are defined to quantify the domain information content of each position along the sequence and are combined into a single predictor using a neural network. The output is further smoothed and post-processed using a probabilistic model to predict the most likely transition positions between domains. RESULTS: The method was assessed using the domain definitions in SCOP and CATH for proteins of known structure and was compared with several other existing methods. Our method performs well both in terms of accuracy and sensitivity. It improves significantly over the best methods available, even some of the semi-manual ones, while being fully automatic. Our method can also be used to suggest and verify domain partitions based on structural data. A few examples of predicted domain definitions and alternative partitions, as suggested by our method, are also discussed. AVAILABILITY: An online domain-prediction server is available at http://biozon.org/tools/domains/  相似文献   

17.
18.
Domain insertions and deletions lead to variations in the domain architectures of the proteins from their common ancestor. In this work, we investigated four groups of the RhoGEF-containing proteins from different organisms with domain architectures RhoGEF-PH-SH3, SH3-RhoGEF-PH, RhoGEF-PH, and SH3-RhoGEF defined in the Pfam database. The phylogenetic trees were constructed using each individual domain and/or the combinations of all the domains. The phylogenetic analysis suggests that RhoGEF-PH-SH3 and SH3-RhoGEF-PH might have evolved from RhoGEF-PH through the insertion of SH3 independently, while SH3- RhoGEF of proteins in fruit fly might have evolved from SH3-RhoGEF-PH by the degeneration of PH domain.  相似文献   

19.
Fan M  Wong KC  Ryu T  Ravasi T  Gao X 《PloS one》2012,7(6):e39475
With rapid advances in the development of DNA sequencing technologies, a plethora of high-throughput genome and proteome data from a diverse spectrum of organisms have been generated. The functional annotation and evolutionary history of proteins are usually inferred from domains predicted from the genome sequences. Traditional database-based domain prediction methods cannot identify novel domains, however, and alignment-based methods, which look for recurring segments in the proteome, are computationally demanding. Here, we propose a novel genome-wide domain prediction method, SECOM. Instead of conducting all-against-all sequence alignment, SECOM first indexes all the proteins in the genome by using a hash seed function. Local similarity can thus be detected and encoded into a graph structure, in which each node represents a protein sequence and each edge weight represents the shared hash seeds between the two nodes. SECOM then formulates the domain prediction problem as an overlapping community-finding problem in this graph. A backward graph percolation algorithm that efficiently identifies the domains is proposed. We tested SECOM on five recently sequenced genomes of aquatic animals. Our tests demonstrated that SECOM was able to identify most of the known domains identified by InterProScan. When compared with the alignment-based method, SECOM showed higher sensitivity in detecting putative novel domains, while it was also three orders of magnitude faster. For example, SECOM was able to predict a novel sponge-specific domain in nucleoside-triphosphatase (NTPases). Furthermore, SECOM discovered two novel domains, likely of bacterial origin, that are taxonomically restricted to sea anemone and hydra. SECOM is an open-source program and available at http://sfb.kaust.edu.sa/Pages/Software.aspx.  相似文献   

20.
The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence‐based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three‐dimensional structures of domains are much more conserved than their sequences. Based on structure‐anchored multiple sequence alignments of low identity homologues we constructed 850 structure‐anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI‐BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E‐value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled “unknown” in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/ . Proteins 2009. © 2008 Wiley‐Liss, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号