首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Classification of proteins is a major challenge in bioinformatics. Here an approach is presented, that unifies different existing classifications of protein structures and sequences. Protein structural domains are represented as nodes in a hypergraph. Shared memberships in sequence families result in hyperedges in the graph. The presented method partitions the hypergraph into clusters of structural domains. Each computed cluster is based on a set of shared sequence family memberships. Thus, the clusters put existing protein sequence families into the context of structural family hierarchies. Conversely, structural domains are related to their sequence family memberships, which can be used to gain further knowledge about the respective structural families.  相似文献   

2.
Analysis of increasingly saturated sequence databases have shown that gene family sizes are highly skewed with many families being small and few containing many, far-diverged homologs. Additionally, recently published results have identified a structural determinant of mutational plasticity: designability that correlates strongly with gene family size. In this paper, we explore the possible links between the two observations, exploring the possible effect of designability on duplication and divergence. We show that designability has an inverse of expected relationship with strength of selection. More designable domains that should have more mutational plasticity evolve slower. However, we also present evidence that recently duplicated genes have variable probability of locus fixation correlated with strength of selection. As expected, paralogs under stronger evolutionary pressure have a lower failure rate. Finally, we show that probability of pseudogene formation from gene duplication can be directly tied to designability and functional flexibility of the family. We present evidence that gene families with higher designability have diverged farther because of lower probability of pseudogenization. Additionally, mutational plasticity may play an integral role by influencing pseudogenization rate. Either way, we show that considering the failure rate of duplications is integral in understanding the determinants and dynamics of molecular evolution.  相似文献   

3.
Many proteins, especially in eukaryotes, contain tandem repeats of several domains from the same family. These repeats have a variety of binding properties and are involved in protein–protein interactions as well as binding to other ligands such as DNA and RNA. The rapid expansion of protein domain repeats is assumed to have evolved through internal tandem duplications. However, the exact mechanisms behind these tandem duplications are not well-understood. Here, we have studied the evolution, function, protein structure, gene structure, and phylogenetic distribution of domain repeats. For this purpose we have assigned Pfam-A domain families to 24 proteomes with more sensitive domain assignments in the repeat regions. These assignments confirmed previous findings that eukaryotes, and in particular vertebrates, contain a much higher fraction of proteins with repeats compared with prokaryotes. The internal sequence similarity in each protein revealed that the domain repeats are often expanded through duplications of several domains at a time, while the duplication of one domain is less common. Many of the repeats appear to have been duplicated in the middle of the repeat region. This is in strong contrast to the evolution of other proteins that mainly works through additions of single domains at either terminus. Further, we found that some domain families show distinct duplication patterns, e.g., nebulin domains have mainly been expanded with a unit of seven domains at a time, while duplications of other domain families involve varying numbers of domains. Finally, no common mechanism for the expansion of all repeats could be detected. We found that the duplication patterns show no dependence on the size of the domains. Further, repeat expansion in some families can possibly be explained by shuffling of exons. However, exon shuffling could not have created all repeats.  相似文献   

4.
ABSTRACT: BACKGROUND: Proteins convey the majority of biochemical and cellular activities in organisms. Over the course of evolution, proteins undergo normal sequence mutations as well as large scale mutations involving domain duplication and/or domain shuffling. These events result in the generation of new proteins and protein families. Processes that affect proteome evolution drive species diversity and adaptation. Herein, change over the course of metazoan evolution, as defined by birth/death and duplication/deletion events within protein families and domains, was examined using the proteomes of 9 metazoan and two outgroup species. RESULTS: In studying members of the three major metazoan groups, the vertebrates, arthropods, and nematodes, we found that the number of protein families increased at the majority of lineages over the course of metazoan evolution where the magnitude of these increases was greatest at the lineages leading to mammals. In contrast, the number of protein domains decreased at most lineages and at all terminal lineages. This resulted in a weak correlation between protein family birth and domain birth; however, the correlation between domain birth and domain member duplication was quite strong. These data suggest that domain birth and protein family birth occur via different mechanisms, and that domain shuffling plays a role in the formation of protein families. The ratio of protein family birth to protein domain birth (domain shuffling index) suggests that shuffling had a more demonstrable effect on protein families in nematodes and arthropods than in vertebrates. Through the contrast of high and low domain shuffling indices at the lineages of Trichinella spiralis and Gallus gallus, we propose a link between protein redundancy and evolutionary changes controlled by domain shuffling; however, the speed of adaptation among the different lineages was relatively invariant. Evaluating the functions of protein families that appeared or disappeared at the last common ancestors (LCAs) of the three metazoan clades supports a correlation with organism adaptation. Furthermore, bursts of new protein families and domains in the LCAs of metazoans and vertebrates are consistent with whole genome duplications. CONCLUSION: Metazoan speciation and adaptation were explored by birth/death and duplication/deletion events among protein families and domains. Our results provide insights into protein evolution and its bearing on metazoan evolution.  相似文献   

5.
Some plant microRNAs have been shown to be de novo generated by inverted duplication from their target genes. Subsequent duplication events potentially generate multigene microRNA families. Within this article we provide supportive evidence for the inverted duplication model of plant microRNA evolution. First, we report that the precursors of four Arabidopsis thaliana microRNA families, miR157, miR158, miR405 and miR447 share nearly identical nucleotide sequences throughout the whole miRNA precursor between the family members. The extent and degree of sequence conservation is suggestive of recent evolutionary duplication events. Furthermore we found that sequence similarities are not restricted to the transcribed part but extend into the promoter regions. Thus the duplication event most probably included the promoter regions as well. Conserved elements in upstream regions of miR163 and its targets were also detected. This implies that the inverted duplication of target genes, at least in certain cases, had included the promoters of the target genes. Sequence conservation within promoters of miRNA families as well as between miRNA and its potential progenitor gene can be exploited for understanding the regulation of microRNA genes.  相似文献   

6.
Lee D  Grant A  Marsden RL  Orengo C 《Proteins》2005,59(3):603-615
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.  相似文献   

7.
Self-organization of tree form: a model for complex social systems   总被引:1,自引:0,他引:1  
  相似文献   

8.
The sequencing of a genome is the first stage of its complete characterization. Subsequent work seeks to utilize available sequence data to gain a better understanding of the genes which are found within a genome. Gene families comprise large portions of the genomes of higher vertebrates, and the available genomic data allow for a reappraisal of gene family evolution. This reappraisal will clarify relatedness within and between gene families. One such family, the alpha-actinin gene family, is part of the spectrin superfamily. There are four known loci, which encode alpha-actinins 1, 2, 3, and 4. Of the eight domains in alpha-actinin, the actin-binding domain is the most highly conserved. Here we present evidence gained through phylogenetic analyses of the highly conserved actin-binding domain that alpha-actinin 2 was the first of the four alpha-actinins to arise by gene duplication, followed by the divergence of alpha-actinin 3 and then alpha-actinins 1 and 4. Resolution of the gene tree for this gene family has allowed us to reclassify several alpha-actinins which were previously given names inconsistent with the most widely accepted nomenclature for this gene family. This reclassification clarifies previous discrepancies in the public databases as well as in the literature, thus eliminating confusion caused by continued misclassification of members of the alpha-actinin gene family. In addition, the topology found for this gene family undermines the 2R hypothesis theory of two rounds of genome duplication early in vertebrate evolution.  相似文献   

9.

Background  

Proteins are comprised of one or several building blocks, known as domains. Such domains can be classified into families according to their evolutionary origin. Whereas sequencing technologies have advanced immensely in recent years, there are no matching computational methodologies for large-scale determination of protein domains and their boundaries. We provide and rigorously evaluate a novel set of domain families that is automatically generated from sequence data. Our domain family identification process, called EVEREST (EVolutionary Ensembles of REcurrent SegmenTs), begins by constructing a library of protein segments that emerge in an all vs. all pairwise sequence comparison. It then proceeds to cluster these segments into putative domain families. The selection of the best putative families is done using machine learning techniques. A statistical model is then created for each of the chosen families. This procedure is then iterated: the aforementioned statistical models are used to scan all protein sequences, to recreate a library of segments and to cluster them again.  相似文献   

10.
Dengler U  Siddiqui AS  Barton GJ 《Proteins》2001,42(3):332-344
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.  相似文献   

11.
12.
Grapevine is an important fruit crop that has undergone a long history of evolution. Analysis of the whole genome sequence of grapevine has revealed presence of an early palaeo-hexaploid along with three complements. Thus, gene duplication and genome expansion are common in this genome. In this study, we identified 17,922 duplicated genes in the whole grapevine genome. Among these, 2,039; 628; 1,428; 722; and 2,942 were identified respectively as produced by genome-wide, tandem, proximal, retrotransposed, and DNA-based transposed duplications. Analyses of the evolutionary patterns for different types of duplication using non-synonymous and synonymous substitution rates uncovered a series of underlying rules. Thereafter, all the grapevine genes were classified into families, and the contributions of different types of duplication to the expansion of large families were revealed. No duplication type was solely responsible for the formation of any large gene family, but some families showed enrichment of a special type of duplication. On the basis of this study, we believe that uncovering the underlying rules for gene duplications, expansions of gene families, and their evolutionary styles will contribute significantly to a comprehensive understanding of the features of the grapevine genome.  相似文献   

13.
Modular rearrangements play an important role in protein evolution. Functional modules, often tantamount to structural domains or smaller fragments, are in many cases well conserved but reoccur in a different order and across many protein families. The underlying genetic mechanisms are gene duplication, fusion, and loss of sequence fragments. As a consequence, the sequential order of domains can be inverted, leading to what is known as circularly permutated proteins. Using a recently developed algorithm, we have identified a large number of such rearrangements and analyzed their evolutionary history. We searched for examples which have arisen by one of the three postulated mechanisms: independent fusion/fission, "duplication/deletion," and plasmid-mediated "cut and paste." We conclude that all three mechanisms can be observed, with the independent fusion/fission being the most frequent. This can be partly attributed to highly mobile domains. Duplication/deletion has been found in modular proteins such as peptide synthases.  相似文献   

14.
Gene duplication events exert key functions on gene innovations during the evolution of the eukaryotic genomes. A large portion of the total gene content in plants arose from tandem duplications events, which often result in paralog genes with high sequence identity. Ubiquitin ligases or E3 enzymes are components of the ubiquitin proteasome system that function during the transfer of the ubiquitin molecule to the substrate. In plants, several E3s have expanded in their genomes as multigene families. To gain insight into the consequences of gene duplications on the expansion and diversification of E3s, we examined the evolutionary basis of a cluster of six genes, duplC-ATLs, which arose from segmental and tandem duplication events in Brassicaceae. The assessment of the expression suggested two patterns that are supported by lineage. While retention of expression domains was observed, an apparent absence or reduction of expression was also inferred. We found that two duplC-ATL genes underwent pseudogenization and that, in one case, gene expression is probably regained. Our findings provide insights into the evolution of gene families in plants, defining key events on the expansion of the Arabidopsis Tóxicos en Levadura family of E3 ligases.  相似文献   

15.
Abstract The sequencing of a genome is the first stage of its complete characterization. Subsequent work seeks to utilize available sequence data to gain a better understanding of the genes which are found within a genome. Gene families comprise large portions of the genomes of higher vertebrates, and the available genomic data allow for a reappraisal of gene family evolution. This reappraisal will clarify relatedness within and between gene families. One such family, the α-actinin gene family, is part of the spectrin superfamily. There are four known loci, which encode α-actinins 1, 2, 3, and 4. Of the eight domains in α-actinin, the actin-binding domain is the most highly conserved. Here we present evidence gained through phylogenetic analyses of the highly conserved actin-binding domain that α-actinin 2 was the first of the four α-actinins to arise by gene duplication, followed by the divergence of α-actinin 3 and then α-actinins 1 and 4. Resolution of the gene tree for this gene family has allowed us to reclassify several α-actinins which were previously given names inconsistent with the most widely accepted nomenclature for this gene family. This reclassification clarifies previous discrepancies in the public databases as well as in the literature, thus eliminating confusion caused by continued misclassification of members of the α-actinin gene family. In addition, the topology found for this gene family undermines the 2R hypothesis theory of two rounds of genome duplication early in vertebrate evolution.  相似文献   

16.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

17.
The amino acid sequence of bovine lung cGMP-dependent protein kinase has been determined by degradation and alignment of two primary overlapping sets of peptides generated by cleavage at methionyl or arginyl residues. The protein contains 670 residues in a single N alpha-acetylated chain corresponding to a molecular weight of 76 331. The function of the molecule is considered in six segments of sequence which may correspond to four folding domains. From the amino terminus, the first segment is related to the dimerizing property of the protein. The second and third segments appear to have evolved from an ancestral tandem internal gene duplication, generating twin cGMP-binding domains which are homologous to twin domains in the regulatory subunits of cAMP-dependent protein kinase and to the cAMP-binding domain of the catabolite gene activator of Escherichia coli. The fourth and fifth segments may comprise one domain which is homologous to the catalytic subunits of cAMP-dependent protein kinase, of calcium-dependent phosphorylase b kinase, and of certain oncogenic viral protein tyrosine kinases. The regulatory, amino-terminal half of cGMP-dependent protein kinase appears to be related to a family of smaller proteins that bind cAMP for diverse purposes, whereas the catalytic, carboxyl-terminal half is related to a family of protein kinases of varying specificity and varying sensitivity to regulators. These data suggest that ancestral gene splicing events may have been involved in the fusion of two families of proteins to generate the allosteric character of this chimeric enzyme.  相似文献   

18.
The HNHc (SMART ID: SM00507) domain (SCOP nomenclature: HNH family) can be subclassified into at least eight subsets by iterative refinement of HMM profiles. An initial clustering of 323 proteins containing the HNHc domain helped identify the subsets. The subsets could be differentiated on the basis of the pattern of occurrence of seven defining features. Domain association is also different between the subsets. The subsets show organism as well as domain-based clustering, suggestive of propagation by both duplication and horizontal transfer events. Structure-based sequence analysis of the subsets led to the identification of common structural and sequence motifs in the HNH family with the other three families under the His-Me endonuclease superfamily.  相似文献   

19.
Hahn Y  Bera TK  Pastan IH  Lee B 《Gene》2006,366(2):238-245
The POTE family genes encode a highly homologous group of primate-specific proteins that contain ankyrin repeats and coiled coil domains. At least 13 paralogous POTE family genes are found on 8 human chromosomes (2, 8, 13, 14, 15, 18, 21 and 22), which can be sorted into 3 groups based on sequence similarity. We identified by a database search a group of additional human ankyrin repeat domain proteins, of which ANKRD26 and ANKRD30A are the best characterized; these are more distant homologs of POTE family proteins. A comprehensive comparison of the genomic organization indicates that ANKRD26 has the genomic structure of the possible ancestor of ANKRD30A and all POTE family genes. Extensive remodeling involving segmental loss and internal duplication appears to have reshaped the ANKRD30A and POTE family genes after the primal duplication of the ancestor gene. We also identified a mouse homolog of human ANKRD26, but failed to find a mouse homolog that bears the structural characteristics of any of the POTE family of proteins. The mouse Ankrd26 may serve as a useful model for the study of the function of human ANKRD26, ANKRD30A and POTE family proteins.  相似文献   

20.
We examined the primary sequence of canavalin, the major storage protein of jack beans, and found that an ancient sequence duplication accounts for 80% of the amino acid residues. Evidence for such a duplication was also found in the orthologous proteins phaseolin and pea vicilin. This sequence duplication presumably accounts for a structural duplication in the canavalin monomer observed by crystallographic analysis. One copy of this repeat was found in a second storage-protein family, the legumins, where it encompasses almost the entire B-chain of the mature molecule. We propose that the vicilin and legumin families of legume seed proteins evolved from a common precursor, which consisted of one copy of the repeat in the vicilins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号