首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Domains are the building blocks of all globular proteins, and are units of compact three-dimensional structure as well as evolutionary units. There is a limited repertoire of domain families, so that these domain families are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products. The processes that produce new genes are duplication and recombination as well as gene fusion and fission. We attempt to gain an overview of these processes by studying the structural domains in the proteins of seven genomes from the three kingdoms of life: Eubacteria, Archaea and Eukaryota. We use here the domain and superfamily definitions in Structural Classification of Proteins Database (SCOP) in order to map pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 624 out of the 764 superfamilies in SCOP in these genomes, and the 624 families occur in 585 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour. This type of pattern can be described by a scale-free network. Finally, we study domain repeats and we compare the set of the domain combinations in the genomes to those in PDB, and discuss the implications for structural genomics.  相似文献   

2.
The recognition of remote protein homologies is a major aspect of the structural and functional annotation of newly determined genomes. Here we benchmark the coverage and error rate of genome annotation using the widely used homology-searching program PSI-BLAST (position-specific iterated basic local alignment search tool). This study evaluates the one-to-many success rate for recognition, as often there are several homologues in the database and only one needs to be identified for annotating the sequence. In contrast, previous benchmarks considered one-to-one recognition in which a single query was required to find a particular target. The benchmark constructs a model genome from the full sequences of the structural classification of protein (SCOP) database and searches against a target library of remote homologous domains (<20 % identity). The structural benchmark provides a reliable list of correct and false homology assignments. PSI-BLAST successfully annotated 40 % of the domains in the model genome that had at least one homologue in the target library. This coverage is more than three times that if one-to-one recognition is evaluated (11 % coverage of domains). Although a structural benchmark was used, the results equally apply to just sequence homology searches. Accordingly, structural and sequence assignments were made to the sequences of Mycoplasma genitalium and Mycobacterium tuberculosis (see http://www.bmm.icnet. uk). The extent of missed assignments and of new superfamilies can be estimated for these genomes for both structural and functional annotations.  相似文献   

3.
通过生物信息学手段对9种硬骨鱼转座组进行注释。结果表明9种硬骨鱼类转座组大小和构成差异显著,其转座组含量从高到低分别为斑马鱼、矛尾鱼、青鳉鱼、罗非鱼、花斑剑尾鱼、大西洋鳕鱼、三刺鱼、金娃娃和红鳍东方鲀,转座子含量和基因组大小呈正相关。DNA转座子在硬骨鱼类中具有多样性高和含量差异大的特点(0.50%–38.37%),是硬骨鱼类转座组差异的主要决定因素,其中h AT和Tc/Mariner超家族是硬骨鱼类主要的DNA转座子。RNA转座子在硬骨鱼类中也具有多样性高的特点,其中LINE转座子占硬骨鱼类基因组的0.53%–5.75%,共检测到14个超家族分布,其中L1、L2、RTE和Rex转座子扩增较为明显,LTR转座子除了在斑马鱼和三刺鱼中含量达到5.58%和2.51%,在大多硬骨鱼类基因组中的含量低于2%,在硬骨鱼类中共检测到6个LTR转座子(Copia、DIRS、ERV、Gypsy、Ngaro和Pao)超家族分布,其中扩增最为明显的是Gypsy。而SINE转座子在硬骨鱼类中扩增最弱,仅在斑马鱼和矛尾鱼中分别达到3.28%和5.64%,在其他7个物种中低于1%。SINE中t RNA、5S和MIR三个超家族在部分硬骨鱼类中有一定程度扩增。本研究表明硬骨鱼类转座组具有多样性丰富、差异大的特点,转座组差异与硬骨鱼基因组大小有很强的相关性,转座组是决定硬骨鱼基因组大小的重要因素。  相似文献   

4.
The explosion in gene sequence data and technological breakthroughs in protein structure determination inspired the launch of structural genomics (SG) initiatives. An often stated goal of structural genomics is the high-throughput structural characterisation of all protein sequence families, with the long-term hope of significantly impacting on the life sciences, biotechnology and drug discovery. Here, we present a comprehensive analysis of solved SG targets to assess progress of these initiatives. Eleven consortia have contributed 316 non-redundant entries and 323 protein chains to the Protein Data Bank (PDB), and 459 and 393 domains to the CATH and SCOP structure classifications, respectively. The quality and size of these proteins are comparable to those solved in traditional structural biology and, despite huge scope for duplicated efforts, only 14% of targets have a close homologue (>/=30% sequence identity) solved by another consortium. Analysis of CATH and SCOP revealed the significant contribution that structural genomics is making to the coverage of superfamilies and folds. A total of 67% of SG domains in CATH are unique, lacking an already characterised close homologue in the PDB, whereas only 21% of non-SG domains are unique. For 29% of domains, structure determination revealed a remote evolutionary relationship not apparent from sequence, and 19% and 11% contributed new superfamilies and folds. The secondary structure class, fold and superfamily distributions of this dataset reflect those of the genomes. The domains fall into 172 different folds and 259 superfamilies in CATH but the distribution is highly skewed. The most populous of these are those that recur most frequently in the genomes. Whilst 11% of superfamilies are bacteria-specific, most are common to all three superkingdoms of life and together the 316 PDB entries have provided new and reliable homology models for 9287 non-redundant gene sequences in 206 completely sequenced genomes. From the perspective of this analysis, it appears that structural genomics is on track to be a success, and it is hoped that this work will inform future directions of the field.  相似文献   

5.

Background  

Annotation of sequences that share little similarity to sequences of known function remains a major obstacle in genome annotation. Some of the best methods of detecting remote relationships between protein sequences are based on matching sequence profiles. We analyse the superfamily specific performance of sequence profile-profile matching. Our benchmark consists of a set of 16 protein superfamilies that are highly diverse at the sequence level. We relate the performance to the number of sequences in the profiles, the profile diversity and the extent of structural conservation in the superfamily.  相似文献   

6.
7.
We present a systematic study of the clustering of genes within the human genome based on homology inferred from both sequence and structural similarity. The 3D-Genomics automated proteome annotation pipeline () was utilised to infer homology for each protein domain in the genome, for the 26 superfamilies most highly represented in the Structural Classification Of Proteins (SCOP) database. This approach enabled us to identify homologues that could not be detected by sequence-based methods alone. For each superfamily, we investigated the distribution, both within and among chromosomes, of genes encoding at least one domain within the superfamily. The results indicate a diversity of clustering behaviours: some superfamilies showed no evidence of any clustering, and others displayed significant clustering either within or among chromosomes, or both. Removal of tandem repeats reduced the levels of clustering observed, but some superfamilies still displayed highly significant clustering. Thus, our study suggests that either the process of gene duplication, or the evolution of the resulting clusters, differs between structural superfamilies.  相似文献   

8.
There is a limited repertoire of domain families that are duplicated and combined in different ways to form the set of proteins in a genome. Proteins are gene products, and at the level of genes, duplication, recombination, fusion and fission are the processes that produce new genes. We attempt to gain an overview of these processes by studying the evolutionary units in proteins, domains, in the protein sequences of 40 genomes. The domain and superfamily definitions in the Structural Classification of Proteins Database are used, so that we can view all pairs of adjacent domains in genome sequences in terms of their superfamily combinations. We find 783 out of the 859 superfamilies in SCOP in these genomes, and the 783 families occur in 1307 pairwise combinations. Most families are observed in combination with one or two other families, while a few families are very versatile in their combinatorial behaviour; 209 families do not make combinations with other families. This type of pattern can be described as a scale-free network. We also study the N to C-terminal orientation of domain pairs and domain repeats. The phylogenetic distribution of domain combinations is surveyed, to establish the extent of common and kingdom-specific combinations. Of the kingdom-specific combinations, significantly more combinations consist of families present in all three kingdoms than of families present in one or two kingdoms. Hence, we are led to conclude that recombination between common families, as compared to the invention of new families and recombination among these, has also been a major contribution to the evolution of kingdom-specific and species-specific functions in organisms in all three kingdoms. Finally, we compare the set of the domain combinations in the genomes to those in the RCSB Protein Data Bank, and discuss the implications for structural genomics.  相似文献   

9.
The age of genomics has given us a wealth of information and the tools to study whole genomes. This, in turn, has facilitated genome-wide studies among organisms that were relatively less studied in the pre-genomic era or are non-model organisms. This paves the way to the discovery of interesting evolutionary patterns, which are brought to light by genome-wide surveys of protein superfamilies. Phosphorylation is a post-translational modification that is utilised across all clades of life, and acts as an important signalling switch, regulating several cellular processes. Tyrosine phosphatases, which are found predominantly in eukaryotes, act on phosphorylated tyrosine residues and sometimes on other substrates. Extending on our previous effort to look for tyrosine phosphatases in the human genome, we have looked for sequences of the cysteine-based tyrosine phosphatase superfamily in thirty mammalian genomes from all across Mammalia and validated the sequences with the presence of the signature catalytic motif. Domain architecture annotation, followed by in-depth analysis, revealed interesting taxon-specific patterns such as subtle differences between the protein families in marsupials and early mammals versus placental mammals. Finally, we discuss an interesting case of loss of the tyrosine phosphatase domain from a gene product in the course of eutherian evolution.  相似文献   

10.
Phylogenomics of prokaryotic ribosomal proteins   总被引:1,自引:0,他引:1  
Yutin N  Puigbò P  Koonin EV  Wolf YI 《PloS one》2012,7(5):e36972
Archaeal and bacterial ribosomes contain more than 50 proteins, including 34 that are universally conserved in the three domains of cellular life (bacteria, archaea, and eukaryotes). Despite the high sequence conservation, annotation of ribosomal (r-) protein genes is often difficult because of their short lengths and biased sequence composition. We developed an automated computational pipeline for identification of r-protein genes and applied it to 995 completely sequenced bacterial and 87 archaeal genomes available in the RefSeq database. The pipeline employs curated seed alignments of r-proteins to run position-specific scoring matrix (PSSM)-based BLAST searches against six-frame genome translations, mitigating possible gene annotation errors. As a result of this analysis, we performed a census of prokaryotic r-protein complements, enumerated missing and paralogous r-proteins, and analyzed the distributions of ribosomal protein genes among chromosomal partitions. Phyletic patterns of bacterial and archaeal r-protein genes were mapped to phylogenetic trees reconstructed from concatenated alignments of r-proteins to reveal the history of likely multiple independent gains and losses. These alignments, available for download, can be used as search profiles to improve genome annotation of r-proteins and for further comparative genomics studies.  相似文献   

11.
1. Size variation is a ubiquitous feature of animal populations and is predicted to strongly influence species abundance and dynamics; however, the factors that determine size variation are not well understood. 2. In a mesocosm experiment, we found that the relationship between mean and variation in wood frog (Rana sylvatica) tadpole size is qualitatively different at different levels of competition created by manipulating resource supply rates or tadpole density. At low competition, relative size variation (as measured by the coefficient of variation) decreased as a function of mean size, while at high competition, relative size variation increased. Therefore, increased competition magnified differences in individual performance as measured by growth rate. 3. A model was developed to estimate the contribution of size-dependent factors (i.e. based on size alone) and size-independent factors (i.e. resulting from persistent inherent phenotypic differences other than size that affect growth) on the empirical patterns. 4. Model analysis of the low competition treatment indicated that size-dependent factors alone can describe the relationship between mean size and size variation. To fit the data, the size scaling exponent that describes the dependence of growth rate on size was determined. The estimated value, 0-83, is in the range of that derived from physiological studies. 5. At high competition, the model analysis indicated that individual differences in foraging ability, either size-based or due to inherent phenotypic differences (size-independent factors), were much more pronounced than at low competition. The model was used to quantify the changes in size-dependent or size-independent factors that underlie the effect of competition on size-variation. In contrast to results at low competition, parameters derived from physiological studies could not be used to describe the observed relationships. 6. Our experimental and model results elucidate the role of size-dependent and size-independent factors in the development of size variation, and highlight and quantify the context dependence of individual (intrapopulation) differences in competitive abilities.  相似文献   

12.
Ancestral lipid biosynthesis and early membrane evolution   总被引:5,自引:0,他引:5  
Archaea possess unique membrane phospholipids that generally comprise isoprenoid ethers built on sn-glycerol-1-phosphate (G1P). By contrast, bacterial and eukaryal membrane phospholipids are fatty acid esters linked to sn-glycerol-3-phosphate (G3P). The two key dehydrogenase enzymes that produce G1P and G3P, G1PDH and G3PDH, respectively, are not homologous. Various models propose that these enzymes originated during the speciation of the two prokaryotic domains, and the nature (and even the very existence) of lipid membranes in the last universal common ancestor (cenancestor) is subject to debate. G1PDH and G3PDH belong to two separate superfamilies that are universally distributed, suggesting that members of both superfamilies existed in the cenancestor. Furthermore, archaea possess homologues to known bacterial genes involved in fatty acid metabolism and synthesize fatty acid phospholipids. The cenancestor seems likely to have been endowed with membrane lipids whose synthesis was enzymatic but probably non-stereospecific.  相似文献   

13.
Satoshi Fukuchi  Ken Nishikawa 《DNA research》2004,11(4):219-31, 311-313
Genome annotation produces a considerable number of putative proteins lacking sequence similarity to known proteins. These are referred to as "orphans." The proportion of orphan genes varies among genomes, and is independent of genome size. In the present study, we show that the proportion of orphan genes roughly correlates with the isolation index of organisms (IIO), an indicator introduced in the present study, which represents the degree of isolation of a given genome as measured by sequence similarity. However, there are outlier genomes with respect to the linear correlation, consisting of those genomes that may contain excess amounts of orphan genes. Comparisons of genome sequences among closely related strains revealed that some of the annotated genes are not conserved, suggesting that they are ORFs occurring by chance. Exclusion of these non-conserved ORFs within closely related genomes improved the correlation between the proportion of orphan genes and the IIO values. Assuming that the correlation holds in general, this relationship was used to estimate the number of "authentic" orphan genes in a genome. Using this definition of authentic orphan genes, the anomalies arising from over-assignments, e.g., the percentages of structural annotations, were corrected for 16 genomes, including those of five archaea.  相似文献   

14.
While it is well accepted that horizontal gene transfer plays an important role in the evolution and the diversification of prokaryotic genomes, many questions remain open regarding its functional mechanisms of action and its interplay with the extant genome. This study addresses the relationship between proteome innovation by horizontal gene transfer and genome content in Proteobacteria. We characterize the transferred genes, focusing on the protein domain compositions and their relationships with the existing protein domain superfamilies in the genome. In agreement with previous observations, we find that the protein domain architectures of horizontally transferred genes are significantly shorter than the genomic average. Furthermore, protein domains that are more common in the total pool of genomes appear to have a proportionally higher chance to be transferred. This suggests that transfer events behave as if they were drawn randomly from a cross-genomic community gene pool, much like gene duplicates are drawn from a genomic gene pool. Finally, horizontally transferred genes carry domains of exogenous families less frequently for larger genomes, although they might do it more than expected by chance.  相似文献   

15.
Using a data set of aligned protein domain superfamilies of known three-dimensional structure, we compared the location of interdomain interfaces on the tertiary folds between members of distantly related protein domain superfamilies. The data set analyzed is comprised of interdomain interfaces, with domains occurring within a polypeptide chain and those between two polypeptide chains. We observe that, in general, the interfaces between protein domains are formed entirely in different locations on the tertiary folds in such pairs. This variation in the location of interface happens in protein domains involved in a wide range of functions, such as enzymes, adapters, and domains that bind protein ligands, or cofactors. While basic biochemical functionality is preserved at the domain superfamily level, the effect of biochemical function on protein assemblies is different in these protein domains related by superfamily. The divergence between proteins, in most cases, is coupled with domain recruitment, with different modes of interaction with the recruited domain. This is in complete contrast to the observation that in closely related homologous protein domains, almost always the interaction interfaces are topologically equivalent. In a small subset of interacting domains within proteins related by remote homology, we observe that the relative positioning of domains with respect to one another is preserved. Based on the analysis of multidomain proteins of known or unknown structure, we suggest that variation in protein-protein interactions in members within a superfamily could serve as diverging points in otherwise parallel metabolic or signaling pathways. We discuss a few representative cases of diverging pathways involving domains in a superfamily.  相似文献   

16.
As the largest fraction of any proteome does not carry out enzymatic functions, and in order to leverage 3D structural data for the annotation of increasingly higher volumes of sequence data, we wanted to assess the strength of the link between coarse grained structural data (i.e., homologous superfamily level) and the enzymatic versus non-enzymatic nature of protein sequences. To probe this relationship, we took advantage of 41 phylogenetically diverse (encompassing 11 distinct phyla) genomes recently sequenced within the GEBA initiative, for which we integrated structural information, as defined by CATH, with enzyme level information, as defined by Enzyme Commission (EC) numbers. This analysis revealed that only a very small fraction (about 1%) of domain sequences occurring in the analyzed genomes was found to be associated with homologous superfamilies strongly indicative of enzymatic function. Resorting to less stringent criteria to define enzyme versus non-enzyme biased structural classes or excluding highly prevalent folds from the analysis had only modest effect on this proportion. Thus, the low genomic coverage by structurally anchored protein domains strongly associated to catalytic activities indicates that, on its own, the power of coarse grained structural information to infer the general property of being an enzyme is rather limited.  相似文献   

17.
Somogyi K  Sipos B  Pénzes Z  Andó I 《FEBS letters》2010,584(21):4375-4378
The Nimrod gene superfamily is an important component of the innate immune response. The majority of its member genes are located in close proximity within the Drosophila melanogaster genome and they lie in a larger conserved cluster (“Nimrod cluster”), made up of non-related groups (families, superfamilies) of genes. This cluster has been a part of the Arthropod genomes for about 300-350 million years. The available data suggest that the Nimrod cluster is a functional module of the insect innate immune response.  相似文献   

18.
Sixty-five families of glycosyltransferases (EC 2.4.x.y) have been recognized on the basis of high-sequence similarity to a founding member with experimentally demonstrated enzymatic activity. Although distant sequence relationships between some of these families have been reported, the natural history of glycosyltransferases is poorly understood. We used iterative searches of sequence databases, motif extraction, structural comparison, and analysis of completely sequenced genomes to track the origins of modern-type glycosyltransferases. We show that >75% of recognized glycosyltransferase families belong to one of only three monophyletic superfamilies of proteins, namely, (1) a recently described GPGTF/GT-B superfamily; (2) a nucleoside-diphosphosugar transferase (GT-A) superfamily, which is characterized by a DxD sequence signature and also includes nucleotidyltransferases; and (3) a GT-C superfamily of integral membrane glycosyltransferases with a modified DxD signature in the first extracellular loop. Several developmental regulators in Metazoans, including Fringe and Egghead homologs, belong to the second superfamily. Interestingly, Tout-velu/Exostosin family of developmental proteins found in all multicellular eukaryotes, contains separate domains belonging to the first and the second superfamilies, explaining multiple glycosyltransferase activities in one protein.  相似文献   

19.
Restauro-G: A Rapid Genome Re-Annotation System for Comparative Genomics   总被引:1,自引:0,他引:1  
of complete genome sequences submitted directly from sequencing projects are diverse in terms of annotation strategies and update frequencies. These inconsistencies make comparative studies difficult. To allow rapid data preparation of a large number of complete genomes, automation and speed are important for genome re-annotation. Here we introduce an open-source rapid genome re-annotation software system, Restauro-G, specialized for bacterial genomes. Restauro-G re-annotates a genome by similarity searches utilizing the BLASTLike Alignment Tool, referring to protein databases such as UniProt KB, NCBI nr, NCBI COGs, Pfam, and PSORTb. Re-annotation by Restauro-G achieved over 98% accuracy for most bacterial chromosomes in comparison with the original manually curated annotation of EMBL releases. Restauro-G was developed in the generic bioinformatics workbench G-language Genome Analysis Environment and is distributed at http://restauro-g.iab.keio.ac.jp/ under the GNU General Public License.  相似文献   

20.
The interneuronal network that produces local bending in the leech is distributed, in the sense that most of the interneurons involved are activated in all forms of local bending, even those in which their outputs would produce inappropriate movements. Such networks have been found to control a number of different behaviors in a variety of animals. This article reviews three issues: the physiological and modeling observations that led to the conclusion that local bending in leeches is controlled by a distributed system; what distributed processing means for this and other behaviors; and why the leech interneuronal network may have evolved to be distributed in the first place. © 1995 John Wiley & Sons, Inc.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号