首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Protein functional annotation relies on the identification of accurate relationships, sequence divergence being a key factor. This is especially evident when distant protein relationships are demonstrated only with three-dimensional structures. To address this challenge, we describe a computational approach to purposefully bridge gaps between related protein families through directed design of protein-like “linker” sequences. For this, we represented SCOP domain families, integrated with sequence homologues, as multiple profiles and performed HMM-HMM alignments between related domain families. Where convincing alignments were achieved, we applied a roulette wheel-based method to design 3,611,010 protein-like sequences corresponding to 374 SCOP folds. To analyze their ability to link proteins in homology searches, we used 3024 queries to search two databases, one containing only natural sequences and another one additionally containing designed sequences. Our results showed that augmented database searches showed up to 30% improvement in fold coverage for over 74% of the folds, with 52 folds achieving all theoretically possible connections. Although sequences could not be designed between some families, the availability of designed sequences between other families within the fold established the sequence continuum to demonstrate 373 difficult relationships. Ultimately, as a practical and realistic extension, we demonstrate that such protein-like sequences can be “plugged-into” routine and generic sequence database searches to empower not only remote homology detection but also fold recognition. Our richly statistically supported findings show that complementary searches in both databases will increase the effectiveness of sequence-based searches in recognizing all homologues sharing a common fold.  相似文献   

2.
3.
Ribonuclease H-like (RNHL) superfamily, also called the retroviral integrase superfamily, groups together numerous enzymes involved in nucleic acid metabolism and implicated in many biological processes, including replication, homologous recombination, DNA repair, transposition and RNA interference. The RNHL superfamily proteins show extensive divergence of sequences and structures. We conducted database searches to identify members of the RNHL superfamily (including those previously unknown), yielding >60 000 unique domain sequences. Our analysis led to the identification of new RNHL superfamily members, such as RRXRR (PF14239), DUF460 (PF04312, COG2433), DUF3010 (PF11215), DUF429 (PF04250 and COG2410, COG4328, COG4923), DUF1092 (PF06485), COG5558, OrfB_IS605 (PF01385, COG0675) and Peptidase_A17 (PF05380). Based on the clustering analysis we grouped all identified RNHL domain sequences into 152 families. Phylogenetic studies revealed relationships between these families, and suggested a possible history of the evolution of RNHL fold and its active site. Our results revealed clear division of the RNHL superfamily into exonucleases and endonucleases. Structural analyses of features characteristic for particular groups revealed a correlation between the orientation of the C-terminal helix with the exonuclease/endonuclease function and the architecture of the active site. Our analysis provides a comprehensive picture of sequence-structure-function relationships in the RNHL superfamily that may guide functional studies of the previously uncharacterized protein families.  相似文献   

4.
Vibrio cholerae, the enteropathogenic gram negative bacteria is one of the main causative agents of waterborne diseases like cholera. About 1/3(rd) of the organism's genome is uncharacterised with many protein coding genes lacking structure and functional information. These proteins form significant fraction of the genome and are crucial in understanding the organism's complete functional makeup. In this study we report the general structure and function of a family of hypothetical proteins, Domain of Unknown Function 3233 (DUF3233), which are conserved across gram negative gammaproteobacteria (especially in Vibrio sp. and similar bacteria). Profile and HMM based sequence search methods were used to screen homologues of DUF3233. The I-TASSER fold recognition method was used to build a three dimensional structural model of the domain. The structure resembles the transmembrane beta-barrel with an axial N-terminal helix and twelve antiparallel beta-strands. Using a combination of amphipathy and discrimination analysis we analysed the potential transmembrane beta-barrel forming properties of DUF3233. Sequence, structure and phylogenetic analysis of DUF3233 indicates that this gram negative bacterial hypothetical protein resembles the beta-barrel translocation unit of autotransporter Va secretory mechanism with a gene organisation that differs from the conventional Va system.  相似文献   

5.
We have identified two new lysozyme-like protein families by using a combination of sequence similarity searches, domain architecture analysis, and structural predictions. First, the P5 protein from bacteriophage phi8, which belongs to COG3926 and Pfam family DUF847, is predicted to have a new lysozyme-like domain. This assignment is consistent with the lytic function of P5 proteins observed in several related double-stranded RNA bacteriophages. Domain architecture analysis reveals two lysozyme-associated transmembrane modules (LATM1 and LATM2) in a few COG3926/DUF847 members. LATM2 is also present in two proteins containing a peptidoglycan binding domain (PGB) and an N-terminal region that corresponds to COG5526 with uncharacterized function. Second, structure prediction and sequence analysis suggest that COG5526 represents another new lysozyme-like family. Our analysis offers fold and active-site assignments for COG3926/DUF847 and COG5526. The predicted enzymatic activity is consistent with an experimental study on the zliS gene product from Zymomonas mobilis, suggesting that bacterial COG3926/DUF847 members might be activators of macromolecular secretion.  相似文献   

6.
7.
8.
We have determined the crystal structure of hypothetical protein TTHB192 from Thermus thermophilus HB8 at 1.9 A resolution. This protein is a member of the Escherichia coli ygcH sequence family, which contains approximately 15 sequence homologs of bacterial origin. These homologs have a high isoelectric point. The crystal structure reveals that TTHB192 consists of two independently folded domains, and that each domain exhibits a ferredoxin-like fold with a four-stranded antiparallel beta-sheet packed on one side by alpha-helices. These two tandem domains face each other to generate a beta-sheet platform. TTHB192 displays overall structural similarity to Sex-lethal protein and poly(A)-binding protein fragments. These proteins have RNA binding activity which is supported by a beta-sheet platform formed by two tandem repeats of an RNA recognition motif domain with signature sequence motifs on the beta-sheet surface. Although TTHB192 does not have the same signature sequence motif as the RNA recognition motif domain, the presence of an evolutionarily conserved basic patch on the beta-sheet platform could be functionally relevant for nucleic acid-binding. This report shows that TTHB192 and its sequence homologs adopt an RNA recognition motif-like domain and provides the first testable functional hypothesis for this protein family.  相似文献   

9.
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.  相似文献   

10.
In addition to one hypothetical viral sequence from Bacteriophage KVP40, the PfamA family of unknown function DUF458 (Pfam Accession No. PF04308) encompasses several uncharacterized bacterial proteins including Bacillus subtilis YkuK protein. Using Meta-BASIC, a highly sensitive method for detection of distant similarity between proteins, we assign DUF458 family members to the ribonuclease H-like (RNase H-like) superfamily. DUF458 sequences maintain all core secondary structure elements of RNase H-like fold and share several conserved, presumably active site residues with RNase HI, including an invariant DDE motif. In addition to providing a model structure for a previously uncharacterized protein family, this finding suggests that DUF458 proteins function as nucleases. The unusual phyletic pattern, together with a presence of DUF458 in several thermophilic organisms, may suggest a potential role of these proteins in DNA repair in stressful conditions such as an extreme heat or other stress that causes spore formation.  相似文献   

11.
Processing of exogenous glycerol esters is an initial step in energy derivation for many bacterial cells. Lipid-rich environments settled by a variety of organisms exert strong evolutionary pressure for establishing enzymatic pathways involved in lipid metabolism. However, a certain number of enzymes involved in this process remain unknown since they do not share detectable sequence similarity with any known protein domains. Using distant homology detection and fold recognition we predict that bacterial transmembrane proteins belonging to the uncharacterized domain of unknown function 2319 (DUF2319) family possess the alpha/beta hydrolase fold domain together with the catalytic triad critical for hydrolysis. A detailed analysis of sequence/structure features and genomic context indicates that DUF2319 proteins may be involved in lipid metabolism. Therefore, these enzymes are likely to serve as extracellular lipases.  相似文献   

12.
Dicer or Dicer-like (DCL) protein is a catalytic component involved in microRNA (miRNA) or small interference RNA (siRNA) processing pathway, whose fragment structures have been partially solved. However, the structure and function of the unique DUF283 domain within dicer is largely unknown. Here we report the first structure of the DUF283 domain from the Arabidopsis thaliana DCL4. The DUF283 domain adopts an α-β-β-β-α topology and resembles the structural similarity to the double-stranded RNA-binding domain. Notably, the N-terminal α helix of DUF283 runs cross over the C-terminal α helix orthogonally, therefore, N- and C-termini of DUF283 are in close proximity. Biochemical analysis shows that the DUF283 domain of DCL4 displays weak dsRNA binding affinity and specifically binds to double-stranded RNA-binding domain 1 (dsRBD1) of Arabidopsis DRB4, whereas the DUF283 domain of DCL1 specifically binds to dsRBD2 of Arabidopsis HYL1. These data suggest a potential functional role of the Arabidopsis DUF283 domain in target selection in small RNA processing.  相似文献   

13.
Lee D  Grant A  Marsden RL  Orengo C 《Proteins》2005,59(3):603-615
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002;11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to approximately 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While approximately 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www.biochem.ucl.ac.uk/bsm/cath/Gene3D/.  相似文献   

14.
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath_new) currently contains 34 287 domain structures classified into 1383 superfamilies and 3285 sequence families. Each structural family is expanded with domain sequence relatives recruited from GenBank using a variety of efficient sequence search protocols and reliable thresholds. This extended resource, known as the CATH-protein family database (CATH-PFDB) contains a total of 310 000 domain sequences classified into 26 812 sequence families. New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison, based on secondary structure matching, for domain boundary assignment. The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned. This is an improvement on a previous consensus approach for which only 10-20% of domains could be reliably processed in a completely automated fashion. Since domain boundary assignment is a significant bottleneck in the classification of new structures, CATHEDRAL will also help to increase the frequency of CATH updates.  相似文献   

15.
In multi‐domain proteins, the domains typically run end‐to‐end, that is, one domain follows the C‐terminus of another domain. However, approximately 10% of multi‐domain proteins are formed by insertion of one domain sequence into that of another domain. Detecting such insertions within protein sequences is a fundamental challenge in structural biology. The haloacid dehalogenase superfamily (HADSF) serves as a challenging model system wherein a variable cap domain (~5–200 residues in length) accessorizes the ubiquitous Rossmann‐fold core domain, with variations in insertion site and topology corresponding to different classes of cap types. Herein, we describe a comprehensive computational strategy, CapPredictor, for determining large, variable domain insertions in protein sequences. Using a novel sequence‐alignment algorithm in conjunction with a structure‐guided sequence profile from 154 core‐domain‐only structures, more than 40,000 HADSF member sequences were assigned cap types. The resulting data set afforded insight into HADSF evolution. Notably, a similar distribution of cap‐type classes across different phyla was observed, indicating that all cap types existed in the last universal common ancestor. In addition, comparative analyses of the predicted cap‐type and functional assignments showed that different cap types carry out similar chemistries. Thus, while cap domains play a role in substrate recognition and chemical reactivity, cap‐type does not strictly define functional class. Through this example, we have shown that CapPredictor is an effective new tool for the study of form and function in protein families where domain insertion occurs. Proteins 2014; 82:1896–1906. © 2014 Wiley Periodicals, Inc.  相似文献   

16.
Extracting protein alignment models from the sequence database.   总被引:16,自引:2,他引:14       下载免费PDF全文
Biologists often gain structural and functional insights into a protein sequence by constructing a multiple alignment model of the family. Here a program called Probe fully automates this process of model construction starting from a single sequence. Central to this program is a powerful new method to locate and align only those, often subtly, conserved patterns essential to the family as a whole. When applied to randomly chosen proteins, Probe found on average about four times as many relationships as a pairwise search and yielded many new discoveries. These include: an obscure subfamily of globins in the roundworm Caenorhabditis elegans ; two new superfamilies of metallohydrolases; a lipoyl/biotin swinging arm domain in bacterial membrane fusion proteins; and a DH domain in the yeast Bud3 and Fus2 proteins. By identifying distant relationships and merging families into superfamilies in this way, this analysis further confirms the notion that proteins evolved from relatively few ancient sequences. Moreover, this method automatically generates models of these ancient conserved regions for rapid and sensitive screening of sequences.  相似文献   

17.
Bacterial competence, which can be natural or induced, allows the uptake of exogenous double stranded DNA (dsDNA) into a competent bacterium. This process is known as transformation. A multiprotein assembly binds and processes the dsDNA to import one strand and degrade another yet the underlying molecular mechanisms are relatively poorly understood. Here distant relationships of domains in Competence protein EC (ComEC) of Bacillus subtilis (Uniprot: P39695) were characterized. DNA‐protein interactions were investigated in silico by analyzing models for structural conservation, surface electrostatics and structure‐based DNA binding propensity; and by data‐driven macromolecular docking of DNA to models. Our findings suggest that the DUF4131 domain contains a cryptic DNA‐binding OB fold domain and that the β‐lactamase‐like domain is the hitherto cryptic competence nuclease. Proteins 2016; 84:1431–1442. © 2016 The Authors Proteins: Structure, Function, and Bioinformatics Published by Wiley Periodicals, Inc.  相似文献   

18.
To understand the molecular basis of glycosyltransferases' (GTFs) catalytic mechanism, extensive structural information is required. Here, fold recognition methods were employed to assign 3D protein shapes (folds) to the currently known GTF sequences, available in public databases such as GenBank and Swissprot. First, GTF sequences were retrieved and classified into clusters, based on sequence similarity only. Intracluster sequence similarity was chosen sufficiently high to ensure that the same fold is found within a given cluster. Then, a representative sequence from each cluster was selected to compose a subset of GTF sequences. The members of this reduced set were processed by three different fold recognition methods: 3D-PSSM, FUGUE, and GeneFold. Finally, the results from different fold recognition methods were analyzed and compared to sequence-similarity search methods (i.e., BLAST and PSI-BLAST). It was established that the folds of about 70% of all currently known GTF sequences can be confidently assigned by fold recognition methods, a value which is higher than the fold identification rate based on sequence comparison alone (48% for BLAST and 64% for PSI-BLAST). The identified folds were submitted to 3D clustering, and we found that most of the GTF sequences adopt the typical GTF A or GTF B folds. Our results indicate a lack of evidence that new GTF folds (i.e., folds other than GTF A and B) exist. Based on cases where fold identification was not possible, we suggest several sequences as the most promising targets for a structural genomics initiative focused on the GTF protein family.  相似文献   

19.
Pseudomonas aeruginosa is an opportunistic pathogen commonly found in humans and other organisms and is an important cause of infection especially in patients with compromised immune defense mechanisms. The PA3611 gene of P. aeruginosa PAO1 encodes a secreted protein of unknown function, which has been recently classified into a small Pseudomonas‐specific protein family called DUF4146. As part of our effort to extend structural coverage of novel protein space and provide a structure‐based functional insight into new protein families, we report the crystal structure of PA3611, the first structural representative of the DUF4146 protein family. Proteins 2014; 82:1086–1092. © 2013 Wiley Periodicals, Inc.  相似文献   

20.
Qi Y  Grishin NV 《Proteins》2005,58(2):376-388
Protein structure classification is necessary to comprehend the rapidly growing structural data for better understanding of protein evolution and sequence-structure-function relationships. Thioredoxins are important proteins that ubiquitously regulate cellular redox status and various other crucial functions. We define the thioredoxin-like fold using the structure consensus of thioredoxin homologs and consider all circular permutations of the fold. The search for thioredoxin-like fold proteins in the PDB database identified 723 protein domains. These domains are grouped into eleven evolutionary families based on combined sequence, structural, and functional evidence. Analysis of the protein-ligand structure complexes reveals two major active site locations for the thioredoxin-like proteins. Comparison to existing structure classifications reveals that our thioredoxin-like fold group is broader and more inclusive, unifying proteins from five SCOP folds, five CATH topologies and seven DALI domain dictionary globular folding topologies. Considering these structurally similar domains together sheds new light on the relationships between sequence, structure, function and evolution of thioredoxins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号