首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Restriction endonucleases and other nucleic acid cleaving enzymes form a large and extremely diverse superfamily that display little sequence similarity despite retaining a common core fold responsible for cleavage. The lack of significant sequence similarity between protein families makes homology inference a challenging task and hinders new family identification with traditional sequence-based approaches. Using the consensus fold recognition method Meta-BASIC that combines sequence profiles with predicted protein secondary structure, we identify nine new restriction endonuclease-like fold families among previously uncharacterized proteins and predict these proteins to cleave nucleic acid substrates. Application of transitive searches combined with gene neighborhood analysis allow us to confidently link these unknown families to a number of known restriction endonuclease-like structures and thus assign folds to the uncharacterized proteins. Finally, our method identifies a novel restriction endonuclease-like domain in the C-terminus of RecC that is not detected with structure-based searches of the existing PDB database.  相似文献   

2.
High divergence in protein sequences makes the detection of distant protein relationships through homology-based approaches challenging. Grouping protein sequences into families, through similarities in either sequence or 3-D structure, facilitates in the improved recognition of protein relationships. In addition, strategically designed protein-like sequences have been shown to bridge distant structural domain families by serving as artificial linkers. In this study, we have augmented a search database of known protein domain families with such designed sequences, with the intention of providing functional clues to domain families of unknown structure. When assessed using representative query sequences from each family, we obtain a success rate of 94% in protein domain families of known structure. Further, we demonstrate that the augmented search space enabled fold recognition for 582 families with no structural information available a priori. Additionally, we were able to provide reliable functional relationships for 610 orphan families. We discuss the application of our method in predicting functional roles through select examples for DUF4922, DUF5131, and DUF5085. Our approach also detects new associations between families that were previously not known to be related, as demonstrated through new sub-groups of the RNA polymerase domain among three distinct RNA viruses. Taken together, designed sequences-augmented search databases direct the detection of meaningful relationships between distant protein families. In turn, they enable fold recognition and offer reliable pointers to potential functional sites that may be probed further through direct mutagenesis studies.  相似文献   

3.
Ribonuclease H-like (RNHL) superfamily, also called the retroviral integrase superfamily, groups together numerous enzymes involved in nucleic acid metabolism and implicated in many biological processes, including replication, homologous recombination, DNA repair, transposition and RNA interference. The RNHL superfamily proteins show extensive divergence of sequences and structures. We conducted database searches to identify members of the RNHL superfamily (including those previously unknown), yielding >60 000 unique domain sequences. Our analysis led to the identification of new RNHL superfamily members, such as RRXRR (PF14239), DUF460 (PF04312, COG2433), DUF3010 (PF11215), DUF429 (PF04250 and COG2410, COG4328, COG4923), DUF1092 (PF06485), COG5558, OrfB_IS605 (PF01385, COG0675) and Peptidase_A17 (PF05380). Based on the clustering analysis we grouped all identified RNHL domain sequences into 152 families. Phylogenetic studies revealed relationships between these families, and suggested a possible history of the evolution of RNHL fold and its active site. Our results revealed clear division of the RNHL superfamily into exonucleases and endonucleases. Structural analyses of features characteristic for particular groups revealed a correlation between the orientation of the C-terminal helix with the exonuclease/endonuclease function and the architecture of the active site. Our analysis provides a comprehensive picture of sequence-structure-function relationships in the RNHL superfamily that may guide functional studies of the previously uncharacterized protein families.  相似文献   

4.
5.
In addition to one hypothetical viral sequence from Bacteriophage KVP40, the PfamA family of unknown function DUF458 (Pfam Accession No. PF04308) encompasses several uncharacterized bacterial proteins including Bacillus subtilis YkuK protein. Using Meta-BASIC, a highly sensitive method for detection of distant similarity between proteins, we assign DUF458 family members to the ribonuclease H-like (RNase H-like) superfamily. DUF458 sequences maintain all core secondary structure elements of RNase H-like fold and share several conserved, presumably active site residues with RNase HI, including an invariant DDE motif. In addition to providing a model structure for a previously uncharacterized protein family, this finding suggests that DUF458 proteins function as nucleases. The unusual phyletic pattern, together with a presence of DUF458 in several thermophilic organisms, may suggest a potential role of these proteins in DNA repair in stressful conditions such as an extreme heat or other stress that causes spore formation.  相似文献   

6.
The genome sciences face the challenge to characterize structure and function of a vast number of novel genes. Sequence search techniques are used to infer functional and structural information from similarities to experimentally characterized genes or proteins. The persistent goal is to refine these techniques and to develop alternative and complementary methods to increase the range of reliable inference.Here, we focus on the structural and functional assignments that can be inferred from the known three-dimensional structures of proteins. The study uses all structures in the Protein Data Bank that were known by the end of 1997. The protein structures released in 1998 were then characterized in terms of functional and structural similarity to the previously known structures, yielding an estimate of the maximum amount of information on novel protein sequences that can be obtained from inference techniques.The 147 globular proteins corresponding to 196 domains released in 1998 have no clear sequence similarity to previously known structures. However, 75 % of the domains have extensive structure similarity to previously known folds, and most importantly, in two out of three cases similarity in structure coincides with related function. In view of this analysis, full utilization of existing structure data bases would provide information for many new targets even if the relationship is not accessible from sequence information alone. Currently, the most sophisticated techniques detect of the order of one-third of these relationships.  相似文献   

7.
DUF538 (domain of unknown function 538) proteins are known as a group of putative hypothetical proteins in a wide range of plant species. They have been identified from some plants challenged with various environmental stresses. However, a little is known about their functional properties. They have been newly predicted to have binding capacity and esterase-type hydrolytic activity towards bacterial lipopolysaccharides and chlorophyll molecules as carboxylic compounds in plants. In the present study, the binding ability and the methylesterase activity of DUF538 proteins towards pectin molecules were also predicted. Their similarities to pectin methylesterases and their binding ability to pectin molecule were predicted using bioinformatic tools as well as the experimental method. A probable cooperation was speculated between DUF538 and pectin methylesterase protein families in cell wall associated defense responses in plants.  相似文献   

8.
A unifold, mesofold, and superfold model of protein fold use.   总被引:4,自引:0,他引:4  
As more and more protein structures are determined, there is increasing interest in the question of how many different folds have been used in biology. The history of the rate of discovery of new folds and the distribution of sequence families among known folds provide a means of estimating the underlying distribution of fold use. Previous models exploiting these data have led to rather different conclusions on the total number of folds. We present a new model, based on the notion that the folds used in biology fall naturally into three classes: unifolds, that is, folds found only in a single narrow sequence family; mesofolds, found in an intermediate number of families; and the previously noted superfolds, found in many protein families. We show that this model fits the available data well and has predicted the development of SCOP over the past 2 years. The principle implications of the model are as follows: (1) The vast majority of folds will be found in only a single sequence family; (2) the total number of folds is at least 10,000; and (3) 80% of sequence families have one of about 400 folds, most of which are already known.  相似文献   

9.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

10.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

11.
Metagenomics projects based on shotgun sequencing of populations of micro-organisms yield insight into protein families. We used sequence similarity clustering to explore proteins with a comprehensive dataset consisting of sequences from available databases together with 6.12 million proteins predicted from an assembly of 7.7 million Global Ocean Sampling (GOS) sequences. The GOS dataset covers nearly all known prokaryotic protein families. A total of 3,995 medium- and large-sized clusters consisting of only GOS sequences are identified, out of which 1,700 have no detectable homology to known families. The GOS-only clusters contain a higher than expected proportion of sequences of viral origin, thus reflecting a poor sampling of viral diversity until now. Protein domain distributions in the GOS dataset and current protein databases show distinct biases. Several protein domains that were previously categorized as kingdom specific are shown to have GOS examples in other kingdoms. About 6,000 sequences (ORFans) from the literature that heretofore lacked similarity to known proteins have matches in the GOS data. The GOS dataset is also used to improve remote homology detection. Overall, besides nearly doubling the number of current proteins, the predicted GOS proteins also add a great deal of diversity to known protein families and shed light on their evolution. These observations are illustrated using several protein families, including phosphatases, proteases, ultraviolet-irradiation DNA damage repair enzymes, glutamine synthetase, and RuBisCO. The diversity added by GOS data has implications for choosing targets for experimental structure characterization as part of structural genomics efforts. Our analysis indicates that new families are being discovered at a rate that is linear or almost linear with the addition of new sequences, implying that we are still far from discovering all protein families in nature.  相似文献   

12.
Processing of exogenous glycerol esters is an initial step in energy derivation for many bacterial cells. Lipid-rich environments settled by a variety of organisms exert strong evolutionary pressure for establishing enzymatic pathways involved in lipid metabolism. However, a certain number of enzymes involved in this process remain unknown since they do not share detectable sequence similarity with any known protein domains. Using distant homology detection and fold recognition we predict that bacterial transmembrane proteins belonging to the uncharacterized domain of unknown function 2319 (DUF2319) family possess the alpha/beta hydrolase fold domain together with the catalytic triad critical for hydrolysis. A detailed analysis of sequence/structure features and genomic context indicates that DUF2319 proteins may be involved in lipid metabolism. Therefore, these enzymes are likely to serve as extracellular lipases.  相似文献   

13.
Using sequence similarity searches and top-of-the-range fold-recognition methods, we have identified a novel family of bacterial transglutaminase-like cysteine proteinases (BTLCPs) with an invariant Cys-His-Asp catalytic triad and a predicted N-terminal signal sequence. This family of previously uncharacterized hypothetical proteins encompasses sequences of unknown function from DUF920 (in the Pfam database) and COG3672. BTLCPs are predicted to possess the papain-like cysteine proteinase fold and catalyze post-translational protein modification through transamidase, acetylase or hydrolase activity. Inspection of neighboring genes encoding BTLCPs suggests a link between this predicted activity and a type-I secretion system resembling ATP-binding cassette exporters of toxins and proteases involved in bacterial pathogenicity.  相似文献   

14.
We have identified two new lysozyme-like protein families by using a combination of sequence similarity searches, domain architecture analysis, and structural predictions. First, the P5 protein from bacteriophage phi8, which belongs to COG3926 and Pfam family DUF847, is predicted to have a new lysozyme-like domain. This assignment is consistent with the lytic function of P5 proteins observed in several related double-stranded RNA bacteriophages. Domain architecture analysis reveals two lysozyme-associated transmembrane modules (LATM1 and LATM2) in a few COG3926/DUF847 members. LATM2 is also present in two proteins containing a peptidoglycan binding domain (PGB) and an N-terminal region that corresponds to COG5526 with uncharacterized function. Second, structure prediction and sequence analysis suggest that COG5526 represents another new lysozyme-like family. Our analysis offers fold and active-site assignments for COG3926/DUF847 and COG5526. The predicted enzymatic activity is consistent with an experimental study on the zliS gene product from Zymomonas mobilis, suggesting that bacterial COG3926/DUF847 members might be activators of macromolecular secretion.  相似文献   

15.
The progress in genome sequencing has led to an increasing submission of uncharacterized hypothetical genes with the domain of unknown function, DUF985, in GenBank, and none of these genes is related to a known protein. We therefore underwent an experimental study to identify the function of a DUF985 domain-containing hypothetical gene BbDUF985 (GenBank Accession No. AY273818) isolated from amphioxus Branchiostoma belcheri (B. belcheri). BbDUF985 was successfully expressed in both prokaryotic and eukaryotic systems, and its recombinant proteins expressed in both systems definitely exhibited an activity of phosphoglucose isomerase (PGI). Both tissue-section in situ hybridization and immunohistochemistry demonstrated that BbDUF985 was expressed in a tissue-specific manner, with most abundant levels in the hepatic caecum and ovary. In CHO cells transfected with the expression plasmid pEGFP-N1/BbDUF985, the fusion protein was targeted in the cytoplasm of CHO cells, suggesting that BbDUF985 is a cytosolic protein. In contrast, Western blotting indicated that BbDUF985 was also present in amphioxus humoral fluids, suggesting that it exists as a secreted protein as well. Our study provided a framework for further understanding the biochemical properties and physiological function of DUF985-containing hypothetical proteins in other species.  相似文献   

16.
There are 10 genes in the Arabidopsis genome that contain a domain described in the Pfam database as domain of unknown function 579 (DUF579). Although DUF579 is widely distributed in eukaryotic species, there is no direct experimental evidence to assign a function to it. Five of the 10 Arabidopsis DUF579 family members are co‐expressed with marker genes for secondary cell wall formation. Plants in which two closely related members of the DUF579 family have been disrupted by T‐DNA insertions contain less xylose in the secondary cell wall as a result of decreased xylan content, and exhibit mildly distorted xylem vessels. Consequently we have named these genes IRREGULAR XYLEM 15 (IRX15) and IRX15L. These mutant plants exhibit many features of previously described xylan synthesis mutants, such as the replacement of glucuronic acid side chains with methylglucuronic acid side chains. By contrast, immunostaining of xylan and transmission electron microscopy (TEM) reveals that the walls of these irx15 irx15l double mutants are disorganized, compared with the wild type or other previously described xylan mutants, and exhibit dramatic increases in the quantity of sugar released in cell wall digestibility assays. Furthermore, localization studies using fluorescent fusion proteins label both the Golgi and also an unknown intracellular compartment. These data are consistent with irx15 and irx15l defining a new class of genes involved in xylan biosynthesis. How these genes function during xylan biosynthesis and deposition is discussed.  相似文献   

17.
A number of structural genomics/proteomics initiatives are focused on bacterial or viral pathogens. In this article, we will review the progress of structural proteomics initiatives targeting the SARS coronavirus (SARS-CoV), the etiological agent of the 2003 worldwide epidemic that culminated in approximately 8,000 cases and 800 deaths. The SARS-CoV genome encodes 28 proteins in three distinct classes, many of them with unknown function and sharing low similarity to other proteins. The structures of 16 SARS-CoV proteins or functional domains have been determined to date. Remarkably, eight of these 16 proteins or functional domains have novel folds, indicating the uniqueness of the coronavirus proteins. The results of SARS-CoV structural proteomics initiatives will have several profound biological impacts, including elucidation of the structure-function relationships of coronavirus proteins; identification of targets for the design of anti-viral compounds against SARS-CoV and other coronaviruses; and addition of new protein folds to the fold space, with further understanding of the structure-function relationships for several new protein families. We discuss the use of structural proteomics in response to emerging infectious diseases such as SARS-CoV and to increase preparedness against future emerging coronaviruses.  相似文献   

18.
Objective DUF538(domain of unknown function 538) domain containing proteins are known as putative hypothetical proteins in plants. Until yet, there is no much information regarding their structure and function. Methods In the present research work, the homologous structures and binding potentials were identified between plant/mammalian lipocalins and plant DUF538 protein by using bioinformatics and experimental tools including molecular dynamics simulation, molecular docking and recombinant tech...  相似文献   

19.
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co‐occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.  相似文献   

20.
Progress towards mapping the universe of protein folds   总被引:1,自引:0,他引:1       下载免费PDF全文
Although the precise aims differ between the various international structural genomics initiatives currently aiming to illuminate the universe of protein folds, many selectively target protein families for which the fold is unknown. How well can the current set of known protein families and folds be used to estimate the total number of folds in nature, and will structural genomics initiatives yield representatives for all the major protein families within a reasonable time scale?  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号