首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 539 毫秒
1.
PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure–function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.  相似文献   

2.
To maximise the assignment of function of the proteins encoded by a genome and to aid the search for novel drug targets, there is an emerging need for sensitive methods of predicting protein function on a genome-wide basis. GeneAtlas is an automated, high-throughput pipeline for the prediction of protein structure and function using sequence similarity detection, homology modelling and fold recognition methods. GeneAtlas is described in detail here. To test GeneAtlas, a 'virtual' genome was used, a subset of PDB structures from the SCOP database, in which the functional relationships are known. GeneAtlas detects additional relationships by building 3D models in comparison with the sequence searching method PSI-BLAST. Functionally related proteins with sequence identity below the twilight zone can be recognised correctly.  相似文献   

3.
Protein classification artificial neural system.   总被引:2,自引:0,他引:2       下载免费PDF全文
A neural network classification method is developed as an alternative approach to the large database search/organization problem. The system, termed Protein Classification Artificial Neural System (ProCANS), has been implemented on a Cray supercomputer for rapid superfamily classification of unknown proteins based on the information content of the neural interconnections. The system employs an n-gram hashing function that is similar to the k-tuple method for sequence encoding. A collection of modular back-propagation networks is used to store the large amount of sequence patterns. The system has been trained and tested with the first 2,148 of the 8,309 entries of the annotated Protein Identification Resource protein sequence database (release 29). The entries included the electron transfer proteins and the six enzyme groups (oxidoreductases, transferases, hydrolases, lyases, isomerases, and ligases), with a total of 620 superfamilies. After a total training time of seven Cray central processing unit (CPU) hours, the system has reached a predictive accuracy of 90%. The classification is fast (i.e., 0.1 Cray CPU second per sequence), as it only involves a forward-feeding through the networks. The classification time on a full-scale system embedded with all known superfamilies is estimated to be within 1 CPU second. Although the training time will grow linearly with the number of entries, the classification time is expected to remain low even if there is a 10-100-fold increase of sequence entries. The neural database, which consists of a set of weight matrices of the networks, together with the ProCANS software, can be ported to other computers and made available to the genome community. The rapid and accurate superfamily classification would be valuable to the organization of protein sequence databases and to the gene recognition in large sequencing projects.  相似文献   

4.
A hint to search for metalloproteins in gene banks   总被引:1,自引:0,他引:1  
MOTIVATION: With the advent of genome sequencing, a huge database of protein primary sequences has been accumulating. In parallel, a number of tools to investigate and expand upon this information, e.g. reconstructing and building relationships between protein families and superfamilies, have been developed. Metalloproteins are proteins capable of binding one or more metal ions, which are required for their biological function or for regulation of their activities or for structural purposes. Sometimes, metal binding can be observed in vitro but not be physiologically relevant. At present, there is a lack of specific tools to address the matter of the identification of metalloproteins in databases of gene sequences. RESULTS: In the present work, an approach exploiting metal-binding patterns (MBPs) of metalloproteins present in the Protein Data Bank to search gene banks for new metalloproteins is presented and applied to copper proteins. Nearly 100 different MBPs have been identified and then used for subsequent applications. The ensemble of sequences of the whole PDB is used to assess the potentiality and limits of the method and to identify levels of confidence for the predictions output by the search. It appears that copper-binding capabilities are identified with a confidence >90% when the percentage of identical amino acids aligned around the MBP by PHI-BLAST is at least 20% with respect to the entire protein domain length. If this percentage is between 10% and 20%, the level of confidence is approximately 50%. Application of the methodology to the entire genome sequences of Pyrococcus furiosus, Escherichia coli, Drosophila melanogaster and Homo sapiens suggests some differentiation between prokaryotes and eukaryotes. SUPPLEMENTARY INFORMATION: A table reporting statistics on the MBP identified; a list of all hits retrieved for the four organisms considered; a figure showing the number of hits for the four organisms as a function of I(d)(Global).  相似文献   

5.
6.

Background  

Inferences about protein function are often made based on sequence homology to other gene products of known activities. This approach is valuable for small families of conserved proteins but can be difficult to apply to large superfamilies of proteins with diverse function. In this study we looked at sequence homology between members of the DJ-1/ThiJ/PfpI superfamily, which includes a human protein of unclear function, DJ-1, associated with inherited Parkinson's disease.  相似文献   

7.
The presence of sequence homologues and the availability of structural information of proteins enable better understanding of the biological function of a protein family. A majority of entries in protein structural databank are single member superfamilies for which it is hard to derive motifs due to the paucity of structural homologues. Important conserved segments for these superfamilies have been identified and compiled into a database, SSToSS (Sequence Structural Templates of Single member Superfamily). Conserved regions, recognized by permitted amino acid exchanges, are mapped on the structure and various structural features (solvent accessibility, secondary structure content, hydrogen bonding and residue packing) are examined. These conserved segments with high structural feature content are projected as sequence-structural templates for the particular superfamily member. Interactive three-dimensional displays of the templates in three-dimensional structure (in Chime and RASMOL) are provided for better understanding and visualization. In SSToSS database, we also provide the application of sequence-structural templates in three different areas: multiple-motif based sequence search, multiple sequence alignment and homology modeling. In each case, the inclusion of the sequence-structural templates can give rise to sensitive and accurate results. This enables the inclusion of singletons to provide added value to the recognition of additional members, comparative modeling and in designing experiments.  相似文献   

8.
The increasing number of annotated genome sequences in public databases has made it possible to study the length distributions and domain composition of proteins at unprecedented resolution. To identify factors that influence protein length in metazoans, we performed an analysis of all domain-annotated proteins from a total of 49 animal species from Ensembl (v.56) or EnsemblMetazoa (v.3). Our results indicate that protein length constraints are not fixed as a linear function of domain count and can vary based on domain content. The presence of repeating domains was associated with relaxation of the constraints that govern protein length. Conversely, for proteins with unique domains, length constraints were generally maintained with increased domain counts. It is clear that mean (and median) protein length and domain composition vary significantly between metazoans and other kingdoms; however, the connections between function, domain content, and length are unclear. We incorporated Gene Ontology (GO) annotation to identify biological processes, cellular components, or molecular functions that favor the incorporation of multi-domain proteins. Using this approach, we identified multiple GO terms that favor the incorporation of multi-domain proteins; interestingly, several of the GO terms with elevated domain counts were not restricted to a single gene family. The findings presented here represent an important step in resolving the complex relationship between protein length, function, and domain content. The comparison of the data presented in this work to data from other kingdoms is likely to reveal additional differences in the regulation of protein length.  相似文献   

9.
Evolution of function in protein superfamilies, from a structural perspective   总被引:29,自引:0,他引:29  
The recent growth in protein databases has revealed the functional diversity of many protein superfamilies. We have assessed the functional variation of homologous enzyme superfamilies containing two or more enzymes, as defined by the CATH protein structure classification, by way of the Enzyme Commission (EC) scheme. Combining sequence and structure information to identify relatives, the majority of superfamilies display variation in enzyme function, with 25 % of superfamilies in the PDB having members of different enzyme types. We determined the extent of functional similarity at different levels of sequence identity for 486,000 homologous pairs (enzyme/enzyme and enzyme/non-enzyme), with structural and sequence relatives included. For single and multi-domain proteins, variation in EC number is rare above 40 % sequence identity, and above 30 %, the first three digits may be predicted with an accuracy of at least 90 %. For more distantly related proteins sharing less than 30 % sequence identity, functional variation is significant, and below this threshold, structural data are essential for understanding the molecular basis of observed functional differences. To explore the mechanisms for generating functional diversity during evolution, we have studied in detail 31 diverse structural enzyme superfamilies for which structural data are available. A large number of variations and peculiarities are observed, at the atomic level through to gross structural rearrangements. Almost all superfamilies exhibit functional diversity generated by local sequence variation and domain shuffling. Commonly, substrate specificity is diverse across a superfamily, whilst the reaction chemistry is maintained. In many superfamilies, the position of catalytic residues may vary despite playing equivalent functional roles in related proteins. The implications of functional diversity within supefamilies for the structural genomics projects are discussed. More detailed information on these superfamilies is available at http://www.biochem.ucl.ac.uk/bsm/FAM-EC/.  相似文献   

10.
The emergence of genomics; ongoing computational advances; and the development of large-scale sequence, structural, and functional databases have created important new interdisciplinary linkages between molecular evolution, molecular biology, and enzymology. The five minireviews in this series survey advances and challenges in this burgeoning field from complementary perspectives. The series has three major themes. The first is the evolution of enzyme superfamilies, in which members exhibit increasing sequence, structural, and functional divergence with increasing time of divergence from a common ancestor. The second is the evolutionary role of promiscuous enzymes, which, in addition to their primary function, have adventitious secondary activities that frequently provide the starting point for the evolution of new enzymes. The third is the importance of in silico approaches to the daunting challenge of assigning and predicting the functions of the many uncharacterized proteins in the large-scale sequence and structural databases that are now available. A recent computational advance, the use of protein similarity networks that map functional data onto proteins clustered by similarity, is presented as an approach that can improve functional insight and inference. The three themes are illustrated with several examples of enzyme superfamilies, including the amidohydrolase, metallo-β-lactamase, and enolase superfamilies.  相似文献   

11.
What are the selective pressures on protein sequences during evolution? Amino acid residues may be highly conserved for functional or structural (stability) reasons. Theoretical studies have proposed that residues involved in the folding nucleus may also be highly conserved. To test this we are using an experimental "fold approach" to the study of protein folding. This compares the folding and stability of a number of proteins that share the same fold, but have no common amino acid sequence or biological activity. The fold selected for this study is the immunoglobulin-like beta-sandwich fold, which is a fold that has no specifically conserved function. Four model proteins are used from two distinct superfamilies that share the immunoglobulin-like fold, the fibronectin type III and immunoglobulin superfamilies. Here, the fold approach and protein engineering are used to question the role of a highly conserved tyrosine in the "tyrosine corner" motif that is found ubiquitously and exclusively in Greek key proteins. In the four model beta-sandwich proteins characterised here, the tyrosine is the only residue that is absolutely conserved at equivalent sites. By mutating this position to phenylalanine, we show that the tyrosine hydroxyl is not required to nucleate folding in the immunoglobulin superfamily, whereas it is involved to some extent in early structure formation in the fibronectin type III superfamily. The tyrosine corner is important for stability, mutation to phenylalanine costs between 1.5 and 3 kcal mol(-1). We propose that the high level of conservation of the tyrosine is related to the structural restraints of the loop connecting the beta-sheets, representing an evolutionary "cul-de-sac".  相似文献   

12.
Analysis of cellular protein patterns by computer-aided 2-dimensional gel electrophoresis together with recent advances in protein sequence analysis have made possible the establishment of comprehensive 2-dimensional gel protein databases that may link protein and DNA information and that offer a global approach to the study of the cell. Using the integrated approach offered by 2-dimensional gel protein databases it is now possible to reveal phenotype specific protein (or proteins), to microsequence them, to search for homology with previously identified proteins, to clone the cDNAs, to assign partial protein sequence to genes for which the full DNA sequence and the chromosome location is known, and to study the regulatory properties and function of groups of proteins that are coordinately expressed in a given biological process. Human 2-dimensional gel protein databases are becoming increasingly important in view of the concerted effort to map and sequence the entire genome.  相似文献   

13.
蛋白质分子进化规律研究是分子进化研究的重点,对揭示生命起源与进化机制有重要意义。本文对已知空间结构及物种信息的单绕蛋白,利用结构比对信息,构建了不同层次单绕样本系统聚类图。分析发现:功能相似蛋白存在明显聚集现象,同一超家族样本基本聚在一个大支中,同一家族样本集中在所属超家族下的小支中,功能约束下单绕样本聚类图与物种进化图有较好对应关系。结果表明:单绕蛋白的结构演化反映了蛋白质功能的约束,特定功能单绕样本的结构差异具有种属特异性,结构演化包含了物种进化信息。  相似文献   

14.
Enzyme evolution is often constrained by aspects of catalysis. Sets of homologous proteins that catalyze different overall reactions but share an aspect of catalysis, such as a common partial reaction, are called mechanistically diverse superfamilies. The common mechanistic steps and structural characteristics of several of these superfamilies, including the enolase, Nudix, amidohydrolase, and haloacid dehalogenase superfamilies have been characterized. In addition, studies of mechanistically diverse superfamilies are helping to elucidate mechanisms of functional diversification, such as catalytic promiscuity. Understanding how enzyme superfamilies evolve is vital for accurate genome annotation, predicting protein functions, and protein engineering.  相似文献   

15.
We present, to our knowledge, the first quantitative analysis of functional site diversity in homologous domain superfamilies. Different types of functional sites are considered separately. Our results show that most diverse superfamilies are very plastic in terms of the spatial location of their functional sites. This is especially true for protein–protein interfaces. In contrast, we confirm that catalytic sites typically occupy only a very small number of topological locations. Small-ligand binding sites are more diverse than expected, although in a more limited manner than protein–protein interfaces. In spite of the observed diversity, our results also confirm the previously reported preferential location of functional sites. We identify a subset of homologous domain superfamilies where diversity is particularly extreme, and discuss possible reasons for such plasticity, i.e. structural diversity. Our results do not contradict previous reports of preferential co-location of sites among homologues, but rather point at the importance of not ignoring other sites, especially in large and diverse superfamilies. Data on sites exploited by different relatives, within each well annotated domain superfamily, has been made accessible from the CATH website in order to highlight versatile superfamilies or superfamilies with highly preferential sites. This information is valuable for system biology and knowledge of any constraints on protein interactions could help in understanding the dynamic control of networks in which these proteins participate. The novelty of our work lies in the comprehensive nature of the analysis – we have used a significantly larger dataset than previous studies – and the fact that in many superfamilies we show that different parts of the domain surface are exploited by different relatives for ligand/protein interactions, particularly in superfamilies which are diverse in sequence and structure, an observation not previously reported on such a large scale. This article is part of a Special Issue entitled: The emerging dynamic view of proteins: Protein plasticity in allostery, evolution and self-assembly.  相似文献   

16.
17.
Statistical analyses of genome sequence‐derived protein sequence data can identify amino acid residues that interact between proteins or between domains of a protein. These statistical methods are based on evolution‐directed amino acid variation responding to structural and functional constraints in proteins. The identified residues form a basis for determining structure and folding of proteins as well as inferring mechanisms of protein function. When applied to two‐component systems, several research groups have shown they can be used to identify the amino acid interactions between response regulators and histidine kinases and the specificity therein. Recently, statistical studies between the HisKA and HATPase‐ATP‐binding domains of histidine kinases identified amino acid interactions for both the inactive and the active catalytic states of such kinases. The identified interactions generated a model structure for the domain conformation of the active state. This conformation requires an unwinding of a portion of the C‐terminal helix of the HisKA domain that destroys the inactive state residue contacts and suggests how signal‐binding determines the equilibrium between the inactive and active states of histidine kinases. The rapidly accumulating protein sequence databases from genome, metagenome and microbiome studies are an important resource for functional and structural understanding of proteins and protein complexes in microbes.  相似文献   

18.
Over the next few years, various genome projects will sequence many new genes and yield many new gene products. Many of these products will have no known function and little, if any, sequence homology to existing proteins. There is reason to believe that a rapid determination of a protein fold, even at low resolution, can aid in the identification of function and expedite the determination of structure at higher resolution. Recently devised NMR methods of measuring residual dipolar couplings provide one route to the determination of a fold. They do this by allowing the alignment of previously identified secondary structural elements with respect to each other. When combined with constraints involving loops connecting elements or other short-range experimental distance information, a fold is produced. We illustrate this approach to protein fold determination on (15)N-labeled Eschericia coli acyl carrier protein using a limited set of (15)N-(1)H and (1)H-(1)H dipolar couplings. We also illustrate an approach using a more extended set of heteronuclear couplings on a related protein, (13)C, (15)N-labeled NodF protein from Rhizobium leguminosarum.  相似文献   

19.
Thus far, identification of functionally important residues in Type II restriction endonucleases (REases) has been difficult using conventional methods. Even though known REase structures share a fold and marginally recognizable active site, the overall sequence similarities are statistically insignificant, unless compared among proteins that recognize identical or very similar sequences. Bsp6I is a Type II REase, which recognizes the palindromic DNA sequence 5′GCNGC and cleaves between the cytosine and the unspecified nucleotide in both strands, generating a double-strand break with 5′-protruding single nucleotides. There are no solved structures of REases that recognize similar DNA targets or generate cleavage products with similar characteristics. In straightforward comparisons, the Bsp6I sequence shows no significant similarity to REases with known structures. However, using a fold-recognition approach, we have identified a remote relationship between Bsp6I and the structure of PvuII. Starting from the sequence–structure alignment between Bsp6I and PvuII, we constructed a homology model of Bsp6I and used it to predict functionally significant regions in Bsp6I. The homology model was supported by site-directed mutagenesis of residues predicted to be important for dimerization, DNA binding and catalysis. Completing the picture of sequence–structure–function relationships in protein superfamilies becomes an essential task in the age of structural genomics and our study may serve as a paradigm for future analyses of superfamilies comprising strongly diverged members with little or no sequence similarity.  相似文献   

20.
An efficient algorithm for large-scale detection of protein families   总被引:6,自引:0,他引:6  
Detection of protein families in large databases is one of the principal research objectives in structural and functional genomics. Protein family classification can significantly contribute to the delineation of functional diversity of homologous proteins, the prediction of function based on domain architecture or the presence of sequence motifs as well as comparative genomics, providing valuable evolutionary insights. We present a novel approach called TRIBE-MCL for rapid and accurate clustering of protein sequences into families. The method relies on the Markov cluster (MCL) algorithm for the assignment of proteins into families based on precomputed sequence similarity information. This novel approach does not suffer from the problems that normally hinder other protein sequence clustering algorithms, such as the presence of multi-domain proteins, promiscuous domains and fragmented proteins. The method has been rigorously tested and validated on a number of very large databases, including SwissProt, InterPro, SCOP and the draft human genome. Our results indicate that the method is ideally suited to the rapid and accurate detection of protein families on a large scale. The method has been used to detect and categorise protein families within the draft human genome and the resulting families have been used to annotate a large proportion of human proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号