首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Dengler U  Siddiqui AS  Barton GJ 《Proteins》2001,42(3):332-344
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.  相似文献   

2.
Viruses are the most abundant life form and infect practically all organisms. Consequently, these obligate parasites are a major cause of human suffering and economic loss. Rossmann‐like fold is the most populated fold among α/β‐folds in the Protein Data Bank and proteins containing Rossmann‐like fold constitute 22% of all known proteins 3D structures. Thus, analysis of viral proteins containing Rossmann‐like domains could provide an understanding of viral biology and evolution as well as could propose possible targets for antiviral therapy. We provide functional and evolutionary analysis of viral proteins containing a Rossmann‐like fold found in the evolutionary classification of protein domains (ECOD) database developed in our lab. We identified 81 protein families of bacterial, archeal, and eukaryotic viruses in light of their evolution‐based ECOD classification and Pfam taxonomy. We defined their functional significance using enzymatic EC number assignments as well as domain‐level family annotations.  相似文献   

3.
The HSSP database of protein structure-sequence alignments.   总被引:3,自引:0,他引:3       下载免费PDF全文
HSSP (homology-derived structures of proteins) is a derived database merging structural (2-D and 3-D) and sequence information (1-D). For each protein of known 3D structure from the Protein Data Bank, the database has a file with all sequence homologues, properly aligned to the PDB protein. Homologues are very likely to have the same 3D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of sequence aligned sequence families, but it is also a database of implied secondary and tertiary structures.  相似文献   

4.
The iProClass database is an integrated resource that provides comprehensive family relationships and structural and functional features of proteins, with rich links to various databases. It is extended from ProClass, a protein family database that integrates PIR superfamilies and PROSITE motifs. The iProClass currently consists of more than 200,000 non-redundant PIR and SWISS-PROT proteins organized with more than 28,000 superfamilies, 2600 domains, 1300 motifs, 280 post-translational modification sites and links to more than 30 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein and family summary reports provide rich annotations, including membership information with length, taxonomy and keyword statistics, full family relationships, comprehensive enzyme and PDB cross-references and graphical feature display. The database facilitates classification-driven annotation for protein sequence databases and complete genomes, and supports structural and functional genomic research. The iProClass is implemented in Oracle 8i object-relational system and available for sequence search and report retrieval at http://pir.georgetown.edu/iproclass/.  相似文献   

5.
Restriction endonucleases and other nucleic acid cleaving enzymes form a large and extremely diverse superfamily that display little sequence similarity despite retaining a common core fold responsible for cleavage. The lack of significant sequence similarity between protein families makes homology inference a challenging task and hinders new family identification with traditional sequence-based approaches. Using the consensus fold recognition method Meta-BASIC that combines sequence profiles with predicted protein secondary structure, we identify nine new restriction endonuclease-like fold families among previously uncharacterized proteins and predict these proteins to cleave nucleic acid substrates. Application of transitive searches combined with gene neighborhood analysis allow us to confidently link these unknown families to a number of known restriction endonuclease-like structures and thus assign folds to the uncharacterized proteins. Finally, our method identifies a novel restriction endonuclease-like domain in the C-terminus of RecC that is not detected with structure-based searches of the existing PDB database.  相似文献   

6.
The HSSP database of protein structure-sequence alignments.   总被引:4,自引:0,他引:4       下载免费PDF全文
HSSP is a derived database merging structural (3-D) and sequence (1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family. The list of homologues is the result of a database search in SwissProt using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed homologues are very likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures covering 29% of all SwissProt-stored sequences.  相似文献   

7.
Understanding the evolution of a protein, including both close and distant relationships, often reveals insight into its structure and function. Fast and easy access to such up-to-date information facilitates research. We have developed a hierarchical evolutionary classification of all proteins with experimentally determined spatial structures, and presented it as an interactive and updatable online database. ECOD (Evolutionary Classification of protein Domains) is distinct from other structural classifications in that it groups domains primarily by evolutionary relationships (homology), rather than topology (or “fold”). This distinction highlights cases of homology between domains of differing topology to aid in understanding of protein structure evolution. ECOD uniquely emphasizes distantly related homologs that are difficult to detect, and thus catalogs the largest number of evolutionary links among structural domain classifications. Placing distant homologs together underscores the ancestral similarities of these proteins and draws attention to the most important regions of sequence and structure, as well as conserved functional sites. ECOD also recognizes closer sequence-based relationships between protein domains. Currently, approximately 100,000 protein structures are classified in ECOD into 9,000 sequence families clustered into close to 2,000 evolutionary groups. The classification is assisted by an automated pipeline that quickly and consistently classifies weekly releases of PDB structures and allows for continual updates. This synchronization with PDB uniquely distinguishes ECOD among all protein classifications. Finally, we present several case studies of homologous proteins not recorded in other classifications, illustrating the potential of how ECOD can be used to further biological and evolutionary studies.  相似文献   

8.
The HSSP database of protein structure-sequence alignments.   总被引:2,自引:0,他引:2       下载免费PDF全文
HSSP is a derived database merging structural three dimensional (3-D) and sequence one dimensional(1-D) information. For each protein of known 3-D structure from the Protein Data Bank (PDB), the database has a multiple sequence alignment of all available homologues and a sequence profile characteristic of the family. The list of homologues is the result of a database search in Swissprot using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed homologues are very likely to have the same 3-D structure as the PDB protein to which they have been aligned. As a result, the database is not only a database of aligned sequence families, but also a database of implied secondary and tertiary structures covering 27% of all Swissprot-stored sequences.  相似文献   

9.
Qi Y  Grishin NV 《Proteins》2005,58(2):376-388
Protein structure classification is necessary to comprehend the rapidly growing structural data for better understanding of protein evolution and sequence-structure-function relationships. Thioredoxins are important proteins that ubiquitously regulate cellular redox status and various other crucial functions. We define the thioredoxin-like fold using the structure consensus of thioredoxin homologs and consider all circular permutations of the fold. The search for thioredoxin-like fold proteins in the PDB database identified 723 protein domains. These domains are grouped into eleven evolutionary families based on combined sequence, structural, and functional evidence. Analysis of the protein-ligand structure complexes reveals two major active site locations for the thioredoxin-like proteins. Comparison to existing structure classifications reveals that our thioredoxin-like fold group is broader and more inclusive, unifying proteins from five SCOP folds, five CATH topologies and seven DALI domain dictionary globular folding topologies. Considering these structurally similar domains together sheds new light on the relationships between sequence, structure, function and evolution of thioredoxins.  相似文献   

10.
蛋白质空间结构研究是分子生物学、细胞生物学、生物化学以及药物设计等领域的重要课题.折叠类型反映了蛋白质核心结构的拓扑模式,对折叠类型的识别是蛋白质序列与结构关系研究的重要内容.选取LIFCA数据库中样本量较大的53种折叠类型,应用功能域组分方法进行折叠识别.将Astral 1.65中序列一致性小于95%的样本作为检验集,全库检验结果中平均敏感性为96.42%,特异性为99.91%,马修相关系数(MCC)为0.91,各项统计结果表明:功能域组分方法可以很好地应用在蛋白质折叠识别中,LIFCA相对简单的分类规则可以很好地集中蛋白质的大部分功能特性,反映了结构与功能的对应关系.  相似文献   

11.
The bias in protein structure and function space resulting from experimental limitations and targeting of particular functional classes of proteins by structural biologists has long been recognized, but never continuously quantified. Using the Enzyme Commission and the Gene Ontology classifications as a reference frame, and integrating structure data from the Protein Data Bank (PDB), target sequences from the structural genomics projects, structure homology derived from the SUPERFAMILY database, and genome annotations from Ensembl and NCBI, we provide a quantified view, both at the domain and whole-protein levels, of the current and projected coverage of protein structure and function space relative to the human genome. Protein structures currently provide at least one domain that covers 37% of the functional classes identified in the genome; whole structure coverage exists for 25% of the genome. If all the structural genomics targets were solved (twice the current number of structures in the PDB), it is estimated that structures of one domain would cover 69% of the functional classes identified and complete structure coverage would be 44%. Homology models from existing experimental structures extend the 37% coverage to 56% of the genome as single domains and 25% to 31% for complete structures. Coverage from homology models is not evenly distributed by protein family, reflecting differing degrees of sequence and structure divergence within families. While these data provide coverage, conversely, they also systematically highlight functional classes of proteins for which structures should be determined. Current key functional families without structure representation are highlighted here; updated information on the "most wanted list" that should be solved is available on a weekly basis from http://function.rcsb.org:8080/pdb/function_distribution/index.html.  相似文献   

12.
High divergence in protein sequences makes the detection of distant protein relationships through homology-based approaches challenging. Grouping protein sequences into families, through similarities in either sequence or 3-D structure, facilitates in the improved recognition of protein relationships. In addition, strategically designed protein-like sequences have been shown to bridge distant structural domain families by serving as artificial linkers. In this study, we have augmented a search database of known protein domain families with such designed sequences, with the intention of providing functional clues to domain families of unknown structure. When assessed using representative query sequences from each family, we obtain a success rate of 94% in protein domain families of known structure. Further, we demonstrate that the augmented search space enabled fold recognition for 582 families with no structural information available a priori. Additionally, we were able to provide reliable functional relationships for 610 orphan families. We discuss the application of our method in predicting functional roles through select examples for DUF4922, DUF5131, and DUF5085. Our approach also detects new associations between families that were previously not known to be related, as demonstrated through new sub-groups of the RNA polymerase domain among three distinct RNA viruses. Taken together, designed sequences-augmented search databases direct the detection of meaningful relationships between distant protein families. In turn, they enable fold recognition and offer reliable pointers to potential functional sites that may be probed further through direct mutagenesis studies.  相似文献   

13.
New directions in biology are being driven by the complete sequencing of genomes, which has given us the protein repertoires of diverse organisms from all kingdoms of life. In tandem with this accumulation of sequence data, worldwide structural genomics initiatives, advanced by the development of improved technologies in X-ray crystallography and NMR, are expanding our knowledge of structural families and increasing our fold libraries. Methods for detecting remote sequence similarities have also been made more sensitive and this means that we can map domains from these structural families onto genome sequences to understand how these families are distributed throughout the genomes and reveal how they might influence the functional repertoires and biological complexities of the organisms. We have used robust protocols to assign sequences from completed genomes to domain structures in the CATH database, allowing up to 60% of domain sequences in these genomes, depending on the organism, to be assigned to a domain family of known structure. Analysis of the distribution of these families throughout bacterial genomes identified more than 300 universal families, some of which had expanded significantly in proportion to genome size. These highly expanded families are primarily involved in metabolism and regulation and appear to make major contributions to the functional repertoire and complexity of bacterial organisms. When comparisons are made across all kingdoms of life, we find a smaller set of universal domain families (approx. 140), of which families involved in protein biosynthesis are the largest conserved component. Analysis of the behaviour of other families reveals that some (e.g. those involved in metabolism, regulation) have remained highly innovative during evolution, making it harder to trace their evolutionary ancestry. Structural analyses of metabolic families provide some insights into the mechanisms of functional innovation, which include changes in domain partnerships and significant structural embellishments leading to modulation of active sites and protein interactions.  相似文献   

14.
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath_new) currently contains 34 287 domain structures classified into 1383 superfamilies and 3285 sequence families. Each structural family is expanded with domain sequence relatives recruited from GenBank using a variety of efficient sequence search protocols and reliable thresholds. This extended resource, known as the CATH-protein family database (CATH-PFDB) contains a total of 310 000 domain sequences classified into 26 812 sequence families. New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison, based on secondary structure matching, for domain boundary assignment. The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned. This is an improvement on a previous consensus approach for which only 10-20% of domains could be reliably processed in a completely automated fashion. Since domain boundary assignment is a significant bottleneck in the classification of new structures, CATHEDRAL will also help to increase the frequency of CATH updates.  相似文献   

15.
HSSP (http: //www.sander.embl-ebi.ac.uk/hssp/) is a derived database merging structure (3-D) and sequence (1-D) information. For each protein of known 3D structure from the Protein Data Bank (PDB), we provide a multiple sequence alignment of putative homologues and a sequence profile characteristic of the protein family, centered on the known structure. The list of homologues is the result of an iterative database search in SWISS-PROT using a position-weighted dynamic programming method for sequence profile alignment (MaxHom). The database is updated frequently. The listed putative homologues are very likely to have the same 3D structure as the PDB protein to which they have been aligned. As a result, the database not only provides aligned sequence families, but also implies secondary and tertiary structures covering 33% of all sequences in SWISS-PROT.  相似文献   

16.
We have determined consensus protein-fold classifications on the basis of three classification methods, SCOP, CATH, and Dali. These classifications make use of different methods of defining and categorizing protein folds that lead to different views of protein-fold space. Pairwise comparisons of domains on the basis of their fold classifications show that much of the disagreement between the classification systems is due to differing domain definitions rather than assigning the same domain to different folds. However, there are significant differences in the fold assignments between the three systems. These remaining differences can be explained primarily in terms of the breadth of the fold classifications. Many structures may be defined as having one fold in one system, whereas far fewer are defined as having the analogous fold in another system. By comparing these folds for a nonredundant set of proteins, the consensus method breaks up broad fold classifications and combines restrictive fold classifications into metafolds, creating, in effect, an averaged view of fold space. This averaged view requires that the structural similarities between proteins having the same metafold be recognized by multiple classification systems. Thus, the consensus map is useful for researchers looking for fold similarities that are relatively independent of the method used to compare proteins. The 30 most populated metafolds, representing the folds of about half of a nonredundant subset of the PDB, are presented here. The full list of metafolds is presented on the Web.  相似文献   

17.
MOTIVATION: It is commonly believed that sequence determines structure, which in turn determines function. However, the presence of many proteins with the same structural fold but different functions suggests that global structure and function do not always correlate well. RESULTS: We propose a method for accurate functional annotation, based on identification of functional signatures from structural alignments (FSSA) using the Structural Classification of Proteins (SCOP) database. The FSSA method is superior at function discrimination and classification compared with several methods that directly inherit functional annotation information from homology inference, such as Smith-Waterman, PSI-BLAST, hidden Markov models and structure comparison methods, for a large number of structural fold families. Our results indicate that the contributions of amino acid residue types and positions to structure and function are largely separable for proteins in multi-functional fold families.  相似文献   

18.
MOTIVATION: In recent years, the Protein Data Bank (PDB) has experienced rapid growth. To maximize the utility of the high resolution protein-protein interaction data stored in the PDB, we have developed PIBASE, a comprehensive relational database of structurally defined interfaces between pairs of protein domains. It is composed of binary interfaces extracted from structures in the PDB and the Probable Quaternary Structure server using domain assignments from the Structural Classification of Proteins and CATH fold classification systems. RESULTS: PIBASE currently contains 158,915 interacting domain pairs between 105,061 domains from 2125 SCOP families. A diverse set of geometric, physiochemical and topologic properties are calculated for each complex, its domains, interfaces and binding sites. A subset of the interface properties are used to remove interface redundancy within PDB entries, resulting in 20,912 distinct domain-domain interfaces. The complexes are grouped into 989 topological classes based on their patterns of domain-domain contacts. The binary interfaces and their corresponding binding sites are categorized into 18,755 and 30,975 topological classes, respectively, based on the topology of secondary structure elements. The utility of the database is illustrated by outlining several current applications. AVAILABILITY: The database is accessible via the world wide web at http://salilab.org/pibase SUPPLEMENTARY INFORMATION: http://salilab.org/pibase/suppinfo.html.  相似文献   

19.
It is well known that the structure is currently available only for a small fraction of known protein sequences. It is urgent to discover the important features of known protein sequences based on present protein structures. Here, we report a study on the size distribution of protein families within different types of folds. The fold of a protein means the global arrangement of its main secondary structures, both in terms of their relative orientations and their topological connections, which specify a certain biochemical and biophysical aspect. We first search protein families in the structural database SCOP against the sequence-based database Pfam, and acquire a pool of corresponding Pfam families whose structures can be deemed as known. This pool of Pfam families is called the sample space for short. Then the size distributions of protein families involving the sample space, the Pfam database and the SCOP database are obtained. The results indicate that the size distributions of protein families under different kinds of folds abide by similar power-law. Specially, the largest families scatter evenly in different kinds of folds. This may help better understand the relationship of protein sequence, structure and function. We also show that the total of proteins with known structures can be considered a random sample from the whole space of protein sequences, which is an essential but unsettled assumption for related predictions, such as, estimating the number of protein folds in nature. Finally we conclude that about 2957 folds are needed to cover the total Pfam families by a simple method.  相似文献   

20.
蛋白质的序列、结构和功能多种多样.大量研究表明蛋白质的结构与其氨基酸序列的排序有关,并且局部的氨基酸序列环境对蛋白质的结构具有一定的影响.本文提出一种新的基于5-mer氨基酸扭转角统计偏好的蛋白质结构类型预测方法,在该方法通过PDB数据库中5-mer中间氨基酸的扭转角统计偏好来进行结构类型的预测.新方法可以通过计算机仿...  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号