首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Dengler U  Siddiqui AS  Barton GJ 《Proteins》2001,42(3):332-344
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.  相似文献   

2.
MOTIVATION: Although many methods are available for the identification of structural domains from protein three-dimensional structures, accurate definition of protein domains and the curation of such data for a large number of proteins are often possible only after manual intervention. The availability of domain definitions for protein structural entries is useful for the sequence analysis of aligned domains, structure comparison, fold recognition procedures and understanding protein folding, domain stability and flexibility. RESULTS: We have improved our method of domain identification starting from the concept of clustering secondary structural elements, but with an intention of reducing the number of discontinuous segments in identified domains. The results of our modified and automatic approach have been compared with the domain definitions from other databases. On a test data set of 55 proteins, this method acquires high agreement (88%) in the number of domains with the crystallographers' definition and resources such as SCOP, CATH, DALI, 3Dee and PDP databases. This method also obtains 98% overlap score with the other resources in the definition of domain boundaries of the 55 proteins. We have examined the domain arrangements of 4592 non-redundant protein chains using the improved method to include 5409 domains leading to an update of the structural domain database. AVAILABILITY: The latest version of the domain database and online domain identification methods are available from http://www.ncbs.res.in/~faculty/mini/ddbase/ddbase.html Supplementary information: http://www.ncbs.res.in/~faculty/mini/ddbase/supplementary/supplementary.html  相似文献   

3.
Cheng H  Kim BH  Grishin NV 《Proteins》2008,70(4):1162-1166
We describe MALIDUP (manual alignments of duplicated domains), a database of 241 pairwise structure alignments for homologous domains originated by internal duplication within the same polypeptide chain. Since duplicated domains within a protein frequently diverge in function and thus in sequence, this would be the first database of structurally similar homologs that is not strongly biased by sequence or functional similarity. Our manual alignments in most cases agree with the automatic structural alignments generated by several commonly used programs. This carefully constructed database could be used in studies on protein evolution and as a reference for testing structure alignment programs. The database is available at http://prodata.swmed.edu/malidup.  相似文献   

4.
Structural comparison reveals remote homology that often fails to be detected by sequence comparison. The DALI web server ( http://ekhidna2.biocenter.helsinki.fi/dali ) is a platform for structural analysis that provides database searches and interactive visualization, including structural alignments annotated with secondary structure, protein families and sequence logos, and 3D structure superimposition supported by color-coded sequence and structure conservation. Here, we are using DALI to mine the AlphaFold Database version 1, which increased the structural coverage of protein families by 20%. We found 100 remote homologous relationships hitherto unreported in the current reference database for protein domains, Pfam 35.0. In particular, we linked 35 domains of unknown function (DUFs) to the previously characterized families, generating a functional hypothesis that can be explored downstream in structural biology studies. Other findings include gene fusions, tandem duplications, and adjustments to domain boundaries. The evidence for homology can be browsed interactively through live examples on DALI's website.  相似文献   

5.
We report the latest release (version 1.6) of the CATH protein domains database (http://www.biochem.ucl. ac.uk/bsm/cath ). This is a hierarchical classification of 18 577 domains into evolutionary families and structural groupings. We have identified 1028 homo-logous superfamilies in which the proteins have both structural, and sequence or functional similarity. These can be further clustered into 672 fold groups and 35 distinct architectures. Recent developments of the database include the generation of 3D templates for recognising structural relatives in each fold group, which has led to significant improvements in the speed and accuracy of updating the database and also means that less manual validation is required. We also report the establishment of the CATH-PFDB (Protein Family Database), which associates 1D sequences with the 3D homologous superfamilies. Sequences showing identifiable homology to entries in CATH have been extracted from GenBank using PSI-BLAST. A CATH-PSIBLAST server has been established, which allows you to scan a new sequence against the database. The CATH Dictionary of Homologous Superfamilies (DHS), which contains validated multiple structural alignments annotated with consensus functional information for evolutionary protein superfamilies, has been updated to include annotations associated with sequence relatives identified in GenBank. The DHS is a powerful tool for considering the variation of functional properties within a given CATH superfamily and in deciding what functional properties may be reliably inherited by a newly identified relative.  相似文献   

6.
MOTIVATION: Much research has been devoted to the characterization of interaction interfaces found in complexes with known structure. In this context, the interactions of non-homologous domains at equivalent binding sites are of particular interest, as they can reveal convergently evolved interface motifs. Such motifs are an important source of information to formulate rules for interaction specificity and to design ligands based on the common features shared among diverse partners. RESULTS: We develop a novel method to identify non-homologous structural domains which bind at equivalent sites when interacting with a common partner. We systematically apply this method to all pairs of interactions with known structure and derive a comprehensive database for these interactions. Of all non-homologous domains, which bind with a common interaction partner, 4.2% use the same interface of the common interaction partner (excluding immunoglobulins and proteases). This rises to 16% if immunoglobulin and proteases are included. We demonstrate two applications of our database: first, the systematic screening for viral protein interfaces, which can mimic native interfaces and thus interfere; and second, structural motifs in enzymes and its inhibitors. We highlight several cases of virus protein mimicry: viral M3 protein interferes with a chemokine dimer interface. The virus has evolved the motif SVSPLP, which mimics the native SSDTTP motif. A second example is the regulatory factor Nef in HIV which can mimic a kinase when interacting with SH3. Among others the virus has evolved the kinase's PxxP motif. Further, we elucidate motif resemblances in Baculovirus p35 and HIV capsid proteins. Finally, chymotrypsin is subject to scrutiny wrt. its structural similarity to subtilisin and wrt. its inhibitor's similar recognition sites. SUPPLEMENTARY INFORMATION: A database is online at scoppi.biotec.tu-dresden.de/abac/.  相似文献   

7.
8.
9.
Membership in a protein domain database does not a domain make; a feature we realized when generating a consensus view of protein fold space with our consensus domain dictionary (CDD). This dictionary was used to select representative structures for characterization of the protein dynameome: the Dynameomics initiative. Through this endeavor we rejected a surprising 40% of the 1,695 folds in the CDD as being non‐autonomous folding units. Although some of this was due to the challenges of grouping similar fold topologies, the dissonance between the cataloguing and structural qualification of protein domains remains surprising. Another potential factor is previously overlooked intrinsic disorder; predictions suggest that 40% of proteins have either local or global disorder. One thing is clear, filtering a structural database and ensuring a consistent definition for protein domains is crucial, and caution is prescribed when generalizations of globular domains are drawn from unfiltered protein domain datasets.  相似文献   

10.
Comparison and classification of folding patterns from a database of protein structures is crucial to understand the principles of protein architecture, evolution and function. Current search methods for proteins with similar folding patterns are slow and computationally intensive. The sharp growth in the number of known protein structures poses severe challenges for methods of structural comparison. There is a need for methods that can search the database of structures accurately and rapidly. We provide several methods to search for similar folding patterns using a concise tableau representation of proteins that encodes the relative geometry of secondary structural elements. Our first approach allows the extraction of identical and very closely-related protein folding patterns in constant-time (per hit). Next, we address the hard computational problem of extraction of maximally-similar subtableaux, when comparing two tableaux. We solve the problem using Quadratic and Linear integer programming formulations and demonstrate their power to identify subtle structural similarities, especially when protein structures significantly diverge. Finally, we describe a rapid and accurate method for comparing a query structure against a database of protein domains, TableauSearch. TableauSearch is rapid enough to search the entire structural database in seconds on a standard desktop computer. Our analysis of TableauSearch on many queries shows that the method is very accurate in identifying similarities of folding patterns, even between distantly related proteins. AVAILABILITY: A web server implementing the TableauSearch is available from http://hollywood.bx.psu.edu/TabSearch.  相似文献   

11.
Protein Structural Interactome map (PSIMAP) is a global interaction map that describes domain-domain and protein-protein interaction information for known Protein Data Bank structures. It calculates the Euclidean distance to determine interactions between possible pairs of structural domains in proteins. PSIbase is a database and file server for protein structural interaction information calculated by the PSIMAP algorithm. PSIbase also provides an easy-to-use protein domain assignment module, interaction navigation and visual tools. Users can retrieve possible interaction partners of their proteins of interests if a significant homology assignment is made with their query sequences. AVAILABILITY: http://psimap.org and http://psibase.kaist.ac.kr/  相似文献   

12.
In this paper, we present the protein classification based on structural trees (PCBOST). This is a novel hierarchical classification of proteins that is primarily based on similarity of overall folds of proteins as well as on the modeled folding pathways of proteins. Amino acid sequences, functions of proteins and their evolutionary relationship are not taken into account in this classification. To date the database includes 3847 proteins and domains grouped into six categories having structural similarity and forming six structural trees (total 10,547 PDB-entries). The work on extension of the database and construction of novel structural trees is in progress. The service is free for all users and available at the URL <http://strees.protres.ru/>.  相似文献   

13.
We have developed a tool, named "SCOPExplorer", for browsing and analyzing SCOP information. SCOPExplorer 1) contains a tree-style viewer to display an overview of protein structure data, 2) is able to employ a variety of options to analyze SCOP data statistically, and 3) provides a function to link protein domains to protein data bank (PDB) resources. SCOPExplorer uses an XML-based structural document format, named "SCOPML", derived from the SCOP data. To evaluate SCOPExplorer, proteins containing more than 20 domains were analyzed. The Skp1-Skp2 protein complex and the Fab fragment of IgG2 contain the largest numbers of domains in the current eukaryotic SCOP database. These proteins are known to either bind to various proteins or generate diversity. This suggests that the more domains a protein has, the more interactions or more variability it will be capable of. (SCOPExplorer is available for download at http://scopexplorer.ulsan.ac.kr).  相似文献   

14.
Motif-based searching in TOPS protein topology databases.   总被引:1,自引:0,他引:1  
MOTIVATION: TOPS cartoons are a schematic ion of protein three-dimensional structures in two dimensions, and are used for understanding and manual comparison of protein folds. Recently, an algorithm that produces the cartoons automatically from protein structures has been devised and cartoons have been generated to represent all the structures in the structural databank. There is now a need to be able to define target topological patterns and to search the database for matching domains. RESULTS: We have devised a formal language for describing TOPS diagrams and patterns, and have designed an efficient algorithm to match a pattern to a set of diagrams. A pattern-matching system has been implemented, and tested on a database derived from all the current entries in the Protein Data Bank (15,000 domains). Users can search on patterns selected from a library of motifs or, alternatively, they can define their own search patterns. AVAILABILITY: The system is accessible over the Web at http://tops.ebi.ac.uk/tops  相似文献   

15.
The Gene3D database (http://www.biochem.ucl.ac.uk/bsm/cath_new/Gene3D/) provides structural assignments for genes within complete genomes. These are available via the internet from either the World Wide Web or FTP. Assignments are made using PSI-BLAST and subsequently processed using the DRange protocol. The DRange protocol is an empirically benchmarked method for assessing the validity of structural assignments made using sequence searching methods where appropriate assignment statistics are collected and made available. Gene3D links assignments to their appropriate entries in relevent structural and classification resources (PDBsum, CATH database and the Dictionary of Homologous Superfamilies). Release 2.0 of Gene3D includes 62 genomes, 2 eukaryotes, 10 archaea and 40 bacteria. Currently, structural assignments can be made for between 30 and 40 percent of any given genome. In any genome, around half of those genes assigned a structural domain are assigned a single domain and the other half of the genes are assigned multiple structural domains. Gene3D is linked to the CATH database and is updated with each new update of CATH.  相似文献   

16.
17.
BackgroundProtein domains are commonly used to assess the functional roles and evolutionary relationships of proteins and protein families. Here, we use the Pfam protein family database to examine a set of candidate partial domains. Pfam protein domains are often thought of as evolutionarily indivisible, structurally compact, units from which larger functional proteins are assembled; however, almost 4% of Pfam27 PfamA domains are shorter than 50% of their family model length, suggesting that more than half of the domain is missing at those locations. To better understand the structural nature of partial domains in proteins, we examined 30,961 partial domain regions from 136 domain families contained in a representative subset of PfamA domains (RefProtDom2 or RPD2).ResultsWe characterized three types of apparent partial domains: split domains, bounded partials, and unbounded partials. We find that bounded partial domains are over-represented in eukaryotes and in lower quality protein predictions, suggesting that they often result from inaccurate genome assemblies or gene models. We also find that a large percentage of unbounded partial domains produce long alignments, which suggests that their annotation as a partial is an alignment artifact; yet some can be found as partials in other sequence contexts.ConclusionsPartial domains are largely the result of alignment and annotation artifacts and should be viewed with caution. The presence of partial domain annotations in proteins should raise the concern that the prediction of the protein’s gene may be incomplete. In general, protein domains can be considered the structural building blocks of proteins.

Electronic supplementary material

The online version of this article (doi:10.1186/s13059-015-0656-7) contains supplementary material, which is available to authorized users.  相似文献   

18.

Background

As tertiary structure is currently available only for a fraction of known protein families, it is important to assess what parts of sequence space have been structurally characterized. We consider protein domains whose structure can be predicted by sequence similarity to proteins with solved structure and address the following questions. Do these domains represent an unbiased random sample of all sequence families? Do targets solved by structural genomic initiatives (SGI) provide such a sample? What are approximate total numbers of structure-based superfamilies and folds among soluble globular domains?

Results

To make these assessments, we combine two approaches: (i) sequence analysis and homology-based structure prediction for proteins from complete genomes; and (ii) monitoring dynamics of the assigned structure set in time, with the accumulation of experimentally solved structures. In the Clusters of Orthologous Groups (COG) database, we map the growing population of structurally characterized domain families onto the network of sequence-based connections between domains. This mapping reveals a systematic bias suggesting that target families for structure determination tend to be located in highly populated areas of sequence space. In contrast, the subset of domains whose structure is initially inferred by SGI is similar to a random sample from the whole population. To accommodate for the observed bias, we propose a new non-parametric approach to the estimation of the total numbers of structural superfamilies and folds, which does not rely on a specific model of the sampling process. Based on dynamics of robust distribution-based parameters in the growing set of structure predictions, we estimate the total numbers of superfamilies and folds among soluble globular proteins in the COG database.

Conclusion

The set of currently solved protein structures allows for structure prediction in approximately a third of sequence-based domain families. The choice of targets for structure determination is biased towards domains with many sequence-based homologs. The growing SGI output in the future should further contribute to the reduction of this bias. The total number of structural superfamilies and folds in the COG database are estimated as ~4000 and ~1700. These numbers are respectively four and three times higher than the numbers of superfamilies and folds that can currently be assigned to COG proteins.  相似文献   

19.
The iProClass database is an integrated resource that provides comprehensive family relationships and structural and functional features of proteins, with rich links to various databases. It is extended from ProClass, a protein family database that integrates PIR superfamilies and PROSITE motifs. The iProClass currently consists of more than 200,000 non-redundant PIR and SWISS-PROT proteins organized with more than 28,000 superfamilies, 2600 domains, 1300 motifs, 280 post-translational modification sites and links to more than 30 databases of protein families, structures, functions, genes, genomes, literature and taxonomy. Protein and family summary reports provide rich annotations, including membership information with length, taxonomy and keyword statistics, full family relationships, comprehensive enzyme and PDB cross-references and graphical feature display. The database facilitates classification-driven annotation for protein sequence databases and complete genomes, and supports structural and functional genomic research. The iProClass is implemented in Oracle 8i object-relational system and available for sequence search and report retrieval at http://pir.georgetown.edu/iproclass/.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号