首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
We have updated the Protein Sequence-Structure Analysis Relational Database (PSSARD) first published in the Int. J. Biol. Macromol. 36 (2005) 259-262 corresponding to 1573 representative protein chains selected from the Protein Data Bank (PDB). In this, the updated and revised PSSARD (Version 2.0), we have included all proteins in the Protein Data Bank available at the time of developing this database including the NMR PDB entries. The current database corresponds to 22,752 XRAY PDB entries and 3977 NMR PDB entries and is separated accordingly in order to facilitate the appropriate database search. The representative protein chains can also be separately accessed within the current database. We have made a provision to combine more than one field to query the database and the results of any search can be used to carry out further nested searches using a combination of queries. We have provided hyperlinks to the individual PDB entries obtained as the result of any search in PSSARD in order to obtain additional details relevant to the protein structure. Certain applications useful to identify domains and structural motifs are discussed.  相似文献   

2.
We predicted gamma-turns from amino acid sequences using the first-order Markov chain theory and enlarged representative data sets corresponding to protein chains selected from the Protein Data Bank (PDB). The following data sets were used for training and deriving the probability values: (1) an initial data set containing 315 protein chains comprising 904 gamma-turns and (2) a later data set in order to include new entries in the PDB, containing 434 protein chains and comprising 1053 gamma-turns. By excluding 93 protein chains that were common to these two training data sets, we generated two mutually exclusive data sets containing 222 and 341 protein chains for testing our predictions. Applying amino acid probability values derived from training data sets on to testing data sets yielded overall prediction accuracies in the range 54-57%. We recommend the use of probability values derived from the data set comprising 315 protein chains that represents more gamma-turns and also provides better predictions.  相似文献   

3.
PDB-REPRDB is a database of representative protein chains from the Protein Data Bank (PDB). Started at the Real World Computing Partnership (RWCP) in August 1997, it developed to the present system of PDB-REPRDB. In April 2001, the system was moved to the Computational Biology Research Center (CBRC), National Institute of Advanced Industrial Science and Technology (AIST) (http://www.cbrc.jp/); it is available at http://www.cbrc.jp/pdbreprdb/. The current database includes 33 368 protein chains from 16 682 PDB entries (1 September, 2002), from which are excluded (a) DNA and RNA data, (b) theoretically modeled data, (c) short chains (1<40 residues), or (d) data with non-standard amino acid residues at all residues. The number of entries including membrane protein structures in the PDB has increased rapidly with determination of numbers of membrane protein structures because of improved X-ray crystallography, NMR, and electron microscopic experimental techniques. Since many protein structure studies must address globular and membrane proteins separately, this new elimination factor, which excludes membrane protein chains, is introduced in the PDB-REPRDB system. Moreover, the PDB-REPRDB system for membrane protein chains begins at the same URL. The current membrane database includes 551 protein chains, including membrane domains in the SCOP database of release 1.59 (15 May, 2002).  相似文献   

4.
Many proteins function as homo-oligomers and are regulated via their oligomeric state. For some proteins, the stoichiometry of homo-oligomeric states under various conditions has been studied using gel filtration or analytical ultracentrifugation experiments. The interfaces involved in these assemblies may be identified using cross-linking and mass spectrometry, solution-state NMR, and other experiments. However, for most proteins, the actual interfaces that are involved in oligomerization are inferred from X-ray crystallographic structures using assumptions about interface surface areas and physical properties. Examination of interfaces across different Protein Data Bank (PDB) entries in a protein family reveals several important features. First, similarities in space group, asymmetric unit size, and cell dimensions and angles (within 1%) do not guarantee that two crystals are actually the same crystal form, containing similar relative orientations and interactions within the crystal. Conversely, two crystals in different space groups may be quite similar in terms of all the interfaces within each crystal. Second, NMR structures and an existing benchmark of PDB crystallographic entries consisting of 126 dimers as well as larger structures and 132 monomers were used to determine whether the existence or lack of common interfaces across multiple crystal forms can be used to predict whether a protein is an oligomer or not. Monomeric proteins tend to have common interfaces across only a minority of crystal forms, whereas higher-order structures exhibit common interfaces across a majority of available crystal forms. The data can be used to estimate the probability that an interface is biological if two or more crystal forms are available. Finally, the Protein Interfaces, Surfaces, and Assemblies (PISA) database available from the European Bioinformatics Institute is more consistent in identifying interfaces observed in many crystal forms compared with the PDB and the European Bioinformatics Institute's Protein Quaternary Server (PQS). The PDB, in particular, is missing highly likely biological interfaces in its biological unit files for about 10% of PDB entries.  相似文献   

5.
Mapping PDB chains to UniProtKB entries   总被引:2,自引:0,他引:2  
MOTIVATION: UniProtKB/SwissProt is the main resource for detailed annotations of protein sequences. This database provides a jumping-off point to many other resources through the links it provides. Among others, these include other primary databases, secondary databases, the Gene Ontology and OMIM. While a large number of links are provided to Protein Data Bank (PDB) files, obtaining a regularly updated mapping between UniProtKB entries and PDB entries at the chain or residue level is not straightforward. In particular, there is no regularly updated resource which allows a UniProtKB/SwissProt entry to be identified for a given residue of a PDB file. RESULTS: We have created a completely automatically maintained database which maps PDB residues to residues in UniProtKB/SwissProt and UniProtKB/trEMBL entries. The protocol uses links from PDB to UniProtKB, from UniProtKB to PDB and a brute-force sequence scan to resolve PDB chains for which no annotated link is available. Finally the sequences from PDB and UniProtKB are aligned to obtain a residue-level mapping. AVAILABILITY: The resource may be queried interactively or downloaded from http://www.bioinf.org.uk/pdbsws/.  相似文献   

6.
A statistical analysis is reported of 1,200 of the 1,404 nuclear magnetic resonance (NMR)-derived protein and nucleic acid structures deposited in the Protein Data Bank (PDB) before 1999. Excluded from this analysis were the entries not yet fully validated by the PDB and the more than 100 entries that contained < 95% of the expected hydrogens. The aim was to assess the geometry of the hydrogens in the remaining structures and to provide a check on their nomenclature. Deviations in bond lengths, bond angles, improper dihedral angles, and planarity with respect to estimated values were checked. More than 100 entries showed anomalous protonation states for some of their amino acids. Approximately 250,000 (1.7%) atom names differed from the consensus PDB nomenclature. Most of the inconsistencies are due to swapped prochiral labeling. Large deviations from the expected geometry exist for a considerable number of entries, many of which are average structures. The most common causes for these deviations seem to be poor minimization of average structures and an improper balance between force-field constraints for experimental and holonomic data. Some specific geometric outliers are related to the refinement programs used. A number of recommendations for biomolecular databases, modeling programs, and authors submitting biomolecular structures are given.  相似文献   

7.
Mitchell JB  Smith J 《Proteins》2003,50(4):563-571
We have investigated the D-amino acid residues present in Protein Data Bank (PDB) entries, categorizing them into "real" D-residues and artifacts. In polypeptide chains of more than 20 residues, only a single instance of a "real" D-residue, other than those deliberately designed or engineered, was found. This example was the result of a slow chemical epimerization process. Another 12 designed D-residues were found in these longer polypeptide chains. Smaller peptides of 20 or fewer residues contained 479 "real" D-residues, the majority in various gramicidin, actinomycin, or cyclosporin structures. We found 148 PDB entries with "real" D-residues and a further 186, in which all apparent D-residues are artifacts. Investigating the (phi, psi) preferences of the "real" D-residues, we found that the region around (-60 degrees, -45 degrees ) was almost completely unoccupied, even though it is not formally disallowed. We link the low propensity to occupy this region with the alpha-helix destabilizing properties of D-residues.  相似文献   

8.
Sussman JL  Abola EE  Lin D  Jiang J  Manning NO  Prilusky J 《Genetica》1999,106(1-2):149-158
The protein data bank (PDB), at Brookhaven National Laboratory, is a database containing information on experimentally determined three-dimensional structures of proteins, nucleic acids, and other biological macromolecules, with approximately 9000 entries. The PDB has a 27-year history of service to a global community of researchers, educators, and students in a wide variety of scientific disciplines. Data are easily submitted via PDB's WWW-based tool AutoDep, in either PDB or mmCIF format, and are most conveniently examined via PDB's WWW-based tool 3DB Browser. Collaborative centers have been, and continue to be, established worldwide to assist in data deposition, archiving, and distribution.This revised version was published online in October 2005 with corrections to the Cover Date.  相似文献   

9.
MOTIVATION: Integral membrane proteins play important roles in living cells. Although these proteins are estimated to constitute 25% of proteins at a genomic scale, the Protein Data Bank (PDB) contains only a few hundred membrane proteins due to the difficulties with experimental techniques. The presence of transmembrane proteins in the structure data bank, however, is quite invisible, as the annotation of these entries is rather poor. Even if a protein is identified as a transmembrane one, the possible location of the lipid bilayer is not indicated in the PDB because these proteins are crystallized without their natural lipid bilayer, and currently no method is publicly available to detect the possible membrane plane using the atomic coordinates of membrane proteins. RESULTS: Here, we present a new geometrical approach to distinguish between transmembrane and globular proteins using structural information only and to locate the most likely position of the lipid bilayer. An automated algorithm (TMDET) is given to determine the membrane planes relative to the position of atomic coordinates, together with a discrimination function which is able to separate transmembrane and globular proteins even in cases of low resolution or incomplete structures such as fragments or parts of large multi chain complexes. This method can be used for the proper annotation of protein structures containing transmembrane segments and paves the way to an up-to-date database containing the structure of all known transmembrane proteins and fragments (PDB_TM) which can be automatically updated. The algorithm is equally important for the purpose of constructing databases purely of globular proteins.  相似文献   

10.
PDB-REPRDB is a database of representative protein chains from the Protein Data Bank (PDB). The previous version of PDB-REPRDB provided 48 representative sets, whose similarity criteria were predetermined, on the WWW. The current version is designed so that the user may obtain a quick selection of representative chains from PDB. The selection of representative chains can be dynamically configured according to the user's requirement. The WWW interface provides a large degree of freedom in setting parameters, such as cut-off scores of sequence and structural similarity. One can obtain a representative list and classification data of protein chains from the system. The current database includes 20 457 protein chains from PDB entries (August 6, 2000). The system for PDB-REPRDB is available at the Parallel Protein Information Analysis system (PAPIA) WWW server (http://www.rwcp.or.jp/papia/).  相似文献   

11.
PISCES: a protein sequence culling server   总被引:21,自引:0,他引:21  
PISCES is a public server for culling sets of protein sequences from the Protein Data Bank (PDB) by sequence identity and structural quality criteria. PISCES can provide lists culled from the entire PDB or from lists of PDB entries or chains provided by the user. The sequence identities are obtained from PSI-BLAST alignments with position-specific substitution matrices derived from the non-redundant protein sequence database. PISCES therefore provides better lists than servers that use BLAST, which is unable to identify many relationships below 40% sequence identity and often overestimates sequence identity by aligning only well-conserved fragments. PDB sequences are updated weekly. PISCES can also cull non-PDB sequences provided by the user as a list of GenBank identifiers, a FASTA format file, or BLAST/PSI-BLAST output.  相似文献   

12.
The Protein Data Bank (PDB) is the single most important repository of structural data for proteins and other biologically relevant molecules. Therefore, it is critically important to keep the PDB data, as much as possible, error-free. In this study, we have analyzed PDB crystal structures possessing oligonucleotide/oligosaccharide binding (OB)-fold, one of the highly populated folds, for the presence of sequence-structure mapping errors. Using energy-based structure quality assessment coupled with sequence analyses, we have found that there are at least five OB-structures in the PDB that have regions where sequences have been incorrectly mapped onto the structure. We have demonstrated that the combination of these computation techniques is effective not only in detecting sequence-structure mapping errors, but also in providing guidance to correct them. Namely, we have used results of computational analysis to direct a revision of X-ray data for one of the PDB entries containing a fairly inconspicuous sequence-structure mapping error. The revised structure has been deposited with the PDB. We suggest use of computational energy assessment and sequence analysis techniques to facilitate structure determination when homologs having known structure are available to use as a reference. Such computational analysis may be useful in either guiding the sequence-structure assignment process or verifying the sequence mapping within poorly defined regions.  相似文献   

13.
Intrinsic disorder in the Protein Data Bank   总被引:2,自引:0,他引:2  
The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only approximately 7% of proteins are observed in the corresponding PDB structures, and only approximately 25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, "Observed" (which correspond to structured regions), "Not observed" (regions with missing electron density, potentially disordered), "Uncharacterized," and "Ambiguous," depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a 'fragment' or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. "Non-observed," "Ambiguous," and "Uncharacterized" regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR(R) VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the "Observed" dataset are ordered, and that the "Not observed" regions are mostly disordered. The "Uncharacterized" regions possess some tendency toward order, whereas the predictions for the short "Ambiguous" regions are really ambiguous. Long "Ambiguous" regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be "wobbly" domains. Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset approximately 10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and approximately 40% of the proteins possess short regions (> or =10 and < 30 amino-acid long) of missing and ambiguous residues.  相似文献   

14.
Abstract

The Protein Data Bank (PDB) is the preeminent source of protein structural information. PDB contains over 32,500 experimentally determined 3-D structures solved using X-ray crystallography or nuclear magnetic resonance spectroscopy. Intrinsically disordered regions fail to form a fixed 3-D structure under physiological conditions. In this study, we compare the amino-acid sequences of proteins whose structures are determined by X-ray crystallography with the corresponding sequences from the Swiss-Prot database. The analyzed dataset includes 16,370 structures, which represent 18,101 PDB chains and 5,434 different proteins from 910 different organisms (2,793 eukaryotic, 2,109 bacterial, 288 viral, and 244 archaeal). In this dataset, on average, each Swiss-Prot protein is represented by 7 PDB chains with 76% of the crystallized regions being represented by more than one structure. Intriguingly, the complete sequences of only ~7% of proteins are observed in the corresponding PDB structures, and only ~25% of the total dataset have >95% of their lengths observed in the corresponding PDB structures. This suggests that the vast majority of PDB proteins is shorter than their corresponding Swiss-Prot sequences and/or contain numerous residues, which are not observed in maps of electron density. To determine the prevalence of disordered regions in PDB, the residues in the Swiss-Prot sequences were grouped into four general categories, “Observed” (which correspond to structured regions), “Not observed” (regions with missing electron density, potentially disordered), “Uncharacterized,” and “Ambiguous,” depending on their appearance in the corresponding PDB entries. This non-redundant set of residues can be viewed as a ‘fragment’ or empirical domain database that contains a set of experimentally determined structured regions or domains and a set of experimentally verified disordered regions or domains. We studied the propensities and properties of residues in these four categories and analyzed their relations to the predictions of disorder using several algorithms. “Non-observed,” “Ambiguous,” and “Uncharacterized” regions were shown to possess the amino acid compositional biases typical of intrinsically disordered proteins. The application of four different disorder predictors (PONDR® VL-XT, VL3-BA, VSL1P, and IUPred) revealed that the vast majority of residues in the “Observed” dataset are ordered, and that the “Not observed” regions are mostly disordered. The “Uncharacterized” regions possess some tendency toward order, whereas the predictions for the short “Ambiguous” regions are really ambiguous. Long “Ambiguous” regions (>70 amino acid residues) are mostly predicted to be ordered, suggesting that they are likely to be “wobbly” domains.

Overall, we showed that completely ordered proteins are not highly abundant in PDB and many PDB sequences have disordered regions. In fact, in the analyzed dataset ~10% of the PDB proteins contain regions of consecutive missing or ambiguous residues longer than 30 amino-acids and ~40% of the proteins possess short regions (≥10 and <30 amino-acid long) of missing and ambiguous residues.  相似文献   

15.
16.
Of the roughly 20,000 canonical human protein sequences, as of January 20, 2021, 7,077 proteins have had their full or partial, medium‐ to high‐resolution structures determined by x‐ray crystallography or other methods. Which of these proteins dominate the protein data bank (the PDB) and why? In this paper, we list the 273 top human protein structures based on the number of their PDB entries. This set of proteins accounts for more than 40% of all available human PDB entries and represent past trends as well as current status for protein structural biology. We briefly discuss the relationship which some of the prominent protein structures have with protein research as a whole and mention their relevance to human diseases. The top‐10 soluble and membrane proteins are all well‐known (most of their first structures being deposited more than 30 years ago). Overall, there is no dramatic change in recent trends in the PDB. Remarkably, the number of structure depositions has grown nearly exponentially over the last 10 or more years (with a doubling time of 7 years for proteins, obtained from any organism). Growth in human protein structures is slightly faster (at 5.9 years). The information in this paper may be informative to senior scientists but also inspire researchers who are new to protein science, providing the year 2021 snap‐shot for the state of protein structural biology.  相似文献   

17.
Dengler U  Siddiqui AS  Barton GJ 《Proteins》2001,42(3):332-344
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.  相似文献   

18.
There are now four structures of vertebrate mitochondrial bc 1 complexes available in theprotein databases and structures from yeast and bacterial sources are expected soon. Thisreview summarizes the new information with emphasis on the avian cytochrome bc 1 complex(PDB entries 1BCC and 3BCC). The Rieske iron–sulfur protein is mobile and this has beenproposed to be important for catalysis. The binding sites for quinone have been located basedon structures containing inhibitors and, in the case of the quinone reduction site Qi, thequinone itself.  相似文献   

19.
The Protein Data Bank (PDB) has been processed to extract a screening protein library (sc-PDB) of 2148 entries. A knowledge-based detection algorithm has been applied to 18,000 PDB files to find regular expressions corresponding to either protein, ions, co-factors, solvent, or ligand atoms. The sc-PDB database comprises high-resolution X-ray structures of proteins for which (i) a well-defined active site exists, (ii) the bound-ligand is a small molecular weight molecule. The database has been screened by an inverse docking tool derived from the GOLD program to recover the known target of four unrelated ligands. Both the database and the inverse screening procedures are accurate enough to rank the true target of the four investigated ligands among the top 1% scorers, with 70-100 fold enrichment with respect to random screening. Applying the proposed screening procedure to a small-sized generic ligand was much less accurate suggesting that inverse screening shall be reserved to rather selective compounds.  相似文献   

20.
MOTIVATION: Data on both single nucleotide polymorphisms and disease-related mutations are being collected at ever-increasing rates. To understand the structural effects of missense mutations, we consider both classes under the term single amino acid polymorphisms (SAAPs) and we wish to map these to protein structure where their effects can be analyzed. Our initial aim therefore is to create a completely automatically maintained database of SAAPs mapped to individual residues in the Protein Data Bank (PDB) updated as new mutations or structures become available. RESULTS: We present an integrated pipeline for the automated mapping of SAAP data from HGVbase to individual PDB residues. Achieving this in a completely automated and reliable manner is a complex task. Data extracted from HGVbase are mapped to EMBL entries to confirm whether the mutation occurs in an exon and, if so, where in the sequence it occurs. From there we map to Swiss-Prot entries and thence to the PDB. AVAILABILITY: The resulting database may be accessed over the web at http://www.bioinf.org.uk/saap/ or http://acrmwww.biochem.ucl.ac.uk/saap/ CONTACT: a.martin@biochem.ucl.ac.uk.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号