首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
2.
ProClass is a protein family database that organizes non-redundant sequence entries into families defined collectively by PIR superfamilies and PROSITE patterns. By combining global similarities and functional motifs into a single classification scheme, ProClass helps to reveal domain and family relationships and classify multi-domain proteins. The database currently consists of >155 000 sequence entries retrieved from both PIR-International and SWISS-PROT databases. Approximately 92 000 or 60% of the ProClass entries are classified into approximately 6000 families, including a large number of new members detected by our GeneFIND family identification system. The ProClass motif collection contains approximately 72 000 motif sequences and >1300 multiple alignments for all PROSITE patterns, including >21 000 matches not listed in PROSITE and mostly detected from unique PIR sequences. To maximize family information retrieval, the database provides links to various protein family, domain, alignment and structural class databases. With its high classification rate and comprehensive family relationships, ProClass can be used to support full-scale genomic annotation. The database, now being implemented in an object-relational database management system, is available for online sequence search and record retrieval from our WWW server at http://pir.georgetown.edu/gfserver/proclass.html  相似文献   

3.
The Protein Information Resource, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Information Database (JIPID), produces the most comprehensive and expertly annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database. To provide timely and high quality annotation and promote database interoperability, the PIR-International employs rule-based and classification-driven procedures based on controlled vocabulary and standard nomenclature and includes status tags to distinguish experimentally determined from predicted protein features. The database contains about 200,000 non-redundant protein sequences, which are classified into families and superfamilies and their domains and motifs identified. Entries are extensively cross-referenced to other sequence, classification, genome, structure and activity databases. The PIR web site features search engines that use sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. The PIR-Inter-national databases and search tools are accessible on the PIR web site at http://pir.georgetown.edu/ and at the MIPS web site at http://www.mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP.  相似文献   

4.
The Protein Information Resource (PIR) is an integrated public resource of protein informatics that supports genomic and proteomic research and scientific discovery. PIR maintains the Protein Sequence Database (PSD), an annotated protein database containing over 283 000 sequences covering the entire taxonomic range. Family classification is used for sensitive identification, consistent annotation, and detection of annotation errors. The superfamily curation defines signature domain architecture and categorizes memberships to improve automated classification. To increase the amount of experimental annotation, the PIR has developed a bibliography system for literature searching, mapping, and user submission, and has conducted retrospective attribution of citations for experimental features. PIR also maintains NREF, a non-redundant reference database, and iProClass, an integrated database of protein family, function, and structure information. PIR-NREF provides a timely and comprehensive collection of protein sequences, currently consisting of more than 1 000 000 entries from PIR-PSD, SWISS-PROT, TrEMBL, RefSeq, GenPept, and PDB. The PIR web site (http://pir.georgetown.edu) connects data analysis tools to underlying databases for information retrieval and knowledge discovery, with functionalities for interactive queries, combinations of sequence and text searches, and sorting and visual exploration of search results. The FTP site provides free download for PSD and NREF biweekly releases and auxiliary databases and files.  相似文献   

5.
PASS2 is a nearly automated version of CAMPASS and contains sequence alignments of proteins grouped at the level of superfamilies. This database has been created to fall in correspondence with SCOP database (1.53 release) and currently consists of 110 multi-member superfamilies and 613 superfamilies corresponding to single members. In multi-member superfamilies, protein chains with no more than 25% sequence identity have been considered for the alignment and hence the database aims to address sequence alignments which represent 26 219 protein domains under the SCOP 1.53 release. Structure-based sequence alignments have been obtained by COMPARER and the initial equivalences are provided automatically from a MALIGN alignment and subsequently augmented using STAMP4.0. The final sequence alignments have been annotated for the structural features using JOY4.0. Several interesting links are provided to other related databases and genome sequence relatives. Availability of reliable sequence alignments of distantly related proteins, despite poor sequence identity and single-member superfamilies, permit better sampling of structures in libraries for fold recognition of new sequences and for the understanding of protein structure–function relationships of individual superfamilies. The database can be queried by keywords and also by sequence search, interfaced by PSI-BLAST methods. Structure-annotated sequence alignments and several structural accessory files can be retrieved for all the superfamilies including the user-input sequence. The database can be accessed from http://www.ncbs.res.in/%7Efaculty/mini/campass/pass.html.  相似文献   

6.
Rational classification of proteins encoded in sequenced genomes is critical for making the genome sequences maximally useful for functional and evolutionary studies. The family of DNA-binding proteins is one of the most populated and studied amongst the various genomes of bacteria, archaea and eukaryotes and the Web-based system presented here is an approach to their classification. The DnaProt resource is an annotated and searchable collection of protein sequences for the families of DNA-binding proteins. The database contains 3238 full-length sequences (retrieved from the SWISS-PROT database, release 38) that include, at least, a DNA-binding domain. Sequence entries are organized into families defined by PROSITE patterns, PRINTS motifs and de novo excised signatures. Combining global similarities and functional motifs into a single classification scheme, DNA-binding proteins are classified into 33 unique classes, which helps to reveal comprehensive family relationships. To maximize family information retrieval, DnaProt contains a collection of multiple alignments for each DNA-binding family while the recognized motifs can be used as diagnostically functional fingerprints. All available structural class representatives have been referenced. The resource was developed as a Web-based management system for online free access of customized data sets. Entries are fully hyperlinked to facilitate easy retrieval of the original records from the source databases while functional and phylogenetic annotation will be applied to newly sequenced genomes. The database is freely available for online search of a library containing specific patterns of the identified DNA-binding protein classes and retrieval of individual entries from our WWW server (http://kronos.biol.uoa.gr/~mariak/dbDNA.html).  相似文献   

7.
An automatic procedure is proposed to identify, from the protein sequence database, conserved amino acid patterns (or sequence motifs) that are exclusive to a group of functionally related proteins. This procedure is applied to the PIR database and a dictionary of sequence motifs that relate to specific superfamilies constructed. The motifs have a practical relevance in identifying the membership of specific superfamilies without the need to perform sequence database searches in 20% of newly determined sequences. The sequence motifs identified represent functionally important sites on protein molecules. When multiple blocks exist in a single motif they are often close together in the 3-D structure. Furthermore, occasionally these motif blocks were found to be split by introns when the correlation with exon structures was examined.  相似文献   

8.
9.
The protein information resource (PIR)   总被引:13,自引:0,他引:13       下载免费PDF全文
The Protein Information Resource (PIR) produces the largest, most comprehensive, annotated protein sequence database in the public domain, the PIR-International Protein Sequence Database, in collaboration with the Munich Information Center for Protein Sequences (MIPS) and the Japan International Protein Sequence Database (JIPID). The expanded PIR WWW site allows sequence similarity and text searching of the Protein Sequence Database and auxiliary databases. Several new web-based search engines combine searches of sequence similarity and database annotation to facilitate the analysis and functional identification of proteins. New capabilities for searching the PIR sequence databases include annotation-sorted search, domain search, combined global and domain search, and interactive text searches. The PIR-International databases and search tools are accessible on the PIR WWW site at http://pir.georgetown.edu and at the MIPS WWW site at http://www. mips.biochem.mpg.de. The PIR-International Protein Sequence Database and other files are also available by FTP.  相似文献   

10.
Evolution of function in protein superfamilies, from a structural perspective   总被引:29,自引:0,他引:29  
The recent growth in protein databases has revealed the functional diversity of many protein superfamilies. We have assessed the functional variation of homologous enzyme superfamilies containing two or more enzymes, as defined by the CATH protein structure classification, by way of the Enzyme Commission (EC) scheme. Combining sequence and structure information to identify relatives, the majority of superfamilies display variation in enzyme function, with 25 % of superfamilies in the PDB having members of different enzyme types. We determined the extent of functional similarity at different levels of sequence identity for 486,000 homologous pairs (enzyme/enzyme and enzyme/non-enzyme), with structural and sequence relatives included. For single and multi-domain proteins, variation in EC number is rare above 40 % sequence identity, and above 30 %, the first three digits may be predicted with an accuracy of at least 90 %. For more distantly related proteins sharing less than 30 % sequence identity, functional variation is significant, and below this threshold, structural data are essential for understanding the molecular basis of observed functional differences. To explore the mechanisms for generating functional diversity during evolution, we have studied in detail 31 diverse structural enzyme superfamilies for which structural data are available. A large number of variations and peculiarities are observed, at the atomic level through to gross structural rearrangements. Almost all superfamilies exhibit functional diversity generated by local sequence variation and domain shuffling. Commonly, substrate specificity is diverse across a superfamily, whilst the reaction chemistry is maintained. In many superfamilies, the position of catalytic residues may vary despite playing equivalent functional roles in related proteins. The implications of functional diversity within supefamilies for the structural genomics projects are discussed. More detailed information on these superfamilies is available at http://www.biochem.ucl.ac.uk/bsm/FAM-EC/.  相似文献   

11.
The cadherin superfamily is a large protein family with diverse structures and functions. Because of this diversity and the growing biological interest in cell adhesion and signaling processes, in which many members of the cadherin superfamily play a crucial role, it is becoming increasingly important to develop tools to manage, distribute and analyze sequences in this protein family. Current profile and motif databases classify protein sequences into a broad spectrum of protein superfamilies, however to provide a more specific functional annotation, the next step should include classification of subfamilies of these protein superfamilies. Here, we present a tool that classified greater than 90% of the proteins belonging to the cadherin superfamily found in the SWISS PROT database. Therefore, for most members of the cadherin superfamily, this tool can assist in adding more specific functional annotations than can be achieved with current profile and motif databases. Finally, the classification tool and the results of our analysis were integrated into a web-accessible database (http://calcium.uhnres. utoronto.ca/cadherin).  相似文献   

12.
13.
Knowledge of three dimensional structure is essential to understand the function of a protein. Although the overall fold is made from the whole details of its sequence, a small group of residues, often called as structural motifs, play a crucial role in determining the protein fold and its stability. Identification of such structural motifs requires sufficient number of sequence and structural homologs to define conservation and evolutionary information. Unfortunately, there are many structures in the protein structure databases have no homologous structures or sequences. In this work, we report an SVM method, SMpred, to identify structural motifs from single protein structure without using sequence and structural homologs. SMpred method was trained and tested using 132 proteins domains containing 581 motifs. SMpred method achieved 78.79% accuracy with 79.06% sensitivity and 78.53% specificity. The performance of SMpred was evaluated with MegaMotifBase using 188 proteins containing 1161 motifs. Out of 1161 motifs, SMpred correctly identified 1503 structural motifs reported in MegaMotifBase. Further, we showed that SMpred is useful approach for the length deviant superfamilies and single member superfamilies. This result suggests the usefulness of our approach for facilitating the identification of structural motifs in protein structure in the absence of sequence and structural homologs. The dataset and executable for the SMpred algorithm is available at http://www3.ntu.edu.sg/home/EPNSugan/index_files/SMpred.htm.  相似文献   

14.
The epoxide hydrolases and haloalkane dehalogenases database (EH/HD) integrates sequence and structure of a highly diverse protein family, including mainly the Asp-hydrolases of EHs and HDs but also proteins, such as Ser-hydrolases non-heme peroxidases, prolyl iminopetidases and 2-hydroxymuconic semialdehyde hydrolases. These proteins have a highly conserved structure, but display a remarkable diversity in sequence and function. A total of 305 protein entries were assigned to 14 homologous families, forming two superfamilies. Annotated multisequence alignments and phylogenetic trees are provided for each homologous family and superfamily. Experimentally derived structures of 19 proteins are superposed and consistently annotated. Sequence and structure of all 305 proteins were systematically analysed. Thus, deeper insight is gained into the role of a highly conserved sequence motifs and structural elements. AVAILABILITY: The EH/HD database is available at http://www.led.uni-stuttgart.de  相似文献   

15.
BioThesaurus is a web-based system designed to map a comprehensive collection of protein and gene names to protein entries in the UniProt Knowledgebase. Currently covering more than two million proteins, BioThesaurus consists of over 2.8 million names extracted from multiple molecular biological databases according to the database cross-references in iProClass. The BioThesaurus web site allows the retrieval of synonymous names of given protein entries and the identification of protein entries sharing the same names. AVAILABILITY: BioThesaurus is accessible for online searching at http://pir.georgetown.edu/iprolink/biothesaurus  相似文献   

16.
Many protein classification systems capture homologous relationships by grouping domains into families and superfamilies on the basis of sequence similarity. Superfamilies with similar 3D structures are further grouped into folds. In the absence of discernable sequence similarity, these structural similarities were long thought to have originated independently, by convergent evolution. However, the growth of databases and advances in sequence comparison methods have led to the discovery of many distant evolutionary relationships that transcend the boundaries of superfamilies and folds. To investigate the contributions of convergent versus divergent evolution in the origin of protein folds, we clustered representative domains of known structure by their sequence similarity, treating them as point masses in a virtual 2D space which attract or repel each other depending on their pairwise sequence similarities. As expected, families in the same superfamily form tight clusters. But often, superfamilies of the same fold are linked with each other, suggesting that the entire fold evolved from an ancient prototype. Strikingly, some links connect superfamilies with different folds. They arise from modular peptide fragments of between 20 and 40 residues that co‐occur in the connected folds in disparate structural contexts. These may be descendants of an ancestral pool of peptide modules that evolved as cofactors in the RNA world and from which the first folded proteins arose by amplification and recombination. Our galaxy of folds summarizes, in a single image, most known and many yet undescribed homologous relationships between protein superfamilies, providing new insights into the evolution of protein domains.  相似文献   

17.
The current pace of structural biology now means that protein three-dimensional structure can be known before protein function, making methods for assigning homology via structure comparison of growing importance. Previous research has suggested that sequence similarity after structure-based alignment is one of the best discriminators of homology and often functional similarity. Here, we exploit this observation, together with a merger of protein structure and sequence databases, to predict distant homologous relationships. We use the Structural Classification of Proteins (SCOP) database to link sequence alignments from the SMART and Pfam databases. We thus provide new alignments that could not be constructed easily in the absence of known three-dimensional structures. We then extend the method of Murzin (1993b) to assign statistical significance to sequence identities found after structural alignment and thus suggest the best link between diverse sequence families. We find that several distantly related protein sequence families can be linked with confidence, showing the approach to be a means for inferring homologous relationships and thus possible functions when proteins are of known structure but of unknown function. The analysis also finds several new potential superfamilies, where inspection of the associated alignments and superimpositions reveals conservation of unusual structural features or co-location of conserved amino acids and bound substrates. We discuss implications for Structural Genomics initiatives and for improvements to sequence comparison methods.  相似文献   

18.
The Structural Motifs of Superfamilies (SMoS) database provides information about the structural motifs of aligned protein domain superfamilies. Such motifs among structurally aligned multiple members of protein superfamilies are recognized by the conservation of amino acid preference and solvent inaccessibility and are examined for the conservation of other features like secondary structural content, hydrogen bonding, non-polar interaction and residue packing. These motifs, along with their sequence and spatial orientation, represent the conserved core structure of each superfamily and also provide the minimal requirement of sequence and structural information to retain each superfamily fold.  相似文献   

19.
W R Pearson 《Genomics》1991,11(3):635-650
The sensitivity and selectivity of the FASTA and the Smith-Waterman protein sequence comparison algorithms were evaluated using the superfamily classification provided in the National Biomedical Research Foundation/Protein Identification Resource (PIR) protein sequence database. Sequences from each of the 34 superfamilies in the PIR database with 20 or more members were compared against the protein sequence database. The similarity scores of the related and unrelated sequences were determined using either the FASTA program or the Smith-Waterman local similarity algorithm. These two sets of similarity scores were used to evaluate the ability of the two comparison algorithms to identify distantly related protein sequences. The FASTA program using the ktup = 2 sensitivity setting performed as well as the Smith-Waterman algorithm for 19 of the 34 superfamilies. Increasing the sensitivity by setting ktup = 1 allowed FASTA to perform as well as Smith-Waterman on an additional 7 superfamilies. The rigorous Smith-Waterman method performed better than FASTA with ktup = 1 on 8 superfamilies, including the globins, immunoglobulin variable regions, calmodulins, and plastocyanins. Several strategies for improving the sensitivity of FASTA were examined. The greatest improvement in sensitivity was achieved by optimizing a band around the best initial region found for every library sequence. For every superfamily except the globins and immunoglobulin variable regions, this strategy was as sensitive as a full Smith-Waterman. For some sequences, additional sensitivity was achieved by including conserved but nonidentical residues in the lookup table used to identify the initial region.  相似文献   

20.
Sixty-five families of glycosyltransferases (EC 2.4.x.y) have been recognized on the basis of high-sequence similarity to a founding member with experimentally demonstrated enzymatic activity. Although distant sequence relationships between some of these families have been reported, the natural history of glycosyltransferases is poorly understood. We used iterative searches of sequence databases, motif extraction, structural comparison, and analysis of completely sequenced genomes to track the origins of modern-type glycosyltransferases. We show that >75% of recognized glycosyltransferase families belong to one of only three monophyletic superfamilies of proteins, namely, (1) a recently described GPGTF/GT-B superfamily; (2) a nucleoside-diphosphosugar transferase (GT-A) superfamily, which is characterized by a DxD sequence signature and also includes nucleotidyltransferases; and (3) a GT-C superfamily of integral membrane glycosyltransferases with a modified DxD signature in the first extracellular loop. Several developmental regulators in Metazoans, including Fringe and Egghead homologs, belong to the second superfamily. Interestingly, Tout-velu/Exostosin family of developmental proteins found in all multicellular eukaryotes, contains separate domains belonging to the first and the second superfamilies, explaining multiple glycosyltransferase activities in one protein.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号