首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 46 毫秒
1.
InterPro, an integrated documentation resource for protein families, protein domains, and functional sites, was developed to amalgamate the individual efforts of the PROSITE, PRINTS, Pfam, and ProDom databases. InterPro can be used for the computational functional classification of newly determined amino acid sequences that lack biochemical characterization and for comparative genome analysis. InterPro contains over 3500 entries for more than 1 000 000 hits in SWISS-PROT and TrEMBL. The database is accessible for text-and sequence-based searches at http://www.ebi.ac.uk/interpro/. InterPro was used for the complete analysis of the proteome of the pathogenic microorganism Mycobacterium tuberculosis and the comparison with the predicted protein-coding sequences of the complete genomes of Bacillus subtilis and Escherichia coli. It was found that 64.8% of proteins in the proteome of M. tuberculosis matched InterPro entries and can be classified by their functions. The comparison with B. subtilis and E. coli provided information on the most common protein families and domains and on the most highly represented protein families in each organism. Thus, InterPro is a useful tool for general comparison of complete proteomes and their compositions.  相似文献   

2.
3.
The Proteome Analysis database (http://www.ebi.ac.uk/proteome/) has been developed by the Sequence Database Group at EBI utilizing existing resources and providing comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archeae and eukaryotes. Three main projects are used, InterPro, CluSTr and GO Slim, to give an overview on families, domains, sites, and functions of the proteins from each of the complete genomes. Complete proteome analysis is available for a total of 89 proteome sets. A specifically designed application enables InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.  相似文献   

4.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.  相似文献   

5.
6.
InterPro (http://www.ebi.ac.uk/interpro/) is an integrated documentation resource for protein families, domains and sites, developed initially as a means of rationalizing the complementary efforts of the PROSITE, PRINTS, Pfam and ProDom database projects. It is a useful resource that aids the functional classification of proteins. Almost 90% of the actinopterygii protein sequences from SWISS-PROT and TrEMBL can be classified using InterPro. Over 30% of the actinopterygii protein sequences currently in SWISS-PROT and TrEMBL are of mitochondrial origin, the majority of which belong to the cytochrome b/b6 family. InterPro also gives insights into the domain composition of the classified proteins and has applications in the functional classification of newly determined sequences lacking biochemical characterization, and in comparative genome analysis. A comparison of the actinopterygii protein sequences against the sequences of other eukaryotes confirms the high representation of eukaryotic protein kinase in the organisms studied. The comparisons also show that, based on InterPro families, the trans-species evolution of MHC class I and II molecules in mammals and teleost fish can be recognized.  相似文献   

7.
The SWISS-PROT group at EBI has developed the Proteome Analysis Database utilising existing resources and providing comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archaea and eukaryotes (http://www.ebi.ac. uk/proteome/). The two main projects used, InterPro and CluSTr, give a new perspective on families, domains and sites and cover 31-67% (InterPro statistics) of the proteins from each of the complete genomes. CluSTr covers the three complete eukaryotic genomes and the incomplete human genome data. The Proteome Analysis Database is accompanied by a program that has been designed to carry out InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.  相似文献   

8.
9.
Applications of InterPro in protein annotation and genome analysis   总被引:2,自引:0,他引:2  
The applications of InterPro span a range of biologically important areas that includes automatic annotation of protein sequences and genome analysis. In automatic annotation of protein sequences InterPro has been utilised to provide reliable characterisation of sequences, identifying them as candidates for functional annotation. Rules based on the InterPro characterisation are stored and operated through a database called RuleBase. RuleBase is used as the main tool in the sequence database group at the EBI to apply automatic annotation to unknown sequences. The annotated sequences are stored and distributed in the TrEMBL protein sequence database. InterPro also provides a means to carry out statistical and comparative analyses of whole genomes. In the Proteome Analysis Database, InterPro analyses have been combined with other analyses based on CluSTr, the Gene Ontology (GO) and structural information on the proteins.  相似文献   

10.
The CluSTr (Clusters of SWISS-PROT and TrEMBL proteins) database offers an automatic classification of SWISS-PROT and TrEMBL proteins into groups of related proteins. The clustering is based on analysis of all pairwise comparisons between protein sequences. Analysis has been carried out for different levels of protein similarity, yielding a hierarchical organisation of clusters. The database provides links to InterPro, which integrates information on protein families, domains and functional sites from PROSITE, PRINTS, Pfam and ProDom. Links to the InterPro graphical interface allow users to see at a glance whether proteins from the cluster share particular functional sites. CluSTr also provides cross-references to HSSP and PDB. The database is available for querying and browsing at http://www.ebi.ac.uk/clustr.  相似文献   

11.
InterPro, an integrated documentation resource of protein families, domains and functional sites, was created in 1999 as a means of amalgamating the major protein signature databases into one comprehensive resource. PROSITE, Pfam, PRINTS, ProDom, SMART and TIGRFAMs have been manually integrated and curated and are available in InterPro for text- and sequence-based searching. The results are provided in a single format that rationalises the results that would be obtained by searching the member databases individually. The latest release of InterPro contains 5629 entries describing 4280 families, 1239 domains, 95 repeats and 15 post-translational modifications. Currently, the combined signatures in InterPro cover more than 74% of all proteins in SWISS-PROT and TrEMBL, an increase of nearly 15% since the inception of InterPro. New features of the database include improved searching capabilities and enhanced graphical user interfaces for visualisation of the data. The database is available via a webserver (http://www.ebi.ac.uk/interpro) and anonymous FTP (ftp://ftp.ebi.ac.uk/pub/databases/interpro).  相似文献   

12.
We have developed GFam, a platform for automatic annotation of gene/protein families. GFam provides a framework for genome initiatives and model organism resources to build domain-based families, derive meaningful functional labels and offers a seamless approach to propagate functional annotation across periodic genome updates. GFam is a hybrid approach that uses a greedy algorithm to chain component domains from InterPro annotation provided by its 12 member resources followed by a sequence-based connected component analysis of un-annotated sequence regions to derive consensus domain architecture for each sequence and subsequently generate families based on common architectures. Our integrated approach increases sequence coverage by 7.2 percentage points and residue coverage by 14.6 percentage points higher than the coverage relative to the best single-constituent database within InterPro for the proteome of Arabidopsis. The true power of GFam lies in maximizing annotation provided by the different InterPro data sources that offer resource-specific coverage for different regions of a sequence. GFam’s capability to capture higher sequence and residue coverage can be useful for genome annotation, comparative genomics and functional studies. GFam is a general-purpose software and can be used for any collection of protein sequences. The software is open source and can be obtained from http://www.paccanarolab.org/software/gfam/.  相似文献   

13.
Many protein regions have been shown to be intrinsically disordered, lacking unique structure under physiological conditions. These intrinsically disordered regions are not only very common in proteomes, but also crucial to the function of many proteins, especially those involved in signaling, recognition, and regulation. The goal of this work was to identify the prevalence, characteristics, and functions of conserved disordered regions within protein domains and families. A database was created to store the amino acid sequences of nearly one million proteins and their domain matches from the InterPro database, a resource integrating eight different protein family and domain databases. Disorder prediction was performed on these protein sequences. Regions of sequence corresponding to domains were aligned using a multiple sequence alignment tool. From this initial information, regions of conserved predicted disorder were found within the domains. The methodology for this search consisted of finding regions of consecutive positions in the multiple sequence alignments in which a 90% or more of the sequences were predicted to be disordered. This procedure was constrained to find such regions of conserved disorder prediction that were at least 20 amino acids in length. The results of this work included 3,653 regions of conserved disorder prediction, found within 2,898 distinct InterPro entries. Most regions of conserved predicted disorder detected were short, with less than 10% of those found exceeding 30 residues in length.  相似文献   

14.
MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk  相似文献   

15.
Various sequence-motif and sequence-cluster databases have been integrated into a new resource known as InterPro. Because the contributing databases have different clustering principles and scoring sensitivities, the combined assignments complement each other for grouping protein families and delineating domains. InterPro and new developments in the analysis of both the phylogenetic profiles of protein families and domain fusion events improve the prediction of specific functions for numerous proteins.  相似文献   

16.
MOTIVATION: The SWISS-PROT group at the EBI has developed the Proteome Analysis Database utilizing existing resources and providing comprehensive and integrated comparative analysis of the predicted protein coding sequences of the complete genomes of bacteria, archaea and eukaryotes. The Proteome Analysis Database is accompanied by a program that has been designed to carry out interactive InterPro proteome comparisons for any one proteome against any other one or more of the proteomes in the database.  相似文献   

17.
Lengthy co-evolution of Homo sapiens and Mycobacterium tuberculosis, the main causative agent of tuberculosis, resulted in a dramatically successful pathogen species that presents considerable challenge for modern medicine. The continuous and ever increasing appearance of multi-drug resistant mycobacteria necessitates the identification of novel drug targets and drugs with new mechanisms of action. However, further insights are needed to establish automated protocols for target selection based on the available complete genome sequences. In the present study, we perform complete proteome level comparisons between M. tuberculosis, mycobacteria, other prokaryotes and available eukaryotes based on protein domains, local sequence similarities and protein disorder. We show that the enrichment of certain domains in the genome can indicate an important function specific to M. tuberculosis. We identified two families, termed pkn and PE/PPE that stand out in this respect. The common property of these two protein families is a complex domain organization that combines species-specific regions, commonly occurring domains and disordered segments. Besides highlighting promising novel drug target candidates in M. tuberculosis, the presented analysis can also be viewed as a general protocol to identify proteins involved in species-specific functions in a given organism. We conclude that target selection protocols should be extended to include proteins with complex domain architectures instead of focusing on sequentially unique and essential proteins only.  相似文献   

18.
Classifications of proteins into groups of related sequences are in some respects like a periodic table for biology, allowing us to understand the underlying molecular biology of any organism. Pfam is a large collection of protein domains and families. Its scientific goal is to provide a complete and accurate classification of protein families and domains. The next release of the database will contain over 10,000 entries, which leads us to reflect on how far we are from completing this work. Currently Pfam matches 72% of known protein sequences, but for proteins with known structure Pfam matches 95%, which we believe represents the likely upper bound. Based on our analysis a further 28,000 families would be required to achieve this level of coverage for the current sequence database. We also show that as more sequences are added to the sequence databases the fraction of sequences that Pfam matches is reduced, suggesting that continued addition of new families is essential to maintain its relevance.  相似文献   

19.
H J Pel  M Rep    L A Grivell 《Nucleic acids research》1992,20(17):4423-4428
We have recently reported the cloning and sequencing of the gene for the mitochondrial release factor mRF-1. mRF-1 displays high sequence similarity to the bacterial release factors RF-1 and RF-2. A database search for proteins resembling these three factors revealed high similarities to two amino acid sequences deduced from unassigned genomic reading frames in Escherichia coli and Bacillus subtilis. The amino acid sequence derived from the Bacillus reading frame is 47% identical to E.coli and Salmonella typhimurium RF-2, strongly suggesting that it represents B.subtilis RF-2. Our comparison suggests that the expression of the B.subtilis gene is, like that of the E.coli and S. typhimurium RF-2 genes, autoregulated by a stop codon dependent +1 frameshift. A comparison of prokaryotic and mitochondrial release factor sequences, including the putative B.subtilis RF-2, leads us to propose a five-domain model for release factor structure. Possible functions of the various domains are discussed.  相似文献   

20.
Dengler U  Siddiqui AS  Barton GJ 《Proteins》2001,42(3):332-344
The 3Dee database of domain definitions was developed as a comprehensive collection of domain definitions for all three-dimensional structures in the Protein Data Bank (PDB). The database includes definitions for complex, multiple-segment and multiple-chain domains as well as simple sequential domains, organized in a structural hierarchy. Two different snapshots of the 3Dee database were analyzed at September 1996 and November 1999. For the November 1999 release, 7,995 PDB entries contained 13,767 protein chains and gave rise to 18,896 domains. The domain sequences clustered into 1,715 domain sequence families, which were further clustered into a conservative 1,199 domain structure families (families with similar folds). The proportion of different domain structure families per domain sequence family increases from 84% for domains 1-100 residues long to 100% for domains greater than 600 residues. This is in keeping with the idea that longer chains will have more alternative folds available to them. Of the representative domains from the domain sequence families, 49% are in the range of 51-150 residues, whereas 64% of the representative chains over 200 residues have more than 1 domain. Of the representative chains, 8.5% are part of multichain domains. The largest multichain domain in the database has 14 chains and 1,400 residues, whereas the largest single-chain domain has 907 residues. The largest number of domains found in a protein is 13. The analysis shows that over the history of the PDB, new domain folds have been discovered at a slower rate than by random selection of all known folds. Between 1992 and 1997, a constant 1 in 11 new domains deposited in the PDB has shown no sequence similarity to a previously known domain sequence family, and only 1 in 15 new domain structures has had a fold that has not been seen previously. A comparison of the September 1996 release of 3Dee to the Structural Classification of Proteins (SCOP) showed that the domain definitions agreed for 80% of the representative protein chains. However, 3Dee provided explicit domain boundaries for more proteins. 3Dee is accessible on the World Wide Web at http://barton.ebi.ac.uk/servers/3Dee.html.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号