首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
SBASE 4.0 is the fourth release of SBASE, a collection of annotated protein domain sequences that represent various structural, functional, ligand binding and topogenic segments of proteins. SBASE was designed to facilitate the detection of functional homologies and can be searched with standard database search tools, such as FASTA and BLAST3. The present release contains 61 137 entries provided with standardized names and cross-referenced to all major protein, nucleic acid and sequence pattern collections. The entries are clustered into 13 155 groups in order to facilitate detection of distant similarities. SBASE 4.0 is freely available by anonymous ftp file transfer from ftp.icgeb.trieste.it. Individual records can be retrieved with the gopher server at icgeb.trieste.it and with a World Wide Web server at http://www.icgeb.trieste.it. Automated searching of SBASE with BLAST can be carried out with the electronic mail server sbase@icgeb.trieste.it, which now also provides a graphic representation of the homologies. A related mail server, domain@hubi.abc.hu, assigns SBASE domain homologies on the basis of SWISS-PROT searches.  相似文献   

2.
SBASE 8.0 is the eighth release of the SBASE library of protein domain sequences that contains 294 898 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to most major sequence databases and sequence pattern collections. The entries are clustered into over 2005 statistically validated domain groups (SBASE-A) and 595 non-validated groups (SBASE-B), provided with several WWW-based search and browsing facilities for online use. A domain-search facility was developed, based on non-parametric pattern recognition methods, including artificial neural networks. SBASE 8.0 is freely available by anonymous 'ftp' file transfer from ftp.icgeb.trieste.it. Automated searching of SBASE can be carried out with the WWW servers http://www.icgeb.trieste.it/sbase/ and http://sbase.abc. hu/sbase/.  相似文献   

3.
SBASE 7.0 is the seventh release of the SBASE protein domain library sequences that contains 237 937 annotated structural, functional, ligand-binding and topogenic segments of proteins, cross-referenced to all major sequence databases and sequence pattern collections. The entries are clustered into over 1811 groups and are provided with two WWW-based search facilities for on-line use. SBASE 7.0 is freely available by anonymous 'ftp' file transfer from ftp.icgeb. trieste.it. Automated searching of SBASE with BLAST can be carried out with the WWW servers http://www.icgeb.trieste.it/sbase/and http://sbase.abc.hu/sbase/  相似文献   

4.
MOTIVATION: A key goal of genomics is to assign function to genes, especially for orphan sequences. RESULTS: We compared the clustered functional domains in the SBASE database to each protein sequence using BLASTP. This representation for a protein is a vector, where each of the non-zero entries in the vector indicates a significant match between the sequence of interest and the SBASE domain. The machine learning methods nearest neighbour algorithm (NNA) and support vector machines are used for predicting protein functional classes from this information. We find that the best results are found using the SBASE-A database and the NNA, namely 72% accuracy for 79% coverage. We tested an assigning function based on searching for InterPro sequence motifs and by taking the most significant BLAST match within the dataset. We applied the functional domain composition method to predict the functional class of 2018 currently unclassified yeast open reading frames. AVAILABILITY: A program for the prediction method, that uses NNA called Functional Class Prediction based on Functional Domains (FCPFD) is available and can be obtained by contacting Y.D.Cai at y.cai@umist.ac.uk  相似文献   

5.
SBASE 2.0 is the second release of SBASE, a collection of annotated protein domain sequences. SBASE entries represent various structural, functional, ligand-binding and topogenic segments of proteins [Pongor, S. et al. (1993) Prot. Eng., in press]. This release contains 34,518 entries provided with standardized names and it is cross-referenced to the major protein and nucleic acid databanks as well as to the PROSITE catalog of protein sequence patterns [Bairoch, A. (1992) Nucl. Acids Res., 20 suppl, 2013-2018]. SBASE can be used for establishing domain homologies using different database-search tools such as FASTA [Lipman and Pearson (1985) Science, 227, 1436-1441], FASTDB [Brutlag et al. (1990) Comp. Appl. Biosci., 6, 237-245] or BLAST3 [Altschul and Lipman (1990) Proc. Natl. Acad. Sci. USA, 87, 5509-5513] which is especially useful in the case of loosely defined domain types for which efficient consensus patterns can not be established. SBASE 2.0 and a set of search and retrieval tools are freely available on request to the authors or by anonymous 'ftp' file transfer from mean value of ftp.icgeb.trieste.it.  相似文献   

6.
SBASE 5.0 is the fifth release of SBASE, a collection of annotated protein domain sequences that represent various structural, functional, ligand-binding and topogenic segments of proteins. SBASE was designed to facilitate the detection of functional homologies and can be searched with standard database-search programs. The present release contains over 79863 entries provided with standardized names and is cross-referenced to all major sequence databases and sequence pattern collections. The information is assigned to individual domains rather than to entire protein sequences, thus SBASE contains substantially more cross-references and links than do the protein sequence databases. The entries are clustered into >16 000 groups in order to facilitate the detection of distant similarities. SBASE 5.0 is freely available by anonymous 'ftp' file transfer from <ftp.icgeb.trieste.it >. Automated searching of SBASE with BLAST can be carried out with the WWW-server <http://www.icgeb.trieste.it/sbase/ >. and with the electronic mail server <sbase@icgeb.trieste.it >which now also provides a graphic representation of the homologies. A related WWW-server <http://www.abc.hu/blast.html > and e-mail server <domain@hubi.abc.hu > predicts SBASE domain homologies on the basis of SWISS-PROT searches.  相似文献   

7.
SBASE 3.0 is the third release of SBASE, a collection of annotated protein domain sequences. SBASE entries represent various structural, functional, ligand-binding and topogenic segments of proteins as defined by their publishing authors. SBASE can be used for establishing domain homologies using different database-search tools such as FASTA [Lipman and Pearson (1985) Science, 227, 1436-1441], and BLAST3 [Altschul and Lipman (1990) Proc. Natl. Acad. Sci. USA, 87, 5509-5513] which is especially useful in the case of loosely defined domain types for which efficient consensus patterns can not be established. The present release contains 41,749 entries provided with standardized names and cross-referenced to the major protein and nucleic acid databanks as well as to the PROSITE catalogue of protein sequence patterns. The entries are clustered into 2285 groups using the BLAST algorithm for computing similarity measures. SBASE 3.0 is freely available on request to the authors or by anonymous 'ftp' file transfer from < ftp.icgeb.trieste.it >. Individual records can be retrieved with the gopher server at < icgeb.trieste.it > and with a www-server at < http:@www.icgeb.trieste.it >. Automated searching of SBASE by BLAST can be carried out with the electronic mail server < sbase@icgeb.trieste.it >. Another mail server < domain@hubi.abc.hu > assigns SBASE domain homologies on the basis of SWISS-PROT searches. A comparison of pertinent search strategies is presented.  相似文献   

8.
SUMMARY: A simple heuristic scoring method is described for assigning sequences to known domain types based on BLAST search outputs. The scoring is based on the score distribution of the known domain groups determined from a database versus database comparison and is directly applicable to BLAST output processing.  相似文献   

9.
Domains are considered as the basic units of protein folding, evolution, and function. Decomposing each protein into modular domains is thus a basic prerequisite for accurate functional classification of biological molecules. Here, we present ADDA, an automatic algorithm for domain decomposition and clustering of all protein domain families. We use alignments derived from an all-on-all sequence comparison to define domains within protein sequences based on a global maximum likelihood model. In all, 90% of domain boundaries are predicted within 10% of domain size when compared with the manual domain definitions given in the SCOP database. A representative database of 249,264 protein sequences were decomposed into 450,462 domains. These domains were clustered on the basis of sequence similarities into 33,879 domain families containing at least two members with less than 40% sequence identity. Validation against family definitions in the manually curated databases SCOP and PFAM indicates almost perfect unification of various large domain families while contamination by unrelated sequences remains at a low level. The global survey of protein-domain space by ADDA confirms that most large and universal domain families are already described in PFAM and/or SMART. However, a survey of the complete set of mobile modules leads to the identification of 1479 new interesting domain families which shuffle around in multi-domain proteins. The data are publicly available at ftp://ftp.ebi.ac.uk/pub/contrib/heger/adda.  相似文献   

10.
The ever increasing speed of DNA sequencing widens the discrepancy between the number of known gene products, and the knowledge of their function and structure. Proper annotation of protein sequences is therefore crucial if the missing information is to be deduced from sequence‐based similarity comparisons. These comparisons become exceedingly difficult as the pairwise identities drop to very low values. To improve the accuracy of domain identification, we exploit the fact that the three‐dimensional structures of domains are much more conserved than their sequences. Based on structure‐anchored multiple sequence alignments of low identity homologues we constructed 850 structure‐anchored hidden Markov models (saHMMs), each representing one domain family. Since the saHMMs are highly family specific, they can be used to assign a domain to its correct family and clearly distinguish it from domains belonging to other families, even within the same superfamily. This task is not trivial and becomes particularly difficult if the unknown domain is distantly related to the rest of the domain sequences within the family. In a search with full length protein sequences, harbouring at least one domain as defined by the structural classification of proteins database (SCOP), version 1.71, versus the saHMM database based on SCOP version 1.69, we achieve an accuracy of 99.0%. All of the few hits outside the family fall within the correct superfamily. Compared to Pfam_ls HMMs, the saHMMs obtain about 11% higher coverage. A comparison with BLAST and PSI‐BLAST demonstrates that the saHMMs have consistently fewer errors per query at a given coverage. Within our recommended E‐value range, the same is true for a comparison with SUPERFAMILY. Furthermore, we are able to annotate 232 proteins with 530 nonoverlapping domains belonging to 102 different domain families among human proteins labelled “unknown” in the NCBI protein database. Our results demonstrate that the saHMM database represents a versatile and reliable tool for identification of domains in protein sequences. With the aid of saHMMs, homology on the family level can be assigned, even for distantly related sequences. Due to the construction of the saHMMs, the hits they provide are always associated with high quality crystal structures. The saHMM database can be accessed via the FISH server at http://babel.ucmp.umu.se/fish/ . Proteins 2009. © 2008 Wiley‐Liss, Inc.  相似文献   

11.
The ProDom database of protein domain families.   总被引:12,自引:1,他引:11       下载免费PDF全文
F Corpet  J Gouzy    D Kahn 《Nucleic acids research》1998,26(1):323-326
The ProDom database contains protein domain families generated from the SWISS-PROT database by automated sequence comparisons. It can be searched on the World Wide Web (http://protein.toulouse.inra. fr/prodom.html ) or by E-mail (prodom@toulouse.inra.fr) to study domain arrangements within known families or new proteins. Strong emphasis has been put on the graphical user interface which allows for interactive analysis of protein homology relationships. Recent improvements to the server include: ProDom search by keyword; links to PROSITE and PDB entries; more sensitive ProDom similarity search with BLAST or WU-BLAST; alignments of query sequences with homologous ProDom domain families; and links to the SWISS-MODEL server (http: //www.expasy.ch/swissmod/SWISS-MODEL.html ) for homology based 3-D domain modelling where possible.  相似文献   

12.
Databases of multiple sequence alignments are a valuable aid to protein sequence classification and analysis. One of the main challenges when constructing such a database is to simultaneously satisfy the conflicting demands of completeness on the one hand and quality of alignment and domain definitions on the other. The latter properties are best dealt with by manual approaches, whereas completeness in practice is only amenable to automatic methods. Herein we present a database based on hidden Markov model profiles (HMMs), which combines high quality and completeness. Our database, Pfam, consists of parts A and B. Pfam-A is curated and contains well-characterized protein domain families with high quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. Pfam-B contains sequence families that were generated automatically by applying the Domainer algorithm to cluster and align the remaining protein sequences after removal of Pfam-A domains. By using Pfam, a large number of previously unannotated proteins from the Caenorhabditis elegans genome project were classified. We have also identified many novel family memberships in known proteins, including new kazal, Fibronectin type III, and response regulator receiver domains. Pfam-A families have permanent accession numbers and form a library of HMMs available for searching and automatic annotation of new protein sequences. Proteins: 28:405–420, 1997. © 1997 Wiley-Liss, Inc.  相似文献   

13.
The CATH database of protein domain structures (http://www.biochem.ucl.ac.uk/bsm/cath_new) currently contains 34 287 domain structures classified into 1383 superfamilies and 3285 sequence families. Each structural family is expanded with domain sequence relatives recruited from GenBank using a variety of efficient sequence search protocols and reliable thresholds. This extended resource, known as the CATH-protein family database (CATH-PFDB) contains a total of 310 000 domain sequences classified into 26 812 sequence families. New sequence search protocols have been designed, based on these intermediate sequence libraries, to allow more regular updating of the classification. Further developments include the adaptation of a recently developed method for rapid structure comparison, based on secondary structure matching, for domain boundary assignment. The philosophy behind CATHEDRAL is the recognition of recurrent folds already classified in CATH. Benchmarking of CATHEDRAL, using manually validated domain assignments, demonstrated that 43% of domains boundaries could be completely automatically assigned. This is an improvement on a previous consensus approach for which only 10-20% of domains could be reliably processed in a completely automated fashion. Since domain boundary assignment is a significant bottleneck in the classification of new structures, CATHEDRAL will also help to increase the frequency of CATH updates.  相似文献   

14.
Predicting the structural fold of a protein is an important and challenging problem. Available computer programs for determining whether a protein sequence is compatible with a known 3-dimensional structure fall into 2 categories: (1) structure-based methods, in which structural features such as local conformation and solvent accessibility are encoded in a template, and (2) sequence-based methods, in which aligned sequences of a set of related proteins are encoded in a template. In both cases, the programs use a static template based on a predetermined set of proteins. Here, we describe a computer-based method, called iterative template refinement (ITR), that uses templates combining structure-based and sequence-based information and employs an iterative search procedure to detect related proteins and sequentially add them to the templates. Starting from a single protein of known structure, ITR performs sequential cycles of database search to construct an expanding tree of templates with the aim of identifying subtle relationships among proteins. Evaluating the performance of ITR on 6 proteins, we found that the method automatically identified a variety of subtle structural similarities to other proteins. For example, the method identified structural similarity between arabinose-binding protein and phosphofructokinase, a relationship that has not been widely recognized.  相似文献   

15.
MOTIVATION: Since protein domains are the units of evolution, databases of domain signatures such as ProDom or Pfam enable both a sensitive and selective sequence analysis. However, manually curated databases have a low coverage and automatically generated ones often miss relationships which have not yet been discovered between domains or cannot display similarities between domains which have drifted apart. METHODS: We present a tool which makes use of the fact that overall domain arrangements are often conserved. AIDAN (Automated Improvement of Domain ANnotations) identifies potential annotation artifacts and domains which have drifted apart. The underlying database supplements ProDom and is interfaced by a graphical tool allowing the localization of single domain deletions or annotations which have been falsely made by the automated procedure. AVAILABILITY: http://www.uni-muenster.de/Evolution/ebb/Services/AIDAN. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

16.
ProDom contains all protein domain families automatically generated from the SWISS-PROT and TrEMBL sequence databases (http://www. toulouse.inra.fr/prodom.html ). ProDom-CG results from a similar domain analysis as applied to completed genomes (http://www.toulouse. inra.fr/prodomCG.html ). Recent improvements to the ProDom database and its server include: scaling up to include sequences from TrEMBL, addition of Pfam-A entries to the set of expert validated families, assignment of stable accession numbers, consistency indicators for domain families, domain arrangements of sub-families and links to Pfam-A.  相似文献   

17.
The Conserved Domain Database (CDD) is now indexed as a separate database within the Entrez system and linked to other Entrez databases such as MEDLINE(R). This allows users to search for domain types by name, for example, or to view the domain architecture of any protein in Entrez's sequence database. CDD can be accessed on the WorldWideWeb at http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=cdd. Users may also employ the CD-Search service to identify conserved domains in new sequences, at http://www.ncbi.nlm.nih.gov/Structure/cdd/wrpsb.cgi. CD-Search results, and pre-computed links from Entrez's protein database, are calculated using the RPS-BLAST algorithm and Position Specific Score Matrices (PSSMs) derived from CDD alignments. CD-Searches are also run by default for protein-protein queries submitted to BLAST(R) at http://www.ncbi.nlm.nih.gov/BLAST. CDD mirrors the publicly available domain alignment collections SMART and PFAM, and now also contains alignment models curated at NCBI. Structure information is used to identify the core substructure likely to be present in all family members, and to produce sequence alignments consistent with structure conservation. This alignment model allows NCBI curators to annotate 'columns' corresponding to functional sites conserved among family members.  相似文献   

18.
In the postgenomic era it is essential that protein sequences are annotated correctly in order to help in the assignment of their putative functions. Over 1300 proteins in current protein sequence databases are predicted to contain a PAS domain based upon amino acid sequence alignments. One of the problems with the current annotation of the PAS domain is that this domain exhibits limited similarity at the amino acid sequence level. It is therefore essential, when using proteins with low-sequence similarities, to apply profile hidden Markov model searches for the PAS domain-containing proteins, as for the PFAM database. From recent 3D X-ray and NMR structures, however, PAS domains appear to have a conserved 3D fold as shown here by structural alignment of the six representative 3D-structures from the PDB database. Large-scale modelling of the PAS sequences from the PFAM database against the 3D-structures of these six structural prototypes was performed. All 3D models generated (> 5700) were evaluated using prosaii. We conclude from our large-scale modelling studies that the PAS and PAC motifs (which are separately defined in the PFAM database) are directly linked and that these two motifs form the PAS fold. The existing subdivision in PAS and PAC motifs, as used by the PFAM and SMART databases, appears to be caused by major differences in sequences in the region connecting these two motifs. This region, as has been shown by Gardner and coworkers for human PAS kinase (Amezcua, C.A., Harper, S.M., Rutter, J. & Gardner, K.H. (2002) Structure 10, 1349-1361, [1]), is very flexible and adopts different conformations depending on the bound ligand. Some PAS sequences present in the PFAM database did not produce a good structural model, even after realignment using a structure-based alignment method, suggesting that these representatives are unlikely to have a fold resembling any of the structural prototypes of the PAS domain superfamily.  相似文献   

19.
The SYSTERS (short for SYSTEmatic Re-Searching) protein sequence cluster set consists of the classification of all sequences from SWISS-PROT and PIR into disjoint protein family clusters and hierarchically into superfamily and subfamily clusters. The cluster set can be searched with a sequence using the SSMAL search tool or a traditional database search tool like BLAST or FASTA. Additionally a multiple alignment is generated for each cluster and annotated with domain information from the Pfam database of protein domain families. A taxonomic overview of the organisms covered by a cluster is given based on the NCBI taxonomy. The cluster set is available for querying and browsing at http://www.dkfz-heidelberg. de/tbi/services/cluster/systersform  相似文献   

20.
MELDB: a database for microbial esterases and lipases   总被引:1,自引:0,他引:1  
Kang HY  Kim JF  Kim MH  Park SH  Oh TK  Hur CG 《FEBS letters》2006,580(11):2736-2740
MELDB is a comprehensive protein database of microbial esterases and lipases which are hydrolytic enzymes important in the modern industry. Proteins in MELDB are clustered into groups according to their sequence similarities based on a local pairwise alignment algorithm and a graph clustering algorithm (TribeMCL). This differs from traditional approaches that use global pairwise alignment and joining methods. Our procedure was able to reduce the noise caused by dubious alignment in the distantly related or unrelated regions in the sequences. In the database, 883 esterase and lipase sequences derived from microbial sources are deposited and conserved parts of each protein are identified. HMM profiles of each cluster were generated to classify unknown sequences. Contents of the database can be keyword-searched and query sequences can be aligned to sequence profiles and sequences themselves.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号