首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 468 毫秒
1.
Plant protein annotation in the UniProt Knowledgebase   总被引:3,自引:0,他引:3       下载免费PDF全文
The Swiss-Prot, TrEMBL, Protein Information Resource (PIR), and DNA Data Bank of Japan (DDBJ) protein database activities have united to form the Universal Protein Resource (UniProt) Consortium. UniProt presents three database layers: the UniProt Archive, the UniProt Knowledgebase (UniProtKB), and the UniProt Reference Clusters. The UniProtKB consists of two sections: UniProtKB/Swiss-Prot (fully manually curated entries) and UniProtKB/TrEMBL (automated annotation, classification and extensive cross-references). New releases are published fortnightly. A specific Plant Proteome Annotation Program (http://www.expasy.org/sprot/ppap/) was initiated to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Through UniProt, our aim is to provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information that will allow the plant community to fully explore and utilize the wealth of information available for both plant and non-plant model organisms.  相似文献   

2.
The I-conotoxin superfamily (I-Ctx) is known to have four disulfide bonds with the cysteine arrangement C-C-CC-CC-C-C, and the members inhibit or modify ion channels of nerve cells. Recently, Olivera and co-workers (FEBS J. 2005; 272: 4178-4188) have suggested that the previously described I-Ctx should now be divided into two different gene superfamilies, namely, I1 and I2, in view of their having two different types of signal peptides and exhibiting distinct functions. We have revisited the 28 entries presently grouped as I-Ctx in UniProt Swiss-Prot knowledgebase, and on the basis of in silico analysis have divided them into I1 and I2 superfamilies. The sequence analysis has provided a framework for in silico annotation enabling us to carry out computer-based functional characterization of the UniProtKB/TrEMBL entry Q59AA4 from Conus miles and to predict it as a member of the I2 superfamily. Furthermore, we have predicted the mature toxin of this entry and have proposed that it may be an inhibitor of voltage-gated potassium channels.  相似文献   

3.
UniProt蛋白质数据库简介   总被引:1,自引:0,他引:1       下载免费PDF全文
罗静初 《生物信息学》2019,17(3):131-144
UniProt(https://www.uniprot.org/)是国际知名蛋白质数据库,主要包括UniProtKB知识库、UniParc归档库和UniRef参考序列集三部分。UniProtKB知识库是UniProt的核心,除蛋白质序列数据外,还包括大量注释信息。UniProtKB知识库分Swiss-Prot和TrEMBL两个子库。Swiss-Prot子库中50多万条序列均由人工审阅和注释,而TrEMBL子库中1.4亿多条序列是由核酸序列数据库EMBL中的蛋白质编码序列翻译所得,并由计算机根据一定规则进行注释。UniParc归档库将存放于不同数据库中的同一个蛋白质归并到一个记录中以避免冗余,并赋予序列唯一性特定标识符。UniRef参考序列集按相似性程度将UniProtKB和UniParc中的序列分为UniRef100、UniRef90和UniRef50三个数据集。UniProt网站为用户提供了高效实用的高级检索系统和大量帮助文档。UniProt数据库每4周发布新版的同时也发布统计报表,用户可通过统计报表了解该数据库的数据量及更新情况、数据类别和物种分布等基本信息,查看常规注释信息、序列特征注释信息和数据库交叉链接等统计数据。UniProt是目前国际上序列数据最完整、注释信息最丰富的非冗余蛋白质序列数据库,自本世纪初创建以来,为生命科学领域提供了宝贵资源。  相似文献   

4.
Metal ion binding domains are found in proteins that mediate transport, buffering or detoxification of metal ions. In this study, we have performed an in silico analysis of metal binding proteins and have identified putative metal binding motifs for the ions of cadmium, cobalt, zinc, arsenic, mercury, magnesium, manganese, molybdenum and nickel. A pattern search against the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases yielded true positives in each case showing the high-specificity of the motifs. Motifs were also validated against PDB structures and site directed mutagenesis studies.  相似文献   

5.
SUMMARY: Several methods for establishing cross-links between Protein Data Bank (PDB) structures or Structural Classification of Proteins (SCOP) domains and Swiss-Prot + TrEMBL sequences (or vice versa) rely on database annotations. Alternatively, sequence alignment procedures can be used. In this study, we describe Seq2Struct, a web resource for the identification of sequence-structure links. The resource consists of an exhaustive collection of annotated links between Swiss-Prot + TrEMBL and PDB + SCOP database entries. Links are based on pre-established highly reliable thresholds and stored in a relational database, which has been enhanced using annotations derived from Swiss-Prot, PDB, SCOP, GOA and DSSP databases. The Seq2Struct database contents, supported by a WWW web interface, can be queried both online and downloaded. AVAILABILITY: The Seq2Struct resource, with related documentation, is available at http://surface.bio.uniroma2.it/seq2struct/ CONTACT: seq2struct@cbm.bio.uniroma2.it.  相似文献   

6.
Mapping PDB chains to UniProtKB entries   总被引:2,自引:0,他引:2  
MOTIVATION: UniProtKB/SwissProt is the main resource for detailed annotations of protein sequences. This database provides a jumping-off point to many other resources through the links it provides. Among others, these include other primary databases, secondary databases, the Gene Ontology and OMIM. While a large number of links are provided to Protein Data Bank (PDB) files, obtaining a regularly updated mapping between UniProtKB entries and PDB entries at the chain or residue level is not straightforward. In particular, there is no regularly updated resource which allows a UniProtKB/SwissProt entry to be identified for a given residue of a PDB file. RESULTS: We have created a completely automatically maintained database which maps PDB residues to residues in UniProtKB/SwissProt and UniProtKB/trEMBL entries. The protocol uses links from PDB to UniProtKB, from UniProtKB to PDB and a brute-force sequence scan to resolve PDB chains for which no annotated link is available. Finally the sequences from PDB and UniProtKB are aligned to obtain a residue-level mapping. AVAILABILITY: The resource may be queried interactively or downloaded from http://www.bioinf.org.uk/pdbsws/.  相似文献   

7.
8.
Biomolecule phosphorylation by protein kinases is a fundamental cell signaling process in all living cells. Following the comprehensive cataloguing of the protein kinase complement of the human genome (Manning, G., Whyte, D. B., Martinez, R., Hunter, T., and Sudarsanam, S. (2002) The protein kinase complement of the human genome. Science 298, 1912-1934), this review will detail the state-of-the-art human and mouse kinase proteomes as provided in the UniProtKB/Swiss-Prot protein knowledgebase. The sequences of the 480 classical and up to 24 atypical protein kinases now believed to exist in the human genome and 484 classical and up to 24 atypical kinases within the mouse genome have been reviewed and, where necessary, revised. Extensive annotation has been added to each entry. In an era when a wealth of new databases is emerging on the Internet, UniProtKB/Swiss-Prot makes available to the scientific community the most up-to-date and in-depth annotation of these proteins with access to additional external resources linked from within each entry. Incorrect sequence annotations resulting from errors and artifacts have been eliminated. Each entry will be constantly reviewed and updated as new information becomes available with the orthologous enzymes in related species being annotated in a parallel effort and complete kinomes being completed as sequences become available. This ensures that the mammalian kinomes available from UniProtKB/Swiss-Prot are of a consistently high standard with each separate entry acting both as a valuable information resource and a central portal to a wealth of further detail via extensive cross-referencing.  相似文献   

9.
The unique family of membrane-bound proton-pumping inorganic pyrophosphatases, involving pyrophosphate as the alternative to ATP, was investigated by characterizing 166 members of the UniProtKB/Swiss-Prot + UniProtKB/TrEMBL databases and available completed genomes, using sequence comparisons and a hidden Markov model based upon a conserved 57-residue region in the loop between transmembrane segments 5 and 6. The hidden Markov model was also used to search the approximately one million sequences recently reported from a large-scale sequencing project of organisms in the Sargasso Sea, resulting in additional 164 partial pyrophosphatase sequences. The strongly conserved 57-residue region was found to contain two nonapeptidyl sequences, mainly consisting of the four 'very early' proteinaceous amino acid residues Gly, Ala, Val and Asp, compatible with an ancient origin of the inorganic pyrophosphatases. The nonapeptide patterns have charged amino acid residues at positions 1, 5 and 9, are apparent binding sites for the substrate and parts of the active site, and were shown to be so specific for these enzymes that they can be used for functional assignments of unannotated genomes.  相似文献   

10.
The Swiss-Prot protein knowledgebase provides manually annotated entries for all species, but concentrates on the annotation of entries from model organisms to ensure the presence of high quality annotation of representative members of all protein families. A specific Plant Protein Annotation Program (PPAP) was started to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Its main goal is the annotation of proteins from the model plant organism Arabidopsis thaliana. In addition to bibliographic references, experimental results, computed features and sometimes even contradictory conclusions, direct links to specialized databases connect amino acid sequences with the current knowledge in plant sciences. As protein families and groups of plant-specific proteins are regularly reviewed to keep up with current scientific findings, we hope that the wealth of information of Arabidopsis origin accumulated in our knowledgebase, and the numerous software tools provided on the Expert Protein Analysis System (ExPASy) web site might help to identify and reveal the function of proteins originating from other plants. Recently, a single, centralized, authoritative resource for protein sequences and functional information, UniProt, was created by joining the information contained in Swiss-Prot, Translation of the EMBL nucleotide sequence (TrEMBL), and the Protein Information Resource-Protein Sequence Database (PIR-PSD). A rising problem is that an increasing number of nucleotide sequences are not being submitted to the public databases, and thus the proteins inferred from such sequences will have difficulties finding their way to the Swiss-Prot or TrEMBL databases.  相似文献   

11.

Background

Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes.

Results

Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries.

Conclusion

MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.  相似文献   

12.
Metal ion binding domains are found in proteins that mediate transport, buffering or detoxification of metal ions. The objective of the study is to design and analyze metal binding motifs against the genes involved in phytoremediation. This is being done on the basis of certain pre-requisite amino-acid residues known to bind metal ions/metal complexes in medicinal and aromatic plants (MAP''s). Earlier work on MAP''s have shown that heavy metals accumulated by aromatic and medicinal plants do not appear in the essential oil and that some of these species are able to grow in metal contaminated sites. A pattern search against the UniProtKB/Swiss-Prot and UniProtKB/TrEMBL databases yielded true positives in each case showing the high specificity of the motifs designed for the ions of nickel, lead, molybdenum, manganese, cadmium, zinc, iron, cobalt and xenobiotic compounds. Motifs were also studied against PDB structures. Results of the study suggested the presence of binding sites on the surface of protein molecules involved. PDB structures of proteins were finally predicted for the binding sites functionality in their respective phytoremediation usage. This was further validated through CASTp server to study its physico-chemical properties. Bioinformatics implications would help in designing strategy for developing transgenic plants with increased metal binding capacity. These metal binding factors can be used to restrict metal update by plants. This helps in reducing the possibility of metal movement into the food chain.  相似文献   

13.
SWISS-PROT, a curated protein sequence data bank, contains not only sequence data but also annotation relevant to a particular sequence. The annotation added to each entry is done by a team of biologists and comes, primarily, from articles in journals reporting the actual sequencing and sometimes characterisation. Review articles and collaboration with external experts also play a role along with the use of secondary databases like PROSITE and Pfam in addition to a variety of feature prediction methods. Annotation added by these methods is checked for relevance and likelihood to a particular sequence. The onset of genome sequencing has led to a dramatic increase in sequence data to be included in SWISS-PROT. This has led to the production of TrEMBL (Translation of the EMBL database). TrEMBL consists of entries in a SWISS-PROT format that are derived from the translation of all coding sequences in the EMBL nucleotide sequence database, that are not in SWISS-PROT. Unlike SWISS-PROT entries those in TrEMBL are awaiting manual annotation. However, rather than just representing basic sequence and source information, steps have been taken to add features and annotation automatically. In taking these steps it is hoped that TrEMBL entries are enhanced with some indication as to what a protein is, could or may be.  相似文献   

14.
15.
16.
Programmatic access to the UniProt Knowledgebase (UniProtKB) is essential for many bioinformatics applications dealing with protein data. We have created a Java library named UniProtJAPI, which facilitates the integration of UniProt data into Java-based software applications. The library supports queries and similarity searches that return UniProtKB entries in the form of Java objects. These objects contain functional annotations or sequence information associated with a UniProt entry. Here, we briefly describe the UniProtJAPI and demonstrate its usage.  相似文献   

17.
Amino acid changes due to non-synonymous variation are included as annotations for individual proteins in UniProtKB/Swiss-Prot and RefSeq which present biological data in a protein-or gene-centric fashion. Unfortunately, proteome-wide analysis of non-synonymous singlenucleotide variations (nsSNVs) is not easy to perform because information on nsSNVs and functionally important sites are not well integrated both within and between databases and their search engines. We have developed SNVDis that allows evaluation of proteome-wide nsSNV distribution in functional sites, domains and pathways. More specifically, we have integrated human-specific data from major variation databases (UniProtKB, dbSNP and COSMIC), comprehensive sequence feature annotation from UniProtKB, Pfam, RefSeq, Conserved Domain Database (CDD) and pathway information from Protein ANalysis THrough Evolutionary Relationships (PANTHER) and mapped all of them in a uniform and comprehensive way to the human reference proteome provided by UniProtKB/Swiss-Prot. Integrated information of active sites, pathways, binding sites, domains, which are extracted from a number of different sources, provides a detailed overview of how nsSNVs are distributed over the human proteome and pathways and how they intersect with functional sites of proteins. Additionally, it is possible to find out whether there is an over-or under-representation of nsSNVs in specific domains, pathways or user-defined protein lists. The underlying datasets are updated once every 3 months. SNVDis is freely available at http://hive.biochemistry.gwu.edu/tool/snvdis.  相似文献   

18.
UniRef: comprehensive and non-redundant UniProt reference clusters   总被引:2,自引:0,他引:2  
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

19.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.  相似文献   

20.
Liu F  Baggerman G  Schoofs L  Wets G 《Peptides》2006,27(12):3137-3153
Bioactive (neuro)peptides play critical roles in regulating most biological processes in animals. Peptides belonging to the same family are characterized by a typical sequence pattern that is conserved among the family's peptide members. Such a conserved pattern or motif usually corresponds to the functionally important part of the biologically active peptide. In this paper, all known bioactive (neuro)peptides annotated in Swiss-Prot and TrEMBL protein databases are collected, and the pattern searching program Pratt is used to search these unaligned peptide sequences for conserved patterns. The obtained patterns are then refined by combining the information on amino acids at important functional sites collected from the literature. All the identified patterns are further tested by scanning them against Swiss-Prot and TrEMBL protein databases. The diagnostic power of each pattern is validated by the fact that any annotated protein from Swiss-Prot and TrEMBL that contains one of the established patterns, is indeed a known (neuro)peptide precursor. We discovered 155 novel peptide patterns in addition to the 56 established ones in the PROSITE database. All the patterns cover 110 peptide families. Fifty-five of these families are not characterized by the PROSITE signatures, and 12 are also not identified by other existing motif databases, such as Pfam and SMART. Using the newly identified peptide signatures as a search tool, we predicted 95 hypothetical proteins as putative peptide precursors.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号