首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 187 毫秒
1.
串联质谱数据的从头解析与蛋白质的数据库搜索鉴定   总被引:3,自引:0,他引:3  
蛋白质的鉴定是蛋白质组学研究中必不可少的一步。用串联质谱 (tandemmassspectrometry ,MS/MS)可以进行多肽的从头测序 (denovosequencing) ,并搜索数据库以鉴定蛋白质。用图论以及真实谱 理论谱联配 (alignment)的方法对串联质谱得到的多肽图谱进行从头解析 ,得到了可靠的多肽序列 ,并应用到数据库搜索中鉴定了相应的蛋白质。同时 ,还用统计的方法对SwissProt以及TrEMBL蛋白质数据库进行了详细的分析。结果表明 ,3个四肽或者 2个五肽或者 1个八肽一般可以唯一地确定一个蛋白质  相似文献   

2.
即使细菌基因组的基因结构较为简单,但在注释过程中也可能出现基因遗漏的现象。当潜在基因在高质量数据库中没有显著同源序列时,基于知识库的基因预测方法就会遇到困难。本文希望通过系统扫描基因组所有可能ORF的蛋白质序列模式来搜索遗漏基因。为验证该方法的可行性,作者系统分析了重要的工业发酵微生物谷氨酸棒杆菌的基因组,发现了25个候选疑似基因。它们具有显著的蛋白质序列模式,但在Swiss-Prot中元显著同源序列,并且在GenBank中仍未注释。深入分析发现,25个候选疑似基因中19个为可能基因,3个为可能假基因,3个为疑似基因序列。这些结果说明本文的分析方法可以有效地用于无显著同源序列基因的搜索。  相似文献   

3.
微生物基因组注释系统MGAP   总被引:6,自引:0,他引:6  
利用生物信息学方法和工具开发了微生物基因组注释系统(Microbial genome annotation package, MGAP),并用于蓝细菌PCC7002的基因组注释。该系统由基因组注释系统和基于Web的用户接口程序两部分组成。基因组注释系统整合多个基因识别、功能预测和序列分析软件;以及蛋白质序列数据库、蛋白质资源信息系统和直系同源蛋白质家族数据库等。用户接口程序包括基因组环状图展示、基因和开放读码框在染色体上的分布图,以及注释信息检索工具。该系统基于PC微机和Linux操作系统,用MySQL作数据库管理系统、用Apache作Web服务器程序,用Perl脚本语言编写应用程序接口,上述软件均可免费获得。  相似文献   

4.
蛋白质组图谱数据库的建立   总被引:12,自引:0,他引:12  
中国科学院上海生命科学研究院生物信息中心与蛋白质组研究中心将公布我国第一个蛋白质组图谱数据库。数据库由物理层、链路层、交互层三层构架组成 ,数据库开发全面采用Java技术 ,具有完全的平台无关性。用户可以方便地对数据库中的图谱进行浏览 ,还可以使用数据库提供的多种检索工具获得感兴趣蛋白质的具体信息  相似文献   

5.
基因库(GenBank)的电子邮件检索   总被引:2,自引:0,他引:2  
胡德华  方平 《遗传》1999,21(6):43-46
基因库(GenBank)是由美国国立卫生研究院、美国国立医学图书馆以及美国国家生物技术信息中心建立发行的,所有已知核酸和蛋白质序列及其文献和生物学注释的公共数据库。可以通过WW W 、FTP、E- m ail获取其中的数据,本文主要介绍了查询服务器的检索方法。  相似文献   

6.
挖掘高通量实验数据蕴含的生物学意义是蛋白质组学研究面临的一大挑战 . 基于等级化结构化的词汇表 GO (Gene Ontology) 和相关数据库中的蛋白质功能注释,发展了一种对蛋白质组学研究中得到的表达谱 (Expression profile) 进行功能分析的策略 . 在对蛋白质表达谱进行功能注释的基础上给出蛋白质表达谱中蛋白质功能的分布,同时给出感兴趣功能类别的统计信息 . 这有助于对表达谱蛋白质功能的整体理解和深入的生物信息学分析 . 该策略已经成功应用胎肝蛋白表达谱研究中,用户可以通过访问网址 http://www.hupo.org.cn/GOfact/ 使用或者下载我们的程序 .  相似文献   

7.
赵锐  钱震  任双喜 《生物信息学》2009,7(2):143-145,149
设计一种基于网络的可用来存储和注释海量DNA数据的数据库模型。整个过程分为三部分:首先是构建数据库框架,然后对原始基因组序列数据进行批量注释并输出有效格式导入数据库,最后通过一个友好的用户交互界面,实现对基因组数据的在线读取,查询,注释等操作。设计的数据库用于解决大量产生并有待分析的基因组序列的有效存储和管理问题。  相似文献   

8.
蛋白质二级结构预测样本集数据库的设计与实现   总被引:1,自引:0,他引:1  
张宁  张涛 《生物信息学》2006,4(4):163-166
将数据库技术应用到蛋白质二级结构预测的样本集处理和分析上,建立了二级结构预测样本集数据库。以CB513样本集为例介绍了该数据库的构建模式。构建样本数据库不仅便于存储、管理和检索数据,还可以完成一些简单的序列分析工作,取代许多以往必须的编程。从而大大提高了工作效率,减少错误的发生。  相似文献   

9.
以生长于广西大厂锡多金属矿上部(重金属胁迫区)和未受矿化或污染影响的矿区外围(对照区)的芒萁〔Dicranopteris pedata(Houtt.)Nakaike〕为实验材料,对芒萁叶片进行转录组高通量测序,并对组装得到的unigenes经NCBI官方非冗余蛋白质序列数据库(Nr)、NCBI官方非冗余核苷酸序列数据库(Nt)、KEGG直系同源数据库(KO)、Swiss-Prot数据库(Swiss-Prot)、蛋白质家族数据库(Pfam)、基因功能分类体系数据库(GO)和真核生物直系同源序列数据库(KOG)进行注释,同时分析重金属胁迫区和对照区芒萁叶片间的差异表达unigenes.结果显示:测序获得19.56 Gb clean data,其中,重金属胁迫区和对照区芒萁叶片分别含10.14和9.42 Gb clean data.组装得到的250582个unigenes中有120097个unigenes得到注释,占unigenes总数的47.93%.与对照区相比较,重金属胁迫区芒萁叶片中上调和下调差异表达unigenes分别有208和620个,其中120个上调差异表达unigenes注释为代谢过程,占所有上调差异表达unigenes的57.69%;285个下调差异表达unigenes注释为催化活性,占所有下调差异表达unigenes的45.97%.重金属胁迫区芒萁叶片中15个unigenes与重金属转运和耐受相关,其中c44988 g1和c84121 g1的相对表达量分别极显著和显著高于对照区.研究结果显示:芒萁响应自然金属矿化或矿山重金属污染的基因可以用于生物地球化学找矿和土壤重金属污染检测.  相似文献   

10.
蛋白质-蛋白质相互作用(Protein-protein interaction,PPI)是生命体结构和生命活动的基础和特征,控制着生命活动的各个过程.PPI网络是研究蛋白质相互作用的有效手段.随着高通量实验技术的发展,越来越多的PPI数据得以使用,收录蛋白质相互作用的数据库数据每年都有变化.本文对DIP数据库从2003年到2008年的PPI网络数据分别计算度分布.为提高可信度,对注释蛋白质数据库交集进行抽样,分别探讨对不同年份的数据和注释数据库抽样对PPI网络度分布的影响.结果表明,从2003年到2008年的数据增长对PPI网络度分布没有明显影响,而且拟合度分布最优的函数并不是以往所认为的幂率分布(power-law),而是广延指数(stretched exponential)函数,数据库交集抽样同样得到广延指数(stretched exponential)函数分布最优且可信度的高低并不影响PPI网络的度分布.  相似文献   

11.
UniRef: comprehensive and non-redundant UniProt reference clusters   总被引:2,自引:0,他引:2  
MOTIVATION: Redundant protein sequences in biological databases hinder sequence similarity searches and make interpretation of search results difficult. Clustering of protein sequence space based on sequence similarity helps organize all sequences into manageable datasets and reduces sampling bias and overrepresentation of sequences. RESULTS: The UniRef (UniProt Reference Clusters) provide clustered sets of sequences from the UniProt Knowledgebase (UniProtKB) and selected UniProt Archive records to obtain complete coverage of sequence space at several resolutions while hiding redundant sequences. Currently covering >4 million source sequences, the UniRef100 database combines identical sequences and subfragments from any source organism into a single UniRef entry. UniRef90 and UniRef50 are built by clustering UniRef100 sequences at the 90 or 50% sequence identity levels. UniRef100, UniRef90 and UniRef50 yield a database size reduction of approximately 10, 40 and 70%, respectively, from the source sequence set. The reduced redundancy increases the speed of similarity searches and improves detection of distant relationships. UniRef entries contain summary cluster and membership information, including the sequence of a representative protein, member count and common taxonomy of the cluster, the accession numbers of all the merged entries and links to rich functional annotation in UniProtKB to facilitate biological discovery. UniRef has already been applied to broad research areas ranging from genome annotation to proteomics data analysis. AVAILABILITY: UniRef is updated biweekly and is available for online search and retrieval at http://www.uniprot.org, as well as for download at ftp://ftp.uniprot.org/pub/databases/uniprot/uniref. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

12.
In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism‐specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease‐associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide‐level identifications in the main MS‐based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism‐specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS‐based bottom‐up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes.  相似文献   

13.
UniProt archive     
UniProt Archive (UniParc) is the most comprehensive, non-redundant protein sequence database available. Its protein sequences are retrieved from predominant, publicly accessible resources. All new and updated protein sequences are collected and loaded daily into UniParc for full coverage. To avoid redundancy, each unique sequence is stored only once with a stable protein identifier, which can be used later in UniParc to identify the same protein in all source databases. When proteins are loaded into the database, database cross-references are created to link them to the origins of the sequences. As a result, performing a sequence search against UniParc is equivalent to performing the same search against all databases cross-referenced by UniParc. UniParc contains only protein sequences and database cross-references; all other information must be retrieved from the source databases.  相似文献   

14.
Mapping PDB chains to UniProtKB entries   总被引:2,自引:0,他引:2  
MOTIVATION: UniProtKB/SwissProt is the main resource for detailed annotations of protein sequences. This database provides a jumping-off point to many other resources through the links it provides. Among others, these include other primary databases, secondary databases, the Gene Ontology and OMIM. While a large number of links are provided to Protein Data Bank (PDB) files, obtaining a regularly updated mapping between UniProtKB entries and PDB entries at the chain or residue level is not straightforward. In particular, there is no regularly updated resource which allows a UniProtKB/SwissProt entry to be identified for a given residue of a PDB file. RESULTS: We have created a completely automatically maintained database which maps PDB residues to residues in UniProtKB/SwissProt and UniProtKB/trEMBL entries. The protocol uses links from PDB to UniProtKB, from UniProtKB to PDB and a brute-force sequence scan to resolve PDB chains for which no annotated link is available. Finally the sequences from PDB and UniProtKB are aligned to obtain a residue-level mapping. AVAILABILITY: The resource may be queried interactively or downloaded from http://www.bioinf.org.uk/pdbsws/.  相似文献   

15.
Programmatic access to the UniProt Knowledgebase (UniProtKB) is essential for many bioinformatics applications dealing with protein data. We have created a Java library named UniProtJAPI, which facilitates the integration of UniProt data into Java-based software applications. The library supports queries and similarity searches that return UniProtKB entries in the form of Java objects. These objects contain functional annotations or sequence information associated with a UniProt entry. Here, we briefly describe the UniProtJAPI and demonstrate its usage.  相似文献   

16.
Babnigg G  Giometti CS 《Proteomics》2006,6(16):4514-4522
In proteome studies, identification of proteins requires searching protein sequence databases. The public protein sequence databases (e.g., NCBInr, UniProt) each contain millions of entries, and private databases add thousands more. Although much of the sequence information in these databases is redundant, each database uses distinct identifiers for the identical protein sequence and often contains unique annotation information. Users of one database obtain a database-specific sequence identifier that is often difficult to reconcile with the identifiers from a different database. When multiple databases are used for searches or the databases being searched are updated frequently, interpreting the protein identifications and associated annotations can be problematic. We have developed a database of unique protein sequence identifiers called Sequence Globally Unique Identifiers (SEGUID) derived from primary protein sequences. These identifiers serve as a common link between multiple sequence databases and are resilient to annotation changes in either public or private databases throughout the lifetime of a given protein sequence. The SEGUID Database can be downloaded (http://bioinformatics.anl.gov/SEGUID/) or easily generated at any site with access to primary protein sequence databases. Since SEGUIDs are stable, predictions based on the primary sequence information (e.g., pI, Mr) can be calculated just once; we have generated approximately 500 different calculations for more than 2.5 million sequences. SEGUIDs are used to integrate MS and 2-DE data with bioinformatics information and provide the opportunity to search multiple protein sequence databases, thereby providing a higher probability of finding the most valid protein identifications.  相似文献   

17.
Plant protein annotation in the UniProt Knowledgebase   总被引:3,自引:0,他引:3       下载免费PDF全文
The Swiss-Prot, TrEMBL, Protein Information Resource (PIR), and DNA Data Bank of Japan (DDBJ) protein database activities have united to form the Universal Protein Resource (UniProt) Consortium. UniProt presents three database layers: the UniProt Archive, the UniProt Knowledgebase (UniProtKB), and the UniProt Reference Clusters. The UniProtKB consists of two sections: UniProtKB/Swiss-Prot (fully manually curated entries) and UniProtKB/TrEMBL (automated annotation, classification and extensive cross-references). New releases are published fortnightly. A specific Plant Proteome Annotation Program (http://www.expasy.org/sprot/ppap/) was initiated to cope with the increasing amount of data produced by the complete sequencing of plant genomes. Through UniProt, our aim is to provide the scientific community with a single, centralized, authoritative resource for protein sequences and functional information that will allow the plant community to fully explore and utilize the wealth of information available for both plant and non-plant model organisms.  相似文献   

18.
A procedure to recruit members to enlarge protein family databases is described here. The procedure makes use of UniRef50 clusters produced by UniProt. Current family entries are used to recruit additional members based on the UniRef50 clusters to which they belong. Only those additional UniRef50 members that are not fragments and whose length is within a restricted range relative to the original entry are recruited. The enriched dataset is then limited to contain only genomes from selected clades. We used the COG database - used for genome annotation and for studies of phylogenetics and gene evolution - as a model. To validate the method, a UniRef-Enriched COG0151 (UECOG) was tested with distinct procedures to compare recruited members with the recruiters: PSI-BLAST, secondary structure overlap (SOV), Seed Linkage, COGnitor, shared domain content, and neighbor-joining single-linkage, and observed that the former four agree in their validations. Presently, the UniRef50-based recruitment procedure enriches the COG database for Archaea, Bacteria and its subgroups Actinobacteria, Firmicutes, Proteobacteria, and other bacteria by 2.2-, 8.0-, 7.0-, 8.8-, 8.7-, and 4.2-fold, respectively, in terms of sequences, and also considerably increased the number of species.  相似文献   

19.
Nucleic acid sequences from genome sequencing projects are submitted as raw data, from which biologists attempt to elucidate the function of the predicted gene products. The protein sequences are stored in public databases, such as the UniProt Knowledgebase (UniProtKB), where curators try to add predicted and experimental functional information. Protein function prediction can be done using sequence similarity searches, but an alternative approach is to use protein signatures, which classify proteins into families and domains. The major protein signature databases are available through the integrated InterPro database, which provides a classification of UniProtKB sequences. As well as characterization of proteins through protein families, many researchers are interested in analyzing the complete set of proteins from a genome (i.e. the proteome), and there are databases and resources that provide non-redundant proteome sets and analyses of proteins from organisms with completely sequenced genomes. This article reviews the tools and resources available on the web for single and large-scale protein characterization and whole proteome analysis.  相似文献   

20.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号