首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 161 毫秒
1.
In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.  相似文献   

2.
In this article, we provide a comprehensive study of the content of the Universal Protein Resource (UniProt) protein data sets for human and mouse. The tryptic search spaces of the UniProtKB (UniProt knowledgebase) complete proteome sets were compared with other data sets from UniProtKB and with the corresponding International Protein Index, reference sequence, Ensembl, and UniRef100 (where UniRef is UniProt reference clusters) organism‐specific data sets. All protein forms annotated in UniProtKB (both the canonical sequences and isoforms) were evaluated in this study. In addition, natural and disease‐associated amino acid variants annotated in UniProtKB were included in the evaluation. The peptide unicity was also evaluated for each data set. Furthermore, the peptide information in the UniProtKB data sets was also compared against the available peptide‐level identifications in the main MS‐based proteomics repositories. Identifying the peptides observed in these repositories is an important resource of information for protein databases as they provide supporting evidence for the existence of otherwise predicted proteins. Likewise, the repositories could use the information available in UniProtKB to direct reprocessing efforts on specific sets of peptides/proteins of interest. In summary, we provide comprehensive information about the different organism‐specific sequence data sets available from UniProt, together with the pros and cons for each, in terms of search space for MS‐based bottom‐up proteomics workflows. The aim of the analysis is to provide a clear view of the tryptic search space of UniProt and other protein databases to enable scientists to select those most appropriate for their purposes.  相似文献   

3.
4.
Protein N-terminal acetylation (N(α) -acetylation) is among the most common modifications in eukaryotes. We previously described a simple method to enrich N(α) -modified peptides using CNBr-activated Sepharose resin. A limitation of this method is that an optimal ratio of sample to resin had to be determined prior to the analysis since Lys-containing N(α) -modified peptides may be lost. To address this problem, we hereby present an optimized method by the introduction of double incubation at pH 6.0. We demonstrate with the optimized method that the N(α) -modified peptides can be enriched regardless of whether ε-NH(2) is present or not, and the sample to resin ratio optimization is no longer necessary. Another improvement was accomplished by the inclusion of the singly charged precursor for MS/MS fragmentation to alleviate the shortcoming of the reduced charge state of N(α) -modified peptides. We employed a duplicate experiment using 80 μg samples each and identified 922 IPI annotated and 103 IPI unannotated acetylated N-termini from 989 proteins, so far the largest acetylated N-termini data set acquired from a tryptic digest. Furthermore, the reproducibility of the N(α) -acetyl proteome approach was evaluated and its complementarity to the regular proteome approach was analyzed. The unexpected coupling of CNBr-activated Sepharose to His-containing peptides via the imidazole group was discovered.  相似文献   

5.
A crucial aim upon the completion of the human genome is the verification and functional annotation of all predicted genes and their protein products. Here we describe the mapping of peptides derived from accurate interpretations of protein tandem mass spectrometry (MS) data to eukaryotic genomes and the generation of an expandable resource for integration of data from many diverse proteomics experiments. Furthermore, we demonstrate that peptide identifications obtained from high-throughput proteomics can be integrated on a large scale with the human genome. This resource could serve as an expandable repository for MS-derived proteome information.  相似文献   

6.
UniProt蛋白质数据库简介   总被引:1,自引:0,他引:1       下载免费PDF全文
罗静初 《生物信息学》2019,17(3):131-144
UniProt(https://www.uniprot.org/)是国际知名蛋白质数据库,主要包括UniProtKB知识库、UniParc归档库和UniRef参考序列集三部分。UniProtKB知识库是UniProt的核心,除蛋白质序列数据外,还包括大量注释信息。UniProtKB知识库分Swiss-Prot和TrEMBL两个子库。Swiss-Prot子库中50多万条序列均由人工审阅和注释,而TrEMBL子库中1.4亿多条序列是由核酸序列数据库EMBL中的蛋白质编码序列翻译所得,并由计算机根据一定规则进行注释。UniParc归档库将存放于不同数据库中的同一个蛋白质归并到一个记录中以避免冗余,并赋予序列唯一性特定标识符。UniRef参考序列集按相似性程度将UniProtKB和UniParc中的序列分为UniRef100、UniRef90和UniRef50三个数据集。UniProt网站为用户提供了高效实用的高级检索系统和大量帮助文档。UniProt数据库每4周发布新版的同时也发布统计报表,用户可通过统计报表了解该数据库的数据量及更新情况、数据类别和物种分布等基本信息,查看常规注释信息、序列特征注释信息和数据库交叉链接等统计数据。UniProt是目前国际上序列数据最完整、注释信息最丰富的非冗余蛋白质序列数据库,自本世纪初创建以来,为生命科学领域提供了宝贵资源。  相似文献   

7.
In mass spectrometry‐based proteomics, most conventional search engines match spectral data to sequence databases. These search databases thus play a crucial role in the identification process. While search engines can derive peptides in silico from protein sequences, this is usually limited to standard digestion algorithms. Customized search databases that provide detailed control over the search space can vastly outperform such standard strategies, especially in gel‐free proteomics experiments. Here we present Database on Demand, an easy‐to‐use web tool that can quickly produce a wide variety of customized search databases.  相似文献   

8.
There is growing interest to use mass spectrometry data to search genome sequences directly. Previous work by other authors demonstrated that this approach is able to correct and complement available genome annotations. We discuss the practical difficulty of searching large eukaryotic genomes with peptide ion trap tandem mass spectra of small proteins (<40 kDa). The challenging problem of automatically identifying peptides that span across exon/intron boundaries is explored for the first time by using experimental data. In a human genome search, we find that roughly 30% of the peptides are missed, due to various reasons, compared to a Swiss-Prot search. We show that this percentage is significantly reduced with improved parent mass accuracy. We finally provide several examples of predicted gene structures that could be improved by proteomics data, in particular by peptides spanning across exon/intron boundaries.  相似文献   

9.
10.
Proteomics research routinely involves identifying peptides and proteins via MS/MS sequence database search. Thus the database search engine is an integral tool in many proteomics research groups. Here, we introduce the Comet search engine to the existing landscape of commercial and open‐source database search tools. Comet is open source, freely available, and based on one of the original sequence database search tools that has been widely used for many years.  相似文献   

11.
With great biological interest in post-translational modifications (PTMs), various approaches have been introduced to identify PTMs using MS/MS. Recent developments for PTM identification have focused on an unrestrictive approach that searches MS/MS spectra for all known and possibly even unknown types of PTMs at once. However, the resulting expanded search space requires much longer search time and also increases the number of false positives (incorrect identifications) and false negatives (missed true identifications), thus creating a bottleneck in high throughput analysis. Here we introduce MODa, a novel "multi-blind" spectral alignment algorithm that allows for fast unrestrictive PTM searches with no limitation on the number of modifications per peptide while featuring over an order of magnitude speedup in relation to existing approaches. We demonstrate the sensitivity of MODa on human shotgun proteomics data where it reveals multiple mutations, a wide range of modifications (including glycosylation), and evidence for several putative novel modifications. Based on the reported findings, we argue that the efficiency and sensitivity of MODa make it the first unrestrictive search tool with the potential to fully replace conventional restrictive identification of proteomics mass spectrometry data.  相似文献   

12.
The main goal of many proteomics experiments is an accurate and rapid quantification and identification of regulated proteins in complex biological samples. The bottleneck in quantitative proteomics remains the availability of efficient software to evaluate and quantify the tremendous amount of mass spectral data acquired during a proteomics project. A new software suite, ICPLQuant, has been developed to accurately quantify isotope‐coded protein label (ICPL)‐labeled peptides on the MS level during LC‐MALDI and peptide mass fingerprint experiments. The tool is able to generate a list of differentially regulated peptide precursors for subsequent MS/MS experiments, minimizing time‐consuming acquisition and interpretation of MS/MS data. ICPLQuant is based on two independent units. Unit 1 performs ICPL multiplex detection and quantification and proposes peptides to be identified by MS/MS. Unit 2 combines MASCOT MS/MS protein identification with the quantitative data and produces a protein/peptide list with all the relevant information accessible for further data mining. The accuracy of quantification, selection of peptides for MS/MS‐identification and the automated output of a protein list of regulated proteins are demonstrated by the comparative analysis of four different mixtures of three proteins (Ovalbumin, Horseradish Peroxidase and Rabbit Albumin) spiked into the complex protein background of the DGPF Proteome Marker.  相似文献   

13.
Assessment of differential protein abundance from the observed properties of detected peptides is an essential part of protein profiling based on shotgun proteomics. However, the abundance observed for shared peptides may be due to contributions from multiple proteins that are affected differently by a given treatment. Excluding shared peptides eliminates this ambiguity but may significantly decrease the number of proteins for which abundance estimates can be obtained. Peptide sharing within a family of biologically related proteins does not cause ambiguity if family members have a common response to treatment. On the basis of this concept, we have developed an approach for including shared peptides in the analysis of differential protein abundance in protein profiling. Data from a recent proteomics study of lung tissue from mice exposed to lipopolysaccharide, cigarette smoke, and a combination of these agents are used to illustrate our method. Starting from data where about half of the implicated database protein involved shared peptides, 82% of the affected proteins were grouped into families, based on FASTA annotation, with closure on peptide sharing. In many cases, a common abundance relative to control was sufficient to explain ion-current peak areas for peptides, both unique and shared, that identified biologically related proteins in a peptide-sharing closure group. On the basis of these results, we propose that peptide-sharing closure groups provide a way to include abundance data for shared peptides in quantitative protein profiling by high-throughput mass spectrometry.  相似文献   

14.
Protein identification via peptide mass fingerprinting (PMF) remains a key component of high-throughput proteomics experiments in post-genomic science. Candidate protein identifications are made using bioinformatic tools from peptide peak lists obtained via mass spectrometry (MS). These algorithms rely on several search parameters, including the number of potential uncut peptide bonds matching the primary specificity of the hydrolytic enzyme used in the experiment. Typically, up to one of these "missed cleavages" are considered by the bioinformatics search tools, usually after digestion of the in silico proteome by trypsin. Using two distinct, nonredundant datasets of peptides identified via PMF and tandem MS, a simple predictive method based on information theory is presented which is able to identify experimentally defined missed cleavages with up to 90% accuracy from amino acid sequence alone. Using this simple protocol, we are able to "mask" candidate protein databases so that confident missed cleavage sites need not be considered for in silico digestion. We show that that this leads to an improvement in database searching, with two different search engines, using the PMF dataset as a test set. In addition, the improved approach is also demonstrated on an independent PMF data set of known proteins that also has corresponding high-quality tandem MS data, validating the protein identifications. This approach has wider applicability for proteomics database searching, and the program for predicting missed cleavages and masking Fasta-formatted protein sequence databases has been made available via http:// ispider.smith.man.ac uk/MissedCleave.  相似文献   

15.
Understanding differences in the repertoire of orthologous gene pairs is vital for interpretation of pharmacological and physiological experiments if conclusions are conveyed between species. Here we present a comprehensive dataset for G protein-coupled receptors (GPCRs) in both human and mouse with a phylogenetic road map. We performed systematic searches applying several search tools such as BLAST, BLAT, and Hidden Markov models and searches in literature data. We aimed to gather a full-length version of each human or mouse GPCR in only one copy referring to a single chromosomal position. Moreover, we performed detailed phylogenetic analysis of the transmembrane regions of the receptors to establish accurate orthologous pairs. The results show the identity of 495 mouse and 400 human functional nonolfactory GPCRs. Overall, 329 of the receptors are found in one-to-one orthologous pairs, while 119 mouse and 31 human receptors originate from species-specific expansions or deletions. The average percentage similarity of the orthologue pairs is 85%, while it varies between the main GRAFS families from an average of 59 to 94%. The orthologous pairs for the lipid-binding GPCRs had the lowest levels of conservation, while the biogenic amines had highest levels of conservation. Moreover, we searched for expressed sequence tags (ESTs) and identified more than 17,000 ESTs matching GPCRs in mouse and human, providing information about their expression patterns. On the whole, this is the most comprehensive study of the gene repertoire that codes for human and mouse GPCRs. The datasets are available for downloading.  相似文献   

16.
Recent studies using stable isotope labeling with amino acids in culture (SILAC) in quantitative proteomics have made mention of the problematic conversion of isotope-coded arginine to proline in cells. The resulting converted proline peptide divides the heavy peptide ion signal causing inaccuracy when compared with the light peptide ion signal. This is of particular concern as it can effect up to half of all peptides in a proteomic experiment. Strategies to both compensate for and limit the inadvertent conversion have been demonstrated, but none have been shown to prevent it. Additionally, these methods combined with SILAC labeling in general have proven problematic in their large scale application to sensitive cell types including embryonic stem cells (ESCs) from the mouse and human. Here, we show that by providing as little as 200 mg/liter L-proline in SILAC media, the conversion of arginine to proline can be rendered completely undetectable. At the same time, there was no compromise in labeling with isotope-coded arginine, indicating there is no observable back conversion from the proline supplement. As a result, when supplemented with proline, correct interpretation of "light" and "heavy" peptide ratios could be achieved even in the worst cases of conversion. By extending these principles to ESC culture protocols and reagents we were able to routinely SILAC label both mouse and human ESCs in the absence of feeder cells and without compromising the pluripotent phenotype. This study provides the simplest protocol to prevent proline artifacts in SILAC labeling experiments with arginine. Moreover, it presents a robust, feeder cell-free, protocol for performing SILAC experiments on ESCs from both the mouse and the human.  相似文献   

17.
18.
Here, we describe the novel use of a volatile surfactant, perfluorooctanoic acid (PFOA), for shotgun proteomics. PFOA was found to solubilize membrane proteins as effectively as sodium dodecyl sulfate (SDS). PFOA concentrations up to 0.5% (w/v) did not significantly inhibit trypsin activity. The unique features of PFOA allowed us to develop a single-tube shotgun proteomics method that used all volatile chemicals that could easily be removed by evaporation prior to mass spectrometry analysis. The experimental procedures involved: 1) extraction of proteins in 2% PFOA; 2) reduction of cystine residues with triethyl phosphine and their S-alkylation with iodoethanol; 3) trypsin digestion of proteins in 0.5% PFOA; 4) removal of PFOA by evaporation; and 5) LC-MS/MS analysis of the resulting peptides. The general applicability of the method was demonstrated with the membrane preparation of photoreceptor outer segments. We identified 75 proteins from 1 μg of the tryptic peptides in a single, 1-hour, LC-MS/MS run. About 67% of the proteins identified were classified as membrane proteins. We also demonstrate that a proteolytic (18)O labeling procedure can be incorporated after the PFOA removal step for quantitative proteomic experiments. The present method does not require sample clean-up devices such as solid-phase extractions and membrane filters, so no proteins/peptides are lost in any experimental steps. Thus, this single-tube shotgun proteomics method overcomes the major drawbacks of surfactant use in proteomic experiments.  相似文献   

19.
Ideally, shotgun proteomics would facilitate the identification of an entire proteome with 100% protein sequence coverage. In reality, the large dynamic range and complexity of cellular proteomes results in oversampling of abundant proteins, while peptides from low abundance proteins are undersampled or remain undetected. We tested the proteome equalization technology, ProteoMiner, in conjunction with Multidimensional Protein Identification Technology (MudPIT) to determine how the equalization of protein dynamic range could improve shotgun proteomics methods for the analysis of cellular proteomes. Our results suggest low abundance protein identifications were improved by two mechanisms: (1) depletion of high abundance proteins freed ion trap sampling space usually occupied by high abundance peptides and (2) enrichment of low abundance proteins increased the probability of sampling their corresponding more abundant peptides. Both mechanisms also contributed to dramatic increases in the quantity of peptides identified and the quality of MS/MS spectra acquired due to increases in precursor intensity of peptides from low abundance proteins. From our large data set of identified proteins, we categorized the dominant physicochemical factors that facilitate proteome equalization with a hexapeptide library. These results illustrate that equalization of the dynamic range of the cellular proteome is a promising methodology to improve low abundance protein identification confidence, reproducibility, and sequence coverage in shotgun proteomics experiments, opening a new avenue of research for improving proteome coverage.  相似文献   

20.
Human saliva contains a large number of proteins and peptides (salivary proteome) that help maintain homeostasis in the oral cavity. Global analysis of human salivary proteome is important for understanding oral health and disease pathogenesis. In this study, large-scale identification of salivary proteins was demonstrated by using shotgun proteomics and two-dimensinal gel electrophoresis-mass spectrometry (2-DE-MS). For the shotgun approach, whole saliva proteins were prefractionated according to molecular weight. The smallest fraction, presumably containing salivary peptides, was directly separated by capillary liquid chromatography (LC). However, the large protein fractions were digested into peptides for subsequent LC separation. Separated peptides were analyzed by on-line electrospray tandem mass spectrometry (MS/MS) using a quadrupole-time of flight mass spectrometer, and the obtained spectra were automatically processed to search human protein sequence database for protein identification. Additionally, 2-DE was used to map out the proteins in whole saliva. Protein spots 105 in number were excised and in-gel digested; and the resulting peptide fragments were measured by matrix-assisted laser desorption/ionization-mass spectrometry and sequenced by LC-MS/MS for protein identification. In total, we cataloged 309 proteins from human whole saliva by using these two proteomic approaches.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号