期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics

Alexey I. Nesvizhskii 《Journal of Proteomics》2010,73(11):2092-2123

相似文献

2.

用于串联质谱鉴定多肽的计量方法 总被引：1，自引：0，他引：1

盛泉虎汤海旭解涛王连水丁达夫《Acta biochimica et biophysica Sinica》2003,35(8):734-740

目前已有多种对串联质谱与数据库中多肽的理论质谱的一致性进行评估的高通量计量算法用于鸟枪法蛋白质组学 (shotgunproteomics)研究。然而这些方法操作时存在大量错误的多肽鉴定。这里提出一种新的串联质谱识别多肽序列的计量算法。该算法综合考虑了串联质谱中不同离子出现的概率、多肽的酶切位点数、理论离子与实验离子的匹配程度和匹配模式。对大容量的串联质谱数据集的测试表明 ,根据算法开发的软件PepSearch比目前最常用的软件SEQUEST有更好的鉴定准确性。PepSearch可从http : compbio.sibsnet.org projects pepsearch下载。相似文献

3.

The probability distribution for a random match between an experimental-theoretical spectral pair in tandem mass spectrometry

Fridman T Razumovskaya J Verberkmoes N Hurst G Protopopescu V Xu Y 《Journal of bioinformatics and computational biology》2005,3(2):455-476

Proteomic techniques are fast becoming the main method for qualitative and quantitative determination of the protein content in biological systems. Despite notable advances, efficient and accurate analysis of high throughput proteomic data generated by mass spectrometers remains one of the major stumbling blocks in the protein identification problem. We present a model for the number of random matches between an experimental MS-MS spectrum and a theoretical spectrum of a peptide. The shape of the probability distribution is a function of the experimental accuracy, the number of peaks in the experimental spectrum, the length of the interval over which the peaks are distributed, and the number of theoretical spectral peaks in this interval. Based on this probability distribution, a goodness-of-fit tool can be used to yield fast and accurate scoring schemes for peptide identification through database search. In this paper, we describe one possible implementation of such a method and compare the performance of the resulting scoring function with that of SEQUEST. In terms of speed, our algorithm is roughly two orders of magnitude faster than the SEQUEST program, and its accuracy of peptide identification compares favorably to that of SEQUEST. Moreover, our algorithm does not use information related to the intensities of the peaks. 相似文献

4.

AMASS: software for automatically validating the quality of MS/MS spectrum from SEQUEST results

Sun W Li F Wang J Zheng D Gao Y 《Molecular & cellular proteomics : MCP》2004,3(12):1194-1199

Time-consuming and experience-dependent manual validations of tandem mass spectra are usually applied to SEQUEST results. This inefficient method has become a significant bottleneck for MS/MS data processing. Here we introduce a program AMASS (advanced mass spectrum screener), which can filter the tandem mass spectra of SEQUEST results by measuring the match percentage of high-abundant ions and the continuity of matched fragment ions in b, y series. Compared with Xcorr and DeltaCn filter, AMASS can increase the number of positives and reduce the number of negatives in 22 datasets generated from 18 known protein mixtures. It effectively removed most noisy spectra, false interpretations, and about half of poor fragmentation spectra, and AMASS can work synergistically with Rscore filter. We believe the use of AMASS and Rscore can result in a more accurate identification of peptide MS/MS spectra and reduce the time and energy for manual validation. 相似文献

5.

ProbIDtree: an automated software program capable of identifying multiple peptides from a single collision-induced dissociation spectrum collected by a tandem mass spectrometer

Zhang N Li XJ Ye M Pan S Schwikowski B Aebersold R 《Proteomics》2005,5(16):4096-4106

In MS/MS experiments with automated precursor ion, selection only a fraction of sequencing attempts lead to the successful identification of a peptide. A number of reasons may contribute to this situation. They include poor fragmentation of the selected precursor ion, the presence of modified residues in the peptide, mismatches with sequence databases, and frequently, the concurrent fragmentation of multiple precursors in the same CID attempt. Current database search engines are incapable of correctly assigning the sequences of multiple precursors to such spectra. We have developed a search engine, ProbIDtree, which can identify multiple peptides from a CID spectrum generated by the concurrent fragmentation of multiple precursor ions. This is achieved by iterative database searching in which the submitted spectra are generated by subtracting the fragment ions assigned to a tentatively matched peptide from the acquired spectrum and in which each match is assigned a tentative probability score. Tentatively matched peptides are organized in a tree structure from which their adjusted probability scores are calculated and used to determine the correct identifications. The results using MALDI-TOF-TOF MS/MS data demonstrate that multiple peptides can be effectively identified simultaneously with high confidence using ProbIDtree. 相似文献

6.

Learning score function parameters for improved spectrum identification in tandem mass spectrometry experiments

M Spivak MS Bereman MJ Maccoss WS Noble 《Journal of proteome research》2012,11(9):4499-4508

The identification of proteins from spectra derived from a tandem mass spectrometry experiment involves several challenges: matching each observed spectrum to a peptide sequence, ranking the resulting collection of peptide-spectrum matches, assigning statistical confidence estimates to the matches, and identifying the proteins. The present work addresses algorithms to rank peptide-spectrum matches. Many of these algorithms, such as PeptideProphet, IDPicker, or Q-ranker, follow a similar methodology that includes representing peptide-spectrum matches as feature vectors and using optimization techniques to rank them. We propose a richer and more flexible feature set representation that is based on the parametrization of the SEQUEST XCorr score and that can be used by all of these algorithms. This extended feature set allows a more effective ranking of the peptide-spectrum matches based on the target-decoy strategy, in comparison to a baseline feature set devoid of these XCorr-based features. Ranking using the extended feature set gives 10-40% improvement in the number of distinct peptide identifications relative to a range of q-value thresholds. While this work is inspired by the model of the theoretical spectrum and the similarity measure between spectra used specifically by SEQUEST, the method itself can be applied to the output of any database search. Further, our approach can be trivially extended beyond XCorr to any linear operator that can serve as similarity score between experimental spectra and peptide sequences. 相似文献

7.

A computational method for assessing peptide- identification reliability in tandem mass spectrometry analysis with SEQUEST 总被引：3，自引：0，他引：3

Razumovskaya J Olman V Xu D Uberbacher EC VerBerkmoes NC Hettich RL Xu Y 《Proteomics》2004,4(4):961-969

High-throughput protein identification in mass spectrometry is predominantly achieved by first identifying tryptic peptides by a database search and then by combining the peptide hits for protein identification. One of the popular tools used for the database search is SEQUEST. Peptide identification is carried out by selecting SEQUEST hits above a specified threshold, the value of which is typically chosen empirically in an attempt to separate true identifications from false ones. These SEQUEST scores are not normalized with respect to the composition, length and other parameters of the peptides. Furthermore, there is no rigorous reliability estimate assigned to the protein identifications derived from these scores. Hence, the interpretation of SEQUEST hits generally requires human involvement, making it difficult to scale up the identification process for genome-scale applications. To overcome these limitations, we have developed a method, which combines a neural network and a statistical model, for normalizing SEQUEST scores, and also for providing a reliability estimate for each SEQUEST hit. This method improves the sensitivity and specificity of peptide identification compared to the standard filtering procedure used in the SEQUEST package, and provides a basis for estimating the reliability of protein identifications. 相似文献

8.

Estimating false discovery rates for peptide and protein identification using randomized databases

Gregory Hather Roger Higdon Andrew Bauman Priska D. von Haller Eugene Kolker 《Proteomics》2010,10(12):2369-2376

MS‐based proteomics characterizes protein contents of biological samples. The most common approach is to first match observed MS/MS peptide spectra against theoretical spectra from a protein sequence database and then to score these matches. The false discovery rate (FDR) can be estimated as a function of the score by searching together the protein sequence database and its randomized version and comparing the score distributions of the randomized versus nonrandomized matches. This work introduces a straightforward isotonic regression‐based method to estimate the cumulative FDRs and local FDRs (LFDRs) of peptide identification. Our isotonic method not only performed as well as other methods used for comparison, but also has the advantages of being: (i) monotonic in the score, (ii) computationally simple, and (iii) not dependent on assumptions about score distributions. We demonstrate the flexibility of our approach by using it to estimate FDRs and LFDRs for protein identification using summaries of the peptide spectra scores. We reconfirmed that several of these methods were superior to a two‐peptide rule. Finally, by estimating both the FDRs and LFDRs, we showed for both peptide and protein identification, moderate FDR values (5%) corresponded to large LFDR values (53 and 60%). 相似文献

9.

MassMatrix: A database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data

Hua Xu Michael A. Freitas Dr. 《Proteomics》2009,9(6):1548-1555

MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ～100 000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at http://www.massmatrix.net. 相似文献

10.

Whole-Cell Protein Identification Using the Concept of Unique Peptides

Yupeng Zhao Yen-Han Lin 《基因组蛋白质组与生物信息学报(英文版)》2010,8(1):33-41

A concept of unique peptides(CUP)was proposed and implemented to identify whole-cell proteins from tandem mass spectrometry(MS/MS)ion spectra.A unique peptide is defined as a peptide,irrespective of its length,that exists only in one protein of a proteome of interest,despite the fact that this peptide may appear more than once in the same protein.Integrating CUP,a two-step whole-cell protein identification strategy was developed to further increase the confidence of identified proteins.A dataset containing 40,243 MS/MS ion spectra of Saccharomyces cerevisiae and protein identification tools including Mascot and SEQUEST were used to illustrate the proposed concept and strategy.Without implementing CUP,the proteins identified by SEQUEST are 2.26 fold of those identified by Mascot.When CUP was applied,the proteins bearing unique peptides identified by SEQUEST are3.89 fold of those identified by Mascot.By cross-comparing two sets of identified proteins,only 89 common proteins derived from CUP were found.The key discrepancy between identified proteins was resulted from the filtering criteria employed by each protein identification tool.According to the origin of peptides classified by CUP and the commonality of proteins recognized by protein identification tools,all identified proteins were cross-compared,resulting in four groups of proteins possessing different levels of assigned confidence. 相似文献

11.

Automatic quality assessment of peptide tandem mass spectra

Bern M Goldberg D McDonald WH Yates JR 《Bioinformatics (Oxford, England)》2004,20(Z1):i49-i54

MOTIVATION: A powerful proteomics methodology couples high-performance liquid chromatography (HPLC) with tandem mass spectrometry and database-search software, such as SEQUEST. Such a set-up, however, produces a large number of spectra, many of which are of too poor quality to be useful. Hence a filter that eliminates poor spectra before the database search can significantly improve throughput and robustness. Moreover, spectra judged to be of high quality, but that cannot be identified by database search, are prime candidates for still more computationally intensive methods, such as de novo sequencing or wider database searches including post-translational modifications. RESULTS: We report on two different approaches to assessing spectral quality prior to identification: binary classification, which predicts whether or not SEQUEST will be able to make an identification, and statistical regression, which predicts a more universal quality metric involving the number of b- and y-ion peaks. The best of our binary classifiers can eliminate over 75% of the unidentifiable spectra while losing only 10% of the identifiable spectra. Statistical regression can pick out spectra of modified peptides that can be identified by a de novo program but not by SEQUEST. In a section of independent interest, we discuss intensity normalization of mass spectra. 相似文献

12.

Deriving the probabilities of water loss and ammonia loss for amino acids from tandem mass spectra

Sun S Yu C Qiao Y Lin Y Dong G Liu C Zhang J Zhang Z Cai J Zhang H Bu D 《Journal of proteome research》2008,7(1):202-208

In protein identification through tandem mass spectrometry, it is critical to accurately predict the theoretical spectrum for a peptide sequence. The widely used prediction models, such as SEQUEST and MASCOT, ignore the intensity of the ions with important neutral losses, including water loss and ammonia loss. However, ignoring these neutral losses results in a significant deviation between the predicted theoretical spectrum and its experimental counterpart. Here, based on the "one peak, multiple explanations" observation, we proposed an expectation-maximization (EM) method to automatically learn the probabilities of water loss and ammonia loss for each amino acid. Then we employed these probabilities to design an improved statistical model for theoretical spectrum prediction. We implemented these methods and tested them on practical data. On a training set containing 1803 spectra, the experimental results show a good agreement with some known knowledge about neutral losses, such as the tendency of water loss from Asp, Glu, Ser, and Thr. Furthermore, on a testing set containing 941 spectra, the improved similarity between the experimental and predicted spectra demonstrates that this method can generate more reasonable predictions relative to the model that ignores neutral losses. As an application of the derived probabilities, we implemented a database searching method adopting the improved theoretical spectrum model with neutral loss ions estimated. Experimental results on Keller's data set demonstrate that this method can identify peptides more accurately than SEQUEST. In another application to validate SEQUEST's results, the reported peptide-spectrum pairs are reranked with respect to the similarity between experimental and predicted spectra. Experimental results on both LTQ and QSTAR data sets suggest that this reranking strategy can effectively distinguish the false negative predictions reported by SEQUEST. 相似文献

13.

MixGF: Spectral Probabilities for Mixture Spectra from more than One Peptide

Jian Wang Philip E. Bourne Nuno Bandeira 《Molecular & cellular proteomics : MCP》2014,13(12):3688-3697

In large-scale proteomic experiments, multiple peptide precursors are often cofragmented simultaneously in the same mixture tandem mass (MS/MS) spectrum. These spectra tend to elude current computational tools because of the ubiquitous assumption that each spectrum is generated from only one peptide. Therefore, tools that consider multiple peptide matches to each MS/MS spectrum can potentially improve the relatively low spectrum identification rate often observed in proteomics experiments. More importantly, data independent acquisition protocols promoting the cofragmentation of multiple precursors are emerging as alternative methods that can greatly improve the throughput of peptide identifications but their success also depends on the availability of algorithms to identify multiple peptides from each MS/MS spectrum. Here we address a fundamental question in the identification of mixture MS/MS spectra: determining the statistical significance of multiple peptides matched to a given MS/MS spectrum. We propose the MixGF generating function model to rigorously compute the statistical significance of peptide identifications for mixture spectra and show that this approach improves the sensitivity of current mixture spectra database search tools by a ≈30–390%. Analysis of multiple data sets with MixGF reveals that in complex biological samples the number of identified mixture spectra can be as high as 20% of all the identified spectra and the number of unique peptides identified only in mixture spectra can be up to 35.4% of those identified in single-peptide spectra.The advancement of technology and instrumentation has made tandem mass (MS/MS)¹ spectrometry the leading high-throughput method to analyze proteins (, , ). In typical experiments, tens of thousands to millions of MS/MS spectra are generated and enable researchers to probe various aspects of the proteome on a large scale. Part of this success hinges on the availability of computational methods that can analyze the large amount of data generated from these experiments. The classical question in computational proteomics asks: given an MS/MS spectrum, what is the peptide that generated the spectrum? However, it is increasingly being recognized that this assumption that each MS/MS spectrum comes from only one peptide is often not valid. Several recent analyses show that as many as 50% of the MS/MS spectra collected in typical proteomics experiments come from more than one peptide precursor (, ). The presence of multiple peptides in mixture spectra can decrease their identification rate to as low as one half of that for MS/MS spectra generated from only one peptide (, , ). In addition, there have been numerous developments in data independent acquisition (DIA) technologies where multiple peptide precursors are intentionally selected to cofragment in each MS/MS spectrum (, , , , , , ). These emerging technologies can address some of the enduring disadvantages of traditional data-dependent acquisition (DDA) methods (e.g. low reproducibility ()) and potentially increase the throughput of peptide identification 5–10 fold (, ). However, despite the growing importance of mixture spectra in various contexts, there are still only a few computational tools that can analyze mixture spectra from more than one peptide (, , , , , ). Our recent analysis indicated that current database search methods for mixture spectra still have relatively low sensitivity compared with their single-peptide counterpart and the main bottleneck is their limited ability to separate true matches from false positive matches (). Traditionally problem of peptide identification from MS/MS spectra involves two sub-problems: 1) define a Peptide-Spectrum-Match (PSM) scoring function that assigns each MS/MS spectrum to the peptide sequence that most likely generated the spectrum; and 2) given a set of top-scoring PSMs, select a subset that corresponds to statistical significance PSMs. Here we focus on the second problem, which is still an ongoing research question even for the case of single-peptide spectra (, , , ). Intuitively the second problem is difficult because one needs to consider spectra across the whole data set (instead of comparing different peptide candidates against one spectrum as in the first problem) and PSM scoring functions are often not well-calibrated across different spectra (i.e. a PSM score of 50 may be good for one spectrum but poor for a different spectrum). Ideally, a scoring function will give high scores to all true PSMs and low scores to false PSMs regardless of the peptide or spectrum being considered. However, in practice, some spectra may receive higher scores than others simply because they have more peaks or their precursor mass results in more peptide candidates being considered from the sequence database (, ). Therefore, a scoring function that accounts for spectrum or peptide-specific effects can make the scores more comparable and thus help assess the confidence of identifications across different spectra. The MS-GF solution to this problem is to compute the per-spectrum statistical significance of each top-scoring PSM, which can be defined as the probability that a random peptide (out of all possible peptide within parent mass tolerance) will match to the spectrum with a score at least as high as that of the top-scoring PSM. This measures how good the current best match is in relation to all possible peptides matching to the same spectrum, normalizing any spectrum effect from the scoring function. Intuitively, our proposed MixGF approach extends the MS-GF approach to now calculate the statistical significance of the top pair of peptides matched from the database to a given mixture spectrum M (i.e. the significance of the top peptide–peptide spectrum match (PPSM)). As such, MixGF determines the probability that a random pair of peptides (out of all possible peptides within parent mass tolerance) will match a given mixture spectrum with a score at least as high as that of the top-scoring PPSM.Despite the theoretical attractiveness of computing statistical significance, it is generally prohibitive for any database search methods to score all possible peptides against a spectrum. Therefore, earlier works in this direction focus on approximating this probability by assuming the score distribution of all PSMs follows certain analytical form such as the normal, Poisson or hypergeometric distributions (, , ). In practice, because score distributions are highly data-dependent and spectrum-specific, these model assumptions do not always hold. Other approaches tried to learn the score distribution empirically from the data (, ). However, one is most interested in the region of the score distribution where only a small fraction of false positives are allowed (typically at 1% FDR). This usually corresponds to the extreme tail of the distribution where p values are on the order of 10⁻⁹ or lower and thus there is typically lack of sufficient data points to accurately model the tail of the score distribution (). More recently, Kim et al. () and Alves et al. (), in parallel, proposed a generating function approach to compute the exact score distribution of random peptide matches for any spectra without explicitly matching all peptides to a spectrum. Because it is an exact computation, no assumption is made about the form of score distribution and the tail of the distribution can be computed very accurately. As a result, this approach substantially improved the ability to separate true matches from false positive ones and lead to a significant increase in sensitivity of peptide identification over state-of-the-art database search tools in single-peptide spectra ().For mixture spectra, it is expected that the scores for the top-scoring match will be even less comparable across different spectra because now more than one peptide and different numbers of peptides can be matched to each spectrum at the same time. We extend the generating function approach () to rigorously compute the statistical significance of multiple-Peptide-Spectrum Matches (mPSMs) and demonstrate its utility toward addressing the peptide identification problem in mixture spectra. In particular, we show how to extend the generating approach for mixture from two peptides. We focus on this relatively simple case of mixture spectra because it accounts for a large fraction of mixture spectra presented in traditional DDA workflows (). This allows us to test and develop algorithmic concepts using readily-available DDA data because data with more complex mixture spectra such as those from DIA workflows () is still not widely available in public repositories. 相似文献

14.

Phosphorylation-specific MS/MS scoring for rapid and accurate phosphoproteome analysis

Payne SH Yau M Smolka MB Tanner S Zhou H Bafna V 《Journal of proteome research》2008,7(8):3373-3381

The promise of mass spectrometry as a tool for probing signal-transduction is predicated on reliable identification of post-translational modifications. Phosphorylations are key mediators of cellular signaling, yet are hard to detect, partly because of unusual fragmentation patterns of phosphopeptides. In addition to being accurate, MS/MS identification software must be robust and efficient to deal with increasingly large spectral data sets. Here, we present a new scoring function for the Inspect software for phosphorylated peptide tandem mass spectra for ion-trap instruments, without the need for manual validation. The scoring function was modeled by learning fragmentation patterns from 7677 validated phosphopeptide spectra. We compare our algorithm against SEQUEST and X!Tandem on testing and training data sets. At a 1% false positive rate, Inspect identified the greatest total number of phosphorylated spectra, 13% more than SEQUEST and 39% more than X!Tandem. Spectra identified by Inspect tended to score better in several spectral quality measures. Furthermore, Inspect runs much faster than either SEQUEST or X!Tandem, making desktop phosphoproteomics feasible. Finally, we used our new models to reanalyze a corpus of 423,000 LTQ spectra acquired for a phosphoproteome analysis of Saccharomyces cerevisiae DNA damage and repair pathways and discovered 43% more phosphopeptides than the previous study. 相似文献

15.

Tempest: GPU-CPU computing for high-throughput database spectral matching

Milloy JA Faherty BK Gerber SA 《Journal of proteome research》2012,11(7):3581-3591

Modern mass spectrometers are now capable of producing hundreds of thousands of tandem (MS/MS) spectra per experiment, making the translation of these fragmentation spectra into peptide matches a common bottleneck in proteomics research. When coupled with experimental designs that enrich for post-translational modifications such as phosphorylation and/or include isotopically labeled amino acids for quantification, additional burdens are placed on this computational infrastructure by shotgun sequencing. To address this issue, we have developed a new database searching program that utilizes the massively parallel compute capabilities of a graphical processing unit (GPU) to produce peptide spectral matches in a very high throughput fashion. Our program, named Tempest, combines efficient database digestion and MS/MS spectral indexing on a CPU with fast similarity scoring on a GPU. In our implementation, the entire similarity score, including the generation of full theoretical peptide candidate fragmentation spectra and its comparison to experimental spectra, is conducted on the GPU. Although Tempest uses the classical SEQUEST XCorr score as a primary metric for evaluating similarity for spectra collected at unit resolution, we have developed a new "Accelerated Score" for MS/MS spectra collected at high resolution that is based on a computationally inexpensive dot product but exhibits scoring accuracy similar to that of the classical XCorr. In our experience, Tempest provides compute-cluster level performance in an affordable desktop computer. 相似文献

16.

Generalized method for probability-based peptide and protein identification from tandem mass spectrometry data and sequence database searching

Ramos-Fernández A Paradela A Navajas R Albar JP 《Molecular & cellular proteomics : MCP》2008,7(9):1748-1754

Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra. 相似文献

17.

A dataset of human liver proteins identified by protein profiling via isotope-coded affinity tag (ICAT) and tandem mass spectrometry 总被引：7，自引：0，他引：7

Yan W Lee H Deutsch EW Lazaro CA Tang W Chen E Fausto N Katze MG Aebersold R 《Molecular & cellular proteomics : MCP》2004,3(10):1039-1041

Proteins from human liver carcinoma Huh7 cells, representing transformed liver cells, and cultured primary human fetal hepatocytes (HFH) and human HH4 hepatocytes, representing nontransformed liver cells, were extracted and processed for proteome analysis. Proteins from stimulated cells (interferon-alpha treatment for the Huh7 and HFH cells and induction of hepatitis C virus [HCV] proteins for the HH4 cells) and corresponding control cells were labeled with light and heavy cleavable ICAT reagents, respectively. The labeled samples were combined, trypsinized, and subject to cation-exchange and avidin-affinity chromatographies. The resulting cysteine-containing peptides were analyzed by microcapillary LC-MS/MS. The MS/MS spectra were initially analyzed by searching the human International Protein Index database using the SEQUEST software (1). Subsequently, new statistical algorithms were applied to the collective SEQUEST search results of each experiment. First, the PeptideProphet software (2) was applied to discriminate true assignments of MS/MS spectra to peptide sequences from false assignments, to assign a probability value for each identified peptide, and to compute the sensitivity and error rate for the assignment of spectra to sequences in each experiment. Second, the ProteinProphet software (3) was used to infer the protein identifications and to compute probabilities that a protein had been correctly identified, based on the available peptide sequence evidence. The resulting protein lists were filtered by a ProteinProphet probability score p > or = 0.5, which corresponded to an error rate of less than 5%. A total of 1,296, 1,430, and 1,476 proteins or related protein groups were identified in three subdatasets from the Huh7, HFH, and HH4 cells, respectively. In total, these subdatasets contained 2,486 unique protein identifications from human liver cells. An increase of the threshold to p > or = 0.9 (corresponding to an error rate of less than 1%) resulted in 2,159 unique protein identifications (1,146, 1,235, and 1,318 for the Huh7, HFH, and HH4 cells, respectively). 相似文献

18.

A new algorithm for the evaluation of shotgun peptide sequencing in proteomics: support vector machine classification of peptide MS/MS spectra and SEQUEST scores 总被引：1，自引：0，他引：1

Anderson DC Li W Payan DG Noble WS 《Journal of proteome research》2003,2(2):137-146

Shotgun tandem mass spectrometry-based peptide sequencing using programs such as SEQUEST allows high-throughput identification of peptides, which in turn allows the identification of corresponding proteins. We have applied a machine learning algorithm, called the support vector machine, to discriminate between correctly and incorrectly identified peptides using SEQUEST output. Each peptide was characterized by SEQUEST-calculated features such as delta Cn and Xcorr, measurements such as precursor ion current and mass, and additional calculated parameters such as the fraction of matched MS/MS peaks. The trained SVM classifier performed significantly better than previous cutoff-based methods at separating positive from negative peptides. Positive and negative peptides were more readily distinguished in training set data acquired on a QTOF, compared to an ion trap mass spectrometer. The use of 13 features, including four new parameters, significantly improved the separation between positive and negative peptides. Use of the support vector machine and these additional parameters resulted in a more accurate interpretation of peptide MS/MS spectra and is an important step toward automated interpretation of peptide tandem mass spectrometry data in proteomics. 相似文献

19.

The standard protein mix database: a diverse data set to assist in the production of improved Peptide and protein identification software tools

Klimek J Eddes JS Hohmann L Jackson J Peterson A Letarte S Gafken PR Katz JE Mallick P Lee H Schmidt A Ossola R Eng JK Aebersold R Martin DB 《Journal of proteome research》2008,7(1):96-103

Tandem mass spectrometry (MS/MS) is frequently used in the identification of peptides and proteins. Typical proteomic experiments rely on algorithms such as SEQUEST and MASCOT to compare thousands of tandem mass spectra against the theoretical fragment ion spectra of peptides in a database. The probabilities that these spectrum-to-sequence assignments are correct can be determined by statistical software such as PeptideProphet or through estimations based on reverse or decoy databases. However, many of the software applications that assign probabilities for MS/MS spectra to sequence matches were developed using training data sets from 3D ion-trap mass spectrometers. Given the variety of types of mass spectrometers that have become commercially available over the last 5 years, we sought to generate a data set of reference data covering multiple instrumentation platforms to facilitate both the refinement of existing computational approaches and the development of novel software tools. We analyzed the proteolytic peptides in a mixture of tryptic digests of 18 proteins, named the "ISB standard protein mix", using 8 different mass spectrometers. These include linear and 3D ion traps, two quadrupole time-of-flight platforms (qq-TOF), and two MALDI-TOF-TOF platforms. The resulting data set, which has been named the Standard Protein Mix Database, consists of over 1.1 million spectra in 150+ replicate runs on the mass spectrometers. The data were inspected for quality of separation and searched using SEQUEST. All data, including the native raw instrument and mzXML formats and the PeptideProphet validated peptide assignments, are available at http://regis-web.systemsbiology.net/PublicDatasets/. 相似文献

20.

Monte carlo simulation-based algorithms for analysis of shotgun proteomic data

Xu H Freitas MA 《Journal of proteome research》2008,7(7):2605-2615

Two new statistical models based on Monte Carlo Simulation (MCS) have been developed to score peptide matches in shotgun proteomic data and incorporated in a database search program, MassMatrix (www.massmatrix.net). The first model evaluates peptide matches based on the total abundance of matched peaks in the experimental spectra. The second model evaluates amino acid residue tags within MS/MS spectra. The two models provide complementary scores for peptide matches that result in higher confidence in peptide identification when significant scores are returned from both models. The MCS-based models use a variance reduction technique that improves estimation precision. Due to the high computational expense of MCS-based models, peptide matches were prefiltered by other statistical models before further evaluation by the MCS-based models. Receiver operating characteristic analysis of the data sets confirmed that MCS-based models improved the overall performance of the MassMatrix search software, especially for low-mass accuracy data sets. 相似文献