首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence."  相似文献   

2.
Database search is a standard technique for identifying peptides from their tandem mass spectra. To increase the number of correctly identified peptides, we suggest a probabilistic framework that allows the combination of scores from different search engines into a joint consensus score. Central to the approach is a novel method to estimate scores for peptides not found by an individual search engine. This approach allows the estimation of p-values for each candidate peptide and their combination across all search engines. The consensus approach works better than any single search engine across all different instrument types considered in this study. Improvements vary strongly from platform to platform and from search engine to search engine. Compared to the industry standard MASCOT, our approach can identify up to 60% more peptides. The software for consensus predictions is implemented in C++ as part of OpenMS, a software framework for mass spectrometry. The source code is available in the current development version of OpenMS and can easily be used as a command line application or via a graphical pipeline designer TOPPAS.  相似文献   

3.

Background  

Tandem mass spectrometry (MS/MS) is a powerful tool for protein identification. Although great efforts have been made in scoring the correlation between tandem mass spectra and an amino acid sequence database, improvements could be made in three aspects, including characterization ofpeaks in spectra, adoption of effective scoring functions and access to thereliability of matching between peptides and spectra.  相似文献   

4.
Kwon KH  Kim M  Kim JY  Kim KW  Kim SI  Park YM  Yoo JS 《Proteomics》2003,3(12):2305-2309
We compared peptide identification by database (DB) search methods with de novo sequencing results for proteomics study in an organism without genome sequence information. When the former was done by searching the Expressed Sequence Tag (EST) DB of the sample organism or the NCBI nonredundant (nr) protein DB of green plants using either the MASCOT or SEQUEST software program, it was confirmed that the former is as accurate as the latter. Peptides identified from EST DB were twice as many as those from the nr protein DB, in spite of the fact that the EST DB has less data (26 222 EST) than the NCBI nr protein DB (224 238). This study demonstrates that EST DB with tandem mass spectra can be used reliably for high-throughput proteomics studies in an organism without genome information.  相似文献   

5.
There are many computer programs that can match tandem mass spectra of peptides to database-derived sequences; however, situations can arise where mass spectral data cannot be correlated with any database sequence. In such cases, sequences can be automatically deduced de novo, without recourse to sequence databases, and the resulting peptide sequences can be used to perform homologous nonexact searches of sequence databases. This article describes details on how to implement both a de novo sequencing program called “Lutefisk,” and a version of FASTA that has been modified to account for sequence ambiguities inherent in tandem mass spectrometry data.  相似文献   

6.
We evaluate statistical models used in two-hypothesis tests for identifying peptides from tandem mass spectrometry data. The null hypothesis H(0), that a peptide matches a spectrum by chance, requires information on the probability of by-chance matches between peptide fragments and peaks in the spectrum. Likewise, the alternate hypothesis H(A), that the spectrum is due to a particular peptide, requires probabilities that the peptide fragments would indeed be observed if it was the causative agent. We compare models for these probabilities by determining the identification rates produced by the models using an independent data set. The initial models use different probabilities depending on fragment ion type, but uniform probabilities for each ion type across all of the labile bonds along the backbone. More sophisticated models for probabilities under both H(A) and H(0) are introduced that do not assume uniform probabilities for each ion type. In addition, the performance of these models using a standard likelihood model is compared to an information theory approach derived from the likelihood model. Also, a simple but effective model for incorporating peak intensities is described. Finally, a support-vector machine is used to discriminate between correct and incorrect identifications based on multiple characteristics of the scoring functions. The results are shown to reduce the misidentification rate significantly when compared to a benchmark cross-correlation based approach.  相似文献   

7.
MOTIVATION: We reformulate the problem of comparing mass-spectra by mapping spectra to a vector space model. Our search method leverages a metric space indexing algorithm to produce an initial candidate set, which can be followed by any fine ranking scheme. RESULTS: We consider three distance measures integrated into a multi-vantage point index structure. Of these, a semi-metric fuzzy-cosine distance using peptide precursor mass constraints performs the best. The index acts as a coarse, lossless filter with respect to the SEQUEST and ProFound scoring schemes, reducing the number of distance computations and returned candidates for fine filtering to about 0.5% and 0.02% of the database respectively. The fuzzy cosine distance term improves specificity over a peptide precursor mass filter, reducing the number of returned candidates by an order of magnitude. Run time measurements suggest proportional speedups in overall search times. Using an implementation of ProFound's Bayesian score as an example of a fine filter on a test set of Escherichia coli protein fragmentation spectra, the top results of our sample system are consistent with that of SEQUEST.  相似文献   

8.
We describe the application of a peptide retention time reversed phase liquid chromatography (RPLC) prediction model previously reported (Petritis et al. Anal. Chem. 2003, 75, 1039) for improved peptide identification. The model uses peptide sequence information to generate a theoretical (predicted) elution time that can be compared with the observed elution time. Using data from a set of known proteins, the retention time parameter was incorporated into a discriminant function for use with tandem mass spectrometry (MS/MS) data analyzed with the peptide/protein identification program SEQUEST. For singly charged ions, the number of confident identifications increased by 12% when the elution time metric is included compared to when mass spectral data is the sole source of information in the context of a Drosophila melanogaster database. A 3-4% improvement was obtained for doubly and triply charged ions for the same biological system. Application to the larger Rattus norvegicus (rat) and human proteome databases resulted in an 8-9% overall increase in the number of confident identifications, when both the discriminant function and elution time are used. The effect of adding "runner-up" hits (peptide matches that are not the highest scoring for a spectra) from SEQUEST is also explored, and we find that the number of confident identifications is further increased by 1% when these hits are also considered. Finally, application of the discriminant functions derived in this work with approximately 2.2 million spectra from over three hundred LC-MS/MS analyses of peptides from human plasma protein resulted in a 16% increase in confident peptide identifications (9022 vs 7779) using elution time information. Further improvements from the use of elution time information can be expected as both the experimental control of elution time reproducibility and the predictive capability are improved.  相似文献   

9.
In a previous paper we introduced a novel model-based approach (OLAV) to the problem of identifying peptides via tandem mass spectrometry, for which early implementations showed promising performance. We recently further improved this performance to a remarkable level (1-2% false positive rate at 95% true positive rate) and characterized key properties of OLAV like robustness and training set size. We present these results in a synthetic and coherent way along with detailed performance comparisons, a new scoring component making use of peptide amino acidic composition, and new developments like automatic parameter learning. Finally, we discuss the impact of OLAV on the automation of proteomics projects.  相似文献   

10.
MOTIVATION: The correlation among fragment ions in a tandem mass spectrum is crucial in reducing stochastic mismatches for peptide identification by database searching. Until now, an efficient scoring algorithm that considers the correlative information in a tunable and comprehensive manner has been lacking. RESULTS: This paper provides a promising approach to utilizing the correlative information for improving the peptide identification accuracy. The kernel trick, rooted in the statistical learning theory, is exploited to address this issue with low computational effort. The common scoring method, the tandem mass spectral dot product (SDP), is extended to the kernel SDP (KSDP). Experiments on a dataset reported previously demonstrate the effectiveness of the KSDP. The implementation on consecutive fragments shows a decrease of 10% in the error rate compared with the SDP. Our software tool, pFind, using a simple scoring function based on the KSDP, outperforms two SDP-based software tools, SEQUEST and Sonar MS/MS, in terms of identification accuracy. SUPPLEMENTARY INFORMATION: http://www.jdl.ac.cn/user/yfu/pfind/index.html  相似文献   

11.
Computational analysis of mass spectra remains the bottleneck in many proteomics experiments. SEQUEST was one of the earliest software packages to identify peptides from mass spectra by searching a database of known peptides. Though still popular, SEQUEST performs slowly. Crux and TurboSEQUEST have successfully sped up SEQUEST by adding a precomputed index to the search, but the demand for ever-faster peptide identification software continues to grow. Tide, introduced here, is a software program that implements the SEQUEST algorithm for peptide identification and that achieves a dramatic speedup over Crux and SEQUEST. The optimization strategies detailed here employ a combination of algorithmic and software engineering techniques to achieve speeds up to 170 times faster than a recent version of SEQUEST that uses indexing. For example, on a single Xeon CPU, Tide searches 10,000 spectra against a tryptic database of 27,499 Caenorhabditis elegans proteins at a rate of 1550 spectra per second, which compares favorably with a rate of 8.8 spectra per second for a recent version of SEQUEST with index running on the same hardware.  相似文献   

12.
13.
14.
De novo peptide sequencing via tandem mass spectrometry.   总被引:10,自引:0,他引:10  
Peptide sequencing via tandem mass spectrometry (MS/MS) is one of the most powerful tools in proteomics for identifying proteins. Because complete genome sequences are accumulating rapidly, the recent trend in interpretation of MS/MS spectra has been database search. However, de novo MS/MS spectral interpretation remains an open problem typically involving manual interpretation by expert mass spectrometrists. We have developed a new algorithm, SHERENGA, for de novo interpretation that automatically learns fragment ion types and intensity thresholds from a collection of test spectra generated from any type of mass spectrometer. The test data are used to construct optimal path scoring in the graph representations of MS/MS spectra. A ranked list of high scoring paths corresponds to potential peptide sequences. SHERENGA is most useful for interpreting sequences of peptides resulting from unknown proteins and for validating the results of database search algorithms in fully automated, high-throughput peptide sequencing.  相似文献   

15.
Li D  Fu Y  Sun R  Ling CX  Wei Y  Zhou H  Zeng R  Yang Q  He S  Gao W 《Bioinformatics (Oxford, England)》2005,21(13):3049-3050
SUMMARY: Research in proteomics requires powerful database-searching software to automatically identify protein sequences in a complex protein mixture via tandem mass spectrometry. In this paper, we describe a novel database-searching software system called pFind (peptide/protein Finder), which employs an effective peptide-scoring algorithm that we reported earlier. The pFind server is implemented with the C++ STL, .Net and XML technologies. As a result, high speed and good usability of the software are achieved.  相似文献   

16.
A novel ProteinChip-interfaced tandem mass spectrometer was employed to identify collagen binding proteins from biosurfactant produced by Lactobacillus fermentum RC-14. On-chip tryptic digestion of the captured collagen binding proteins resulted in rapid sequence identification of five novel tryptic peptide sequences via collision-induced dissociation tandem mass spectrometry.  相似文献   

17.
To interpret LC-MS/MS data in proteomics, most popular protein identification algorithms primarily use predicted fragment m/z values to assign peptide sequences to fragmentation spectra. The intensity information is often undervalued, because it is not as easy to predict and incorporate into algorithms. Nevertheless, the use of intensity to assist peptide identification is an attractive prospect and can potentially improve the confidence of matches and generate more identifications. On the basis of our previously reported study of fragmentation intensity patterns, we developed a protein identification algorithm, SeQuence IDentfication (SQID), that makes use of the coarse intensity from a statistical analysis. The scoring scheme was validated by comparing with Sequest and X!Tandem using three data sets, and the results indicate an improvement in the number of identified peptides, including unique peptides that are not identified by Sequest or X!Tandem. The software and source code are available under the GNU GPL license at http://quiz2.chem.arizona.edu/wysocki/bioinformatics.htm.  相似文献   

18.
Mass spectrometry combined with database searching has become the preferred method for identifying proteins in proteomics projects. Proteins are digested by one or several enzymes to obtain peptides, which are analyzed by mass spectrometry. We introduce a new family of scoring schemes, named OLAV, aimed at identifying peptides in a database from their tandem mass spectra. OLAV scoring schemes are based on signal detection theory, and exploit mass spectrometry information more extensively than previously existing schemes. We also introduce a new concept of structural matching that uses pattern detection methods to better separate true from false positives. We show the superiority of OLAV scoring schemes compared to MASCOT, a widely used identification program. We believe that this work introduces a new way of designing scoring schemes that are especially adapted to high-throughput projects such as GeneProt large-scale human plasma project, where it is impractical to check all identifications manually.  相似文献   

19.
Mass spectrometry, the core technology in the field of proteomics, promises to enable scientists to identify and quantify the entire complement of proteins in a complex biological sample. Currently, the primary bottleneck in this type of experiment is computational. Existing algorithms for interpreting mass spectra are slow and fail to identify a large proportion of the given spectra. We describe a database search program called Crux that reimplements and extends the widely used database search program Sequest. For speed, Crux uses a peptide indexing scheme to rapidly retrieve candidate peptides for a given spectrum. For each peptide in the target database, Crux generates shuffled decoy peptides on the fly, providing a good null model and, hence, accurate false discovery rate estimates. Crux also implements two recently described postprocessing methods: a p value calculation based upon fitting a Weibull distribution to the observed scores, and a semisupervised method that learns to discriminate between target and decoy matches. Both methods significantly improve the overall rate of peptide identification. Crux is implemented in C and is distributed with source code freely to noncommercial users.  相似文献   

20.
A novel hybrid methodology for the automated identification of peptides via de novo integer linear optimization, local database search, and tandem mass spectrometry is presented in this article. A modified version of the de novo identification algorithm PILOT, is utilized to construct accurate de novo peptide sequences. A modified version of the local database search tool FASTA is used to query these de novo predictions against the nonredundant protein database to resolve any low-confidence amino acids in the candidate sequences. The computational burden associated with performing several alignments is alleviated with the use of distributive computing. Extensive computational studies are presented for this new hybrid methodology, as well as comparisons with MASCOT for a set of 38 quadrupole time-of-flight (QTOF) and 380 OrbiTrap tandem mass spectra. The results for our proposed hybrid method for the OrbiTrap spectra are also compared with a modified version of PepNovo, which was trained for use on high-precision tandem mass spectra, and the tag-based method InsPecT. The de novo sequences of PILOT and PepNovo are also searched against the nonredundant protein database using CIDentify to compare with the alignments achieved by our modifications of FASTA. The comparative studies demonstrate the excellent peptide identification accuracy gained from combining the strengths of our de novo method, which is based on integer linear optimization, and database driven search methods.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号