期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry 总被引：1，自引：0，他引：1

Elias JE Gygi SP 《Nature methods》2007,4(3):207-214

Liquid chromatography and tandem mass spectrometry (LC-MS/MS) has become the preferred method for conducting large-scale surveys of proteomes. Automated interpretation of tandem mass spectrometry (MS/MS) spectra can be problematic, however, for a variety of reasons. As most sequence search engines return results even for 'unmatchable' spectra, proteome researchers must devise ways to distinguish correct from incorrect peptide identifications. The target-decoy search strategy represents a straightforward and effective way to manage this effort. Despite the apparent simplicity of this method, some controversy surrounds its successful application. Here we clarify our preferred methodology by addressing four issues based on observed decoy hit frequencies: (i) the major assumptions made with this database search strategy are reasonable; (ii) concatenated target-decoy database searches are preferable to separate target and decoy database searches; (iii) the theoretical error associated with target-decoy false positive (FP) rate measurements can be estimated; and (iv) alternate methods for constructing decoy databases are similarly effective once certain considerations are taken into account. 相似文献

2.

Learning from Decoys to Improve the Sensitivity and Specificity of Proteomics Database Search Results

Amit Kumar Yadav Dhirendra Kumar Debasis Dash 《PloS one》2012,7(11)

The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size. 相似文献

3.

MassMatrix: A database search program for rapid characterization of proteins and peptides from tandem mass spectrometry data

Hua Xu Michael A. Freitas Dr. 《Proteomics》2009,9(6):1548-1555

MassMatrix is a program that matches tandem mass spectra with theoretical peptide sequences derived from a protein database. The program uses a mass accuracy sensitive probabilistic score model to rank peptide matches. The MS/MS search software was evaluated by use of a high mass accuracy dataset and its results compared with those from MASCOT, SEQUEST, X!Tandem, and OMSSA. For the high mass accuracy data, MassMatrix provided better sensitivity than MASCOT, SEQUEST, X!Tandem, and OMSSA for a given specificity and the percentage of false positives was 2%. More importantly all manually validated true positives corresponded to a unique peptide/spectrum match. The presence of decoy sequence and additional variable PTMs did not significantly affect the results from the high mass accuracy search. MassMatrix performs well when compared with MASCOT, SEQUEST, X!Tandem, and OMSSA with regard to search time. MassMatrix was also run on a distributed memory clusters and achieved search speeds of ～100 000 spectra per hour when searching against a complete human database with eight variable modifications. The algorithm is available for public searches at http://www.massmatrix.net. 相似文献

4.

Analysis of the resolution limitations of peptide identification algorithms

Colaert N Degroeve S Helsens K Martens L 《Journal of proteome research》2011,10(12):5555-5561

Proteome identification using peptide-centric proteomics techniques is a routinely used analysis technique. One of the most powerful and popular methods for the identification of peptides from MS/MS spectra is protein database matching using search engines. Significance thresholding through false discovery rate (FDR) estimation by target/decoy searches is used to ensure the retention of predominantly confident assignments of MS/MS spectra to peptides. However, shortcomings have become apparent when such decoy searches are used to estimate the FDR. To study these shortcomings, we here introduce a novel kind of decoy database that contains isobaric mutated versions of the peptides that were identified in the original search. Because of the supervised way in which the entrapment sequences are generated, we call this a directed decoy database. Since the peptides found in our directed decoy database are thus specifically designed to look quite similar to the forward identifications, the limitations of the existing search algorithms in making correct calls in such strongly confusing situations can be analyzed. Interestingly, for the vast majority of confidently identified peptide identifications, a directed decoy peptide-to-spectrum match can be found that has a better or equal match score than the forward match score, highlighting an important issue in the interpretation of peptide identifications in present-day high-throughput proteomics. 相似文献

5.

Towards accurate estimation of the proportion of true null hypotheses in multiple testing

Zhang SD 《PloS one》2011,6(4):e18874

BACKGROUND: Biomedical researchers are now often faced with situations where it is necessary to test a large number of hypotheses simultaneously, eg, in comparative gene expression studies using high-throughput microarray technology. To properly control false positive errors the FDR (false discovery rate) approach has become widely used in multiple testing. The accurate estimation of FDR requires the proportion of true null hypotheses being accurately estimated. To date many methods for estimating this quantity have been proposed. Typically when a new method is introduced, some simulations are carried out to show the improved accuracy of the new method. However, the simulations are often very limited to covering only a few points in the parameter space. RESULTS: Here I have carried out extensive in silico experiments to compare some commonly used methods for estimating the proportion of true null hypotheses. The coverage of these simulations is unprecedented thorough over the parameter space compared to typical simulation studies in the literature. Thus this work enables us to draw conclusions globally as to the performance of these different methods. It was found that a very simple method gives the most accurate estimation in a dominantly large area of the parameter space. Given its simplicity and its overall superior accuracy I recommend its use as the first choice for estimating the proportion of true null hypotheses in multiple testing. 相似文献

6.

A Scalable Approach for Protein False Discovery Rate Estimation in Large Proteomic Data Sets

Mikhail M. Savitski Mathias Wilhelm Hannes Hahne Bernhard Kuster Marcus Bantscheff 《Molecular & cellular proteomics : MCP》2015,14(9):2394-2404

Calculating the number of confidently identified proteins and estimating false discovery rate (FDR) is a challenge when analyzing very large proteomic data sets such as entire human proteomes. Biological and technical heterogeneity in proteomic experiments further add to the challenge and there are strong differences in opinion regarding the conceptual validity of a protein FDR and no consensus regarding the methodology for protein FDR determination. There are also limitations inherent to the widely used classic target–decoy strategy that particularly show when analyzing very large data sets and that lead to a strong over-representation of decoy identifications. In this study, we investigated the merits of the classic, as well as a novel target–decoy-based protein FDR estimation approach, taking advantage of a heterogeneous data collection comprised of ∼19,000 LC-MS/MS runs deposited in ProteomicsDB (https://www.proteomicsdb.org). The “picked” protein FDR approach treats target and decoy sequences of the same protein as a pair rather than as individual entities and chooses either the target or the decoy sequence depending on which receives the highest score. We investigated the performance of this approach in combination with q-value based peptide scoring to normalize sample-, instrument-, and search engine-specific differences. The “picked” target–decoy strategy performed best when protein scoring was based on the best peptide q-value for each protein yielding a stable number of true positive protein identifications over a wide range of q-value thresholds. We show that this simple and unbiased strategy eliminates a conceptual issue in the commonly used “classic” protein FDR approach that causes overprediction of false-positive protein identification in large data sets. The approach scales from small to very large data sets without losing performance, consistently increases the number of true-positive protein identifications and is readily implemented in proteomics analysis software.Shotgun proteomics is the most popular approach for large-scale identification and quantification of proteins. The rapid evolution of high-end mass spectrometers in recent years (–) has made proteomic studies feasible that identify and quantify as many as 10,000 proteins in a sample (–) and enables many lines of new scientific research including, for example, the analysis of many human proteomes, and proteome-wide protein–drug interaction studies (–). One fundamental step in most proteomic experiments is the identification of proteins in the biological system under investigation. To achieve this, proteins are digested into peptides, analyzed by LC-MS/MS, and tandem mass spectra are used to interrogate protein sequence databases using search engines that match experimental data to data generated in silico (, ). Peptide spectrum matches (PSMs)¹ are commonly assigned by a search engine using either a heuristic or a probabilistic scoring scheme (–). Proteins are then inferred from identified peptides and a protein score or a probability derived as a measure for the confidence in the identification (, ).Estimating the proportion of false matches (false discovery rate; FDR) in an experiment is important to assess and maintain the quality of protein identifications. Owing to its conceptual and practical simplicity, the most widely used strategy to estimate FDR in proteomics is the target–decoy database search strategy (target–decoy strategy; TDS) (). The main assumption underlying this idea is that random matches (false positives) should occur with similar likelihood in the target database and the decoy (reversed, shuffled, or otherwise randomized) version of the same database (, ). The number of matches to the decoy database, therefore, provides an estimate of the number of random matches one should expect to obtain in the target database. The number of target and decoy hits can then be used to calculate either a local or a global FDR for a given data set (–). This general idea can be applied to control the FDR at the level of PSMs, peptides, and proteins, typically by counting the number of target and decoy observations above a specified score.Despite the significant practical impact of the TDS, it has been observed that a peptide FDR that results in an acceptable protein FDR (of say 1%) for a small or medium sized data set, turns into an unacceptably high protein FDR when the data set grows larger (, ). This is because the basic assumption of the classical TDS is compromised when a large proportion of the true positive proteins have already been identified. In small data sets, containing say only a few hundred to a few thousand proteins, random peptide matches will be distributed roughly equally over all decoy and “leftover” target proteins, allowing for a reasonably accurate estimation of false positive target identifications by using the number of decoy identifications. However, in large experiments comprising hundreds to thousands of LC-MS/MS runs, 10,000 or more target proteins may be genuinely and repeatedly identified, leaving an ever smaller number of (target) proteins to be hit by new false positive peptide matches. In contrast, decoy proteins are only hit by the occasional random peptide match but fully count toward the number of false positive protein identifications estimated from the decoy hits. The higher the number of genuinely identified target proteins gets, the larger this imbalance becomes. If this is not corrected for in the decoy space, an overestimation of false positives will occur.This problem has been recognized and e.g. Reiter and colleagues suggested a way for correcting for the overestimation of false positive protein hits termed MAYU (). Following the main assumption that protein identifications containing false positive PSMs are uniformly distributed over the target database, MAYU models the number of false positive protein identifications using a hypergeometric distribution. Its parameters are estimated from the number of protein database entries and the total number of target and decoy protein identifications. The protein FDR is then estimated by dividing the number of expected false positive identifications (expectation value of the hypergeometric distribution) by the total number of target identifications. Although this approach was specifically designed for large data sets (tested on ∼1300 LC-MS/MS runs from digests of C. elegans proteins), it is not clear how far the approach actually scales. Another correction strategy for overestimation of false positive rates, the R factor, was suggested initially for peptides () and more recently for proteins (). A ratio, R, of forward and decoy hits in the low probability range is calculated, where the number of true peptide or protein identifications is expected to be close to zero, and hence, R should approximate one. The number of decoy hits is then multiplied (corrected) by the R factor when performing FDR calculations. The approach is conceptually simpler than the MAYU strategy and easy to implement, but is also based on the assumption that the inflation of the decoy hits intrinsic in the classic target–decoy strategy occurs to the same extent in all probability ranges.In the context of the above, it is interesting to note that there is currently no consensus in the community regarding if and how protein FDRs should be calculated for data of any size. One perhaps extreme view is that, owing to issues and assumptions related to the peptide to protein inference step and ways of constructing decoy protein sequences, protein level FDRs cannot be meaningfully estimated at all (30). This is somewhat unsatisfactory as an estimate of protein level error in proteomic experiments is highly desirable. Others have argued that target–decoy searches are not even needed when accurate p values of individual PSMs are available () whereas others choose to tighten the PSM or peptide FDRs obtained from TDS analysis to whatever threshold necessary to obtain a desired protein FDR (). This is likely too conservative.We have recently proposed an alternative protein FDR approach termed “picked” target–decoy strategy (picked TDS) that indicated improved performance over the classical TDS in a very large proteomic data set () but a systematic investigation of the idea had not been performed at the time. In this study, we further characterized the picked TDS for protein FDR estimation and investigated its scalability compared with that of the classic TDS FDR method in data sets of increasing size up to ∼19,000 LC-MS/MS runs. The results show that the picked TDS is effective in preventing decoy protein over-representation, identifies more true positive hits, and works equally well for small and large proteomic data sets. 相似文献

7.

Randomized sequence databases for tandem mass spectrometry peptide and protein identification 总被引：4，自引：0，他引：4

Higdon R Hogan JM Van Belle G Kolker E 《Omics : a journal of integrative biology》2005,9(4):364-379

Tandem mass spectrometry (MS/MS) combined with database searching is currently the most widely used method for high-throughput peptide and protein identification. Many different algorithms, scoring criteria, and statistical models have been used to identify peptides and proteins in complex biological samples, and many studies, including our own, describe the accuracy of these identifications, using at best generic terms such as "high confidence." False positive identification rates for these criteria can vary substantially with changing organisms under study, growth conditions, sequence databases, experimental protocols, and instrumentation; therefore, study-specific methods are needed to estimate the accuracy (false positive rates) of these peptide and protein identifications. We present and evaluate methods for estimating false positive identification rates based on searches of randomized databases (reversed and reshuffled). We examine the use of separate searches of a forward then a randomized database and combined searches of a randomized database appended to a forward sequence database. Estimated error rates from randomized database searches are first compared against actual error rates from MS/MS runs of known protein standards. These methods are then applied to biological samples of the model microorganism Shewanella oneidensis strain MR-1. Based on the results obtained in this study, we recommend the use of use of combined searches of a reshuffled database appended to a forward sequence database as a means providing quantitative estimates of false positive identification rates of peptides and proteins. This will allow researchers to set criteria and thresholds to achieve a desired error rate and provide the scientific community with direct and quantifiable measures of peptide and protein identification accuracy as opposed to vague assessments such as "high confidence." 相似文献

8.

Analysis and practical guideline of constraint-based boolean method in genetic network inference

Saithong T Bumee S Liamwirat C Meechai A 《PloS one》2012,7(1):e30232

Boolean-based method, despite of its simplicity, would be a more attractive approach for inferring a network from high-throughput expression data if its effectiveness has not been limited by high false positive prediction. In this study, we explored factors that could simply be adjusted to improve the accuracy of inferring networks. Our work focused on the analysis of the effects of discretisation methods, biological constraints, and stringency of boolean function assignment on the performance of boolean network, including accuracy, precision, specificity and sensitivity, using three sets of microarray time-series data. The study showed that biological constraints have pivotal influence on the network performance over the other factors. It can reduce the variation in network performance resulting from the arbitrary selection of discretisation methods and stringency settings. We also presented the master boolean network as an approach to establish the unique solution for boolean analysis. The information acquired from the analysis was summarised and deployed as a general guideline for an efficient use of boolean-based method in the network inference. In the end, we provided an example of the use of such a guideline in the study of Arabidopsis circadian clock genetic network from which much interesting biological information can be inferred. 相似文献

9.

Semisupervised model-based validation of peptide identifications in mass spectrometry-based proteomics 总被引：1，自引：0，他引：1

Choi H Nesvizhskii AI 《Journal of proteome research》2008,7(1):254-265

Development of robust statistical methods for validation of peptide assignments to tandem mass (MS/MS) spectra obtained using database searching remains an important problem. PeptideProphet is one of the commonly used computational tools available for that purpose. An alternative simple approach for validation of peptide assignments is based on addition of decoy (reversed, randomized, or shuffled) sequences to the searched protein sequence database. The probabilistic modeling approach of PeptideProphet and the decoy strategy can be combined within a single semisupervised framework, leading to improved robustness and higher accuracy of computed probabilities even in the case of most challenging data sets. We present a semisupervised expectation-maximization (EM) algorithm for constructing a Bayes classifier for peptide identification using the probability mixture model, extending PeptideProphet to incorporate decoy peptide matches. Using several data sets of varying complexity, from control protein mixtures to a human plasma sample, and using three commonly used database search programs, SEQUEST, MASCOT, and TANDEM/k-score, we illustrate that more accurate mixture estimation leads to an improved control of the false discovery rate in the classification of peptide assignments. 相似文献

10.

PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification

Zhang J Xin L Shan B Chen W Xie M Yuen D Zhang W Zhang Z Lajoie GA Ma B 《Molecular & cellular proteomics : MCP》2012,11(4):M111.010587

Many software tools have been developed for the automated identification of peptides from tandem mass spectra. The accuracy and sensitivity of the identification software via database search are critical for successful proteomics experiments. A new database search tool, PEAKS DB, has been developed by incorporating the de novo sequencing results into the database search. PEAKS DB achieves significantly improved accuracy and sensitivity over two other commonly used software packages. Additionally, a new result validation method, decoy fusion, has been introduced to solve the issue of overconfidence that exists in the conventional target decoy method for certain types of peptide identification software. 相似文献

11.

GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy

Xue Y Ren J Gao X Jin C Wen L Yao X 《Molecular & cellular proteomics : MCP》2008,7(9):1598-1608

Identification of protein phosphorylation sites with their cognate protein kinases (PKs) is a key step to delineate molecular dynamics and plasticity underlying a variety of cellular processes. Although nearly 10 kinase-specific prediction programs have been developed, numerous PKs have been casually classified into subgroups without a standard rule. For large scale predictions, the false positive rate has also never been addressed. In this work, we adopted a well established rule to classify PKs into a hierarchical structure with four levels, including group, family, subfamily, and single PK. In addition, we developed a simple approach to estimate the theoretically maximal false positive rates. The on-line service and local packages of the GPS (Group-based Prediction System) 2.0 were implemented in Java with the modified version of the Group-based Phosphorylation Scoring algorithm. As the first stand alone software for predicting phosphorylation, GPS 2.0 can predict kinase-specific phosphorylation sites for 408 human PKs in hierarchy. A large scale prediction of more than 13,000 mammalian phosphorylation sites by GPS 2.0 was exhibited with great performance and remarkable accuracy. Using Aurora-B as an example, we also conducted a proteome-wide search and provided systematic prediction of Aurora-B-specific substrates including protein-protein interaction information. Thus, the GPS 2.0 is a useful tool for predicting protein phosphorylation sites and their cognate kinases and is freely available on line. 相似文献

12.

Strategies for the effective identification of remotely related sequences in multiple PSSM search approach

Gowri VS Tina KG Krishnadev O Srinivasan N 《Proteins》2007,67(4):789-794

Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach. 相似文献

13.

Peptide identification quality control

Vaudel M Burkhart JM Sickmann A Martens L Zahedi RP 《Proteomics》2011,11(10):2105-2114

Identification of large proteomics data sets is routinely performed using sophisticated software tools called search engines. Yet despite the importance of the identification process, its configuration and execution is often performed according to established lab habits, and is mostly unsupervised by detailed quality control. In order to establish easily obtainable quality control criteria that can be broadly applied to the identification process, we here introduce several simple quality control methods. An unbiased quality control of identification parameters will be conducted using target/decoy searches providing significant improvement over identification standards. MASCOT identifications were for instance increased by 13% at a constant level of confidence. The target/decoy approach can however not be universally applied. We therefore also quality control the application of this strategy itself, providing useful and intuitive metrics for evaluating the precision and robustness of the obtained false discovery rate. 相似文献

14.

Fold recognition with minimal gaps

Chen W Mirny L Shakhnovich EI 《Proteins》2003,51(4):531-543

Here we present a simplified form of threading that uses only a 20 x 20 two-body residue-based potential and restricted number of gaps. Despite its simplicity and transparency the Monte Carlo-based threading algorithm performs very well in a rigorous test of fold recognition. The results suggest that by simplifying and constraining the decoy space, one can achieve better fold recognition. Fold recognition results are compared with and supplemented by a PSI-BLAST search. The statistical significance of threading results is rigorously evaluated from statistics of extremes by comparison with optimal alignments of a large set of randomly shuffled sequences. The statistical theory, based on the Random Energy Model, yields a cumulative statistical parameter, epsilon, that attests to the likelihood of correct fold recognition. A large epsilon indicates a significant energy gap between the optimal alignment and decoy alignments and, consequently, a high probability that the fold is correctly recognized. For a particular number of gaps, the epsilon parameter reaches its maximal value, and the fold is recognized. As the number of gaps further increases, the likelihood of correct fold recognition drops off. This is because the decoy space is small when gaps are restricted to a small number, but the native alignment is still well approximated, whereas unrestricted increase of the number of gaps leads to rapid growth of the number of decoys and their statistical dominance over the correct alignment. It is shown that best results are obtained when a combination of one-, two-, and three-gap threading is used. To this end, use of the epsilon parameter is crucial for rigorous comparison of results across the different decoy spaces belonging to a different number of gaps. 相似文献

15.

Assessing the Probability that a Finding Is Genuine for Large-Scale Genetic Association Studies

Chia-Ling Kuo Olga A. Vsevolozhskaya Dmitri V. Zaykin 《PloS one》2015,10(5)

Genetic association studies routinely involve massive numbers of statistical tests accompanied by P-values. Whole genome sequencing technologies increased the potential number of tested variants to tens of millions. The more tests are performed, the smaller P-value is required to be deemed significant. However, a small P-value is not equivalent to small chances of a spurious finding and significance thresholds may fail to serve as efficient filters against false results. While the Bayesian approach can provide a direct assessment of the probability that a finding is spurious, its adoption in association studies has been slow, due in part to the ubiquity of P-values and the automated way they are, as a rule, produced by software packages. Attempts to design simple ways to convert an association P-value into the probability that a finding is spurious have been met with difficulties. The False Positive Report Probability (FPRP) method has gained increasing popularity. However, FPRP is not designed to estimate the probability for a particular finding, because it is defined for an entire region of hypothetical findings with P-values at least as small as the one observed for that finding. Here we propose a method that lets researchers extract probability that a finding is spurious directly from a P-value. Considering the counterpart of that probability, we term this method POFIG: the Probability that a Finding is Genuine. Our approach shares FPRP''s simplicity, but gives a valid probability that a finding is spurious given a P-value. In addition to straightforward interpretation, POFIG has desirable statistical properties. The POFIG average across a set of tentative associations provides an estimated proportion of false discoveries in that set. POFIGs are easily combined across studies and are immune to multiple testing and selection bias. We illustrate an application of POFIG method via analysis of GWAS associations with Crohn''s disease. 相似文献

16.

Generalized method for probability-based peptide and protein identification from tandem mass spectrometry data and sequence database searching

Ramos-Fernández A Paradela A Navajas R Albar JP 《Molecular & cellular proteomics : MCP》2008,7(9):1748-1754

Tandem mass spectrometry-based proteomics is currently in great demand of computational methods that facilitate the elimination of likely false positives in peptide and protein identification. In the last few years, a number of new peptide identification programs have been described, but scores or other significance measures reported by these programs cannot always be directly translated into an easy to interpret error rate measurement such as the false discovery rate. In this work we used generalized lambda distributions to model frequency distributions of database search scores computed by MASCOT, X!TANDEM with k-score plug-in, OMSSA, and InsPecT. From these distributions, we could successfully estimate p values and false discovery rates with high accuracy. From the set of peptide assignments reported by any of these engines, we also defined a generic protein scoring scheme that enabled accurate estimation of protein-level p values by simulation of random score distributions that was also found to yield good estimates of protein-level false discovery rate. The performance of these methods was evaluated by searching four freely available data sets ranging from 40,000 to 285,000 MS/MS spectra. 相似文献

17.

Statistical validation of peptide identifications in large-scale proteomics using the target-decoy database search strategy and flexible mixture modeling

Choi H Ghosh D Nesvizhskii AI 《Journal of proteome research》2008,7(1):286-292

Reliable statistical validation of peptide and protein identifications is a top priority in large-scale mass spectrometry based proteomics. PeptideProphet is one of the computational tools commonly used for assessing the statistical confidence in peptide assignments to tandem mass spectra obtained using database search programs such as SEQUEST, MASCOT, or X! TANDEM. We present two flexible methods, the variable component mixture model and the semiparametric mixture model, that remove the restrictive parametric assumptions in the mixture modeling approach of PeptideProphet. Using a control protein mixture data set generated on an linear ion trap Fourier transform (LTQ-FT) mass spectrometer, we demonstrate that both methods improve parametric models in terms of the accuracy of probability estimates and the power to detect correct identifications controlling the false discovery rate to the same degree. The statistical approaches presented here require that the data set contain a sufficient number of decoy (known to be incorrect) peptide identifications, which can be obtained using the target-decoy database search strategy. 相似文献

18.

Wenguang Shao Kan Zhu Henry Lam 《Proteomics》2013,13(22):3273-3283

Spectral library searching is a maturing approach for peptide identification from MS/MS, offering an alternative to traditional sequence database searching. Spectral library searching relies on direct spectrum‐to‐spectrum matching between the query data and the spectral library, which affords better discrimination of true and false matches, leading to improved sensitivity. However, due to the inherent diversity of the peak location and intensity profiles of real spectra, the resulting similarity score distributions often take on unpredictable shapes. This makes it difficult to model the scores of the false matches accurately, necessitating the use of decoy searching to sample the score distribution of the false matches. Here, we refined the similarity scoring in spectral library searching to enable the validation of spectral search results without the use of decoys. We rank‐transformed the peak intensities to standardize all spectra, making it possible to fit a parametric distribution to the scores of the nontop‐scoring spectral matches. The statistical significance of the top‐scoring match can then be estimated in a rigorous manner according to Extreme Value Theory. The overall result is a more robust and interpretable measure of the quality of the spectral match, which can be obtained without decoys. We tested this refined similarity scoring function on real datasets and demonstrated its effectiveness. This approach reduces search time, increases sensitivity, and extends spectral library searching to situations where decoy spectra cannot be readily generated, such as in searching unidentified and nonpeptide spectral libraries. 相似文献

19.

A Bayesian Framework to Identify Methylcytosines from High-Throughput Bisulfite Sequencing Data

Qing Xie Qi Liu Fengbiao Mao Wanshi Cai Honghu Wu Mingcong You Zhen Wang Bingyu Chen Zhong Sheng Sun Jinyu Wu 《PLoS computational biology》2014,10(9)

High-throughput bisulfite sequencing technologies have provided a comprehensive and well-fitted way to investigate DNA methylation at single-base resolution. However, there are substantial bioinformatic challenges to distinguish precisely methylcytosines from unconverted cytosines based on bisulfite sequencing data. The challenges arise, at least in part, from cell heterozygosis caused by multicellular sequencing and the still limited number of statistical methods that are available for methylcytosine calling based on bisulfite sequencing data. Here, we present an algorithm, termed Bycom, a new Bayesian model that can perform methylcytosine calling with high accuracy. Bycom considers cell heterozygosis along with sequencing errors and bisulfite conversion efficiency to improve calling accuracy. Bycom performance was compared with the performance of Lister, the method most widely used to identify methylcytosines from bisulfite sequencing data. The results showed that the performance of Bycom was better than that of Lister for data with high methylation levels. Bycom also showed higher sensitivity and specificity for low methylation level samples (<1%) than Lister. A validation experiment based on reduced representation bisulfite sequencing data suggested that Bycom had a false positive rate of about 4% while maintaining an accuracy of close to 94%. This study demonstrated that Bycom had a low false calling rate at any methylation level and accurate methylcytosine calling at high methylation levels. Bycom will contribute significantly to studies aimed at recalibrating the methylation level of genomic regions based on the presence of methylcytosines. 相似文献

20.

A complex standard for protein identification, designed by evolution

Vaudel M Burkhart JM Breiter D Zahedi RP Sickmann A Martens L 《Journal of proteome research》2012,11(10):5065-5071

Shotgun proteomic investigations rely on the algorithmic assignment of mass spectra to peptides. The quality of these matches is therefore a cornerstone in the analysis and has been the subject of numerous recent developments. In order to establish the benefits of novel algorithms, they are applied to reference samples of known content. However, these were recently shown to be either too simple to resemble typical real-life samples or as leading to results of lower accuracy as the method itself. Here, we describe how to use the proteome of Pyrococcus furiosus , a hyperthermophile, as a standard to evaluate proteomics identification workflows. Indeed, we prove that the Pyrococcus furiosus proteome provides a valid method for detecting random hits, comparable to the decoy databases currently in popular use, but we also prove that the Pyrococcus furiosus proteome goes squarely beyond the decoy approach by also providing many hundreds of highly reliable true positive hits. Searching the Pyrococcus furiosus proteome can thus be used as a unique test that provides the ability to reliably detect both false positives as well as proteome-scale true positives, allowing the rigorous testing of identification algorithms at the peptide and protein level. 相似文献