首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 944 毫秒
1.
We present a wrapper-based approach to estimate and control the false discovery rate for peptide identifications using the outputs from multiple commercially available MS/MS search engines. Features of the approach include the flexibility to combine output from multiple search engines with sequence and spectral derived features in a flexible classification model to produce a score associated with correct peptide identifications. This classification model score from a reversed database search is taken as the null distribution for estimating p-values and false discovery rates using a simple and established statistical procedure. Results from 10 analyses of rat sera on an LTQ-FT mass spectrometer indicate that the method is well calibrated for controlling the proportion of false positives in a set of reported peptide identifications while correctly identifying more peptides than rule-based methods using one search engine alone.  相似文献   

2.
The statistical validation of database search results is a complex issue in bottom-up proteomics. The correct and incorrect peptide spectrum match (PSM) scores overlap significantly, making an accurate assessment of true peptide matches challenging. Since the complete separation between the true and false hits is practically never achieved, there is need for better methods and rescoring algorithms to improve upon the primary database search results. Here we describe the calibration and False Discovery Rate (FDR) estimation of database search scores through a dynamic FDR calculation method, FlexiFDR, which increases both the sensitivity and specificity of search results. Modelling a simple linear regression on the decoy hits for different charge states, the method maximized the number of true positives and reduced the number of false negatives in several standard datasets of varying complexity (18-mix, 49-mix, 200-mix) and few complex datasets (E. coli and Yeast) obtained from a wide variety of MS platforms. The net positive gain for correct spectral and peptide identifications was up to 14.81% and 6.2% respectively. The approach is applicable to different search methodologies- separate as well as concatenated database search, high mass accuracy, and semi-tryptic and modification searches. FlexiFDR was also applied to Mascot results and showed better performance than before. We have shown that appropriate threshold learnt from decoys, can be very effective in improving the database search results. FlexiFDR adapts itself to different instruments, data types and MS platforms. It learns from the decoy hits and sets a flexible threshold that automatically aligns itself to the underlying variables of data quality and size.  相似文献   

3.
Confident peptide identification is one of the most important components in mass-spectrometry-based proteomics. We propose a method to properly combine the results from different database search methods to enhance the accuracy of peptide identifications. The database search methods included in our analysis are SEQUEST (v27 rev12), ProbID (v1.0), InsPecT (v20060505), Mascot (v2.1), X! Tandem (v2007.07.01.2), OMSSA (v2.0) and RAId_DbS. Using two data sets, one collected in profile mode and one collected in centroid mode, we tested the search performance of all 21 combinations of two search methods as well as all 35 possible combinations of three search methods. The results obtained from our study suggest that properly combining search methods does improve retrieval accuracy. In addition to performance results, we also describe the theoretical framework which in principle allows one to combine many independent scoring methods including de novo sequencing and spectral library searches. The correlations among different methods are also investigated in terms of common true positives, common false positives, and a global analysis. We find that the average correlation strength, between any pairwise combination of the seven methods studied, is usually smaller than the associated standard error. This indicates only weak correlation may be present among different methods and validates our approach in combining the search results. The usefulness of our approach is further confirmed by showing that the average cumulative number of false positive peptides agrees reasonably well with the combined E-value. The data related to this study are freely available upon request.  相似文献   

4.

Background  

High-throughput shotgun proteomics data contain a significant number of spectra from non-peptide ions or spectra of too poor quality to obtain highly confident peptide identifications. These spectra cannot be identified with any positive peptide matches in some database search programs or are identified with false positives in others. Removing these spectra can improve the database search results and lower computational expense.  相似文献   

5.

Background  

Rejection of false positive peptide matches in database searches of shotgun proteomic experimental data is highly desirable. Several methods have been developed to use the peptide retention time as to refine and improve peptide identifications from database search algorithms. This report describes the implementation of an automated approach to reduce false positives and validate peptide matches.  相似文献   

6.
Although bioacoustics is increasingly used to study species and environments for their monitoring and conservation, detecting calls produced by species of interest is prohibitively time consuming when done manually. Here we compared four methods for detecting and identifying roar-barks of maned wolves (Chrysocyon brachyurus) within long sound recordings: (1) a manual method, (2) an automated detector method using Raven Pro 1.4, (3) an automated detector method using XBAT and (4) a mixed method using XBAT's detector followed by manual verification. Recordings were done using a song meter installed at the Serra da Canastra National Park (Minas Gerais, Brazil). For each method we evaluated the following variables in a 24-h recording: (1) total time required analysing files, (2) number of false positives identified and (3) number of true positives identified compared to total number of target sounds. Automated methods required less time to analyse the recordings (77–93 min) when compared to manual method (189 min), but consistently presented more false positives and were less efficient in identifying true positives (manual = 91.89%, Raven = 32.43% and XBAT = 84.86%). Adding a manual verification after XBAT detection dramatically increased efficiency in identifying target sounds (XBAT+manual = 100% true positives). Manual verification of XBAT detections seems to be the best way out of the proposed methods to collect target sound data for studies where large amounts of audio data need to be analysed in a reasonable time (111 min, 58.73% of the time required to find calls manually).  相似文献   

7.
With great biological interest in post-translational modifications (PTMs), various approaches have been introduced to identify PTMs using MS/MS. Recent developments for PTM identification have focused on an unrestrictive approach that searches MS/MS spectra for all known and possibly even unknown types of PTMs at once. However, the resulting expanded search space requires much longer search time and also increases the number of false positives (incorrect identifications) and false negatives (missed true identifications), thus creating a bottleneck in high throughput analysis. Here we introduce MODa, a novel "multi-blind" spectral alignment algorithm that allows for fast unrestrictive PTM searches with no limitation on the number of modifications per peptide while featuring over an order of magnitude speedup in relation to existing approaches. We demonstrate the sensitivity of MODa on human shotgun proteomics data where it reveals multiple mutations, a wide range of modifications (including glycosylation), and evidence for several putative novel modifications. Based on the reported findings, we argue that the efficiency and sensitivity of MODa make it the first unrestrictive search tool with the potential to fully replace conventional restrictive identification of proteomics mass spectrometry data.  相似文献   

8.
Data produced from the MudPIT analysis of yeast (S. cerevisiae) and rice (O. sativa) were used to develop a technique to validate single-peptide protein identifications using complementary database search algorithms. This results in a considerable reduction of overall false-positive rates for protein identifications; the overall false discovery rates in yeast are reduced from near 25% to less than 1%, and the false discovery rate of yeast single-peptide protein identifications becomes negligible. This technique can be employed by laboratories utilizing a SEQUEST-based proteomic analysis platform, incorporating the XTandem algorithm as a complementary tool for verification of single-peptide protein identifications. We have achieved this using open-source software, including several data-manipulation software tools developed in our laboratory, which are freely available to download.  相似文献   

9.
MOTIVATION: The deluge of biological information from different genomic initiatives and the rapid advancement in biotechnologies have made bioinformatics tools an integral part of modern biology. Among the widely used sequence alignment tools, BLAST and PSI-BLAST are arguably the most popular. PSI-BLAST, which uses an iterative profile position specific score matrix (PSSM)-based search strategy, is more sensitive than BLAST in detecting weak homologies, thus making it suitable for remote homolog detection. Many refinements have been made to improve PSI-BLAST, and its computational efficiency and high specificity have been much touted. Nevertheless, corruption of its profile via the incorporation of false positive sequences remains a major challenge. RESULTS: We have developed a simple and elegant approach to resolve the problem of model corruption in PSI-BLAST searches. We hypothesized that combining results from the first (least-corrupted) profile with results from later (most sensitive) iterations of PSI-BLAST provides a better discriminator for true and false hits. Accordingly, we have derived a formula that utilizes the E-values from these two PSI-BLAST iterations to obtain a figure of merit for rank-ordering the hits. Our verification results based on a 'gold-standard' test set indicate that this figure of merit does indeed delineate true positives from false positives better than PSI-BLAST E-values. Perhaps what is most notable about this strategy is that it is simple and straightforward to implement.  相似文献   

10.
False positive peptide identifications are a major concern in the field of peptidecentric, mass spectrometry-driven gel-free proteomics. They occur in regions where the score distributions of true positives and true negatives overlap. Removal of these false positive identifications necessarily involves a trade-off between sensitivity and specificity. Existing postprocessing tools typically rely on a fixed or semifixed set of assumptions in their attempts to optimize both the sensitivity and the specificity of peptide and protein identification using MS/MS spectra. Because of the expanding diversity in available proteomics technologies, however, these postprocessing tools often struggle to adapt to emerging technology-specific peculiarity. Here we present a novel tool named Peptizer that solves this adaptability issue by making use of pluggable assumptions. This research-oriented postprocessing tool also includes a graphical user interface to perform efficient manual validation of suspect identifications for optimal sensitivity recovery. Peptizer is open source software under the Apache2 license and is written in Java.  相似文献   

11.
One of the challenges associated with large-scale proteome analysis using tandem mass spectrometry (MS/MS) and automated database searching is to reduce the number of false positive identifications without sacrificing the number of true positives found. In this work, a systematic investigation of the effect of 2MEGA labeling (N-terminal dimethylation after lysine guanidination) on the proteome analysis of a membrane fraction of an Escherichia coli cell extract by 2-dimensional liquid chromatography MS/MS is presented. By a large-scale comparison of MS/MS spectra of native peptides with those from the 2MEGA-labeled peptides, the labeled peptides were found to undergo facile fragmentation with enhanced a1 or a1-related (a(1)-17 and a(1)-45) ions derived from all N-terminal amino acids in the MS/MS spectra; these ions are usually difficult to detect in the MS/MS spectra of nonderivatized peptides. The 2MEGA labeling alleviated the biased detection of arginine-terminated peptides that is often observed in MALDI and ESI MS experiments. 2MEGA labeling was found not only to increase the number of peptides and proteins identified but also to generate enhanced a1 or a1-related ions as a constraint to reduce the number of false positive identifications. In total, 640 proteins were identified from the E. coli membrane fraction, with each protein identified based on peptide mass and sequence match of one or more peptides using MASCOT database search algorithm from the MS/MS spectra generated by a quadrupole time-of-flight mass spectrometer. Among them, the subcellular locations of 336 proteins are presently known, including 258 membrane and membrane-associated proteins (76.8%). Among the classified proteins, there was a dramatic increase in the total number of integral membrane proteins identified in the 2MEGA-labeled sample (153 proteins) versus the unlabeled sample (77 proteins).  相似文献   

12.
In a wide range of contexts, including predator avoidance, medical decision-making and security screening, decision accuracy is fundamentally constrained by the trade-off between true and false positives. Increased true positives are possible only at the cost of increased false positives; conversely, decreased false positives are associated with decreased true positives. We use an integrated theoretical and experimental approach to show that a group of decision-makers can overcome this basic limitation. Using a mathematical model, we show that a simple quorum decision rule enables individuals in groups to simultaneously increase true positives and decrease false positives. The results from a predator-detection experiment that we performed with humans are in line with these predictions: (i) after observing the choices of the other group members, individuals both increase true positives and decrease false positives, (ii) this effect gets stronger as group size increases, (iii) individuals use a quorum threshold set between the average true- and false-positive rates of the other group members, and (iv) individuals adjust their quorum adaptively to the performance of the group. Our results have broad implications for our understanding of the ecology and evolution of group-living animals and lend themselves for applications in the human domain such as the design of improved screening methods in medical, forensic, security and business applications.  相似文献   

13.
MS/MS is a widely used method for proteome‐wide analysis of protein expression and PTMs. The thousands of MS/MS spectra produced from a single experiment pose a major challenge for downstream analysis. Standard programs, such as MASCOT, provide peptide assignments for many of the spectra, including identification of PTM sites, but these results are plagued by false‐positive identifications. In phosphoproteomic experiments, only a single peptide assignment is typically available to support identification of each phosphorylation site, and hence minimizing false positives is critical. Thus, tedious manual validation is often required to increase confidence in the spectral assignments. We have developed phoMSVal, an open‐source platform for managing MS/MS data and automatically validating identified phosphopeptides. We tested five classification algorithms with 17 extracted features to separate correct peptide assignments from incorrect ones using over 2600 manually curated spectra. The naïve Bayes algorithm was among the best classifiers with an AUC value of 97% and PPV of 97% for phosphotyrosine data. This classifier required only three features to achieve a 76% decrease in false positives as compared with MASCOT while retaining 97% of true positives. This algorithm was able to classify an independent phosphoserine/threonine data set with AUC value of 93% and PPV of 91%, demonstrating the applicability of this method for all types of phospho‐MS/MS data. PhoMSVal is available at http://csbi.ltdk.helsinki.fi/phomsval .  相似文献   

14.
Shotgun proteomics using mass spectrometry is a powerful method for protein identification but suffers limited sensitivity in complex samples. Integrating peptide identifications from multiple database search engines is a promising strategy to increase the number of peptide identifications and reduce the volume of unassigned tandem mass spectra. Existing methods pool statistical significance scores such as p-values or posterior probabilities of peptide-spectrum matches (PSMs) from multiple search engines after high scoring peptides have been assigned to spectra, but these methods lack reliable control of identification error rates as data are integrated from different search engines. We developed a statistically coherent method for integrative analysis, termed MSblender. MSblender converts raw search scores from search engines into a probability score for every possible PSM and properly accounts for the correlation between search scores. The method reliably estimates false discovery rates and identifies more PSMs than any single search engine at the same false discovery rate. Increased identifications increment spectral counts for most proteins and allow quantification of proteins that would not have been quantified by individual search engines. We also demonstrate that enhanced quantification contributes to improve sensitivity in differential expression analyses.  相似文献   

15.
Identification of proteins by MS/MS is performed by matching experimental mass spectra against calculated spectra of all possible peptides in a protein data base. The search engine assigns each spectrum a score indicating how well the experimental data complies with the expected one; a higher score means increased confidence in the identification. One problem is the false-positive identifications, which arise from incomplete data as well as from the presence of misleading ions in experimental mass spectra due to gas-phase reactions, stray ions, contaminants, and electronic noise. We employed a novel technique of reduction of false positives that is based on a combined use of orthogonal fragmentation techniques electron capture dissociation (ECD) and collisionally activated dissociation (CAD). Since ECD and CAD exhibit many complementary properties, their combined use greatly increased the analysis specificity, which was further strengthened by the high mass accuracy (approximately 1 ppm) afforded by Fourier transform mass spectrometry. The utility of this approach is demonstrated on a whole cell lysate from Escherichia coli. Analysis was made using the data-dependent acquisition mode. Extraction of complementary sequence information was performed prior to data base search using in-house written software. Only masses involved in complementary pairs in the MS/MS spectrum from the same or orthogonal fragmentation techniques were submitted to the data base search. ECD/CAD identified twice as many proteins at a fixed statistically significant confidence level with on average a 64% higher Mascot score. The confidence in protein identification was hereby increased by more than 1 order of magnitude. The combined ECD/CAD searches were on average 20% faster than CAD-only searches. A specially developed test with scrambled MS/MS data revealed that the amount of false-positive identifications was dramatically reduced by the combined use of CAD and ECD.  相似文献   

16.
LC‐MS experiments can generate large quantities of data, for which a variety of database search engines are available to make peptide and protein identifications. Decoy databases are becoming widely used to place statistical confidence in result sets, allowing the false discovery rate (FDR) to be estimated. Different search engines produce different identification sets so employing more than one search engine could result in an increased number of peptides (and proteins) being identified, if an appropriate mechanism for combining data can be defined. We have developed a search engine independent score, based on FDR, which allows peptide identifications from different search engines to be combined, called the FDR Score. The results demonstrate that the observed FDR is significantly different when analysing the set of identifications made by all three search engines, by each pair of search engines or by a single search engine. Our algorithm assigns identifications to groups according to the set of search engines that have made the identification, and re‐assigns the score (combined FDR Score). The combined FDR Score can differentiate between correct and incorrect peptide identifications with high accuracy, allowing on average 35% more peptide identifications to be made at a fixed FDR than using a single search engine.  相似文献   

17.
Searches using position specific scoring matrices (PSSMs) have been commonly used in remote homology detection procedures such as PSI-BLAST and RPS-BLAST. A PSSM is generated typically using one of the sequences of a family as the reference sequence. In the case of PSI-BLAST searches the reference sequence is same as the query. Recently we have shown that searches against the database of multiple family-profiles, with each one of the members of the family used as a reference sequence, are more effective than searches against the classical database of single family-profiles. Despite relatively a better overall performance when compared with common sequence-profile matching procedures, searches against the multiple family-profiles database result in a few false positives and false negatives. Here we show that profile length and divergence of sequences used in the construction of a PSSM have major influence on the performance of multiple profile based search approach. We also identify that a simple parameter defined by the number of PSSMs corresponding to a family that is hit, for a query, divided by the total number of PSSMs in the family can distinguish effectively the true positives from the false positives in the multiple profiles search approach.  相似文献   

18.
19.
Small interfering RNAs (siRNAs) have become a ubiquitous experimental tool for down-regulating mRNAs. Unfortunately, off-target effects are a significant source of false positives in siRNA experiments and an effective control for them has not previously been identified. We introduce two methods of mismatched siRNA design for negative controls based on changing bases in the middle of the siRNA to their complement bases. To test these controls, a test set of 20 highly active siRNAs (10 true positives and 10 false positives) was identified from a genome-wide screen performed in a cell-line expressing a simple, constitutively expressed luciferase reporter. Three controls were then synthesized for each of these 20 siRNAs, the first two using the proposed mismatch design methods and the third being a simple random permutation of the sequence (scrambled siRNA). When tested in the original assay, the scrambled siRNAs showed significantly reduced activity in comparison to the original siRNAs, regardless of whether they had been identified as true or false positives, indicating that they have little utility as experimental controls. In contrast, one of the proposed mismatch design methods, dubbed C911 because bases 9 through 11 of the siRNA are replaced with their complement, was able to completely distinguish between the two groups. False positives due to off-target effects maintained most of their activity when the C911 mismatch control was tested, whereas true positives whose phenotype was due to on-target effects lost most or all of their activity when the C911 mismatch was tested. The ability of control siRNAs to distinguish between true and false positives, if widely adopted, could reduce erroneous results being reported in the literature and save research dollars spent on expensive follow-up experiments.  相似文献   

20.
Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号