共查询到20条相似文献,搜索用时 0 毫秒
1.
Background
A better understanding of the mechanisms involved in gas-phase fragmentation of peptides is essential for the development of more reliable algorithms for high-throughput protein identification using mass spectrometry (MS). Current methodologies depend predominantly on the use of derived m/z values of fragment ions, and, the knowledge provided by the intensity information present in MS/MS spectra has not been fully exploited. Indeed spectrum intensity information is very rarely utilized in the algorithms currently in use for high-throughput protein identification. 相似文献2.
For MALDI-TOF mass spectrometry, we show that the intensity of a peptide-ion peak is directly correlated with its sequence, with the residues M, H, P, R, and L having the most substantial effect on ionization. We developed a machine learning approach that exploits this relationship to significantly improve peptide mass fingerprint (PMF) accuracy based on training data sets from both true-positive and false-positive PMF searches. The model's cross-validated accuracy in distinguishing real versus false-positive database search results is 91%, rivaling the accuracy of MS/MS-based protein identification. 相似文献
3.
Mass spectrometry data from high-resolution time-of-flight instruments often contain a vast number of noninformative background-ion peaks whose signal is similar to that of peptide peaks. Consequently, seeking peptide signal in these spectra based on a signal-to-noise ratio will remove signal peaks as well as noise. This work characterizes the background as a precursor to seeking peptide-related features. Robust-regression methods are used to estimate distributions for null (background) peak intensities and locations. Defining signal peaks as outliers with respect to these distributions leads to more precision in detecting the isotopic envelope of peaks from low-abundance peptides in high-resolution spectra. 相似文献
4.
Manual analysis of mass spectrometry data is a current bottleneck in high throughput proteomics. In particular, the need to manually validate the results of mass spectrometry database searching algorithms can be prohibitively time-consuming. Development of software tools that attempt to quantify the confidence in the assignment of a protein or peptide identity to a mass spectrum is an area of active interest. We sought to extend work in this area by investigating the potential of recent machine learning algorithms to improve the accuracy of these approaches and as a flexible framework for accommodating new data features. Specifically we demonstrated the ability of boosting and random forest approaches to improve the discrimination of true hits from false positive identifications in the results of mass spectrometry database search engines compared with thresholding and other machine learning approaches. We accommodated additional attributes obtainable from database search results, including a factor addressing proton mobility. Performance was evaluated using publically available electrospray data and a new collection of MALDI data generated from purified human reference proteins. 相似文献
5.
Salmi J Moulder R Filén JJ Nevalainen OS Nyman TA Lahesmaa R Aittokallio T 《Bioinformatics (Oxford, England)》2006,22(4):400-406
Peptide identification by tandem mass spectrometry is an important tool in proteomic research. Powerful identification programs exist, such as SEQUEST, ProICAT and Mascot, which can relate experimental spectra to the theoretical ones derived from protein databases, thus removing much of the manual input needed in the identification process. However, the time-consuming validation of the peptide identifications is still the bottleneck of many proteomic studies. One way to further streamline this process is to remove those spectra that are unlikely to provide a confident or valid peptide identification, and in this way to reduce the labour from the validation phase. RESULTS: We propose a prefiltering scheme for evaluating the quality of spectra before the database search. The spectra are classified into two classes: spectra which contain valuable information for peptide identification and spectra that are not derived from peptides or contain insufficient information for interpretation. The different spectral features developed for the classification are tested on a real-life material originating from human lymphoblast samples and on a standard mixture of 9 proteins, both labelled with the ICAT-reagent. The results show that the prefiltering scheme efficiently separates the two spectra classes. 相似文献
6.
Background
Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on de novo sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified. 相似文献7.
Wu LC Hsieh PH Horng JT Jou YJ Lin CD Cheng KF Lin CW Chen SY 《Protein and peptide letters》2012,19(1):120-129
Mass spectrometry biomarker discovery may assist patient's diagnosis in time and realize the characteristics of new diseases. Our previous work built a preprocess method called HHTmass which is capable of removing noise, but HHTmass only a proof of principle to be peak detectable and did not tested for peak reappearance rate and used on medical data. We developed a modified version of biomarker discovery method called Enhance HHTMass (E-HHTMass) for MALDI-TOF and SELDI-TOF mass spectrometry data which improved old HHTMass method by removing the interpolation and the biomarker discovery process. E-HHTMass integrates the preprocessing and classification functions to identify significant peaks. The results show that most known biomarker can be found and high peak appearance rate achieved comparing to MSCAP and old HHTMass2. E-HHTMass is able to adapt to spectra with a small increasing interval. In addition, new peaks are detected which can be potential biomarker after further validation. 相似文献
8.
Mass spectrometry data is inherently uncertain. Rather than compare peak heights across samples, a comparison can be made of the relative ordering of the peak height across samples. Order statistics are used to provide a distance metric between each ordered list of peak heights from the samples. A principal component analysis is performed on the set of distance vectors to highlight to important components. 相似文献
9.
In a regression setting, it is often of interest to quantify the importance of various features in predicting the response. Commonly, the variable importance measure used is determined by the regression technique employed. For this reason, practitioners often only resort to one of a few regression techniques for which a variable importance measure is naturally defined. Unfortunately, these regression techniques are often suboptimal for predicting the response. Additionally, because the variable importance measures native to different regression techniques generally have a different interpretation, comparisons across techniques can be difficult. In this work, we study a variable importance measure that can be used with any regression technique, and whose interpretation is agnostic to the technique used. This measure is a property of the true data‐generating mechanism. Specifically, we discuss a generalization of the analysis of variance variable importance measure and discuss how it facilitates the use of machine learning techniques to flexibly estimate the variable importance of a single feature or group of features. The importance of each feature or group of features in the data can then be described individually, using this measure. We describe how to construct an efficient estimator of this measure as well as a valid confidence interval. Through simulations, we show that our proposal has good practical operating characteristics, and we illustrate its use with data from a study of risk factors for cardiovascular disease in South Africa. 相似文献
10.
Harezlak J Wu MC Wang M Schwartzman A Christiani DC Lin X 《Journal of proteome research》2008,7(1):217-224
Plasma biomarkers of exposure to environmental contaminants play an important role in early detection of disease. The emerging field of proteomics presents an attractive opportunity for candidate biomarker discovery, as it simultaneously measures and analyzes a large number of proteins. This article presents a case study for measuring arsenic concentrations in a population residing in an As-endemic region of Bangladesh using plasma protein expressions measured by SELDI-TOF mass spectrometry. We analyze the data using a unified statistical method based on functional learning to preprocess mass spectra and extract mass spectrometry (MS) features and to associate the selected MS features with arsenic exposure measurements. The task is challenging due to several factors, the high dimensionality of mass spectrometry data, complicated error structures, and a multiple comparison problem. We use nonparametric functional regression techniques for MS modeling, peak detection based on the significant zero-downcrossing method, and peak alignment using a warping algorithm. Our results show significant associations of arsenic exposure to either under- or overexpressions of 20 proteins. 相似文献
11.
It is important to preprocess high-throughput data generated from mass spectrometry experiments in order to obtain a successful proteomics analysis. Outlier detection is an important preprocessing step. A naive outlier detection approach may miss many true outliers and instead select many non-outliers because of the heterogeneity of the variability observed commonly in high-throughput data. Because of this issue, we developed a outlier detection software program accounting for the heterogeneous variability by utilizing linear, non-linear and non-parametric quantile regression techniques. Our program was developed using the R computer language. As a consequence, it can be used interactively and conveniently in the R environment. AVAILABILITY: An R package, OutlierD, is available at the Bioconductor project at http://www.bioconductor.org 相似文献
12.
N I Taranenko S L Allman V V Golovlev N V Taranenko N R Isola C H Chen 《Nucleic acids research》1998,26(10):2488-2490
Sequencing of DNA fragments of 130 and 200 bp using matrix-assisted laser desorption/ionization time-of-flight mass spectrometry for DNA ladder detection was demonstrated. With further improvement in mass resolution and detection sensitivity, mass spectrometry shows great promise for routine DNA sequencing in the future. 相似文献
13.
De Bruyne K Slabbinck B Waegeman W Vauterin P De Baets B Vandamme P 《Systematic and applied microbiology》2011,34(1):20-29
At present, there is much variability between MALDI-TOF MS methodology for the characterization of bacteria through differences in e.g., sample preparation methods, matrix solutions, organic solvents, acquisition methods and data analysis methods. After evaluation of the existing methods, a standard protocol was developed to generate MALDI-TOF mass spectra obtained from a collection of reference strains belonging to the genera Leuconostoc, Fructobacillus and Lactococcus. Bacterial cells were harvested after 24 h of growth at 28 °C on the media MRS or TSA. Mass spectra were generated, using the CHCA matrix combined with a 50:48:2 acetonitrile:water:trifluoroacetic acid matrix solution, and analyzed by the cell smear method and the cell extract method. After a data preprocessing step, the resulting high quality data set was used for PCA, distance calculation and multi-dimensional scaling. Using these analyses, species-specific information in the MALDI-TOF mass spectra could be demonstrated. As a next step, the spectra, as well as the binary character set derived from these spectra, were successfully used for species identification within the genera Leuconostoc, Fructobacillus, and Lactococcus. Using MALDI-TOF MS identification libraries for Leuconostoc and Fructobacillus strains, 84% of the MALDI-TOF mass spectra were correctly identified at the species level. Similarly, the same analysis strategy within the genus Lactococcus resulted in 94% correct identifications, taking species and subspecies levels into consideration. Finally, two machine learning techniques were evaluated as alternative species identification tools. The two techniques, support vector machines and random forests, resulted in accuracies between 94% and 98% for the identification of Leuconostoc and Fructobacillus species, respectively. 相似文献
14.
In this article, we describe a general approach to modeling the structure of binary protein complexes using structural mass spectrometry data combined with molecular docking. In the first step, hydroxyl radical mediated oxidative protein footprinting is used to identify residues that experience conformational reorganization due to binding or participate in the binding interface. In the second step, a three-dimensional atomic structure of the complex is derived by computational modeling. Homology modeling approaches are used to define the structures of the individual proteins if footprinting detects significant conformational reorganization as a function of complex formation. A three-dimensional model of the complex is constructed from these binary partners using the ClusPro program, which is composed of docking, energy filtering, and clustering steps. Footprinting data are used to incorporate constraints-positive and/or negative-in the docking step and are also used to decide the type of energy filter-electrostatics or desolvation-in the successive energy-filtering step. By using this approach, we examine the structure of a number of binary complexes of monomeric actin and compare the results to crystallographic data. Based on docking alone, a number of competing models with widely varying structures are observed, one of which is likely to agree with crystallographic data. When the docking steps are guided by footprinting data, accurate models emerge as top scoring. We demonstrate this method with the actin/gelsolin segment-1 complex. We also provide a structural model for the actin/cofilin complex using this approach which does not have a crystal or NMR structure. 相似文献
15.
16.
Mass spectrometry data are often corrupted by noise. It is very difficult to simultaneously detect low-abundance peaks and reduce false-positive peak detection caused by noise. In this paper, we propose to improve peak detection using an additional constraint: the consistent appearance of similar true peaks across multiple spectra. We observe that false -positive peaks in general do not repeat themselves well across multiple spectra. When we align all the identified peaks (including false-positive ones) from multiple spectra together, those false-positive peaks are not as consistent as true peaks. Thus, we propose to use information from other spectra in order to reduce false-positive peaks. The new method improves the detection of peaks over the traditional single spectrum based peak detection methods. Consequently, the discovery of cancer biomarkers also benefits from this improvement. Source code and additional data are available at: http://www.ece.ust.hk/ approximately eeyu/mspeak.htm. 相似文献
17.
Wiebke Timm Alexandra Scherbart Sebastian Böcker Oliver Kohlbacher Tim W Nattkemper 《BMC bioinformatics》2008,9(1):443
Background
Mass spectrometry is a key technique in proteomics and can be used to analyze complex samples quickly. One key problem with the mass spectrometric analysis of peptides and proteins, however, is the fact that absolute quantification is severely hampered by the unclear relationship between the observed peak intensity and the peptide concentration in the sample. While there are numerous approaches to circumvent this problem experimentally (e.g. labeling techniques), reliable prediction of the peak intensities from peptide sequences could provide a peptide-specific correction factor. Thus, it would be a valuable tool towards label-free absolute quantification. 相似文献18.
Macromolecular assemblies play an important role in all cellular processes. While there has recently been significant progress in protein structure prediction based on deep learning, large protein complexes cannot be predicted with these approaches. The integrative structure modeling approach characterizes multi-subunit complexes by computational integration of data from fast and accessible experimental techniques. Crosslinking mass spectrometry is one such technique that provides spatial information about the proximity of crosslinked residues. One of the challenges in interpreting crosslinking datasets is designing a scoring function that, given a structure, can quantify how well it fits the data. Most approaches set an upper bound on the distance between Cα atoms of crosslinked residues and calculate a fraction of satisfied crosslinks. However, the distance spanned by the crosslinker greatly depends on the neighborhood of the crosslinked residues. Here, we design a deep learning model for predicting the optimal distance range for a crosslinked residue pair based on the structures of their neighborhoods. We find that our model can predict the distance range with the area under the receiver-operator curve of 0.86 and 0.7 for intra- and inter-protein crosslinks, respectively. Our deep scoring function can be used in a range of structure modeling applications. 相似文献
19.
Suitable shark conservation depends on well-informed population assessments. Direct methods such as scientific surveys and fisheries monitoring are adequate for defining population statuses, but species-specific indices of abundance and distribution coming from these sources are rare for most shark species. We can rapidly fill these information gaps by boosting media-based remote monitoring efforts with machine learning and automation.We created a database of 53,345 shark images covering 219 species of sharks, and packaged object-detection and image classification models into a Shark Detector bundle. The Shark Detector recognizes and classifies sharks from videos and images using transfer learning and convolutional neural networks (CNNs). We applied these models to common data-generation approaches of sharks: collecting occurrence records from photographs taken by the public or citizen scientists, processing baited remote camera footage and online videos, and data-mining Instagram. We examined the accuracy of each model and tested genus and species prediction correctness as a result of training data quantity.The Shark Detector can classify 47 species pertaining to 26 genera. It sorted heterogeneous datasets of images sourced from Instagram with 91% accuracy and classified species with 70% accuracy. It located sharks in baited remote footage and YouTube videos with 89% accuracy, and classified located subjects to the species level with 69% accuracy. All data-generation methods were processed without manual interaction.As media-based remote monitoring appears to dominate methods for observing sharks in nature, we developed an open-source Shark Detector to facilitate common identification applications. Prediction accuracy of the software pipeline increases as more images are added to the training dataset. We provide public access to the software on our GitHub page. 相似文献
20.