首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.

Motivation

Mass spectrometry is a high throughput, fast, and accurate method of protein analysis. Using the peaks detected in spectra, we can compare a normal group with a disease group. However, the spectrum is complicated by scale shifting and is also full of noise. Such shifting makes the spectra non-stationary and need to align before comparison. Consequently, the preprocessing of the mass data plays an important role during the analysis process. Noises in mass spectrometry data come in lots of different aspects and frequencies. A powerful data preprocessing method is needed for removing large amount of noises in mass spectrometry data.

Results

Hilbert-Huang Transformation is a non-stationary transformation used in signal processing. We provide a novel algorithm for preprocessing that can deal with MALDI and SELDI spectra. We use the Hilbert-Huang Transformation to decompose the spectrum and filter-out the very high frequencies and very low frequencies signal. We think the noise in mass spectrometry comes from many sources and some of the noises can be removed by analysis of signal frequence domain. Since the protein in the spectrum is expected to be a unique peak, its frequence domain should be in the middle part of frequence domain and will not be removed. The results show that HHT, when used for preprocessing, is generally better than other preprocessing methods. The approach not only is able to detect peaks successfully, but HHT has the advantage of denoising spectra efficiently, especially when the data is complex. The drawback of HHT is that this approach takes much longer for the processing than the wavlet and traditional methods. However, the processing time is still manageable and is worth the wait to obtain high quality data.  相似文献   

2.
Wagner M  Naik D  Pothen A 《Proteomics》2003,3(9):1692-1698
We report our results in classifying protein matrix-assisted laser desorption/ionization-time of flight mass spectra obtained from serum samples into diseased and healthy groups. We discuss in detail five of the steps in preprocessing the mass spectral data for biomarker discovery, as well as our criterion for choosing a small set of peaks for classifying the samples. Cross-validation studies with four selected proteins yielded misclassification rates in the 10-15% range for all the classification methods. Three of these proteins or protein fragments are down-regulated and one up-regulated in lung cancer, the disease under consideration in this data set. When cross-validation studies are performed, care must be taken to ensure that the test set does not influence the choice of the peaks used in the classification. Misclassification rates are lower when both the training and test sets are used to select the peaks used in classification versus when only the training set is used. This expectation was validated for various statistical discrimination methods when thirteen peaks were used in cross-validation studies. One particular classification method, a linear support vector machine, exhibited especially robust performance when the number of peaks was varied from four to thirteen, and when the peaks were selected from the training set alone. Experiments with the samples randomly assigned to the two classes confirmed that misclassification rates were significantly higher in such cases than those observed with the true data. This indicates that our findings are indeed significant. We found closely matching masses in a database for protein expression in lung cancer for three of the four proteins we used to classify lung cancer. Data from additional samples, increased experience with the performance of various preprocessing techniques, and affirmation of the biological roles of the proteins that help in classification, will strengthen our conclusions in the future.  相似文献   

3.
In recent years, mass spectrometry has become one of the core technologies for high throughput proteomic profiling in biomedical research. However, reproducibility of the results using this technology was in question. It has been realized that sophisticated automatic signal processing algorithms using advanced statistical procedures are needed to analyze high resolution and high dimensional proteomic data, e.g., Matrix-Assisted Laser Desorption/Ionization Time-of-Flight (MALDI-TOF) data. In this paper we present a software package-pkDACLASS based on R which provides a complete data analysis solution for users of MALDITOF raw data. Complete data analysis comprises data preprocessing, monoisotopic peak detection through statistical model fitting and testing, alignment of the monoisotopic peaks for multiple samples and classification of the normal and diseased samples through the detected peaks. The software provides flexibility to the users to accomplish the complete and integrated analysis in one step or conduct analysis as a flexible platform and reveal the results at each and every step of the analysis. AVAILABILITY: The database is available for free at http://cran.r-project.org/web/packages/pkDACLASS/index.html.  相似文献   

4.
5.
Kwon D  Vannucci M  Song JJ  Jeong J  Pfeiffer RM 《Proteomics》2008,8(15):3019-3029
In recent years there has been an increased interest in using protein mass spectroscopy to discriminate diseased from healthy individuals with the aim of discovering molecular markers for disease. A crucial step before any statistical analysis is the pre-processing of the mass spectrometry data. Statistical results are typically strongly affected by the specific pre-processing techniques used. One important pre-processing step is the removal of chemical and instrumental noise from the mass spectra. Wavelet denoising techniques are a standard method for denoising. Existing techniques, however, do not accommodate errors that vary across the mass spectrum, but instead assume a homogeneous error structure. In this paper we propose a novel wavelet denoising approach that deals with heterogeneous errors by incorporating a variance change point detection method in the thresholding procedure. We study our method on real and simulated mass spectrometry data and show that it improves on performances of peak detection methods.  相似文献   

6.
The purpose of this research was to develop a noise tolerant and faster processing approach for in vivo and in vitro spectrophotometric applications where distorted spectra are difficult to interpret quantitatively. A PC based multilayer neural network with a sigmoid activation function and a generalized delta learning rule was trained with a two component (protonated and unprotonated form) pH-dependent spectrum generated from microspectrophotometry of the vital dye neutral red (NR). The network makes use of the digitized absorption spectrum between 375 and 675 nm. The number of nodes in the input layer was determined by the required resolution. The number of output nodes determined the step size of the quantization value used to distinguish the input spectra (i.e. defined the number of distinct output steps). Mathematic analysis provided the conditions for which this network is guaranteed to converge. Simulation results showed that features of the input spectrum were successfully identified and stored in the weight matrix of the input and hidden layers. After convergent training with typical spectra, a calibration curve was constructed to interpret the output layer activity and therefore, predict interpolated pH values of unknown spectra. With its built-in redundant presentation, this approach needed no preprocessing procedures (baseline correction or intensive signal averaging) normally used in multicomponent analyses. The identification of unknown spectra with the activities of the output layer is a one step process using the convergent weight matrix. After learning from examples, real time applications can be accomplished without solving multiple linear equations as in the multiple linear regression method. This method can be generalized to pattern oriented sensory information processing and multi-sensor data fusion for quantitative measurement purposes.  相似文献   

7.
Computer simulation of database searches of electron transfer dissociation (ETD) spectra using both "bottom up" and "top down" approaches was performed to evaluate the utility of knowing a priori which product ions contain the C-terminus (i.e., the z* ions). In this work, knowledge of the identities of the z* ions was used to exclude putative identifications that are based solely on the mass matching of undifferentiated product ions derived from an experiment with those derived from in silico fragmentation. The benefit from knowing which ions are z* ions was found to be heavily dependent on the quality of the ETD spectra, in terms of sequence coverage afforded by the product ions, the amount of noise in the spectra (i.e., extraneous peaks that do not directly reflect primary structure), and mass measurement accuracy. Under conditions in which the likelihood for misidentifications are high without a priori knowledge of ion types (e.g., b-, y-, c-, or z-ions), a knowledge of which product ions are z* ions allows discrimination against false-positive identifications. Relatively little benefit from knowing which ions are z* ions was noted when product spectra reflected relatively high sequence coverage and when a low fraction of the products ions were due to extraneous peaks (i.e., spectra with relatively little noise). In all cases, specificity is higher with higher mass measurement accuracy with the consequent reduction in benefit from knowledge of which ions are z* ions.  相似文献   

8.
The paper presents two analyzes of the MALDI-TOF mass spectrometry dataset. Both analyzes use the support vector machine as a tool to build a prediction model. The first analysis which is our contribution to the competition uses the given spectra data without further processing. In the second analysis, we employed an additional preprocessing step consisting of peak detection, peak alignment and feature selection based on statistical tests. The experimental results suggest that the preprocessing step with feature selection improves prediction accuracy.  相似文献   

9.
We report the results of our work to facilitate protein identification using tandem mass spectra and protein sequence databases. We describe a parallel version of SEQUEST (SEQUEST-PVM) that is tolerant toward arithmetic exceptions. The changes we report effectively separate search processes on slave nodes from each other. Therefore, if one of the slave nodes drops out of the cluster due to an error, the rest of the cluster will carry the search process to the end. SEQUEST has been widely used for protein identifications. The modifications made to the code improve its stability and effectiveness in a high-throughput production environment. We evaluate the overhead associated with the parallelization of SEQUEST. A prior version of software to preprocess LC/MS/MS data attempted to differentiate the charge states of ions. Singly charged ions can be accurately identified, but the software was unable to reliably differentiate tandem mass spectra of +2 and +3 charge states. We have designed and implemented a computational approach to narrow charge states of precursor ions from nominal resolution ion-trap tandem mass spectra. The preprocessing code, 2to3, determines the charge state of the precursor ion using its mass-to-charge ratio (m/z) and fragment ions contained in the tandem mass spectrum. For each possible charge state the program calculates the expected fragment ions that account for precursor ion m/z values. If any one of the numbers is less than an empirically determined threshold value then the spectrum corresponding to that charge state is removed. If both numbers are higher than the threshold value then +2 and +3 copies of the spectrum are kept. We present the comparison of results from protein identification experiments with and without using 2 to 3. It is shown that by determining the charge state and eliminating poor quality spectra 2to3 decreases the number of spectral files to be searched without affecting the search results. The decrease reduces computer requirements and researcher efforts for analysis of the results.  相似文献   

10.
Infrared spectra obtained from cell or tissue specimen have commonly been observed to involve a significant degree of scattering effects, often Mie scattering, which probably overshadows biochemically relevant spectral information by a nonlinear, nonadditive spectral component in Fourier transform infrared (FTIR) spectroscopic measurements. Correspondingly, many successful machine learning approaches for FTIR spectra have relied on preprocessing procedures that computationally remove the scattering components from an infrared spectrum. We propose an approach to approximate this complex preprocessing function using deep neural networks. As we demonstrate, the resulting model is not just several orders of magnitudes faster, which is important for real-time clinical applications, but also generalizes strongly across different tissue types. Using Bayesian machine learning approaches, our approach unveils model uncertainty that coincides with a band shift in the amide I region that occurs when scattering is removed computationally based on an established physical model. Furthermore, our proposed method overcomes the trade-off between computation time and the corrected spectrum being biased towards an artificial reference spectrum.  相似文献   

11.
MOTIVATION: Tandem mass spectrometry combined with sequence database searching is one of the most powerful tools for protein identification. As thousands of spectra are generated by a mass spectrometer in one hour, the speed of database searching is critical, especially when searching against a large sequence database, or when the peptide is generated by some unknown or non-specific enzyme, even or when the target peptides have post-translational modifications (PTM). In practice, about 70-90% of the spectra have no match in the database. Many believe that a significant portion of them are due to peptides of non-specific digestions by unknown enzymes or amino acid modifications. In another case, scientists may choose to use some non-specific enzymes such as pepsin or thermolysin for proteolysis in proteomic study, in that not all proteins are amenable to be digested by some site-specific enzymes, and furthermore many digested peptides may not fall within the rang of molecular weight suitable for mass spectrometry analysis. Interpreting mass spectra of these kinds will cost a lot of computational time of database search engines. OVERVIEW: The present study was designed to speed up the database searching process for both cases. More specifically speaking, we employed an approach combining suffix tree data structure and spectrum graph. The suffix tree is used to preprocess the protein sequence database, while the spectrum graph is used to preprocess the tandem mass spectrum. We then search the suffix tree against the spectrum graph for candidate peptides. We design an efficient algorithm to compute a matching threshold with some statistical significance level, e.g. p = 0.01, for each spectrum, and use it to select candidate peptides. Then we rank these peptides using a SEQUEST-like scoring function. The algorithms were implemented and tested on experimental data. For post-translational modifications, we allow arbitrary number of any modification to a protein. AVAILABILITY: The executable program and other supplementary materials are available online at: http://hto-c.usc.edu:8000/msms/suffix/.  相似文献   

12.
基质辅助激光解吸电离飞行时间质谱(MALDI-TOF MS)因其具有快速、准确、高通量等特点在食品微生物检测和临床微生物鉴定领域有广泛的应用。对MALDI-TOF MS数据的预处理和分析是微生物鉴定的关键步骤,通过对数据的处理可以从大量的数据中提取微生物的特征肽或者蛋白信息,并通过有监督和无监督学习方法对这些特征信息进行分类和聚类,从而实现对微生物的鉴定、分型和同源性分析。本文就MALDI-TOF MS鉴定微生物中所应用的数理统计分析方法和数据分析软件进行综述。  相似文献   

13.
In this study, we present a preprocessing method for quadrupole time-of-flight (Q-TOF) tandem mass spectra to increase the accuracy of database searching for peptide (protein) identification. Based on the natural isotopic information inherent in tandem mass spectra, we construct a decision tree after feature selection to classify the noise and ion peaks in tandem spectra. Furthermore, we recognize overlapping peaks to find the monoisotopic masses of ions for the following identification process. The experimental results show that this preprocessing method increases the search speed and the reliability of peptide identification.  相似文献   

14.
Rapid and early identification of pathogens is critical to guide antibiotic therapy. Raman spectroscopy as a noninvasive diagnostic technique provides rapid and accurate detection of pathogens. Raman spectrum of single cells serves as the “fingerprint” of the cell, revealing its metabolic characteristics. Rapid identification of pathogens can be achieved by combining Raman spectroscopy and deep learning. Traditional classification techniques frequently require lots of data for training, which is time costing to collect Raman spectra. For trace samples and strains that are difficult to culture, it is difficult to provide an accurate classification model. In order to reduce the number of samples collected and improve the accuracy of the classification model, a new pathogen detection method integrating Raman spectroscopy, variational auto-encoder (VAE), and long short-term memory network (LSTM) is proposed in this paper. We collect the Raman signals of pathogens and input them to VAE for training. VAE will generate a large number of Raman spectral data that cannot be distinguished from the real spectrum, and the signal-to-noise ratio is higher than that of the real spectrum. These spectra are input into the LSTM together with the real spectrum for training, and a good classification model is obtained. The results of the experiments reveal that this method not only improves the average accuracy of pathogen classification to 96.9% but also reduces the number of Raman spectra collected from 1000 to 200. With this technology, the number of Raman spectra collected can be greatly reduced, so that strains that are difficult to culture or trace can be rapidly identified.  相似文献   

15.
Neville P  Tan PY  Mann G  Wolfinger R 《Proteomics》2003,3(9):1710-1715
We bring a "spectrum" of classical data mining and statistical analysis methods to bear on discrimination of two groups of spectra from 24 diseased and 17 normal patients. Our primary goal is to accurately estimate the generalizability of this small dataset. After an aggressive preprocessing step that reduces consideration to only 55 peaks, we conduct over 35 out-of-sample cross-validation simulations of logistic regression, binary decision trees, and linear discriminant analysis. Misclassification rates grow worse as the size of the holdout sample increases, with many exceeding 30 percent. The ability to generalize is clearly tempered by the statistical, instrumentation, and biophysical characteristics of the study.  相似文献   

16.
The efficacy of DNA extraction protocols can be highly dependent upon both the type of sample being investigated and the types of downstream analyses performed. Considering that the use of new bacterial community analysis techniques (e.g., microbiomics, metagenomics) is becoming more prevalent in the agricultural and environmental sciences and many environmental samples within these disciplines can be physiochemically and microbiologically unique (e.g., fecal and litter/bedding samples from the poultry production spectrum), appropriate and effective DNA extraction methods need to be carefully chosen. Therefore, a novel semi-automated hybrid DNA extraction method was developed specifically for use with environmental poultry production samples. This method is a combination of the two major types of DNA extraction: mechanical and enzymatic. A two-step intense mechanical homogenization step (using bead-beating specifically formulated for environmental samples) was added to the beginning of the “gold standard” enzymatic DNA extraction method for fecal samples to enhance the removal of bacteria and DNA from the sample matrix and improve the recovery of Gram-positive bacterial community members. Once the enzymatic extraction portion of the hybrid method was initiated, the remaining purification process was automated using a robotic workstation to increase sample throughput and decrease sample processing error. In comparison to the strict mechanical and enzymatic DNA extraction methods, this novel hybrid method provided the best overall combined performance when considering quantitative (using 16S rRNA qPCR) and qualitative (using microbiomics) estimates of the total bacterial communities when processing poultry feces and litter samples.  相似文献   

17.
A classification of teleostean cartilages is essentially an impossible task because of an intergrading of tissues along an almost continuous spectrum of skeletal tissue types. Teleost fish display a spectrum that ranges from cartilage‐like connective tissue to bone‐like cartilage. In addition, many teleost cartilages cannot be equated with mammalian hyaline cartilage. Existing classifications of teleost cartilage are often disregarded due to the necessarily cumbersome terminology that is used to describe the astounding array of tissue types, a neglect that hampers enhancing our knowledge of the origin and function of teleost cartilage.  相似文献   

18.
Discovery of biomarker patterns using proteomic techniques requires examination of large numbers of patient and control samples, followed by data mining of the molecular read-outs (e.g., mass spectra). Adequate signal processing and statistical analysis are critical for successful extraction of markers from these data sets. The protocol, specifically designed for use in conjunction with MALDI-TOF-MS-based serum peptide profiling, is a data analysis pipeline, starting with transfer of raw spectra that are interpreted using signal processing algorithms to define suitable features (i.e., peptides). We describe an algorithm for minimal entropy-based peak alignment across samples. Peak lists obtained in this way, and containing all samples, all peptide features and their normalized MS-ion intensities, can be evaluated, and results validated, using common statistical methods. We recommend visual inspection of the spectra to confirm all results, and have written freely available software for viewing and color-coding of spectral overlays.  相似文献   

19.
Peak detection is a pivotal first step in biomarker discovery from MS data and can significantly influence the results of downstream data analysis steps. We developed a novel automatic peak detection method for prOTOF MS data, which does not require a priori knowledge of protein masses. Random noise is removed by an undecimated wavelet transform and chemical noise is attenuated by an adaptive short‐time discrete Fourier transform. Isotopic peaks corresponding to a single protein are combined by extracting an envelope over them. Depending on the S/N, the desired peaks in each individual spectrum are detected and those with the highest intensity among their peak clusters are recorded. The common peaks among all the spectra are identified by choosing an appropriate cut‐off threshold in the complete linkage hierarchical clustering. To remove the 1 Da shifting of the peaks, the peak corresponding to the same protein is determined as the detected peak with the largest number among its neighborhood. We validated this method using a data set of serial peptide and protein calibration standards. Compared with MoverZ program, our new method detects more peaks and significantly enhances S/N of the peak after the chemical noise removal. We then successfully applied this method to a data set from prOTOF MS spectra of albumin and albumin‐bound proteins from serum samples of 59 patients with carotid artery disease compared to vascular disease‐free patients to detect peaks with S/N≥2. Our method is easily implemented and is highly effective to define peaks that will be used for disease classification or to highlight potential biomarkers.  相似文献   

20.
Mass spectrometry biomarker discovery may assist patient's diagnosis in time and realize the characteristics of new diseases. Our previous work built a preprocess method called HHTmass which is capable of removing noise, but HHTmass only a proof of principle to be peak detectable and did not tested for peak reappearance rate and used on medical data. We developed a modified version of biomarker discovery method called Enhance HHTMass (E-HHTMass) for MALDI-TOF and SELDI-TOF mass spectrometry data which improved old HHTMass method by removing the interpolation and the biomarker discovery process. E-HHTMass integrates the preprocessing and classification functions to identify significant peaks. The results show that most known biomarker can be found and high peak appearance rate achieved comparing to MSCAP and old HHTMass2. E-HHTMass is able to adapt to spectra with a small increasing interval. In addition, new peaks are detected which can be potential biomarker after further validation.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号