首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Yang  Runmin  Zhu  Daming 《BMC genomics》2018,19(7):666-39

Background

Database search has been the main approach for proteoform identification by top-down tandem mass spectrometry. However, when the target proteoform that produced the spectrum contains post-translational modifications (PTMs) and/or mutations, it is quite time consuming to align a query spectrum against all protein sequences without any PTMs and mutations in a large database. Consequently, it is essential to develop efficient and sensitive filtering algorithms for speeding up database search.

Results

In this paper, we propose a spectrum graph matching (SGM) based protein sequence filtering method for top-down mass spectral identification. It uses the subspectra of a query spectrum to generate spectrum graphs and searches them against a protein database to report the best candidates. As the sequence tag and gaped tag approaches need the preprocessing step to extract and select tags, the SGM filtering method circumvents this preprocessing step, thus simplifying data processing. We evaluated the filtration efficiency of the SGM filtering method with various parameter settings on an Escherichia coli top-down mass spectrometry data set and compared the performances of the SGM filtering method and two tag-based filtering methods on a data set of MCF-7 cells.

Conclusions

Experimental results on the data sets show that the SGM filtering method achieves high sensitivity in protein sequence filtration. When coupled with a spectral alignment algorithm, the SGM filtering method significantly increases the number of identified proteoform spectrum-matches compared with the tag-based methods in top-down mass spectrometry data analysis.
  相似文献   

2.
This paper introduces a modified technique based on Hilbert-Huang transform (HHT) to improve the spectrum estimates of heart rate variability (HRV). In order to make the beat-to-beat (RR) interval be a function of time and produce an evenly sampled time series, we first adopt a preprocessing method to interpolate and resample the original RR interval. Then, the HHT, which is based on the empirical mode decomposition (EMD) approach to decompose the HRV signal into several monocomponent signals that become analytic signals by means of Hilbert transform, is proposed to extract the features of preprocessed time series and to characterize the dynamic behaviors of parasympathetic and sympathetic nervous system of heart. At last, the frequency behaviors of the Hilbert spectrum and Hilbert marginal spectrum (HMS) are studied to estimate the spectral traits of HRV signals. In this paper, two kinds of experiment data are used to compare our method with the conventional power spectral density (PSD) estimation. The analysis results of the simulated HRV series show that interpolation and resampling are basic requirements for HRV data processing, and HMS is superior to PSD estimation. On the other hand, in order to further prove the superiority of our approach, real HRV signals are collected from seven young health subjects under the condition that autonomic nervous system (ANS) is blocked by certain acute selective blocking drugs: atropine and metoprolol. The high-frequency power/total power ratio and low-frequency power/high-frequency power ratio indicate that compared with the Fourier spectrum based on principal dynamic mode, our method is more sensitive and effective to identify the low-frequency and high-frequency bands of HRV.  相似文献   

3.

Background  

Mass spectrometry is an essential technique in proteomics both to identify the proteins of a biological sample and to compare proteomic profiles of different samples. In both cases, the main phase of the data analysis is the procedure to extract the significant features from a mass spectrum. Its final output is the so-called peak list which contains the mass, the charge and the intensity of every detected biomolecule. The main steps of the peak list extraction procedure are usually preprocessing, peak detection, peak selection, charge determination and monoisotoping operation.  相似文献   

4.
Mass spectrometry (MS) is a technique that is used for biological studies. It consists in associating a spectrum to a biological sample. A spectrum consists of couples of values (intensity, m/z), where intensity measures the abundance of biomolecules (as proteins) with a mass-to-charge ratio (m/z) present in the originating sample. In proteomics experiments, MS spectra are used to identify pattern expressions in clinical samples that may be responsible of diseases. Recently, to improve the identification of peptides/proteins related to patterns, MS/MS process is used, consisting in performing cascade of mass spectrometric analysis on selected peaks. Latter technique has been demonstrated to improve the identification and quantification of proteins/peptide in samples. Nevertheless, MS analysis deals with a huge amount of data, often affected by noises, thus requiring automatic data management systems. Tools have been developed and most of the time furnished with the instruments allowing: (i) spectra analysis and visualization, (ii) pattern recognition, (iii) protein databases querying, (iv) peptides/proteins quantification and identification. Currently most of the tools supporting such phases need to be optimized to improve the protein (and their functionalities) identification processes. In this article we survey on applications supporting spectrometrists and biologists in obtaining information from biological samples, analyzing available software for different phases. We consider different mass spectrometry techniques, and thus different requirements. We focus on tools for (i) data preprocessing, allowing to prepare results obtained from spectrometers to be analyzed; (ii) spectra analysis, representation and mining, aimed to identify common and/or hidden patterns in spectra sets or in classifying data; (iii) databases querying to identify peptides; and (iv) improving and boosting the identification and quantification of selected peaks. We trace some open problems and report on requirements that represent new challenges for bioinformatics.  相似文献   

5.

Background  

Many algorithms have been developed for deciphering the tandem mass spectrometry (MS) data sets. They can be essentially clustered into two classes. The first performs searches on theoretical mass spectrum database, while the second based itself on de novo sequencing from raw mass spectrometry data. It was noted that the quality of mass spectra affects significantly the protein identification processes in both instances. This prompted the authors to explore ways to measure the quality of MS data sets before subjecting them to the protein identification algorithms, thus allowing for more meaningful searches and increased confidence level of proteins identified.  相似文献   

6.
MOTIVATION: Application of mass spectrometry in proteomics is a breakthrough in high-throughput analyses. Early applications have focused on protein expression profiles to differentiate among various types of tissue samples (e.g. normal versus tumor). Here our goal is to use mass spectra to differentiate bacterial species using whole-organism samples. The raw spectra are similar to spectra of tissue samples, raising some of the same statistical issues (e.g. non-uniform baselines and higher noise associated with higher baseline), but are substantially noisier. As a result, new preprocessing procedures are required before these spectra can be used for statistical classification. RESULTS: In this study, we introduce novel preprocessing steps that can be used with any mass spectra. These comprise a standardization step and a denoising step. The noise level for each spectrum is determined using only data from that spectrum. Only spectral features that exceed a threshold defined by the noise level are subsequently used for classification. Using this approach, we trained the Random Forest program to classify 240 mass spectra into four bacterial types. The method resulted in zero prediction errors in the training samples and in two test datasets having 240 and 300 spectra, respectively.  相似文献   

7.

Background  

Surface enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI) is a proteomics tool for biomarker discovery and other high throughput applications. Previous studies have identified various areas for improvement in preprocessing algorithms used for protein peak detection. Bottom-up approaches to preprocessing that emphasize modeling SELDI data acquisition are promising avenues of research to find the needed improvements in reproducibility.  相似文献   

8.
The paper presents two analyzes of the MALDI-TOF mass spectrometry dataset. Both analyzes use the support vector machine as a tool to build a prediction model. The first analysis which is our contribution to the competition uses the given spectra data without further processing. In the second analysis, we employed an additional preprocessing step consisting of peak detection, peak alignment and feature selection based on statistical tests. The experimental results suggest that the preprocessing step with feature selection improves prediction accuracy.  相似文献   

9.
SELDI-TOF mass spectrometer''s compact size and automated, high throughput design have been attractive to clinical researchers, and the platform has seen steady-use in biomarker studies. Despite new algorithms and preprocessing pipelines that have been developed to address reproducibility issues, visual inspection of the results of SELDI spectra preprocessing by the best algorithms still shows miscalled peaks and systematic sources of error. This suggests that there continues to be problems with SELDI preprocessing. In this work, we study the preprocessing of SELDI in detail and introduce improvements. While many algorithms, including the vendor supplied software, can identify peak clusters of specific mass (or m/z) in groups of spectra with high specificity and low false discover rate (FDR), the algorithms tend to underperform estimating the exact prevalence and intensity of peaks in those clusters. Thus group differences that at first appear very strong are shown, after careful and laborious hand inspection of the spectra, to be less than significant. Here we introduce a wavelet/neural network based algorithm which mimics what a team of expert, human users would call for peaks in each of several hundred spectra in a typical SELDI clinical study. The wavelet denoising part of the algorithm optimally smoothes the signal in each spectrum according to an improved suite of signal processing algorithms previously reported (the LibSELDI toolbox under development). The neural network part of the algorithm combines those results with the raw signal and a training dataset of expertly called peaks, to call peaks in a test set of spectra with approximately 95% accuracy. The new method was applied to data collected from a study of cervical mucus for the early detection of cervical cancer in HPV infected women. The method shows promise in addressing the ongoing SELDI reproducibility issues.  相似文献   

10.
The use of mass spectrometry (MS) is pivotal in analyses of the metabolome and presents a major challenge for subsequent data processing. While the last few years have given new high performance instruments, there has not been a comparable development in data processing. In this paper we discuss an automated data processing pipeline to compare large numbers of fingerprint spectra from direct infusion experiments analyzed by high resolution MS. We describe some of the intriguing problems that have to be addressed, starting with the conversion and pre-processing of the raw data to the final data analysis. Illustrated on the direct infusion analysis (ESI-TOF-MS) of complex mixtures the method exploits the full quality of the high-resolution present in the mass spectra. Although the method is illustrated as a new library search method for high resolution MS, we demonstrate that the output of the preprocessing is applicable to cluster-, discriminant analysis, and related multivariate methods applied directly to mass spectra from direct infusion analysis of crude extracts. This is done to find the relationship between several terverticillate Penicillium species and identify the ions responsible for the segregation.  相似文献   

11.

Background  

Compared to the waveform or spectrum analysis of event-related potentials (ERPs), time-frequency representation (TFR) has the advantage of revealing the ERPs time and frequency domain information simultaneously. As the human brain could be modeled as a complicated nonlinear system, it is interesting from the view of psychological knowledge to study the performance of the nonlinear and linear time-frequency representation methods for ERP research. In this study Hilbert-Huang transformation (HHT) and Morlet wavelet transformation (MWT) were performed on mismatch negativity (MMN) of children. Participants were 102 children aged 8–16 years. MMN was elicited in a passive oddball paradigm with duration deviants. The stimuli consisted of an uninterrupted sound including two alternating 100 ms tones (600 and 800 Hz) with infrequent 50 ms or 30 ms 600 Hz deviant tones. In theory larger deviant should elicit larger MMN. This theoretical expectation is used as a criterion to test two TFR methods in this study. For statistical analysis MMN support to absence ratio (SAR) could be utilized to qualify TFR of MMN.  相似文献   

12.
The mass spectrometry (MS) technology in clinical proteomics is very promising for discovery of new biomarkers for diseases management. To overcome the obstacles of data noises in MS analysis, we proposed a new approach of knowledge-integrated biomarker discovery using data from Major Adverse Cardiac Events (MACE) patients. We first built up a cardiovascular-related network based on protein information coming from protein annotations in Uniprot, protein-protein interaction (PPI), and signal transduction database. Distinct from the previous machine learning methods in MS data processing, we then used statistical methods to discover biomarkers in cardiovascular-related network. Through the tradeoff between known protein information and data noises in mass spectrometry data, we finally could firmly identify those high-confident biomarkers. Most importantly, aided by protein-protein interaction network, that is, cardiovascular-related network, we proposed a new type of biomarkers, that is, network biomarkers, composed of a set of proteins and the interactions among them. The candidate network biomarkers can classify the two groups of patients more accurately than current single ones without consideration of biological molecular interaction.  相似文献   

13.

Background  

Feature selection is an approach to overcome the 'curse of dimensionality' in complex researches like disease classification using microarrays. Statistical methods are utilized more in this domain. Most of them do not fit for a wide range of datasets. The transform oriented signal processing domains are not probed much when other fields like image and video processing utilize them well. Wavelets, one of such techniques, have the potential to be utilized in feature selection method. The aim of this paper is to assess the capability of Haar wavelet power spectrum in the problem of clustering and gene selection based on expression data in the context of disease classification and to propose a method based on Haar wavelet power spectrum.  相似文献   

14.
MOTIVATION: Due to the recent advances in technology of mass spectrometry, there has been an exponential increase in the amount of data being generated in the past few years. Database searches have not been able to keep with this data explosion. Thus, speeding up the data searches becomes increasingly important in mass-spectrometry-based applications. Traditional database search methods use one-against-all comparisons of a query spectrum against a very large number of peptides generated from in silico digestion of protein sequences in a database, to filter potential candidates from this database followed by a detailed scoring and ranking of those filtered candidates. RESULTS: In this article, we show that we can avoid the one-against-all comparisons. The basic idea is to design a set of hash functions to pre-process peptides in the database such that for each query spectrum we can use the hash functions to find only a small subset of peptide sequences that are most likely to match the spectrum. The construction of each hash function is based on a random spectrum and the hash value of a peptide is the normalized shared peak counts score (cosine) between the random spectrum and the hypothetical spectrum of the peptide. To implement this idea, we first embed each peptide into a unit vector in a high-dimensional metric space. The random spectrum is represented by a random vector, and we use random vectors to construct a set of hash functions called locality sensitive hashing (LSH) for preprocessing. We demonstrate that our mapping is accurate. We show that our method can filter out >95.65% of the spectra without missing any correct sequences, or gain 111 times speedup by filtering out 99.64% of spectra while missing at most 0.19% (2 out of 1014) of the correct sequences. In addition, we show that our method can be effectively used for other mass spectra mining applications such as finding clusters of spectra efficiently and accurately. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.  相似文献   

15.

Background  

Mass spectrometry based peptide mass fingerprints (PMFs) offer a fast, efficient, and robust method for protein identification. A protein is digested (usually by trypsin) and its mass spectrum is compared to simulated spectra for protein sequences in a database. However, existing tools for analyzing PMFs often suffer from missing or heuristic analysis of the significance of search results and insufficient handling of missing and additional peaks.  相似文献   

16.

Background  

Liquid chromatography coupled to mass spectrometry (LC/MS) has been widely used in proteomics and metabolomics research. In this context, the technology has been increasingly used for differential profiling, i.e. broad screening of biomolecular components across multiple samples in order to elucidate the observed phenotypes and discover biomarkers. One of the major challenges in this domain remains development of better solutions for processing of LC/MS data.  相似文献   

17.

Background

The electrocardiogram (ECG) signals provide important information about the heart electrical activities in medical and diagnostic applications. This signal may be contaminated by different types of noises. One of the noise types which has a considerable overlap with the ECG signals in frequency domain is electromyogram (EMG). Among the exciting approaches for de-noising the ECG signals, those based on singular spectrum analysis (SSA) are popular.

Methods

In this paper, we propose a method based on SSA to separate the ECG signals from EMG noises. In general, SSA contains four steps as: embedding, singular value decomposition, grouping, and diagonal averaging. Among these steps, grouping step contains parameter (indices) which can be adjusted to achieve the desirable results. Indeed, grouping is one of the important steps of SSA as the ECG and EMG signals are separated in this step. Hence, in the proposed method, a new criterion is presented to select the indices in grouping step to separate the ECG from EMG signal with higher accuracy.

Results

Performance of the proposed method is investigated using several experiments. Two sub-sets from Physionet MIT-BIH arrhythmia database are used for this purpose.

Conclusion

The experimental results demonstrate effectiveness of the proposed method in comparison with other SSA-based techniques.  相似文献   

18.
数据非依赖采集(DIA)是蛋白质组学领域近年来快速发展的质谱采集技术,其通过无偏碎裂隔离窗口内的所有母离子采集二级谱图,理论上可实现蛋白质样品的深度覆盖,同时具有高通量、高重现性和高灵敏度的优点。现有的DIA数据采集方法可以分为全窗口碎裂方法、隔离窗口序列碎裂方法和四维DIA数据采集方法(4D-DIA)3大类。针对DIA数据的不同特点,主要数据解析方法包括谱库搜索方法、蛋白质序列库直接搜索方法、伪二级谱图鉴定方法和从头测序方法4大类。解析得到的肽段鉴定结果需要进行可信度评估,包括使用机器学习方法的重排序和对报告结果集合的假发现率估计两个步骤,实现对数据解析结果的质控。本文对DIA数据的采集方法、数据解析方法及软件和鉴定结果可信度评估方法进行了整理和综述,并展望了未来的发展方向。  相似文献   

19.

Background  

Hydrogen/deuterium exchange mass spectrometry (H/DX-MS) experiments implemented to characterize protein interaction and protein folding generate large quantities of data. Organizing, processing and visualizing data requires an automated solution, particularly when accommodating new tandem mass spectrometry modes for H/DX measurement. We sought to develop software that offers flexibility in defining workflows so as to support exploratory treatments of H/DX-MS data, with a particular focus on the analysis of very large protein systems and the mining of tandem mass spectrometry data.  相似文献   

20.

Background

Mass spectrometry is an important analytical tool for clinical proteomics. Primarily employed for biomarker discovery, it is increasingly used for developing methods which may help to provide unambiguous diagnosis of biological samples. In this context, we investigated the classification of phenotypes by applying support vector machine (SVM) on experimental data obtained by MudPIT approach. In particular, we compared the performance capabilities of SVM by using two independent collection of complex samples and different data-types, such as mass spectra (m/z), peptides and proteins.

Results

Globally, protein and peptide data allowed a better discriminant informative content than experimental mass spectra (overall accuracy higher than 87% in both collection 1 and 2). These results indicate that sequencing of peptides and proteins reduces the experimental noise affecting the raw mass spectra, and allows the extraction of more informative features available for the effective classification of samples. In addition, proteins and peptides features selected by SVM matched for 80% with the differentially expressed proteins identified by the MAProMa software.

Conclusions

These findings confirm the availability of the most label-free quantitative methods based on processing of spectral count and SEQUEST-based SCORE values. On the other hand, it stresses the usefulness of MudPIT data for a correct grouping of sample phenotypes, by applying both supervised and unsupervised learning algorithms. This capacity permit the evaluation of actual samples and it is a good starting point to translate proteomic methodology to clinical application.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号