首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Protein structure determination is a very important topic in structural genomics,which helps people to understand varieties of biological functions such as protein-protein interactions,protein–DNA interactions and so on.Nowadays,nuclear magnetic resonance(NMR) has often been used to determine the three-dimensional structures of protein in vivo.This study aims to automate the peak picking step,the most important and tricky step in NMR structure determination.We propose to model the NMR spectrum by a mixture of bivariate Gaussian densities and use the stochastic approximation Monte Carlo algorithm as the computational tool to solve the problem.Under the Bayesian framework,the peak picking problem is casted as a variable selection problem.The proposed method can automatically distinguish true peaks from false ones without preprocessing the data.To the best of our knowledge,this is the first effort in the literature that tackles the peak picking problem for NMR spectrum data using Bayesian method.  相似文献   

2.
Our goal in this paper is to show an analytical workflow for selecting protein biomarker candidates from SELDI-MS data. The clinical question at issue is to enable prediction of the complete remission (CR) duration for acute myeloid leukemia (AML) patients. This would facilitate disease prognosis and make individual therapy possible. SELDI-mass spectrometry proteomics analyses were performed on blast cell samples collected from AML patients pre-chemotherapy. Although the biobank available included approximately 200 samples, only 58 were available for analysis. The presented workflow includes sample selection, experimental optimization, repeatability estimation, data preprocessing, data fusion, and feature selection. Specific difficulties have been the small number of samples and the skew distribution of the CR duration among the patients. Further, we had to deal with both noisy SELDI-MS data and a diverse patient cohort. This has been handled by sample selection and several methods for data preprocessing and feature detection in the analysis workflow. Four conceptually different methods for peak detection and alignment were considered, as well as two diverse methods for feature selection. The peak detection and alignment methods included the recently developed annotated regions of significance (ARS) method, the SELDI-MS software Ciphergen Express which was regarded as the standard method, segment-wise spectral alignment by a genetic algorithm (PAGA) followed by binning, and, finally, binning of raw data. In the feature selection, the "standard" Mann-Whitney t test was compared with a hierarchical orthogonal partial least-squares (O-PLS) analysis approach. The combined information from all these analyses gave a collection of 21 protein peaks. These were regarded as the most potential and robust biomarker candidates since they were picked out as significant features in several of the models. The chosen peaks will now be our first choice for the continuing work on protein identification and biological validation. The identification will be performed by chromatographic purification and MALDI MS/MS. Thus, we have shown that the use of several data handling methods can improve a protein profiling workflow from experimental optimization to a predictive model. The framework of this methodology should be seen as general and could be used with other one-dimensional spectral omics data than SELDI MS including an adequate number of samples.  相似文献   

3.
4.

Background  

Mass spectrometry is an essential technique in proteomics both to identify the proteins of a biological sample and to compare proteomic profiles of different samples. In both cases, the main phase of the data analysis is the procedure to extract the significant features from a mass spectrum. Its final output is the so-called peak list which contains the mass, the charge and the intensity of every detected biomolecule. The main steps of the peak list extraction procedure are usually preprocessing, peak detection, peak selection, charge determination and monoisotoping operation.  相似文献   

5.

Background  

The problem of locating valid peaks from data corrupted by noise frequently arises while analyzing experimental data. In various biological and chemical data analysis tasks, peak detection thus constitutes a critical preprocessing step that greatly affects downstream analysis and eventual quality of experiments. Many existing techniques require the users to adjust parameters by trial and error, which is error-prone, time-consuming and often leads to incorrect analysis results. Worse, conventional approaches tend to report an excessive number of false alarms by finding fictitious peaks generated by mere noise.  相似文献   

6.
基因芯片表达谱数据的预处理分析   总被引:1,自引:0,他引:1  
基因芯片数据的预处理是一个十分关键的步骤,通过数据过滤获取需要的数据、数据转换满足正态分布的分析要求、缺失值的估计弥补不完整的数据、数据归一化纠正系统误差等处理为后续分析工作做准备,预处理分析的重要性并不亚于基因芯片的后续分析,它将直接影响后续分析是否能得到预期的结果.本文重点综述了cDNA芯片的数据预处理,简要地概述寡核苷酸芯片的数据预处理.  相似文献   

7.
BOLD fMRI is sensitive to blood-oxygenation changes correlated with brain function; however, it is limited by relatively weak signal and significant noise confounds. Many preprocessing algorithms have been developed to control noise and improve signal detection in fMRI. Although the chosen set of preprocessing and analysis steps (the “pipeline”) significantly affects signal detection, pipelines are rarely quantitatively validated in the neuroimaging literature, due to complex preprocessing interactions. This paper outlines and validates an adaptive resampling framework for evaluating and optimizing preprocessing choices by optimizing data-driven metrics of task prediction and spatial reproducibility. Compared to standard “fixed” preprocessing pipelines, this optimization approach significantly improves independent validation measures of within-subject test-retest, and between-subject activation overlap, and behavioural prediction accuracy. We demonstrate that preprocessing choices function as implicit model regularizers, and that improvements due to pipeline optimization generalize across a range of simple to complex experimental tasks and analysis models. Results are shown for brief scanning sessions (<3 minutes each), demonstrating that with pipeline optimization, it is possible to obtain reliable results and brain-behaviour correlations in relatively small datasets.  相似文献   

8.
In recent years, mass spectrometry has become one of the core technologies for high throughput proteomic profiling in biomedical research. However, reproducibility of the results using this technology was in question. It has been realized that sophisticated automatic signal processing algorithms using advanced statistical procedures are needed to analyze high resolution and high dimensional proteomic data, e.g., Matrix-Assisted Laser Desorption/Ionization Time-of-Flight (MALDI-TOF) data. In this paper we present a software package-pkDACLASS based on R which provides a complete data analysis solution for users of MALDITOF raw data. Complete data analysis comprises data preprocessing, monoisotopic peak detection through statistical model fitting and testing, alignment of the monoisotopic peaks for multiple samples and classification of the normal and diseased samples through the detected peaks. The software provides flexibility to the users to accomplish the complete and integrated analysis in one step or conduct analysis as a flexible platform and reveal the results at each and every step of the analysis. AVAILABILITY: The database is available for free at http://cran.r-project.org/web/packages/pkDACLASS/index.html.  相似文献   

9.
Summary: As random shotgun metagenomic projects proliferate and become the dominant source of publicly available sequence data, procedures for the best practices in their execution and analysis become increasingly important. Based on our experience at the Joint Genome Institute, we describe the chain of decisions accompanying a metagenomic project from the viewpoint of the bioinformatic analysis step by step. We guide the reader through a standard workflow for a metagenomic project beginning with presequencing considerations such as community composition and sequence data type that will greatly influence downstream analyses. We proceed with recommendations for sampling and data generation including sample and metadata collection, community profiling, construction of shotgun libraries, and sequencing strategies. We then discuss the application of generic sequence processing steps (read preprocessing, assembly, and gene prediction and annotation) to metagenomic data sets in contrast to genome projects. Different types of data analyses particular to metagenomes are then presented, including binning, dominant population analysis, and gene-centric analysis. Finally, data management issues are presented and discussed. We hope that this review will assist bioinformaticians and biologists in making better-informed decisions on their journey during a metagenomic project.  相似文献   

10.
Independent of the approach used, the ability to correctly interpret tandem MS data depends on the quality of the original spectra. Even in the case of the highest quality spectra, the majority of spectral peaks can not be reliably interpreted. The accuracy of sequencing algorithms can be improved by filtering out such 'noise' peaks. Preprocessing MS/MS spectra to select informative ion peaks increases accuracy and reduces the processing time. Intuitively, the mix of informative versus non-informative peaks has a direct effect on the quality and size of the resulting candidate peptide search space. As the number of selected peaks increases, the corresponding search space increases exponentially. If we select too few peaks then the ion-ladder interpretation of the spectrum will contain gaps that can only be explained by permutations of combinations of amino acids. This will result in a larger candidate peptide search space and poorer quality candidates. The dependency that peptide sequencing accuracy has on an initial peak selection regime makes this preprocessing step a crucial facet of any approach, whether de novo or not, to MS/MS spectra interpretation.We have developed a novel approach to address this problem. Our approach uses a staged neural network to model ion fragmentation patterns and estimate the posterior probability of each ion type. Our method improves upon other preprocessing techniques and shows a significant reduction in the search space for candidate peptides without sacrificing candidate peptide quality.  相似文献   

11.
MOTIVATION: Application of mass spectrometry in proteomics is a breakthrough in high-throughput analyses. Early applications have focused on protein expression profiles to differentiate among various types of tissue samples (e.g. normal versus tumor). Here our goal is to use mass spectra to differentiate bacterial species using whole-organism samples. The raw spectra are similar to spectra of tissue samples, raising some of the same statistical issues (e.g. non-uniform baselines and higher noise associated with higher baseline), but are substantially noisier. As a result, new preprocessing procedures are required before these spectra can be used for statistical classification. RESULTS: In this study, we introduce novel preprocessing steps that can be used with any mass spectra. These comprise a standardization step and a denoising step. The noise level for each spectrum is determined using only data from that spectrum. Only spectral features that exceed a threshold defined by the noise level are subsequently used for classification. Using this approach, we trained the Random Forest program to classify 240 mass spectra into four bacterial types. The method resulted in zero prediction errors in the training samples and in two test datasets having 240 and 300 spectra, respectively.  相似文献   

12.
测序技术的不断发展和价格的日益降低使得系统发育组学更深层次的研究成为可能。在系统发育组学分析中,至关重要的步骤是直同源预测,这是因为进行系统发育重建的先决条件是进行比对的基因是直同源的。这里我们简单地回顾了直同源的定义和直同源预测的不同方法,与此同时,我们还给出了一些选择更合适的直同源预测方法的建议。  相似文献   

13.
MOTIVATION: Feature subset selection is an important preprocessing step for classification. In biology, where structures or processes are described by a large number of features, the elimination of irrelevant and redundant information in a reasonable amount of time has a number of advantages. It enables the classification system to achieve good or even better solutions with a restricted subset of features, allows for a faster classification, and it helps the human expert focus on a relevant subset of features, hence providing useful biological knowledge. RESULTS: We present a heuristic method based on Estimation of Distribution Algorithms to select relevant subsets of features for splice site prediction in Arabidopsis thaliana. We show that this method performs a fast detection of relevant feature subsets using the technique of constrained feature subsets. Compared to the traditional greedy methods the gain in speed can be up to one order of magnitude, with results being comparable or even better than the greedy methods. This makes it a very practical solution for classification tasks that can be solved using a relatively small amount of discriminative features (or feature dependencies), but where the initial set of potential discriminative features is rather large.  相似文献   

14.
Chromatogram overlays are frequently used to monitor inter‐batch performance of bioprocess purification steps. However, the objective analysis of chromatograms is difficult due to peak shifts caused by variable phase durations or unexpected process holds. Furthermore, synchronization of batch process data may also be required prior to performing multivariate analysis techniques. Dynamic time warping was originally developed as a method for spoken word recognition, but shows potential in the objective analysis of time variant signals, such as manufacturing data. In this work we will discuss the application of dynamic time warping with a derivative weighting function to align chromatograms to facilitate process monitoring and fault detection. In addition, we will demonstrate the utility of this method as a preprocessing step for multivariate model development. © 2013 American Institute of Chemical Engineers Biotechnol. Prog., 29: 394–402, 2013  相似文献   

15.
Gene expression profiling has been widely used to study molecular signatures of many diseases and to develop molecular diagnostics for disease prediction. Gene selection, as an important step for improved diagnostics, screens tens of thousands of genes and identifies a small subset that discriminates between disease types. A two-step gene selection method is proposed to identify informative gene subsets for accurate classification of multiclass phenotypes. In the first step, individually discriminatory genes (IDGs) are identified by using one-dimensional weighted Fisher criterion (wFC). In the second step, jointly discriminatory genes (JDGs) are selected by sequential search methods, based on their joint class separability measured by multidimensional weighted Fisher criterion (wFC). The performance of the selected gene subsets for multiclass prediction is evaluated by artificial neural networks (ANNs) and/or support vector machines (SVMs). By applying the proposed IDG/JDG approach to two microarray studies, that is, small round blue cell tumors (SRBCTs) and muscular dystrophies (MDs), we successfully identified a much smaller yet efficient set of JDGs for diagnosing SRBCTs and MDs with high prediction accuracies (96.9% for SRBCTs and 92.3% for MDs, resp.). These experimental results demonstrated that the two-step gene selection method is able to identify a subset of highly discriminative genes for improved multiclass prediction.  相似文献   

16.
We addressed the problem of discriminating between 24 diseased and 17 healthy specimens on the basis of protein mass spectra. To prepare the data, we performed mass to charge ratio (m/z) normalization, baseline elimination, and conversion of absolute peak height measures to height ratios. After preprocessing, the major difficulty encountered was the extremely large number of variables (1676 m/z values) versus the number of examples (41). Dimensionality reduction was treated as an integral part of the classification process; variable selection was coupled with model construction in a single ten-fold cross-validation loop. We explored different experimental setups involving two peak height representations, two variable selection methods, and six induction algorithms, all on both the original 1676-mass data set and on a prescreened 124-mass data set. Highest predictive accuracies (1-2 off-sample misclassifications) were achieved by a multilayer perceptron and Na?ve Bayes, with the latter displaying more consistent performance (hence greater reliability) over varying experimental conditions. We attempted to identify the most discriminant peaks (proteins) on the basis of scores assigned by the two variable selection methods and by neural network based sensitivity analysis. These three scoring schemes consistently ranked four peaks as the most relevant discriminators: 11683, 1403, 17350 and 66107.  相似文献   

17.
Stabilizing selection is a fundamental concept in evolutionary biology. In the presence of a single intermediate optimum phenotype (fitness peak) on the fitness surface, stabilizing selection should cause the population to evolve toward such a peak. This prediction has seldom been tested, particularly for suites of correlated traits. The lack of tests for an evolutionary match between population means and adaptive peaks may be due, at least in part, to problems associated with empirically detecting multivariate stabilizing selection and with testing whether population means are at the peak of multivariate fitness surfaces. Here we show how canonical analysis of the fitness surface, combined with the estimation of confidence regions for stationary points on quadratic response surfaces, may be used to define multivariate stabilizing selection on a suite of traits and to establish whether natural populations reside on the multivariate peak. We manufactured artificial advertisement calls of the male cricket Teleogryllus commodus and played them back to females in laboratory phonotaxis trials to estimate the linear and nonlinear sexual selection that female phonotactic choice imposes on male call structure. Significant nonlinear selection on the major axes of the fitness surface was convex in nature and displayed an intermediate optimum, indicating multivariate stabilizing selection. The mean phenotypes of four independent samples of males, from the same population as the females used in phonotaxis trials, were within the 95% confidence region for the fitness peak. These experiments indicate that stabilizing sexual selection may play an important role in the evolution of male call properties in natural populations of T. commodus.  相似文献   

18.
MOTIVATION: Logistic regression is a standard method for building prediction models for a binary outcome and has been extended for disease classification with microarray data by many authors. A feature (gene) selection step, however, must be added to penalized logistic modeling due to a large number of genes and a small number of subjects. Model selection for this two-step approach requires new statistical tools because prediction error estimation ignoring the feature selection step can be severely downward biased. Generic methods such as cross-validation and non-parametric bootstrap can be very ineffective due to the big variability in the prediction error estimate. RESULTS: We propose a parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. The proposed method provides guidance on the two critical issues in model selection: the number of genes to include in the model and the optimal shrinkage for the penalized logistic regression. We show that selecting more than 20 genes usually helps little in further reducing the prediction error. Application to Golub's leukemia data and our own cervical cancer data leads to highly accurate prediction models. AVAILABILITY: R library GeneLogit at http://geocities.com/jg_liao  相似文献   

19.
MOTIVATION: High-throughput NMR structure determination is a goal that will require progress on many fronts, one of which is rapid resonance assignment. An important rate-limiting step in the resonance assignment process is accurate identification of resonance peaks in the NMR spectra. Peak-picking schemes range from incomplete (which lose essential assignment connectivities) to noisy (which obscure true connectivities with many false ones). We introduce an automated preassignment process that removes false peaks from noisy peak lists by requiring consensus between multiple NMR experiments and exploiting a priori information about NMR spectra. This process is designed to accept multiple input formats and generate multiple output formats, in an effort to be compatible with a variety of user preferences. RESULTS: Automated preprocessing with APART rapidly identifies and removes false peaks from initial peak lists, reduces the burden of manual data entry, and documents and standardizes the peak filtering process. Successful preprocessing is demonstrated by the increased number of correct assignments obtained when data are submitted to an automated assignment program. AVAILABILITY: APART is available from http://sir.lanl.gov/NMR/APART.htm CONTACT: npawley@lanl.gov; rmichalczyk@lanl.gov SUPPLEMENTARY INFORMATION: Manual pages with installation instructions, procedures and screen shots can also be found at http://sir.lanl.gov/NMR/APART_Manual1.pdf.  相似文献   

20.
Rapid and quantitative measurements of cellulose concentrations in ionic liquids (ILs) are difficult. In this study, FTIR operated in attenuated total reflectance (ATR) mode was investigated as a tool to measure cellulose concentration in 1-ethyl-3-methylimidazolium acetate ([emim][OAc]) and the spectra were subjected to partial least squares (PLS) regression for the quantitative determination of cellulose content. Additionally, the spectra were subjected to 7 data preprocessing methods to reduce physical effects in the spectra. Peak normalization was found to be the technique that most improved the prediction of dissolved cellulose in [emim][OAc]. When peak normalization was used for data preprocessing, a model for the quantitative estimation of cellulose content between 0 wt.% and 4 wt.% with an error of 0.53 wt.% was generated. The methods described here provide the basis for a rapid and facile technique for the determination of dissolved cellulose content in [emim][OAc].  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号