首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 528 毫秒
1.
Extracting biomedical information from large metabolomic datasets by multivariate data analysis is of considerable complexity. Common challenges include among others screening for differentially produced metabolites, estimation of fold changes, and sample classification. Prior to these analysis steps, it is important to minimize contributions from unwanted biases and experimental variance. This is the goal of data preprocessing. In this work, different data normalization methods were compared systematically employing two different datasets generated by means of nuclear magnetic resonance (NMR) spectroscopy. To this end, two different types of normalization methods were used, one aiming to remove unwanted sample-to-sample variation while the other adjusts the variance of the different metabolites by variable scaling and variance stabilization methods. The impact of all methods tested on sample classification was evaluated on urinary NMR fingerprints obtained from healthy volunteers and patients suffering from autosomal polycystic kidney disease (ADPKD). Performance in terms of screening for differentially produced metabolites was investigated on a dataset following a Latin-square design, where varied amounts of 8 different metabolites were spiked into a human urine matrix while keeping the total spike-in amount constant. In addition, specific tests were conducted to systematically investigate the influence of the different preprocessing methods on the structure of the analyzed data. In conclusion, preprocessing methods originally developed for DNA microarray analysis, in particular, Quantile and Cubic-Spline Normalization, performed best in reducing bias, accurately detecting fold changes, and classifying samples.  相似文献   

2.

Introduction

Failure to properly account for normal systematic variations in OMICS datasets may result in misleading biological conclusions. Accordingly, normalization is a necessary step in the proper preprocessing of OMICS datasets. In this regards, an optimal normalization method will effectively reduce unwanted biases and increase the accuracy of downstream quantitative analyses. But, it is currently unclear which normalization method is best since each algorithm addresses systematic noise in different ways.

Objective

Determine an optimal choice of a normalization method for the preprocessing of metabolomics datasets.

Methods

Nine MVAPACK normalization algorithms were compared with simulated and experimental NMR spectra modified with added Gaussian noise and random dilution factors. Methods were evaluated based on an ability to recover the intensities of the true spectral peaks and the reproducibility of true classifying features from orthogonal projections to latent structures—discriminant analysis model (OPLS-DA).

Results

Most normalization methods (except histogram matching) performed equally well at modest levels of signal variance. Only probabilistic quotient (PQ) and constant sum (CS) maintained the highest level of peak recovery (>?67%) and correlation with true loadings (>?0.6) at maximal noise.

Conclusion

PQ and CS performed the best at recovering peak intensities and reproducing the true classifying features for an OPLS-DA model regardless of spectral noise level. Our findings suggest that performance is largely determined by the level of noise in the dataset, while the effect of dilution factors was negligible. A minimal allowable noise level of 20% was also identified for a valid NMR metabolomics dataset.
  相似文献   

3.
Background

Metabolomics provides measurement of numerous metabolites in human samples, which can be a useful tool in clinical research. Blood and urine are regarded as preferred subjects of study because of their minimally invasive collection and simple preprocessing methods. Adhering to standard operating procedures is an essential factor in ensuring excellent sample quality and reliable results.

Aim of review

In this review, we summarize the studies about the impacts of various preprocessing factors on metabolomics studies involving clinical blood and urine samples in order to provide guidance for sample collection and preprocessing.

Key scientific concepts of review

Clinical information is important for sample grouping and data analysis which deserves attention before sample collection. Plasma and serum as well as urine samples are appropriate for metabolomics analysis. Collection tubes, hemolysis, delay at room temperature, and freeze–thaw cycles may affect metabolic profiles of blood samples. Collection time, time between sampling and examination, contamination, normalization strategies, and storage conditions may alter analysis results of urine samples. Taking these collection and preprocessing factors into account, this review provides suggestions of standard sample preprocessing.

  相似文献   

4.

Background  

Classifying nuclear magnetic resonance (NMR) spectra is a crucial step in many metabolomics experiments. Since several multivariate classification techniques depend upon the variance of the data, it is important to first minimise any contribution from unwanted technical variance arising from sample preparation and analytical measurements, and thereby maximise any contribution from wanted biological variance between different classes. The generalised logarithm (glog) transform was developed to stabilise the variance in DNA microarray datasets, but has rarely been applied to metabolomics data. In particular, it has not been rigorously evaluated against other scaling techniques used in metabolomics, nor tested on all forms of NMR spectra including 1-dimensional (1D) 1H, projections of 2D 1H, 1H J-resolved (pJRES), and intact 2D J-resolved (JRES).  相似文献   

5.
Due to the great variety of preprocessing tools in two-channel expression microarray data analysis it is difficult to choose the most appropriate one for a given experimental setup. In our study, two independent two-channel inhouse microarray experiments as well as a publicly available dataset were used to investigate the influence of the selection of preprocessing methods (background correction, normalization, and duplicate spots correlation calculation) on the discovery of differentially expressed genes. Here we are showing that both the list of differentially expressed genes and the expression values of selected genes depend significantly on the preprocessing approach applied. The choice of normalization method to be used had the highest impact on the results. We propose a simple but efficient approach to increase the reliability of obtained results, where two normalization methods which are theoretically distinct from one another are used on the same dataset. Then the intersection of results, that is, the lists of differentially expressed genes, is used in order to get a more accurate estimation of the genes that were de facto differentially expressed.  相似文献   

6.

Background

The potential for astrocyte participation in central nervous system recovery is highlighted by in vitro experiments demonstrating their capacity to transdifferentiate into neurons. Understanding astrocyte plasticity could be advanced by comparing astrocytes with stem cells. RNA sequencing (RNA-seq) is ideal for comparing differences across cell types. However, this novel multi-stage process has the potential to introduce unwanted technical variation at several points in the experimental workflow. Quantitative understanding of the contribution of experimental parameters to technical variation would facilitate the design of robust RNA-Seq experiments.

Results

RNA-Seq was used to achieve biological and technical objectives. The biological aspect compared gene expression between normal human fetal-derived astrocytes and human neural stem cells cultured in identical conditions. When differential expression threshold criteria of |log2 fold change| > 2 were applied to the data, no significant differences were observed. The technical component quantified variation arising from particular steps in the research pathway, and compared the ability of different normalization methods to reduce unwanted variance. To facilitate this objective, a liberal false discovery rate of 10% and a |log2 fold change| > 0.5 were implemented for the differential expression threshold. Data were normalized with RPKM, TMM, and UQS methods using JMP Genomics. The contributions of key replicable experimental parameters (cell lot; library preparation; flow cell) to variance in the data were evaluated using principal variance component analysis. Our analysis showed that, although the variance for every parameter is strongly influenced by the normalization method, the largest contributor to technical variance was library preparation. The ability to detect differentially expressed genes was also affected by normalization; differences were only detected in non-normalized and TMM-normalized data.

Conclusions

The similarity in gene expression between astrocytes and neural stem cells supports the potential for astrocytic transdifferentiation into neurons, and emphasizes the need to evaluate the therapeutic potential of astrocytes for central nervous system damage. The choice of normalization method influences the contributions to experimental variance as well as the outcomes of differential expression analysis. However irrespective of normalization method, our findings illustrate that library preparation contributed the largest component of technical variance.
  相似文献   

7.
The measurements of coordinated patterns of protein abundance using antibody microarrays could be used to gain insight into disease biology and to probe the use of combinations of proteins for disease classification. The correct use and interpretation of antibody microarray data requires proper normalization of the data, which has not yet been systematically studied. Therefore we undertook a study to determine the optimal normalization of data from antibody microarray profiling of proteins in human serum specimens. Forty-three serum samples collected from patients with pancreatic cancer and from control subjects were probed in triplicate on microarrays containing 48 different antibodies, using a direct labeling, two-color comparative fluorescence detection format. Seven different normalization methods representing major classes of normalization for antibody microarray data were compared by their effects on reproducibility, accuracy, and trends in the data set. Normalization with ELISA-determined concentrations of IgM resulted in the most accurate, reproducible, and reliable data. The other normalization methods were deficient in at least one of the criteria. Multiparametric classification of the samples based on the combined measurement of seven of the proteins demonstrated the potential for increased classification accuracy compared with the use of individual measurements. This study establishes reliable normalization for antibody microarray data, criteria for assessing normalization performance, and the capability of antibody microarrays for serum-protein profiling and multiparametric sample classification.  相似文献   

8.

Saliva is an easy to obtain bodily fluid that is specific to the oral environment. It can be used for metabolomic studies as it is representative of the overall wellbeing of an organism, as well as mouth health and bacterial flora. The metabolomic structure of saliva varies greatly depending on the bacteria present in the mouth as they produce a range of metabolites. In this study we have investigated the metabolomic profiles of human saliva that were obtained using 1H NMR (nuclear magnetic resonance) analysis. 48 samples of saliva were collected from 16 healthy subjects over 3 days. Each sample was split in two and the first half treated with an oral rinse, while the second was left untreated as a control sample. The 96 1H NMR metabolomic profiles obtained in the dataset are affected by three factors, namely 16 subjects, 3 sampling days and 2 treatments. These three factors contribute to the total variation in the dataset. When analysing datasets from saliva using traditional methods such as PCA (principal component analysis), the overall variance is dominated by subjects’ contributions, and we cannot see trends that would highlight the effect of specific factors such as oral rinse. In order to identify these trends, we used methods such as MSCA (multilevel simultaneous component analysis) and ASCA (ANOVA simultaneous component analysis), that provide variance splits according to the experimental factors, so that we could look at the particular effect of treatment on saliva. The analysis of the treatment effect was enhanced, as it was isolated from the overall variance and assessed without confounding factors.

  相似文献   

9.
:分析了当前常用的标准化方法在肿瘤基因芯片中引起错误分类的原因,提出了一种基于类均值的标准化方法.该方法对基因表达谱进行双向标准化,并将标准化过程与聚类过程相互缠绕,利用聚类结果来修正参照表达水平.选取了5组肿瘤基因芯片数据,用层次聚类和K-均值聚类算法在不同的方差水平上分别对常用的标准化和基于类均值的标准化处理后的基因表达数据进行聚类分析比较.实验结果表明,基于类均值的标准化方法能有效提高肿瘤基因表达谱聚类结果的质量.  相似文献   

10.
Successful metabolic profile analysis will aid in the fundamental understanding of physiology. Here, we present a possible analysis workflow. Initially, the procedure to transform raw data into a data matrix containing relative metabolite levels for each sample is described. Given that, because of experimental issues in the technical equipment, the levels of some metabolites cannot be universally determined or that different experiments need to be compared, missing value estimation and normalization are presented as helpful preprocessing steps. Regression methods are presented in this review as tools to relate metabolite levels with other physiological properties like biomass and gene expression. As the number of measured metabolites often exceeds the number of samples, dimensionality reduction methods are required. Two of these methods are discussed in detail in this review. Throughout this article, practical examples illustrating the application of the aforementioned methods are given. We focus on the uncovering the relationship between metabolism and growth-related properties.  相似文献   

11.
With the advent of the -omics era, classical technology platforms, such as hyphenated mass spectrometry, are currently undergoing a transformation toward high-throughput application. These novel platforms yield highly detailed metabolite profiles in large numbers of samples. Such profiles can be used as fingerprints for the accurate identification and classification of samples as well as for the study of effects of experimental conditions on the concentrations of specific metabolites. Challenges for the application of these methods lie in the acquisition of high-quality data, data normalization, and data mining. Here, a high-throughput fingerprinting approach based on analysis of headspace volatiles using ultrafast gas chromatography coupled to time of flight mass spectrometry (ultrafast GC/TOF-MS) was developed and evaluated for classification and screening purposes in food fermentation. GC-MS mass spectra of headspace samples of milk fermented by different mixed cultures of lactic acid bacteria (LAB) were collected and preprocessed in MetAlign, a dedicated software package for the preprocessing and comparison of liquid chromatography (LC)-MS and GC-MS data. The Random Forest algorithm was used to detect mass peaks that discriminated combinations of species or strains used in fermentations. Many of these mass peaks originated from key flavor compounds, indicating that the presence or absence of individual strains or combinations of strains significantly influenced the concentrations of these components. We demonstrate that the approach can be used for purposes like the selection of strains from collections based on flavor characteristics and the screening of (mixed) cultures for the presence or absence of strains. In addition, we show that strain-specific flavor characteristics can be traced back to genetic markers when comparative genome hybridization (CGH) data are available.  相似文献   

12.
13.
The development of fast and effective spectroscopic methods that can detect most compounds in an untargeted manner is of increasing interest in plant extracts fingerprinting or profiling projects. Metabolite fingerprinting by nuclear magnetic resonance (NMR) is a fast growing field which is increasingly applied for quality control of herbal products, mostly via 1D 1H NMR coupled to multivariate data analysis. Nevertheless, signal overlap is a common problem in 1H NMR profiles that hinders metabolites identification and results in incomplete data interpretation. Herein, we introduce a novel approach in coupling 2D NMR datasets with principal component analysis (PCA) exemplified for hop resin classification. Heteronuclear multiple bond correlation (HMBC) profile maps of hop resins (Humulus lupulus) were generated for a comparative study of 13 hop cultivars. The method described herein combines reproducible metabolite fingerprints with a minimal sample preparation effort and an experimental time of ca. 28 min per sample, comparable to that of a standard HPLC run. Moreover, HMBC spectra provide not only unequivocal assignment of hop major secondary metabolites, but also allow to identify several isomerization and degradation products of hop bitter acids including the sedative principal of hop (2-methylbut-3-en-2-ol). We do believe that combining 2D NMR datasets to chemometrics, i.e. PCA, has great potential for application in other plant metabolome projects of (commercially relevant) nutraceuticals and or herbal drugs.  相似文献   

14.
One of the major issues in expression profiling analysis still is to outline proper thresholds to determine differential expression, while avoiding false positives. The problem being that the variance is inversely proportional to the log of signal intensities. Aiming to solve this issue, we describe a model, expression variation (EV), based on the LMS method, which allows data normalization and to construct confidence bands of gene expression, fitting cubic spline curves to the Box-Cox transformation. The confidence bands, fitted to the actual variance of the data, include the genes devoid of significant variation, and allow, based on the confidence bandwidth, to calculate EVs. Each outlier is positioned according to the dispersion space (DS) and a P-value is statistically calculated to determine EV. This model results in variance stabilization. Using two Affymetrix-generated datasets, the sets of differentially expressed genes selected using EV and other classical methods were compared. The analysis suggests that EV is more robust on variance stabilization and on selecting differential expression from both rare and strongly expressed genes.  相似文献   

15.
Metabolomics involves the unbiased quantitative and qualitative analysis of the complete set of metabolites present in cells, body fluids and tissues (the metabolome). By analyzing differences between metabolomes using biostatistics (multivariate data analysis; pattern recognition), metabolites relevant to a specific phenotypic characteristic can be identified. However, the reliability of the analytical data is a prerequisite for correct biological interpretation in metabolomics analysis. In this review the challenges in quantitative metabolomics analysis with regards to analytical as well as data preprocessing steps are discussed. Recommendations are given on how to optimize and validate comprehensive silylation-based methods from sample extraction and derivatization up to data preprocessing and how to perform quality control during metabolomics studies. The current state of method validation and data preprocessing methods used in published literature are discussed and a perspective on the future research necessary to obtain accurate quantitative data from comprehensive GC-MS data is provided.  相似文献   

16.
lumi: a pipeline for processing Illumina microarray   总被引:2,自引:0,他引:2  
Illumina microarray is becoming a popular microarray platform. The BeadArray technology from Illumina makes its preprocessing and quality control different from other microarray technologies. Unfortunately, most other analyses have not taken advantage of the unique properties of the BeadArray system, and have just incorporated preprocessing methods originally designed for Affymetrix microarrays. lumi is a Bioconductor package especially designed to process the Illumina microarray data. It includes data input, quality control, variance stabilization, normalization and gene annotation portions. In specific, the lumi package includes a variance-stabilizing transformation (VST) algorithm that takes advantage of the technical replicates available on every Illumina microarray. Different normalization method options and multiple quality control plots are provided in the package. To better annotate the Illumina data, a vendor independent nucleotide universal identifier (nuID) was devised to identify the probes of Illumina microarray. The nuID annotation packages and output of lumi processed results can be easily integrated with other Bioconductor packages to construct a statistical data analysis pipeline for Illumina data. Availability: The lumi Bioconductor package, www.bioconductor.org  相似文献   

17.
1H NMR spectra from urine can yield information-rich data sets that offer important insights into many biological and biochemical phenomena. However, the quality and utility of these insights can be profoundly affected by how the NMR spectra are processed and interpreted. For instance, if the NMR spectra are incorrectly referenced or inconsistently aligned, the identification of many compounds will be incorrect. If the NMR spectra are mis-phased or if the baseline correction is flawed, the estimated concentrations of many compounds will be systematically biased. Furthermore, because NMR permits the measurement of concentrations spanning up to five orders of magnitude, several problems can arise with data analysis. For instance, signals originating from the most abundant metabolites may prove to be the least biologically relevant while signals arising from the least abundant metabolites may prove to be the most important but hardest to accurately and precisely measure. As a result, a number of data processing techniques such as scaling, transformation and normalization are often required to address these issues. Therefore, proper processing of NMR data is a critical step to correctly extract useful information in any NMR-based metabolomic study. In this review we highlight the significance, advantages and disadvantages of different NMR spectral processing steps that are common to most NMR-based metabolomic studies of urine. These include: chemical shift referencing, phase and baseline correction, spectral alignment, spectral binning, scaling and normalization. We also provide a set of recommendations for best practices regarding spectral and data processing for NMR-based metabolomic studies of biofluids, with a particular focus on urine.  相似文献   

18.
Various types of unwanted and uncontrollable signal variations in MS‐based metabolomics and proteomics datasets severely disturb the accuracies of metabolite and protein profiling. Therefore, pooled quality control (QC) samples are often employed in quality management processes, which are indispensable to the success of metabolomics and proteomics experiments, especially in high‐throughput cases and long‐term projects. However, data consistency and QC sample stability are still difficult to guarantee because of the experimental operation complexity and differences between experimenters. To make things worse, numerous proteomics projects do not take QC samples into consideration at the beginning of experimental design. Herein, a powerful and interactive web‐based software, named pseudoQC, is presented to simulate QC sample data for actual metabolomics and proteomics datasets using four different machine learning‐based regression methods. The simulated data are used for correction and normalization of the two published datasets, and the obtained results suggest that nonlinear regression methods perform better than linear ones. Additionally, the above software is available as a web‐based graphical user interface and can be utilized by scientists without a bioinformatics background. pseudoQC is open‐source software and freely available at https://www.omicsolution.org/wukong/pseudoQC/ .  相似文献   

19.
Array-based gene expression studies frequently serve to identify genes that are expressed differently under two or more conditions. The actual analysis of the data, however, may be hampered by a number of technical and statistical problems. Possible remedies on the level of computational analysis lie in appropriate preprocessing steps, proper normalization of the data and application of statistical testing procedures in the derivation of differentially expressed genes. This review summarizes methods that are available for these purposes and provides a brief overview of the available software tools.  相似文献   

20.
基于基因表达变异性的通路富集方法研究   总被引:1,自引:0,他引:1       下载免费PDF全文
当前的通路富集方法主要是基于基因的表达差异,很少有方法从通路变异性(方差)角度对其富集分析.我们注意到用合适的统计量描述通路的变异性时,在疾病表型下一些通路的变异性有明显的上升或者下降.因此本研究假设:通路变异性程度在不同表型中存在差异.本文设计了14种描述通路变异性的统计量与检验方法,检测不同表型下变异性有差异的通路即富集通路,并将富集结果与文献检索结果进行比较,同时,分析不同芯片预处理方法对数据和结果的影响.研究结果表明:5种预处理方法中,多阵列对数健壮算法(RMA)是数据预处理的最优方法;不同表型下通路的变异性程度存在差异;根据文献检索的通路结果,14种基于变异性的通路富集方法中,以通路中各基因欧氏距离的方差做统计量进行permutation检验(方法11)能有效识别显著通路,其富集结果优于基因集富集分析(GSEA).综上所述,基于通路变异性的通路富集策略具有可行性,不仅对通路富集分析有一定的理论指导意义,而且为人类疾病研究提供新的视角.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号