首页 | 本学科首页   官方微博 | 高级检索  
     


Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
Authors:Kieu Trinh Do  Simone Wahl  Johannes Raffler  Sophie Molnos  Michael Laimighofer  Jerzy Adamski  Karsten Suhre  Konstantin Strauch  Annette Peters  Christian Gieger  Claudia Langenberg  Isobel D. Stewart  Fabian J. Theis  Harald Grallert  Gabi Kastenmüller  Jan Krumsiek
Affiliation:1.Institute of Computational Biology,Helmholtz-Zentrum München,Neuherberg,Germany;2.Institute of Epidemiology II, German Research Center for Environmental Health,Helmholtz Zentrum München,Neuherberg,Germany;3.Research Unit of Molecular Epidemiology, German Research Center for Environmental Health,Helmholtz Zentrum München,Neuherberg,Germany;4.German Center for Diabetes Research (DZD e.V.),Neuherberg,Germany;5.Institute of Bioinformatics and Systems Biology,Helmholtz-Zentrum München,Neuherberg,Germany;6.Institute of Experimental Genetics, Genome Analysis Center,Helmholtz Zentrum München,Neuherberg,Germany;7.Lehrstuhl für Experimentelle Genetik,Technische Universit?t München,Freising,Germany;8.German Center for Cardiovascular Disease Research (DZHK e.V.),Munich,Germany;9.Department of Physiology and Biophysics,Weill Cornell Medical College in Qatar,Doha,Qatar;10.Institute of Genetic Epidemiology,Helmholtz Zentrum München–German Research Center for Environmental Health,Neuherberg,Germany;11.Chair of Genetic Epidemiology, Institute of Medical Informatics, Biometry and Epidemiology,Ludwig-Maximilians-University,Munich,Germany;12.MRC Epidemiology Unit,University of Cambridge,Cambridge,UK;13.Department of Mathematics,Technische Universit?t München,Garching,Germany;14.Institute for Computational Biomedicine, Englander Institute for Precision Medicine, Department of Physiology and Biophysics,Weill Cornell Medicine,New York,USA
Abstract:

Background

Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation.

Methods

We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n?=?1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci.

Results

Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable.

Conclusion

Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号