Do, K.T. ; Wahl, S. ; Raffler, J. ; Molnos, S. ; Laimighofer, M. ; Adamski, J. ; Suhre, K.* ; Strauch, K. ; Peters, A. ; Gieger, C. ; Langenberg, C.* ; Stewart, I.D.* ; Theis, F.J. ; Grallert, H. ; Kastenmüller, G. ; Krumsiek, J.
Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies.
Metabolomics 14:128 (2018)
BACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.
Impact Factor
Scopus SNIP
Web of Science
Times Cited
Scopus
Cited By
Altmetric
Publication type
Article: Journal article
Document type
Scientific Article
Thesis type
Editors
Keywords
Batch Effects ; K-nearest Neighbor ; Limit Of Detection ; Mice ; Mass Spectrometry ; Missing Values Imputation ; Untargeted Metabolomics; Multiple Imputation; Human Blood; Networks; Limit
Keywords plus
Language
english
Publication Year
2018
Prepublished in Year
HGF-reported in Year
2018
ISSN (print) / ISBN
1573-3882
e-ISSN
1573-3890
ISBN
Book Volume Title
Conference Title
Conference Date
Conference Location
Proceedings Title
Quellenangaben
Volume: 14,
Issue: 10,
Pages: ,
Article Number: 128
Supplement: ,
Series
Publisher
Springer
Publishing Place
New York, NY
Day of Oral Examination
0000-00-00
Advisor
Referee
Examiner
Topic
University
University place
Faculty
Publication date
0000-00-00
Application date
0000-00-00
Patent owner
Further owners
Application country
Patent priority
Reviewing status
Peer reviewed
POF-Topic(s)
30205 - Bioengineering and Digital Health
30505 - New Technologies for Biomedical Discoveries
30202 - Environmental Health
30201 - Metabolic Health
30501 - Systemic Analysis of Genetic and Environmental Factors that Impact Health
90000 - German Center for Diabetes Research
Research field(s)
Enabling and Novel Technologies
Genetics and Epidemiology
PSP Element(s)
G-554100-001
G-503700-001
G-503800-001
G-504091-002
G-500600-001
G-504100-001
G-504000-001
G-501900-402
G-504090-001
Grants
Copyright
Erfassungsdatum
2018-09-27