Show simple item record

dc.contributor.authorDo, Kieu Trinh
dc.contributor.authorWahl, Simone
dc.contributor.authorRaffler, Johannes
dc.contributor.authorMolnos, Sophie
dc.contributor.authorLaimighofer, Michael
dc.contributor.authorAdamski, Jerzy
dc.contributor.authorSuhre, Karsten
dc.contributor.authorStrauch, Konstantin
dc.contributor.authorPeters, Annette
dc.contributor.authorGieger, Christian
dc.contributor.authorLangenberg, Claudia
dc.contributor.authorStewart, Isobel D
dc.contributor.authorTheis, Fabian J
dc.contributor.authorGrallert, Harald
dc.contributor.authorKastenmüller, Gabi
dc.contributor.authorKrumsiek, Jan
dc.description.abstractBACKGROUND: Untargeted mass spectrometry (MS)-based metabolomics data often contain missing values that reduce statistical power and can introduce bias in biomedical studies. However, a systematic assessment of the various sources of missing values and strategies to handle these data has received little attention. Missing data can occur systematically, e.g. from run day-dependent effects due to limits of detection (LOD); or it can be random as, for instance, a consequence of sample preparation. METHODS: We investigated patterns of missing data in an MS-based metabolomics experiment of serum samples from the German KORA F4 cohort (n = 1750). We then evaluated 31 imputation methods in a simulation framework and biologically validated the results by applying all imputation approaches to real metabolomics data. We examined the ability of each method to reconstruct biochemical pathways from data-driven correlation networks, and the ability of the method to increase statistical power while preserving the strength of established metabolic quantitative trait loci. RESULTS: Run day-dependent LOD-based missing data accounts for most missing values in the metabolomics dataset. Although multiple imputation by chained equations performed well in many scenarios, it is computationally and statistically challenging. K-nearest neighbors (KNN) imputation on observations with variable pre-selection showed robust performance across all evaluation schemes and is computationally more tractable. CONCLUSION: Missing data in untargeted MS-based metabolomics data occur for various reasons. Based on our results, we recommend that KNN-based imputation is performed on observations with variable pre-selection since it showed robust results in all evaluation schemes.
dc.description.sponsorshipThis work was supported by grants from the German Federal Ministry of Education and Research (BMBF), by BMBF Grant No. 01ZX1313C (project e:Athero-MED) and Grant No. 03IS2061B (project Gani_Med). Moreover, the research leading to these results has received funding from the European Union’s Seventh Framework Programme [FP7-Health-F5-2012] under grant agreement No. 305280 (MIMOmics) and from the European Research Council (starting grant “LatentCauses”). KS is supported by Biomedical Research Program funds at Weill Cornell Medical College in Qatar, a program funded by the Qatar Foundation. The KORA Augsburg studies were financed by the Helmholtz Zentrum München, German Research Center for Environmental Health, Neuherberg, Germany and supported by grants from the German Federal Ministry of Education and Research (BMBF). Analyses in the EPIC-Norfolk study were supported by funding from the Medical Research Council (MC_PC_13048 and MC_UU_12015/1).
dc.publisherSpringer Science and Business Media LLC
dc.rightsAttribution 4.0 International
dc.titleCharacterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies.
dc.contributor.orcidLangenberg, Claudia [0000-0002-5017-7344]
rioxxterms.typeJournal Article/Review
pubs.funder-project-idMedical Research Council (MC_UU_12015/1)
pubs.funder-project-idEuropean Commission (305280)

Files in this item


This item appears in the following Collection(s)

Show simple item record

Attribution 4.0 International
Except where otherwise noted, this item's licence is described as Attribution 4.0 International