Selecting Methods for Multiple Imputation of Missing Data

Micha Fischer

doi:10.7302/8407

ScienceGate Book Chapters

JOURNAL ARTICLE

Selecting Methods for Multiple Imputation of Missing Data

Micha Fischer

Year: 2023 Journal: Deep Blue (University of Michigan) Publisher: University of Michigan–Ann Arbor

DOI: 10.7302/8407

Get Full-Text PDF Get Analytical Report

Abstract

Most data sets from sample surveys contain incomplete observations for various reasons, such as a respondent’s refusal to answer questions. Unfortunately, most analysis tools assume complete data sets. When applying such tools to incomplete data, researchers are limited to using either complete observations or complete variables, which can have problematic consequences: biased and inefficient estimates, and decreased power in statistical tests. However, often, the challenges of missing data can be circumvented through sequential imputation (SI), an iterative procedure that imputes missing values variable by variable, conditioning on observed or previously imputed values of other variables. SI generates a complete data set that can be analyzed using standard analytical tools. Multiple imputation, which generates multiple data sets with different draws of the missing values, can be used to improve efficiency and provide inferences that take into account imputation uncertainty. Various procedures have been proposed for SI, and each procedure involves a choice of options, which can lead to subjectivity in the imputation process. Further, data are mainly analyzed with a substantive question in mind and missing data imputation might not be the primary focus of an analyst. To address these issues, previous studies compared different procedures to find the best way to apply SI. However, they often rely on one assessment strategy, e.g., simulated data only, and often compare only a small number of procedures. These shortcomings lead to findings with low generalizability. This dissertation tries to close this gap by comparing multiple parametric and non-parametric procedures for multiple imputation within the SI framework and to automate and reduce sensitivity in the SI process. Study One compares several parametric and non-parametric procedures for SI. The evaluation uses a simulation approach, analyzing data from 1) parametric models, 2) non-parametric models, and 3) a real survey data set, a publicly available version of the National Health and Nutrition Examination Survey (NHANES) data. The procedures to be compared include parametric and tree-based procedures. The first study finds that there is no overall best performing method. However, we provide guidance for practice based on the simulation, taking into account the data situation and required modelling effort. Study Two proposes a modified SI procedure in which the assessment of different procedures is automated. The study develops criteria for binary, nominal, and continuous incomplete variables to assess imputation methods within SI in an automated and objective fashion. The modified SI process is assessed via a simulation study using data from the NHANES. This study provides methodology for a more automated SI procedure with included plausibility checks for a potential application to high-dimensional data sets with missing values, where specifying models via a human imputer is inefficient. Study Three investigates the use and implications of incorporating response indicators (RIs) for covariates in the imputation process. This approach leads to imputation under a missing-not-at-random (MNAR) model. A literature review provides insights into how to include RIs for predictors into models with different analysis goals. Furthermore, a targeted simulation study suggests data situations and analysis goals where this approach is sensible. The simulation shows that, under MAR, methods including RIs perform as well as those without them. In MNAR scenarios, methods including RIs can improve performance.

Keywords:

Imputation (statistics) Missing data Computer science Data mining Statistics Mathematics Machine learning

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Statistical Methods and Inference

Physical Sciences → Mathematics → Statistics and Probability

Bayesian Methods and Mixture Models

Physical Sciences → Computer Science → Artificial Intelligence

Bayesian Modeling and Causal Inference

Physical Sciences → Computer Science → Artificial Intelligence

Selecting Methods for Multiple Imputation of Missing Data

Abstract

Metrics

Topics

Related Documents

Multiple Imputation for Univariate Missing Data: Robust Methods

Multiple Imputation for Univariate Missing Data: Parametric Methods

Missing Data and Multiple Imputation

Multiple Imputation of Missing Data

Multiple Imputation for Missing Data