Page 87 - AIH-2-1
P. 87
Artificial Intelligence in Health
ORIGINAL RESEARCH ARTICLE
Benchmarking machine learning missing data
imputation methods in large-scale mental
health survey databases
4,5
2
3
Preethi Prakash , Kelly Street , Shrikanth Narayanan , Bridget A. Fernandez ,
1
6
7
Yufeng Shen , and Chang Shu *
1 Department of Computer Science, Fu Foundation School of Engineering and Applied Science,
Columbia University, New York, NY, United States of America
2 Department of Population and Public Health Sciences, Division of Biostatistics, Keck School of
Medicine, University of Southern California, Los Angeles, CA, United States of America
3 Viterbi School of Engineering, University of Southern California, Los Angeles, CA, United States
of America
4 Department of Pediatrics, Division of Medical Genetics, Children’s Hospital Los Angeles and The
Saban Research Institute, Los Angeles, CA, United States of America
5 Department of Pediatrics, Keck School of Medicine of USC, University of Southern California, Los
Angeles, CA, United States of America
6 Departments of Systems Biology and Biomedical Informatics, and JP Sulzberger Columbia
Genome Center, Columbia University Irving Medical Center, New York, NY, United States of America
7 Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Division
of Epidemiology and Genetics, Keck School of Medicine, University of Southern California, Los
Angeles, CA, United States of America
*Correspondence author:
Chang Shu
(april.shu@usc.edu) Abstract
Citation: Prakash P, Street K,
Narayanan S, Fernandez BA, Shen Databases tied to mental and behavioral health surveys suffer from the issue of
Y, Shu C. Benchmarking machine missing data when participants skip the entire survey, which affects the data quality
learning missing data imputation and sample size. These missing data patterns were investigated and the imputation
methods in large-scale mental
health survey databases. Artif Intell performance was evaluated in Simons Foundations Powering Autism Research for
Health. 2025;2(1):81-92. Knowledge, a large-scale autism cohort consists of over 117,000 participants. Four
doi: 10.36922/aih.4406 common methods were assessed – Multiple imputation by chained equations (MICE),
Received: August 1, 2024 K-nearest neighbors (KNN), MissForest, and multiple imputation with denoising
autoencoders (MIDAS). In a complete subset of 15,196 autism participants, three
Revised: September 17, 2024
types of missingness patterns were simulated. We observed that MIDAS and KNN
Accepted: October 14, 2024 performed the best as the random missingness rate increased and when blockwise
Published Online: November 7, missingness was simulated. The average computational times were each 10 min for
2024 MIDAS and KNN, 35 min for MissForest, and 290 min for MICE. MIDAS and KNN both
Copyright: © 2024 Author(s). provide promising imputation performance in mental and behavioral health survey
This is an Open-Access article data that exhibit blockwise missingness patterns.
distributed under the terms of the
Creative Commons Attribution
License, permitting distribution, Keywords: Missing data; Mental health survey; Imputation methods; Machine learning
and reproduction in any medium,
provided the original work is
properly cited.
Publisher’s Note: AccScience
Publishing remains neutral with 1. Introduction
regard to jurisdictional claims in
published maps and institutional Large-scale biobank databases in mental and behavioral health such as Simons
affiliations. Foundations Powering Autism Research for Knowledge (SPARK), UK Biobank, and All
Volume 2 Issue 1 (2025) 81 doi: 10.36922/aih.4406

