Page 87 - AIH-2-1
P. 87

Artificial Intelligence in Health





                                        ORIGINAL RESEARCH ARTICLE
                                        Benchmarking machine learning missing data

                                        imputation methods in large-scale mental
                                        health survey databases



                                                                                                          4,5
                                                                 2
                                                                                      3
                                        Preethi Prakash , Kelly Street , Shrikanth Narayanan , Bridget A. Fernandez ,
                                                      1
                                                   6
                                                                  7
                                        Yufeng Shen , and Chang Shu *
                                        1 Department of Computer Science, Fu Foundation School of Engineering and Applied Science,
                                        Columbia University, New York, NY, United States of America
                                        2 Department of Population and Public Health Sciences, Division of Biostatistics, Keck School of
                                        Medicine, University of Southern California, Los Angeles, CA, United States of America
                                        3 Viterbi School of Engineering, University of Southern California, Los Angeles, CA, United States
                                        of America
                                        4 Department of Pediatrics, Division of Medical Genetics, Children’s Hospital Los Angeles and The
                                        Saban Research Institute, Los Angeles, CA, United States of America
                                        5 Department of Pediatrics, Keck School of Medicine of USC, University of Southern California, Los
                                        Angeles, CA, United States of America
                                        6 Departments of Systems Biology and Biomedical Informatics, and JP Sulzberger Columbia
                                        Genome Center, Columbia University Irving Medical Center, New York, NY, United States of America
                                        7 Department of Population and Public Health Sciences, Center for Genetic Epidemiology, Division
                                        of Epidemiology and Genetics, Keck School of Medicine, University of Southern California, Los
                                        Angeles, CA, United States of America

            *Correspondence author:
            Chang Shu
            (april.shu@usc.edu)         Abstract
            Citation: Prakash P, Street K,
            Narayanan S, Fernandez BA, Shen   Databases tied to mental and behavioral health surveys suffer from the issue of
            Y, Shu C. Benchmarking machine   missing data when participants skip the entire survey, which affects the data quality
            learning missing data imputation   and sample size. These missing data patterns were investigated and the imputation
            methods in large-scale mental
            health survey databases. Artif Intell   performance was evaluated in Simons Foundations Powering Autism Research for
            Health. 2025;2(1):81-92.    Knowledge, a large-scale autism cohort consists of over 117,000 participants. Four
            doi: 10.36922/aih.4406      common methods were assessed – Multiple imputation by chained equations (MICE),
            Received: August 1, 2024    K-nearest neighbors (KNN), MissForest, and multiple imputation with denoising
                                        autoencoders (MIDAS). In a complete subset of 15,196 autism participants, three
            Revised: September 17, 2024
                                        types of missingness patterns were simulated. We observed that MIDAS and KNN
            Accepted: October 14, 2024  performed the best as the random missingness rate increased and when blockwise
            Published Online: November 7,   missingness was simulated. The average computational times were each 10 min for
            2024                        MIDAS and KNN, 35 min for MissForest, and 290 min for MICE. MIDAS and KNN both
            Copyright: © 2024 Author(s).   provide promising imputation performance in mental and behavioral health survey
            This is an Open-Access article   data that exhibit blockwise missingness patterns.
            distributed under the terms of the
            Creative Commons Attribution
            License, permitting distribution,   Keywords: Missing data; Mental health survey; Imputation methods; Machine learning
            and reproduction in any medium,
            provided the original work is
            properly cited.
            Publisher’s Note: AccScience
            Publishing remains neutral with   1. Introduction
            regard to jurisdictional claims in
            published maps and institutional   Large-scale biobank databases in mental and behavioral health such as Simons
            affiliations.               Foundations Powering Autism Research for Knowledge (SPARK), UK Biobank, and All

            Volume 2 Issue 1 (2025)                         81                               doi: 10.36922/aih.4406
   82   83   84   85   86   87   88   89   90   91   92