Page 89 - AIH-2-1

P. 89

Artificial Intelligence in Health Benchmarking ML imputation in mental health surveys

Figure 1. Overview of workflow and study design. (A) The full dataset refers to the original data filtered to only include autism spectrum disorder (ASD)
participants. The preprocessed complete dataset refers to the original dataset after filtering to only include ASD participants, dropping incomplete rows,
removing variables with extreme rates of missingness, and conducting one-hot-encoding on the categorical variables (which increases the number of
variables). (B) Missing completely at random refers to the simulation scenario that randomly converts a specified fraction of the input dataset to missing.
Survey-specific missing rate refers to the simulation environment that is tailored to the missingness of the original dataset. Blockwise survey-specific
missing rate refers to the simulation environment that is also tailored to the missingness of the original dataset but converts all rows of a given column
to missing at once. (C) Multiple imputation by chained equations is an imputation method that employs a series of regression models; MissForest is an
imputation method that is based on random forests; Multiple Imputation with Denoising Autoencoders is an imputation method that uses denoising
autoencoders; K-nearest neighbors is an imputation method that uses neighboring data points in the feature space. (D) RMSE corresponds to root mean
squared error.

data to generate a dataset comprising complete observations, Table 1. Percentage of subjects who did not complete each
(2) setting up the simulation scenarios for three missing individual survey among all 117,099 participants with
data mechanisms including random missingness, survey- autism in SPARK
specific missing rates, and blockwise missingness with
Percentage of subjects who did not
survey-specific missing rates, (3) conducting the missing Survey name complete corresponding survey (%)
data imputation, and (4) evaluating the performance of
each model. Individuals registration 0
Basic medical screening 39.9
2.1. Data source and preprocessing Background history 59.3

The dataset used in this study is based on SPARK Area deprivation index 35.1
phenotype V8, consisting of 117,099 participants with SCQ 51.3
autism and 363 variables. It contains information extracted RBS-R 63.8
from standardized surveys and parent-reported medical DCDQ 72.9
history regarding children with autism. The following
eight surveys with <80% missing rates in the full dataset Vineland 82.2
(Table 1) were included in the missing data imputation Intelligence quotient 95.3
assessment: individuals registration, basic medical CBCL 99.6
screening, background history, SCQ, RBS-R, DCDQ, Note: SCQ: Social communication questionnaire; RBS-R: Repetitive
Child Behavior Checklist, and area deprivation index. behavior scale-revised; and DCDQ: Developmental coordination
disorder questionnaire; are surveys commonly used to quantify the
This dataset was first filtered to remove variables with mental and behavioral functions at scale.
extreme rates of missingness (~90% or greater), resulting Abbreviation: CBCL: Child Behavior Checklist.
in a drop of 22 variables. The dataset was then modified to
remove any rows with missing information. This resulted One-hot encoding was used to transform the
in 15,196 participants with autism and 347 variables. categorical variables in this dataset, resulting in 15,196

Volume 2 Issue 1 (2025) 83 doi: 10.36922/aih.4406

84 85 86 87 88 89 90 91 92 93 94