Page 90 - AIH-2-1

P. 90

Artificial Intelligence in Health Benchmarking ML imputation in mental health surveys

participants with autism and 431 variables. The preprocess is tailored to mirror the missing rates in the full SPARK
method from the caret package in R was used to normalize dataset by reusing the same proportions of missing values
each variable with a mean of 0 and a standard deviation for each survey (Table 1).
of 1. This was mainly to allow for comparable root mean
squared error (RMSE) metrics across all variables that are 2.2.3. MNAR: BSMR
commonly used in similar studies. 21,23,24 The last missing data simulation scenario, referred to as
This preprocessed complete dataset of participants with BSMR, incorporates blockwise missingness with survey-
autism was used to simulate different missing data mechanisms specific missing rates. Instead of randomly selecting a
and assess the accuracy or various imputation methods. specific portion of each column to be converted to missing
as in SMR, a proportion of participants is randomly
2.2. Three simulation scenarios for missing data selected to have completely missing values for all surveys
mechanisms of a particular survey type. In other words, every column
Three simulation scenarios were constructed for missing of a specific survey type contains the same missing rows.
data mechanisms in mental and behavioral surveys as This resembles real data more closely when subjects skip
outlined in Figure 2. the entire survey.

2.2.1. MCAR 2.3. Machine learning imputation
The first missing data simulation scenario, referred to as For each missing data simulation scenario described in
MCAR, introduces missingness completely at random the previous section, multiple machine learning models
by converting a specific percentage of the preprocessed were used to impute the missing values. The generated
complete dataset to missing. To observe the imputation incomplete datasets were passed through the following
performance as the missing rate gradually increases, imputation algorithms to compute the predicted values.
MCAR was implemented with missing rates from 10% to A separate set of 10 datasets with 20% randomly selected
90% in 10% intervals for all variables in the dataset. missing values was used to conduct hyperparameter tuning
on each of these models.
2.2.2. MNAR: SMR
2.3.1. MICE
The second missing data simulation scenario is SMR, in
which the proportion of missing values in each column This study used the MICE (version 3.16.0) package in
13
is dependent on the survey type that it belongs to. SMR R which employs a multiple imputation model. It uses a

Figure 2. Visualization of the three missing data simulation scenarios explored in this study. On the left is Missing Completely at Random with a 40%
missing rate. In the middle is Survey-specific missing rate with a 20% missing rate for Survey 1 and 80% missing rate for Survey 2. On the right is blockwise
survey-specific missing rate with a 20% missing rate for Survey 1 and 80% missing rate for Survey 2.

Volume 2 Issue 1 (2025) 84 doi: 10.36922/aih.4406

85 86 87 88 89 90 91 92 93 94 95