Page 91 - AIH-2-1

P. 91

Artificial Intelligence in Health Benchmarking ML imputation in mental health surveys

concept called fully conditional specification, in which column using the postResample method from the caret
each incomplete variable is imputed by a different model. package (Version 6.0 – 94) in R. To retrieve the RMSE
It generates multiple imputed datasets that are averaged value for an imputed column, the following formula was
to retrieve the final imputed data. Since MICEs employ used:
a regression-based approach, hyperparameter tuning was
not performed. ∑ ˆ (y − y ) 2
RMSE = i i
2.3.2. KNN  n

KNNImputer is a method in Python’s Scikit-learn package Where y are predicted values and y are observed
25
i
i
(version 0.22) and was used to study the KNN algorithm. values. As indicated by the equation, the square of the
KNNImputer predicts each sample’s missing values using difference between the predicted and observed value was
the average value from the closest data points in the summed across each item in the column that was imputed.
training set. Hyperparameter tuning was used to select the This value was then divided by the total number of imputed
optimal value for the number of nearest neighbors used items and the square root of this value was stored as the
during imputation. column’s RMSE.
2.3.3. MissForest These column-specific RMSEs were averaged across
all columns in the dataset. Then, these RMSEs were again
MissForest (version 1.5) is an R package which uses averaged across the 10 trials for each simulation setting.
16
a random forest approach to impute missing values, This resulted in a mean overall RMSE for each simulation
building multiple decision trees to make predictions scenario. These error values were then compared for every
using the other remaining features. By averaging several simulation scenario between each imputation method.
classification or regression trees, MissForest employs
out-of-bag error estimates and can capture complex, SCQ summary score, RBS-R summary score, and DCDQ
non-linear relationships. Hyperparameter tuning was used summary score evaluate the social communication function,
to select the optimal values for the number of trees and the severity of repetitive behaviors, and motor functions,
maximum number of iterations. respectively, in study participants with autism. They were
calculated based on corresponding questionnaires. The
2.3.4. MIDAS RMSE values of these specific mental and behavior summary
MIDASpy (version 1.3.1) is a Python package that was scores were also compared between the four imputation
26
used to study the MIDAS algorithm. It introduces additional methods across each simulation scenario.
missing values into a given dataset and restores these values Finally, the total computation time was assessed for the
using an unsupervised neural network called a denoising four imputation methods during the BSMR simulation
autoencoder. Then, the resulting model is used to predict scenario, which was chosen since it is closest in nature to
the values of the original missing data. Similar to MICE, missingness in real survey data.
MIDASpy generates multiple imputed datasets that are
averaged to retrieve the final imputed data. Hyperparameter 3. Results
tuning was used to select the optimal values for the input 3.1. Overview of full dataset and missingness
drop, layer structure, and number of epochs.
patterns
2.4. Evaluation of imputation performance The full dataset used in this study consists of 117,099 study
For each missing data simulation scenario, missingness participants with autism. Slightly more than half of the
was introduced into the complete dataset 10 different times participants (51.3%) did not complete SCQ survey, which
as 10 separate trials. The values in Table 1 correspond to screens for social functioning; 63.8% did not complete
the percentage of subject IDs in the full dataset (with RBS-R survey on repetitive behaviors; and 72.9% did not
missing values among participants with autism) who are complete DCDQ survey on motor functions (Table 1).
not present in each specific survey. These missing rates A total of 34,067 participants have medium missing
were used when generating the missing datasets for the rates between 20% and 80% among 363 total questions
SMR and BSMR simulation scenarios. (Table 2), 37,710 participants exhibit low missing rates
(<20%), and 45,322 participants exhibit high missing rates
The four models were used to impute the missing
data, and these imputed values were compared with the (>80%, Table 2).
true values in the preprocessed complete dataset. In each When compared to female participants, there are
imputation trial, the RMSE values were calculated for each slightly more male participants with high and low missing

Volume 2 Issue 1 (2025) 85 doi: 10.36922/aih.4406

86 87 88 89 90 91 92 93 94 95 96