Page 88 - AIH-2-1
P. 88
Artificial Intelligence in Health Benchmarking ML imputation in mental health surveys
of Us have empowered researchers to investigate the genetic More advanced imputation approaches using statistical
and environmental risk factors associated with mental and and computational methods are needed to accurately impute
behavioral disorders among more than 100,000 subjects. mental and behavioral surveys with blockwise missingness.
1-3
Self-reported surveys and questionnaires such as the social Here, four commonly used missing data imputation
communication questionnaire (SCQ), repetitive behavior methods were employed – Multivariate imputation by
4
5
scale-revised (RBS-R), and developmental coordination chained equations (MICE), K-nearest neighbors (KNN),
disorder questionnaire (DCDQ) are commonly used to non-parametric missing value imputation using random
6
quantify mental and behavioral functions at scale. These forest (MissForest), and multiple imputation with
questionnaires typically consist of a series of related denoising autoencoders (MIDAS). 13-16 MICE is one of the
questions and measure responses using ordinal scales with most popular methods of multiple imputation originally
13
a natural order or rank to indicate the level of agreement developed in the early 2000s. This approach uses a series of
known as Likert scales. 7 regression models to predict each variable with missingness
using the remaining variables in the data. KNN is a
14
However, missingness commonly occurs in the supervised machine learning algorithm commonly used
responses to these surveys and questionnaires. The reasons when the distribution of the data is unknown or difficult
include non-inapplicable or ambiguous questions, and to determine. This method performs predictions on
15
characteristics of the participants themselves including the missing data by averaging the K-nearest data points.
reluctance to answer sensitive questions, incomplete MissForest is a missing data imputation method based on a
knowledge, and lack of time. Missingness can also arise random forest developed in 2012. It predicts missing values
8
at the source level. Specifically, data may have been based on random forest models trained on the complete
curated from varying sources with different administered dataset and imputes missing values iteratively. MIDAS
16
instrument protocols. Certain questions in the survey also uses a type of unsupervised neural network to predict
may not be relevant to specific demographic groups, such missing values in the data by reducing the dimensions in
as those that might not apply to young children. the observed data and reconstructing the missing data.
Common types of missing data include missing completely MIDAS was recently developed in 2022 and has proven
at random (MCAR) and missing not at random (MNAR), its high accuracy and computational efficiency through
with either specific parts of surveys or entire surveys being systematic tests on simulated and real social science data. 17
incomplete. In MCAR, the probability of missingness is Previous studies have not systematically reviewed
9
independent of the observed and unobserved data. MAR is a machine learning-based imputation methods recently
broader class than MCAR in which the missing data is related developed for the databases tied to mental and behavioral
to the observed but not the unobserved data. On the other health surveys. 18-22 Most psychiatric studies use multiple
hand, the probability of missingness in MNAR data depends imputation for handling missing data and have not taken
on the unobserved missing values. Typically, participants advantage of the latest machine learning-based imputation
tend to skip entire questionnaires due to unobserved factors, techniques. 18-22 In addition, they have not focused on
and a form of MNAR missingness referred to as blockwise assessing imputation accuracy in surveys with blockwise
missingness arises. Blockwise missingness occurs when missing structures. 18-22 This study systematically examines
all responses belonging to the same survey are missing the imputation performance and computational time
simultaneously for the same participants, forming clustered of these four commonly used missing data imputation
missing blocks in the overall phenotypic data. methods (MICE, KNN, MissForest, and MIDAS) in
The simplest solution to address blockwise missingness the presence of blockwise missingness in mental and
in mental and behavioral questionnaires is to drop behavioral surveys. It uses data from the SPARK, a large-
10
participants with missing surveys. However, this option scale autism research study that collects social functioning
leads to a significant loss of information, reduced sample and behavioral surveys from over 117,000 participants.
size, and loss of statistical power when analyzing mental This study assesses imputation models on both MCAR
and behavioral questionnaires in biobank data. Another and MNAR data, identifying the optimal method for each
commonly used approach is to impute missing data using type of missingness pattern. This study conducts a novel
statistical and computational methods. Mean, median, and exploration of these methods while also addressing the
mode substitutions are basic imputation approaches that commonly encountered blockwise missingness pattern.
maintain the original sample size but can lead to biased
inferences. Specifically, participants who skip certain 2. Methods
11
questionnaires may exhibit different characteristics than Figure 1 outlines the sample selection and workflow of the
those who complete the questionnaires. 12 study. The four major steps included: (1) preprocessing the
Volume 2 Issue 1 (2025) 82 doi: 10.36922/aih.4406

