Page 88 - AIH-2-1
P. 88

Artificial Intelligence in Health                          Benchmarking ML imputation in mental health surveys



            of Us have empowered researchers to investigate the genetic   More advanced imputation approaches using statistical
            and environmental risk factors associated with mental and   and computational methods are needed to accurately impute
            behavioral disorders among more than 100,000 subjects.    mental and behavioral surveys with blockwise missingness.
                                                         1-3
            Self-reported surveys and questionnaires such as the social   Here, four commonly used missing data imputation
            communication questionnaire (SCQ),  repetitive behavior   methods were employed – Multivariate imputation by
                                          4
                              5
            scale-revised (RBS-R),  and developmental coordination   chained equations (MICE), K-nearest neighbors (KNN),
            disorder questionnaire (DCDQ)  are commonly used to   non-parametric missing value imputation using random
                                      6
            quantify mental and behavioral functions at scale. These   forest  (MissForest),  and  multiple  imputation  with
            questionnaires typically consist of a series of related   denoising autoencoders (MIDAS). 13-16  MICE is one of the
            questions and measure responses using ordinal scales with   most popular methods of multiple imputation originally
                                                                                     13
            a natural order or rank to indicate the level of agreement   developed in the early 2000s.  This approach uses a series of
            known as Likert scales. 7                          regression models to predict each variable with missingness
                                                               using  the  remaining  variables  in  the  data.   KNN  is  a
                                                                                                   14
              However, missingness commonly occurs in the      supervised machine learning algorithm commonly used
            responses to these surveys and questionnaires. The reasons   when the distribution of the data is unknown or difficult
            include non-inapplicable or ambiguous questions, and   to determine.  This method performs predictions on
                                                                          15
            characteristics of the participants themselves including   the missing data by averaging the K-nearest data points.
            reluctance to answer sensitive questions, incomplete   MissForest is a missing data imputation method based on a
            knowledge, and lack of time.  Missingness can also arise   random forest developed in 2012. It predicts missing values
                                    8
            at  the  source level. Specifically,  data may  have  been   based on random forest models trained on the complete
            curated from varying sources with different administered   dataset and imputes missing values iteratively.  MIDAS
                                                                                                     16
            instrument protocols. Certain questions in the survey also   uses a type of unsupervised neural network to predict
            may not be relevant to specific demographic groups, such   missing values in the data by reducing the dimensions in
            as those that might not apply to young children.   the observed data and reconstructing the missing data.

              Common types of missing data include missing completely   MIDAS  was recently developed in 2022  and  has  proven
            at random (MCAR) and missing not at random (MNAR),   its high accuracy and computational efficiency through
            with either specific parts of surveys or entire surveys being   systematic tests on simulated and real social science data. 17
            incomplete.  In MCAR, the probability of missingness is   Previous studies have not systematically reviewed
                     9
            independent of the observed and unobserved data. MAR is a   machine learning-based imputation methods recently
            broader class than MCAR in which the missing data is related   developed for the databases tied to mental and behavioral
            to the observed but not the unobserved data. On the other   health surveys. 18-22  Most psychiatric studies use multiple
            hand, the probability of missingness in MNAR data depends   imputation for handling missing data and have not taken
            on the unobserved missing values. Typically, participants   advantage of the latest machine learning-based imputation
            tend to skip entire questionnaires due to unobserved factors,   techniques. 18-22   In  addition,  they  have  not  focused  on
            and a form of MNAR missingness referred to as blockwise   assessing imputation accuracy in surveys with blockwise
            missingness arises. Blockwise missingness occurs when   missing structures. 18-22  This study systematically examines
            all responses belonging to the same survey are missing   the imputation performance and computational time
            simultaneously for the same participants, forming clustered   of  these  four  commonly  used  missing  data  imputation
            missing blocks in the overall phenotypic data.     methods (MICE, KNN, MissForest, and MIDAS) in
              The simplest solution to address blockwise missingness   the presence of blockwise missingness in mental and
            in mental and behavioral questionnaires is to drop   behavioral surveys. It uses data from the SPARK, a large-
                                       10
            participants with missing surveys.  However, this option   scale autism research study that collects social functioning
            leads to a significant loss of information, reduced sample   and behavioral surveys from over 117,000 participants.
            size, and loss of statistical power when analyzing mental   This study assesses imputation models on both MCAR
            and behavioral questionnaires in biobank data. Another   and MNAR data, identifying the optimal method for each
            commonly used approach is to impute missing data using   type of missingness pattern. This study conducts a novel
            statistical and computational methods. Mean, median, and   exploration  of  these  methods  while  also  addressing  the
            mode substitutions are basic imputation approaches that   commonly encountered blockwise missingness pattern.
            maintain the original sample size but can lead to biased
            inferences.  Specifically, participants who skip certain   2. Methods
                    11
            questionnaires may exhibit different characteristics than   Figure 1 outlines the sample selection and workflow of the
            those who complete the questionnaires. 12          study. The four major steps included: (1) preprocessing the


            Volume 2 Issue 1 (2025)                         82                               doi: 10.36922/aih.4406
   83   84   85   86   87   88   89   90   91   92   93