Page 91 - AIH-2-1
P. 91

Artificial Intelligence in Health                          Benchmarking ML imputation in mental health surveys



            concept  called fully  conditional  specification,  in which   column using the postResample method from the caret
            each incomplete variable is imputed by a different model.   package (Version 6.0 – 94) in R. To retrieve the RMSE
            It generates multiple imputed datasets that are averaged   value for an imputed column, the following formula was
            to retrieve the final imputed data. Since MICEs employ   used:
            a regression-based approach, hyperparameter tuning was
            not performed.                                                ∑   ˆ (y  − y  ) 2
                                                                 RMSE =       i   i
            2.3.2. KNN                                                        n

            KNNImputer is a method in Python’s Scikit-learn package    Where  y  are predicted values and  y  are observed
                                                         25
                                                                         i
                                                                                                 i
            (version 0.22) and was used to study the KNN algorithm.   values. As indicated by the equation, the square of the
            KNNImputer predicts each sample’s missing values using   difference between the predicted and observed value was
            the  average  value  from  the  closest  data  points  in  the   summed across each item in the column that was imputed.
            training set. Hyperparameter tuning was used to select the   This value was then divided by the total number of imputed
            optimal value for the number of nearest neighbors used   items and the square root of this value was stored as the
            during imputation.                                 column’s RMSE.
            2.3.3. MissForest                                    These column-specific RMSEs were averaged across
                                                               all columns in the dataset. Then, these RMSEs were again
            MissForest  (version  1.5) is an R package which uses   averaged across the 10 trials for each simulation setting.
                    16
            a random forest approach to impute missing values,   This resulted in a mean overall RMSE for each simulation
            building multiple decision trees to make predictions   scenario. These error values were then compared for every
            using the other remaining features. By averaging several   simulation scenario between each imputation method.
            classification or regression trees, MissForest employs
            out-of-bag error estimates and can capture complex,   SCQ summary score, RBS-R summary score, and DCDQ
            non-linear relationships. Hyperparameter tuning was used   summary score evaluate the social communication function,
            to select the optimal values for the number of trees and the   severity  of repetitive  behaviors, and  motor  functions,
            maximum number of iterations.                      respectively, in study participants with autism. They were
                                                               calculated  based  on  corresponding  questionnaires.  The
            2.3.4. MIDAS                                       RMSE values of these specific mental and behavior summary
            MIDASpy  (version  1.3.1) is  a Python  package that  was   scores  were  also  compared  between  the  four  imputation
                    26
            used to study the MIDAS algorithm. It introduces additional   methods across each simulation scenario.
            missing values into a given dataset and restores these values   Finally, the total computation time was assessed for the
            using an unsupervised neural network called a denoising   four imputation methods during the BSMR simulation
            autoencoder. Then, the resulting model is used to predict   scenario, which was chosen since it is closest in nature to
            the values of the original missing data. Similar to MICE,   missingness in real survey data.
            MIDASpy generates multiple imputed datasets that are
            averaged to retrieve the final imputed data. Hyperparameter   3. Results
            tuning was used to select the optimal values for the input   3.1. Overview of full dataset and missingness
            drop, layer structure, and number of epochs.
                                                               patterns
            2.4. Evaluation of imputation performance          The full dataset used in this study consists of 117,099 study
            For each missing data simulation scenario, missingness   participants with autism. Slightly more than half of the
            was introduced into the complete dataset 10 different times   participants (51.3%) did not complete SCQ survey, which
            as 10 separate trials. The values in Table 1 correspond to   screens  for  social  functioning;  63.8%  did  not  complete
            the percentage of subject IDs in the full dataset (with   RBS-R survey on repetitive behaviors; and 72.9% did not
            missing values among participants with autism) who are   complete DCDQ survey on motor functions (Table 1).
            not present in each specific survey. These missing rates   A  total of 34,067 participants have medium missing
            were used when generating the missing datasets for the   rates between 20% and 80% among 363 total questions
            SMR and BSMR simulation scenarios.                 (Table 2), 37,710 participants exhibit low missing rates
                                                               (<20%), and 45,322 participants exhibit high missing rates
              The four models were used to impute the missing
            data, and these imputed values were compared with the   (>80%, Table 2).
            true values in the preprocessed complete dataset. In each   When compared to female participants, there are
            imputation trial, the RMSE values were calculated for each   slightly more male participants with high and low missing


            Volume 2 Issue 1 (2025)                         85                               doi: 10.36922/aih.4406
   86   87   88   89   90   91   92   93   94   95   96