Page 93 - AIH-2-1
P. 93

Artificial Intelligence in Health                          Benchmarking ML imputation in mental health surveys



            Methods) was first obtained. Around 78% of participants   dataset were used (Table 1) to reflect the missingness
            with complete data are male and 22% are female. The male-  distribution present in the real data.
            to-female  ratio  is  3.5:1,  which  aligns  with  the  sex  ratio
            among subjects with autism in the general population.   3.3. Performance of imputation on overall dataset
            About half of the individuals with complete data are   The four imputation methods were applied to the
            between 6 and 11  years of age. Only 0.4% of subjects   incomplete datasets in each of the three simulation
            are under 2 years of age while none are above 18. About   scenarios (Figure 3). The imputed values were compared
            79%  of  participants  were  self-identified  as  white.  The   with the actual values in the complete dataset, and the
            category with the second largest number of participants is   RMSE values were calculated. RMSE can be interpreted
            “Multiple Races” (10.9%), followed by African American   as the average deviation of the predicted scores from the
            (4.3%), “Other” (3.5%), and Asian (2.2%). The number of   true scores in units of standard deviation since all variables
            participants who are Native American or Native Hawaiian   were  standardized.  Lower  RMSE  values  correspond  to
            are below 1%. In the preprocessed complete dataset, the   higher accuracy in missing value imputation.
            SCQ, RBS-R, and DCDQ scores have average values of
            21.72, 35.16, and 37.87, respectively.               In the MCAR scenario, the imputation error for all
                                                               models generally rose as the missing rate increased.
              All variables were standardized with a mean of zero   MissForest has the lowest overall RMSE (ranging between
            and standard deviation of 1 so that the imputation error,   0.73 and 1.0), outperforming the other methods especially
            calculated as RMSE, can be interpreted as the average   when  the  missing  rate  was  low  (Figure  3,  left  panel).
            deviation of the predicted scores from the true scores in   However, as the percentage of missing values increased,
            units  of  standard  deviation.  To  assess  the  performance   the performance of KNN and MIDAS became comparable
            of  the missing data imputation methods,  missing   to  that  of  MissForest.  MICE  outperformed  KNN  and
            values were introduced to the preprocessed complete   MIDAS between 20% and 60% of random missingness but
            dataset with 15,196 participants with autism. First, to   performed considerably worse than all other models for
            simulate the scenario on MCAR, a random subset of   the remaining missing rates.
            values across the entire dataset was converted to missing
            values. Ten incomplete datasets were generated for each   In the MNAR scenarios, all models exhibited an increase
            missingness percentage (10 – 90%). Second, to examine   in imputation error in the BSMR scenario when compared
            the performance of the imputation methods on MNAR   to SMR. MissForest produced the lowest error rate in the
            patterns, 10 incomplete datasets were randomly generated   SMR scenario, with an RMSE of 0.83, but did not perform
            for  the  SMR  and  BSMR  simulation  scenarios  separately.   as well during the BSMR scenario that simulated blockwise
            When doing so, the missing rates in the original SPARK   missingness. MissForest also exhibited larger variations in

























            Figure 3. Evaluation of imputation performance based on overall root mean squared error (RMSE). Values across the 10 trials using the missing completely
            at random simulation scenario (left). Overall RMSE values across the 10 missing not-at-random trials in the survey-specific missing rate and blockwise
            missingness with Survey-specific missing rate simulation scenarios (right).
            Abbreviations: KNN: K-Nearest Neighbors; MICE: Multiple Imputation by Chained Equations; MIDAS: Multiple Imputation with Denoising Autoencoders;
            MissForest: Non-parametric missing value imputation using Random Forest.


            Volume 2 Issue 1 (2025)                         87                               doi: 10.36922/aih.4406
   88   89   90   91   92   93   94   95   96   97   98