Page 90 - AIH-2-1
P. 90

Artificial Intelligence in Health                          Benchmarking ML imputation in mental health surveys



            participants with autism and 431 variables. The preprocess   is tailored to mirror the missing rates in the full SPARK
            method from the caret package in R was used to normalize   dataset by reusing the same proportions of missing values
            each variable with a mean of 0 and a standard deviation   for each survey (Table 1).
            of 1. This was mainly to allow for comparable root mean
            squared error (RMSE) metrics across all variables that are   2.2.3. MNAR: BSMR
            commonly used in similar studies. 21,23,24         The last missing data simulation scenario, referred to as
              This preprocessed complete dataset of participants with   BSMR, incorporates blockwise missingness with survey-
            autism was used to simulate different missing data mechanisms   specific missing rates. Instead of randomly selecting a
            and assess the accuracy or various imputation methods.  specific portion of each column to be converted to missing
                                                               as in SMR, a proportion of participants is randomly
            2.2. Three simulation scenarios for missing data   selected to have completely missing values for all surveys
            mechanisms                                         of a particular survey type. In other words, every column
            Three simulation scenarios were constructed for missing   of a specific survey type contains the same missing rows.
            data  mechanisms  in  mental  and  behavioral  surveys  as   This resembles real data more closely when subjects skip
            outlined in Figure 2.                              the entire survey.

            2.2.1. MCAR                                        2.3. Machine learning imputation
            The first missing data simulation scenario, referred to as   For each missing data simulation scenario described in
            MCAR, introduces missingness completely at random   the previous section, multiple machine learning models
            by converting a specific percentage of the preprocessed   were used to impute the missing values. The generated
            complete dataset to missing. To observe the imputation   incomplete datasets were passed through the following
            performance as the missing rate gradually increases,   imputation algorithms to  compute  the predicted  values.
            MCAR was implemented with missing rates from 10% to   A separate set of 10 datasets with 20% randomly selected
            90% in 10% intervals for all variables in the dataset.  missing values was used to conduct hyperparameter tuning
                                                               on each of these models.
            2.2.2. MNAR: SMR
                                                               2.3.1. MICE
            The second missing data simulation scenario is SMR, in
            which  the  proportion of  missing  values  in  each  column   This study used the MICE  (version  3.16.0) package in
                                                                                     13
            is dependent on the survey type that it belongs to. SMR   R which employs a multiple imputation model. It uses a






























            Figure 2. Visualization of the three missing data simulation scenarios explored in this study. On the left is Missing Completely at Random with a 40%
            missing rate. In the middle is Survey-specific missing rate with a 20% missing rate for Survey 1 and 80% missing rate for Survey 2. On the right is blockwise
            survey-specific missing rate with a 20% missing rate for Survey 1 and 80% missing rate for Survey 2.


            Volume 2 Issue 1 (2025)                         84                               doi: 10.36922/aih.4406
   85   86   87   88   89   90   91   92   93   94   95