Page 117 - AIH-1-4
P. 117

Artificial Intelligence in Health                        Complex early diagnosis of MS through machine learning



                                        TN                     absolute SHAP values for all rows and folds, followed
                            Specificity
                                      TN FP                    by min-max normalization to ensure metrics were on a

                                                               consistent scale. This process produced a matrix of 20
              Where TN = true negatives, and FP = false positives.  rows of features with six columns of models. This matrix
                                                               allowed us to analyze the overall impact of features. Next,
            2.5. Statistical test
                                                               to rank features, we calculated the mean SHAP values of
            The statistical testing process started with a normal   all six models for each feature, then sorted them. This step
            distribution check on all evaluation metrics for each model   provides a clear view of which CIS is driving the prediction
            across all folds. We chose the Shapiro–Wilk test  due to   of CDMS across all ML models.
                                                   55
            its  sensitivity  in  detecting  deviations  from  normality  in
            small sample sizes. On analyzing the results, we found that   3. Results
            some metrics were not normally distributed, such as recall   Overall, we found that gradient-boosted models
            of LGBM and specificity of RF. Therefore, we proceeded   – CatBoost, LGBM, and XGBoost – consistently
            with the non-parametric  Friedman test  to compare   outperformed  other  models  in  predicting  CDMS,
                                              56
            models’ performance. This test is ideal for comparing   though the performance differences were not statistically
            metrics of multiple models across different folds without   significant. CatBoost achieved the highest AUC and
            relying on the assumption of normal distribution. When   showed the best overall balance between precision and
            the Friedman test indicated significant differences among   recall. Periventricular_MRI, Infratentorial_MRI, and
            model metrics, we, further, investigated these differences   Oligoclonal_Bands are features that have a big effect on
            using the Nemenyi post hoc test  for comparison between   how well different ML models predict CDMS. Among the
                                     57
            pairs of models.                                   features,  Oligoclonal_Bands  and  Periventricular_MRI
                                                               showed strong interaction.
            2.6. Explainability
                                                               3.1. Model performance
            To explain our models, we leveraged SHAP, a highly
            regarded technique in Explainable AI within the medical   3.1.1. Model performance metrics
            and healthcare domain,  to identify important factors   To compare model performance, we evaluated predictions
                                58
            influencing  CDMS  predictions.  SHAP  provides  both   across five-fold cross-validation, focusing on metrics such
            global and local insights into feature importance, helping   as AUC, ACC, F1 score, precision, recall, and specificity.
            us understand overall model behavior and individual   Gradient-boosted tree models – CatBoost, LGBM, and
            predictions. These techniques ensure our models were   XGBoost – consistently outperformed others, likely due
            not only accurate but also transparent, enhancing their   to  their iterative error-correcting nature.  These  models
            trustworthiness for predicting CDMS from CIS.      achieved higher AUCs, better precision-recall balance, and
              For tree-based models such as CatBoost, XGBoost,   superior F1 scores, indicating stronger class separation and
            LGBM, and RF, we used TreeExplainer with parameters   accuracy in predicting positive cases while reducing false
            model_output set to “raw” and feature_perturbation set   positives and negatives. Table 2 presents the mean metrics
            to “tree_path_dependent.” This setup captures the raw   for each model across the five folds.
            output of the models before applying any logistic function
            and uses the decision tree structures to perturb features.   Table 2. Evaluation metrics for six machine learning models
            For SVM, we applied KernelExplainer  with the number   across five folds
            of background samples automatically selected by SHAP.   Model  AUC  ACC  F1 score Precision Recall  Specificity
            This ensures that the sample size provided a satisfactory   CatBoost  0.9312 0.8791  0.8675  0.8710  0.8640  0.8919
            approximation without  requiring  excessive  computation
            time. LR was explained with LinearExplainer to suit the   XGBoost  0.9202 0.8645  0.8514  0.8548  0.8480  0.8784
            linear nature of the model. The explainer was also set to   LGBM  0.9150 0.8791  0.8675  0.8710  0.8640  0.8919
            consider the correlation dependence of features during the   RF  0.9097 0.8388  0.8295  0.8045  0.8560  0.8243
            explanation.                                       SVM     0.8985 0.8168  0.8031  0.7907  0.8160  0.8176
              We calculated SHAP values for each row in the test   LR  0.8922 0.8132  0.7935  0.8033  0.7840  0.8378
            set across all folds and models. SHAP interaction values   Note: The values in boldface mean highest values in the columns.
            are available only for the four tree-based models, as   Abbreviations: CatBoost: Categorical boosting; LGBM: Light gradient
                                                               boosting machine; LR: Logistic regression; RF: Random forest;
            KernelExplainer and LinearExplainer do not support   SVM: Support vector machine; XGBoost: Extreme gradient boosting;
            interactions. For each model, we first averaged the mean   AUC: Area under the curve.


            Volume 1 Issue 4 (2024)                        111                               doi: 10.36922/aih.4255
   112   113   114   115   116   117   118   119   120   121   122