Page 117 - AIH-1-4
P. 117
Artificial Intelligence in Health Complex early diagnosis of MS through machine learning
TN absolute SHAP values for all rows and folds, followed
Specificity
TN FP by min-max normalization to ensure metrics were on a
consistent scale. This process produced a matrix of 20
Where TN = true negatives, and FP = false positives. rows of features with six columns of models. This matrix
allowed us to analyze the overall impact of features. Next,
2.5. Statistical test
to rank features, we calculated the mean SHAP values of
The statistical testing process started with a normal all six models for each feature, then sorted them. This step
distribution check on all evaluation metrics for each model provides a clear view of which CIS is driving the prediction
across all folds. We chose the Shapiro–Wilk test due to of CDMS across all ML models.
55
its sensitivity in detecting deviations from normality in
small sample sizes. On analyzing the results, we found that 3. Results
some metrics were not normally distributed, such as recall Overall, we found that gradient-boosted models
of LGBM and specificity of RF. Therefore, we proceeded – CatBoost, LGBM, and XGBoost – consistently
with the non-parametric Friedman test to compare outperformed other models in predicting CDMS,
56
models’ performance. This test is ideal for comparing though the performance differences were not statistically
metrics of multiple models across different folds without significant. CatBoost achieved the highest AUC and
relying on the assumption of normal distribution. When showed the best overall balance between precision and
the Friedman test indicated significant differences among recall. Periventricular_MRI, Infratentorial_MRI, and
model metrics, we, further, investigated these differences Oligoclonal_Bands are features that have a big effect on
using the Nemenyi post hoc test for comparison between how well different ML models predict CDMS. Among the
57
pairs of models. features, Oligoclonal_Bands and Periventricular_MRI
showed strong interaction.
2.6. Explainability
3.1. Model performance
To explain our models, we leveraged SHAP, a highly
regarded technique in Explainable AI within the medical 3.1.1. Model performance metrics
and healthcare domain, to identify important factors To compare model performance, we evaluated predictions
58
influencing CDMS predictions. SHAP provides both across five-fold cross-validation, focusing on metrics such
global and local insights into feature importance, helping as AUC, ACC, F1 score, precision, recall, and specificity.
us understand overall model behavior and individual Gradient-boosted tree models – CatBoost, LGBM, and
predictions. These techniques ensure our models were XGBoost – consistently outperformed others, likely due
not only accurate but also transparent, enhancing their to their iterative error-correcting nature. These models
trustworthiness for predicting CDMS from CIS. achieved higher AUCs, better precision-recall balance, and
For tree-based models such as CatBoost, XGBoost, superior F1 scores, indicating stronger class separation and
LGBM, and RF, we used TreeExplainer with parameters accuracy in predicting positive cases while reducing false
model_output set to “raw” and feature_perturbation set positives and negatives. Table 2 presents the mean metrics
to “tree_path_dependent.” This setup captures the raw for each model across the five folds.
output of the models before applying any logistic function
and uses the decision tree structures to perturb features. Table 2. Evaluation metrics for six machine learning models
For SVM, we applied KernelExplainer with the number across five folds
of background samples automatically selected by SHAP. Model AUC ACC F1 score Precision Recall Specificity
This ensures that the sample size provided a satisfactory CatBoost 0.9312 0.8791 0.8675 0.8710 0.8640 0.8919
approximation without requiring excessive computation
time. LR was explained with LinearExplainer to suit the XGBoost 0.9202 0.8645 0.8514 0.8548 0.8480 0.8784
linear nature of the model. The explainer was also set to LGBM 0.9150 0.8791 0.8675 0.8710 0.8640 0.8919
consider the correlation dependence of features during the RF 0.9097 0.8388 0.8295 0.8045 0.8560 0.8243
explanation. SVM 0.8985 0.8168 0.8031 0.7907 0.8160 0.8176
We calculated SHAP values for each row in the test LR 0.8922 0.8132 0.7935 0.8033 0.7840 0.8378
set across all folds and models. SHAP interaction values Note: The values in boldface mean highest values in the columns.
are available only for the four tree-based models, as Abbreviations: CatBoost: Categorical boosting; LGBM: Light gradient
boosting machine; LR: Logistic regression; RF: Random forest;
KernelExplainer and LinearExplainer do not support SVM: Support vector machine; XGBoost: Extreme gradient boosting;
interactions. For each model, we first averaged the mean AUC: Area under the curve.
Volume 1 Issue 4 (2024) 111 doi: 10.36922/aih.4255

