Page 75 - GTM-3-1
P. 75
Global Translational Medicine Evaluating ML models for CAD prediction
made by the model using the counts of TPs, TNs, false The feature importance plot revealed that exercise-
positives, and false negatives. Both κ and MCC range induced angina, TA chest pain, and age stood out as
31
from a scale between -1 and 1, where 1 means that the the most significant factors in influencing the model’s
model is making perfect predictions (perfect agreement predictions; however, more known factors in CAD
between model’s predictions and actual outcomes), 0 diagnosis, including male sex and cholesterol, held less
indicates that the model’s predictions are equivalent to impact in our model (Figure 5). Variability in the dataset
what is expected from chance alone, and -1 indicates that in terms of abundance of certain parameters (i.e., more
the predictions are worse than chance. 31,32 A κ value of 0.57 male records as compared to female records) could skew
indicates a weak level of agreement between the model’s the feature importance plot in that the model focuses on
predictions and actual outcomes (κ of 0.40–0.59 = weak, the features that are most different between positive and
and 0.60–0.79 = moderate). Little information can be negative CAD cases. In addition, the present study had a
33
obtained from interpreting the actual score for MCC; relatively low number of records (1049), so this could also
however, prior research has shown that the MCC is a explain the differences between the feature importance
special case of the Pearson’s correlation coefficient between plot generated by the model and the actual risk factors that
the observed and predicted binary classification. 34,35 lead to CAD. To further analyze the performance of the
Therefore, the MCC can be crudely interpreted with similar model, and to assess whether the model can be improved
cutoff values as the Pearson’s correlation coefficient, so for with more data points, a learning curve for LR was assessed
a MCC of 0.57, there is a moderate agreement between (Figure 6).
the model’s predictions and actual outcomes (Pearson’s Learning curves are plots that demonstrate an ML
correlation coefficient of 0.40–0.69 = moderate correlation, model’s performance (prediction accuracy, F1 score, error)
and 0.70–0.89 = strong correlation). 35,36 Prior studies has over time with more experience and with more training
also shown that MCC is a better metric than κ, especially instances. Learning curves usually consist of two lines, one
if the dataset is unbalanced such as the current dataset for the model performance using the training dataset (data
(+CAD = 572 and -CAD = 477) because an unbalanced that was used to train the model) and validation dataset
dataset could affect the hypothetical probability of chance (data that the model has never seen). The learning study
38
agreement, which is part of the κ formula; however, for the in the present study has an X-axis that consists of training
current study, the κ and MCC are the same number. 31 iterations and a Y-axis that consists of the model’s F-1
The ROC curve shows the performance of a binary score with the blue line representing the training dataset
ML model as it evaluates the trade-off between the TP and the green line representing the validation dataset. As
rate (sensitivity) on the Y-axis and false-positive rate the training interactions increase, the training score starts
(1-specificity) on the X-axis. The TP rate is the proportion off high and remains constant while the cross-validation
37
of actual positive instances correctly identified by the score starts lower and increases until there is a small gap
model, and the false-positive rate is the proportion of between the training and cross-validation scores. The small
actual negative instances incorrectly classified as positive gap between the two scores could indicate that the model is
by the model. The AUC is used to quantify the performance not overfitting. However, since the training score remained
of the model from the ROC curve. If there is a one-to- constant throughout the training iterations, it could have
one relationship between the TP rate and false-positive been memorizing the information provided by the dataset
rate (meaning that the ROC curve is a straight diagonal or it has simply learned all it can from the dataset. In
line), then the model’s ability to discriminate between the addition, since the training score and cross-validation scores
positive and negative classes is no better than random are converging, the addition of more patient records would
37
chance, and thus, the AUC will be 0.5. The perfect ROC most likely yield minimal improvements in model metrics. 39
curve has a value of 1.0 where the model achieves a TP The three ML models that provided the best results in
rate of 1 and false-positive rate of 0, meaning that model predicting CAD included LR, linear discriminant analysis
makes no errors in distinguishing between positive and (LDA), and Ridge classifier (RIDGE). As a statistical
negative cases (presence or absence of CAD in the current method for binary classification, logistic regression is
study). The LR model had an AUC of 0.88 (Figure 4), applied when there are only two possible outcomes for the
outperforming other models in this metric. Comparing target variable. Logistic regression accomplishes binary
similar literature, this AUC matches the performance classification by modeling the probability that a given input
of other chosen models. This also indicates a positive belongs to a certain class. For instance, in the medical
18
40
performance, given the sensitivity to certain imbalances in field, logistic regression can be used to determine the
the data set (572 +CAD compared to 477 -CAD). relationship between variables such as weight and exercise
Volume 3 Issue 1 (2024) 8 https://doi.org/10.36922/gtm.2669

