Page 75 - GTM-3-1
P. 75

Global Translational Medicine                                       Evaluating ML models for CAD prediction



            made by the model using the counts of TPs, TNs, false   The feature importance plot revealed that exercise-
            positives, and false negatives.  Both  κ and MCC range   induced angina, TA chest pain, and age stood out as
                                    31
            from a scale between  -1 and 1, where 1 means that the   the most significant factors in influencing the model’s
            model is making perfect predictions (perfect agreement   predictions; however, more known factors in CAD
            between model’s predictions and actual outcomes), 0   diagnosis, including male sex and cholesterol, held less
            indicates  that  the  model’s  predictions  are  equivalent  to   impact in our model (Figure 5). Variability in the dataset
            what is expected from chance alone, and -1 indicates that   in terms of abundance of certain parameters (i.e., more
            the predictions are worse than chance. 31,32  A κ value of 0.57   male records as compared to female records) could skew
            indicates a weak level of agreement between the model’s   the feature importance plot in that the model focuses on
            predictions and actual outcomes (κ of 0.40–0.59 = weak,   the features that are most different between positive and
            and 0.60–0.79  =  moderate).  Little information can be   negative CAD cases. In addition, the present study had a
                                   33
            obtained from interpreting the actual score for MCC;   relatively low number of records (1049), so this could also
            however, prior research has shown that the MCC is a   explain the differences between the feature importance
            special case of the Pearson’s correlation coefficient between   plot generated by the model and the actual risk factors that
            the observed and predicted binary classification. 34,35    lead to CAD. To further analyze the performance of the
            Therefore, the MCC can be crudely interpreted with similar   model, and to assess whether the model can be improved
            cutoff values as the Pearson’s correlation coefficient, so for   with more data points, a learning curve for LR was assessed
            a MCC of 0.57, there is a moderate agreement between   (Figure 6).
            the model’s predictions and actual outcomes (Pearson’s   Learning  curves  are  plots  that  demonstrate  an  ML
            correlation coefficient of 0.40–0.69 = moderate correlation,   model’s performance (prediction accuracy, F1 score, error)
            and 0.70–0.89 = strong correlation). 35,36  Prior studies has   over time with more experience and with more training
            also shown that MCC is a better metric than κ, especially   instances. Learning curves usually consist of two lines, one
            if the dataset is unbalanced such as  the current dataset   for the model performance using the training dataset (data
            (+CAD = 572 and -CAD = 477) because an unbalanced   that was used to train the model) and validation dataset
            dataset could affect the hypothetical probability of chance   (data that the model has never seen).  The learning study
                                                                                             38
            agreement, which is part of the κ formula; however, for the   in the present study has an X-axis that consists of training
            current study, the κ and MCC are the same number. 31  iterations and a Y-axis that consists of the model’s F-1
              The ROC curve shows the performance of a binary   score with the blue line representing the training dataset
            ML model as it evaluates the trade-off between the TP   and the green line representing the validation dataset. As
            rate (sensitivity) on the Y-axis and false-positive rate   the training interactions increase, the training score starts
            (1-specificity) on the X-axis.  The TP rate is the proportion   off high and remains constant while the cross-validation
                                  37
            of actual positive instances correctly identified by the   score starts lower and increases until there is a small gap
            model, and the false-positive rate is the proportion of   between the training and cross-validation scores. The small
            actual negative instances incorrectly classified as positive   gap between the two scores could indicate that the model is
            by the model. The AUC is used to quantify the performance   not overfitting. However, since the training score remained
            of the model from the ROC curve. If there is a one-to-  constant throughout the training iterations, it could have
            one relationship between the TP rate and false-positive   been memorizing the information provided by the dataset
            rate (meaning that the ROC curve is a straight diagonal   or it has simply learned all it can from the dataset. In
            line), then the model’s ability to discriminate between the   addition, since the training score and cross-validation scores
            positive and negative classes is no better than random   are converging, the addition of more patient records would
                                           37
            chance, and thus, the AUC will be 0.5.  The perfect ROC   most likely yield minimal improvements in model metrics. 39
            curve has a value of 1.0 where the model achieves a TP   The three ML models that provided the best results in
            rate of 1 and false-positive rate of 0, meaning that model   predicting CAD included LR, linear discriminant analysis
            makes  no  errors  in  distinguishing  between  positive  and   (LDA), and Ridge classifier (RIDGE). As a statistical
            negative cases (presence or absence of CAD in the current   method for binary classification, logistic regression is
            study). The LR model had an AUC of 0.88 (Figure  4),   applied when there are only two possible outcomes for the
            outperforming other  models in  this metric.  Comparing   target variable. Logistic regression accomplishes binary
            similar literature, this AUC matches the performance   classification by modeling the probability that a given input
            of other chosen models.  This also indicates a positive   belongs to a certain class.  For instance, in the medical
                                18
                                                                                    40
            performance, given the sensitivity to certain imbalances in   field, logistic regression can be used to determine the
            the data set (572 +CAD compared to 477 -CAD).      relationship between variables such as weight and exercise

            Volume 3 Issue 1 (2024)                         8                        https://doi.org/10.36922/gtm.2669
   70   71   72   73   74   75   76   77   78   79   80