Page 73 - GTM-3-1
P. 73

Global Translational Medicine                                       Evaluating ML models for CAD prediction




            Table 3. Performance of all machine learning classification models
            Model                                 Accuracy   Recall    Precision   F1‑Score    Kappa     MCC
            Logistic regression (LR)               0.7861    0.8075     0.8025      0.8030     0.5687    0.5719
            Linear discriminant analysis (LDA)     0.7834    0.7950     0.8059      0.7984     0.5641    0.5674
            Ridge classifier (RIDGE)               0.7820    0.7950     0.8040      0.7974     0.5612    0.5646
            AdaBoost classifier (ADA)              0.7793    0.8000     0.7975      0.7967     0.5550    0.5584
            Gradient Boost classifier (GBC)        0.7765    0.8075     0.7897      0.7968     0.5481    0.5516
            Naive Bayes (NB)                       0.7725    0.8000     0.7866      0.7923     0.5405    0.5423
            Light Gradient Boosting Machine (LGBM)  0.7628   0.7950     0.7761      0.7851     0.5205    0.5214
            Extreme Gradient Boosting (XGBOOST)    0.7559    0.7725     0.7815      0.7753     0.5077    0.5103
            Random Forest classifier (RF)          0.7491    0.7675     0.7739      0.7683     0.4944    0.4980
            Extra Trees classifier (ET)            0.7383    0.7350     0.7753      0.7530     0.4748    0.4776
            K-nearest neighbor classifier (KNN)    0.7315    0.7575     0.7559      0.7551     0.4575    0.4597
            Decision Tree classifier (DT)          0.7260    0.6975     0.7827      0.7347     0.4526    0.4592
            Quadratic discriminant analysis (QDA)  0.6920    0.7550     0.7038      0.7215     0.3751    0.3886
            Support vector machines–linear kernel (SVM-LK)  0.5670  0.5050  0.5095  0.4355     0.1401    0.1713
            Abbreviation: MCC: Matthews correlation coefficient.

                                                               pain type ATA, which are arranged in descending order of
                                                               importance. Variables with lower importance in the context
                                                               of this model included sex, fasting blood sugar, chest pain
                                                               type ASY, resting ECG ST, and cholesterol, which are listed
                                                               in order of decreasing impact on the model’s predictions.
                                                               These findings imply that the occurrence of exercise-induced
                                                               angina (chest pain) and the type of chest pain categorized
                                                               as TA angina are the strongest predictors for CAD in this
                                                               logistic  regression  model.  Age  and  other  types  of chest
                                                               pain, such as NAP and ATA angina, are also important
                                                               considerations, although they have less influence compared
                                                               to exercise angina and chest pain type TA. The factors of lesser
                                                               importance, such as patient’s gender, fasting blood sugar
                                                               levels, presence of asymptomatic chest pain, certain ECG
            Figure 4. Area under the ROC curves for logistic regressor model.  changes denoted as resting ECG ST, and cholesterol levels,
                                                               still contribute to the predictive power of the model, but to
            can be high – in terms of both health outcomes for the   a smaller extent. Clinically, this information can be valuable
            patient and resource  allocation in  health-care  settings.   for risk stratification and tailoring diagnostic evaluations.
            An AUC of 0.88 is considered to be very good, suggesting   For  example, the prominence  of exercise-induced angina
            that the logistic regression model can be trusted to make   suggests a significant association with CAD and could be a
            accurate predictions about the presence or absence of   strong indicator for further diagnostic testing. Similarly, the
            CAD, though it should typically be used in conjunction   type and characteristics of chest pain can influence clinical
            with other diagnostic information and clinical judgment.  decision-making. While the other variables may have less
              Through LR, a feature importance plot was created to   importance  in  the  model’s  predictions,  they  might  still
            categorize the impact of each variable in diagnosing CAD   contribute to a comprehensive risk profile and should not be
            (Figure 5). The feature importance plot in ML is used to   disregarded in clinical evaluation.
            identify  which  variables  have  the  most  influence  on  the   The learning curve for LR reveals a training score that
            predictions made by the model. In the context of the study,   stays consistent between a score of 0.78 to 0.80, and the
            exercise angina had the highest variable importance, followed   cross-validation score exhibits a rising trend as it reaches
            very closely by chest pain type TA. Additional variables with   a point of intersection with the training score within
            high importance were age, chest pain type NAP, and chest   the 0.78  to 0.80 range (Figure  6). The learning curve


            Volume 3 Issue 1 (2024)                         6                        https://doi.org/10.36922/gtm.2669
   68   69   70   71   72   73   74   75   76   77   78