Page 76 - GTM-3-1
P. 76
Global Translational Medicine Evaluating ML models for CAD prediction
to predict whether a person will suffer from a heart attack or in the dataset. Some continuous variables such as age and
other medical complication. Such a model is set up through fasting blood sugar were converted to binary outcomes,
training and testing. Throughout the training phase, the and the final outcome of presence of heart disease is also a
logistic regression model’s parameters are learned. The binary outcome.
41
model identifies patterns in the input data and associates There have been a multitude of studies that provide
them with some form of output. As such, after training, the a non-invasive method of predicting CAD using ML.
model can forecast the likelihood that an input will belong Özbilgin et al. proposed a method of early diagnosis of
to a specific class. This makes logistic regression a valuable CAD using iris images. The study used images from 198
25
algorithm for linearly separable datasets where two classes volunteers: 94 with CAD and 104 without CAD. Features of
are separated by a line on a graph. Due to its simplicity, the iris images were extracted using wavelength transform,
logistic regression often serves as the baseline for more first-order statistical analysis, gray level co-occurrence
complex classification models. matrix, and a gray level run length matrix based on the
LDA is a classification algorithm used in ML that is ReliefF feature selection method. Similar to the present
employed to solve multi-class classification equations. study, a number of different classifiers were employed, and
42
LDA utilizes a linear combination of features to separate their efficacy was compared. The support vector machine
classes to ultimately determine whether an input set model provided the highest accuracy at 93%. Another
25
belongs to an output. It accomplishes this task through study by Akella and Akella had a similar design to the
dimensionality reduction in which the separation between current study where data from the UCI Center for ML
classes is prioritized while the dimensionality of classes and Intelligent Systems were used to train six different ML
43
is aimed to be reduced. For instance, in the realm of algorithms to predict the presence of CAD. The ML models
medicine, an LDA algorithm would aim to maximize class included linear model, decision tree, random forest,
separability, such as separating disease categories from one support vector machine, neural network, and k-nearest
another. Within each class, the algorithm identifies key neighbor. Neural network achieved the highest accuracy
46
patterns to simplify the information to appropriately make of 93.03% and a sensitivity of 93.80 as well as the highest
predictions based on how the data were reduced. AUC. The present study is different in that PyCaret was
46
RIDGE is an important tool used in ML that utilizes used with the same dataset to automatically train a total of
examples to learn to classify new inputs into different 14 different ML models.
categories. RIDGE is a linear classifier that extends the Hence, the results demonstrate that among the various
concept of logistic regression and LDA by incorporating ML models tested, LR exhibited superior performance with
regularization to mitigate overfitting, a situation that an AUC of 0.88, reflecting a high degree of discriminative
occurs when a machine classification algorithm performs ability. Clinically, the importance attributed to features such
exceptionally well on the training data but poorly in the as exercise angina, chest pain type TA, age, and other chest
testing phase. This allows for more accurate predictions. pain types indicates that the model aligns well with known
44
As a model used for multi-class classification tasks, RIDGE clinical predictors of CAD, reinforcing its potential utility
learns from examples in the training set to categorize in a clinical setting. However, it is crucial to acknowledge
variables into different classes. Once trained, the RIDGE the plateau observed in the learning curve, suggesting that
calculates a decision function based on the learned further expansion of the dataset beyond the 500-instance
coefficients. 45 mark may not significantly enhance model performance
Known hurdles in previous literature of ML datasets unless new varieties of data or features are introduced.
and CAD are that most investigated datasets have a Therefore, future work should focus on refining the logistic
18
limited number of features and small sample sizes. The regression model with more diverse and complex data, as
limitations of the present study also mostly involve the well as on external validation of the model to ensure its
dataset. Since the dataset is a combination of different generalizability and applicability across different patient
resources, institutional differences in collecting data could populations.
have an impact on the quality of the dataset. In addition, Future studies for the present study would involve
even though the learning curve suggests that addition of improving the quality and quantity of the dataset as well as
more data would yield minimal improvement, the present making the dataset more complex. In addition to the LR
study still has a relatively low number of patient records. In model, it is also important to note that the LDA, RIDGE,
addition, the training score of the learning curve staying AdaBoost classifier (ADA), Gradient Boosting classifier
constant (Figure 6) could also indicate the model learned (GBC), and Naive Bayes (NB), all had comparable results
everything it can early on due to the lack of complexity to the LR classifier; therefore, further research should also
Volume 3 Issue 1 (2024) 9 https://doi.org/10.36922/gtm.2669

