Page 42 - AIH-1-3
P. 42
Artificial Intelligence in Health Predicting mortality in COVID-19 using ML
“ICU_2,” and “ICU_97.” After this processing, we created variables based on the weights of each independent
three more categorical attributes, bringing the total to 20. variable. The weight of each variable is related to the degree
After adding the two numerical attributes “Age” and “Days of correlation it has with the dependent variable. In this
from Symptom to Hospitalization,” we ultimately settled study, we used the “LogisticRegression” method from
on a final total of 22 attributes used by the ML models. Python’s sklearn library.
Finally, using the sklearn’s statistical methods, namely
“StandardScaler” (std) and “MinMaxScaler” (mm), we 3.2.2. DTs
created six distinct datasets with different normalization DTs are a non-parametric supervised learning method
schemes for the numerical attributes “Age” and “Days from used for classification or regression problems. The main
Symptom to Hospitalization.” Specifically, one dataset goal of DTs is to build a model that predicts the value of
was created using the “StandardScaler” method, four a target variable by learning simple decision rules inferred
using the “MinMaxScaler” method with different ranges from data features. The algorithm takes the given dataset
(0 – 1, 0 – 10, 0 – 100, and 0 – 1000), and one without any and divides it into categories consisting of entities with the
normalizing method (none). Each of the six datasets was same value for a specific variable (attribute). This process
saved in CSV file format. The transformation-encoding is repeated recursively until the DT is constructed through
process flowchart is shown in Figure 4. the rules of the individual categorizations of the specific
model. A tree can be thought of as a piecewise consistent
30
3.1.3. Train-test set generation approximation. In this study, we constructed our DT
To create the train and test sets for each ML model iteration, models using the “DecisionTreeClassifier” method from
we formed a dataset by randomly selecting 20% of the Python’s sklearn library.
samples from the current dataset file using the “sample”
function from the pandas library. Next, to mitigate the 3.2.3. RF
imbalance in the “Survived” attribute, which had an RF is an ensemble-supervised ML method that can be
approximate dead-to-survivor ratio of 1:14, we applied the applied to both classification and regression problems.
synthetic minority oversampling technique (SMOTE) RF improves model performance by combining multiple
50
from Python’s imblearn library. This adjustment created a classifiers to solve complex problems. Specifically, RF is
31
set with a dead-to-survivor ratio of 1:10. Finally, we applied a classifier that consists of a number of DTs, each trained
the “RandomUnderSampler” method from the imblearn on a different subset of the training set. The final decision
library to create the final dataset with a dead-to-survivor (prediction) is made by the majority vote for categorical
ratio of 1:2. This dataset was then randomly divided into variables or by averaging the values for numerical variables,
two subsets: the train set, consisting of 70% of the data, enhancing the model’s accuracy. The higher the number of
and the test set, consisting 30% of the data. The flowchart DTs that comprise the forest, the higher the model’s accuracy
illustrating the process of generating the train and test sets and the lower the risk of overfitting. In this study, we used the
is depicted in Figure 5. “RandomForestClassifier” method from the sklearn library.
3.2. Models and algorithmic methods 3.2.4. XGBoost
In this study, we used six ML algorithmic methods, i.e., XGBoost is a well-known variant of the gradient boosting
LR, 27,28 DTs, 29,30 RF, XGBoost, MLPs, 33,34 and KNN. algorithm, developed to increase prediction accuracy.
35
32
31
The models were implemented in Python (version 3.7) XGBoost is an ensemble learning method based on DTs,
using the integrated development environment (IDE) utilizing a gradient-boosting framework. This framework
software Spyder (version 5.1.5) and the pandas library corrects mistakes from previous DT models by modifying
(version 1.3.5). the weights of the variables, thereby improving subsequent
models. The method was originally developed by Tianqi
3.2.1. LR Chen and described by him and Carlos Guestrin in
LR is a supervised learning method developed by David 2016. XGBoost has gained widespread popularity due
32
Cox in 1958 that aims to solve classification problems. to its performance in ML competitions. In this study, we
32
27
LR is a generalized form of simple linear regression, used the “XGBClassifier” method from Python’s XGBoost
used for solving classification problems where both library.
numerical variables and categorical variables can be used
as dependent variables. LR models data using the sigmoid 3.2.5. MLPs
function to make predictions about different possible MLPs are another term for ANNs since the artificial
outcomes. Specifically, it predicts the value of dependent neuron is also called a “Perceptron.” MLPs are a
51
28
Volume 1 Issue 3 (2024) 36 doi: 10.36922/aih.2591

