Page 42 - AIH-1-3
P. 42

Artificial Intelligence in Health                                  Predicting mortality in COVID-19 using ML



            “ICU_2,” and “ICU_97.” After this processing, we created   variables based on the weights of each independent
            three more categorical attributes, bringing the total to 20.   variable. The weight of each variable is related to the degree
            After adding the two numerical attributes “Age” and “Days   of correlation it has with the dependent variable. In this
            from Symptom to Hospitalization,” we ultimately settled   study, we used the “LogisticRegression” method from
            on a final total of 22 attributes used by the ML models.   Python’s sklearn library.
            Finally, using the sklearn’s statistical methods, namely
            “StandardScaler”  (std)  and  “MinMaxScaler”  (mm),  we   3.2.2. DTs
            created six distinct datasets with different normalization   DTs are a non-parametric supervised learning method
            schemes for the numerical attributes “Age” and “Days from   used for classification or regression problems. The main
            Symptom to Hospitalization.” Specifically, one dataset   goal of DTs is to build a model that predicts the value of
            was created using the “StandardScaler” method, four   a target variable by learning simple decision rules inferred
            using the “MinMaxScaler” method with different ranges   from data features. The algorithm takes the given dataset
            (0 – 1, 0 – 10, 0 – 100, and 0 – 1000), and one without any   and divides it into categories consisting of entities with the
            normalizing method (none). Each of the six datasets was   same value for a specific variable (attribute). This process
            saved in CSV file format. The transformation-encoding   is repeated recursively until the DT is constructed through
            process flowchart is shown in Figure 4.            the rules of the individual categorizations of the specific
                                                               model.  A tree can be thought of as a piecewise consistent
                                                                    30
            3.1.3. Train-test set generation                   approximation. In this study, we constructed our DT
            To create the train and test sets for each ML model iteration,   models  using  the  “DecisionTreeClassifier”  method  from
            we formed a dataset  by randomly selecting 20% of  the   Python’s sklearn library.
            samples from the current dataset file using the “sample”
            function from the pandas library. Next, to mitigate the   3.2.3. RF
            imbalance  in  the  “Survived”  attribute,  which  had  an   RF  is  an ensemble-supervised  ML  method  that  can be
            approximate dead-to-survivor ratio of 1:14, we applied the   applied to both classification and regression problems.
            synthetic minority oversampling technique (SMOTE)    RF improves model performance by combining multiple
                                                         50
            from Python’s imblearn library. This adjustment created a   classifiers  to  solve  complex  problems.   Specifically,  RF  is
                                                                                             31
            set with a dead-to-survivor ratio of 1:10. Finally, we applied   a classifier that consists of a number of DTs, each trained
            the “RandomUnderSampler” method from the imblearn   on a different subset of the training set. The final decision
            library to create the final dataset with a dead-to-survivor   (prediction) is made by the majority vote for categorical
            ratio of 1:2. This dataset was then randomly divided into   variables or by averaging the values for numerical variables,
            two subsets: the train set, consisting of 70% of the data,   enhancing the model’s accuracy. The higher the number of
            and the test set, consisting 30% of the data. The flowchart   DTs that comprise the forest, the higher the model’s accuracy
            illustrating the process of generating the train and test sets   and the lower the risk of overfitting. In this study, we used the
            is depicted in Figure 5.                           “RandomForestClassifier” method from the sklearn library.
            3.2. Models and algorithmic methods                3.2.4. XGBoost
            In this study, we used six ML algorithmic methods, i.e.,   XGBoost is a well-known variant of the gradient boosting
            LR, 27,28  DTs, 29,30  RF,  XGBoost,  MLPs, 33,34  and KNN.    algorithm,  developed  to  increase  prediction  accuracy.
                                                         35
                                      32
                            31
            The models were implemented in Python (version  3.7)   XGBoost is an ensemble learning method based on DTs,
            using the integrated development environment (IDE)   utilizing a gradient-boosting framework. This framework
            software Spyder (version  5.1.5) and the pandas library   corrects mistakes from previous DT models by modifying
            (version 1.3.5).                                   the weights of the variables, thereby improving subsequent
                                                               models. The method was originally developed by Tianqi
            3.2.1. LR                                          Chen and described by him and Carlos Guestrin in
            LR is a supervised learning method developed by David   2016.  XGBoost has gained widespread  popularity due
                                                                   32
            Cox in 1958  that aims to solve classification problems.   to its performance in ML competitions.  In this study, we
                                                                                               32
                      27
            LR is a generalized form of simple linear regression,   used the “XGBClassifier” method from Python’s XGBoost
            used for solving classification problems where both   library.
            numerical variables and categorical variables can be used
            as dependent variables. LR models data using the sigmoid   3.2.5. MLPs
            function to make predictions about different possible   MLPs are another term for ANNs since the artificial
            outcomes.  Specifically, it predicts the value of dependent   neuron is  also called  a “Perceptron.”  MLPs  are a
                                                                                                51
                    28
            Volume 1 Issue 3 (2024)                         36                               doi: 10.36922/aih.2591
   37   38   39   40   41   42   43   44   45   46   47