Page 39 - AIH-1-3
P. 39

Artificial Intelligence in Health                                  Predicting mortality in COVID-19 using ML



            dataset comprised more than 2,670,000 confirmed    of 0.690, an AUC-ROC of 0.895, and an F1-score of 0.716.
            COVID-19 patients from 146 countries, with an average   The advantage of the study is the use of a diverse ensemble
            age of 44.75  years. They applied feature selection to   of ML methods, while the main limitation was the small
            filter  irrelevant  symptoms  and  pre-existing  conditions,   number of patients.
            obtaining accuracies of 89.98% for ANNs, 89.83% for   Bárcenas and  Fuentes-García  conducted a study to
                                                                                         45
            KNNs, 89.02% for SVM, 87.93% for RF, 87.91% for LR,   determine  the  risk  factors  associated  with  mortality  in
            and 86.87% for DTs. The advantages of the study include   COVID-19 patients using RF, GBM, and XGBoost. They
            the diversity the patient origins, the large sample size, and   used a subset of the dataset provided by the Mexican
            the use of different ML methods. The main limitation was   government,  recorded from January 17, 2020, to June
                                                                         26
            the lack of ensemble ML methods.                   28, 2020, which consisted of 583,678 patients, 220,657 of
              Naseem et al.  aimed to develop a novel deep learning   whom were confirmed COVID-19 patients. Patients were
                         41
            neural network (DNN) model for COVID-19 mortality   classified into three risk categories: low, moderate, and
            prediction using the Neo-V framework and compared its   high, depending on comorbidities and major symptoms.
            performance to other traditional ML models, such as LR, RF,   The overall accuracy for predicting mortality was 89.97%
            KNN, random trees, support vector classifier using radial   for XGBoost, 89.86% for RF, and 83.37 for GBM. The
            basis function (SVC-RBF), adaptive boosting (AdaBoost)   advantage of the study is the large patient dataset, while
            classifier, quadratic discriminant analysis, and a DNN.   its limitations include the use of only a few ML methods
            The dataset used comprised laboratory and clinical data   and  the  lack  of  non-ensemble  methods.  In  addition,
            of 1,214 adult COVID-19 patients admitted to Aga Khan   recent studies demonstrate promising results in the use
            University Hospital from February to September 2020. The   of ML in various domains, such as contact tracing for
            DNN  Neo-V  model  outperformed the  conventional ML   COVID-19 transmission,  prenatal screening,  predicting
                                                                                                    47
                                                                                   46
            models, achieving an accuracy of 99.53%, a sensitivity of   the occurrence of type  2 diabetes,  and cardiovascular
                                                                                            48
            89.87%, a specificity of 95.63%, and an AUC-ROC of 88.5.   disease. 49
            The main advantage of the study is the diversity of the ML
            methods used, including the Neo-V framework, with the   3. Data and methods
            limitation being the small number of patients.     In this section, we present the dataset used, the
              Chadaga  et al.  aimed  to predict  mortality among   preprocessing techniques applied, and the ML algorithmic
                           42
            COVID-19  patients using epidemiological parameters.   methods employed to train the models.
            The ML methods used were RF, XGBoost, LightGBM,    3.1. Data preprocessing
            categorical boosting, AdaBoost, and gradient boost. The
            dataset used was provided by the Directorate General   The data preprocessing procedure encompasses cleaning,
            of Epidemiology, Secretariat of Health (Mexico)  and   transforming, and encoding the raw data, as well as
                                                     43
            consisted of 263,007 confirmed COVID-19 patients with   generating the training and testing datasets for each
            19 selected attributes each. The XGBoost model achieved   iteration.
            the  best  results  with  an  accuracy  of  96%,  a  precision  of
            95%, a recall of 95%, an F1-score of 95%, and an AUC-  3.1.1. Cleansing
            ROC of 96%. The advantages of the study include the   Our dataset consists of 12,425,179  cases suspected of
            number and variety of ML methods used and the large   having COVID-19, who attended various health facilities
            patient dataset. However, the main limitation was the lack   in Mexico from January 17, 2020, until January 3, 2022.
            of non-ensemble ML methods.                        The dataset is publicly available as a CSV file disseminated
                                                               by the Government of Mexico. 26
              Rai  et al.  proposed a voting ensemble model
                       44
            comprising the extra trees classifier, the RF, the gradient   First, we translated all 40 attribute names from Spanish
            boosting classifier, and the XGBoost. The proposed model   to English. Second, we cleansed the dataset by retaining
            was compared to baseline models, including KNN, Naïve   only the positive COVID-19  cases, as indicated by the
            Bayes classifier, XGBoost, RF, gradient boosting classifier,   “Laboratory Result” (1: SARS-CoV-2 positive) and “Final
            and extra tree classifier. The dataset used for the research   Classification” (1, 2, and 3: Confirmed case) attributes,
            was obtained from a publicly available source consisting   in accordance with the guidelines of the Epidemiological
            of blood biomarkers of 4,711  patients admitted to the   Association of Mexico and the Mexican Commission of
            hospital from March 1 to April 16, 2020. The highest scores   Medical Decisions. Third, we discarded 184,345  cases
            were recorded by the proposed voting ensemble model,   containing invalid (98: Ignored and 99: Not Specified) or
            with an accuracy of 86.99%, a precision of 0.744, a recall   null values in one or more of their attributes. Thus, we


            Volume 1 Issue 3 (2024)                         33                               doi: 10.36922/aih.2591
   34   35   36   37   38   39   40   41   42   43   44