Page 39 - AIH-1-3
P. 39
Artificial Intelligence in Health Predicting mortality in COVID-19 using ML
dataset comprised more than 2,670,000 confirmed of 0.690, an AUC-ROC of 0.895, and an F1-score of 0.716.
COVID-19 patients from 146 countries, with an average The advantage of the study is the use of a diverse ensemble
age of 44.75 years. They applied feature selection to of ML methods, while the main limitation was the small
filter irrelevant symptoms and pre-existing conditions, number of patients.
obtaining accuracies of 89.98% for ANNs, 89.83% for Bárcenas and Fuentes-García conducted a study to
45
KNNs, 89.02% for SVM, 87.93% for RF, 87.91% for LR, determine the risk factors associated with mortality in
and 86.87% for DTs. The advantages of the study include COVID-19 patients using RF, GBM, and XGBoost. They
the diversity the patient origins, the large sample size, and used a subset of the dataset provided by the Mexican
the use of different ML methods. The main limitation was government, recorded from January 17, 2020, to June
26
the lack of ensemble ML methods. 28, 2020, which consisted of 583,678 patients, 220,657 of
Naseem et al. aimed to develop a novel deep learning whom were confirmed COVID-19 patients. Patients were
41
neural network (DNN) model for COVID-19 mortality classified into three risk categories: low, moderate, and
prediction using the Neo-V framework and compared its high, depending on comorbidities and major symptoms.
performance to other traditional ML models, such as LR, RF, The overall accuracy for predicting mortality was 89.97%
KNN, random trees, support vector classifier using radial for XGBoost, 89.86% for RF, and 83.37 for GBM. The
basis function (SVC-RBF), adaptive boosting (AdaBoost) advantage of the study is the large patient dataset, while
classifier, quadratic discriminant analysis, and a DNN. its limitations include the use of only a few ML methods
The dataset used comprised laboratory and clinical data and the lack of non-ensemble methods. In addition,
of 1,214 adult COVID-19 patients admitted to Aga Khan recent studies demonstrate promising results in the use
University Hospital from February to September 2020. The of ML in various domains, such as contact tracing for
DNN Neo-V model outperformed the conventional ML COVID-19 transmission, prenatal screening, predicting
47
46
models, achieving an accuracy of 99.53%, a sensitivity of the occurrence of type 2 diabetes, and cardiovascular
48
89.87%, a specificity of 95.63%, and an AUC-ROC of 88.5. disease. 49
The main advantage of the study is the diversity of the ML
methods used, including the Neo-V framework, with the 3. Data and methods
limitation being the small number of patients. In this section, we present the dataset used, the
Chadaga et al. aimed to predict mortality among preprocessing techniques applied, and the ML algorithmic
42
COVID-19 patients using epidemiological parameters. methods employed to train the models.
The ML methods used were RF, XGBoost, LightGBM, 3.1. Data preprocessing
categorical boosting, AdaBoost, and gradient boost. The
dataset used was provided by the Directorate General The data preprocessing procedure encompasses cleaning,
of Epidemiology, Secretariat of Health (Mexico) and transforming, and encoding the raw data, as well as
43
consisted of 263,007 confirmed COVID-19 patients with generating the training and testing datasets for each
19 selected attributes each. The XGBoost model achieved iteration.
the best results with an accuracy of 96%, a precision of
95%, a recall of 95%, an F1-score of 95%, and an AUC- 3.1.1. Cleansing
ROC of 96%. The advantages of the study include the Our dataset consists of 12,425,179 cases suspected of
number and variety of ML methods used and the large having COVID-19, who attended various health facilities
patient dataset. However, the main limitation was the lack in Mexico from January 17, 2020, until January 3, 2022.
of non-ensemble ML methods. The dataset is publicly available as a CSV file disseminated
by the Government of Mexico. 26
Rai et al. proposed a voting ensemble model
44
comprising the extra trees classifier, the RF, the gradient First, we translated all 40 attribute names from Spanish
boosting classifier, and the XGBoost. The proposed model to English. Second, we cleansed the dataset by retaining
was compared to baseline models, including KNN, Naïve only the positive COVID-19 cases, as indicated by the
Bayes classifier, XGBoost, RF, gradient boosting classifier, “Laboratory Result” (1: SARS-CoV-2 positive) and “Final
and extra tree classifier. The dataset used for the research Classification” (1, 2, and 3: Confirmed case) attributes,
was obtained from a publicly available source consisting in accordance with the guidelines of the Epidemiological
of blood biomarkers of 4,711 patients admitted to the Association of Mexico and the Mexican Commission of
hospital from March 1 to April 16, 2020. The highest scores Medical Decisions. Third, we discarded 184,345 cases
were recorded by the proposed voting ensemble model, containing invalid (98: Ignored and 99: Not Specified) or
with an accuracy of 86.99%, a precision of 0.744, a recall null values in one or more of their attributes. Thus, we
Volume 1 Issue 3 (2024) 33 doi: 10.36922/aih.2591

