Page 60 - AIH-2-2
P. 60
Artificial Intelligence in Health Predicting ICU mortality: A stacked ensemble model
in the study. The XGBoost model showed superiority 0.977 and an F of 0.840 in hospital “S,” and with LightGBM
1
of predictive performance over traditional models achieved an AUC of 0.955 and an F -score of 0.762 in
1
(APACHE-IV and SOFA) with an AUC of 0.86. hospital “G.” In general, the difference between hospitals
10
Nevertheless, the results generalizability to other ICU can be attributed to several factors, such as the composition
patients was restricted by the difficulty of predicting of patient characteristics, medical and nursing practices,
mortality in a specific patient cohort. In addition, the use medical and nursing staffing, and other hospital resources.
of multiple imputations to address missing data, despite This suggests that the success of ML models may depend
providing more complete sets for model training, may largely on their adaptation to the local context of each
10
have contributed to bias since it produced a sample that hospital, whose importance was noted by the authors.
was different from the “natural” actual dataset. Finally, the In another case of using a stacking ensemble model,
low mortality rate (3.4%) may have led to an imbalanced Hwangbo et al. attempted to predict 6-month mortality
34
sample resulting in – due to overfitting – high accuracy in ischemic stroke patients without reperfusion therapy.
overall, but low sensitivity in predicting mortality and The sample was 8787 patients from a special dataset
insufficient generalizability. The authors do not seem to (International Stroke Trial) in South Korea. The results
have used techniques such as resampling or oversampling showed an AUC of 0.783, accuracy of 71.6%, sensitivity
of the disadvantaged category to address this challenge. 11 of 72.3%, specificity of 70.9%, and F -score of 0.420. The
1
Other attempts using stacking ensemble models stacking ensemble model showed comparable or slightly
have been made to predict ICU mortality in some better performance (especially in AUC) compared to
specific patient categories, such as patients with heart traditional models. However, the performance of the model
failure (HF). In their study, Chiu et al. collected and can be considered relatively poor, and this is likely due to the
13
analyzed data from 6699 HF patients from the MIMIC- use of very early clinical data (or clinical variables that may
III database. Their model had slightly higher accuracy not fully capture the complexity of the patient’s condition).
(95.25%), which, as in other cases, can be attributed to Data collected later in the disease may be helpful for more
the focus on a specific patient cohort, but AUC was quite accurate mortality prediction. Furthermore, the exclusion
low (82.55%). Furthermore, the other metrics (precision of many patients from the dataset due to various criteria,
80.30%, recall 66.82% and F -score 72.86%) did not seem and the old dataset (from the 1990s), were limitations that
1
to be remarkably high. The stacking ensemble model may have introduced biases and reduced the overall usable
13
34
overall outperformed models such as Random Forests, information generated.
Support Vector, K-Nearest Neighbors, LightGBM, Bagging The importance and interpretability of the clinical
and AdaBoost. The main limitations should include the characteristics included in the different datasets are crucial.
retrospective nature of the data collected from a long time In a study, for example, in Japan, Iwase et al. included in
35
ago (collected between 2001 and 2012), which may involve their variables additionally lactate dehydrogenase (LDH),
bias due to possible evolution of HF treatment protocols, which turned out to be the most critical predictor. Overall,
the use of data from a single center, and the use of a single LDH, along with lactic acid and platelet count, emerged
patient category. Finally, the study showed the maximum as the most important variables for predicting mortality,
results in predicting mortality within three days, which is a which is consistent with the existing clinical literature
strong advantage over other studies that do not have such (especially for LDH) as high LDH levels have been
limitations. 13 associated with mortality in patients with sepsis, acute
This study considered data from a publicly available respiratory distress syndrome, and acute pancreatitis.
database, which may have contributed to higher overall The authors tested several algorithms (Random Forests,
model performance, but with less adaptation to local XGBoost, Neural Networks) on a sample of 12,747 ICU
10
conditions. In studies such as Choi et al. in South Korea, patients to predict mortality and LOS. The study showed
an approach was taken that produced varying results that Random Forests had the highest performance with
35
depending on the individual hospitals (hospitals “S” AUC of 0.945.
and “G”). Data were collected from 2006 to 2020 and As it has been reported elsewhere, the XGBoost
included 85,146 patients. The study included ensemble algorithm appears to be used in several similar studies
techniques and found that among other ML models for prediction analyses. In the study by Pang et al., for
36
evaluated (K-Nearest Neighbor, Decision Trees, Random example, XGBoost, as an ensemble technique, showed the
Forests, XGBoost, LightGBM, SVM, and artificial neural best performance among the models tested XGBoost, LR,
networks), the XGBoost and LightGBM algorithms had SVM, and Decision Trees. They applied the undersampling
the best overall results. With XGBoost achieved an AUC of technique with a random subset of 14,110 patients from
Volume 2 Issue 2 (2025) 54 doi: 10.36922/aih.4981

