Page 60 - AIH-2-2
P. 60

Artificial Intelligence in Health                            Predicting ICU mortality: A stacked ensemble model



            in the study. The XGBoost model showed superiority   0.977 and an F of 0.840 in hospital “S,” and with LightGBM
                                                                          1
            of predictive performance over traditional models   achieved  an  AUC  of  0.955  and  an  F -score  of  0.762  in
                                                                                              1
            (APACHE-IV and SOFA) with an AUC of 0.86.          hospital “G.”  In general, the difference between hospitals
                                                                         10
              Nevertheless, the results generalizability to other ICU   can be attributed to several factors, such as the composition
            patients was restricted by the difficulty of predicting   of patient characteristics, medical and nursing practices,
            mortality in a specific patient cohort. In addition, the use   medical and nursing staffing, and other hospital resources.
            of multiple imputations to address missing data, despite   This suggests that the success of ML models may depend
            providing more complete sets for model training, may   largely on their adaptation to the local context of each
                                                                                                          10
            have contributed to bias since it produced a sample that   hospital, whose importance was noted by the authors.
            was different from the “natural” actual dataset. Finally, the   In  another  case  of  using  a stacking  ensemble  model,
            low mortality rate (3.4%) may have led to an imbalanced   Hwangbo et al.  attempted to predict 6-month mortality
                                                                           34
            sample resulting in – due to overfitting – high accuracy   in ischemic stroke patients without reperfusion therapy.
            overall, but low sensitivity in predicting mortality and   The sample was 8787  patients from a special dataset
            insufficient generalizability. The authors do not seem to   (International Stroke Trial) in South  Korea. The results
            have used techniques such as resampling or oversampling   showed an AUC of 0.783, accuracy of 71.6%, sensitivity
            of the disadvantaged category to address this challenge. 11  of 72.3%, specificity of 70.9%, and F -score of 0.420. The
                                                                                             1
              Other attempts using stacking ensemble models    stacking ensemble model showed comparable or slightly
            have been made to predict ICU mortality in some    better performance (especially in AUC) compared to
            specific patient categories, such as patients with heart   traditional models. However, the performance of the model
            failure (HF). In their study, Chiu  et al.  collected and   can be considered relatively poor, and this is likely due to the
                                             13
            analyzed data from 6699 HF patients from the MIMIC-  use of very early clinical data (or clinical variables that may
            III database. Their model had slightly higher accuracy   not fully capture the complexity of the patient’s condition).
            (95.25%), which, as in other cases, can be attributed to   Data collected later in the disease may be helpful for more
            the focus on a specific patient cohort, but AUC was quite   accurate mortality prediction. Furthermore, the exclusion
            low (82.55%). Furthermore, the other metrics (precision   of many patients from the dataset due to various criteria,
            80.30%, recall 66.82% and F -score 72.86%) did not seem   and the old dataset (from the 1990s), were limitations that
                                   1
            to be remarkably high.  The stacking ensemble model   may have introduced biases and reduced the overall usable
                               13
                                                                                 34
            overall outperformed models such as Random Forests,   information generated.
            Support Vector, K-Nearest Neighbors, LightGBM, Bagging   The importance and interpretability of the clinical
            and AdaBoost. The main limitations should include the   characteristics included in the different datasets are crucial.
            retrospective nature of the data collected from a long time   In a study, for example, in Japan, Iwase et al.  included in
                                                                                                   35
            ago (collected between 2001 and 2012), which may involve   their variables additionally lactate dehydrogenase (LDH),
            bias due to possible evolution of HF treatment protocols,   which turned out to be the most critical predictor. Overall,
            the use of data from a single center, and the use of a single   LDH, along with lactic acid and platelet count, emerged
            patient category. Finally, the study showed the maximum   as the most important variables for predicting mortality,
            results in predicting mortality within three days, which is a   which is consistent with the existing clinical literature
            strong advantage over other studies that do not have such   (especially for LDH) as high LDH levels have been
            limitations. 13                                    associated with mortality in patients with sepsis, acute
              This study considered data from a publicly available   respiratory distress syndrome, and acute pancreatitis.
            database, which may have contributed to higher overall   The authors tested several algorithms (Random Forests,
            model performance, but with less adaptation to local   XGBoost, Neural Networks) on a sample of 12,747 ICU
                                            10
            conditions. In studies such as Choi et al.  in South Korea,   patients to predict mortality and LOS. The study showed
            an  approach  was  taken  that  produced  varying  results   that Random Forests had the highest performance with
                                                                          35
            depending on  the individual  hospitals  (hospitals  “S”   AUC of 0.945.
            and “G”). Data were collected from 2006 to  2020 and   As it has been reported elsewhere, the XGBoost
            included 85,146  patients. The study included ensemble   algorithm appears to be used in several similar studies
            techniques and found that among other ML models    for prediction analyses. In the study by Pang et al.,  for
                                                                                                         36
            evaluated (K-Nearest Neighbor, Decision Trees, Random   example, XGBoost, as an ensemble technique, showed the
            Forests, XGBoost, LightGBM, SVM, and artificial neural   best performance among the models tested XGBoost, LR,
            networks), the XGBoost and LightGBM algorithms had   SVM, and Decision Trees. They applied the undersampling
            the best overall results. With XGBoost achieved an AUC of   technique with a random subset of 14,110 patients from


            Volume 2 Issue 2 (2025)                         54                               doi: 10.36922/aih.4981
   55   56   57   58   59   60   61   62   63   64   65