Page 57 - AIH-2-2

P. 57

Artificial Intelligence in Health Predicting ICU mortality: A stacked ensemble model

meta-model (Level-1), by combining the aforementioned only hyperparameter variation was that the first CatBoost
predictions and effectively learning to weight the model had a depth of 8, whereas the second had a depth
predictions of each Level-0 model. Taken together, the of 7. Moreover, the importance of the second CatBoost
stacked ensemble learning outputs the final predictions model was equal to 1. Guided by the feature importance
and delivers more accurate and robust results. analysis of the respective classifier, the second CatBoost
model reduced its training feature vector by excluding
Analogous to Figure 3, Level-0 consisted of two the following features: hepatic failure, leukemia, AIDS,
non-identical CatBoost models and a Random Forests lymphoma, cirrhosis, and immunosuppression. Finally,
one. The initial CatBoost model, assigned the highest the Random Forests model classifier was implemented
importance weight equal to 27. The model was trained using the Gini impurity criterion, a broadly used metric
with a learning rate of 0.1 and a Random Subspace for evaluating split quality during tree construction. The
Method of 0.9. The second CatBoost model utilized number of all features considered for node splitting was
mainly the same hyperparameters, except from one. The reduced by setting the maximum features parameter to
70%. Finally, a minimum of 30 samples per node split was
enforced, and the maximum tree depth was constrained to
7 nodes. These hyperparameter settings contributed to the
overall efficacy of the ensemble learning strategy.
After extensive validation and performance evaluation,
we concluded to the aforementioned ensemble structure.
The subsequent section describes the results from the
validation and evaluation methods used in this study.

4. Results
The 10-fold cross-validation approach was employed,
to improve generalization ability and to reduce the
possibility of overfitting. Cross-validation maximized
model generalization by minimizing the bias potential
associated with single train-test splits, resulting in an
evaluation that is more robust and trustworthy. The trained
models were evaluated using a variety of metrics, such as
F , accuracy, precision, and recall, each of which offered a
1
unique perspective on this binary medical categorization
task. The optimal architecture was selected based on a
comprehensive comparison of the metrics, with particular
emphasis placed on the F score.
1
A comparative examination of the F scores attained
1
by the various models examined in this research is
Figure 2. SHAP Beeswarm plot shown in Figure 4. Prior to ensemble learning, individual
Abbreviations: AIDS: Acquired immune deficiency syndrome; models such as Random Forests, LightGBM, XGBoost,
BSL: Blood sugar level; BUN: Blood urea nitrogen; FiO : Fraction of and CatBoost demonstrated the most promising results,
2
inspired oxygen; GCS: Glasgow Comma Scale; ICU: Intensive care unit;
MAP: Mean arterial pressure; SHAP: SHapley Additive exPlanations; as observed in separate evaluations. Nevertheless, the
WBC: White blood cell count. utilization of stacked ensemble learning yielded an

Figure 3. Stacked ensemble learning process flow

Volume 2 Issue 2 (2025) 51 doi: 10.36922/aih.4981

52 53 54 55 56 57 58 59 60 61 62