Page 100 - AIH-2-2
P. 100
Artificial Intelligence in Health Cirrhosis prediction in hepatitis C
Brier scores (0.120 [0.015] and 0.129 [0.002], respectively), Figure 4 (and Figures S1-S3), all four models exhibited
compared to the LR and RF models (Table 2). All p-values good calibration across various proportions of labeled
of four paired t-tests including LR versus RNN, LR versus training and validation data (10%, 20%, 50%, and 100%)
semi-RNN, RF versus RNN, and RF versus semi-RNN are for predicting 1-year risks, with semi-RNN emerging as
below 0.05 for AuROC, Brier score, AuPRC, proportion of the optimal performer. This suggests that the models are
samples who test positive at 80% sensitivity, specificity at reliable in estimating risks and have the potential for use in
80% sensitivity, positive predictive value at 80% sensitivity, clinical decision-making.
and negative predictive value at 80% sensitivity.
3.5. Feature attribution of models
3.3. Model robustness To elucidate the decision-making processes within a
Robustness characterizes a model’s capacity to sustain neural network model, we applied the feature attribution
consistent and reliable predictions under varying technique for an exemplary patient, who had lower
conditions. In this study, we deliberately reduced the predicted risk scores at first, and then higher predicted
volume of labeled data to assess the models’ stability and risk scores in later visits, for a representative split from
generalizability. The superior performance of the semi- the RNN model when 100% of the training and validation
RNN and RNN models became evident when 50% and set was used for the labeled cohort. The feature attribution
100% of the training and validation data for the labeled technique can quantify the contribution of individual
cohort was used (Figure 3). This finding underscores the features to a model’s prediction through the calculation
23
critical role of substantial labeled data in optimizing the of the gradient of features against the loss function at
predictive power of deep learning models. two different visits and compare the feature importance
based on their centered, adjusted values. The plots
3.4. Model calibration (Figure 5) revealed that, at the later visit with a higher
To examine the calibration of the models, we chose a predicted risk score, the RNN model relied more heavily
representative split with an AuROC closest to the mean on features such as AFP, FIB-4, and albumin, despite the
of 10 splits for the RNN model. The calibration plot centered values of these features not being extreme. In
demonstrates the correspondence between predicted contrast, compared to an earlier visit, the model placed
probabilities and observed outcomes. A perfect calibration less emphasis on features, such as time to first visit and
is denoted by a 45° diagonal line, signifying that the glucose but greater emphasis on features, such as SVR
model’s predicted probabilities precisely match the and the standardized ratio of AFP. We also provided a
actual probabilities of events occurring. Proximity to this similar variable importance analysis for RF and LR shown
ideal calibration line illustrates superior calibration. In in Figure S4.
Table 2. Comparison of performance metrics of the models predicting cirrhosis development within 1 year in patients at risk
when 100% of labeled data were used
Characteristic, mean (SD) LR RF RNN Semi‑RNN P‑value*
AuROC 0.724 (0.008) 0.731 (0.008) 0.744 (0.009) 0.785 (0.062) <0.050
Brier score 0.133 (0.002) 0.131 (0.002) 0.129 (0.002) 0.120 (0.015) <0.050
AuPRC 0.345 (0.009) 0.358 (0.006) 0.371 (0.010) 0.448 (0.119) <0.050
Proportion of samples who test positive at 90% sensitivity 0.699 (0.018) 0.685 (0.025) 0.674 (0.021) 0.597 (0.109) >0.050
Specificity at 90% sensitivity 0.344 (0.022) 0.361 (0.030) 0.374 (0.026) 0.467 (0.133) >0.050
Positive predictive value at 90% sensitivity 0.227 (0.006) 0.232 (0.009) 0.236 (0.009) 0.277 (0.061) >0.050
Negative predictive value at 90% sensitivity 0.940 (0.004) 0.943 (0.004) 0.945 (0.004) 0.953 (0.010) >0.050
Proportion of samples who test positive at 80% sensitivity 0.538 (0.014) 0.533 (0.015) 0.517 (0.018) 0.455 (0.095) <0.050
Specificity at 80% sensitivity 0.518 (0.017) 0.524 (0.018) 0.544 (0.022) 0.619 (0.115) <0.050
Positive predictive value at 80% sensitivity 0.263 (0.007) 0.265 (0.009) 0.274 (0.011) 0.328 (0.084) <0.050
Negative predictive value at 80% sensitivity 0.923 (0.003) 0.924 (0.003) 0.926 (0.003) 0.933 (0.010) <0.050
*All P-values of four paired t-tests including LR versus RNN, LR versus semi-RNN, RF versus RNN, and RF versus semi-RNN are below 0.05 for
AuROC, Brier score, AuPRC, proportion of samples who test positive at 80% sensitivity, specificity at 80% sensitivity, positive predictive value at 80%
sensitivity, and negative predictive value at 80% sensitivity.
Abbreviations: AuROC: Area under the receiver operating characteristic curve; AuPRC: Area under the precision-recall curve; LR: Logistic regression;
RF: Random forest; RNN: Supervised recurrent neural network; Semi-RNN: Semi-supervised recurrent neural network.
Volume 2 Issue 2 (2025) 94 doi: 10.36922/aih.4671

