Page 100 - AIH-2-2
P. 100

Artificial Intelligence in Health                                         Cirrhosis prediction in hepatitis C



            Brier scores (0.120 [0.015] and 0.129 [0.002], respectively),   Figure  4 (and  Figures  S1-S3), all four models exhibited
            compared to the LR and RF models (Table 2). All p-values   good calibration across various proportions of labeled
            of four paired t-tests including LR versus RNN, LR versus   training and validation data (10%, 20%, 50%, and 100%)
            semi-RNN, RF versus RNN, and RF versus semi-RNN are   for predicting 1-year risks, with semi-RNN emerging as
            below 0.05 for AuROC, Brier score, AuPRC, proportion of   the optimal performer. This suggests that the models are
            samples who test positive at 80% sensitivity, specificity at   reliable in estimating risks and have the potential for use in
            80% sensitivity, positive predictive value at 80% sensitivity,   clinical decision-making.
            and negative predictive value at 80% sensitivity.
                                                               3.5. Feature attribution of models
            3.3. Model robustness                              To elucidate the decision-making processes within a
            Robustness  characterizes a  model’s  capacity to sustain   neural network model, we applied the feature attribution
            consistent and reliable predictions under varying   technique for an exemplary patient, who had lower
            conditions. In this study, we deliberately  reduced the   predicted risk scores at first, and then higher predicted
            volume of labeled data to assess the models’ stability and   risk scores in later visits, for a representative split from
            generalizability. The superior performance of the semi-  the RNN model when 100% of the training and validation
            RNN and RNN models became evident when 50% and     set was used for the labeled cohort. The feature attribution
            100% of the training and validation data for the labeled   technique can quantify the contribution of individual
            cohort was used (Figure 3). This finding underscores the   features to a model’s prediction  through the calculation
                                                                                        23
            critical role of substantial labeled data in optimizing the   of the gradient of features against the loss function at
            predictive power of deep learning models.          two different visits and compare the feature importance
                                                               based on their centered, adjusted values. The plots
            3.4. Model calibration                             (Figure 5) revealed that, at the later visit with a higher
            To examine the calibration of the models, we chose a   predicted risk score, the RNN model relied more heavily
            representative split with an AuROC closest to the mean   on features such as AFP, FIB-4, and albumin, despite the
            of  10  splits  for  the  RNN  model.  The  calibration  plot   centered values of these features not being extreme. In
            demonstrates the correspondence between predicted   contrast, compared to an earlier visit, the model placed
            probabilities and observed outcomes. A perfect calibration   less emphasis on features, such as time to first visit and
            is denoted by a 45° diagonal line, signifying that  the   glucose but greater emphasis on features, such as SVR
            model’s predicted probabilities precisely match the   and the standardized ratio of AFP. We also provided a
            actual probabilities of events occurring. Proximity to this   similar variable importance analysis for RF and LR shown
            ideal calibration line illustrates superior calibration. In   in Figure S4.

            Table 2. Comparison of performance metrics of the models predicting cirrhosis development within 1 year in patients at risk
            when 100% of labeled data were used

            Characteristic, mean (SD)                   LR          RF          RNN       Semi‑RNN    P‑value*
            AuROC                                    0.724 (0.008)  0.731 (0.008)  0.744 (0.009)  0.785 (0.062)  <0.050
            Brier score                              0.133 (0.002)  0.131 (0.002)  0.129 (0.002)  0.120 (0.015)  <0.050
            AuPRC                                    0.345 (0.009)  0.358 (0.006)  0.371 (0.010)  0.448 (0.119)  <0.050
            Proportion of samples who test positive at 90% sensitivity  0.699 (0.018)  0.685 (0.025)  0.674 (0.021)  0.597 (0.109)  >0.050
            Specificity at 90% sensitivity           0.344 (0.022)  0.361 (0.030)  0.374 (0.026)  0.467 (0.133)  >0.050
            Positive predictive value at 90% sensitivity  0.227 (0.006)  0.232 (0.009)  0.236 (0.009)  0.277 (0.061)  >0.050
            Negative predictive value at 90% sensitivity  0.940 (0.004)  0.943 (0.004)  0.945 (0.004)  0.953 (0.010)  >0.050
            Proportion of samples who test positive at 80% sensitivity  0.538 (0.014)  0.533 (0.015)  0.517 (0.018)  0.455 (0.095)  <0.050
            Specificity at 80% sensitivity           0.518 (0.017)  0.524 (0.018)  0.544 (0.022)  0.619 (0.115)  <0.050
            Positive predictive value at 80% sensitivity  0.263 (0.007)  0.265 (0.009)  0.274 (0.011)  0.328 (0.084)  <0.050
            Negative predictive value at 80% sensitivity  0.923 (0.003)  0.924 (0.003)  0.926 (0.003)  0.933 (0.010)  <0.050
            *All P-values of four paired t-tests including LR versus RNN, LR versus semi-RNN, RF versus RNN, and RF versus semi-RNN are below 0.05 for
            AuROC, Brier score, AuPRC, proportion of samples who test positive at 80% sensitivity, specificity at 80% sensitivity, positive predictive value at 80%
            sensitivity, and negative predictive value at 80% sensitivity.
            Abbreviations: AuROC: Area under the receiver operating characteristic curve; AuPRC: Area under the precision-recall curve; LR: Logistic regression;
            RF: Random forest; RNN: Supervised recurrent neural network; Semi-RNN: Semi-supervised recurrent neural network.


            Volume 2 Issue 2 (2025)                         94                               doi: 10.36922/aih.4671
   95   96   97   98   99   100   101   102   103   104   105