Page 97 - AIH-2-2
P. 97

Artificial Intelligence in Health                                         Cirrhosis prediction in hepatitis C



            2,199 control samples. Of the 8031 patients who did not   time gaps and eliminates the need for feature extraction.
            develop cirrhosis and maintained a follow-up of more than   The adaptable structure of RNNs also enables them to
            1 year, 976 of them achieved SVR within 1 year of their   support both supervised and semi-supervised learning,
            most recent TE. We assumed that they would not develop   a capability that is not easy to attain with LR or RF. We
            cirrhosis in the subsequent year, and thus randomly   developed a supervised RNN utilizing labeled data only,
            sampled one visit as time t within 1 year before the last   and a semi-RNN that employed the abundant unlabeled
            available TE. For the rest of the patients who neither   data to improve classification performance.
            achieved  SVR  in  the  most  recent  1  year  nor  developed   Specifically, for the supervised RNN model, we used
            cirrhosis, 6,345 patients had documented records within                  14
            the 1 – 2-year window before the last available TE, for   gated recurrent units (GRU)  to regulate information flow
            whom we randomly sampled 1  visit as time  t  in that   and  remember  long-term  information.  The  longitudinal
            window. Finally, for the remaining 710 patients who did   information from hidden units of GRU was passed to
            not have any records within 2 years before the last available   a max pooling layer, and then was merged with time-
            TE, we selected the most recent documented visit before   invariant information from baseline predictors using
            the last available TE as time t. We ended up with 10,230   a feedforward neural network  (FNN). Finally, we built
            control samples in which TE outcome was assumed to   another FNN to process the combined information, and
            remain negative within 1 year of the sampled visit (time t)   used a sigmoid activation function in the output layer
            (Figure 1).                                        to predict the probability of developing cirrhosis within
                                                               1 year (Figure 2). In all FNNs, we used rectified linear unit
            2.5.2. Additional unlabeled cohort for semi-       (ReLU) as the non-linear activation function. To train the
            supervised learning                                model, we minimized the binary cross-entropy loss, which
            In addition to the labeled cohort described above, we   is named the supervised loss, through the Adam stochastic
                                                                       13
            included patients without known TE outcomes (unlabeled   algorithm.  Adam’s adaptive learning rates naturally deal
            patients) to improve the feature representation of   with noisy gradients, which can be viewed as a form of
            longitudinal predictors. We defined a surrogate outcome   implicit regularization. By dampening the effect of noisy
            as the achievement of two consecutive APRIs >2  to ensure   updates (due to averaging over time), Adam avoids the
                                                  4
                                                                                                     15
            a similar sampling scheme to the labeled cohort and to   tendency to overfit to noise in the training data.  We also
                                                                                   16
            avoid potential bias. After removal of visits later than the   used dropout technique  to prevent overfitting and an
             st
            1  date of developing the surrogate outcome, we randomly   early stopping mechanism with a patience of 10 epochs to
            sampled one visit (time t) within the 1 – 2-year windows   prevent unnecessary training beyond the optimal point.
            before the surrogate outcome for the 159,039 unlabeled   For the semi-RNN model, we incorporated an auxiliary
            patients (Figure 1).                               task in addition to the primary prediction task of the
                                                               supervised RNN. The auxiliary task was to predict the
            2.6. Models for supervised and semi-supervised     values of longitudinal predictors at the next visit, which
            learning
                                                               can be trained using unlabeled data. We shared the layers
            We developed four different models to predict the   of GRU between the auxiliary task and the prediction task.
            probability of developing cirrhosis within 1 year after time t   Jointly learning both tasks could help improve feature
            using baseline predictors and longitudinal predictors from   representation by leveraging unlabeled data. We defined
            enrollment to time t. To utilize longitudinal information,   the negative log-likelihood for longitudinal predictors as
            we  employed  two  approaches  in  our  analysis.  The  first   the unsupervised loss, and we minimized the weighted
            approach was to compute summary statistics  for each   sum  of the  supervised  and unsupervised  losses  to  train
            longitudinal predictor, including maximum, minimum,   the model end-to-end.  The weight, which controls the
                                                                                  17
            maximum of slope, minimum of slope, and total variation.   trade-off between supervised learning and unsupervised
            These summary statistics were combined with baseline   learning, was selected by hyperparameter tuning.
            predictors and used to train conventional machine learning
            models.  We opted for LR (a classic linear method) and   2.7. Statistical analysis
                  13
            RF (a highly non-linear method based on decision trees)   To conduct the analysis, we randomly split the labeled
            to evaluate  the  effectiveness of  conventional  machine   data into a training set (40%), a validation set (30%), and
            learning methods based on human-designed factors.  a testing set (30%). The unlabeled data used in semi-RNN
              The second approach to handle raw longitudinal   belonged to the training set. We learned each model using
            predictors from enrollment to time t was using an RNN,   the training and validation set, and then evaluated their
            which excels in processing sequential data with irregular   performance on the same testing set. This procedure was


            Volume 2 Issue 2 (2025)                         91                               doi: 10.36922/aih.4671
   92   93   94   95   96   97   98   99   100   101   102