Page 97 - AIH-2-2
P. 97
Artificial Intelligence in Health Cirrhosis prediction in hepatitis C
2,199 control samples. Of the 8031 patients who did not time gaps and eliminates the need for feature extraction.
develop cirrhosis and maintained a follow-up of more than The adaptable structure of RNNs also enables them to
1 year, 976 of them achieved SVR within 1 year of their support both supervised and semi-supervised learning,
most recent TE. We assumed that they would not develop a capability that is not easy to attain with LR or RF. We
cirrhosis in the subsequent year, and thus randomly developed a supervised RNN utilizing labeled data only,
sampled one visit as time t within 1 year before the last and a semi-RNN that employed the abundant unlabeled
available TE. For the rest of the patients who neither data to improve classification performance.
achieved SVR in the most recent 1 year nor developed Specifically, for the supervised RNN model, we used
cirrhosis, 6,345 patients had documented records within 14
the 1 – 2-year window before the last available TE, for gated recurrent units (GRU) to regulate information flow
whom we randomly sampled 1 visit as time t in that and remember long-term information. The longitudinal
window. Finally, for the remaining 710 patients who did information from hidden units of GRU was passed to
not have any records within 2 years before the last available a max pooling layer, and then was merged with time-
TE, we selected the most recent documented visit before invariant information from baseline predictors using
the last available TE as time t. We ended up with 10,230 a feedforward neural network (FNN). Finally, we built
control samples in which TE outcome was assumed to another FNN to process the combined information, and
remain negative within 1 year of the sampled visit (time t) used a sigmoid activation function in the output layer
(Figure 1). to predict the probability of developing cirrhosis within
1 year (Figure 2). In all FNNs, we used rectified linear unit
2.5.2. Additional unlabeled cohort for semi- (ReLU) as the non-linear activation function. To train the
supervised learning model, we minimized the binary cross-entropy loss, which
In addition to the labeled cohort described above, we is named the supervised loss, through the Adam stochastic
13
included patients without known TE outcomes (unlabeled algorithm. Adam’s adaptive learning rates naturally deal
patients) to improve the feature representation of with noisy gradients, which can be viewed as a form of
longitudinal predictors. We defined a surrogate outcome implicit regularization. By dampening the effect of noisy
as the achievement of two consecutive APRIs >2 to ensure updates (due to averaging over time), Adam avoids the
4
15
a similar sampling scheme to the labeled cohort and to tendency to overfit to noise in the training data. We also
16
avoid potential bias. After removal of visits later than the used dropout technique to prevent overfitting and an
st
1 date of developing the surrogate outcome, we randomly early stopping mechanism with a patience of 10 epochs to
sampled one visit (time t) within the 1 – 2-year windows prevent unnecessary training beyond the optimal point.
before the surrogate outcome for the 159,039 unlabeled For the semi-RNN model, we incorporated an auxiliary
patients (Figure 1). task in addition to the primary prediction task of the
supervised RNN. The auxiliary task was to predict the
2.6. Models for supervised and semi-supervised values of longitudinal predictors at the next visit, which
learning
can be trained using unlabeled data. We shared the layers
We developed four different models to predict the of GRU between the auxiliary task and the prediction task.
probability of developing cirrhosis within 1 year after time t Jointly learning both tasks could help improve feature
using baseline predictors and longitudinal predictors from representation by leveraging unlabeled data. We defined
enrollment to time t. To utilize longitudinal information, the negative log-likelihood for longitudinal predictors as
we employed two approaches in our analysis. The first the unsupervised loss, and we minimized the weighted
approach was to compute summary statistics for each sum of the supervised and unsupervised losses to train
longitudinal predictor, including maximum, minimum, the model end-to-end. The weight, which controls the
17
maximum of slope, minimum of slope, and total variation. trade-off between supervised learning and unsupervised
These summary statistics were combined with baseline learning, was selected by hyperparameter tuning.
predictors and used to train conventional machine learning
models. We opted for LR (a classic linear method) and 2.7. Statistical analysis
13
RF (a highly non-linear method based on decision trees) To conduct the analysis, we randomly split the labeled
to evaluate the effectiveness of conventional machine data into a training set (40%), a validation set (30%), and
learning methods based on human-designed factors. a testing set (30%). The unlabeled data used in semi-RNN
The second approach to handle raw longitudinal belonged to the training set. We learned each model using
predictors from enrollment to time t was using an RNN, the training and validation set, and then evaluated their
which excels in processing sequential data with irregular performance on the same testing set. This procedure was
Volume 2 Issue 2 (2025) 91 doi: 10.36922/aih.4671

