Page 97 - AIH-1-4
P. 97
Artificial Intelligence in Health Transformer-based radiology report summaries
Table 2. Hyperparameters used for top 3 best performing models trained
Hyperparameters Bidirectional encoder representations BigBird values T5 values
from transformer’s values
Epochs 6 7 12
Batch size 8 8 2
Learning rate 1.0e-5 1.0e-5 1.0e-5
Gradient accumulation steps 2 2 4
Optimizer AdamW AdamW AdamW
Training time per epoch 0.65 h 1.55 h 0.63 h
Table 3. ROUGE scores among different models we • Each report with no findings has a much smaller space
experimented with of possible impressions compared to other disease-
type impressions. Other diseases have a variety of
# ROUGE‑1 ROUGE‑2 ROUGE‑L Baseline/rank nuances that are harder for the model to capture.
1 0.67 0.15 0.63 Yes
2 48.71 37.98 47.42 Yes In fact, solely summarizing reports with the phrase
3 30.81 19.51 26.88 Yes “There is no cardiopulmonary process” could achieve
a ROUGE-L of 53.2 on these reports. It is possible that
4 55.68 45.52 54.74 3 rd the model has learned an efficient classification strategy
5 59.61 48.22 58.75 1 st to detect “No Findings” reports and respond with a few
6 57.83 47.12 56.66 2 nd canonical phrases in those cases.
7 43.8 30.98 41.7 -
8 58.97 47.06 57.37 - 5.2. Specialized and general checkpoints
Note: Bold values highlight the best performance in the selected metrics. Interestingly, fine-tuned checkpoints on PubMed and other
We found that a fine-tuned bidirectional encoder representation from clinical data underperformed the base BERT2BERT model
transformers (BERT) 2BERT model performs the best. on this task. One example is the Clinical LongFormer
Abbreviation: ROUGE: Recall-Oriented Understudy for Gisting 9
Evaluation. model, which is pre-trained on large-scale clinical corpora
and achieves SOTA performance on many biomedical tasks.
Similar performance was observed with BioClinical BERT.
indicated or negated by the radiologist. For instance, the However, radiology text summarization is a highly
radiologist might note that the patient’s chest “had no specialized task. One interpretation of this result is that
indications for pneumonia,” which would be provided as other clinical checkpoints may only contain a fraction of
“pneumonia: -1.” the information required to summarize radiology notes
By comparing performance across these disease profiles effectively. As a result, these specialized checkpoints
(Figure 3), we observed the Biomedical-BERT2BERT can easily fall into local minima with respect to the loss
model performance has a slight positive correlation function, whereas a more general language checkpoint can
with the number of examples per disease type. This optimize more for global minima.
indicates that while the model gains knowledge with more For future studies, this evidence points to the importance
examples, there is potentially a model saturation point. of using a variety of pre-trained checkpoints, and not solely
One interpretation is that BERT-to-BERT has reached an relying on fine-tuned variants for specialized tasks.
architectural limit to improve the summarization of these
complex disease types. 5.3. Limitations of linear attention mechanisms
In contrast, the model performs almost twice as well Attention is the key mechanism underlying transformers.
on reports with “No Findings” (77.6 ROUGE- LSUM). However, the time and memory complexity to calculate
2
“No Findings” reports are those where the radiologist still attention is scaled with O(n ), which restricts models such
describes X-ray features, but interprets them to be normal as BERT to a limited context size (i.e., 512 tokens).
and without disease. This summarization improvement is Many models such as Linformer, Reformer, and
likely due to the following reasons: Perceiver have been formulated to use linear attention
19
• “No Findings” reports account for the majority of the methods by indirectly calculating “full attention” by
16
examples approximation. Google’s BigBird is the latest of such models,
Volume 1 Issue 4 (2024) 91 doi: 10.36922/aih.3846

