Page 96 - AIH-1-4
P. 96
Artificial Intelligence in Health Transformer-based radiology report summaries
4.2. Evaluation method Precision and recall scores are combined into an F1
In text summarization tasks, various evaluation metrics score, as seen in Equation IV. We implemented ROUGE-L
were used to assess the quality of the generated summaries. leveraging the base rouge_score calculator available
Common metrics include BLEU, METEOR, and ROUGE. from HuggingFace (huggingface.co/metrics/rouge). Our
18
BLEU measures the correspondence between machine- baseline models use cross-entropy loss on Teacher-Forcing
generated text and human-written reference text using masked spans while training. We implemented Teacher-
n-gram overlaps, making it a popular choice for evaluating Forcing on our training batches with HuggingFace
16
machine translation. However, BLEU has limitations in text DataCollatorForSeq2Seq. We also used BERTScore
summarization as it primarily focuses on precision and does not in our model, to learn contextual embeddings for the
adequately capture recall, which is crucial for summarization reference and predicted summarization for the radiology
tasks. METEOR addresses some of BLEU’s shortcomings by reports and, thus, mitigate drawbacks of pure ROUGE
considering synonyms and stemming, thereby providing a score evaluation.
more nuanced evaluation. However, METEOR is still more 4.3. Experimental details
suited for translation tasks than for summarization.
Our experiments established summarization from the
For our study, we chose ROUGE, specifically fine-tuned BERT2BERT model and reproduced medical
ROUGE-L, as the primary evaluation metric. ROUGE-L summarization work by Chen et al. 5
focuses on the LCS between the predicted and reference
summaries, which is particularly effective in measuring We applied a set of hyperparameters, as illustrated in
the informativeness and fluency of the summaries. Unlike Table 2, to the entire dataset with a ratio of train/test split
n-gram-based metrics, ROUGE-L captures the overall of 5:1 for developing the model. We also built a custom
structure and coherence of the text by considering the section extractor to automate the processing of whole
longest matching sequence of words, making it well-suited reports into various components such as FINDINGS and
for evaluating abstractive summarization tasks where the IMPRESSION. These reports were filtered for those that
generated text may not have exact n-gram matches with included ALL INPUT FIELDS and IMPRESSION sections
the reference. This characteristic of ROUGE-L is crucial (Figure 2). The reports were then sampled, tokenized, and
for radiology report summarization, where the goal is to batched for training. Finally, we shuffled inputs for the last
produce coherent and informative summaries that may 2 – 3 epochs of every model trained.
not directly match the source text verbatim.
4.4. Results
By selecting ROUGE-L, we ensured that our evaluation
metric aligned with the specific requirements of medical Our final set of results on ROUGE are shown in Table 3.
text summarization. The ability of ROUGE-L to balance We found that a fine-tuned T5 model performs best
precision and recall makes it an ideal choice for capturing among our baseline models. We achieved state-of-the-art
the quality of summaries in terms of both completeness ROUGE-L F1 performance with a BERT2BERT model
and relevance. Our study leveraged ROUGE-L to provide after 6 epochs, with each epoch using a different ordering
a comprehensive assessment of our model’s performance, of input fields.
ensuring that the generated summaries effectively convey 5. Analysis and discussion
the critical information contained in radiology reports.
Equations II and III compute ROUGE precision and Our experiments yielded several non-intuitive results across
recall, respectively, where MaxLCS is the maximum model architecture, pre-training, and attention context. For
9
length of LCS between the reference summary (R) and instance, larger specialized models like ClinicalLongFormer
the candidate summary (C). r and c are the lengths of the significantly underperformed baselines.
reference and candidate summaries, respectively. We investigated these results across disease types and
Max ( , )R C analyzed why the vanilla BERT-to-BERT model trained
ROUGE precision = LCS r (II) directly on this task outperformed models that had specialized
checkpoints on clinical data and used architectures with
Max ( , )R C more sophisticated attention mechanisms.
ROUGE recall = LCS (III) 5.1. Summarization across disease types
c
2 Recall Alongside patient radiology reports, the MIMIC-CXR
×Precision
×
ROUGE F 1 = Precision + Recall (IV) dataset provides extracted disease metadata that is either
Volume 1 Issue 4 (2024) 90 doi: 10.36922/aih.3846

