Page 96 - AIH-1-4
P. 96

Artificial Intelligence in Health                               Transformer-based radiology report summaries



            4.2. Evaluation method                               Precision and recall scores are combined into an F1

            In  text  summarization  tasks,  various evaluation metrics   score, as seen in Equation IV. We implemented ROUGE-L
            were used to assess the quality of the generated summaries.   leveraging the base rouge_score calculator available
            Common metrics include BLEU, METEOR, and ROUGE.    from HuggingFace (huggingface.co/metrics/rouge). Our
                                                         18
            BLEU measures the correspondence between machine-  baseline models use cross-entropy loss on Teacher-Forcing
            generated text and human-written reference text using   masked spans while training. We implemented Teacher-
            n-gram overlaps, making it a popular choice for evaluating   Forcing  on our  training batches  with  HuggingFace
                                                                                                            16
            machine translation. However, BLEU has limitations in text   DataCollatorForSeq2Seq. We also used BERTScore
            summarization as it primarily focuses on precision and does not   in our model, to learn contextual embeddings for the
            adequately capture recall, which is crucial for summarization   reference and predicted summarization for the radiology
            tasks. METEOR addresses some of BLEU’s shortcomings by   reports and, thus, mitigate drawbacks of pure ROUGE
            considering synonyms and stemming, thereby providing a   score evaluation.
            more nuanced evaluation. However, METEOR is still more   4.3. Experimental details
            suited for translation tasks than for summarization.
                                                               Our experiments established summarization from the
              For our study, we chose ROUGE, specifically      fine-tuned  BERT2BERT  model  and  reproduced  medical
            ROUGE-L, as the primary evaluation metric. ROUGE-L   summarization work by Chen et al. 5
            focuses on the LCS between the predicted and reference
            summaries, which is particularly effective in measuring   We applied a set of hyperparameters, as illustrated in
            the informativeness and fluency of the summaries. Unlike   Table 2, to the entire dataset with a ratio of train/test split
            n-gram-based metrics, ROUGE-L captures the overall   of 5:1 for developing the model. We also built a custom
            structure and coherence of the text by considering the   section  extractor  to  automate  the  processing  of  whole
            longest matching sequence of words, making it well-suited   reports into various components such as FINDINGS and
            for evaluating abstractive summarization tasks where the   IMPRESSION. These reports were filtered for those that
            generated text may not have exact n-gram matches with   included ALL INPUT FIELDS and IMPRESSION sections
            the reference. This characteristic of ROUGE-L is crucial   (Figure 2). The reports were then sampled, tokenized, and
            for radiology report summarization, where the goal is to   batched for training. Finally, we shuffled inputs for the last
            produce coherent and informative summaries that may   2 – 3 epochs of every model trained.
            not directly match the source text verbatim.
                                                               4.4. Results
              By selecting ROUGE-L, we ensured that our evaluation
            metric aligned with the specific requirements of medical   Our final set of results on ROUGE are shown in Table 3.
            text summarization. The ability of ROUGE-L to balance   We found that a fine-tuned T5 model performs best
            precision and recall makes it an ideal choice for capturing   among our baseline models. We achieved state-of-the-art
            the quality of summaries in terms of both completeness   ROUGE-L F1 performance with a BERT2BERT model
            and relevance. Our study leveraged ROUGE-L to provide   after 6 epochs, with each epoch using a different ordering
            a comprehensive assessment of our model’s performance,   of input fields.
            ensuring that the generated summaries effectively convey   5. Analysis and discussion
            the critical information contained in radiology reports.
              Equations II and III compute ROUGE precision and   Our experiments yielded several non-intuitive results across
            recall, respectively, where MaxLCS is the maximum   model architecture, pre-training, and attention context. For
                                                                                                             9
            length of LCS between the reference summary (R) and   instance, larger specialized models like ClinicalLongFormer
            the candidate summary (C). r and c are the lengths of the   significantly underperformed baselines.
            reference and candidate summaries, respectively.     We investigated these results across disease types and

                            Max   ( , )R C                     analyzed why the vanilla BERT-to-BERT model trained
               ROUGE precision  =  LCS r               (II)    directly on this task outperformed models that had specialized
                                                               checkpoints on clinical data and used architectures with
                          Max   ( , )R C                       more sophisticated attention mechanisms.
                ROUGE recall  =  LCS                   (III)   5.1. Summarization across disease types
                               c
                         2          Recall                     Alongside patient radiology reports, the MIMIC-CXR
                                  ×Precision
                          ×
               ROUGE F 1  =  Precision +    Recall     (IV)    dataset provides extracted disease metadata that is either


            Volume 1 Issue 4 (2024)                         90                               doi: 10.36922/aih.3846
   91   92   93   94   95   96   97   98   99   100   101