Page 97 - AIH-1-4
P. 97

Artificial Intelligence in Health                               Transformer-based radiology report summaries




            Table 2. Hyperparameters used for top 3 best performing models trained
            Hyperparameters                Bidirectional encoder representations   BigBird values     T5 values
                                              from transformer’s values
            Epochs                                                     6              7                 12
            Batch size                                                 8              8                 2
            Learning rate                                              1.0e-5         1.0e-5            1.0e-5
            Gradient accumulation steps                                2              2                 4
            Optimizer                                AdamW                        AdamW                AdamW
            Training time per epoch                                    0.65 h         1.55 h            0.63 h

            Table 3. ROUGE scores among different models we    •   Each report with no findings has a much smaller space
            experimented with                                     of possible impressions compared to other disease-
                                                                  type impressions. Other diseases have a variety of
            #    ROUGE‑1   ROUGE‑2    ROUGE‑L   Baseline/rank     nuances that are harder for the model to capture.
            1      0.67       0.15      0.63    Yes
            2      48.71     37.98      47.42   Yes              In fact, solely summarizing reports with the phrase
            3      30.81     19.51      26.88   Yes            “There is no cardiopulmonary process” could achieve
                                                               a ROUGE-L of 53.2 on these reports. It is possible that
            4      55.68     45.52      54.74   3 rd           the model has learned an efficient classification strategy
            5      59.61     48.22      58.75   1 st           to detect “No Findings” reports and respond with a few
            6      57.83     47.12      56.66   2 nd           canonical phrases in those cases.
            7      43.8      30.98      41.7    -
            8      58.97     47.06      57.37   -              5.2. Specialized and general checkpoints
            Note: Bold values highlight the best performance in the selected metrics.   Interestingly, fine-tuned checkpoints on PubMed and other
            We found that a fine-tuned bidirectional encoder representation from   clinical data underperformed the base BERT2BERT model
            transformers (BERT) 2BERT model performs the best.  on this task. One example is the Clinical LongFormer
            Abbreviation: ROUGE: Recall-Oriented Understudy for Gisting                                      9
            Evaluation.                                        model, which is pre-trained on large-scale clinical corpora
                                                               and achieves SOTA performance on many biomedical tasks.
                                                               Similar performance was observed with BioClinical BERT.
            indicated or negated by the radiologist. For instance, the   However,  radiology  text  summarization  is  a highly
            radiologist might note that the patient’s chest “had no   specialized task. One interpretation of this result is that
            indications for pneumonia,” which would be provided as   other clinical checkpoints may only contain a fraction of
            “pneumonia: -1.”                                   the information required to summarize radiology notes
              By comparing performance across these disease profiles   effectively. As a result, these specialized checkpoints
            (Figure  3), we observed the Biomedical-BERT2BERT   can easily fall into local minima with respect to the loss
            model performance has a slight positive correlation   function, whereas a more general language checkpoint can
            with the number of examples per disease type. This   optimize more for global minima.
            indicates that while the model gains knowledge with more   For future studies, this evidence points to the importance
            examples, there is potentially a model saturation point.   of using a variety of pre-trained checkpoints, and not solely
            One interpretation is that BERT-to-BERT has reached an   relying on fine-tuned variants for specialized tasks.
            architectural limit to improve the summarization of these
            complex disease types.                             5.3. Limitations of linear attention mechanisms
              In contrast, the model performs almost twice as well   Attention is the key mechanism underlying transformers.
            on reports with “No Findings” (77.6 ROUGE-  LSUM).   However, the time and memory complexity to calculate
                                                                                     2
            “No Findings” reports are those where the radiologist still   attention is scaled with O(n ), which restricts models such
            describes X-ray features, but interprets them to be normal   as BERT to a limited context size (i.e., 512 tokens).
            and without disease. This summarization improvement is   Many  models such as Linformer, Reformer,  and
            likely due to the following reasons:               Perceiver  have been formulated to use linear attention
                                                                      19
            •   “No Findings” reports account for the majority of the   methods by indirectly calculating “full attention” by
                                                                                                            16
               examples                                        approximation. Google’s BigBird is the latest of such models,
            Volume 1 Issue 4 (2024)                         91                               doi: 10.36922/aih.3846
   92   93   94   95   96   97   98   99   100   101   102