Page 98 - AIH-1-4

P. 98

Artificial Intelligence in Health Transformer-based radiology report summaries

which uses random attention, windowed attention, and For future studies, the limited effectiveness of
global attention to generate a sparse attention representation linear attention points to the importance of evaluating
(Figure 4). The value of this approach is the ability to process the information distribution within a dataset. Likely,
4096 tokens with sparse attention at approximately the same the more concentrated relevant information is in a
time complexity as with 512 tokens with full attention. dataset, the less likely a larger context transformer will
Theoretically, this provides better information capture for outperform.
longer documents. This is relevant for our task, as radiology
reports can exceed the 512-token limit. 5.4. Learning radiology from summarization
For BigBird, however, complete parity with full While transformers tend to find uninterpretable
attention with n-tokens is only realized with n hidden statistical patterns in the training data, we found
16
attention layers. This means at m < n layers, BigBird that our model has learned a few radiology facts.
performance relies on the larger context size to have much A few notable observations that hint at some of the
more relevant information for the task than the 512 token operating mechanisms for Biomedical-BERT2BERT are
limit. At m = n layers, we lose the performance advantage as follows:
of linear attention as O(n ⇤ m) = O(n ). • Pneumonia corresponds to pleural surfaces
2
By evaluating the information distribution in radiology • Negation for disease is entailed by phrasing normal
text data, we found that the majority of IMPRESSION physiology (e.g., No pneumonia = Normal heart and
information can be derived from only two to three sections lungs)
(i.e., FINDINGS, COMPARISON, and INDICATION), • “Chest” pertains to both heart and lung anatomical
whose size totaled 200 – 300 tokens, well within the features.
BERT full attention limit. As a result, while BigBird Figure A1 provides more information in this regard.
might eventually achieve the Biomedical-BERT2BERT Visualizations were created by extracting cross-attention
performance given more compute and scaling laws, the matrices between our BERT2BERT Encoder Decoder
20
larger context size effectively acted as statistical noise, rather components and plotted with BERTViz. We also sampled
21
than providing an information advantage. In contrast, model outputs with a medical resident who found that the
since we provided key sections to BERT directly, the generated summaries encapsulate the source text well for
Biomedical-BERT2BERT model learned summarization a medical setting (Figure A2). This points to an exciting
more efficiently with full attention. future direction to extract knowledge from radiology

Figure 3. Performance distribution of ROUGE-L SUM scores versus the number of examples in the dataset. Image created with Google Sheets

A B C D

Figure 4. (A-D) Multiple attention mechanisms in the BigBird linear attention calculation, which did not show improved performance for our
16
summarization task

Volume 1 Issue 4 (2024) 92 doi: 10.36922/aih.3846

93 94 95 96 97 98 99 100 101 102 103