Page 94 - AIH-1-4
P. 94
Artificial Intelligence in Health Transformer-based radiology report summaries
from the source text that best preserve the salient aspects We used cross-entropy loss with a weighted average of
17
of the document. The abstractive method, on the other distances to a reference word in the vector space, outlined
hand, may generate text which may not be included in the in Equation I. Y is the i- word in the predicted summary
th
i
source text. and V is the k- word in the reference vocabulary. E(w) is
th
k
Biomedical summaries require an abstractive approach. a vector of the word w, V ,and y , in the equation, and ed is
k
i
Radiology settings in particular require the interpretation the Euclidean distance computation.
of many data points to identify the most important aspects Loss = I K ( p y , X ) (ed E ( ), ( )V E y (I)
i
=0
of a patient’s condition and implications for future care. For ∑∑ k =0 <i k i
instance, a radiologist’s report may include only physical Table 1 provides an overview of the model performance.
details of lung nodules, but the summary may conclude
that “pneumonia is or is not present.” 3.3.1. Data pipeline
3.2. Baselines We built a comprehensive data processing pipeline to handle
the MIMIC-CXR dataset. The pipeline includes the following
We began by leveraging several pre-trained models from steps: (1) Data extraction when free-text radiology reports
HuggingFace and built our own data processing pipeline are extracted with associated labels from the MIMIC-CXR
4
to extract sections and identify relevant training data dataset. (2) Preprocessing that tokenizes the text and removes
from the MIMIC-CXR dataset. We baselined our project any unnecessary characters or noise. (3) Section extraction
by fine-tuning a T5 encoder–decoder and Meta’s BART from reports, in which relevant sections such as FINDINGS,
8
13
in the MIMIC-CXR dataset, implementing Teacher- IMPRESSION, INDICATION, and TECHNIQUE are
14
Forcing on our training batches with HuggingFace’s extracted from the radiology reports. (4) Data augmentation,
DataCollatorForSeq2Seq. 15 This baseline already where data augmentation techniques are applied to create
performed near SOTA at 47.55 ROUGE-L. new training examples, improving model robustness and
generalization (detailed in the following subsection).
3.3. Main approach
(5) Model training, which uses the processed and augmented
To improve on these results, we experimented with data to train the Biomedical-BERT2BERT model.
approaches in model architecture including:
• Larger variants of common architectures 3.3.2. Data augmentation techniques
• Custom encoder–decoder models Significant improvements in our model are attributed to the
• Specialized checkpoints in medical data data augmentation techniques utilized. Data augmentation
• Models using linear attention mechanisms. in NLP involves creating new training examples by
We then moved on to a data-centric approach by altering the existing data to improve model robustness and
shuffling all input fields for each of our highest-performing generalization.
model architectures: T5, BERT2BERT, and a BigBird- We implemented input’s fields shuffling. For each
16
PubMed-Base model, with the latter model chosen model, we trained on different epochs with shuffled input
because it relies on block sparse attention instead of fields as new examples (i.e., FINDINGS, IMPRESSION,
normal attention and can handle longer sequences. This INDICATION, and TECHNIQUE; Figure 2). This
data-centric approach proved to be key to reaching a new technique ensures that the model learns to understand the
state-of-art performance, improving the previous SOTA context and meaning of the text regardless of the order in
work ROUGE-L performance by 1.38 points. which the sentences are presented, thereby improving its
5
We leveraged several pre-trained models from ability to generalize across different sentence structures.
HuggingFace and built our own data processing pipeline We acknowledge that class imbalance in the dataset
to extract sections and identify relevant training data from has affected our model’s performance, particularly in
the MIMIC-CXR dataset. We have baselined our project by the “No Findings” category. Initially, we did not address
fine-tuning a T5 encoder–decoder and Facebook’ BART this imbalance to observe the model’s natural learning
8
13
in the MIMIC-CXR dataset. patterns. However, to improve the model, we implemented
the fields shuffling augmentation technique to mitigate
To perform the summarization task, the input fields underrepresented categories.
were fed with sentence-level embeddings created by the
encoder to a transformer decoder initialized randomly. 3.3.3. Rationale for model configurations
Encoders and decoders were fine-tuned end-to-end; The rationale behind choosing specific model
Figure 1 for the whole process. configurations is based on balancing model complexity
Volume 1 Issue 4 (2024) 88 doi: 10.36922/aih.3846

