Page 94 - AIH-1-4
P. 94

Artificial Intelligence in Health                               Transformer-based radiology report summaries



            from the source text that best preserve the salient aspects   We used cross-entropy loss  with a weighted average of
                                                                                       17
            of  the  document.  The  abstractive  method, on  the  other   distances to a reference word in the vector space, outlined
            hand, may generate text which may not be included in the   in Equation I. Y is the i-  word in the predicted summary
                                                                                  th
                                                                            i
            source text.                                       and V  is the k-  word in the reference vocabulary. E(w) is
                                                                           th
                                                                    k
              Biomedical summaries require an abstractive approach.   a vector of the word w, V ,and y , in the equation, and ed is
                                                                                   k
                                                                                        i
            Radiology settings in particular require the interpretation   the Euclidean distance computation.
            of many data points to identify the most important aspects   Loss =   I  K  ( p y  , X ) (ed E ( ), ( )V  E y  (I)
                                                                         i
                                                                          =0
            of a patient’s condition and implications for future care. For   ∑∑ k =0  <i     k    i
            instance, a radiologist’s report may include only physical   Table 1 provides an overview of the model performance.
            details of lung nodules, but the summary may conclude
            that “pneumonia is or is not present.”             3.3.1. Data pipeline
            3.2. Baselines                                     We built a comprehensive data processing pipeline to handle
                                                               the MIMIC-CXR dataset. The pipeline includes the following
            We began by leveraging several pre-trained models from   steps: (1) Data extraction when free-text radiology reports
            HuggingFace and built our own data processing pipeline    are extracted with associated labels from the MIMIC-CXR
                                                          4
            to extract sections and identify relevant training data   dataset. (2) Preprocessing that tokenizes the text and removes
            from the MIMIC-CXR dataset. We baselined our project   any unnecessary characters or noise. (3) Section extraction
            by fine-tuning a T5 encoder–decoder  and Meta’s BART    from reports, in which relevant sections such as FINDINGS,
                                          8
                                                         13
            in the MIMIC-CXR dataset, implementing Teacher-    IMPRESSION, INDICATION, and  TECHNIQUE  are
                  14
            Forcing  on our training batches with HuggingFace’s   extracted from the radiology reports. (4) Data augmentation,
            DataCollatorForSeq2Seq. 15  This  baseline  already  where data augmentation techniques are applied to create
            performed near SOTA at 47.55 ROUGE-L.              new training examples, improving model robustness and
                                                               generalization (detailed in the following subsection).
            3.3. Main approach
                                                               (5) Model training, which uses the processed and augmented
            To improve on these results, we experimented with   data to train the Biomedical-BERT2BERT model.
            approaches in model architecture including:
            •   Larger variants of common architectures        3.3.2. Data augmentation techniques
            •   Custom encoder–decoder models                  Significant improvements in our model are attributed to the
            •   Specialized checkpoints in medical data        data augmentation techniques utilized. Data augmentation
            •   Models using linear attention mechanisms.      in NLP involves creating new training examples by
              We then moved on to a data-centric approach by   altering the existing data to improve model robustness and
            shuffling all input fields for each of our highest-performing   generalization.
            model architectures: T5, BERT2BERT, and a BigBird-   We implemented input’s fields shuffling. For each
                              16
            PubMed-Base model,  with the latter model chosen   model, we trained on different epochs with shuffled input
            because it relies on block sparse attention instead of   fields  as  new examples  (i.e.,  FINDINGS,  IMPRESSION,
            normal attention and can handle longer sequences. This   INDICATION, and TECHNIQUE;  Figure  2). This
            data-centric approach proved to be key to reaching a new   technique ensures that the model learns to understand the
            state-of-art performance, improving the previous SOTA   context and meaning of the text regardless of the order in
            work  ROUGE-L performance by 1.38 points.          which the sentences are presented, thereby improving its
                5
              We leveraged several pre-trained models from     ability to generalize across different sentence structures.
            HuggingFace and built our own data processing pipeline   We acknowledge that class imbalance in the dataset
            to extract sections and identify relevant training data from   has affected our model’s performance, particularly in
            the MIMIC-CXR dataset. We have baselined our project by   the “No Findings” category. Initially, we did not address
            fine-tuning a T5 encoder–decoder  and Facebook’ BART    this imbalance to observe the model’s natural learning
                                       8
                                                         13
            in the MIMIC-CXR dataset.                          patterns. However, to improve the model, we implemented
                                                               the fields shuffling augmentation technique to mitigate
              To perform the summarization task, the input fields   underrepresented categories.
            were fed with sentence-level embeddings created by the
            encoder to a transformer decoder initialized randomly.  3.3.3. Rationale for model configurations
              Encoders and decoders were fine-tuned end-to-end;   The  rationale  behind  choosing  specific  model
            Figure 1 for the whole process.                    configurations is  based on balancing model complexity


            Volume 1 Issue 4 (2024)                         88                               doi: 10.36922/aih.3846
   89   90   91   92   93   94   95   96   97   98   99