Page 123 - AIH-1-3
P. 123

Artificial Intelligence in Health                                 Interpretability of deep models for COVID-19



            Sections 3.5 and 3.6. The goal of these experiments was to   (xi and xj) from the training set and their respective classes
                                                                                                    xy,
            determine which operations are relevant for classification   (yi and  yj) to generate a new instance      using
                                                                                                 27
            and how they affect the model’s decision process (analyzed   Equations I and II:
            by Grad-CAM).                                       x   x 1   x                        (I)
              The server used for our training has an Intel Xeon Silver   i  j
            CPU processor (39 cores, 2.40 GHz), 56 GB of RAM, and    y   y 1   y                   (II)
            two Nvidia 2080 GPUs (8 GB of VRAM each). Some of        i        j
            the runs occurred only in the CPU  cores, while  others
            used both GPUs and the CPU. Overall, all the experiments   where λ ∈ [0, 1] is generated from a Beta distribution.
            took approximately the same time to run, around 6 h in   Unlike common image processing augmentation
            the CPU-only scenario or 1 h using both GPUs and CPU.   techniques, such as rotation, cropping, and horizontal
            Some small variations were observed, mainly due to the   flipping, Mix-up is applicable to various tasks, including
            preprocessing techniques used, as they were performed   audio processing.  It helps generate better decision
                                                                              28
            exclusively on the CPU during each epoch of training. It   frontiers in the manifold extracted by the model during
            should be noted that inference took only a few seconds in   training, which is particularly beneficial for small
            our environment.                                   datasets.
            3.1. Transfer learning with PANNs                    Finally, to further enhance model robustness in the
                                                               cases of small training sets, we used the SpecAugment
                                                                                                            29
            PANNs have proven effective for transfer learning across   data  augmentation  technique.  SpecAugment  is  designed
            various tasks.  They have been successfully applied to   for spectrograms and was initially developed for
                       13
            multiple audio classification tasks, such as audio set   automatic  voice  recognition.  It  performs  augmentation
            tagging,  speech emotion recognition,  and automated   on the spectrogram by first applying distortion in the time
                                            25
                  13
            audio captioning.  PANNs are Mel spectrogram-based   dimension, termed time warping in the study by Casanova
                          26
            models and trained on the AudioSet dataset, which   et al.,  and then masking parts of frequency channels and
                                                                   9
            comprises approximately 1.9 million audios, totalizing   masking blocks in time. The frequency mask is applied
            527 classes and over 5000 h. While the original authors   over  f  consecutive Mel channels [f0, f0 +  f), where  f  is
            explored several architectures, in this work, we used   chosen uniformly from 0 to  F and  F  represents the
            only the CNN14 architecture due to its simplicity and   maximum number of masked Mel channels (set to eight
            similarity to SpiraNet,  allowing it to benefit from the same   in our experiments). The parameter f0 is uniformly chosen
                             9
            preprocessing techniques.
                                                               at random from [0,v − f), where v is the total number of
            3.2. Data augmentation                             Mel-frequency channels. The temporal mask is performed
                                                               over t time slots [t0, t0 + t), where t and t0 are determined
            Following the work of Casanova  et al.,  three data   analogously to the frequency mask.
                                               20
            augmentation techniques were applied: Noise insertion,
            Mix-up, and SpecAugment.                           3.3. Windowing
              First, noise insertion was performed due to the different   Each original audio, which is at least 4 s long, is divided
            recording environments present in the SPIRA dataset for   into smaller 4-s audios. The division was performed using
                                                      9
            patient and control group audios. Previous research  has   a 4-s window with a 1-s hop. For example, a 5-s audio was
            shown that models trained on this dataset can overfit if   split into two segments: The first from seconds 0 – 4 and
            the data are not preprocessed adequately, leading to biases,   the second from seconds 1 – 5. This approach, initially
            such  as  distinguishing  control  and  patient  groups  based   employed by Casanova  et al.,  ensures uniform audio
                                                                                        9
            on the presence of hospital ward noise. To mitigate this,   lengths and prevents model overfitting based on audio
            we followed the Casanova  et  al.  approach by injecting   lengths. As patient audios tend to be longer, models can
                                       9
            hospital ward noise into all audios. For some experiments,   overfit on audio length if no normalization is done.
            we inserted four noise recordings for the control group and   The windows cover repeated fragments of the original
            three for the patient group, while other experiments used   audios to include as many fragments of the original spoken
            three audio recordings for each class, based on Casanova   sentence as possible. It is important to note that windowing
            et al.’s findings .
                       9
                                                               was performed separately for training and test sets. During
              Second, due to the small size of the training set, we used   training, each fragment was labeled with the class of the
            a data augmentation technique called Mix-up to increase   original  audio.  In  the  test  set,  a  voting  mechanism  over
            model robustness. Mix-up combines two random instances   the windowed audios was used to determine the class


            Volume 1 Issue 3 (2024)                        117                               doi: 10.36922/aih.2992
   118   119   120   121   122   123   124   125   126   127   128