Page 123 - AIH-1-3

P. 123

Artificial Intelligence in Health Interpretability of deep models for COVID-19

Sections 3.5 and 3.6. The goal of these experiments was to (xi and xj) from the training set and their respective classes
xy,
determine which operations are relevant for classification (yi and yj) to generate a new instance  using
27
and how they affect the model’s decision process (analyzed Equations I and II:
by Grad-CAM).  x x 1 x (I)
The server used for our training has an Intel Xeon Silver i j
CPU processor (39 cores, 2.40 GHz), 56 GB of RAM, and  y y 1 y (II)
two Nvidia 2080 GPUs (8 GB of VRAM each). Some of i j
the runs occurred only in the CPU cores, while others
used both GPUs and the CPU. Overall, all the experiments where λ ∈ [0, 1] is generated from a Beta distribution.
took approximately the same time to run, around 6 h in Unlike common image processing augmentation
the CPU-only scenario or 1 h using both GPUs and CPU. techniques, such as rotation, cropping, and horizontal
Some small variations were observed, mainly due to the flipping, Mix-up is applicable to various tasks, including
preprocessing techniques used, as they were performed audio processing. It helps generate better decision
28
exclusively on the CPU during each epoch of training. It frontiers in the manifold extracted by the model during
should be noted that inference took only a few seconds in training, which is particularly beneficial for small
our environment. datasets.
3.1. Transfer learning with PANNs Finally, to further enhance model robustness in the
cases of small training sets, we used the SpecAugment
29
PANNs have proven effective for transfer learning across data augmentation technique. SpecAugment is designed
various tasks. They have been successfully applied to for spectrograms and was initially developed for
13
multiple audio classification tasks, such as audio set automatic voice recognition. It performs augmentation
tagging, speech emotion recognition, and automated on the spectrogram by first applying distortion in the time
25
13
audio captioning. PANNs are Mel spectrogram-based dimension, termed time warping in the study by Casanova
26
models and trained on the AudioSet dataset, which et al., and then masking parts of frequency channels and
9
comprises approximately 1.9 million audios, totalizing masking blocks in time. The frequency mask is applied
527 classes and over 5000 h. While the original authors over f consecutive Mel channels [f0, f0 + f), where f is
explored several architectures, in this work, we used chosen uniformly from 0 to F and F represents the
only the CNN14 architecture due to its simplicity and maximum number of masked Mel channels (set to eight
similarity to SpiraNet, allowing it to benefit from the same in our experiments). The parameter f0 is uniformly chosen
9
preprocessing techniques.
at random from [0,v − f), where v is the total number of
3.2. Data augmentation Mel-frequency channels. The temporal mask is performed
over t time slots [t0, t0 + t), where t and t0 are determined
Following the work of Casanova et al., three data analogously to the frequency mask.
20
augmentation techniques were applied: Noise insertion,
Mix-up, and SpecAugment. 3.3. Windowing
First, noise insertion was performed due to the different Each original audio, which is at least 4 s long, is divided
recording environments present in the SPIRA dataset for into smaller 4-s audios. The division was performed using
9
patient and control group audios. Previous research has a 4-s window with a 1-s hop. For example, a 5-s audio was
shown that models trained on this dataset can overfit if split into two segments: The first from seconds 0 – 4 and
the data are not preprocessed adequately, leading to biases, the second from seconds 1 – 5. This approach, initially
such as distinguishing control and patient groups based employed by Casanova et al., ensures uniform audio
9
on the presence of hospital ward noise. To mitigate this, lengths and prevents model overfitting based on audio
we followed the Casanova et al. approach by injecting lengths. As patient audios tend to be longer, models can
9
hospital ward noise into all audios. For some experiments, overfit on audio length if no normalization is done.
we inserted four noise recordings for the control group and The windows cover repeated fragments of the original
three for the patient group, while other experiments used audios to include as many fragments of the original spoken
three audio recordings for each class, based on Casanova sentence as possible. It is important to note that windowing
et al.’s findings .
9
was performed separately for training and test sets. During
Second, due to the small size of the training set, we used training, each fragment was labeled with the class of the
a data augmentation technique called Mix-up to increase original audio. In the test set, a voting mechanism over
model robustness. Mix-up combines two random instances the windowed audios was used to determine the class

Volume 1 Issue 3 (2024) 117 doi: 10.36922/aih.2992

118 119 120 121 122 123 124 125 126 127 128