Page 125 - AIH-1-3
P. 125

Artificial Intelligence in Health                                 Interpretability of deep models for COVID-19



            •   Experiment 5: This experiment aims to explore how   Table 2. Results from Experiments 1 – 5
               the accuracy of a classification model changes when
               data augmentation techniques, such as SpecAugment   Experiment  True         False     Accuracy
                                                                                                        (%)
               and Mix-up, are used. In this experiment, we used         Positives  Negatives  Positives  Negatives
               the SpiraNet model but replaced MFCCs with      1           37     49          5  17       79.63
               spectrograms to simplify human analysis and improve   2     36     38      16     18       68.52
               audio resynthesis (see Experiment 6). As usual, for the   3  44    44      10     10       81.48
               SpiraNet, we use Set 1 of Table 1.              4           51     51          3          3  94.44 a
            •   Experiments 6a and 6b: These experiments differ   5        49     22      32          5      65.74
               from the previous five. We performed a qualitative
                                                                   a
               analysis focused on model explainability using heat   Note:  The highest accuracy among the experiments was achieved by
                                                               Experiment 4.
               maps generated by Grad-CAM. This method aims
               to uncover the underlying reasons for the model’s   Experiment 4 achieved the highest accuracy (94.44%),
               classification decisions by generating heat maps that   indicating that transfer learning significantly impacts
               highlight important zones in the decision process.   learning features from patient and control groups,
               First, in Experiment 6ª, we conducted a preliminary   surpassing the results of Casanova et al.  These findings
                                                                                                9
               analysis and case study, investigating Grad-CAM from   suggest that CNN14 might be better suited than SpiraNet
               Experiments 1 to 3 (see Section 4.2) to understand   for COVID-19 detection. CNN14’s results are comparable
               which parts of the input are more relevant for   in accuracy to those of transformers-based architectures
               classification. Then, in Experiment 6b, we performed a   described by Gauy and Finger,  with the added advantage
                                                                                       12
               detailed analysis, focusing on the heat maps generated   of using spectrograms as input instead of MFCCs, as was
               in Experiment 1. Our preliminary analysis showed   the case for the MFCC-transformer.  This advantage is
                                                                                             12
               that the spectrogram plays a major role in classification   attributed  to  the  effectiveness  of  the  transfer  learning
               (see Section 4.3). In Section 4.3, we also resynthesized
               audios from Experiment 1, allowing us to hear them   used. Experiment 5 demonstrates that data augmentation
               and investigate attention from both a visual and aural   techniques (SpecAugment and Mix-up) did not improve
               perspective. The audio reconstruction process is done   SpiraNet  accuracy,  as  it  performed  worse  than  in
               in two steps. First, the heat map generated by Grad-  Experiments 1 and 2. Experiments 6a and 6b are presented
               CAM and the log-Mel spectrogram are combined    separately because they are based on heat maps, human
               using the Hadamard product. Second, the result and   analysis, and audio resynthesis (Sections 4.2 and 4.3,
               the phase of the original spectrogram are used to   respectively).
               generate new audios highlighting the moments and   Regarding errors, most experiments resulted in a balance
               frequencies the model considered most important in   of false positives and false negatives. Experiment 1 was an
               its decisions. We refer to the combination of original   exception, presenting more false negatives. This experiment
               log-Mel spectrograms with heat maps as modified   might have been more susceptible than others to cases of
               spectrograms.                                   silent hypoxia, in which a patient has low blood oxygenation
                                                               but does not present severe symptoms. Another exception
            4. Results                                         was Experiment 5, which had significantly more false
            4.1. Experiments 1 – 5: Quantitative analysis      positives (32) than false negatives (5). A hypothesis for this
                                                               phenomenon is that SpecAugment forces the model to give
            Table 2 presents the results of Experiments 1 – 5, with   less importance to pauses, which are crucial for detecting
            accuracies ranging from 65.75% to 94.44%. From     respiratory insufficiency.  This may occur because the
                                                                                   18
            Experiment 1, we observe that spectrograms are     method introduces artificial pauses in training data.
            discriminative. Likewise, Experiment 2 showed that
            F0,  F0-STD,  sex,  and  age  also  contain  discriminative   4.2. Experiment 6a: Case study based on heat map
            information. However, spectrograms appear to carry more   analysis
            useful information since the accuracy of Experiment 1   Experiment 6a involved using Grad-CAM to generate
            is >10% higher than that of Experiment 2. Experiment 3   heat maps for experiments based on inputs (Section 3.5).
            suggests that features extracted from inputs in Experiments   Figures 2-4 present the results of heat maps and modified
            1 and 2 are largely equivalent despite a slight increase in
            accuracy (almost 2%) compared to Experiment 1. It should   spectrograms for Experiments 1 – 3, respectively.
            be noted that Experiments 1 - 4 used only noise insertion   Experiment 1 focused solely on spectrograms. The
            as data augmentation.                              visual results for two patients and two control group


            Volume 1 Issue 3 (2024)                        119                               doi: 10.36922/aih.2992
   120   121   122   123   124   125   126   127   128   129   130