Page 121 - AIH-1-3
P. 121

Artificial Intelligence in Health                                 Interpretability of deep models for COVID-19



            1. Introduction                                    biases could occur toward hospital noise. To mitigate this
                                                               problem, we introduced hospital noise into domestic audio
            In December 2019, a novel coronavirus, namely severe   samples following the proposal of Casanova et al.  We also
                                                                                                      9
            acute respiratory syndrome coronavirus-2, was identified   used their data augmentation techniques to improve model
            as the causative agent for coronavirus disease 2019   performance. Finally, we could literally hear the areas in the
            (COVID-19). This coronavirus variant rapidly became a   audio that the model values the most in its decision process.
            global concern, reaching pandemic status as declared by   To achieve this, we multiplied the heat maps obtained
            the World Health Organization.  COVID-19 evolved to   from Grad-CAM by the original log-Mel spectrograms,
                                      1
            become more contagious and lethal over a short period.  and the result was synthesized. It is important to note that
              Researchers from all fields joined efforts to tackle the   we focused  on  spectrograms  rather than  Mel-frequency
            pandemic crisis. In particular, researchers in artificial   cepstral coefficients (MFCCs)  to enhance interpretability,
                                                                                      11
            intelligence (AI) and related areas sought methods to   while previous works opted to explore MFCCs  to attain
                                                                                                    9,12
            simplify COVID-19 detection. These methods use a variety   accuracies above or close to 90%. As spectrogram-based
            of sources, such as medical examinations,  symptoms,    models were shown in those papers to have inherently
                                               2
                                                          3
            and X-ray images,  among others.  A potential source for   lower accuracy, we employed methods such as transfer
                           4
                                       5
            COVID-19 detection is audio recordings. Several projects   learning (e.g., pre-trained models on large-scale audio
            have collected audio samples from patients, including   datasets) to recover the model’s performance using log-
            speech and cough sounds,  to develop detection models.   Mel spectrograms.
                                 6-8
            These models could optimize the patient screening process.   As a result, our best model uses a pre-trained audio
            However, existing approaches have limitations in data   neural network (PANN)  called CNN14, which, through
                                                                                   13
            collection procedures. For example, environmental noise   transfer learning, achieves 94.44% accuracy, in line with
            can be present during audio capture, leading to model   the accuracy on the same dataset from recent works  using
                                                                                                       12
            overfitting on such noise.                         transformers-based architectures. 14
              The dataset presented in the SPIRA project  illustrates   This work presents four main contributions:
                                                 8,9
            these challenges. Positive audio samples (read speech)   (i)  We present an analysis detailing the features crucial
            from COVID-19 patients were collected in hospitals, while   for deep models to detect or rule out COVID-19
            samples from symptom-free individuals were obtained   in patient and control audios. In the analyzed
            through a web interface. These samples were labeled as the   data, spectrograms contain more discriminative
            control group, with the caveat that no additional testing for   information than the combination of F0, F0-STD, sex,
            COVID-19 was performed on these subjects. Training a   and age. A visual analysis of heat maps generated by
            model on this dataset requires precautions to avoid learning   Grad-CAM shows that, among F0, F0-STD, sex, and
            biases due to differences in the collection environment, as   age, the most important feature is F0.
            patient audios may contain hospital noise, while the control   (ii)  We present an interpretation of the decisions made
            group may include other environmental noises. Moreover,   by deep models using heat maps and audio synthesis,
            a  model  trained  on this dataset contrasts healthy  cases   following an explainable AI approach. Based on the
            with more severe COVID-19 cases, which typically exhibit   heat maps and audio resynthesis, we formulate a few
            symptoms such as respiratory insufficiency. Such models   hypotheses for the factors affecting model decisions,
            will likely not be able to identify COVID-19 cases that do   such as (a) the structure of pauses (patients have longer
            not induce severe symptoms.                           and more frequent pauses than controls); (b) signal
              In this work, we trained and analyzed convolutional   energy over time decreases faster for patients than
            neural networks (CNNs) for COVID-19 detection from    controls; and (c)  an interplay between syntax and
            audios using the dataset from the study by Casanova et al.    prosody emerges as a boundary marked by formant
                                                          9
            In addition, we analyzed factors important for the model   vowel high energy.
            decision using several criteria, namely spectrograms,   (iii) Through manual analysis of the audio signals
            fundamental frequency (F0), fundamental frequency     (using Grad-CAM), we ensure that the deep models
            standard deviation (F0-STD), speaker age, and sex. We   focus on the voice (or silent pauses) rather than on
            applied  the  gradient-weight  class  activation  mapping   environmental noise.
            (Grad-CAM)  algorithm to generate heat maps, allowing   (iv)  We demonstrate that models pre-trained on large-
                      10
            us to investigate which pieces of information are most   scale audio datasets, such as CNN14, can, through
            relevant for the model’s classification decisions. As the   transfer learning, achieve accuracies on par with the
            dataset used in this work contained audios from different   best previously reported models,  even when using
                                                                                             12
            collection environments (hospital and domestic), learning   log-Mel spectrograms as input instead of MFCCs.

            Volume 1 Issue 3 (2024)                        115                               doi: 10.36922/aih.2992
   116   117   118   119   120   121   122   123   124   125   126