Page 121 - AIH-1-3
P. 121
Artificial Intelligence in Health Interpretability of deep models for COVID-19
1. Introduction biases could occur toward hospital noise. To mitigate this
problem, we introduced hospital noise into domestic audio
In December 2019, a novel coronavirus, namely severe samples following the proposal of Casanova et al. We also
9
acute respiratory syndrome coronavirus-2, was identified used their data augmentation techniques to improve model
as the causative agent for coronavirus disease 2019 performance. Finally, we could literally hear the areas in the
(COVID-19). This coronavirus variant rapidly became a audio that the model values the most in its decision process.
global concern, reaching pandemic status as declared by To achieve this, we multiplied the heat maps obtained
the World Health Organization. COVID-19 evolved to from Grad-CAM by the original log-Mel spectrograms,
1
become more contagious and lethal over a short period. and the result was synthesized. It is important to note that
Researchers from all fields joined efforts to tackle the we focused on spectrograms rather than Mel-frequency
pandemic crisis. In particular, researchers in artificial cepstral coefficients (MFCCs) to enhance interpretability,
11
intelligence (AI) and related areas sought methods to while previous works opted to explore MFCCs to attain
9,12
simplify COVID-19 detection. These methods use a variety accuracies above or close to 90%. As spectrogram-based
of sources, such as medical examinations, symptoms, models were shown in those papers to have inherently
2
3
and X-ray images, among others. A potential source for lower accuracy, we employed methods such as transfer
4
5
COVID-19 detection is audio recordings. Several projects learning (e.g., pre-trained models on large-scale audio
have collected audio samples from patients, including datasets) to recover the model’s performance using log-
speech and cough sounds, to develop detection models. Mel spectrograms.
6-8
These models could optimize the patient screening process. As a result, our best model uses a pre-trained audio
However, existing approaches have limitations in data neural network (PANN) called CNN14, which, through
13
collection procedures. For example, environmental noise transfer learning, achieves 94.44% accuracy, in line with
can be present during audio capture, leading to model the accuracy on the same dataset from recent works using
12
overfitting on such noise. transformers-based architectures. 14
The dataset presented in the SPIRA project illustrates This work presents four main contributions:
8,9
these challenges. Positive audio samples (read speech) (i) We present an analysis detailing the features crucial
from COVID-19 patients were collected in hospitals, while for deep models to detect or rule out COVID-19
samples from symptom-free individuals were obtained in patient and control audios. In the analyzed
through a web interface. These samples were labeled as the data, spectrograms contain more discriminative
control group, with the caveat that no additional testing for information than the combination of F0, F0-STD, sex,
COVID-19 was performed on these subjects. Training a and age. A visual analysis of heat maps generated by
model on this dataset requires precautions to avoid learning Grad-CAM shows that, among F0, F0-STD, sex, and
biases due to differences in the collection environment, as age, the most important feature is F0.
patient audios may contain hospital noise, while the control (ii) We present an interpretation of the decisions made
group may include other environmental noises. Moreover, by deep models using heat maps and audio synthesis,
a model trained on this dataset contrasts healthy cases following an explainable AI approach. Based on the
with more severe COVID-19 cases, which typically exhibit heat maps and audio resynthesis, we formulate a few
symptoms such as respiratory insufficiency. Such models hypotheses for the factors affecting model decisions,
will likely not be able to identify COVID-19 cases that do such as (a) the structure of pauses (patients have longer
not induce severe symptoms. and more frequent pauses than controls); (b) signal
In this work, we trained and analyzed convolutional energy over time decreases faster for patients than
neural networks (CNNs) for COVID-19 detection from controls; and (c) an interplay between syntax and
audios using the dataset from the study by Casanova et al. prosody emerges as a boundary marked by formant
9
In addition, we analyzed factors important for the model vowel high energy.
decision using several criteria, namely spectrograms, (iii) Through manual analysis of the audio signals
fundamental frequency (F0), fundamental frequency (using Grad-CAM), we ensure that the deep models
standard deviation (F0-STD), speaker age, and sex. We focus on the voice (or silent pauses) rather than on
applied the gradient-weight class activation mapping environmental noise.
(Grad-CAM) algorithm to generate heat maps, allowing (iv) We demonstrate that models pre-trained on large-
10
us to investigate which pieces of information are most scale audio datasets, such as CNN14, can, through
relevant for the model’s classification decisions. As the transfer learning, achieve accuracies on par with the
dataset used in this work contained audios from different best previously reported models, even when using
12
collection environments (hospital and domestic), learning log-Mel spectrograms as input instead of MFCCs.
Volume 1 Issue 3 (2024) 115 doi: 10.36922/aih.2992

