Page 129 - AIH-1-3
P. 129
Artificial Intelligence in Health Interpretability of deep models for COVID-19
Regarding the training process, we found that Finally, our best model (CNN14) achieved an accuracy
noise insertion is important, consistent with previous of 94.44%. This number is almost as good as the best
findings; therefore, we used it in all experiments. Other models reported in the literature and shows that proper
20
12
augmentations, such as Mix-up and SpecAugment, did use of transfer learning can make log-Mel spectrogram
not lead to improvements in the model. On the contrary, input nearly as efficient as MFCC input.
accuracy decreased. Transfer learning, on the other
hand, proved to be important in this domain, as CNN14 5.2. Limitations and future work
achieved superior results compared to all other models and In future works, we plan to investigate other audio-related
is comparable to the current state of the art in the literature features, such as autocorrelation, jitter, and shimmer. We
for this task. also intend to investigate the beginning of a sentence.
Furthermore, with respect to the training process, When a speaker starts to produce a sentence, they have
we noted in preliminary experiments some variance in more air in their lungs, which decreases as they speak. Some
the aspects a model can focus on during inference. The models may focus more on the audio at the beginning,
structure of pauses, syntactic boundaries, and pretonic measuring the signal energy, as the initial energy in the
syllables, among other factors, may be more or less audio may provide hints about pulmonary capacity. In
evidenced by the models after training. This result is addition, we plan to investigate models of related diseases,
expected because artificial neural networks are high- such as general cases of respiratory insufficiency. Finally,
variance, low-bias classifiers with randomized parameter we aim to investigate the variance in model training,
initialization. We observed that transfer learning appears identifying factors that are important for model inference
to reduce this variance. and techniques that reduce variance in the learned models
(such as transfer learning).
Regarding the qualitative analysis, our first case
study indicated that detailed evaluation would be better 6. Conclusion
performed in the spectrograms-only scenario, which
allowed for audio resynthesis, improving the process. As This work presents a method for interpretability analysis
a result of this analysis, we can formulate the following of audio classification for COVID-19 detection based
hypotheses to explain the obtained variance and on CNNs. Our work focuses on explainable AI. We
understand the data aspects that may play a role in model investigated the importance of different features in the
learning: training process and generated heat maps to understand
(i) H1: Pauses are important clues for detecting the model’s reasoning for its predictions.
COVID-19 since patients tend to make more pauses Regarding the input data, our results show that
for breathing than the control group. spectrograms are a suitable representation for COVID-19
(ii) H2: As the air starts decreasing in the lungs, the detection. F0 appears to be almost as efficient as
speaker may begin to lose breath, or the signal energy spectrograms, and the combination of these two inputs led
may begin to decrease. Thus, energy over time can be to a small increase in the model performance. Grad-CAM
an important clue. analysis indicates that F0 is a more important feature than
(iii) H3: An interplay between syntax and prosody is F0-STD, sex, and age. Moreover, Grad-CAM and audio
expected to emerge as a boundary marked by formant resynthesis helped us formulate hypotheses about the
vowel high energy, i.e., phonetically. factors that determine the model’s classification process
The first hypothesis confirms that deep models use the and confirm that the deep models used do not rely on
discrepancy in the structure of pauses between patients environmental noise for decision-making. Our best model
18
and controls, as observed by Fernandes-Svartman et al. (CNN14) achieved 94.44% accuracy, on par with the best
The second and third hypotheses are newly observed models in the literature12.
discrepancies, which were found to be present by deep
learning models. Acknowledgments
Our work also confirms the hypothesis from previous We gratefully acknowledge the support of NVIDIA
works that the addition of hospital ward noise, alongside corporation with the donation of a GPU used in part of the
9,12
suitable preprocessing steps, prevents the models from experiments presented in this research.
making biased decisions in the COVID-19 detection Funding
task. Through Grad-CAM analysis, we confirm that deep
models focus on the voice (or silent pauses) rather than on This work was supported by FAPESP grants 2022/16374-6
environmental noise. (MMG), 2020/06443-5 (SPIRA), and 2023/00488-5
Volume 1 Issue 3 (2024) 123 doi: 10.36922/aih.2992

