Page 83 - AIH-2-2
P. 83
Artificial Intelligence in Health Efficient knowledge distillation for breast US
3. Proposed methods predictions or from the hidden representations, i.e., the
output of the teacher model’s encoder. This approach
This section describes the proposed KD-based method allows us to examine the most relevant source of
that highlights the potential of leveraging teacher networks knowledge transfer. By incorporating these alternative
to facilitate significant performance gains in student pathways, we ensure that the student model can effectively
models. We developed a student network that achieves learn from either the teacher’s output or the intricate
comparable performance to the teacher network while feature representations captured by its encoder. Illustrated
having a remarkable 100 times fewer trainable parameters. in Figure 1, our approach delineates three primary
Additionally, we explored the fundamental role of pathways for distilling knowledge from the teacher to
augmentation techniques and loss functions in facilitating the student model: KD-Logits (L), KD-Hidden (H), and
knowledge transfer across different distillation pathways, KD-HiddenRegressor (HR).
providing new insights into the optimization of model
distillation processes. 3.3.1. KD-Logits (L)
3.1. Dataset In this particular distillation pathway, our objective is
As mentioned earlier, we used a publicly available 2D US to transfer knowledge in the form of final predicted
dataset introduced by Yap et al. referred to as Dataset_A. logits from the teacher to the student model. Logits
27
It consists of 163 breast US images with their manual represent the raw predictions produced by the teacher
delineations, each presenting either cancerous masses or model before applying any activation function, offering
benign lesions with a mean image size of 760 × 570 pixels. a view of the model’s confidence scores across different
In our experiments, we created three random splits of 130 classes or categories. By distilling these logits, the student
images for the train-validation set and 33 images for testing. model gains access to valuable information regarding the
teacher’s level of uncertainty, enabling a more nuanced
3.2. Teacher and student models optimization process than can be achieved with only a
In our segmentation framework, we employed a U-Net- binary training signal. The design of KD-Logits (L) is
based architecture for both student and teacher networks. shown in Figure 1A.
45
After conducting extensive experimentation with various 3.3.2. KD-Hidden (H)
backbone architectures including ResNet34, ResNet101,
ResNext50, and ResNext101, we ultimately selected In this designated pathway, as illustrated in Figure 1B, we
ResNeXt101 as the backbone for our teacher model. aim to distill knowledge from the output of the teacher’s
76
Among the options tested, ResNeXt101, with 96 million encoder, specifically focusing on the hidden features.
parameters, outperformed others, demonstrating superior These hidden features encapsulate rich representations
performance in terms of accuracy and robustness. For our of the input data captured at various levels of abstraction
student model, we modified MobileNetV3-small-100 in a within the teacher’s architecture. However, a challenge
77
way that only had 0.82 million parameters. MobileNetV3- arises due to discrepancies in the dimensions of the
small-100 stood out as the sole model with a significantly hidden features between the teacher and student models.
reduced parameter count while still possessing pre-trained To address this, we employ a strategy to adjust the size of
weights in the ImageNet dataset. This characteristic was the hidden features to ensure compatibility between the
pivotal for our choice, as it allowed us to strike a balance two models. Specifically, since the number of channels in
between model complexity and computational efficiency, the teacher’s hidden features may differ from that of the
making it the most suitable candidate to serve as the student’s, we harmonize their dimensions by taking the
student model in our KD process. By leveraging distinct average over the channels (denoted as K in Figure 1B) from
encoders tailored to computational requirements, our both sets of hidden features. This normalization process
approach aims to optimize distilling knowledge from facilitates a seamless transfer of knowledge, aligning the
teacher to student, achieving a favorable balance between representations from both models and enabling effective
model complexity and performance in our proposed learning by the student. Moreover, by leveraging this
segmentation framework. For both teacher and student method, we ensure that the student model can benefit from
models, we initialized the backbone with pre-trained the comprehensive insights encoded within the teacher’s
weights obtained from ImageNet. 78 hidden features. The hidden feature size of the teacher and
student are denoted as B × C × H × W and B × C × H × W,
3.3. KD paths respectively, where B is the batch size, C is the number
t
t
*
In our proposed KD paths, we explored two distinct of channels in teacher C and student C models, H is the
s
t
strategies distilling knowledge either from the final height, and W is the width.
Volume 2 Issue 2 (2025) 77 doi: 10.36922/aih.3509

