Page 87 - AIH-2-2
P. 87
Artificial Intelligence in Health Efficient knowledge distillation for breast US
A B C D E
F G H I J
K L M N O
Figure 2. Visual comparison of our ablation study. The original test image, the prediction of the teacher model, and the prediction of the unsupervised
student model are shown in (A-C), respectively. The predicted segmentations of the proposed KD-based models are shown in (D-O). Green contours
represent the ground truth mask, while the red contours illustrate the corresponding predictions.
involves exchanging knowledge between hidden features, could notably influence the effectiveness of knowledge
the top DSC of 79.00 was achieved by the H_KLD model. transfer in KD.
Similarly, in KD (Hidden-Regressor), where knowledge
passes from hidden features through a regressor model, 5.1.3. Effect of augmentation
the highest DSC of 79.00 was reached by the HReg_KLD Exploring the impact of weak augmentation on the teacher
model. These findings collectively suggest that all proposed model reveals more insights into the KD process. The
KD paths exhibit comparable performance, enhancing utilization of weak augmentation for the teacher did not
student performance by approximately 9%. Such consistent yield a significant impact on performance. Models with and
enhancements underscore the robustness and versatility of without weak augmentation for the teacher demonstrated
the proposed KD paths, demonstrating their effectiveness comparable performance. Despite the negligible effect
in knowledge exchange between teacher and student. of weak augmentation on the teacher model, all models
incorporating teacher guidance showcased improvements
5.1.2. Effect of KD loss function compared to students without such supervision. This
Further analysis of the results presented in Table 4 observation demonstrates the fundamental role of the
shows that both the MSE and KLD loss functions are teacher network in guiding and enhancing the learning
effective for knowledge transfer. Notably, across various process of the student network. While weak augmentation
KD pathways, including KD (Logits), KD (Hidden), and may not directly influence the performance of the teacher
KD (Hidden-Regressor), the DSC reveals a consistent model, its presence facilitates the extraction and transfer
pattern wherein both loss functions demonstrate similar of valuable knowledge, thereby contributing to the overall
effectiveness. In the KD (Logits) pathway, for instance, the improvement in student performance.
DSC achieved by L_MSE and L_KLD, namely 77.75 and
78.00, respectively, highlight the marginal outperformance 5.2. Results with respect to SOTA methods
of L_KLD. Similarly, in other KD pathways such as KD In this section, we compare our best model with SOTA
(Hidden) and KD (Hidden-Regressor), the comparative models, as outlined in Table 5, which have utilized the
analysis reveals a similar pattern between MSE and KLD. same dataset employed in our study. It is important to
This slight outperformance of KLD in average DSC scores emphasize that none of these SOTA models have provided
suggests that KLD can be a preferable choice, indicating access to either their codebase or their training and testing
that the selection between MSE and KLD loss functions splits. Consequently, our comparison is based solely on
Volume 2 Issue 2 (2025) 81 doi: 10.36922/aih.3509

