Page 86 - AIH-2-2

P. 86

Artificial Intelligence in Health Efficient knowledge distillation for breast US

In all of our experiments, we standardized the image Table 4. Experimental Dice similarity scores: Average over
size to 224 × 224. Given the limited size of our training 3‑fold cross‑validation
dataset, we used 3-fold cross-validation to showcase the Model notations DSC (%) P‑value w.r.t 1
generalizability of the models. Initially, we trained both
the teacher and student models separately to establish (Mean±std) Teacher Student
our baseline performance. Throughout this manuscript, Teacher 81.50±18.40 - 0.0048**
“Teacher” (with a capital T) refers to the predictions of the Student 73.16±23.78 0.0048** -
selected teacher model, while “Student” (with a capital S) L_MSE 77.75±22.55 0.1414 0.0227*
refers to the predictions of the student model trained from L_MSE_WAug 77.52±17.61 0 0.1565 0.0948
the dataset alone. All KD-based supervised student models L_KLD 78.00±22.05 0.0950 0.0037**
are referred to by their “Model Notation” as defined in L_KLD_WAug 80.00±18.86 0.2075 0.0012**
Table 3.
H_MSE 77.50±21.49 0.0645 0.0215*
For optimization, we employed the Adam optimizer H_MSE_WAug 77.85±20.06 0.0341* 0.0125*
with a learning rate of 10 . The batch size was set to 64, H_KLD 79.00±20.45 0.1320 0.0067**
-4
and training was conducted for 500 epochs. To prevent
overfitting, we implemented early stopping and terminated H_KLD_WAug 78.50±19.83 0.1417 0.0316*
training if the validation loss failed to improve for 50 HReg_MSE 78.06±19.25 0.0325* 0.0201*
consecutive epochs. Model checkpoints were saved based HReg _MSE_WAug 77.63±19.97 0.0402* 0.0269*
on the best validation loss achieved during training. HReg _KLD 79.00±20.37 0.2132 0.0021**
For performance evaluation, we utilized the DSC that HReg _KLD_WAug 79.31±19.33 0.1247 0.0068**
quantifies the overlap between the predicted and ground Notes: w.r.t.: with respect to. *and **denote a statistically significant
1
truth segmentation masks and captures both the precision difference with P<0.05 and P<0.01, respectively. DSC in bold represents
and recall aspects of model performance. DSC is calculated the best DSC across all the proposed networks.
as follows: Abbreviation: DSC: Dice similarity score.
2 N y ^
y
DSC i1 i i (V) not significantly differ from the Teacher model, with a
y
i
i1 y ^ i P = 0.2075. This validates that the L_KLD_WAug model
N
ˆ
Where N, i, y , and y , represent the total number of performs similarly to the Teacher model. Conversely, it
i
i
samples, one sample, ground-truth label, and predicted is significantly outperforming the Student model, with a
P = 0.0012.
label, respectively.
5.1. Ablation study
5. Results
In this section, we conduct an ablation study and analyze
In this section, we delve into the obtained outcomes and the results from various perspectives. As presented in
analyze the implications of each suggested KD pathway. Table 4, the proposed KD paths consistently exhibit
Additionally, we assess the effect of MSE and KLD loss performance closely aligned with that of the teacher
functions on knowledge transfer from teacher to student. model. This is evidenced by their P-values relative to the
Furthermore, we examine the impact of augmentation teacher’s predictions, which generally do not demonstrate
on the teacher model. Finally, we compare the best- significant differences. However, these KD paths typically
performing KD paths with SOTA methods that utilized the reveal significantly better performance when compared
same dataset. It is worth noting that none of these SOTA to the student model, as indicated by their respective
methods have publicly disclosed their training and testing P-values. A visualization example of our ablation study is
splits, nor have they shared their codes. As a result, we presented in Figure 2.
can only report the results as presented in their respective
papers. 5.1.1. Effect of KD paths
The Dice similarity scores, averaged over a 3-fold By investigating the performance evaluation of the KD
cross-validation, are summarized in Table 4. The proposed paths, it becomes evident that each pathway showcases
L_KLD_WAug model achieved the highest DSC of noteworthy achievements in enhancing student
80.00 ± 18.86. The statistical analysis in Table 4 illustrates performance. In the KD (Logits) path, where knowledge
the significance of the differences between the DSC of is transferred between the logits of the teacher and
the proposed models compared to Teacher and Student. student, the highest DSC of 80.00 was attained by the
For instance, the DSC for the L_KLD_WAug model does L_KLD_WAug model. Moving to KD (Hidden), which

Volume 2 Issue 2 (2025) 80 doi: 10.36922/aih.3509

81 82 83 84 85 86 87 88 89 90 91