Page 86 - AIH-2-2
P. 86

Artificial Intelligence in Health                                 Efficient knowledge distillation for breast US



              In all of our experiments, we standardized the image   Table 4. Experimental Dice similarity scores: Average over
            size to 224 × 224. Given the limited size of our training   3‑fold cross‑validation
            dataset, we used 3-fold cross-validation to showcase the   Model notations  DSC (%)  P‑value w.r.t 1
            generalizability of the models. Initially, we trained both
            the teacher and student models separately to establish              (Mean±std)   Teacher   Student
            our baseline performance. Throughout this manuscript,   Teacher     81.50±18.40    -       0.0048**
            “Teacher” (with a capital T) refers to the predictions of the   Student  73.16±23.78  0.0048**  -
            selected teacher model, while “Student” (with a capital S)   L_MSE  77.75±22.55  0.1414    0.0227*
            refers to the predictions of the student model trained from   L_MSE_WAug  77.52±17.61 0  0.1565  0.0948
            the dataset alone. All KD-based supervised student models   L_KLD   78.00±22.05  0.0950    0.0037**
            are referred to by their “Model Notation” as defined in   L_KLD_WAug  80.00±18.86  0.2075  0.0012**
            Table 3.
                                                               H_MSE            77.50±21.49  0.0645    0.0215*
              For optimization, we employed the Adam optimizer   H_MSE_WAug     77.85±20.06  0.0341*   0.0125*
            with a learning rate of 10 . The batch size was set to 64,   H_KLD  79.00±20.45  0.1320    0.0067**
                                 -4
            and training was conducted for 500 epochs. To prevent
            overfitting, we implemented early stopping and terminated   H_KLD_WAug  78.50±19.83  0.1417  0.0316*
            training if the validation loss failed to improve for 50   HReg_MSE  78.06±19.25  0.0325*  0.0201*
            consecutive epochs. Model checkpoints were saved based   HReg _MSE_WAug  77.63±19.97  0.0402*  0.0269*
            on the best validation loss achieved during training.   HReg _KLD   79.00±20.37  0.2132    0.0021**
            For performance evaluation, we utilized the DSC that   HReg _KLD_WAug  79.31±19.33  0.1247  0.0068**
            quantifies the overlap between the predicted and ground   Notes:  w.r.t.: with respect to. *and **denote a statistically significant
                                                                    1
            truth segmentation masks and captures both the precision   difference with P<0.05 and P<0.01, respectively. DSC in bold represents
            and recall aspects of model performance. DSC is calculated   the best DSC across all the proposed networks.
            as follows:                                        Abbreviation: DSC: Dice similarity score.
                  2   N  y  ^
                            y
            DSC       i1  i  i                       (V)     not significantly differ from the  Teacher model, with a
                           y
                        i
                   i1  y  ^ i                               P = 0.2075. This validates that the L_KLD_WAug model
                    N
                               ˆ
              Where N, i, y , and  y , represent the total number of   performs similarly to the  Teacher model. Conversely, it
                                i
                          i
            samples, one sample, ground-truth label, and predicted   is significantly outperforming the Student model, with a
                                                               P = 0.0012.
            label, respectively.
                                                               5.1. Ablation study
            5. Results
                                                               In this section, we conduct an ablation study and analyze
            In this section, we delve into the obtained outcomes and   the results from various perspectives. As presented in
            analyze the implications of each suggested KD pathway.   Table 4, the proposed KD paths consistently exhibit
            Additionally, we assess the effect of MSE and KLD loss   performance closely aligned with that of the teacher
            functions on knowledge transfer from teacher to student.   model. This is evidenced by their P-values relative to the
            Furthermore, we examine the impact of augmentation   teacher’s predictions, which generally do not demonstrate
            on  the  teacher  model.  Finally,  we  compare  the  best-  significant differences. However, these KD paths typically
            performing KD paths with SOTA methods that utilized the   reveal  significantly  better  performance  when  compared
            same dataset. It is worth noting that none of these SOTA   to the student model, as indicated by their respective
            methods have publicly disclosed their training and testing   P-values. A visualization example of our ablation study is
            splits, nor have they shared their codes. As a result, we   presented in Figure 2.
            can only report the results as presented in their respective
            papers.                                            5.1.1. Effect of KD paths
              The Dice similarity scores, averaged over a 3-fold   By investigating the performance evaluation of the KD
            cross-validation, are summarized in Table 4. The proposed   paths, it becomes evident that each pathway showcases
            L_KLD_WAug model achieved the highest DSC of       noteworthy  achievements  in  enhancing  student
            80.00 ± 18.86. The statistical analysis in Table 4 illustrates   performance. In the KD (Logits) path, where knowledge
            the  significance of the  differences  between  the  DSC of   is transferred between the logits of the teacher and
            the  proposed  models  compared  to  Teacher  and  Student.   student, the highest DSC of 80.00 was attained by the
            For instance, the DSC for the L_KLD_WAug model does   L_KLD_WAug model. Moving to KD (Hidden), which


            Volume 2 Issue 2 (2025)                         80                               doi: 10.36922/aih.3509
   81   82   83   84   85   86   87   88   89   90   91