Page 83 - AIH-2-2
P. 83

Artificial Intelligence in Health                                 Efficient knowledge distillation for breast US



            3. Proposed methods                                predictions  or  from  the  hidden  representations,  i.e.,  the
                                                               output  of  the  teacher  model’s  encoder.  This  approach
            This section describes the proposed KD-based method   allows us to examine the most relevant source of
            that highlights the potential of leveraging teacher networks   knowledge transfer. By incorporating these alternative
            to facilitate significant performance gains in student   pathways, we ensure that the student model can effectively
            models. We developed a student network that achieves   learn  from  either  the teacher’s  output or  the  intricate
            comparable performance to the teacher network while   feature representations captured by its encoder. Illustrated
            having a remarkable 100 times fewer trainable parameters.   in  Figure  1, our approach delineates three primary
            Additionally, we explored the fundamental role of   pathways for distilling knowledge from the teacher to
            augmentation techniques and loss functions in facilitating   the  student  model:  KD-Logits (L),  KD-Hidden (H),  and
            knowledge transfer across different distillation pathways,   KD-HiddenRegressor (HR).
            providing new insights into the optimization of model
            distillation processes.                            3.3.1. KD-Logits (L)
            3.1. Dataset                                       In this particular distillation pathway, our objective is
            As mentioned earlier, we used a publicly available 2D US   to transfer knowledge in the form of final predicted
            dataset introduced by Yap et al.  referred to as Dataset_A.   logits from the teacher to the student model. Logits
                                     27
            It consists of 163 breast US images with their manual   represent the raw predictions produced by the teacher
            delineations, each presenting either cancerous masses or   model before applying any activation function, offering
            benign lesions with a mean image size of 760 × 570 pixels.   a view of the model’s confidence scores across different
            In our experiments, we created three random splits of 130   classes or categories. By distilling these logits, the student
            images for the train-validation set and 33 images for testing.  model gains access to valuable information regarding the
                                                               teacher’s level of uncertainty, enabling a more nuanced
            3.2. Teacher and student models                    optimization process than can be achieved with only a
            In our segmentation framework, we employed a U-Net-  binary training signal. The design of  KD-Logits (L) is
            based architecture  for both student and teacher networks.   shown in Figure 1A.
                          45
            After conducting extensive experimentation with various   3.3.2. KD-Hidden (H)
            backbone architectures including ResNet34, ResNet101,
            ResNext50,  and  ResNext101,  we  ultimately  selected   In this designated pathway, as illustrated in Figure 1B, we
            ResNeXt101  as the backbone for our teacher model.   aim to distill knowledge from the output of the teacher’s
                      76
            Among the options tested, ResNeXt101, with 96 million   encoder, specifically focusing on the hidden features.
            parameters, outperformed others, demonstrating superior   These hidden features encapsulate rich representations
            performance in terms of accuracy and robustness. For our   of the input data captured at various levels of abstraction
            student model, we modified MobileNetV3-small-100  in a   within the teacher’s architecture. However, a challenge
                                                      77
            way that only had 0.82 million parameters. MobileNetV3-  arises due to discrepancies in the dimensions of the
            small-100 stood out as the sole model with a significantly   hidden features between the teacher and student models.
            reduced parameter count while still possessing pre-trained   To address this, we employ a strategy to adjust the size of
            weights in the ImageNet dataset. This characteristic was   the  hidden  features  to  ensure  compatibility  between  the
            pivotal for our choice, as it allowed us to strike a balance   two models. Specifically, since the number of channels in
            between model complexity and computational efficiency,   the teacher’s hidden features may differ from that of the
            making  it  the  most  suitable  candidate  to  serve  as  the   student’s, we harmonize their dimensions by taking the
            student model in our KD process. By leveraging distinct   average over the channels (denoted as K in Figure 1B) from
            encoders tailored to computational requirements, our   both sets of hidden features. This normalization process
            approach aims to optimize distilling knowledge from   facilitates a seamless transfer of knowledge, aligning the
            teacher to student, achieving a favorable balance between   representations from both models and enabling effective
            model complexity and performance in our proposed   learning by the student. Moreover, by leveraging this
            segmentation  framework.  For  both teacher and  student   method, we ensure that the student model can benefit from
            models, we initialized the backbone with pre-trained   the comprehensive insights encoded within the teacher’s
            weights obtained from ImageNet. 78                 hidden features. The hidden feature size of the teacher and
                                                               student are denoted as B × C  × H × W and B × C × H × W,
            3.3. KD paths                                      respectively, where B is the batch size, C  is the number
                                                                                                     t
                                                                                     t
                                                                                                *
            In  our proposed  KD paths,  we explored two  distinct   of channels in teacher C  and student C  models, H is the
                                                                                               s
                                                                                   t
            strategies distilling knowledge either from the final   height, and W is the width.
            Volume 2 Issue 2 (2025)                         77                               doi: 10.36922/aih.3509
   78   79   80   81   82   83   84   85   86   87   88