Page 126 - AIH-2-4
P. 126

Artificial Intelligence in Health                                 RefSAM3D for medical image segmentation



                                              ’
            the textual prompt. As mentioned above, V  represents the   this problem, we employed progressive up-sampling,
                                              i
            adapted feature maps extracted from the ith attention   making moderate adjustments to the SAM decoder
                                           BL C
            block, and the textual prompt is T   .           by integrating two additional transposed convolution
              The cross-attention process can be formally expressed   operations. With each layer up-sampling the feature maps
                                                         ’
            as follows (Equation VI). For each hierarchical feature  F ,   by a factor of two, the four transposed convolutional
                                                         i
                                                               layers progressively restored feature maps to their original
            we computed the attention scores A with respect to the   input resolution. In addition, we introduced a multilayer
                                          i
            text T:                                            aggregation mechanism, designing a network akin to
                                                               a  “U-shaped”  architecture.  We  combined  intermediate
                        QK T                                 feature maps from stages 1–4 during the image encoder
            A  softmax   i                          (VI)    phase with prompts generated during the cross-modal
             i
                          d k                              reference prompt generation phase to enrich the mask
                                                               features. After up-sampling the mask feature map to the
                          BD HW C  i  i  i             ’
              where Q          are the queries derived from  F ,   original resolution, we concatenated it with the original
                                                         i
                     i
                     BL C
            and  K      are the keys derived from the textual   image and used another 3D convolution to fuse the
            prompt T. The dimensionality d  represents the size of the   information and generate the final mask to better leverage
                                     k
            keys, which is a scaling factor to ensure stable gradients   information from the original resolution.
            during training. The attention output O  for each feature
                                            i
            block can then be computed as in Equation VII:     4. Experiments
            O  = AV i                                 (VII)    4.1. Experimental setup
             i
                 i
                                                     ’
              Where V denotes the values corresponding to  F and is   We  conducted  a  comprehensive  evaluation of  our
                                                     i
                     i
                                   ’
            similarly dimensioned as  F . The final output from the   segmentation method across four medical image
                                  i
            cross-attention  mechanism  can  be  represented  as   segmentation tasks, encompassing three distinct imaging
            Equation VIII:                                     modalities: CT-based tumor segmentation, MRI-based
                                                               cardiac segmentation, and multi-organ segmentation

            O [ O O ,  2 ,., O ]  BDHW C        (VIII)    from multi-modal datasets. Our approach was rigorously
                          N
                 1
              resulting in a combined output that integrates both   compared against state-of-the-art methods on CT imaging
            visual and textual information across multiple layers. This   tasks. In addition, we assessed our method’s performance
            enriched representation was then utilized as a cross-modal   on MRI cardiac and multi-organ segmentation tasks,
            prompt in the subsequent stages of SAM’s prompt encoder,   providing a thorough analysis of its generalization
            effectively bridging the gap between visual features and   capabilities and conducting an in-depth ablation study to
            semantic understanding derived from text.          elucidate the contributions of its constituent components.
            3.4. Lightweight mask decoder                      4.1.1. Datasets
            The original SAM mask decoder comprises merely two   The kidney tumor segmentation (KiTS21) dataset  is a
                                                                                                        43
            transformer layers, two transposed convolution layers, and   comprehensive collection designed for the segmentation of
            a multilayer perception layer. In the context of 3D medical   kidneys, tumors, and cysts in CT imaging. It comprises 300
            image processing tasks, we replaced the 2D convolutions   publicly available training cases and 100 withheld testing
            with 3D convolutions to enable direct 3D mask generation.   cases. The dataset is formatted in 3D CT with files stored in
            Given that many anatomical structures or lesions in medical   the.nii.gz format. The image dimensions exhibit significant
            images are relatively small, it is often necessary to achieve   variability, with voxel spacing ranging from (0.5, 0.44,
            higher resolution images to ensure better distinction of the   0.44) mm to (5.0, 1.04, 1.04) mm and sizes ranging from
            segmented elements.                                (29, 512, 512) to (1,059, 512, 796). The dataset includes
                                                               annotations for three anatomical structures: kidneys,
              In the image encoder of the SAM, the patch embedding
            process of the transformer backbone embeds each 16 × 16   tumors, and cysts. These structures are consistently present
                                                               across all training cases, with cysts appearing in 49.33%
            patch into a feature vector, resulting in a 16 × 16 down-  of the cases. This dataset serves as a critical resource for
            sampling of the input. The SAM mask decoder employs two   advancing automated segmentation techniques in medical
            consecutive transposed convolution layers to up-sample
            the feature map by a factor of four. However, the final   imaging analysis.
            prediction generated by SAM still has a resolution that is   The Medical Segmentation Decathlon (MSD) pancreas
            four times lower than the original input shape. To address   tumor dataset  consists of 281 contrast-enhanced
                                                                           12

            Volume 2 Issue 4 (2025)                        120                          doi: 10.36922/AIH025080010
   121   122   123   124   125   126   127   128   129   130   131