Page 123 - AIH-2-4
P. 123

Artificial Intelligence in Health                                 RefSAM3D for medical image segmentation



            efficiently while addressing the unique challenges posed by   in reduced accuracy due to potential boundary blurring
            these complex tasks. 35-37                         and non-standard scanning postures. Moreover, medical
                                                               images differ significantly from natural images in both
            2.4. Image segmentation by referring expressions   content and structure, demanding higher anatomical

            Referring image segmentation is a task that involves   precision and detail. Directly applying segmentation
            segmenting a specific object in an image based on a   models trained on natural images to medical domains thus
            natural language  description.  This task requires  the   yields limited effectiveness. Figure 1 shows the proposed
            model to understand both the visual content of the   method, RefSAM3D.
            image and the semantic meaning of the text, making it
            a challenging problem at the intersection of computer   3.2. 3D volumetric input processing
            vision and natural language processing. With the advent   To  enhance SAM’s performance  in medical  imaging
            of large-scale vision-language models, the performance of   tasks,  the  model  needs  to  be adapted and  fine-tuned
            referring image segmentation has significantly improved.   to accommodate the domain-specific challenges. We
            Models such as CLIP  and ALIGN  leverage large     introduced a 3D image adapter to enable SAM’s processing
                               38
                                            39
            datasets of image-text pairs to learn joint embeddings that   of volumetric data.
            can be used for various vision-language tasks, including   We first modified the visual encoder to handle
            referring image segmentation. These models have    3D volumetric inputs. Given a 3D medical volume
            demonstrated strong zero-shot and few-shot capabilities,   V ∈ R C×D×H×W , where C, D, H, and W denote the channel,
            enabling them to generalize well to unseen tasks and   depth, height, and width, respectively, we extracted the 3D
            datasets. Recent advances have seen the adoption of   features through the following steps.
            transformer architectures for referring expression-based
            image segmentation. Transformer-based models, such as   3.2.1. Patch embedding
            the ViT,  have been adapted to this task by integrating   We approximated a k × k × k convolution (with k = 14)
                  40
            textual information into the visual processing pipeline.   by employing a combination of 1 × k × k and k × 1 × 1 3D
            Ding  et  al.  introduced a vision-language transformer   convolutions. The 1 ×  k  ×  k convolution was initialized
                     41
            approach  that  leverages  transformer  and  multi-head   with pre-trained 2D convolution weights, which remain
            attention  mechanisms  to  establish  deep  interactions   frozen during fine-tuning. To manage the complexity of
            between vision and language features, significantly   the model, we applied depth-wise convolutions for the
            enhancing holistic understanding. Similarly, cross-modal   newly introduced  k  × 1 × 1 convolutions, reducing the
            attention mechanisms have become a key component in   number of parameters that require tuning.
            modern  referring  image  segmentation  models.  These
            mechanisms enable the model to effectively combine   3.2.2. Positional encoding
            visual and textual features by computing attention scores
            between the two modalities. Li  et al.  introduced the   In the pre-trained ViT model, we introduced an additional
                                            42
            hierarchical dense attention module to fuse hierarchical   learnable lookup table with dimensions (C×D) to encode
            visual semantic information with sparse embeddings to   the  positional  information  for  3D  points  (d, h,  and w).
            obtain fine-grained dense embeddings, and an implicit   By summing the positional embedding from the frozen
                                                               (h, w) table with the learnable depth-axis embedding, we
            tracking module to generate a tracking token and provide   provided accurate positional encoding for the 3D data.
            historical information for the mask decoder.
                                                               3.2.3. Attention block
            3. Method
                                                               The attention block was directly adjusted to accommodate
            3.1. Overview of Ref-SAM3D                         3D features. For 2D inputs, the query size was (B, HW, C),
            The original SAM, built on a 2D ViT, is proficient in   which  is  easily  modified to (B, DHW, C)  for 3D  inputs
            capturing global patterns within 2D natural images.   while retaining all pretrained weights. We adopted a
            However, its applicability is limited when it comes   sliding  window  mechanism,  similar  to that  in  the  Swin
            to medical imaging modalities such as computed     Transformer, to mitigate memory overhead resulting
            tomography (CT) and magnetic resonance imaging     from the increased dimensionality, optimizing the model’s
            (MRI), which involve 3D volumetric data. In these   performance and memory footprint.
            contexts, 3D information is essential for applications such
            as organ segmentation and tumor quantification, as the   3.2.4. Bottleneck
            characteristics of these structures must be captured from   As in other studies, we enhanced the bottleneck layer
            a 3D perspective. Relying solely on 2D views can result   to better adapt to 3D tasks. Specifically, we replaced


            Volume 2 Issue 4 (2025)                        117                          doi: 10.36922/AIH025080010
   118   119   120   121   122   123   124   125   126   127   128