Page 123 - AIH-2-4
P. 123
Artificial Intelligence in Health RefSAM3D for medical image segmentation
efficiently while addressing the unique challenges posed by in reduced accuracy due to potential boundary blurring
these complex tasks. 35-37 and non-standard scanning postures. Moreover, medical
images differ significantly from natural images in both
2.4. Image segmentation by referring expressions content and structure, demanding higher anatomical
Referring image segmentation is a task that involves precision and detail. Directly applying segmentation
segmenting a specific object in an image based on a models trained on natural images to medical domains thus
natural language description. This task requires the yields limited effectiveness. Figure 1 shows the proposed
model to understand both the visual content of the method, RefSAM3D.
image and the semantic meaning of the text, making it
a challenging problem at the intersection of computer 3.2. 3D volumetric input processing
vision and natural language processing. With the advent To enhance SAM’s performance in medical imaging
of large-scale vision-language models, the performance of tasks, the model needs to be adapted and fine-tuned
referring image segmentation has significantly improved. to accommodate the domain-specific challenges. We
Models such as CLIP and ALIGN leverage large introduced a 3D image adapter to enable SAM’s processing
38
39
datasets of image-text pairs to learn joint embeddings that of volumetric data.
can be used for various vision-language tasks, including We first modified the visual encoder to handle
referring image segmentation. These models have 3D volumetric inputs. Given a 3D medical volume
demonstrated strong zero-shot and few-shot capabilities, V ∈ R C×D×H×W , where C, D, H, and W denote the channel,
enabling them to generalize well to unseen tasks and depth, height, and width, respectively, we extracted the 3D
datasets. Recent advances have seen the adoption of features through the following steps.
transformer architectures for referring expression-based
image segmentation. Transformer-based models, such as 3.2.1. Patch embedding
the ViT, have been adapted to this task by integrating We approximated a k × k × k convolution (with k = 14)
40
textual information into the visual processing pipeline. by employing a combination of 1 × k × k and k × 1 × 1 3D
Ding et al. introduced a vision-language transformer convolutions. The 1 × k × k convolution was initialized
41
approach that leverages transformer and multi-head with pre-trained 2D convolution weights, which remain
attention mechanisms to establish deep interactions frozen during fine-tuning. To manage the complexity of
between vision and language features, significantly the model, we applied depth-wise convolutions for the
enhancing holistic understanding. Similarly, cross-modal newly introduced k × 1 × 1 convolutions, reducing the
attention mechanisms have become a key component in number of parameters that require tuning.
modern referring image segmentation models. These
mechanisms enable the model to effectively combine 3.2.2. Positional encoding
visual and textual features by computing attention scores
between the two modalities. Li et al. introduced the In the pre-trained ViT model, we introduced an additional
42
hierarchical dense attention module to fuse hierarchical learnable lookup table with dimensions (C×D) to encode
visual semantic information with sparse embeddings to the positional information for 3D points (d, h, and w).
obtain fine-grained dense embeddings, and an implicit By summing the positional embedding from the frozen
(h, w) table with the learnable depth-axis embedding, we
tracking module to generate a tracking token and provide provided accurate positional encoding for the 3D data.
historical information for the mask decoder.
3.2.3. Attention block
3. Method
The attention block was directly adjusted to accommodate
3.1. Overview of Ref-SAM3D 3D features. For 2D inputs, the query size was (B, HW, C),
The original SAM, built on a 2D ViT, is proficient in which is easily modified to (B, DHW, C) for 3D inputs
capturing global patterns within 2D natural images. while retaining all pretrained weights. We adopted a
However, its applicability is limited when it comes sliding window mechanism, similar to that in the Swin
to medical imaging modalities such as computed Transformer, to mitigate memory overhead resulting
tomography (CT) and magnetic resonance imaging from the increased dimensionality, optimizing the model’s
(MRI), which involve 3D volumetric data. In these performance and memory footprint.
contexts, 3D information is essential for applications such
as organ segmentation and tumor quantification, as the 3.2.4. Bottleneck
characteristics of these structures must be captured from As in other studies, we enhanced the bottleneck layer
a 3D perspective. Relying solely on 2D views can result to better adapt to 3D tasks. Specifically, we replaced
Volume 2 Issue 4 (2025) 117 doi: 10.36922/AIH025080010

