Page 121 - AIH-2-4
P. 121
Artificial Intelligence in Health RefSAM3D for medical image segmentation
the Segment Anything Model (SAM), to medical image a hierarchical attention mechanism that significantly
segmentation. For example, Huang et al. demonstrated that improves the model’s ability to capture and integrate
6
SAM performs suboptimally on medical data, especially information across different scales. This mechanism focuses
with objects that have irregular shapes or low contrast. on critical feature layers while filtering out irrelevant data,
Three main factors limit SAM’s effectiveness in this domain. thereby enhancing segmentation precision and robustness,
First, medical images, which often differ significantly from particularly in complex 3D structures. By integrating
natural images, tend to be smaller, irregular in shape, and information across multiple scales, the model achieves a
low in contrast, complicating direct application of the nuanced understanding of volumetric data, leading to
model. Second, medical structures typically have blurred more precise medical image segmentation. In addition, we
or indistinct boundaries, whereas SAM’s pre-training adapt the visual encoder to handle 3D inputs and enhance
data includes predominantly well-defined edges, reducing the mask decoder for direct 3D mask generation, bridging
segmentation accuracy and stability. Finally, medical the gap between SAM’s 2D architecture and the demands
imaging data often exists in three-dimensional (3D) forms of 3D medical imaging. This adaptation is crucial for
with rich volumetric details. Yet, SAM’s hint engineering ensuring the model’s applicability and effectiveness in this
was developed for two-dimensional (2D) data, limiting its domain. We evaluate our approach on multiple medical
ability to leverage 3D spatial features essential in medical imaging datasets, demonstrating its superior performance
contexts. compared to state-of-the-art methods. Our experiments
To enhance SAM’s performance in medical imaging highlight the effectiveness of our model in accurately
tasks, it is crucial to adapt and fine-tune the model to address segmenting complex anatomical structures, thereby
domain-specific challenges. Recent studies have shown that advancing the application of SAM in medical imaging. The
parameter-efficient transfer learning (PETL) techniques, contributions of our work are as follows:
such as Low-Rank Adaptation and Adapters, are effective (i) We introduce a cross-modal reference prompt
8
7
in this context. For instance, Med-Tuning reduces the generation mechanism that integrates text and image
9
domain gap between natural images and medical volumes embeddings into a unified feature space, facilitating
by incorporating Med-Adapter modules into pretrained effective cross-modal interaction.
visual foundation models. SAMed employs the Low- (ii) We develop a hierarchical attention mechanism that
10
Rank Adaptation fine-tuning strategy to adjust the image significantly improves the model’s ability to capture
encoder, prompt encoder, and mask decoder of the SAM, and integrate information across different scales,
achieving a balance between performance and deployment leading to improved segmentation precision and
cost. However, these approaches predominantly focus on robustness, particularly in complex 3D structures.
pure 2D adaptation, not fully exploiting the 3D information (iii) We achieve state-of-the-art results across multiple
inherent in volumetric medical data. Nowadays, research is benchmarks, demonstrating superior performance in
gradually shifting focus to better utilize the extensive data 3D medical image segmentation tasks.
available in the 3D domain. The related methodologies can 2. Related work
be categorized into two main approaches: one relies on
prompt design based on SAM, 11-13 and the other achieves 2.1. Vision foundation models (VFMs)
fully automatic segmentation when the segmented objects With the rapid development of foundation models in
exhibit relatively regular shapes and positions. 14,15 The computer vision, recent research has focused on leveraging
automatic prompt generation fails to leverage specialized large-scale pre-training to create adaptable models with
medical knowledge and struggles to capture critical zero-shot and few-shot generalization capabilities. 16-19
features due to blurred boundaries and small targets in These VFMs draw inspiration from language foundation
medical images. These limitations result in suboptimal models like generative pre-trained transformers (GPT)
performance of automated methods, indicating further series, showing remarkable adaptability across domains
optimization. and tasks using pre-training and fine-tuning paradigms.
20
In this paper, we propose Ref-SAM3D, an innovative Notable examples include the Contrastive Language-Image
approach that integrates textual prompts to enhance Pre-training (CLIP) model and the A Large-scale ImaGe
21
segmentation accuracy and consistency in complex and Noisy-text embedding (ALIGN) model, which
22
anatomical scenarios. By incorporating text-based cues, employ image-text pairs to achieve zero-shot generalization
our method enables SAM to perform referring expression across tasks such as classification and video understanding.
segmentation within a 3D context, allowing the model to Building on these foundations, segmentation-specific
process both visual inputs and semantic descriptions for models such as the segment-everything-everywhere
more intelligent segmentation strategies. We introduce model and SegGPT have emerged to address more
23
24
Volume 2 Issue 4 (2025) 115 doi: 10.36922/AIH025080010

