Page 131 - AIH-2-4
P. 131
Artificial Intelligence in Health RefSAM3D for medical image segmentation
Figure 5. Qualitative visualization of segmentation results generated from different methods for magnetic resonance imaging cardiac tumor segmentation
Abbreviations: 3D: Three-dimensional; nn: No new; NSD: Normalized surface Dice; SAM: Segment Anything Model; UNETR: U-Net Transformers;
UX-Net: UNet-eXpanded Network.
A Table 4. Ablation on each key component in our method
Parameters Dice (%) Hausdorff distance (%)
Ref-SAM3D 88.3 2.34
Without a text prompt 72.3 7.31
Without a cross-modal projector 80.1 4.22
B
Without hierarchical fusion 74.1 6.33
solution for clinical deployment, particularly in scenarios
requiring flexible and efficient medical image analysis
tools.
4.4. Ablation study
4.4.1. Effects of text prompt
The text prompt in our Ref-SAM3D model provided
essential semantic guidance by bridging textual descriptions
and visual features, enabling better interpretation of
anatomical structures. The results, as shown in Table 4,
without this component, the model’s performance dropped
significantly, with dice score decreasing from 88.3% to
Figure 6. Comparison of zero-shot and five-shot generalization 72.3% (−16.0%) and HD increasing from 2.34% to 7.31%
performance of Ref-SAM3D, nnU-Net, and Swin-UNETR on AMOS22 (+4.97%). This substantial degradation demonstrates that
data. (A) Computed tomography (CT) and (B) magnetic resonance
imaging (MRI) data. the text prompt is crucial for leveraging linguistic context
Abbreviations: 3D: Three-dimensional; nn: No new; NSD: Normalized to achieve precise medical image segmentation.
surface Dice; SAM: Segment Anything Model; UNETR: U-Net
Transformers; UX-Net: UNet-eXpanded Network. 4.4.2. Effects of cross-modal projector
The cross-modal projector in Ref-SAM3D plays a vital role
Ref-SAM3D’s robust generalization capabilities and in aligning textual and visual inputs, facilitating effective
its potential as a versatile solution for medical image
segmentation across different imaging modalities. integration of multi-modal information for improved
segmentation. By harmonizing these inputs, the projector
These experimental findings clearly demonstrate enhanced the model’s ability to utilize semantic context
Ref-SAM3D’s robust performance across different datasets from text alongside visual data. As shown in Table 4,
and imaging modalities. The model’s strong zero-shot removing this component resulted in an 8.2% decrease
generalization capabilities and impressive few-shot in dice score (from 88.3% to 80.1%) and an HD increase
learning results suggest its practical value in real-world from 2.34% to 4.22%. These results confirm that when
medical applications, where adapting to diverse imaging the cross-modal projector is removed, the model relies
conditions with minimal additional training is essential. on unaligned embeddings, which can lead to less effective
These characteristics position Ref-SAM3D as a promising feature integration.
Volume 2 Issue 4 (2025) 125 doi: 10.36922/AIH025080010

