Page 132 - AIH-2-4
P. 132
Artificial Intelligence in Health RefSAM3D for medical image segmentation
Table 5. The ablation experiments of each stage under the (i) A cross-modal reference prompt generator that fuses
hierarchical cross‑attention text and image embeddings into a unified feature space
Stages Dice (%) Hausdorff distance (%) through adaptive alignment, significantly enhancing
All stages 88.3 2.34 spatial-semantic correlation, (ii) a multi-scale hierarchical
attention mechanism that dynamically prioritizes critical
Stages 1 and 4 78.5 2.76 anatomical features across dimensional scales while
Stages 2 and 4 82.1 2.62 suppressing irrelevant noise, significantly improving
Stages 3 and 4 85.4 2.48 segmentation robustness in intricate 3D topologies, and
Stage 4 only 73.78 2.89 (iii) a volumetric architecture adaptation that transforms
SAM’s native 2D processing into true 3D computation
4.4.3. Effects of hierarchical cross-attention mechanism through depth-aware convolutions and recursive mask
refinement, effectively bridging the dimensional gap
The hierarchical fusion mechanism in Ref-SAM3D is in medical imaging analysis. Extensive validation
pivotal for integrating information across various encoder demonstrates state-of-the-art performance on complex
layers, enabling the model to capture detailed, multi- segmentation tasks. While our approach is highly effective,
level semantic features essential for precise segmentation. future work is needed to focus on improving computational
Ablation studies, summarized in Table 4, demonstrate the efficiency to enable real-time clinical applications,
significance of this mechanism. Removing the hierarchical exploring semi-supervised learning techniques to address
fusion led to a sharp decline in segmentation accuracy, with the challenge of limited labeled data. Overall, our method
the dice coefficient dropping from 88.3% to 74.1%, and the holds significant promise as a generalizable and robust
HD increasing from 2.34% to 6.33%. This underscores the segmentation framework, offering both fully automatic
mechanism’s role in effectively combining features across and promptable segmentation capabilities for a wide range
layers for better performance. of 3D medical imaging applications.
Moreover, Table 5 provides a systematic evaluation of
each block level’s contribution to the model. The results Acknowledgments
reveal that utilizing all layers (Stage 1–4) achieved the best None.
performance, with a dice score of 88.3% and an HD of
2.34%. In contrast, excluding specific layers led to varied Funding
performance declines, with the shallow layers contributing None.
significantly to contextual information and deeper layers
enhancing fine-grained details. For example, when only Conflict of interest
deeper layers (Stages 3 and 4) were used, the dice score
dropped to 78.5%, and the HD increased to 2.76%. In The authors declare they have no competing interests.
contrast, including only the shallow layers (Stages 1 and 2) Author contributions
yielded a dice score of 73.78% and an HD of 2.89%.
Conceptualization: Xiang Gao
These findings underscore the necessity of a
comprehensive fusion approach. Each layer’s unique Data curation: Xiang Gao
Investigation: Xiang Gao
contributions—from the broad contextual cues in shallow Methodology: Xiang Gao
layers to the detailed semantic information in deeper Visualization: Xiang Gao
layers—work synergistically to enhance the model’s ability Writing–original draft: Xiang Gao
to capture complex anatomical structures, ultimately Writing–review & editing: Kai Lu
improving overall segmentation accuracy and robustness.
5. Conclusion Ethics approval and consent to participate
Not applicable.
We present Ref-SAM3D, a 3D-adapted SAM framework
that synergizes cross-modal prompting and hierarchical Consent for publication
attention to address medical segmentation challenges in
volumetric imaging. Our model establishes a bidirectional Not applicable.
interaction between visual data and semantic text Availability of data
descriptions, enabling intelligent segmentation through
joint reasoning over volumetric imaging and clinical Data will be made available upon request to the
context. Three key innovations drive our methodology: corresponding author.
Volume 2 Issue 4 (2025) 126 doi: 10.36922/AIH025080010

