Page 120 - AIH-2-4
P. 120

Artificial Intelligence in Health





                                        ORIGINAL RESEARCH ARTICLE
                                        RefSAM3D: Adapting the Segment Anything

                                        Model with cross-modal references for
                                        three-dimensional medical image segmentation



                                        Xiang Gao  and Kai Lu*

                                        Department of Anesthesiology, Nanjing Drum Tower Hospital, Nanjing University, Nanjing, Jiangsu,
                                        China



                                        Abstract

                                        The Segment Anything Model (SAM), originally built on a two-dimensional vision
                                        transformer, excels at capturing global patterns in two-dimensional natural images
                                        but faces challenges when applied to three-dimensional (3D) medical imaging
                                        modalities such as computed tomography and magnetic resonance imaging. These
                                        modalities require capturing spatial information in volumetric space for tasks such
                                        as organ segmentation and tumor quantification.  To address this challenge, we
                                        introduce RefSAM3D, an adaptation of SAM for 3D medical imaging by incorporating
                                        a 3D image adapter and cross-modal reference prompt generation. Our approach
                                        modifies the visual encoder to handle 3D inputs and enhances the mask decoder
                                        for  direct  3D  mask  generation.  We  also  integrate  textual  prompts  to  improve
                                        segmentation accuracy and consistency in complex anatomical scenarios. By
            *Corresponding author:      employing a hierarchical attention mechanism, our model effectively captures and
            Kai Lu
            (961340955@qq.com)          integrates  information  across  different  scales.  Extensive  evaluations  on  multiple
                                        medical imaging datasets demonstrate that RefSAM3D outperforms state-of-the-art
            Citation: Gao X, Lu K. RefSAM3D:
            Adapting the Segment Anything   methods. Our work thus advances the application of SAM in accurately segmenting
            Model with cross-modal references   complex anatomical structures in medical imaging.
            for three-dimensional medical
            image segmentation. Artif Intell
            Health. 2025;2(4):114-128.   Keywords: Three-dimensional medical imaging; Cross-modal reference prompt;
            doi: 10.36922/AIH025080010
                                        Volumetric segmentation; Vision transformer
            Received: February 17, 2025
            Revised: May 1, 2025
            Accepted: June 23, 2025     1. Introduction
            Published online: August 14, 2025
                                        Medical image segmentation is a fundamental task in medical imaging, primarily aimed
            Copyright: © 2025 Author(s).   at identifying and extracting specific anatomical structures, such as organs, lesions, and
            This is an Open-Access article   tissues, from medical images. This process is crucial for numerous clinical applications,
            distributed under the terms of the
            Creative Commons Attribution   including computer-aided diagnosis, treatment planning, and disease progression
            License, permitting distribution,   monitoring. Accurate image segmentation provides precise volumetric and shape
            and reproduction in any medium,   information about target structures, which is essential for further clinical applications
            provided the original work is
            properly cited.             such as disease diagnosis, quantitative analysis, and surgical planning. 1-3
                                                                                                            4,5
            Publisher’s Note: AccScience   Currently, recent breakthroughs in foundational models for image segmentation
            Publishing remains neutral with   have yielded transformative results, leveraging extensive datasets to capture general
            regard to jurisdictional claims in
            published maps and institutional   representations  that  exhibit  exceptional  generalizability  and  performance.  However,
            affiliations.               despite these strides, significant challenges arise when applying these models, particularly


            Volume 2 Issue 4 (2025)                        114                          doi: 10.36922/AIH025080010
   115   116   117   118   119   120   121   122   123   124   125