Page 121 - AIH-2-4
P. 121

Artificial Intelligence in Health                                 RefSAM3D for medical image segmentation



            the Segment Anything Model (SAM), to medical image   a  hierarchical  attention  mechanism  that  significantly
            segmentation. For example, Huang et al.  demonstrated that   improves the model’s ability to capture and integrate
                                           6
            SAM performs suboptimally on medical data, especially   information across different scales. This mechanism focuses
            with objects that have irregular shapes or low contrast.   on critical feature layers while filtering out irrelevant data,
            Three main factors limit SAM’s effectiveness in this domain.   thereby enhancing segmentation precision and robustness,
            First, medical images, which often differ significantly from   particularly in complex 3D structures. By integrating
            natural images, tend to be smaller, irregular in shape, and   information across multiple scales, the model achieves a
            low  in  contrast,  complicating  direct  application  of  the   nuanced understanding of volumetric data, leading to
            model. Second, medical structures typically have blurred   more precise medical image segmentation. In addition, we
            or indistinct boundaries, whereas SAM’s pre-training   adapt the visual encoder to handle 3D inputs and enhance
            data includes predominantly well-defined edges, reducing   the mask decoder for direct 3D mask generation, bridging
            segmentation  accuracy  and  stability.  Finally,  medical   the gap between SAM’s 2D architecture and the demands
            imaging data often exists in three-dimensional (3D) forms   of  3D medical  imaging. This  adaptation  is  crucial for
            with rich volumetric details. Yet, SAM’s hint engineering   ensuring the model’s applicability and effectiveness in this
            was developed for two-dimensional (2D) data, limiting its   domain.  We  evaluate  our  approach  on  multiple  medical
            ability to leverage 3D spatial features essential in medical   imaging datasets, demonstrating its superior performance
            contexts.                                          compared to state-of-the-art methods. Our experiments

              To enhance SAM’s performance in medical imaging   highlight the effectiveness of our model in accurately
            tasks, it is crucial to adapt and fine-tune the model to address   segmenting complex anatomical structures, thereby
            domain-specific challenges. Recent studies have shown that   advancing the application of SAM in medical imaging. The
            parameter-efficient transfer learning (PETL) techniques,   contributions of our work are as follows:
            such as Low-Rank Adaptation  and Adapters,  are effective   (i)  We introduce a cross-modal reference prompt
                                                8
                                    7
            in this context. For instance, Med-Tuning  reduces the   generation mechanism that integrates text and image
                                               9
            domain gap between natural images and medical volumes   embeddings into a unified feature space, facilitating
            by incorporating Med-Adapter modules into pretrained   effective cross-modal interaction.
            visual  foundation models.  SAMed  employs the Low-  (ii)  We develop a hierarchical attention mechanism that
                                         10
            Rank Adaptation fine-tuning strategy to adjust the image   significantly improves the model’s ability to capture
            encoder, prompt encoder, and mask decoder of the SAM,   and integrate information across different scales,
            achieving a balance between performance and deployment   leading to improved segmentation precision and
            cost. However, these approaches predominantly focus on   robustness, particularly in complex 3D structures.
            pure 2D adaptation, not fully exploiting the 3D information   (iii) We achieve state-of-the-art results across multiple
            inherent in volumetric medical data. Nowadays, research is   benchmarks, demonstrating superior performance in
            gradually shifting focus to better utilize the extensive data   3D medical image segmentation tasks.
            available in the 3D domain. The related methodologies can   2. Related work
            be categorized into two main approaches: one relies on
            prompt design based on SAM, 11-13  and the other achieves   2.1. Vision foundation models (VFMs)
            fully automatic segmentation when the segmented objects   With the rapid development of foundation models in
            exhibit relatively regular shapes and positions. 14,15  The   computer vision, recent research has focused on leveraging
            automatic prompt generation fails to leverage specialized   large-scale pre-training to create adaptable models with
            medical knowledge and struggles to capture critical   zero-shot and few-shot generalization capabilities. 16-19
            features due to blurred boundaries and small targets in   These VFMs draw inspiration from language foundation
            medical images. These limitations result in suboptimal   models like generative pre-trained transformers (GPT)
            performance of automated methods, indicating further   series, showing remarkable adaptability across domains
            optimization.                                      and tasks using pre-training and fine-tuning paradigms.
                                                                                                            20
              In this paper, we propose Ref-SAM3D, an innovative   Notable examples include the Contrastive Language-Image
            approach that integrates textual prompts to enhance   Pre-training (CLIP) model  and the A Large-scale ImaGe
                                                                                    21
            segmentation accuracy and consistency in complex   and  Noisy-text  embedding  (ALIGN)  model,   which
                                                                                                      22
            anatomical scenarios. By incorporating text-based cues,   employ image-text pairs to achieve zero-shot generalization
            our method enables SAM to perform referring expression   across tasks such as classification and video understanding.
            segmentation within a 3D context, allowing the model to   Building  on these foundations, segmentation-specific
            process both visual inputs and semantic descriptions for   models such as the segment-everything-everywhere
            more intelligent segmentation strategies. We introduce   model  and SegGPT  have emerged to address more
                                                                    23
                                                                                 24
            Volume 2 Issue 4 (2025)                        115                          doi: 10.36922/AIH025080010
   116   117   118   119   120   121   122   123   124   125   126