Page 131 - AIH-2-4
P. 131

Artificial Intelligence in Health                                 RefSAM3D for medical image segmentation


















            Figure 5. Qualitative visualization of segmentation results generated from different methods for magnetic resonance imaging cardiac tumor segmentation
            Abbreviations: 3D: Three-dimensional; nn: No new; NSD: Normalized surface Dice; SAM: Segment Anything Model; UNETR: U-Net Transformers;
            UX-Net: UNet-eXpanded Network.

            A                                                  Table 4. Ablation on each key component in our method

                                                               Parameters             Dice (%)  Hausdorff distance (%)
                                                               Ref-SAM3D               88.3         2.34
                                                               Without a text prompt   72.3         7.31
                                                               Without a cross-modal projector  80.1  4.22
            B
                                                               Without hierarchical fusion   74.1   6.33

                                                               solution for clinical deployment, particularly in scenarios
                                                               requiring flexible and efficient medical image analysis
                                                               tools.

                                                               4.4. Ablation study
                                                               4.4.1. Effects of text prompt
                                                               The text prompt in our Ref-SAM3D model provided
                                                               essential semantic guidance by bridging textual descriptions
                                                               and visual features, enabling better interpretation of
                                                               anatomical  structures.  The results,  as shown  in  Table  4,
                                                               without this component, the model’s performance dropped
                                                               significantly, with dice score decreasing from 88.3% to
            Figure  6. Comparison of zero-shot and five-shot generalization   72.3% (−16.0%) and HD increasing from 2.34% to 7.31%
            performance of Ref-SAM3D, nnU-Net, and Swin-UNETR on AMOS22   (+4.97%). This substantial degradation demonstrates that
            data. (A) Computed tomography (CT) and (B) magnetic resonance
            imaging (MRI) data.                                the text prompt is crucial for leveraging linguistic context
            Abbreviations: 3D: Three-dimensional; nn: No new; NSD: Normalized   to achieve precise medical image segmentation.
            surface Dice; SAM: Segment Anything Model; UNETR: U-Net
            Transformers; UX-Net: UNet-eXpanded Network.       4.4.2. Effects of cross-modal projector
                                                               The cross-modal projector in Ref-SAM3D plays a vital role
            Ref-SAM3D’s robust generalization capabilities and   in aligning textual and visual inputs, facilitating effective
            its potential as a versatile solution for medical image
            segmentation across different imaging modalities.  integration of multi-modal information for improved
                                                               segmentation. By harmonizing these inputs, the projector
              These experimental findings clearly demonstrate   enhanced the model’s ability to utilize semantic context
            Ref-SAM3D’s robust performance across different datasets   from text alongside visual data. As shown in  Table 4,
            and imaging modalities. The model’s strong zero-shot   removing this component resulted in an 8.2% decrease
            generalization capabilities  and impressive  few-shot   in dice score (from 88.3% to 80.1%) and an HD increase
            learning  results  suggest  its  practical  value  in  real-world   from  2.34%  to  4.22%.  These  results  confirm  that  when
            medical applications, where adapting to diverse imaging   the cross-modal projector is removed, the model relies
            conditions with minimal additional training is essential.   on unaligned embeddings, which can lead to less effective
            These characteristics position Ref-SAM3D as a promising   feature integration.


            Volume 2 Issue 4 (2025)                        125                          doi: 10.36922/AIH025080010
   126   127   128   129   130   131   132   133   134   135   136