Page 126 - AIH-2-4

P. 126

Artificial Intelligence in Health RefSAM3D for medical image segmentation

’
the textual prompt. As mentioned above, V represents the this problem, we employed progressive up-sampling,
i
adapted feature maps extracted from the ith attention making moderate adjustments to the SAM decoder
BL C
block, and the textual prompt is T  . by integrating two additional transposed convolution
The cross-attention process can be formally expressed operations. With each layer up-sampling the feature maps
’
as follows (Equation VI). For each hierarchical feature F , by a factor of two, the four transposed convolutional
i
layers progressively restored feature maps to their original
we computed the attention scores A with respect to the input resolution. In addition, we introduced a multilayer
i
text T: aggregation mechanism, designing a network akin to
a “U-shaped” architecture. We combined intermediate
QK T feature maps from stages 1–4 during the image encoder
A softmax i (VI) phase with prompts generated during the cross-modal
i
d k reference prompt generation phase to enrich the mask
features. After up-sampling the mask feature map to the
BD HW C i i i ’
where Q  are the queries derived from F , original resolution, we concatenated it with the original
i
i
BL C
and K  are the keys derived from the textual image and used another 3D convolution to fuse the
prompt T. The dimensionality d represents the size of the information and generate the final mask to better leverage
k
keys, which is a scaling factor to ensure stable gradients information from the original resolution.
during training. The attention output O for each feature
i
block can then be computed as in Equation VII: 4. Experiments
O = AV i (VII) 4.1. Experimental setup
i
i
’
Where V denotes the values corresponding to F and is We conducted a comprehensive evaluation of our
i
i
’
similarly dimensioned as F . The final output from the segmentation method across four medical image
i
cross-attention mechanism can be represented as segmentation tasks, encompassing three distinct imaging
Equation VIII: modalities: CT-based tumor segmentation, MRI-based
cardiac segmentation, and multi-organ segmentation

O [ O O , 2 ,., O ]  BDHW C (VIII) from multi-modal datasets. Our approach was rigorously
N
1
resulting in a combined output that integrates both compared against state-of-the-art methods on CT imaging
visual and textual information across multiple layers. This tasks. In addition, we assessed our method’s performance
enriched representation was then utilized as a cross-modal on MRI cardiac and multi-organ segmentation tasks,
prompt in the subsequent stages of SAM’s prompt encoder, providing a thorough analysis of its generalization
effectively bridging the gap between visual features and capabilities and conducting an in-depth ablation study to
semantic understanding derived from text. elucidate the contributions of its constituent components.
3.4. Lightweight mask decoder 4.1.1. Datasets
The original SAM mask decoder comprises merely two The kidney tumor segmentation (KiTS21) dataset is a
43
transformer layers, two transposed convolution layers, and comprehensive collection designed for the segmentation of
a multilayer perception layer. In the context of 3D medical kidneys, tumors, and cysts in CT imaging. It comprises 300
image processing tasks, we replaced the 2D convolutions publicly available training cases and 100 withheld testing
with 3D convolutions to enable direct 3D mask generation. cases. The dataset is formatted in 3D CT with files stored in
Given that many anatomical structures or lesions in medical the.nii.gz format. The image dimensions exhibit significant
images are relatively small, it is often necessary to achieve variability, with voxel spacing ranging from (0.5, 0.44,
higher resolution images to ensure better distinction of the 0.44) mm to (5.0, 1.04, 1.04) mm and sizes ranging from
segmented elements. (29, 512, 512) to (1,059, 512, 796). The dataset includes
annotations for three anatomical structures: kidneys,
In the image encoder of the SAM, the patch embedding
process of the transformer backbone embeds each 16 × 16 tumors, and cysts. These structures are consistently present
across all training cases, with cysts appearing in 49.33%
patch into a feature vector, resulting in a 16 × 16 down- of the cases. This dataset serves as a critical resource for
sampling of the input. The SAM mask decoder employs two advancing automated segmentation techniques in medical
consecutive transposed convolution layers to up-sample
the feature map by a factor of four. However, the final imaging analysis.
prediction generated by SAM still has a resolution that is The Medical Segmentation Decathlon (MSD) pancreas
four times lower than the original input shape. To address tumor dataset consists of 281 contrast-enhanced
12

Volume 2 Issue 4 (2025) 120 doi: 10.36922/AIH025080010

121 122 123 124 125 126 127 128 129 130 131