Page 125 - AIH-2-4
P. 125
Artificial Intelligence in Health RefSAM3D for medical image segmentation
Here, is the sequence of L word embeddings, each the input data, capturing a rich representation of 3D spatial
e
L
with C dimensions, i.e., = f ii=1 , where each word is patterns. The adapted features are computed as Equation IV:
e
e
represented by a C -dimensional embedding. By applying V AdapterV i { ,, .., 12 N}
’
(),
e
a pooling operation over these word embeddings, we i i i (IV)
s
obtained a sentence-level embedding ∈ . where N = 4. We can obtain a collection of image
C e
e
features, as depicted in Equation V:
3.3.2. Cross-modal projector
’
V V V, ’ ’ , ., V ’ (V)
While text embeddings derived from pre-trained language 1 2 N
models capture rich semantic representations, a significant 3.3.4. Hierarchical cross-attention
gap exists between these representations and those The hierarchical cross-attention architecture is designed
obtained from visual encoders. This semantic disparity to integrate multi-level visual features with textual
poses challenges in cross-modal fusion, as the two inputs, enabling a deeper understanding of cross-modal
modalities do not naturally reside in the same embedding data in 3D tasks such as medical image analysis. By
space. To address this, we adopted a strategy inspired by extracting hierarchical features from each attention
vision-and-language bidirectional encoder representations block in a 3D SAM, the architecture leverages the fact
from transformers, wherein we employed a multilayer that each layer focuses on different aspects of the input
perceptron to align the text and image embeddings. This data, from low-level details to high-level semantics. This
allows both modalities to be projected into a unified feature structure enhances the model’s ability to relate complex
space, enabling more effective interaction. Specifically, for 3D spatial patterns with corresponding textual prompts,
each word embedding f in , the sparse embedding can improving cross-modal understanding. Figure 2 shows the
e
i
be obtained by adopting the cross-modal multilayer hierarchical cross-attention architecture.
perceptron (Equation III):
In this architecture, the inputs include both the
s
MLPf () C v (III) hierarchical image features, V V V, ’ 2 ’ , ., V , derived
’
’
i
i
N
1
from each attention block, and a textual prompt T, which
3.3.3. Image feature extraction
encodes the semantic information. These inputs are fused
As previously mentioned, we integrated lightweight through a cross-attention mechanism where each layer of
adapters into our 3D SAM to efficiently adapt the model visual features interacts with the textual input, allowing
for processing volumetric medical images. In this step, we mutual enrichment of modalities. The output is a
extracted the features produced by each attention block as cross-modal prompt that combines visual and textual
cross-attention visual hierarchical features. information, which can be fed into SAM’s prompt encoder
BD HW C i i i th to guide tasks such as segmentation or object detection in
Let V denote the output of the i attention
i
block, where B is the batch size, and H, W and D represent 3D medical images.
i
i
i
the height, width, and depth of the feature maps, In the hierarchical cross-attention architecture, the
respectively. This extraction allowed us to leverage the cross-attention mechanism is designed to facilitate
unique focus of each attention block on different aspects of interaction between the hierarchical image features and
Figure 2. The structure of the cross-modal prompt embedding module. The left part illustrates the overall architecture, where hierarchical visual
embeddings from four stages interact with aligned textual embeddings using cross-attention mechanisms to generate cross-modal prompt embeddings.
The right part details the cross-attention mechanism, showing how attention weights are computed to align textual and visual embeddings through linear
transformations and fusion, enabling effective multi-modal integration for downstream tasks.
Volume 2 Issue 4 (2025) 119 doi: 10.36922/AIH025080010

