Page 125 - AIH-2-4
P. 125

Artificial Intelligence in Health                                 RefSAM3D for medical image segmentation




              Here,    is the sequence of L word embeddings, each   the input data, capturing a rich representation of 3D spatial
                     e
                                       L
            with  C  dimensions, i.e.,   =  f ii=1 , where each word is   patterns. The adapted features are computed as Equation IV:
                 e
                                   e
            represented by a C  -dimensional embedding. By applying   V  AdapterV  i { ,, .., 12   N}
                                                                 ’
                                                                          (),
                           e
            a pooling operation over these word embeddings, we   i        i  i                            (IV)
                                            s
            obtained a sentence-level embedding   ∈ .          where  N = 4. We can obtain a collection of image
                                                C e
                                           e
                                                               features, as depicted in Equation V:
            3.3.2. Cross-modal projector
                                                                 ’
                                                               V  V V, ’  ’ , .,  V ’                   (V)
            While text embeddings derived from pre-trained language   1  2  N
            models capture rich semantic representations, a significant   3.3.4. Hierarchical cross-attention
            gap exists between these representations and those   The  hierarchical  cross-attention  architecture  is  designed
            obtained from visual encoders. This semantic disparity   to integrate multi-level visual features with textual
            poses challenges in cross-modal fusion, as the two   inputs, enabling a deeper understanding of cross-modal
            modalities do not naturally reside in the same embedding   data in 3D tasks such as medical image analysis. By
            space. To address this, we adopted a strategy inspired by   extracting hierarchical features from each attention
            vision-and-language bidirectional encoder representations   block in a 3D SAM, the architecture leverages the fact
            from transformers, wherein we employed a multilayer   that  each  layer  focuses  on  different  aspects  of  the  input
            perceptron to align the text and image embeddings. This   data, from low-level details to high-level semantics. This
            allows both modalities to be projected into a unified feature   structure enhances the model’s ability to relate complex
            space, enabling more effective interaction. Specifically, for   3D spatial patterns with corresponding textual prompts,
            each word embedding f  in   , the sparse embedding can   improving cross-modal understanding. Figure 2 shows the
                                   e
                               i
            be  obtained  by  adopting  the  cross-modal multilayer   hierarchical cross-attention architecture.
            perceptron (Equation III):
                                                                 In this architecture, the inputs include both the
              s
               MLPf ()  C v                        (III)   hierarchical image features,  V  V V, ’  2 ’ , .,  V , derived
                                                                                                     ’
                                                                                         ’
                      i
             i
                                                                                                    N
                                                                                            1
                                                               from each attention block, and a textual prompt T, which
            3.3.3. Image feature extraction
                                                               encodes the semantic information. These inputs are fused
            As  previously  mentioned,  we  integrated  lightweight   through a cross-attention mechanism where each layer of
            adapters into our 3D SAM to efficiently adapt the model   visual features interacts with the textual input, allowing
            for processing volumetric medical images. In this step, we   mutual enrichment of modalities. The output is a
            extracted the features produced by each attention block as   cross-modal prompt that combines visual and textual
            cross-attention visual hierarchical features.      information, which can be fed into SAM’s prompt encoder
                       BD HW C  i  i  i         th           to guide tasks such as segmentation or object detection in
              Let V          denote the output of the i  attention
                   i
            block, where B is the batch size, and H, W and D  represent   3D medical images.
                                          i
                                             i
                                                  i
            the height, width, and depth of the feature maps,    In  the  hierarchical  cross-attention  architecture,  the
            respectively. This extraction allowed us to leverage the   cross-attention mechanism is designed to facilitate
            unique focus of each attention block on different aspects of   interaction between the hierarchical image features and









            Figure  2. The structure of the cross-modal prompt embedding module. The left part illustrates the overall architecture, where hierarchical visual
            embeddings from four stages interact with aligned textual embeddings using cross-attention mechanisms to generate cross-modal prompt embeddings.
            The right part details the cross-attention mechanism, showing how attention weights are computed to align textual and visual embeddings through linear
            transformations and fusion, enabling effective multi-modal integration for downstream tasks.


            Volume 2 Issue 4 (2025)                        119                          doi: 10.36922/AIH025080010
   120   121   122   123   124   125   126   127   128   129   130