Page 43 - AIH-2-4
P. 43

Artificial Intelligence in Health                                       ViT for neurodegeneration diagnosis



            3.3. Data pre-processing                           performance using ViTs. However, most large computer vision

            The  F-FDG PET scans in the ADNI dataset were acquired   datasets include natural images with three RGB channels.
               18
            utilizing different types of imaging devices and during various   Therefore, we reshaped the samples to 3 × 570 × 950 in our
            phases.  Therefore,  these  scans  vary  significantly  in  their   dataset to utilize transfer learning and available pre-trained
                                                               models.  This procedure  constructs  a three-channel image,
            properties, including the image size, number of channels,   in which every channel depicts the brain along a unique axis
            and intensities of voxels. Furthermore, they include the   (sagittal, coronal, and axial). Figure 2 shows the result of data
            subject’s skull, which does not deliver beneficial information   pre-processing and reshaping steps on a single scan.
            regarding NDD diagnosis. Furthermore, raw scans may
            contain noise or blur due to the patient’s movement or other   3.5. Model architecture and training
            technical issues. Consequently, we use the pre-processing
            procedure developed by Etminani  et al.,  which employs   Our proposed model has a similar architecture to vanilla
                                             4
                                                               ViT, as suggested by Dosovitskiy  et al.  After training
                                                                                                6
            MATLAB  and statistical parametric mapping (SPM12)    different models with and without transfer learning,
                                                         33
                    32
            to ensure all scans have the same properties.
                                                               we concluded that pre-training is crucial to obtaining
              The pre-processing steps for each sample are as follows:  excellent results. Therefore, we employed the Hugging
            •   We converted the scan to the NIfTI format      Face  Transformer API and model hub for development.
                                                                  34
            •   It was crucial to place the brain approximately in   Specifically, the foundation of our model is a base-sized
               the center of the scan. Therefore, we reoriented and   ViT, pre-trained on ImageNet-21k  and ImageNet 2012,
                                                                                                            36
                                                                                          35
               repositioned the brain to set the volume’s origin at the   respectively.  Finally, we fine-tuned the model on our  F-
                                                                         37
                                                                                                          18
               anterior commissure region                      FDG PET scan dataset to classify NDDs.
            •   Our dataset included scans of various shapes. Hence,   Figure  3 illustrates the model’s  diagram, inspired by
               we normalized the scan to ensure all samples had   Dosovitskiy et al.  First, the scan was resized to 3 × 384 ×
                                                                             6
               identical spatial size and number of channels   384 to match the model’s input shape. Then, the scan was
            •   Using the tissue probability map of SPM12, the brain   divided into patches of 3 × 32 × 32, flattened, and supplied
               was segmented                                   to a standard transformer along with position embeddings
            •   The last pre-processing stage removed the subject’s   holding the spatial information. Finally, a multilayer
               skull from the scan. Consequently, we used      perceptron head translated the model’s final hidden state
               segmentation maps obtained from the previous step   into the probability of classes for the classification task.
               with a filter for skull-stripping.              Table 2 summarizes the model’s specifications.
              The pre-processing procedure led to skull-stripped scans   We employed an AdamW optimizer (learning rate = 5e-5,
            of size 79 × 95 × 79, representing channels, height, and width,   weight decay = 0.15) for model development. Furthermore, an
            respectively. Then, the values of voxels were normalized using   exponential learning rate decay (γ = 0.9999 per epoch) was used
            a min-max scaler across the channels in Python. Finally, we   during training. Finally, we selected a weighted cross-entropy
            dismissed the initial ten and last nine channels of the 3D scan   as the loss function, in which the weight of each class was the
            since they included a tiny fraction of the brain, resulting in   inverse of its frequency during training, as shown below:
            3D scans with the shape of 60 × 95 × 79.
                                                               l (x,y) = L = {l  ... l } ⊺                 (I)
                                                                          1
                                                                             N
            3.4. Data reshaping                                       C        exp(x  )
                                                                                    , n c
            According to our experiments and the literature,  pre-training   l = − ∑ c= 1 w c log ∑ C  exp(x  )  y  , n c
                                                6
                                                                n
            on large amounts of data is crucial to achieving the best          i= 1   , ni                 (II)
            A                                B                               C










            Figure 2. After the initial data pre-processing, we reshaped each scan of 60 × 95 × 79 to a three-channel image of 3 × 570 × 950, in which every channel
            illustrates the brain along a unique axis. This data reshaping was crucial to utilize transfer learning and pre-train the model on large computer vision
            datasets that contain natural three-channel RGB images. (A) Sagittal, (B) Coronal, (C) Axial.


            Volume 2 Issue 4 (2025)                         37                          doi: 10.36922/AIH025140026
   38   39   40   41   42   43   44   45   46   47   48