Page 77 - AIH-2-1
P. 77

Artificial Intelligence in Health                                       ViT for Glioma Classification in MRI



            3. Methodology                                       Moreover,  as  the  data  on  one patient  record  holds
                                                               154 (or 155) images, each image was considered a single
            This section presents in detail, dataset preparation,   input in the analysis and classified into one of the three
            including data preprocessing, ViT architecture, and model   classes:  HGG,  LGG,  and  nontumorous.  The  dataset  was
            training with special attention to pretraining and fine-  first developed using 120  patient records comprising
            tuning approaches.
                                                               18,480 images, which were subgrouped into 3 subsets with
            3.1. Dataset preparation and preprocessing         40 patients each. The dataset was further separated into
                                                               two subsets, namely training and testing; approximately
            The BraTS 2015 dataset  containing 220 MRI images of   70% of data were used for training and 30% for testing.
                                7
            high-grade gliomas (HGGs) and 54 images of low-grade
            gliomas (LGGs) was used for model training, validation,   3.2. ViT architecture
            and testing. The dataset also contained MRI images of a   ViTs are a group of neural network architectures that convert
            patient in four modalities: T1 (spin-lattice relaxation), T1Gd   one sequence of images into another sequence. During
            (postcontrast T1-weighted), T2 (spin–spin relaxation), and   preprocessing in ViTs, the input image is split into fixed-
            T2-Flair (fluid attenuation inversion recovery). The analysis   size patches and an input sequence is generated by linearly
            was restricted to the axial plane images of T1-MRIs, and   embedding each image into a sequence vector by adding
            the file format of the dataset was “.mha,” which primarily is   position embedding information (Figure  1). The encoder
            associated with the insight segmentation and registration   transforms the input sequence into an embedding space,
            toolkit. The DL architecture used the “.png” as the input
            image format. Hence, the T1-MRIs of a patient were   which is a vector representation of the image. Subsequently,
            converted to “.png” using “mha2png.” Each patient’s record   the decoder receives the data in the embedding space and
                                                               converts this into an output vector. An embedding layer
            resulted in 154 independent “.png” files, corresponding to   generally proceeds each encoder or decoder to process their
            brain slices in the coronal plane. Therefore, this resulted in   respective input, and an output layer is used toward the end
            a “.png” image dataset containing 42,196 images. Using the   of the architecture to generate the final output. ViTs perform
            tumor mask of the BraTS 2015 dataset, each slice was first                                      20
            labeled based on the presence or absence of brain tumors.   classification using an extra learnable layer, i.e., classifier.
            Then,  slices  with  tumors  were  categorized  into  HGG  or   Figure 1 summarizes the process of image classification using
            LGG tumors using the auxiliary data available in the BraTS   the ViT for image recognition. Herein, a modified version of
                                                                       26
            2015 dataset.                                      the model  was used for the classification of MRI images
                                                               from the BraTS 2015 dataset. The classification operation
              Intensity uniformization is another essential step in the   flow of ViT is shown in Figure 2, and the performance of the
            preprocessing of MRI images. The pixel intensity of MRI   proposed system was analyzed using accuracy, training, and
            images in BraTS ranges from −1000 to +1000, with more   validation loss and confusion matrix.
            than 2000 levels. To aid image handling in limited resource
            environment, this pixel intensity range was decreased and   3.3. Model pretraining and fine-tuning
            scaled to match the intensity levels of 0–255, i.e., 8 bits/  ViT is a DL model that requires considerably large dataset
            pixel grayscale. During preprocessing, the values above   for model training. As BraTS is a relatively small dataset
            the upper gray level (G ) and below the lower gray level   to train the ViT effectively, pretraining was performed to
                               u
            (G ) were assigned white and black, respectively. The   generate initial weights. CIFAR-10, a simple dataset, can
              d
            center, also known as the window level (WL) and window   serve as a foundation for pretraining models for medical
            width (WW), was changed based on the upper and lower   image analysis.  The ViT was pretrained using the
                                                                            27
            gray levels. The upper gray level (G ) was calculated as   grayscale images of CIFAR-10, comprising 60,000 32 × 32
                                          u
                       WW                                    images belonging to 10 classes. All classes in CIFAR-10
            G   WL        , and the lower gray level (G ) was
              u
                                                     d
                        2                                    are mutually exclusive, without any overlap between each
            calculated as  G   WL     WW  .  Table  1  summarizes   class, which are well defined and bounded. For pretraining,
                         d
                                    2                        the dataset was split into five training batches and one test
            the effect of different values for WL, WW, and Range   batch, with each batch comprising 10,000 images. The
            (G , G ) on the preprocessed images. For instance, input   test batch of CIFAR-10 was created using exactly 10,000
                 d
              u
            images preprocessed considering WL  =  0, WW  =  400,   randomly selected images, and the training batches
            and Range (−200 – 200) failed to show fine details of brain   contained  the  remaining  50,000  images.  Some  training
            MRI images. After few trial and error iterations, range   batches contained more images from one class than the
            (−200 – 100), WW = 1200, and WL = 400 were chosen   other because the remaining images were added to the
            as the best parameters for 8 bit/pixel grayscale conversion.  training batches in a random order.
            Volume 2 Issue 1 (2025)                         71                               doi: 10.36922/aih.4155
   72   73   74   75   76   77   78   79   80   81   82