Page 77 - AIH-2-1
P. 77
Artificial Intelligence in Health ViT for Glioma Classification in MRI
3. Methodology Moreover, as the data on one patient record holds
154 (or 155) images, each image was considered a single
This section presents in detail, dataset preparation, input in the analysis and classified into one of the three
including data preprocessing, ViT architecture, and model classes: HGG, LGG, and nontumorous. The dataset was
training with special attention to pretraining and fine- first developed using 120 patient records comprising
tuning approaches.
18,480 images, which were subgrouped into 3 subsets with
3.1. Dataset preparation and preprocessing 40 patients each. The dataset was further separated into
two subsets, namely training and testing; approximately
The BraTS 2015 dataset containing 220 MRI images of 70% of data were used for training and 30% for testing.
7
high-grade gliomas (HGGs) and 54 images of low-grade
gliomas (LGGs) was used for model training, validation, 3.2. ViT architecture
and testing. The dataset also contained MRI images of a ViTs are a group of neural network architectures that convert
patient in four modalities: T1 (spin-lattice relaxation), T1Gd one sequence of images into another sequence. During
(postcontrast T1-weighted), T2 (spin–spin relaxation), and preprocessing in ViTs, the input image is split into fixed-
T2-Flair (fluid attenuation inversion recovery). The analysis size patches and an input sequence is generated by linearly
was restricted to the axial plane images of T1-MRIs, and embedding each image into a sequence vector by adding
the file format of the dataset was “.mha,” which primarily is position embedding information (Figure 1). The encoder
associated with the insight segmentation and registration transforms the input sequence into an embedding space,
toolkit. The DL architecture used the “.png” as the input
image format. Hence, the T1-MRIs of a patient were which is a vector representation of the image. Subsequently,
converted to “.png” using “mha2png.” Each patient’s record the decoder receives the data in the embedding space and
converts this into an output vector. An embedding layer
resulted in 154 independent “.png” files, corresponding to generally proceeds each encoder or decoder to process their
brain slices in the coronal plane. Therefore, this resulted in respective input, and an output layer is used toward the end
a “.png” image dataset containing 42,196 images. Using the of the architecture to generate the final output. ViTs perform
tumor mask of the BraTS 2015 dataset, each slice was first 20
labeled based on the presence or absence of brain tumors. classification using an extra learnable layer, i.e., classifier.
Then, slices with tumors were categorized into HGG or Figure 1 summarizes the process of image classification using
LGG tumors using the auxiliary data available in the BraTS the ViT for image recognition. Herein, a modified version of
26
2015 dataset. the model was used for the classification of MRI images
from the BraTS 2015 dataset. The classification operation
Intensity uniformization is another essential step in the flow of ViT is shown in Figure 2, and the performance of the
preprocessing of MRI images. The pixel intensity of MRI proposed system was analyzed using accuracy, training, and
images in BraTS ranges from −1000 to +1000, with more validation loss and confusion matrix.
than 2000 levels. To aid image handling in limited resource
environment, this pixel intensity range was decreased and 3.3. Model pretraining and fine-tuning
scaled to match the intensity levels of 0–255, i.e., 8 bits/ ViT is a DL model that requires considerably large dataset
pixel grayscale. During preprocessing, the values above for model training. As BraTS is a relatively small dataset
the upper gray level (G ) and below the lower gray level to train the ViT effectively, pretraining was performed to
u
(G ) were assigned white and black, respectively. The generate initial weights. CIFAR-10, a simple dataset, can
d
center, also known as the window level (WL) and window serve as a foundation for pretraining models for medical
width (WW), was changed based on the upper and lower image analysis. The ViT was pretrained using the
27
gray levels. The upper gray level (G ) was calculated as grayscale images of CIFAR-10, comprising 60,000 32 × 32
u
WW images belonging to 10 classes. All classes in CIFAR-10
G WL , and the lower gray level (G ) was
u
d
2 are mutually exclusive, without any overlap between each
calculated as G WL WW . Table 1 summarizes class, which are well defined and bounded. For pretraining,
d
2 the dataset was split into five training batches and one test
the effect of different values for WL, WW, and Range batch, with each batch comprising 10,000 images. The
(G , G ) on the preprocessed images. For instance, input test batch of CIFAR-10 was created using exactly 10,000
d
u
images preprocessed considering WL = 0, WW = 400, randomly selected images, and the training batches
and Range (−200 – 200) failed to show fine details of brain contained the remaining 50,000 images. Some training
MRI images. After few trial and error iterations, range batches contained more images from one class than the
(−200 – 100), WW = 1200, and WL = 400 were chosen other because the remaining images were added to the
as the best parameters for 8 bit/pixel grayscale conversion. training batches in a random order.
Volume 2 Issue 1 (2025) 71 doi: 10.36922/aih.4155

