Page 76 - AIH-2-1
P. 76
Artificial Intelligence in Health ViT for Glioma Classification in MRI
Transformers also exhibit superior transferability for of breast cancer using biopsy images and an end-to-end
downstream tasks through extensive pretraining and holistic attention network. ViT-based medical image
17
superior performance in modeling global contexts. In classification and segmentation continues to be a popular
many applications of machine translation and NLP, long topic among researchers.
short-term memory and artificial neural network have ViT contains stacks of encoder and decoder layers in
been successfully substituted by transformers. 10
its core, which will be hereinafter referred to as an encoder
The results of transformers have matched or surpassed and a decoder, respectively. The encoder comprises two
those of the state-of-the-art methods in various image sublayers, namely multihead attention and feed-forward
recognition tasks. 12-16 The original design of transformers layers. The decoder comprises three sublayers, where
presented by Dosovitskiy et al. has undergone several the masked multihead attention layer is followed by the
11
changes for suitability with CV tasks. For instance, multihead attention layer and feed-forward layer. The
Parmar et al. modified transformers that used the self- encoder maps an input sequence x = (x , x ,…, x ) to a
12
1
2
n
attention mechanism only in local neighborhood of each sequence z = (z , z ,…, z ). Based on z, the decoder generates
2
n
1
query pixel. A novel transformer model, known as sparse an output sequence y = (y , y ,…, y ), with one element at
1
n
2
transformers, was proposed by Child et al. which attained a time. The model is auto-regressive at every step and uses
14
global self-attention using scalable approximations. Wu already generated data as the additional input to create a
et al. introduced convolutions into ViTs to achieve best new data instance. For more detail on the implementation
15
results on both convolutions and Transformers. of the ViT architecture, please refer to “An image is worth
In general, large amounts of data and powerful 16 × 16 words” by Dosovitskiy et al. 11
computers are required for training ViTs, limiting their There are two approaches to ViTs: hybrid and
application in medical imaging diagnostics. 17-20 Hence, transformer-only architectures. Hybrid architectures
20
the research presented exploits the possibility of utilizing use a CNN to produce an embedding for an image or
transformer-based attention features along with DL for the subregion of an image (patch). Encoding is used as the
classification of brain tumors with a relatively small clinical input for a subsequent transformer. In hybrid method, a
dataset. We proposed mechanisms to tackle data scarcity CNN was used to process lower-level features in the input.
and high processing power requirements while achieving In transformer-only architectures, a trainable part of the
sufficient model performance. architecture projects patches to an embedding space and a
After image classification, MRI images with tumor hand-coded or convolutional architecture is not used. The
underwent segmentation. Although segmentation transformer architecture learns only lower- and higher-
16
generally provides detailed information about the spatial level features. Herein, transformer-only architecture is
11
extent of tumors, classification offers insights into their focused on and the model developed by Dosovitskiy et al.
nature. Therefore, segmentation was not researched and was used for image classification.
only image classification was focused on herein. However, Transformers have been used for tumor analysis in
as segmentation and classification work in tandem to several studies. For instance, Asiri et al. used fine-tuned
provide a comprehensive understanding of disease ViT model with the CE-MRI dataset containing only 5712
diagnosis, existing studies can be referred to for more images for brain tumor classification. The lack of diversity
25
information on medical image segmentation. 21-25 and limited number of images in the dataset affected the
generalizability of ViT to real-world scenarios, suggesting
2.1. ViT model further research to improve its accuracy and reliability,
ViTs, as presented by Dosovitskiy et al., mimic the particularly for complex cases. Overall, the current ViT
11
original transformer model developed for NLP tasks using model used for brain tumor classification might not be fully
image patches as words for the input. ViTs can be used optimized, and further research is required to enhance its
for image classifications primarily because they reduce diversity, reliability, and accuracy. This study focused on
architectural complexity and have enhanced exploring addressing this research gap in brain tumor classification
scalability and training efficiency. Recent studies have using diverse BraTS datasets that primarily contain
shown that the direct application of transformers with glioma MRIs. This dataset offered a benchmarked set of
global self-attention to input images provided excellent ground truth labels for glioma classification, addressing
15
results on ImageNet classification. Moreover, ViTs can the limitations of existing studies. Moreover, potential
achieve high training accuracy with less computational model optimization techniques and MRI preprocessing
time. The success of transformers in medical image techniques were discussed for their use in improving the
16
segmentation and classification was proven in the diagnosis model results.
Volume 2 Issue 1 (2025) 70 doi: 10.36922/aih.4155

