Page 76 - AIH-2-1
P. 76

Artificial Intelligence in Health                                       ViT for Glioma Classification in MRI



            Transformers also exhibit superior transferability for   of breast cancer using biopsy images and an end-to-end
            downstream tasks through extensive pretraining and   holistic attention network.  ViT-based medical image
                                                                                     17
            superior performance in modeling global contexts. In   classification and segmentation continues to be a popular
            many applications of machine translation and NLP, long   topic among researchers.
            short-term memory and artificial neural network have   ViT contains stacks of encoder and decoder layers in
            been successfully substituted by transformers. 10
                                                               its core, which will be hereinafter referred to as an encoder
              The results of transformers have matched or surpassed   and a decoder, respectively. The encoder  comprises two
            those of the state-of-the-art methods in various image   sublayers, namely multihead attention and feed-forward
            recognition tasks. 12-16  The original design of transformers   layers. The decoder comprises three sublayers, where
            presented by Dosovitskiy  et al.  has undergone several   the masked multihead attention layer is followed by the
                                      11
            changes for suitability with CV tasks. For instance,   multihead attention layer and feed-forward layer. The
            Parmar et al.  modified transformers that used the self-  encoder maps an input sequence x = (x , x ,…, x ) to a
                      12
                                                                                                1
                                                                                                   2
                                                                                                        n
            attention mechanism only in local neighborhood of each   sequence z = (z , z ,…, z ). Based on z, the decoder generates
                                                                             2
                                                                                 n
                                                                           1
            query pixel. A novel transformer model, known as sparse   an output sequence y = (y , y ,…, y ), with one element at
                                                                                    1
                                                                                           n
                                                                                      2
            transformers, was proposed by Child et al.  which attained   a time. The model is auto-regressive at every step and uses
                                             14
            global self-attention using scalable approximations. Wu   already generated data as the additional input to create a
            et al.  introduced convolutions into ViTs to achieve best   new data instance. For more detail on the implementation
                15
            results on both convolutions and Transformers.     of the ViT architecture, please refer to “An image is worth
              In  general,  large  amounts  of  data  and  powerful   16 × 16 words” by Dosovitskiy et al. 11
            computers are required for training ViTs, limiting their   There are two approaches to ViTs: hybrid and
            application  in  medical  imaging  diagnostics. 17-20   Hence,   transformer-only architectures.  Hybrid architectures
                                                                                         20
            the research presented exploits the possibility of utilizing   use  a  CNN  to  produce  an  embedding  for  an  image  or
            transformer-based attention features along with DL for the   subregion of an image (patch). Encoding is used as the
            classification of brain tumors with a relatively small clinical   input for a subsequent transformer. In hybrid method, a
            dataset. We proposed mechanisms to tackle data scarcity   CNN was used to process lower-level features in the input.
            and high processing power requirements while achieving   In transformer-only architectures, a trainable part of the
            sufficient model performance.                      architecture projects patches to an embedding space and a
              After image classification, MRI images with tumor   hand-coded or convolutional architecture is not used. The
            underwent  segmentation.  Although  segmentation   transformer architecture learns only lower-  and higher-
                                                                          16
            generally provides detailed information about the spatial   level features.  Herein, transformer-only architecture is
                                                                                                            11
            extent of tumors, classification offers insights into their   focused on and the model developed by Dosovitskiy et al.
            nature. Therefore, segmentation was not researched and   was used for image classification.
            only image classification was focused on herein. However,   Transformers have been used for tumor analysis in
            as segmentation and classification work in tandem  to   several studies. For instance, Asiri et al. used fine-tuned
            provide a comprehensive understanding of disease   ViT model with the CE-MRI dataset containing only 5712
            diagnosis, existing studies can be referred to for more   images for brain tumor classification.  The lack of diversity
                                                                                            25
            information on medical image segmentation. 21-25   and limited number of images in the dataset affected the
                                                               generalizability of ViT to real-world scenarios, suggesting
            2.1. ViT model                                     further research to improve its accuracy and reliability,
            ViTs, as presented by Dosovitskiy  et al.,  mimic the   particularly  for  complex  cases.  Overall,  the  current  ViT
                                               11
            original transformer model developed for NLP tasks using   model used for brain tumor classification might not be fully
            image patches as words for the input. ViTs can be used   optimized, and further research is required to enhance its
            for image classifications primarily because they reduce   diversity, reliability, and accuracy. This study focused on
            architectural complexity and have enhanced exploring   addressing this research gap in brain tumor classification
            scalability and training efficiency. Recent studies have   using diverse BraTS datasets that primarily contain
            shown  that  the  direct  application  of  transformers  with   glioma MRIs. This dataset offered a benchmarked set of
            global self-attention to input images provided excellent   ground truth labels for glioma classification, addressing
                                        15
            results on ImageNet classification.  Moreover, ViTs can   the limitations of existing studies. Moreover, potential
            achieve  high  training  accuracy with  less  computational   model optimization techniques and MRI preprocessing
            time.  The success of transformers in medical image   techniques were discussed for their use in improving the
                16
            segmentation and classification was proven in the diagnosis   model results.

            Volume 2 Issue 1 (2025)                         70                               doi: 10.36922/aih.4155
   71   72   73   74   75   76   77   78   79   80   81