Page 30 - AIH-1-2
P. 30

Artificial Intelligence in Health                                LLMs-Healthcare: Application and challenges



            8. Handling different types of data in the         summarization or descriptive caption. The LLM supports
            medical industry                                   visual question answering, where the X-ray images of the
                                                               patients are fed into an image encoder (BLIP-2), where the
            This section provides an overview of how different data   natural language presentation is generated and embedded
            formats and types are handled in the medical industry   based on the image understanding.
            when used as training data or inputs for an LLM.
                                                                 Bazi et al.  proposed a transformer encoder-decoder
                                                                          39
            8.1. Clinical notes                                architecture to handle the visual data when using LLM. They
                                                               extracted the image features using the vision transformer
            Clinical  notes,  an  integral  component  of  patient  health   (ViT) model, then used the textual encoder transformer to
            records, have increasingly been utilized as input to LLMs   embed the questions, which were subsequently fed as the
            in the medical domain. These notes, typically generated by   resulting textual and visual representations into a multi-
            health-care professionals, serve as rich patient information   modal decoder to generate the answers. To demonstrate
            repositories,  including  their  medical  history,  present   how LLM handles the visual data, the authors used VQA
            symptoms, diagnoses, treatments, and more. Clinical   datasets for radiology images, termed PathVQA and VQA-
            notes are fed into LLMs to generate meaningful patterns,   RAD. In decoding the radiology images, the proposed
            predictions, and insights. Before using these notes, they   model achieved 72.97% and 8.99%, respectively, for
            are often preprocessed to ensure they are in a format that   the VQA-RAD, and 62.37%  or 83.86%, respectively, for
            is easily digestible for the models. This preprocessing   PathVQA.
            can involve converting handwritten notes into digital
            formats,  anonymizing  patient data  to maintain privacy,   8.3. Radiological reports
            and structuring the data in a consistent format. LLMs   Radiological reports are documents from radiologists that
            can directly process these notes and produce a range of   present the findings or interpretation of medical imaging
            tools suited for activities such as condensing medical   studies such as magnetic resonance imaging (MRI), X-rays,
            data, assisting in clinical decisions, and creating medical   and CT scans. These data are  processed as texts  within
            reports.   To  utilize  clinical  notes  in LLMs,  prompts   the report to be input for LLMs in medicine. After data
                  37
            containing questions, scenarios, or comments about the   augmentation, the radiological reports are used as inputs
            note are used, such as “Assume the role of a neurologist at the   in the LLM model. Tan et al.  collected and categorized
                                                                                       40
            Mayo Clinic brain bank clinicopathological conference.” In   10,602 CT scan reports of cancer patients from a single
            response to the prompt, the model provides an output that   facility into four response types: no evidence of disease,
            aids in evaluation or diagnosis across different medical   partial response, stable disease, or  progressive  disease.
            fields. 37                                         To analyze these reports, they utilized various models,
            8.2. X-rays/Images                                 including  transformer  models,  a  bidirectional  LSTM
                                                               model, a CNN model, and traditional machine learning
            X-rays are medical imaging that utilizes ionizing radiation   approaches. Techniques such as data augmentation through
            to produce images of internal body organs. This data type   sentence shuffling with consistency loss and prompt-based
            may include CT scans (tomography), chest X-rays, and   fine-tuning were applied to enhance the performance of
            bone X-rays. In medicine, X-ray images can be processed   the most effective models.
            by a computer-aided detection (CAD) model, which is
            pre-trained to derive the outputs in tensor form. These   8.4. Speech data
            tensors  are  then  translated  into  natural  language,  where   Speech  data,  encompassing  medical  interviews,
            they can be used as LLM input to generate summaries or   consultations,  and  patient  audio  interactions,  serve  as  a
                                                38
            descriptions of the X-ray images. Wang et al.  illustrated   valuable reservoir of information. Before being applied in
            how the X-rays of exam images are handled while utilizing   LLMs, this data is converted into a textual format through
            them with LLMs. They found that the model is fed into   automatic speech recognition (ASR) systems. Notably,
            pre-trained CAD models to derive the output. They found   converting audio data into text is accomplished using pre-
            that the images can be fed into pre-trained CAD models to   trained models, such as Wav2vec 2.0, which has emerged
            derive the output. Then, the tensor (output) is translated   as a leading contender in speech recognition technology. In
            into natural language. Finally, the language models are   their groundbreaking work, Agbavor and Liang  employed
                                                                                                    21
            used to make final conclusions and summarize the results.   the Wav2vec2-base-960 base model, an advanced tool
            The authors also established that X-ray images can be   fine-tuned on an extensive 960-h dataset of 16 kHz speech
            used as input in the LLM, where the images are fed into   audio. Their methodology incorporated Librosa for audio
            the model together with prompts to generate the image   file loading and Wav2Vec2Tokenizer for the crucial task


            Volume 1 Issue 2 (2024)                         24                               doi: 10.36922/aih.2558
   25   26   27   28   29   30   31   32   33   34   35