Page 30 - AIH-1-2

P. 30

Artificial Intelligence in Health LLMs-Healthcare: Application and challenges

8. Handling different types of data in the summarization or descriptive caption. The LLM supports
medical industry visual question answering, where the X-ray images of the
patients are fed into an image encoder (BLIP-2), where the
This section provides an overview of how different data natural language presentation is generated and embedded
formats and types are handled in the medical industry based on the image understanding.
when used as training data or inputs for an LLM.
Bazi et al. proposed a transformer encoder-decoder
39
8.1. Clinical notes architecture to handle the visual data when using LLM. They
extracted the image features using the vision transformer
Clinical notes, an integral component of patient health (ViT) model, then used the textual encoder transformer to
records, have increasingly been utilized as input to LLMs embed the questions, which were subsequently fed as the
in the medical domain. These notes, typically generated by resulting textual and visual representations into a multi-
health-care professionals, serve as rich patient information modal decoder to generate the answers. To demonstrate
repositories, including their medical history, present how LLM handles the visual data, the authors used VQA
symptoms, diagnoses, treatments, and more. Clinical datasets for radiology images, termed PathVQA and VQA-
notes are fed into LLMs to generate meaningful patterns, RAD. In decoding the radiology images, the proposed
predictions, and insights. Before using these notes, they model achieved 72.97% and 8.99%, respectively, for
are often preprocessed to ensure they are in a format that the VQA-RAD, and 62.37% or 83.86%, respectively, for
is easily digestible for the models. This preprocessing PathVQA.
can involve converting handwritten notes into digital
formats, anonymizing patient data to maintain privacy, 8.3. Radiological reports
and structuring the data in a consistent format. LLMs Radiological reports are documents from radiologists that
can directly process these notes and produce a range of present the findings or interpretation of medical imaging
tools suited for activities such as condensing medical studies such as magnetic resonance imaging (MRI), X-rays,
data, assisting in clinical decisions, and creating medical and CT scans. These data are processed as texts within
reports. To utilize clinical notes in LLMs, prompts the report to be input for LLMs in medicine. After data
37
containing questions, scenarios, or comments about the augmentation, the radiological reports are used as inputs
note are used, such as “Assume the role of a neurologist at the in the LLM model. Tan et al. collected and categorized
40
Mayo Clinic brain bank clinicopathological conference.” In 10,602 CT scan reports of cancer patients from a single
response to the prompt, the model provides an output that facility into four response types: no evidence of disease,
aids in evaluation or diagnosis across different medical partial response, stable disease, or progressive disease.
fields. 37 To analyze these reports, they utilized various models,
8.2. X-rays/Images including transformer models, a bidirectional LSTM
model, a CNN model, and traditional machine learning
X-rays are medical imaging that utilizes ionizing radiation approaches. Techniques such as data augmentation through
to produce images of internal body organs. This data type sentence shuffling with consistency loss and prompt-based
may include CT scans (tomography), chest X-rays, and fine-tuning were applied to enhance the performance of
bone X-rays. In medicine, X-ray images can be processed the most effective models.
by a computer-aided detection (CAD) model, which is
pre-trained to derive the outputs in tensor form. These 8.4. Speech data
tensors are then translated into natural language, where Speech data, encompassing medical interviews,
they can be used as LLM input to generate summaries or consultations, and patient audio interactions, serve as a
38
descriptions of the X-ray images. Wang et al. illustrated valuable reservoir of information. Before being applied in
how the X-rays of exam images are handled while utilizing LLMs, this data is converted into a textual format through
them with LLMs. They found that the model is fed into automatic speech recognition (ASR) systems. Notably,
pre-trained CAD models to derive the output. They found converting audio data into text is accomplished using pre-
that the images can be fed into pre-trained CAD models to trained models, such as Wav2vec 2.0, which has emerged
derive the output. Then, the tensor (output) is translated as a leading contender in speech recognition technology. In
into natural language. Finally, the language models are their groundbreaking work, Agbavor and Liang employed
21
used to make final conclusions and summarize the results. the Wav2vec2-base-960 base model, an advanced tool
The authors also established that X-ray images can be fine-tuned on an extensive 960-h dataset of 16 kHz speech
used as input in the LLM, where the images are fed into audio. Their methodology incorporated Librosa for audio
the model together with prompts to generate the image file loading and Wav2Vec2Tokenizer for the crucial task

Volume 1 Issue 2 (2024) 24 doi: 10.36922/aih.2558

25 26 27 28 29 30 31 32 33 34 35